You are on page 1of 19

UNIT-IV

Linear models: The least-squares method, The Perceptron:a heuristic learning algorithm for linear classifiers,
Support vector machines, obtaining probabilities from linear classifiers, Going beyond linearity with kernel methods.
Distance Based Models: Introduction, Neighbours and exemplars, Nearest Neighbours classification,
Distance Based Clustering, Hierarchical Clustering.

CHAPTER-1)Linear models

In statistics, the term linear model is used in different ways according to the context. The most common
occurrence is in connection with regression models and the term is often taken as synonymous with linear
regression model. However, the term is also used in time series analysis with a different meaning. In each case, the
designation "linear" is used to identify a subclass of models for which substantial reduction in the complexity of the
related statistical theory is possible.
A)The least-squares method:

The "least squares" method is a form of mathematical regression analysis used to determine the line of best fit for a
set of data, providing a visual demonstration of the relationship between the data points. Each point of data
represents the relationship between a known independent variable and an unknown dependent variable.

The least squares method provides the overall rationale for the placement of the line of best fit among the data points
being studied. The most common application of this method, which is sometimes referred to as "linear" or
"ordinary", aims to create a straight line that minimizes the sum of the squares of the errors that are generated by the
results of the associated equations, such as the squared residuals resulting from differences in the observed value, and
the value anticipated, based on that model.

This method of regression analysis begins with a set of data points to be plotted on an x- and y-axis graph. An
analyst using the least squares method will generate a line of best fit that explains the potential relationship between
independent and dependent variables.

In regression analysis, dependent variables are illustrated on the vertical y-axis, while independent variables are
illustrated on the horizontal x-axis. These designations will form the equation for the line of best fit, which is
determined from the least squares method.

In contrast to a linear problem, a non-linear least squares problem has no closed solution and is generally solved by
iteration. The discovery of the least squares method is attributed to Carl Friedrich Gauss, who discovered the method
in 1795

Linear Regression Learning the Model


Learning a linear regression model means estimating the values of the coefficients used in the representation with the
data that we have available.

In this section we will take a brief look at four techniques to prepare a linear regression model. This is not enough
information to implement them from scratch, but enough to get a flavor of the computation and trade-offs involved.

There are many more techniques because the model is so well studied. Take note of Ordinary Least Squares because
it is the most common method used in general. Also take note of Gradient Descent as it is the most common
technique taught in machine learning classes.

i. Simple Linear Regression


With simple linear regression when we have a single input, we can use statistics to estimate the coefficients.

This requires that you calculate statistical properties from the data such as means, standard deviations, correlations
and covariance. All of the data must be available to traverse and calculate statistics.

This is fun as an exercise in excel, but not really useful in practice.

1
ii. Ordinary Least Squares
When we have more than one input we can use Ordinary Least Squares to estimate the values of the coefficients.

The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals. This means that given a
regression line through the data we calculate the distance from each data point to the regression line, square it, and
sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.
This approach treats the data as a matrix and uses linear algebra operations to estimate the optimal values for the
coefficients. It means that all of the data must be available and you must have enough memory to fit the data and
perform matrix operations.

It is unusual to implement the Ordinary Least Squares procedure yourself unless as an exercise in linear algebra. It is
more likely that you will call a procedure in a linear algebra library. This procedure is very fast to calculate.

iii. Gradient Descent


When there are one or more inputs you can use a process of optimizing the values of the coefficients by iteratively
minimizing the error of the model on your training data.

This operation is called Gradient Descent and works by starting with random values for each coefficient. The sum of
the squared errors is calculated for each pair of input and output values. A learning rate is used as a scale factor and
the coefficients are updated in the direction towards minimizing the error. The process is repeated until a minimum
sum squared error is achieved or no further improvement is possible.
When using this method, you must select a learning rate (alpha) parameter that determines the size of the
improvement step to take on each iteration of the procedure.

Gradient descent is often taught using a linear regression model because it is relatively straightforward to understand.
In practice, it is useful when you have a very large dataset either in the number of rows or the number of columns
that may not fit into memory.

iv. Regularization

There are extensions of the training of the linear model called regularization methods. These seek to both minimize
the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the
complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

Two popular examples of regularization procedures for linear regression are:

Lasso Regression: where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients
(called L1 regularization).
Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the
coefficients (called L2 regularization).
These methods are effective to use when there is co linearity in your input values and ordinary least squares would
over fit the training data.

Now that you know some techniques to learn the coefficients in a linear regression model, let’s look at how we can
use a model to make predictions on new data.

Making Predictions with Linear Regression

Given the representation is a linear equation, making predictions is as simple as solving the equation for a specific
set of inputs.

2
Let’s make this concrete with an example. Imagine we are predicting weight (y) from height (x). Our linear
regression model representation for this problem would be:

y = B0 + B1 * x1

or

weight =B0 +B1 * height

Where B0 is the bias coefficient and B1 is the coefficient for the height column. We use a learning technique to find
a good set of coefficient values. Once found, we can plug in different height values to predict the weight.

For example, lets use B0 = 0.1 and B1 = 0.5. Let’s plug them in and calculate the weight (in kilograms) for a person
with the height of 182 centimetres.

weight = 0.1 + 0.5 * 182

weight = 91.1

You can see that the above equation could be plotted as a line in two-dimensions. The B0 is our starting point
regardless of what height we have. We can run through a bunch of heights from 100 to 250 centimetres and plug
them to the equation and get weight values, creating our line.

Now that we know

how to make predictions given a learned linear regression model, let’s look at some rules of thumb for preparing our data to make the most of this type

of model.

3
Preparing Data for Linear Regression

Linear regression is been studied at great length, and there is a lot of literature on how your data must be structured
to make best use of the model.As such, there is a lot of sophistication when talking about these requirements and
expectations which can be intimidating. In practice, you can uses these rules more as rules of thumb when using
Ordinary Least Squares Regression, the most common implementation of linear regression.

Try different preparations of your data using these heuristics and see what works best for your problem.

Linear Assumption. Linear regression assumes that the relationship between your input and output is linear. It does
not support anything else. This may be obvious, but it is good to remember when you have a lot of attributes. You
may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).
Remove Noise. Linear regression assumes that your input and output variables are not noisy. Consider using data
cleaning operations that let you better expose and clarify the signal in your data. This is most important for the output
variable and you want to remove outliers in the output variable (y) if possible.
Remove Co linearity. Linear regression will over-fit your data when you have highly correlated input variables.
Consider calculating pair wise correlations for your input data and removing the most correlated.
Gaussian Distributions. Linear regression will make more reliable predictions if your input and output variables
have a Gaussian distribution. You may get some benefit using transforms (e.g. log or BoxCox) on you variables to
make their distribution more Gaussian looking.
Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input variables using
standardization or normalization.
B)The Perceptron:

A perceptron is a neural network unit (an artificial neuron) that does certain computations to detect features or
business intelligence in the input data.

Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron learning rule based on the
original MCP neuron.
A Perceptron is an algorithm for supervised learning of binary classifiers. This algorithm enables neurons to learn
and processes elements in the training set one at a time

4
There are two types of Perceptrons:
i)Single layer
ii) Multilayer.
Single layer Perceptrons can learn only linearly separable patterns.
Multilayer Perceptrons or feedforward neural networks with two or more layers have the greater processing power.
The Perceptron algorithm learns the weights for the input signals in order to draw a linear decision boundary.This
enables you to distinguish between the two linearly separable classes +1 and -1.
Note: Supervised Learning is a type of Machine Learning used to learn models from labeled training data. It enables
output prediction for future or unseen data.

A)a heuristic learning algorithm for linear classifiers:

 A heuristic algorithm is one that is designed to solve a problem in a faster and more efficient
fashion than traditional methods by sacrificing optimality, accuracy, precision, or completeness for
speed.

 Heuristic algorithms often times used to solve NP-complete problems, a class of decision problems.
In these problems, there is no known efficient way to find a solution quickly and accurately
although solutions can be verified when given.

 Heuristics can produce a solution individually or be used to provide a good baseline and are
supplemented with optimization algorithms.

 Heuristic algorithms are most often employed when approximate solutions are sufficient and exact
solutions are necessarily computationally expensive.

5
Example Algorithms

Swarm Intelligence
Swarm Intelligence systems employ large numbers of agents interacting locally with one another and the
environment. Swarm intelligence refers to the collective behavior of decentralized systems and can be used to
describe both natural and artificial systems. Specific algorithms for this class of system include the particle swarm
optimization algorithm, the ant colony optimization algorithm, and artificial bee colony algorithm. Each of the
previous algorithms was inspired by the natural, self-organized behavior of animals.
Tabu Search
This heuristic technique uses dynamically generated tabus to guide the solution search to optimum solutions. It
examines potential solutions to a problem and checks immediate local neighbors to find an improved solution. The
search creates a set of rules dynamically and prevents the system from searching around the same area redundantly
by marking rule violating solutions as “tabu” or forbidden. This method solves the problem of local search methods
when the search is stuck in suboptimal regions or in areas when there are multiple equally fit solutions.
Simulated Annealing
Borrowing the metallurgical term, this technique converges to a solution in the same way metals are brought to
minimum energy configurations by increasing grain size. Simulated annealing is used in global optimization and can
give a reasonable approximation of a global optimum for a function with a large search space. At each iteration, it
probabilistically decides between staying at its current state or moving to another while ultimately leading the system
to the lowest energy state.
Genetic Algorithms
Genetic algorithms are a subset of a larger class of evolutionary algorithms that describe a set of techniques inspired
by natural selection such as inheritance, mutation, and crossover. Genetic algorithms require both a genetic
representation of the solution domain and a fitness function to evaluate the solution domain. The technique generates
a population of candidate solutions and uses the fitness function to select the optimal solution by iterating with each
generation. The algorithm terminates when the satisfactory fitness level has been reached for the population or the
maximum generations have been reached.
Artificial Neural Networks
Artificial Neural Networks (ANNs) are models capable of pattern recognition and machine learning, in which a
system analyzes a set of training data and is then able to categorize new examples and data. ANNs are influenced by
animals’ central nervous systems and brains, and are used to solve a wide variety of problems including speech
recognition and computer vision.
Support Vector Machines
Support Vector Machines (SVMs) are models with training data used by artificial intelligence to recognize patterns
and analyze data. These algorithms are used for regression analysis and classification purposes. Using example data,
the algorithm will sort new examples into groupings. These SVMs are involved with machine learning, a subset of
artificial intelligence where systems learn from data, and require training data before being capable of analyzing new
examples.
Example Problems:

Travelling Salesmen Problem


A well-known example of a heuristic algorithm is used to solve the common Traveling Salesmen Problem. The
problem is as follows: given a list of cities and the distances between each city, what is the shortest possible route
that visits each city exactly once? A heuristic algorithm used to quickly solve this problem is the nearest neighbor
(NN) algorithm (also known as the Greedy Algorithm). Starting from a randomly chosen city, the algorithm finds the
closest city. The remaining cities are analyzed again, and the closest city is found.3

6
Figure 1: Example of how the nearest neighbor algorithm functions.4

These are the steps of the NN algorithm:

1. Start at a random vertex


2. Determine the shortest distance connecting the current vertex and an unvisited vertex V
3. Make the current vertex the unvisited vertex V
4. Make V visited
5. Record the distance traveled
6. Terminate if no other unvisited vertices remain
7. Repeat step 2.5

This algorithm is heuristic in that it does not take into account the possibility of better steps being excluded due to the selection process. For n cities, the NN

algorithm creates a path that is approximately 25% longer than the most optimal solution. 6

Traveling Salesman Example Problem

There are 4 points of interest located in a 10x10 plot of space: (3,4.5), (9,6.25), (1,8), and (5.5,0). The table below
lists the distance required to touch all 4 points with the first and last point known using the nearest neighbor
algorithm:

Starting at point (1,8): The shortest distance to an unvisited point is 4.03 units to point (3,4.5). The shortest
distance to an unvisited point is 5.15 units to point (5.5,0). The shortest distance to an unvisited point is
7.16 units to point (9,6.25). The total distance traveled is 16.34 units.Starting at point (9,6.25): The shortest
distance to an unvisited point is 6.25 units to point (3,4.5). The shortest distance to an unvisited point is
4.03 units to point (1,8). The shortest distance to an unvisited point is 9.18 units to point (5.5,0). The total
distance traveled is 19.46 units.

Both situations followed the NN algorithm to solve the problem, however the total distance traveled
changed based on the started location. This shows how a heuristic algorithm can give a good solution, but
not the best solution.
Knapsack Problem
Another common use of heuristics is to solve the Knapsack Problem, in which a given set of items (each
with a mass and a value) are grouped to have a maximum value while being under a certain mass limit. The
heuristic algorithm for this problem is called the Greedy Approximation Algorithm which sorts the items
7
based on their value per unit mass and adds the items with the highest v/m as long as there is still space
remaining.

To illustrate, there is a bag with max weight limit W. We want to maximize the value of all the objects that
go into the bag, so the objective function is:

is a binary variable, and determines if object j will go in the bag.

is the value of object j.

is object j’s weight, and the sum of all the weights must not be larger than W.7

In general, Greedy Algorithms are used to approximately solve combinatorics problems in a timely manner. 8

Virus Scanning

In virus scanning, an algorithm searches for key pieces of code associated with particular kinds or viruses, reducing
the number of files that need to be scanned. One of the benefits of heuristic virus scanning is that different viruses of
the same family can be detected without being known due to the common code markers.
Searching and Sorting
One of the most common uses of heuristic algorithms is in searching and sorting. As a search runs, it
adjusts its working parameters to optimize speed, an important characteristic in a search function. The
algorithm discards current possibilities if they are worse than already found solutions.10 Some forms of the
heuristic methods can be detrimental to searching such as the best-first search algorithm. It takes search
results close to the goal and follows the new path even when it may not continue to lead to the optimal
search result..

B)Support vector machines


SVMs are the most popular algorithm for classification in machine learning algorithms. Their mathematical
background is quintessential in building the foundational block for the geometrical distinction between the two
classes. We will see how Support vector machines work by observing their implementation

What is SVM?

Support Vector Machines are a type of supervised machine learning algorithm that provides analysis of data for
classification and regression analysis. While they can be used for regression, SVM is mostly used for
classification. We carry out plotting in the n-dimensional space. Value of each feature is also the value of the
specific coordinate. Then, we find the ideal hyper plane that differentiates between the two classes.

These support vectors are the coordinate representations of individual observation. It is a frontier
8
method for segregating the two classes.

How does SVM work?


The basic principle behind the working of Support vector machines is simple – Create a hyper plane that
separates the dataset into classes. Let us start with a sample problem. Suppose that for a given dataset, you
have to classify red triangles from blue circles. Your goal is to create a line that classifies the data into two
classes, creating a distinction between red triangles and blue circles

While one can hypothesize a clear line that separates the two classes, there can be many lines that can do this
job. Therefore, there is not a single line that you can agree on which can perform this task. Let us visualize some
of the lines that can differentiate between the two classes as follows

in the above visualizations, we have a green line and a red line. Which one do you think would better
differentiate the data into two classes? If you choose the red line, then it is the ideal line that partitions the two
classes properly. However, we still have not concretized the fact that it is the universal line that would classify
our data most efficiently

According to SVM, we have to find the points that lie closest to both the classes. These points are known as
support vectors. In the next step, we find the proximity between our dividing plane and the support vectors.
The distance between the points and the dividing line is known as margin. The aim of an SVM algorithm is to
maximize this very margin. When the margin reaches its maximum, the hyper plane becomes the optimal one.

9
The SVM model tries to enlarge the distance between the two classes by creating a well-defined decision
boundary. In the above case, our hyper plane divided the data. While our data was in 2 dimensions, the hyper
plane was of 1 dimension. For higher dimensions, say, an n-dimensional Euclidean Space, we have an n-1
dimensional subset that divides the space into two disconnected components.

Advantages of SVM

Guaranteed Optimality: Owing to the nature of Convex Optimization, the solution will always be global
minimum not a local minimum.
Abundance of Implementations: We can access it conveniently, be it from Python or Mat lab.

SVM can be used for linearly separable as well as non-linearly separable data. Linearly separable data
is the hard margin whereas non-linearly separable data poses a soft margin.

 SVMs provide compliance to the semi-supervised learning models. It can be used in areas where the data is
labeled as well as unlabeled. It only requires a condition to the minimization problem which is known as the
Transductive SVM.

Feature Mapping used to be quite a load on the computational complexity of the overall training performance
of the model. However, with the help of Kernel Trick, SVM can carry out the feature mapping using simple
dot product.

Disadvantages of SVM
SVM is incapable of handling text structures. This leads to loss of sequential information and thereby,
leading to worse performance.
Vanilla SVM cannot return the probabilistic confidence value that is similar to logistic regression. This
does not provide much explanation as confidence of prediction is important in several applications.

Choice of the kernel is perhaps the biggest limitation of the support vector machine. Considering so
many kernels present, it becomes difficult to choose the right one for the data.

How to Tune SVM Parameters?


Kernel
Kernel in the SVM is responsible for transforming the input data into the required format. Some of the kernels
used in SVM are linear, polynomial and radial basis function (RBF). For creating a non-linear hyper plane, we
use RBF and Polynomial function. For complex applications, one should use more advanced kernels to separate
classes that are nonlinear in nature. With this transformation, one can obtain accurate classifiers.

Regularization
We can maintain regularization by adjusting it in the Scikit-learn’s C parameters. C denotes a penalty
parameter representing an error or any form of misclassification. With this misclassification, one can
10
understand how much of the error is actually bearable. Through this, you can nullify the compensation between
the misclassified term and the decision boundary. With a smaller C value, we obtain hyperplane of small margin
and with a larger C value, we obtain hyperplane of larger value.

Gamma
With a lower value of Gamma will create a loose fit of the training dataset. On the contrary, a high value of
gamma will allow the model to get fit more appropriately. A low value of gamma only provides consideration to
the nearby points for the calculation of a separate plane whereas the high value of gamma will consider all the
data-points to calculate the final separation line.

Applications of SVM
Some of the areas where Support Vector Machines are used are as follows –

Face Detection

SVMs are capable of classifying images of persons in an environment by creating a square box that
separates the face from the rest.
Text and hypertext categorization
SVMs can be used for document classification in the sense that it performs the text and hypertext
categorization. Based on the score generated, it performs a comparison with the threshold value.

Bioinformatics
In the field of bioinformatics, SVMs are used for protein and genomic classification. They can classify the
genetic structure of the patients based on their biological problems.

Handwriting recognition
Another area where support vector machines are used for visual recognition is handwriting recognition.

Summary
In this article, we studied about Support Vector Machines. We learned how these SVM algorithms work
and also implemented them with a real-life example. We also discussed the various applications of SVMs
in our daily lives. Hope now you understand the complete theory of SVM.

c)obtaining probabilities from linear classifiers

11
12
D)Going beyond linearity with kernel methods:

In machine learning, kernel methods are a class of algorithms for pattern analysis, whose best known member
is the support vector machine (SVM). The general task of pattern analysis is to find and study general types of
relations (for example clusters, rankings, principal components, correlations, classifications) in datasets. For
many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into
feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-
specified kernel, i.e., a similarity function over pairs of data points in raw representation.

Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-
dimensional, implicit feature space without ever computing the coordinates of the data in that space, but
rather by simply computing the inner products between the images of all pairs of data in the feature space.
This operation is often computationally cheaper than the explicit computation of the coordinates. This
approach is called the "kernel trick". Kernel functions have been introduced for sequence data, graphs,
text, images, as well as vectors.

Algorithms capable of operating with kernels include the kernel perceptron, support vector machines
(SVM), Gaussian processes, principal components analysis (PCA), canonical correlation analysis, ridge
regression, spectral clustering, linear adaptive filters and many others. Any linear model can be turned into
a non-linear model by applying the kernel trick to the model: replacing its features (predictors) by a kernel
function

Most kernel algorithms are based on convex optimization or eigenproblems and are statistically well-
founded. Typically, their statistical properties are analyzed using statistical learning theory (for example,
using Rademacher complexity).

13
CHAPTER-2DISTANCE BASED MODELS:

a)Introduction:

Distance-based models are the second class of Geometric models. Like Linear models, distance-based models
are based on the geometry of data. As the name implies, distance-based models work on the concept of
distance. In the context of Machine learning, the concept of distance is not based on merely the physical
distance between two points. Instead, we could think of the distance between two points considering the mode
of transport between two points. Travelling between two cities by plane covers less distance physically than
by train because a plane is unrestricted. Similarly, in chess, the concept of distance depends on the piece used
– for example, a Bishop can move diagonally. Thus, depending on the entity and the mode of travel, the
concept of distance can be experienced differently. The distance metrics commonly used are Euclidean,
Minkowski, Manhattan, and Mahalanobis.

Distance is applied through the concept of neighbours and exemplars. Neighbours are points in proximity
with respect to the distance measure expressed through exemplars. Exemplars are either centroids that find
a centre of mass according to a chosen distance metric or medoids that find the most centrally located data
point. The most commonly used centroid is the arithmetic mean, which minimises squared Euclidean
distance to all other points.

Notes:

 The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean position of
all the points in the figure from the centroid point. This definition extends to any object in n-
dimensional space: its centroid is the mean position of all the points.
 Medoids are similar in concept to means or centroids. Medoids are most commonly used on data
when a mean or centroid cannot be defined. They are used in contexts where the centroid is not
representative of the dataset, such as in image data.

Examples of distance-based models include the nearest-neighbour models, which use the training data as
exemplars – for example, in classification. The K-means clustering algorithm also uses exemplars to
create clusters of similar data points.

b)Neighbours and exemplars &Nearest Neighbours classification:

14
15
16
Example for NN

17
c)Distance Based Clustering:

d)Hierarchical Clustering:

Introduction to Hierarchical Clustering

Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms fall into following
two categories.

1.Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is


treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of
clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.

Algorithm:

given a dataset (d1, d2, d3, ....dN) of size N


# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
18
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
untill only a single cluster remains

2.Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-down approach)
the one big cluster into various small clusters.

Algorithm :

given a dataset (d1, d2, d3, ....dN) of size N


at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
untill each data is in its own singleton cluster

19

You might also like