You are on page 1of 41

Unit VI

Applications and Linear discriminant functions:

Taught by
Prof .Datta Deshmukh

By Prof.Datta Deshmukh
Unit VI
Applications and Linear discriminant
functions:
Gradient descent procedures
Perceptron
 Support vector machines

By Prof.Datta Deshmukh
Linear Discriminant Analysis in
Machine Learning
As we know that while dealing with a high
dimensional dataset then we must apply some
dimensionality reduction techniques to the data at hand
so, that we can explore the data and utilize it for
modeling in an efficient manner. In this article, we will
learn about one such dimensionality reduction
technique that is used to map high dimensional data to
a comparatively lower dimension without much data
loss.

By Prof.Datta Deshmukh
What is Linear Discriminant Analysis?

Linear Discriminant Analysis (LDA), also known as Normal


Discriminant Analysis or Discriminant Function Analysis, is
a dimensionality reduction technique primarily utilized in
supervised classification problems. It facilitates the
modeling of distinctions between groups, effectively
separating two or more classes. LDA operates by projecting
features from a higher-dimensional space into a lower-
dimensional one. In machine learning, LDA serves as a
supervised learning algorithm specifically designed for
classification tasks, aiming to identify a linear combination
of features that optimally segregates classes within a dataset.

By Prof.Datta Deshmukh
For example, we have two classes and we need to
separate them efficiently. Classes can have multiple
features. Using only a single feature to classify them
may result in some overlapping as shown in the below
figure. So, we will keep on increasing the number of
features for proper classification.

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Assumptions of LDA

 LDA assumes that the data has a Gaussian distribution and that
the covariance matrices of the different classes are equal. It also
assumes that the data is linearly separable, meaning that a
linear decision boundary can accurately classify the different
classes.
 Suppose we have two sets of data points belonging to two
different classes that we want to classify. As shown in the given
2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of data
points completely. Hence, in this case, LDA (Linear
Discriminant Analysis) is used which reduces the 2D graph into
a 1D graph in order to maximize the separability between the
two classes.
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Here, Linear Discriminant Analysis uses both axes (X
and Y) to create a new axis and projects data onto a
new axis in a way to maximize the separation of the
two categories and hence, reduces the 2D graph into a
1D graph.
Two criteria are used by LDA to create a new axis:
Maximize the distance between the means of the two
classes.
Minimize the variation within each class.

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
In the above graph, it can be seen that a new axis (in
red) is generated and plotted in the 2D graph such that
it maximizes the distance between the means of the
two classes and minimizes the variation within each
class. In simple terms, this newly generated axis
increases the separation between the data points of the
two classes. After generating this new axis using the
above-mentioned criteria, all the data points of the
classes are plotted on this new axis and are shown in
the figure given below.

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
But Linear Discriminant Analysis fails when the mean
of the distributions are shared, as it becomes
impossible for LDA to find a new axis that makes both
classes linearly separable. In such cases, we use non-
linear discriminant analysis.

By Prof.Datta Deshmukh
Gradient descent was initially discovered
by "Augustin-Louis Cauchy" in mid of 18th
century. Gradient Descent is defined as one of the
most commonly used iterative optimization
algorithms of machine learning to train the machine
learning and deep learning models. It helps in
finding the local minimum of a function.

By Prof.Datta Deshmukh
The best way to define the local minimum or local
maximum of a function using gradient descent is as
follows:
If we move towards a negative gradient or away from
the gradient of the function at the current point, it will
give the local minimum of that function.
Whenever we move towards a positive gradient or
towards the gradient of the function at the current
point, we will get the local maximum of that function.

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
This entire procedure is known as Gradient Ascent, which
is also known as steepest descent. The main objective of
using a gradient descent algorithm is to minimize the cost
function using iteration. To achieve this goal, it performs
two steps iteratively:
Calculates the first-order derivative of the function to
compute the gradient or slope of that function.
Move away from the direction of the gradient, which
means slope increased from the current point by alpha
times, where Alpha is defined as Learning Rate. It is a
tuning parameter in the optimization process which helps
to decide the length of the steps.
By Prof.Datta Deshmukh
What is Perceptron | The Simplest
Artificial neural network
 Perceptron is one of the simplest Artificial neural network architectures
. It was introduced by Frank Rosenblatt in 1957s. It is the simplest type
of feedforward neural network, consisting of a single layer of input
nodes that are fully connected to a layer of output nodes. It can learn
the linearly separable patterns. it uses slightly different types of
artificial neurons known as threshold logic units (TLU). it was first
introduced by McCulloch and Walter Pitts in the 1940s.
 Types of Perceptron
 Single-Layer Perceptron: This type of perceptron is limited to
learning linearly separable patterns. effective for tasks where the data
can be divided into distinct categories through a straight line.
 Multilayer Perceptron: Multilayer perceptrons possess enhanced
processing capabilities as they consist of two or more layers, adept at
handling more complex patterns and relationships within the data.

By Prof.Datta Deshmukh
 Basic Components of Perceptron
 A perceptron, the basic unit of a neural network, comprises essential
components that collaborate in information processing.
 Input Features: The perceptron takes multiple input features, each
input feature represents a characteristic or attribute of the input data.
 Weights: Each input feature is associated with a weight,
determining the significance of each input feature in influencing the
perceptron’s output. During training, these weights are adjusted to
learn the optimal values.
 Summation Function: The perceptron calculates the weighted sum
of its inputs using the summation function. The summation function
combines the inputs with their respective weights to produce a
weighted sum.

By Prof.Datta Deshmukh
 Activation Function: The weighted sum is then passed through an activation function.
Perceptron uses Heaviside step function functions. which take the summed values as input
and compare with the threshold and provide the output as 0 or 1.
 Output: The final output of the perceptron, is determined by the activation function’s result.
For example, in binary classification problems, the output might represent a predicted class
(0 or 1).
 Bias: A bias term is often included in the perceptron model. The bias allows the model to
make adjustments that are independent of the input. It is an additional parameter that is
learned during training.
 Learning Algorithm (Weight Update Rule): During training, the perceptron learns by
adjusting its weights and bias based on a learning algorithm. A common approach is the
perceptron learning algorithm, which updates weights based on the difference between the
predicted output and the true output.
 These components work together to enable a perceptron to learn and make predictions.
While a single perceptron can perform binary classification, more complex tasks require the
use of multiple perceptrons organized into layers, forming a neural network.

By Prof.Datta Deshmukh
How does Perceptron work?

A weight is assigned to each input node of a


perceptron, indicating the significance of that input to
the output. The perceptron’s output is a weighted sum
of the inputs that have been run through an activation
function to decide whether or not the perceptron will
fire. it computes the weighted sum of its inputs as:
z = w1x1 + w1x2 + ... + wnxn = XTW

By Prof.Datta Deshmukh
The step function compares this weighted sum to the
threshold, which outputs 1 if the input is larger than a
threshold value and 0 otherwise, is the activation
function that perceptrons utilize the most frequently.
The most common step function used in perceptron is
the Heaviside step function:

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
When all the neurons in a layer are connected to every
neuron of the previous layer, it is known as a fully
connected layer or dense layer.
The output of the fully connected layer can be:

By Prof.Datta Deshmukh
Support vector machines
Support Vector Machine (SVM) is a powerful machine
learning algorithm used for linear or nonlinear
classification, regression, and even outlier detection tasks.
SVMs can be used for a variety of tasks, such as text
classification, image classification, spam
detection, handwriting identification, gene expression
analysis, face detection, and anomaly detection. SVMs are
adaptable and efficient in a variety of applications because
they can manage high-dimensional data and nonlinear
relationships.
SVM algorithms are very effective as we try to find the
maximum separating hyperplane between the different
classes available
By Prof.Datta Deshmukh in the target feature.
Support Vector Machine
Support Vector Machine (SVM) is a
supervised machine learning algorithm used for both
classification and regression. Though we say regression
problems as well it’s best suited for classification. The main
objective of the SVM algorithm is to find the optimal
hyperplane in an N-dimensional space that can separate the
data points in different classes in the feature space.
 The hyperplane tries that the margin between the closest points
of different classes should be as maximum as possible.
The dimension of the hyperplane depends upon the number of
features. If the number of input features is two, then the
hyperplane is just a line. If the number of input features is
three, then the hyperplane becomes a 2-D plane. It becomes
difficult to imagine when the number of features exceeds three.
By Prof.Datta Deshmukh
Let’s consider two independent variables x1,
x2, and one dependent variable which is either a
blue circle or a red circle.

By Prof.Datta Deshmukh
From the figure it’s very clear that there are multiple
lines (our hyperplane here is a line because we are
considering only two input features x1, x2) that
segregate our data points or do a classification between
red and blue circles. So how do we choose the best line
or in general the best hyperplane that segregates our
data points?

By Prof.Datta Deshmukh
How does SVM work?

One reasonable choice as the best hyperplane is the


one that represents the largest separation or margin
between the two classes.

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
So we choose the hyperplane whose distance from it to
the nearest data point on each side is maximized. If
such a hyperplane exists it is known as the maximum-
margin hyperplane/hard margin. So from the above
figure, we choose L2. Let’s consider a scenario like
shown below

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Here we have one blue ball in the boundary of the red
ball. So how does SVM classify the data? It’s simple!
The blue ball in the boundary of red ones is an outlier
of blue balls. The SVM algorithm has the
characteristics to ignore the outlier and finds the best
hyperplane that maximizes the margin. SVM is robust
to outliers.

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
So in this type of data point what SVM does is, finds the
maximum margin as done with previous data sets along
with that it adds a penalty each time a point crosses the
margin. So the margins in these types of cases are
called soft margins. When there is a soft margin to the data
set, the SVM tries to minimize (1/margin+∧(∑penalty)).
Hinge loss is a commonly used penalty. If no violations no
hinge loss.If violations hinge loss proportional to the
distance of violation.

Till now, we were talking about linearly separable data(the


group of blue balls and red balls are separable by a straight
line/linear line). What to do if data are not linearly
separable?

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Say, our data is shown in the figure above. SVM
solves this by creating a new variable using a kernel.
We call a point xi on the line and we create a new
variable yi as a function of distance from origin o.so if
we plot this we get something like as shown below

By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
In this case, the new variable y is created as a function
of distance from the origin. A non-linear function that
creates a new variable is referred to as a kernel.

By Prof.Datta Deshmukh
Support Vector Machine Terminology
Hyperplane: Hyperplane is the decision boundary that is
used to separate the data points of different classes in a
feature space. In the case of linear classifications, it will
be a linear equation i.e. wx+b = 0.
Support Vectors: Support vectors are the closest data
points to the hyperplane, which makes a critical role in
deciding the hyperplane and margin.
Margin: Margin is the distance between the support
vector and hyperplane. The main objective of the support
vector machine algorithm is to maximize the margin. The
wider margin indicates better classification performance.

By Prof.Datta Deshmukh
Kernel: Kernel is the mathematical function, which is used in
SVM to map the original input data points into high-
dimensional feature spaces, so, that the hyperplane can be easily
found out even if the data points are not linearly separable in the
original input space. Some of the common kernel functions are
linear, polynomial, radial basis function(RBF), and sigmoid.
Hard Margin: The maximum-margin hyperplane or the hard
margin hyperplane is a hyperplane that properly separates the
data points of different categories without any
misclassifications.
Soft Margin: When the data is not perfectly separable or
contains outliers, SVM permits a soft margin technique. Each
data point has a slack variable introduced by the soft-margin
SVM formulation, which softens the strict margin requirement
and permits certain misclassifications or violations. It discovers
a compromise between increasing the margin and reducing
By Prof.Datta Deshmukh
violations.

You might also like