Class condition density and decision boundaries

Q. 1] What is class condition density?
[3 marks]
Ans. The variability of the measurements is expressed as a random variable x, and its
probability density function depends on the class ωj. p(x| ωj) is the class-conditional
probability density function, the probability function for x given that the class is ωj.
Example of classification using class-condition density:
Classification problem: discriminate between healthy people or people with anemia
•We have the results of a blood test, so we know the amount of red blood cells.
•The amount of red blood cells is the random variable (x)
•This variable has a gaussian distribution.
Blood test: 4.500.000 red blood cells So, the patient is healthy.
p(x=4.500.000 | healthy) > p(x=4.500.000 | ill)
If we consider the patient is healty, the probaility he has 4.5 milion red blood cells is higher
than if we consider he is ill, with this number of red blood cells.
Q.2] What is Decision Boundaries? [3 marks]
Ans. While training a classifier on a dataset, using a specific classification algorithm, it is

required to define a set of hyper-planes, called Decision Boundary, that separates the data
points into specific classes, where the algorithm switches from one class to another. On one
side a decision boundary, a datapoints is more likely to be called as class A — on the other
side of the boundary, it’s more likely to be called as class B.
Let’s take an example of a Logistic Regression.
The goal of logistic regression, is to figure out some way to split the datapoints to have an
accurate prediction of a given observation’s class using the information present in the
features.
Let’s suppose we define a line that describes the decision boundary. So, all of the points on
one side of the boundary shall have all the datapoints belong to class A and all of the points
on one side of the boundary shall have all the datapoints belong to class B.
S(z)=1/(1+e^-z)
 S(z) = Output between 0 and 1 (probability estimate)
 z = Input to the function (z= mx + b)
 e = Base of natural log
 Our current prediction function returns a probability score between 0 and 1. In order to
map this to a discrete class (A/B), we select a threshold value or tipping point above
which we will classify values into class A and below which we classify values into class B.
 p>=0.5,class=A
p<=0.5,class=B
If our threshold was .5 and our prediction function returned .7, we would classify this
observation belongs to class A. If our prediction was .2 we would classify the observation
belongs to class B.
So, line with 0.5 is called the decision boundary.
In order to map predicted values to probabilities, we use the Sigmoid function.

Q.3] Explain K-Nearest Neighbors Classifier. [6 marks]
Ans. K-Nearest Neighbor also known as KNN is a supervised learning algorithm that can be used for
regression as well as classification problems. Generally, it is used for classification problems in
machine learning.
KNN works on a principle assuming every data point falling in near to each other is falling in the
same class. In other words, it classifies a new data point based on similarity.
KNN algorithms decide a number k which is the nearest Neighbor to that data point that is to be
classified. If the value of k is 5 it will look for 5 nearest Neighbors to that data point.
For example, if we assume k=4. KNN finds out about the 4 nearest Neighbors. All the data points
near black data points belong to the green class meaning all the neighbours belong to the green
class so according to the KNN algorithm, it will belong to this class only. The red class is not
considered because red class data points are nowhere close to the black data point.
The simple version of the K-nearest neighbour classifier algorithms is to predict the target label by
finding the nearest neighbour class. The closest class to the point which is to be classified is
calculated using Euclidean distance.
Advantages of KNN
 A simple algorithm that is easy to understand.
 Used for nonlinear data.
 The versatile algorithm used for both classification as well as regression.
 Gives high accuracy but there are more good algorithms in supervised models.
 The algorithm doesn't demand to build a model, tune several model parameters, or make additional
assumptions.
Disadvantages of KNN
 The requirement of high storage.
 Prediction rate slow.
 Stores all the training data.
 The algorithm get slower when the number of examples, predictors or independent variables
increases.
Significance of k
Specifically, the KNN algorithm works in the way: find a distance between a query and all examples
(variables) of data, select the particular number of examples (say K) nearest to the query, then
decide
 the most frequent label if using for the classification based problems, or
 the averages the label if using for regression-based problems
Therefore, the algorithm hugely depends upon the number of K, such that
 Value of k – bigger the value of k increases confidence in the prediction.
 Decisions may be skewed if k has a very large value.

Q.4] What is linear Classifier? [6 marks]
Ans. Linear classifiers classify data into labels based on a linear combination of input features.
Therefore, these classifiers separate data using a line or plane or a hyperplane (a plane in more than
2 dimensions). They can only be used to classify data that is linearly separable. They can be modified
to classify non-linearly separable data.
We will explore 3 major algorithms in linear binary classification -
Perceptron
In Perceptron, we take weighted linear combination of input features and pass it through a
thresholding function which outputs 1 or 0. The sign of wTx tells us which side of the plane wTx=0,
the point x lies on. Thus by taking threshold as 0, perceptron classifies data based on which side of
the plane the new point lies on.The task during training is to arrive at the plane (defined by w) that
accurately classifies the training data. If the data is linearly separable, perceptron training always
converges
Figure 1: Perceptron
Logistic Regression
In Logistic regression, we take weighted linear combination of input features and pass it through a
sigmoid function which outputs a number between 1 and 0. Unlike perceptron, which just tells us
which side of the plane the point lies on, logistic regression gives a probability of a point lying on a
particular side of the plane. The probability of classification will be very close to 1 or 0 as the point
goes far away from the plane. The probability of classification of points very close to the plane is
close to 0.5.
Figure 2: Logistic Regression
SVM
SVM is another linear classification algorithm (One which separates data with a hyperplane) just
like logistic regression and perceptron algorithms we saw before.
Figure 3: SVM Architecture

Q.5] Explain descriminant function under multivariate normal distribution
with various cases of covariance matric. [12 marks]
Ans. Discriminant functions, classifiers, Decision Surfaces, and Normal density including
Univariate as well as Multivariate Normal Distribution.
Discriminant function is given by,
gi(x) = ln p(x|ωi) + ln P(ωi)
As for the normal density p(x|ωi) follows the multivariate normal distribution, so our
discriminant function can be written as
gi(x)= -1/2(x-μi)tΣi–1(x-μi) – d/2ln2π – 1/2 ln(|Σi|) +ln(P(wi))
We will now examine this discriminant function in detail by dividing the covariance into
different cases.
Case-1 ( Σi = σ2 I )
This case occurs when σij =0 for i!=j i.e, the covariances are zero and the variance of feature
i.e σii remains σ2. This implies that the features are statistically independent and each
feature has variance σ2. As |Σi| and d/2ln2π term are independent of i and will not change
accordingly to the cases, as well as remain unimportant, so they can be ignored.
Substituting our assumption in the normal discriminant function we get,
gi(x) = – ||x – μi||2/2σ2+ ln P(wi)
Where the Euclidean norm is
||x − μi||2 = (x − μi)t(x − μi)

Here we notice that our discriminant function is the sum of two terms, the squared distance
from the mean is normalized by the variance and the other term is the log of prior. Our
decision favors the prior if x is near the mean.
Stressing more on our equation is can be further expanded as,
gi(x) = − 1/2σ2 [xtx − 2μitx + μitμi] + ln P(ωi)
Here the quadratic term xtx is the same for all i, which could also be ignored. Hence, we
arrive at the equivalent linear discriminant functions.
gi(x)= wiT x +wi0
Comparing
wi = 1/σ2 μi
And wi0 = −1/2σ2μitμi + ln P(ωi)
where w0 is the threshold or bias in the ith direction
Such classifiers that use linear discriminant functions are often called linear machines.
We choose the decision surface for any linear machine such that the hyperplane is defined
by the equation gi(x)= gj(x) for bi-categorical with the highest posterior probabilities.
Applying the above condition we get,
wt(x − x0) = 0
w = μi − μ j
x0 = 1/2(μi + μj) − σ2/ ||μi − μj||2 ln P(wi)/P(ωj)(μi − μj)

The above equation is of a hyperplane through the point x0 and orthogonal to vector w. We
can also notice that w = µi −µj, the hyperplane separating the two regions Ri and Rj.
Further, if P(ωi)= P(ωj) the subtractive term in x0 vanishes, and the hyperplane is
perpendicular bisector or halfway the means.
Case- 2 ( Σi = Σ )
These cases rarely occur, but hold importance when having a transition from case-1 to the
more generalized case-3. In this case, the covariance matrix for all of the classes is identical.
Even here we can ignore the term d/2ln2π and |Σi|. This eventually leads us to the equation,
gi(x) = −1/2(x − μi)tΣ−1(x − μi) + ln P(ωi)
If the prior probabilities P(wi) are the same for all the classes the decision rule remains that
to choose the feature vector x to the class with the nearest mean vector of class c. If the
prior probabilities are biased the decision is in favor of the prior more likely class.
t −1
The quadratic form (x−µi) Σ (x−µi) can be expanded as we again notice that the
t −1
term x Σ x can be ignored as it is independent of i. After this term is dropped we get the
linear discriminant function.
gi(x) = witx + wi0

wi = Σ−1μi
wi0 =−1/2μitΣ−1μi+ln P(ωi)
As the discriminant function is linear the outcome of decision boundaries are again the
hyperplane with the equation,
wt(x − x0) = 0
w = Σ−1(μi −μj)
x0 = 1/2 (μi + μj) − { ln [P(ωi)/P(ωj)]/(μi − μj)tΣ−1(μi − μj) }(μi – μj)
The difference this hyperplane has as compared from the on in case 1 is that it is not
orthogonal to the line between the means and also it does not intersect halfway the point
between the means if the priors are equal.
Case- 3 ( Σi = arbitrary )
We now come to the most realistic case of the covariance being different for each class,
Now the only term dropped from the discriminant function equation of normal multivariate
density is the d/2ln2π.
The resulting discriminant function no more remains linear, as it inherently stays quadratic:
gi(x) = xtWix + witx + wi0
Wi = −1/2 Σi−1
wi = Σi−1μi
wi0 = −1/2μitΣi−1μi − 1/2ln |Σi| + ln P(ωi)
The decision surfaces are hyperquadrics i.e general form of hyperplanes, pairs of
hyperplanes, hyperspheres, hyperparaboloids, and more freaky shapes. The extension to
the more than two categories is straightforward.

Class condition density and decision boundaries

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class condition density and decision boundaries

Uploaded by

Copyright:

Available Formats

Q. 1] What is class condition density?

Example of classification using class-condition density:

Classification problem: discriminate between healthy people or people with anemia

•The amount of red blood cells is the random variable (x)

•This variable has a gaussian distribution.

p(x=4.500.000 | healthy) > p(x=4.500.000 | ill)

Ans. While training a classifier on a dataset, using a specific classification algorithm, it is

side of the boundary, it’s more likely to be called as class B.

Let’s take an example of a Logistic Regression.

 S(z) = Output between 0 and 1 (probability estimate)

 z = Input to the function (z= mx + b)

 e = Base of natural log

So, line with 0.5 is called the decision boundary.

In order to map predicted values to probabilities, we use the Sigmoid function.

calculated using Euclidean distance.

 Used for nonlinear data.

 The versatile algorithm used for both classification as well as regression.

 The requirement of high storage.

 Prediction rate slow.

 Stores all the training data.

 the averages the label if using for regression-based problems

 Value of k – bigger the value of k increases confidence in the prediction.

 Decisions may be skewed if k has a very large value.

We will explore 3 major algorithms in linear binary classification -

Figure 3: SVM Architecture

Discriminant function is given by,

gi(x) = ln p(x|ωi) + ln P(ωi)

discriminant function can be written as

gi(x)= -1/2(x-μi)tΣi–1(x-μi) – d/2ln2π – 1/2 ln(|Σi|) +ln(P(wi))

accordingly to the cases, as well as remain unimportant, so they can be ignored.

Substituting our assumption in the normal discriminant function we get,

gi(x) = – ||x – μi||2/2σ2+ ln P(wi)

Where the Euclidean norm is

||x − μi||2 = (x − μi)t(x − μi)

decision favors the prior if x is near the mean.

Stressing more on our equation is can be further expanded as,

gi(x) = − 1/2σ2 [xtx − 2μitx + μitμi] + ln P(ωi)

arrive at the equivalent linear discriminant functions.

gi(x)= wiT x +wi0

And wi0 = −1/2σ2μitμi + ln P(ωi)

where w0 is the threshold or bias in the ith direction

Applying the above condition we get,

x0 = 1/2(μi + μj) − σ2/ ||μi − μj||2 ln P(wi)/P(ωj)(μi − μj)

perpendicular bisector or halfway the means.

gi(x) = −1/2(x − μi)tΣ−1(x − μi) + ln P(ωi)

linear discriminant function.

gi(x) = witx + wi0

wi0 =−1/2μitΣ−1μi+ln P(ωi)

hyperplane with the equation,

x0 = 1/2 (μi + μj) − { ln [P(ωi)/P(ωj)]/(μi − μj)tΣ−1(μi − μj) }(μi – μj)

between the means if the priors are equal.

density is the d/2ln2π.

wi0 = −1/2μitΣi−1μi − 1/2ln |Σi| + ln P(ωi)

hyperplanes, hyperspheres, hyperparaboloids, and more freaky shapes. The extension to

the more than two categories is straightforward.

You might also like