You are on page 1of 12

Q. 1] What is class condition density?

[3 marks]

Ans. The variability of the measurements is expressed as a random variable x, and its
probability density function depends on the class ωj. p(x| ωj) is the class-conditional
probability density function, the probability function for x given that the class is ωj.

Example of classification using class-condition density:

Classification problem: discriminate between healthy people or people with anemia

•We have the results of a blood test, so we know the amount of red blood cells.

•The amount of red blood cells is the random variable (x)

•This variable has a gaussian distribution.

Blood test: 4.500.000 red blood cells So, the patient is healthy.

p(x=4.500.000 | healthy) > p(x=4.500.000 | ill)

If we consider the patient is healty, the probaility he has 4.5 milion red blood cells is higher
than if we consider he is ill, with this number of red blood cells.
Q.2] What is Decision Boundaries? [3 marks]

Ans. While training a classifier on a dataset, using a specific classification algorithm, it is


required to define a set of hyper-planes, called Decision Boundary, that separates the data

points into specific classes, where the algorithm switches from one class to another. On one

side a decision boundary, a datapoints is more likely to be called as class A — on the other

side of the boundary, it’s more likely to be called as class B.

Let’s take an example of a Logistic Regression.

The goal of logistic regression, is to figure out some way to split the datapoints to have an

accurate prediction of a given observation’s class using the information present in the

features.

Let’s suppose we define a line that describes the decision boundary. So, all of the points on

one side of the boundary shall have all the datapoints belong to class A and all of the points

on one side of the boundary shall have all the datapoints belong to class B.

S(z)=1/(1+e^-z)

 S(z) = Output between 0 and 1 (probability estimate)

 z = Input to the function (z= mx + b)

 e = Base of natural log

 Our current prediction function returns a probability score between 0 and 1. In order to
map this to a discrete class (A/B), we select a threshold value or tipping point above
which we will classify values into class A and below which we classify values into class B.

 p>=0.5,class=A
p<=0.5,class=B

If our threshold was .5 and our prediction function returned .7, we would classify this

observation belongs to class A. If our prediction was .2 we would classify the observation

belongs to class B.

So, line with 0.5 is called the decision boundary.

In order to map predicted values to probabilities, we use the Sigmoid function.


Q.3] Explain K-Nearest Neighbors Classifier. [6 marks]

Ans. K-Nearest Neighbor also known as KNN is a supervised learning algorithm that can be used for
regression as well as classification problems. Generally, it is used for classification problems in
machine learning.

KNN works on a principle assuming every data point falling in near to each other is falling in the
same class. In other words, it classifies a new data point based on similarity.

KNN algorithms decide a number k which is the nearest Neighbor to that data point that is to be

classified. If the value of k is 5 it will look for 5 nearest Neighbors to that data point.

For example, if we assume k=4. KNN finds out about the 4 nearest Neighbors. All the data points

near black data points belong to the green class meaning all the neighbours belong to the green

class so according to the KNN algorithm, it will belong to this class only. The red class is not

considered because red class data points are nowhere close to the black data point.

The simple version of the K-nearest neighbour classifier algorithms is to predict the target label by

finding the nearest neighbour class. The closest class to the point which is to be classified is

calculated using Euclidean distance.

Advantages of KNN
 A simple algorithm that is easy to understand.

 Used for nonlinear data.

 The versatile algorithm used for both classification as well as regression.

 Gives high accuracy but there are more good algorithms in supervised models.
 The algorithm doesn't demand to build a model, tune several model parameters, or make additional
assumptions.

Disadvantages of KNN

 The requirement of high storage.

 Prediction rate slow.

 Stores all the training data.

 The algorithm get slower when the number of examples, predictors or independent variables

increases.
Significance of k
Specifically, the KNN algorithm works in the way: find a distance between a query and all examples

(variables) of data, select the particular number of examples (say K) nearest to the query, then

decide

 the most frequent label if using for the classification based problems, or

 the averages the label if using for regression-based problems

Therefore, the algorithm hugely depends upon the number of K, such that

 Value of k – bigger the value of k increases confidence in the prediction.

 Decisions may be skewed if k has a very large value.


Q.4] What is linear Classifier? [6 marks]

Ans. Linear classifiers classify data into labels based on a linear combination of input features.
Therefore, these classifiers separate data using a line or plane or a hyperplane (a plane in more than
2 dimensions). They can only be used to classify data that is linearly separable. They can be modified
to classify non-linearly separable data.

We will explore 3 major algorithms in linear binary classification -

Perceptron

In Perceptron, we take weighted linear combination of input features and pass it through a
thresholding function which outputs 1 or 0. The sign of wTx tells us which side of the plane wTx=0,
the point x lies on. Thus by taking threshold as 0, perceptron classifies data based on which side of
the plane the new point lies on.The task during training is to arrive at the plane (defined by w) that
accurately classifies the training data. If the data is linearly separable, perceptron training always
converges

Figure 1: Perceptron

Logistic Regression

In Logistic regression, we take weighted linear combination of input features and pass it through a
sigmoid function which outputs a number between 1 and 0. Unlike perceptron, which just tells us
which side of the plane the point lies on, logistic regression gives a probability of a point lying on a
particular side of the plane. The probability of classification will be very close to 1 or 0 as the point
goes far away from the plane. The probability of classification of points very close to the plane is
close to 0.5.
Figure 2: Logistic Regression

SVM
SVM is another linear classification algorithm (One which separates data with a hyperplane) just
like logistic regression and perceptron algorithms we saw before.

Figure 3: SVM Architecture


Q.5] Explain descriminant function under multivariate normal distribution
with various cases of covariance matric. [12 marks]

Ans. Discriminant functions, classifiers, Decision Surfaces, and Normal density including
Univariate as well as Multivariate Normal Distribution.

Discriminant function is given by,

gi(x) = ln p(x|ωi) + ln P(ωi)

As for the normal density p(x|ωi) follows the multivariate normal distribution, so our

discriminant function can be written as

gi(x)= -1/2(x-μi)tΣi–1(x-μi) – d/2ln2π – 1/2 ln(|Σi|) +ln(P(wi))

We will now examine this discriminant function in detail by dividing the covariance into

different cases.

Case-1 ( Σi = σ2 I )

This case occurs when σij =0 for i!=j i.e, the covariances are zero and the variance of feature

i.e σii remains σ2. This implies that the features are statistically independent and each

feature has variance σ2. As |Σi| and d/2ln2π term are independent of i and will not change

accordingly to the cases, as well as remain unimportant, so they can be ignored.

Substituting our assumption in the normal discriminant function we get,

gi(x) = – ||x – μi||2/2σ2+ ln P(wi)

Where the Euclidean norm is

||x − μi||2 = (x − μi)t(x − μi)


Here we notice that our discriminant function is the sum of two terms, the squared distance

from the mean is normalized by the variance and the other term is the log of prior. Our

decision favors the prior if x is near the mean.

Stressing more on our equation is can be further expanded as,

gi(x) = − 1/2σ2 [xtx − 2μitx + μitμi] + ln P(ωi)

Here the quadratic term xtx is the same for all i, which could also be ignored. Hence, we

arrive at the equivalent linear discriminant functions.

gi(x)= wiT x +wi0

Comparing

wi = 1/σ2 μi

And wi0 = −1/2σ2μitμi + ln P(ωi)

where w0 is the threshold or bias in the ith direction

Such classifiers that use linear discriminant functions are often called linear machines.

We choose the decision surface for any linear machine such that the hyperplane is defined

by the equation gi(x)= gj(x) for bi-categorical with the highest posterior probabilities.

Applying the above condition we get,

wt(x − x0) = 0

w = μi − μ j

x0 = 1/2(μi + μj) − σ2/ ||μi − μj||2 ln P(wi)/P(ωj)(μi − μj)


The above equation is of a hyperplane through the point x0 and orthogonal to vector w. We

can also notice that w = µi −µj, the hyperplane separating the two regions Ri and Rj.

Further, if P(ωi)= P(ωj) the subtractive term in x0 vanishes, and the hyperplane is

perpendicular bisector or halfway the means.

Case- 2 ( Σi = Σ )

These cases rarely occur, but hold importance when having a transition from case-1 to the

more generalized case-3. In this case, the covariance matrix for all of the classes is identical.

Even here we can ignore the term d/2ln2π and |Σi|. This eventually leads us to the equation,

gi(x) = −1/2(x − μi)tΣ−1(x − μi) + ln P(ωi)

If the prior probabilities P(wi) are the same for all the classes the decision rule remains that

to choose the feature vector x to the class with the nearest mean vector of class c. If the

prior probabilities are biased the decision is in favor of the prior more likely class.

t −1
The quadratic form (x−µi) Σ (x−µi) can be expanded as we again notice that the
t −1
term x Σ x can be ignored as it is independent of i. After this term is dropped we get the

linear discriminant function.

gi(x) = witx + wi0


wi = Σ−1μi

wi0 =−1/2μitΣ−1μi+ln P(ωi)

As the discriminant function is linear the outcome of decision boundaries are again the

hyperplane with the equation,

wt(x − x0) = 0

w = Σ−1(μi −μj)

x0 = 1/2 (μi + μj) − { ln [P(ωi)/P(ωj)]/(μi − μj)tΣ−1(μi − μj) }(μi – μj)

The difference this hyperplane has as compared from the on in case 1 is that it is not

orthogonal to the line between the means and also it does not intersect halfway the point

between the means if the priors are equal.

Case- 3 ( Σi = arbitrary )

We now come to the most realistic case of the covariance being different for each class,

Now the only term dropped from the discriminant function equation of normal multivariate

density is the d/2ln2π.

The resulting discriminant function no more remains linear, as it inherently stays quadratic:
gi(x) = xtWix + witx + wi0

Wi = −1/2 Σi−1

wi = Σi−1μi

wi0 = −1/2μitΣi−1μi − 1/2ln |Σi| + ln P(ωi)

The decision surfaces are hyperquadrics i.e general form of hyperplanes, pairs of

hyperplanes, hyperspheres, hyperparaboloids, and more freaky shapes. The extension to

the more than two categories is straightforward.

You might also like