Professional Documents
Culture Documents
[3 marks]
Ans. The variability of the measurements is expressed as a random variable x, and its
probability density function depends on the class ωj. p(x| ωj) is the class-conditional
probability density function, the probability function for x given that the class is ωj.
•We have the results of a blood test, so we know the amount of red blood cells.
Blood test: 4.500.000 red blood cells So, the patient is healthy.
If we consider the patient is healty, the probaility he has 4.5 milion red blood cells is higher
than if we consider he is ill, with this number of red blood cells.
Q.2] What is Decision Boundaries? [3 marks]
points into specific classes, where the algorithm switches from one class to another. On one
side a decision boundary, a datapoints is more likely to be called as class A — on the other
The goal of logistic regression, is to figure out some way to split the datapoints to have an
accurate prediction of a given observation’s class using the information present in the
features.
Let’s suppose we define a line that describes the decision boundary. So, all of the points on
one side of the boundary shall have all the datapoints belong to class A and all of the points
on one side of the boundary shall have all the datapoints belong to class B.
S(z)=1/(1+e^-z)
Our current prediction function returns a probability score between 0 and 1. In order to
map this to a discrete class (A/B), we select a threshold value or tipping point above
which we will classify values into class A and below which we classify values into class B.
p>=0.5,class=A
p<=0.5,class=B
If our threshold was .5 and our prediction function returned .7, we would classify this
observation belongs to class A. If our prediction was .2 we would classify the observation
belongs to class B.
Ans. K-Nearest Neighbor also known as KNN is a supervised learning algorithm that can be used for
regression as well as classification problems. Generally, it is used for classification problems in
machine learning.
KNN works on a principle assuming every data point falling in near to each other is falling in the
same class. In other words, it classifies a new data point based on similarity.
KNN algorithms decide a number k which is the nearest Neighbor to that data point that is to be
classified. If the value of k is 5 it will look for 5 nearest Neighbors to that data point.
For example, if we assume k=4. KNN finds out about the 4 nearest Neighbors. All the data points
near black data points belong to the green class meaning all the neighbours belong to the green
class so according to the KNN algorithm, it will belong to this class only. The red class is not
considered because red class data points are nowhere close to the black data point.
The simple version of the K-nearest neighbour classifier algorithms is to predict the target label by
finding the nearest neighbour class. The closest class to the point which is to be classified is
Advantages of KNN
A simple algorithm that is easy to understand.
Gives high accuracy but there are more good algorithms in supervised models.
The algorithm doesn't demand to build a model, tune several model parameters, or make additional
assumptions.
Disadvantages of KNN
The algorithm get slower when the number of examples, predictors or independent variables
increases.
Significance of k
Specifically, the KNN algorithm works in the way: find a distance between a query and all examples
(variables) of data, select the particular number of examples (say K) nearest to the query, then
decide
the most frequent label if using for the classification based problems, or
Therefore, the algorithm hugely depends upon the number of K, such that
Ans. Linear classifiers classify data into labels based on a linear combination of input features.
Therefore, these classifiers separate data using a line or plane or a hyperplane (a plane in more than
2 dimensions). They can only be used to classify data that is linearly separable. They can be modified
to classify non-linearly separable data.
Perceptron
In Perceptron, we take weighted linear combination of input features and pass it through a
thresholding function which outputs 1 or 0. The sign of wTx tells us which side of the plane wTx=0,
the point x lies on. Thus by taking threshold as 0, perceptron classifies data based on which side of
the plane the new point lies on.The task during training is to arrive at the plane (defined by w) that
accurately classifies the training data. If the data is linearly separable, perceptron training always
converges
Figure 1: Perceptron
Logistic Regression
In Logistic regression, we take weighted linear combination of input features and pass it through a
sigmoid function which outputs a number between 1 and 0. Unlike perceptron, which just tells us
which side of the plane the point lies on, logistic regression gives a probability of a point lying on a
particular side of the plane. The probability of classification will be very close to 1 or 0 as the point
goes far away from the plane. The probability of classification of points very close to the plane is
close to 0.5.
Figure 2: Logistic Regression
SVM
SVM is another linear classification algorithm (One which separates data with a hyperplane) just
like logistic regression and perceptron algorithms we saw before.
Ans. Discriminant functions, classifiers, Decision Surfaces, and Normal density including
Univariate as well as Multivariate Normal Distribution.
As for the normal density p(x|ωi) follows the multivariate normal distribution, so our
We will now examine this discriminant function in detail by dividing the covariance into
different cases.
Case-1 ( Σi = σ2 I )
This case occurs when σij =0 for i!=j i.e, the covariances are zero and the variance of feature
i.e σii remains σ2. This implies that the features are statistically independent and each
feature has variance σ2. As |Σi| and d/2ln2π term are independent of i and will not change
from the mean is normalized by the variance and the other term is the log of prior. Our
Here the quadratic term xtx is the same for all i, which could also be ignored. Hence, we
Comparing
wi = 1/σ2 μi
Such classifiers that use linear discriminant functions are often called linear machines.
We choose the decision surface for any linear machine such that the hyperplane is defined
by the equation gi(x)= gj(x) for bi-categorical with the highest posterior probabilities.
wt(x − x0) = 0
w = μi − μ j
can also notice that w = µi −µj, the hyperplane separating the two regions Ri and Rj.
Further, if P(ωi)= P(ωj) the subtractive term in x0 vanishes, and the hyperplane is
Case- 2 ( Σi = Σ )
These cases rarely occur, but hold importance when having a transition from case-1 to the
more generalized case-3. In this case, the covariance matrix for all of the classes is identical.
Even here we can ignore the term d/2ln2π and |Σi|. This eventually leads us to the equation,
If the prior probabilities P(wi) are the same for all the classes the decision rule remains that
to choose the feature vector x to the class with the nearest mean vector of class c. If the
prior probabilities are biased the decision is in favor of the prior more likely class.
t −1
The quadratic form (x−µi) Σ (x−µi) can be expanded as we again notice that the
t −1
term x Σ x can be ignored as it is independent of i. After this term is dropped we get the
As the discriminant function is linear the outcome of decision boundaries are again the
wt(x − x0) = 0
w = Σ−1(μi −μj)
The difference this hyperplane has as compared from the on in case 1 is that it is not
orthogonal to the line between the means and also it does not intersect halfway the point
Case- 3 ( Σi = arbitrary )
We now come to the most realistic case of the covariance being different for each class,
Now the only term dropped from the discriminant function equation of normal multivariate
The resulting discriminant function no more remains linear, as it inherently stays quadratic:
gi(x) = xtWix + witx + wi0
Wi = −1/2 Σi−1
wi = Σi−1μi
The decision surfaces are hyperquadrics i.e general form of hyperplanes, pairs of