You are on page 1of 19

Machine Learning Techniques (KCS 055)

UNIT-II (A)
A-REGRESSION: Linear Regression and Logistic Regression
B-BAYESIAN LEARNING - Bayes theorem, Concept learning, Bayes Optimal Classifier, Naïve Bayes
classifier, Bayesian belief networks, EM algorithm.
C-SUPPORT VECTOR MACHINE: Introduction, Types of support vector kernel – (Linear kernel,
olynomial kernel,and Gaussiankernel), Hyperplane – (Decision surface), Properties of SVM, and
Issues in SVM.

A-REGRESSION

Regression

Regression analysis is used to predict the value of a variable based on the value of another
variable. The variable you want to predict is called the dependent variable. The variable you are
using to predict the other variable's value is called the independent variable.

Regression analysis is a statistical method to model the relationship between a dependent


(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.

Linear models play a central part in modern statistical methods. On the one hand, these models
can approximate a large amount of metric data structures in their entire range of definition or
at least piecewise.

Model= Data + Analysis


The term “model” is broadly used to represent any phenomenon in a mathematical framework.

Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

1-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.

Some examples of regression can be as:

 Prediction of rain using temperature and other factors


 Determining Market trends
 Prediction of road accidents due to rash driving.

Important Terms

Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.

Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as
a predictor.

Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.

Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.

Underfitting and Overfitting: If our algorithm works well with the training dataset but not well
with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.

There are two type of regression :


a. Simple linear regression : It uses one independent variable to explain or predict the outcome
of dependent variable Y.
Y = b0+b1X+
b. Multiple linear regression : It uses two or more independent
variables to predict outcomes.
Y = a + b1X1 + b2X2 + b3X3 + ... + btXt + u
Where :
Y = The variable we you are trying to predict (dependent variable).

2-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


X = The variable that we are using to predict Y (independent variable).
a = The intercept.
b = The slope.
u = The regression residual.

Linear Regression

It is a statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression
algorithm shows a linear relationship between a dependent (y) and one or more independent (y)
variables, hence called as linear regression. Since linear regression shows the linear relationship, which
means it finds how the value of the dependent variable is changing according to the value of the
independent variable.

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε
where
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.

3-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given dataset.

1. Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and independent variables.
2. Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and target
variables. Or we can say, it is difficult to determine which predictor variable is affecting the target
variable and which is not. So, the model assumes either little or no multicollinearity between the
features or independent variables.
3. Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of independent
variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter
plot.
4. Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If error
terms are not normally distributed, then confidence intervals will become either too wide or too
narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which
means the error is normally distributed.
5. No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

Logistic Regression

o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.

4-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Bayesian learning

Bayes theorem is given by an English statistician, philosopher, and Presbyterian minister named Mr.
Thomas Bayes in 17th century. Bayes provides their thoughts in decision theory which is extensively used
in important mathematics concepts as Probability.

The concept of conditional probability is introduced in Elementary Statistics. We noted


that the conditional probability of an event is a probability obtained with the additional
information that some other event has already occurred. We used P(B|A) to denoted the
conditional probability of event B occurring, given that event A has already occurred. The
Following formula was provided for finding P(B|A):

Definitions
A prior probability is an initial probability value originally obtained before any additional information is
obtained.
A posterior probability is a probability value that has been revised by using additional information that
is later obtained.

6-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


PRACTICE QUESTIONS ON BAYES’S FORMULA AND ON
PROBABILITY
(NOT TO BE HANDED IN )

1. remarks
If you find any errors in this document, please alert me.
Remark 1. First, I’ll make a remark about question 40 from section 12.4 in the book. Let
A= event that first card is a spade and B=event that second card is a spade. As part of this
question, you computed (presumably using the total law of probability) that
13 12 39 13 1
P (B) = P (A)P (B | A) + P (Ac )P (B | Ac ) = × + × = .
52 51 52 51 4
Note that in this case, of course, you already knew actually that
13 1
P (B) = = ,
52 4
since there are 13 spades in 52 cards, therefore the unconditional probability of B is 13
52
.

The law of total of total probability gives you a method for computing the unconditional
(or total) probability of an event B if you know its conditional probabilities with respect to
some other event A and the probability of A. In this case, we knew directly what P (B) is
(because we had enough information- we know how many cards there are and how many
spades) and you can see how it agrees with what the total law of probability gives you.

However, in most of the other examples, such as the one with the test for a virus we
did in class, it’s not possible to compute the probability of B (in that case, that the test
is positive) directly because you don’t have enough information (we don’t know how many
tests come out positive and how many tests are being administered, i.e., we don’t know the
percentage of tests that come out positive). What we know are the conditional probabilities
of the test coming out positive with the conditions that the person taking it was infected or
not. And we know the probability of this condition happening, i.e., we know the probability
that someone is infected. So the information you have here consists of precisely the pieces
that you need in order to use the total law of probability to compute the probability that a
test comes out positive, and there’s no other way to know this probability.
Remark 2. For all the following questions, the easiest way to think about them is to draw
the tree diagram. Please do so when you try to do them, or when you read the solutions –
draw the diagram to try to follow what’s happening.

2. solutions
Exercise 1. A doctor is called to see a sick child. The doctor has prior information that
90% of sick children in that neighborhood have the flu, while the other 10% are sick with
1
measles. Let F stand for an event of a child being sick with flu and M stand for an event of
a child being sick with measles. Assume for simplicity that F ∪ M = Ω, i.e., that there no
other maladies in that neighborhood.
A well-known symptom of measles is a rash (the event of having which we denote R).
Assume that the probability of having a rash if one has measles is P (R | M ) = 0.95.
However, occasionally children with flu also develop rash, and the probability of having a
rash if one has flu is P (R | F ) = 0.08.
Upon examining the child, the doctor finds a rash. What is the probability that the child
has measles?
Solution.
We use Bayes’s formula.
P (R | M )P (M )
P (M | R) =
(P (R | M )P (M ) + P (R | F )P (F ))
0.95 × 0.10
= ' 0.57.
(0.95 × 0.10 + 0.08 × 0.90)
Which is nowhere close to 95% of P(R—M).
Exercise 2. In a study, physicians were asked what the odds of breast cancer would be in
a woman who was initially thought to have a 1% risk of cancer but who ended up with a
positive mammogram result (a mammogram accurately classifies about 80% of cancerous
tumors and 90% of benign tumors.)
95 out of a hundred physicians estimated the probability of cancer to be about 75%. Do
you agree?
Solution.
Introduce the events:

+ = mammogram result is positive,


B = tumor is benign,
M = tumor is malignant.
Note that B c = M . We are given P (M ) = .01, so P (B) = 1 − P (M ) = .99.
We are also given the conditional probabilities P (+ | M ) = .80 and P (− | B) = .90, where
the event − is the complement of +, thus P (+ | B) = .10
Bayes’ formula in this case is

P (+ | M )P (M )
P (M | +) =
(P (+ | M )P (M ) + P (+ | B)P (B))
0.80 × 0.01
=
(0.80 × 0.01 + 0.10 × 0.99)
' 0.075
So the chance would be 7.5%. A far cry from a common estimate of 75
2
Exercise 3. Suppose we have 3 cards identical in form except that both sides of the first
card are colored red, both sides of the second card are colored black, and one side of the
third card is colored red and the other side is colored black.
The 3 cards are mixed up in a hat, and 1 card is randomly selected and put down on the
ground. If the upper side of the chosen card is colored red, what is the probability that the
other side is colored black?
Solution.
Let RR, BB, and RB denote, respectively, the events that the chosen cars is the red-red,
the black-black, or the red-black card. Letting R be the event that the upturned side of the
chosen card is red, we have that the desired probability is obtained by

P (RB ∩ R)
P (RB | R) =
P (R)
P (R | RB)P (RB)
=
P (R | RR)P (RR) + P (R | RB)P (RB) + P (R | BB)P (BB)
( 21 )( 13 ) 1
= 1 1 1 1 =
(1)( 3 ) + ( 2 )( 3 ) + 0( 3 ) 3
This question was actually just like the Monty Hall problem!
Exercise 4. It is estimated that 50% of emails are spam emails. Some software has been
applied to filter these spam emails before they reach your inbox. A certain brand of software
claims that it can detect 99% of spam emails, and the probability for a false positive (a
non-spam email detected as spam) is 5%.
Now if an email is detected as spam, then what is the probability that it is in fact a
non-spam email?
Solution.
Define events
A = event that an email is detected as spam,
B = event that an email is spam,
c
B = event that an email is not spam.
c
We know P (B) = P (B ) = .5, P (A | B) = 0.99, P (A | B c ) = 0.05.
Hence by the Bayes’s formula we have
P (A | B c )P (B c )
P (B c | A) =
P (A | B)P (B) + P (A|B c )P (B c )
0.05 × 0.5
=
0.05 × 0.5 + 0.99 × 0.5
5
= .
104

3
w.e.f: January 2020
Axis Institute of Technology & Management, Kanpur Form No. Acad-006A
Department of Computer Science & Engineering

Session: 2023-24 Semesters: V Section: A/B


Course Code: KCS055 Course Name: Machine Learning Techniques

Assignment 2

Course
Que Outcome
stio No. , Title of Questions
n Blooms
No Level

CO2 Define the term regression with its type also differentiate between linear regression and
1
logistics regression.

2 CO2 What are the types of support vector machine? Describe Gaussian Kernel

3 CO2
Explain Bayesian learning. Explain two category classifications.

Let blue, green, and red be three classes of objects with prior probabilities given by
P(blue) = 1/4, P(green) = 1/2, P(red) = 1/4. Let there be three types of objects pencils,
pens, and paper. Let the class-conditional probabilities of these objects be given as
4 CO2
follows. Use Bayes classifier to classify pencil, pen and paper.
P(pencil/green) = 1/3 P(pen/green) = 1/2 P(paper/green) = 1/6
P(pencil/blue) = 1/2 P(pen/blue) = 1/6 P(paper/blue) = 1/3
P(pencil/red) = 1/6 P(pen/red) = 1/3 P(paper/red) = 1/2
5 CO2 Explain Naive Bayes classifier.

2-MLT/KCS055/AIML/AITM/DR ABHAY SHUKLA


Machine Learning Techniques (KCS 055)

UNIT-II (B)
SUPPORT VECTOR MACHINE: Introduction, Types of support vector kernel – (Linear kernel,
olynomial kernel,and Gaussiankernel), Hyperplane – (Decision surface), Properties of SVM, and
Issues in SVM.

Support Vector Machines

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning. The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the
below diagram in which there are two different categories that are classified using a decision boundary
or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn about different features of

1-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


cats and dogs, and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see
the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider
the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM

SVM can be of two types:

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified
into two classes by using a single straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

Properties of SVM:
1. Flexibility in choosing a similarity function
2. Ability to handle large feature spaces
3. Over fitting can be controlled by soft margin approach

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there
are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.

2-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.

Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region
is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes.
These

3-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


points are called support vectors. The distance between the vectors and the hyperplane is called as
margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d

4-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


space with z=1, then it will become as:

SVM Kernels
In practice, SVM algorithm is implemented with kernel that transforms an input data space into the
required form. SVM uses a technique called the kernel trick in which kernel takes a low dimensional
input space and transforms it into a higher dimensional space. In simple words, kernel converts non-
separable problems into separable problems by adding more dimensions to it. It makes SVM more
powerful, flexible and accurate. The following are some of the types of kernels used by SVM.

Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is as below
K(x,xi)=sum(x∗xi)
From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum of the
Multiplication of each pair of input values.

The Gaussian kernel


The Gaussian (better Gaußian) kernel is named after Carl Friedrich Gauß (1777-1855), a brilliant
German mathematician. This chapter discusses many of the nice and peculiar properties of the
Gaussian kernel.
The Gaussian kernel is defined in 1-D, 2D and N-D respectively as

The σ determines the width of the Gaussian kernel. In statistics, when we consider the Gaussian
probability density function it is called the standard deviation, and the square of it, σ2, the
variance. In the rest of this book, when we consider the Gaussian as an aperture function of
some observation, we will refer to σ as the inner scale or shortly scale. In the whole of this book
the scale can only take positive values, σ>0. In the process of observation σ can never become
zero. For, this would imply making an observation through an infinitesimally small aperture,
which is impossible.

Advantages of SVM:
1. The abundance of implementations
2. Guaranteed optimality
3. SVM can be used for linearly separable as well as non-linearly separable data.
4. SVMs provide compliance to the semi-supervised learning models.

5-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


5. Feature Mapping used to be quite a load on the computational complexity of the overall training
performance of the model.

Disadvantage of SVM:
1. The choice of the kernel is perhaps the biggest limitation of the support vector machine.
2. SVM cannot return the probabilistic confidence value that is similar to logistic regression.
3. SVM does not give the best performance for handling text structures as compared to other
algorithms that are used in handling text data.

6-MLT/KCS055/CSE/IT/CSDS/AITM/DR ABHAY SHUKLA


Assignment Numerical

Question 1-

Sl. No. Color Legs Height Smelly Species

1 White 3 Short Yes M

2 Green 2 Tall No M

3 Green 3 Short Yes M

4 White 3 Short Yes M

5 Green 2 Short No H

6 White 2 Tall No H

7 White 2 Tall No H

8 White 2 Short Yes H

Using the above data, identify the species of an entity with the following attributes.

A ) X={Color=Green, Legs=2, Height=Tall, Smelly=No}

B) X={Color=white, Legs=3, Height=Tall, Smelly=No}

C) X={Color=white, Legs=3, Height=short, Smelly=Yes}

Question 2:
Using the above data, identify the status of playing an entity with the following
attributes.

A) X={outlook=rain, Temp=hot, humidity =high, wind =strong}


B) X={outlook=Sunny, Temp=mild, humidity =high, wind =weak}
C) X={outlook=overcast, Temp=cool, humidity =normal, wind =weak}
D) X={outlook=rain, Temp=mild, humidity =normal, wind =weak}

Question 3-

Question 4:

Let blue,green and red be three classes of objects with prior probabilities given by
P(blue)=¼,P(green)=½,P(red)=¼.
Let there be three types of objects: pencils,pens and paper.Let the class-conditional
probabilities of these objects be given as follows.
Use Bayes classifier to classify pencil,pen and paper.

 P(pencil/green)=⅓,
 P(pen/green)=½,
 P(paper/green)=⅙,
 P(pencil/blue)=½,
 P(pen/bue)=⅙,
 P(paper/blue)=⅓,
 P(pencil/red)=⅙,
 P(pen/red)=⅓,
 P(paper/red)=½.

You might also like