Professional Documents
Culture Documents
Unit 2
Unit 2
Supervised Learning
By Sachin Rathore,
Assistant Professor
Sachin Rathore, Mechanical Engineering
ME
Supervised Learning: Introduction to Supervised Learning,
Classification,
Regression Analysis and its Types ,
Model Selection Procedures,
Bayesian Decision Theory,
Naïve Bayes Classifier,
Bayes Optimal Classifier,
Evaluating an Estimator: Bias and Variance ,
Support Vector Machines,
Types of Support Vector Kernel(Linear Kernel, Polynomial Kernel, Gaussian Kernel, Issues in SVM,
Case Study on House Price Prediction using Machine Learning.
• you train your model on a labelled dataset that means we have both raw input data as well as its results.
• We split our data into a training dataset and test dataset where the training dataset is used to train our network whereas
the test dataset acts as new data for predicting results or to see the accuracy of our model.
depends on the data and the situation. Here are a few popular classification algorithms:
• Linear Classifiers
• Decision Trees
• K-Nearest Neighbor
• Random Forest
Where x[i] is the feature(s) for the data and where w[i] and b are parameters which are developed during training.
For simple linear regression models with only one feature in the data, the formula looks like this:
Imagine we want to determine a student’s test grade based on how many hours they studied the week of the test. Lets say
the plotted data with a line of best fit looks like this:
• A line of best fit can be drawn through the data points to show the
models predictions when given a new input.
• Say we wanted to know how well a student would do with five hours
of studying.
• We can use the line of best fit to predict the test score based on
other student’s performances.
There are many different types of regression algorithms. The three most common are listed below:
• Linear Regression
• Logistic Regression
• Polynomial Regression
When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple
input variables, literature from statistics often refers to the method as multiple linear regression.
Let’s make this concrete with an example.
Imagine we are predicting weight (y) from height (x). Our linear regression model
representation for this problem would be:
y = B0 + B1 * x1
or
Where B0 is the bias coefficient and B1 is the coefficient for the height column.
We use a learning technique to find a good set of coefficient values. Once found, we can
plug in different height values to predict the weight.
weight = 91.1
You can see that the above equation could be plotted as a line in two-dimensions. The B0 is
our starting point regardless of what height we have. We can run through a bunch of
heights from 100 to 250 centimeters and plug them to the equation and get weight values,
creating our line.
Sachin Rathore, Mechanical Engineering
Mean Squared Error (MSE)
Mean squared error is perhaps the most popular metric used
for regression problems. It essentially finds the average of the
squared difference between the target value and the value
predicted by the regression model.
Where:
•y_j: ground-truth value
•y_hat: predicted value from the regression model
•N: number of datums
:
Where:
•y_j: ground-truth value
•y_hat: predicted value from the regression model
•N: number of datums
Finally, we have our formula for the coefficient of determination, which can tell us how good or bad the fit of the regression line is:
90
y = 0.4437x + 39.476
80 R² = 0.2488
70
60
Axis Title
50 Series1
Linear (Series1)
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Axis Title
RSME= 13.33229163
•Recommendations: Every e-Commerce site or media, all of them use the recommendation system to recommend
their products and new releases to their customers or users on the basis of their activities. Netflix, Amazon,
YouTube, Flipkart are earning huge profits with the help of their recommendation system.
•Spam Filtration: Detecting spam emails is indeed a very helpful tool, this filtration techniques can easily detect any
sort of virus, malware or even harmful URLs. In recent studies, it was found that about 56.87 per cent of all emails
revolving around the internet were spam in March 2017 which was a major drop from April 2014's 71.1 percent
spam share.
Sachin Rathore, Mechanical Engineering
Logistic regression is another technique borrowed by machine learning from the field of statistics.
It is the go-to method for binary classification problems (problems with two class values). In this post you will discover the
logistic regression algorithm for machine learning.
For example,
To predict whether an email is spam (1) or (0)
Whether the tumor is malignant (1) or not (0)
•If we add higher degrees such as quadratic, then it turns the line into a curve that better fits the data. Generally, it is
used when the points in the data set are scattered and the linear model is not able to describe the result clearly. We
should always keep an eye on Overfitting and Underfitting while considering these degrees to the equation.
Model selection is the process of selecting one final machine learning model from
among a collection of candidate machine learning models for a training dataset.
Model selection is a process that can be applied both across different types of models
(e.g. logistic regression, SVM, KNN, etc.) and across models of the same type
configured with different model hyperparameters (e.g. different kernels in an SVM).
predictive model. We do not know beforehand as to which model will perform best on this problem, as it is
unknowable.
Model selection is the process of choosing one of the models as the final model that addresses the problem.
•A model that is sufficiently skillful given the time and resources available.
Introduction
It is basically a classification technique that involves the use of the Bayes Theorem
which is used to find the conditional probabilities.
Bayes Theorem
The conditional probability of A given B, represented by P(A | B) is the chance of occurrence of A given that B has
occurred.
P(A | B) = P(A,B)/P(B) or
P(A,B) = P(A|B)P(B)=P(B|A)P(A)
Here, equation (1) is known as the Bayes Theorem of probability Our aim is to explore each of the components
included in this theorem.
Sachin Rathore, Mechanical Engineering
(a) Prior or State of Nature:
It represents the probability of how likely a feature x occurs given that it belongs to the particular class. It is denoted by, P(X|A) where x is a
particular feature
It is the probability of how likely the feature x occurs given that it belongs to the class wi.
Sometimes, it is also known as the Likelihood.
It is the quantity that we have to evaluate while training the data. During the training process, we have input(features) X labeled to
corresponding class w and we figure out the likelihood of occurrence of that set of features given the class label.
•It is the probability of occurrence of Class A when certain Features are given
•It is what we aim at computing in the test phase in which we have testing input or features (the given entity) and
have to find how likely trained model can predict features belonging to the particular class wi.
•Feature matrix contains all the vectors(rows) 3 Sunny Mild High False Yes
of dataset in which each vector consists of the 4 Sunny Cool Normal False Yes
value of dependent features. In above dataset,
5 Sunny Cool Normal True No
features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’. 6 Overcast Cool Normal True Yes
•Secondly, each feature is given the same weight(or importance). For example, knowing only temperature and humidity
alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to
the outcome.
Note: The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the independence
assumption is never correct but often works well in practice.
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:
•Basically, we are trying to find probability of event A, given the event B is true. Event B is also termed as evidence.
•P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an
Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of dataset)
The Naïve Bayes Classifier belongs to the family of probability classifier, using Bayesian theorem. The reason why it is
called ‘Naïve’ because it requires rigid independence assumption between input variables. Therefore, it is more
proper to call Simple Bayes or Independence Bayes. This algorithm has been studied extensively since 1960s.
Simple though it is, Naïve Bayes Classifier remains one of popular methods to solve text categorization problem, the
problem of judging documents as belonging to one category or the other, such as email spam detection.
Maximum Likelihood Estimation (MLE) is used to estimate parameters — prior probability and conditional probability.
The prior probability equals the number of certain cases of y occur divided by the total number of records.
The conditional probability of p(x1=a1|y=C1) equals the number of cases when x1 equals to a1 and y equals to C1
divided by the number of cases when y equals to C1.
Naïve Bayes Classifier uses following formula to make a prediction:
Use formula above to estimate prior and conditional probability, and we can get:
Please note that P(y) is also called class probability and P(xi | y) is called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xi | y).
Let us try to apply the above formula manually on our weather dataset. For this, we need to do some
We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been demonstrated in the tables below:
and
These numbers can be converted into a probability by making the sum equal to 1 (normalization):
and
The Bayes Optimal Classifier is a probabilistic model that makes the most probable prediction for a new example.
It is described using the Bayes Theorem that provides a principled way for calculating a conditional probability. It is
also closely related to the Maximum a Posteriori: a probabilistic framework referred to as MAP that finds the most
probable hypothesis for a training dataset.
In practice, the Bayes Optimal Classifier is computationally expensive, if not intractable to calculate, and instead,
simplifications such as the Gibbs algorithm and Naive Bayes can be used to approximate the outcome.
you will discover Bayes Optimal Classifier for making the most accurate predictions for new instances of data.
After reading this post, you will know:
•Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
•Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the
training dataset.
•Bayes Optimal Classifier is a probabilistic model that finds the most probable prediction using the training data and
space of hypotheses to make a prediction for a new data instance.
Plus, how to compare estimators based on their bias, variance and mean squared error
A statistical estimator can be evaluated on the basis of how biased it is in its prediction, how consistent its performance is,
and how efficiently it can make predictions. And the quality of your model’s predictions are only as good as the quality of
the estimator it uses.
In this section, we’ll cover the property of bias in detail and learn how to measure it.
The bias of an estimator happens to be joined at the hip with the variance of the estimator’s predictions via a concept called
the bias -variance tradeoff, and so, we’ll learn about that concept too.
We’ll close with a discussion on the Mean Squared Error of the estimator, its applicability to regression modeling, and we’ll
show how to evaluate various estimators of the population mean using the properties of bias, variance and their Mean
Squared Error.
Let’s state an informal definition of what an estimator is:A statistical estimator is a statistical device used to estimate the true,
but unknown, value of some parameter of the population such as the mean or the median. It does this by using the information
contained in the data points that make a sample of values.
In our daily lives, we tend to employ various types of estimators without even realizing it. Following are some types of estimators
that we commonly use:
y_min=0.28704015899999996 y_max=15.02521203Estimate
#1 of the population mean=7.3690859355
Sachin Rathore, Mechanical Engineering
Estimator #2: We could choose a random value from the sample and designate it as the population mean:
Estimator #3: We could use the following estimator, which averages out all temperature values in the data sample:
One aspect that might be apparent to you from the above two figures is that, while in the first figure, although
the bias is large, the ‘dispersion’ of the missed shots is less, leading to a lower variance in outcomes. In the
second figure, the bias has undoubtedly reduced because of a more uniform spreading out of the missed shots,
but that has also lead to a higher spread, a.k.a. higher variance.
The first technique appears to have a larger bias and a smaller variance and it is vice versa for the second
technique. This is no coincidence and it can be easily proved (in fact, we will prove it later!) that there is a direct
give and take between the bias and variance of your estimation technique.
We are now in position to define the bias of the estimator y_bar for the population mean µ as follows:
The bias of the estimator y_bar for the population mean µ, is the difference between the expected value of the sample mean
y_bar, and the population mean µ. Here is the formula:
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection.
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is
to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes.
Maximizing the margin distance provides some reinforcement so that future data points can be classified with more
confidence.
Sachin Rathore, Mechanical Engineering
Hyperplanes and Support Vectors