Unit 2

Unit 2:
Supervised Learning
By Sachin Rathore,
Assistant Professor
Sachin Rathore, Mechanical Engineering
ME
Supervised Learning: Introduction to Supervised Learning,
Classification,
Regression Analysis and its Types ,
Model Selection Procedures,
Bayesian Decision Theory,
Naïve Bayes Classifier,
Bayes Optimal Classifier,
Evaluating an Estimator: Bias and Variance ,
Support Vector Machines,
Types of Support Vector Kernel(Linear Kernel, Polynomial Kernel, Gaussian Kernel, Issues in SVM,
Case Study on House Price Prediction using Machine Learning.

Introduction to Supervised Learning

In supervised learning,
• you train your model on a labelled dataset that means we have both raw input data as well as its results.
• We split our data into a training dataset and test dataset where the training dataset is used to train our network whereas
the test dataset acts as new data for predicting results or to see the accuracy of our model.
• Where Y is the predicted output that is determined by a mapping

function that assigns a class to an input value x.
• The function used to connect input features to a predicted output is

created by the machine learning model during training.

Supervised learning can be split into two subcategories:
• Classification and
• regression.

Classification
During training, a classification algorithm will be given data

points with an assigned category. The job of a classification
algorithm is to then take an input value and assign it a class, or
category, that it fits into based on the training data provided.

The most common example of classification is determining if an email is spam or not. With two classes to choose from
(spam, or not spam), this problem is called a binary classification problem.

Classification problems can be solved with a numerous amount of algorithms. Whichever algorithm you choose to use
depends on the data and the situation. Here are a few popular classification algorithms:
• Linear Classifiers
• Support Vector Machines (SVM)
• Decision Trees
• K-Nearest Neighbor
• Random Forest

Regression
Regression is a predictive statistical process where the model attempts to find the important relationship between
dependent and independent variables.
The goal of a regression algorithm is to predict a continuous number such as sales, income, and test scores. The equation
for basic linear regression can be written as so:
Where x[i] is the feature(s) for the data and where w[i] and b are parameters which are developed during training.
For simple linear regression models with only one feature in the data, the formula looks like this:
Where w is the slope, x is the single feature and b is the y-intercept.

For simple regression problems such as this, the models predictions are represented by the line of best fit. For models using
two features, the plane will be used. Finally, for a model using more than two features, a hyperplane will be used.
Imagine we want to determine a student’s test grade based on how many hours they studied the week of the test. Lets say
the plotted data with a line of best fit looks like this:
• There is a clear positive correlation between hours studied

(independent variable) and the student’s final test score (dependent
variable).
• A line of best fit can be drawn through the data points to show the
models predictions when given a new input.
• Say we wanted to know how well a student would do with five hours
of studying.
• We can use the line of best fit to predict the test score based on
other student’s performances.

Regression Analysis and its Types
There are many different types of regression algorithms. The three most common are listed below:
• Linear Regression
• Logistic Regression
• Polynomial Regression

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the
single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).
When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple
input variables, literature from statistics often refers to the method as multiple linear regression.
Let’s make this concrete with an example.
Imagine we are predicting weight (y) from height (x). Our linear regression model
representation for this problem would be:
y = B0 + B1 * x1
or
weight =B0 +B1 * height
Where B0 is the bias coefficient and B1 is the coefficient for the height column.
We use a learning technique to find a good set of coefficient values. Once found, we can
plug in different height values to predict the weight.
For example, lets use B0 = 0.1 and B1 = 0.5.

Let’s plug them in and calculate the weight (in kilograms) for a person with the height of
182 centimeters.
weight = 0.1 + 0.5 * 182
weight = 91.1
You can see that the above equation could be plotted as a line in two-dimensions. The B0 is
our starting point regardless of what height we have. We can run through a bunch of
heights from 100 to 250 centimeters and plug them to the equation and get weight values,
creating our line.
Mean Squared Error (MSE)
Mean squared error is perhaps the most popular metric used
for regression problems. It essentially finds the average of the
squared difference between the target value and the value
predicted by the regression model.
Where:
•y_j: ground-truth value
•y_hat: predicted value from the regression model
•N: number of datums

Mean Absolute Error (MAE)
Mean Absolute Error is the average of the difference between the ground
truth and the predicted values. Mathematically, its represented as
:
Where:
•y_j: ground-truth value
•y_hat: predicted value from the regression model
•N: number of datums

Root Mean Squared Error (RMSE)
Root Mean Squared Error corresponds to the square root of the average of the squared difference between the target
value and the value predicted by the regression model. Basically, sqrt(MSE). Mathematically it can be represented as:

R² Coefficient of determination
R² Coefficient of determination actually works as a post metric, meaning it’s a metric that’s calculated using other metrics.
The point of even calculating this coefficient is to answer the question “How much (what %) of the total variation in Y(target) is explained by the variation in X(regression line)”
This is calculated using the sum of squared errors. Let’s go through the formulation to understand it better.
Total variation in Y (Variance of Y):
Percentage of variation described the regression line:
Subsequently, the percentage of variation described the regression line:
Finally, we have our formula for the coefficient of determination, which can tell us how good or bad the fit of the regression line is:

The following table shows the midterm and final exam grades obtained for students in a database
course.
(i) Use the method of least squares to find an equation for the prediction of a student’s final exam
grade based on the student’s midterm grade in the course.
(ii) Predict the final exam grade of a student who received an 86 on the midterm exam.
X (Midterm exam) Y (Final exam)

72 84
50 63
81 77
74 78
94 90
86 75
59 49
83 79
65 77
33 52
88 44
81 90
Solution by MS Excel
100
90
y = 0.4437x + 39.476
80 R² = 0.2488
70
60
Axis Title
50 Series1
Linear (Series1)
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Axis Title

Y(Final exam)-Actual Prediction Marks (y = 0.4437x
X (Midterm exam) Error (RMSE)
Mark + 39.476)
Yj=Actual, yj (cap) = Prediction (Yj-

Yj)
72 84 71 13 169
50 63 62 1 1
81 77 75 2 4
74 78 72 6 36
94 90 81 9 81
86 75 78 -3 9
59 49 66 -17 289
83 79 76 3 9
65 77 68 9 81
33 52 54 -2 4
88 44 79 -35 1225
81 90 75 15 225
RSME= 13.33229163

Solution by MATLAB

STEP 1: ADD DATA
MACHINE TOOLBOX@
MATLAB

Training Module

Response

Actual vs prediction

Test module

Final result

Prediction
X Y(Final
Marks (y =
(Midter exam)- Error (RMSE)
0.4437x +
m exam) Actual Mark
39.476)
Yj=Actual, yj (cap)
= Prediction (Yj-Yj)
72 84 71 13 169
50 63 62 1 1
81 77 75 2 4
74 78 72 6 36
94 90 81 9 81
86 75 78 -3 9
59 49 66 -17 289
83 79 76 3 9
65 77 68 9 81
33 52 54 -2 4
88 44 79 -35 1225
81 90 75 15 225
13.33
RSME= 22916
3

Prediction model

Applications of Supervised Learning
•Sentiment Analysis: It is a natural language processing technique in which we analyze and categorize some
meaning out of the given text data. For example, if we are analyzing tweets of people and want to predict whether a
tweet is a query, complaint, suggestion, opinion or news, we will simply use sentiment analysis.
•Recommendations: Every e-Commerce site or media, all of them use the recommendation system to recommend
their products and new releases to their customers or users on the basis of their activities. Netflix, Amazon,
YouTube, Flipkart are earning huge profits with the help of their recommendation system.
•Spam Filtration: Detecting spam emails is indeed a very helpful tool, this filtration techniques can easily detect any
sort of virus, malware or even harmful URLs. In recent studies, it was found that about 56.87 per cent of all emails
revolving around the internet were spam in March 2017 which was a major drop from April 2014's 71.1 percent
spam share.
Logistic regression is another technique borrowed by machine learning from the field of statistics.
It is the go-to method for binary classification problems (problems with two class values). In this post you will discover the
logistic regression algorithm for machine learning.
For example,
To predict whether an email is spam (1) or (0)
Whether the tumor is malignant (1) or not (0)

Polynomial Regression
Regression is defined as the method to

find the relationship between the
independent and dependent variables
to predict the outcome. The first
polynomial regression model was used
in 1815 by Gergonne.
It is used to find the best fit line using

the regression line for predicting the
outcomes.
There are many types of regression

techniques, polynomial regression is
one of them.
•If we add higher degrees such as quadratic, then it turns the line into a curve that better fits the data. Generally, it is
used when the points in the data set are scattered and the linear model is not able to describe the result clearly. We
should always keep an eye on Overfitting and Underfitting while considering these degrees to the equation.

Polynomial Regression Uses
•It is used in many experimental procedures to produce the outcome using this equation.
•It provides a great defined relationship between the independent and dependent variables.
•It is used to study the isotopes of the sediments.
•It is used to study the rise of different diseases within any population.
•It is used to study the generation of any synthesis.

Model Selection for Machine Learning
What Is Model Selection
Model selection is the process of selecting one final machine learning model from
among a collection of candidate machine learning models for a training dataset.
Model selection is a process that can be applied both across different types of models
(e.g. logistic regression, SVM, KNN, etc.) and across models of the same type
configured with different model hyperparameters (e.g. different kernels in an SVM).

For example, we may have a dataset for which we are interested in developing a classification or regression
predictive model. We do not know beforehand as to which model will perform best on this problem, as it is
unknowable.
Therefore, we fit and evaluate a suite of different models on the problem.
Model selection is the process of choosing one of the models as the final model that addresses the problem.
Model selection is different from model assessment

Therefore, a “good enough” model may refer to many things and is specific to your project, such as:
•A model that meets the requirements and constraints of project stakeholders.
•A model that is sufficiently skillful given the time and resources available.
•A model that is skillful as compared to naive models.
•A model that is skillful relative to other tested models.
•A model that is skillful relative to the state-of-the-art.

Bayesian Decision Theory
Introduction
Bayesian decision theory refers to the statistical approach based on tradeoff

quantification among various classification decisions based on the concept of
Probability(Bayes Theorem) and the costs associated with the decision.
It is basically a classification technique that involves the use of the Bayes Theorem
which is used to find the conditional probabilities.
In Statistical pattern Recognition, we will focus on the statistical properties of

patterns that are generally expressed in probability densities (pdf’s and pmf’s), and this
will command most of our attention in this article and try to develop the fundamentals of
the Bayesian decision theory.

Prerequisites
Random Variable
A random variable is a function that maps a possible set of outcomes to some values like while tossing a coin and
getting head H as 1 and Tail T as 0 where 0 and 1 are random variables.
Bayes Theorem
The conditional probability of A given B, represented by P(A | B) is the chance of occurrence of A given that B has
occurred.
P(A | B) = P(A,B)/P(B) or
By Using the Chain rule, this can also be written as:
P(A,B) = P(A|B)P(B)=P(B|A)P(A)
P(A | B) = P(B|A)P(A)/P(B) ——- (1)
Where, P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)
Here, equation (1) is known as the Bayes Theorem of probability Our aim is to explore each of the components
included in this theorem.
(a) Prior or State of Nature:
Prior probabilities represent how likely is each Class is going to occur.

Priors are known before the training process.
The state of nature is a random variable P(wi).
If there are only two classes, then the sum of the priors is P(w1) + P(w2)=1, if the classes are exhaustive.
(b) Class Conditional Probabilities:
It represents the probability of how likely a feature x occurs given that it belongs to the particular class. It is denoted by, P(X|A) where x is a
particular feature
It is the probability of how likely the feature x occurs given that it belongs to the class wi.
Sometimes, it is also known as the Likelihood.
It is the quantity that we have to evaluate while training the data. During the training process, we have input(features) X labeled to
corresponding class w and we figure out the likelihood of occurrence of that set of features given the class label.

(c) Evidence:
•It is the probability of occurrence of a particular feature i.e. P(X).

•It can be calculated using the chain rule as, P(X) = Σin P(X | wi) P(wi)
•As we need the likelihood of class conditional probability is also figure out evidence values during training.
(d) Posterior Probabilities:
•It is the probability of occurrence of Class A when certain Features are given
•It is what we aim at computing in the test phase in which we have testing input or features (the given entity) and
have to find how likely trained model can predict features belonging to the particular class wi.

Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given the weather conditions, each
tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for playing golf. Here is a tabular representation of our dataset.
Case Temperature Humidity Windy Play Golf

Outlook
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

The dataset is divided into two parts, Case Temperature Humidity Windy Play Golf
Outlook
namely, feature matrix and the response 0 Rainy Hot High False No
vector.
•Feature matrix contains all the vectors(rows) 3 Sunny Mild High False Yes
of dataset in which each vector consists of the 4 Sunny Cool Normal False Yes
value of dependent features. In above dataset,
features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’. 6 Overcast Cool Normal True Yes

•Response vector contains the value of class
variable(prediction or output) for each row of
feature matrix. In above dataset, the class
variable name is ‘Play golf’. 10 Rainy Mild Normal True Yes

Assumption:
The fundamental Naive Bayes assumption is that each feature makes an:
•independent
•equal
contribution to the outcome.
With relation to our dataset, this concept can be understood as:

•We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to do with the
humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence, the features are assumed to be independent.
•Secondly, each feature is given the same weight(or importance). For example, knowing only temperature and humidity
alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to
the outcome.
Note: The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the independence
assumption is never correct but often works well in practice.

Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:
where A and B are events and P(B) ? 0.
•Basically, we are trying to find probability of event A, given the event B is true. Event B is also termed as evidence.
•P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an
unknown instance(here, it is event B).
•P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.

Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of dataset)
X = (Rainy, Hot, High, False)

Y = No
So basically, P(Y|X) here means, the probability of “Not playing golf” given that the weather conditions are “Rainy
outlook”, “Temperature is hot”, “high humidity” and “no wind”.

Naïve Bayes Classifier

What is Naïve Bayes Classifier?
The Naïve Bayes Classifier belongs to the family of probability classifier, using Bayesian theorem. The reason why it is
called ‘Naïve’ because it requires rigid independence assumption between input variables. Therefore, it is more
proper to call Simple Bayes or Independence Bayes. This algorithm has been studied extensively since 1960s.
Simple though it is, Naïve Bayes Classifier remains one of popular methods to solve text categorization problem, the
problem of judging documents as belonging to one category or the other, such as email spam detection.
The goal of Naïve Bayes Classifier is to calculate conditional probability:
for each of K possible outcomes or classes Ck.

for each of K possible outcomes or classes Ck.
Let x=(x1,x2,…,xn). Using Bayesian theorem, we can get:
The joint probability can be written as:
Assume that all features x are mutually independent, we can get:
Therefore, formula can be written as:
Therefore, this is the final formula for Naïve Bayes Classifier.

How to calculate parameters and make a prediction in Naïve Bayes Classifier?
Maximum Likelihood Estimation (MLE) is used to estimate parameters — prior probability and conditional probability.
The prior probability equals the number of certain cases of y occur divided by the total number of records.
The conditional probability of p(x1=a1|y=C1) equals the number of cases when x1 equals to a1 and y equals to C1
divided by the number of cases when y equals to C1.
Naïve Bayes Classifier uses following formula to make a prediction:

For example, 15 records in the table below are used to train a Naïve Bayes model, and then a prediction is made to a new record X(B, S).
Use formula above to estimate prior and conditional probability, and we can get:
Finally, as of X(B, S), we can get:

P(Y=0)P(X1=B|Y=0)P(X2=S|Y=0)>
P(Y=1)P(X1=B|Y=1)P(X2=S|Y=1), so y=0.
So, finally, we are left with the task of calculating P(y) and P(xi | y).
Please note that P(y) is also called class probability and P(xi | y) is called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xi | y).
Let us try to apply the above formula manually on our weather dataset. For this, we need to do some
precomputations on our dataset.
We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been demonstrated in the tables below:

Case Temperature Humidity Windy Play Golf
Outlook

So, in the figure above, we have calculated P(xi | yj) for each xi
in X and yj in y manually in the tables 1-4. For example,
probability of playing golf given that the temperature is cool, i.e
P(temp. = cool | play golf = Yes) = 3/9.
Also, we need to find class probabilities (P(y)) which has been

calculated in the table 5. For example, P(play golf = Yes) = 9/14.
So now, we are done with our pre-computations and the

classifier is ready!
Let us test it on a new set of features (let us call it today):
today = (Sunny, Hot, Normal, False)

So, probability of playing golf is given by:
and probability to not play golf is given by:

Since, P(today) is common in both probabilities, we can ignore P(today) and find proportional probabilities as:
and

Now, since
These numbers can be converted into a probability by making the sum equal to 1 (normalization):
and
So, prediction that golf would be played is ‘Yes’.
The method that we discussed above is applicable for discrete

Since data. In case of continuous data, we need to make some
assumptions regarding the distribution of values of each feature.
The different naive Bayes classifiers differ mainly by the
assumptions they make regarding the distribution of P(xi | y).
Now, we discuss one of such classifiers here.

Classification @MATLAB
Case Temperature Humidity Windy Play Golf Outlo

Temp
Humi Wind Play
Outlook eratur
ok dity y Golf
e
Norm
Rainy Hot FALSE
1 1 al 1 1 Yes 1
Overc
Mild High TRUE
ast 2 2 2 2 No 2
Sunny Cool
3 3

conversion @MATLAB
Case Outlook Temperature Humidity Windy Play Golf

0 1 1 2 1 2
1 1 1 2 2 2
2 2 1 2 1 1
3 3 2 2 1 1
4 3 3 1 1 1
5 3 3 1 2 1
6 2 3 1 2 1
7 1 2 2 1 2
8 1 3 1 1 1
9 3 2 1 1 1
10 1 2 1 2 1
11 2 2 2 2 1
12 2 1 1 1 1
13 1 2 2 2 2
STEP 1: IMPORT DATA

STEP 2: ML APP

STEP 3: TRAIN DATA

STEP 4: ACCURACY

STEP 4:PREDICTION

STEP 4:PREDICTION

STEP 4:PREDICTION

Bayes Optimal Classifier
The Bayes Optimal Classifier is a probabilistic model that makes the most probable prediction for a new example.
It is described using the Bayes Theorem that provides a principled way for calculating a conditional probability. It is
also closely related to the Maximum a Posteriori: a probabilistic framework referred to as MAP that finds the most
probable hypothesis for a training dataset.
In practice, the Bayes Optimal Classifier is computationally expensive, if not intractable to calculate, and instead,
simplifications such as the Gibbs algorithm and Naive Bayes can be used to approximate the outcome.
you will discover Bayes Optimal Classifier for making the most accurate predictions for new instances of data.
After reading this post, you will know:
•Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
•Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the
training dataset.
•Bayes Optimal Classifier is a probabilistic model that finds the most probable prediction using the training data and
space of hypotheses to make a prediction for a new data instance.

Estimator Bias, And The Bias — Variance Tradeoff
Plus, how to compare estimators based on their bias, variance and mean squared error
A statistical estimator can be evaluated on the basis of how biased it is in its prediction, how consistent its performance is,
and how efficiently it can make predictions. And the quality of your model’s predictions are only as good as the quality of
the estimator it uses.
In this section, we’ll cover the property of bias in detail and learn how to measure it.
The bias of an estimator happens to be joined at the hip with the variance of the estimator’s predictions via a concept called
the bias -variance tradeoff, and so, we’ll learn about that concept too.
We’ll close with a discussion on the Mean Squared Error of the estimator, its applicability to regression modeling, and we’ll
show how to evaluate various estimators of the population mean using the properties of bias, variance and their Mean
Squared Error.

What is a Statistical Estimator?
An estimator is any procedure or formula that is used to predict or estimate the value of some unknown quantity such as say,
your flight’s departure time, or today’s NASDAQ closing price.
Let’s state an informal definition of what an estimator is:A statistical estimator is a statistical device used to estimate the true,
but unknown, value of some parameter of the population such as the mean or the median. It does this by using the information
contained in the data points that make a sample of values.
In our daily lives, we tend to employ various types of estimators without even realizing it. Following are some types of estimators
that we commonly use:
Estimators based on good (or bad!) Judgement

You ask your stock broker buddy to estimate how high the price of your favorite stock will go in a year’s time. In this case, you
are likely to get an interval estimate of the price, instead of a point estimate.
Estimators based on rules of thumb, and some calculation

You estimate the efforts needed to complete your next home improvement project using some estimation technique such as
the Work Breakdown Structure .
Estimators based on polling

You ask an odd number of your friends, who they think will win the next election, and you accept the majority result.
In each case, you wish to estimate a parameter you don’t know.
In statistical modeling, the mean, especially the mean of the population, is a fundamental parameter that is often estimated.
Let’s look at a real life data sample.
Following is a data set of surface temperatures in the North Eastern Atlantic ocean at a certain time of year:

•Estimator #1: We could take the average of the minimum and maximum temperature value in the above sample:
y_min=0.28704015899999996 y_max=15.02521203Estimate
#1 of the population mean=7.3690859355
Estimator #2: We could choose a random value from the sample and designate it as the population mean:
Estimate #2 of the population mean=13.5832207
Estimator #3: We could use the following estimator, which averages out all temperature values in the data sample:
Estimate #3 of the population mean=11.94113359335031

Estimator Bias
Suppose you are shooting basketballs. While some balls make it through the net, you find that most of your throws
are hitting a point below the hoop. In this case, whatever technique you are using to estimate the correct angle and
speed of the throw is underestimating the (unknown) correct values of angle and speed.
Your estimator has a negative bias.

With practice, your throwing technique improves, and you are able to dunk more often. And from a bias perspective,
you begin overshooting the basket approximately just as often as undershooting it. You have managed to reduce
your estimation technique’s bias.

The bias-variance trade-off
One aspect that might be apparent to you from the above two figures is that, while in the first figure, although
the bias is large, the ‘dispersion’ of the missed shots is less, leading to a lower variance in outcomes. In the
second figure, the bias has undoubtedly reduced because of a more uniform spreading out of the missed shots,
but that has also lead to a higher spread, a.k.a. higher variance.
The first technique appears to have a larger bias and a smaller variance and it is vice versa for the second
technique. This is no coincidence and it can be easily proved (in fact, we will prove it later!) that there is a direct
give and take between the bias and variance of your estimation technique.

Note that this mean y_bar relates to our sample of n values , i.e. n ball throws, or n ocean surface temperatures, etc.
Switching to the ocean temperatures example, if we collect another set of n ocean temperature values and average them out,
we’ll get a second value for the sample mean y_bar. A third sample of size n will yield a third sample mean, and so on. So, the
sample mean y_bar is itself a random variable. And just like any random variable, y_bar has a probability distribution and an
expected value, denoted by E(y_bar).
We are now in position to define the bias of the estimator y_bar for the population mean µ as follows:
The bias of the estimator y_bar for the population mean µ, is the difference between the expected value of the sample mean
y_bar, and the population mean µ. Here is the formula:

Support Vector Machines
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection.
The objective of the support vector machine algorithm is to find a hyperplane in an N-

dimensional space(N - the number of features) that distinctly classifies the data points.

Possible hyperplanes
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is
to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes.
Maximizing the margin distance provides some reinforcement so that future data points can be classified with more
confidence.
Hyperplanes and Support Vectors

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the
hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will
change the position of the hyperplane. These are the points that help us build our SVM.

Unit 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

Unit 2:

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

• Where Y is the predicted output that is determined by a mapping

• The function used to connect input features to a predicted output is

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

During training, a classification algorithm will be given data

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

• Support Vector Machines (SVM)

Sachin Rathore, Mechanical Engineering

Where w is the slope, x is the single feature and b is the y-intercept.

Sachin Rathore, Mechanical Engineering

• There is a clear positive correlation between hours studied

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

weight =B0 +B1 * height

For example, lets use B0 = 0.1 and B1 = 0.5.

weight = 0.1 + 0.5 * 182

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Percentage of variation described the regression line:

Subsequently, the percentage of variation described the regression line:

Sachin Rathore, Mechanical Engineering

X (Midterm exam) Y (Final exam)

Sachin Rathore, Mechanical Engineering

Yj=Actual, yj (cap) = Prediction (Yj-

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

Regression is defined as the method to

It is used to find the best fit line using

There are many types of regression

Sachin Rathore, Mechanical Engineering

Sachin Rathore, Mechanical Engineering

What Is Model Selection

Sachin Rathore, Mechanical Engineering

Therefore, we fit and evaluate a suite of different models on the problem.

Model selection is different from model assessment

Sachin Rathore, Mechanical Engineering

•A model that meets the requirements and constraints of project stakeholders.

•A model that is skillful as compared to naive models.

•A model that is skillful relative to other tested models.

•A model that is skillful relative to the state-of-the-art.

Sachin Rathore, Mechanical Engineering

Bayesian decision theory refers to the statistical approach based on tradeoff

In Statistical pattern Recognition, we will focus on the statistical properties of

Sachin Rathore, Mechanical Engineering

By Using the Chain rule, this can also be written as:

P(A | B) = P(B|A)P(A)/P(B) ——- (1)

Where, P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)

Prior probabilities represent how likely is each Class is going to occur.

(b) Class Conditional Probabilities:

Sachin Rathore, Mechanical Engineering