You are on page 1of 80

Unit 2:

Supervised Learning

By Sachin Rathore,
Assistant Professor
Sachin Rathore, Mechanical Engineering
ME
Supervised Learning: Introduction to Supervised Learning,
Classification,
Regression Analysis and its Types ,
Model Selection Procedures,
Bayesian Decision Theory,
Naïve Bayes Classifier,
Bayes Optimal Classifier,
Evaluating an Estimator: Bias and Variance ,
Support Vector Machines,
Types of Support Vector Kernel(Linear Kernel, Polynomial Kernel, Gaussian Kernel, Issues in SVM,
Case Study on House Price Prediction using Machine Learning.

Sachin Rathore, Mechanical Engineering


Introduction to Supervised Learning

Sachin Rathore, Mechanical Engineering


In supervised learning,

• you train your model on a labelled dataset that means we have both raw input data as well as its results.
• We split our data into a training dataset and test dataset where the training dataset is used to train our network whereas
the test dataset acts as new data for predicting results or to see the accuracy of our model.

• Where Y is the predicted output that is determined by a mapping


function that assigns a class to an input value x.

• The function used to connect input features to a predicted output is


created by the machine learning model during training.

Sachin Rathore, Mechanical Engineering


Supervised learning can be split into two subcategories:
• Classification and
• regression.

Sachin Rathore, Mechanical Engineering


Classification

During training, a classification algorithm will be given data


points with an assigned category. The job of a classification
algorithm is to then take an input value and assign it a class, or
category, that it fits into based on the training data provided.

Sachin Rathore, Mechanical Engineering


The most common example of classification is determining if an email is spam or not. With two classes to choose from
(spam, or not spam), this problem is called a binary classification problem.

Sachin Rathore, Mechanical Engineering


Classification problems can be solved with a numerous amount of algorithms. Whichever algorithm you choose to use

depends on the data and the situation. Here are a few popular classification algorithms:

• Linear Classifiers

• Support Vector Machines (SVM)

• Decision Trees

• K-Nearest Neighbor

• Random Forest

Sachin Rathore, Mechanical Engineering


Regression
Regression is a predictive statistical process where the model attempts to find the important relationship between
dependent and independent variables.
The goal of a regression algorithm is to predict a continuous number such as sales, income, and test scores. The equation
for basic linear regression can be written as so:

Where x[i] is the feature(s) for the data and where w[i] and b are parameters which are developed during training.
For simple linear regression models with only one feature in the data, the formula looks like this:

Where w is the slope, x is the single feature and b is the y-intercept.

Sachin Rathore, Mechanical Engineering


For simple regression problems such as this, the models predictions are represented by the line of best fit. For models using
two features, the plane will be used. Finally, for a model using more than two features, a hyperplane will be used.

Imagine we want to determine a student’s test grade based on how many hours they studied the week of the test. Lets say
the plotted data with a line of best fit looks like this:

• There is a clear positive correlation between hours studied


(independent variable) and the student’s final test score (dependent
variable).

• A line of best fit can be drawn through the data points to show the
models predictions when given a new input.

• Say we wanted to know how well a student would do with five hours
of studying.

• We can use the line of best fit to predict the test score based on
other student’s performances.

Sachin Rathore, Mechanical Engineering


Regression Analysis and its Types

There are many different types of regression algorithms. The three most common are listed below:

• Linear Regression

• Logistic Regression

• Polynomial Regression

Sachin Rathore, Mechanical Engineering


Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the
single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).

When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple
input variables, literature from statistics often refers to the method as multiple linear regression.
Let’s make this concrete with an example.
Imagine we are predicting weight (y) from height (x). Our linear regression model
representation for this problem would be:

y = B0 + B1 * x1

or

weight =B0 +B1 * height

Where B0 is the bias coefficient and B1 is the coefficient for the height column.
We use a learning technique to find a good set of coefficient values. Once found, we can
plug in different height values to predict the weight.

For example, lets use B0 = 0.1 and B1 = 0.5.


Let’s plug them in and calculate the weight (in kilograms) for a person with the height of
182 centimeters.

weight = 0.1 + 0.5 * 182

weight = 91.1

You can see that the above equation could be plotted as a line in two-dimensions. The B0 is
our starting point regardless of what height we have. We can run through a bunch of
heights from 100 to 250 centimeters and plug them to the equation and get weight values,
creating our line.
Sachin Rathore, Mechanical Engineering
Mean Squared Error (MSE)
Mean squared error is perhaps the most popular metric used
for regression problems. It essentially finds the average of the
squared difference between the target value and the value
predicted by the regression model.

Where:
•y_j: ground-truth value
•y_hat: predicted value from the regression model
•N: number of datums

Sachin Rathore, Mechanical Engineering


Mean Absolute Error (MAE)
Mean Absolute Error is the average of the difference between the ground
truth and the predicted values. Mathematically, its represented as

:
Where:
•y_j: ground-truth value
•y_hat: predicted value from the regression model
•N: number of datums

Sachin Rathore, Mechanical Engineering


Root Mean Squared Error (RMSE)
Root Mean Squared Error corresponds to the square root of the average of the squared difference between the target
value and the value predicted by the regression model. Basically, sqrt(MSE). Mathematically it can be represented as:

Sachin Rathore, Mechanical Engineering


R² Coefficient of determination
R² Coefficient of determination actually works as a post metric, meaning it’s a metric that’s calculated using other metrics.
The point of even calculating this coefficient is to answer the question “How much (what %) of the total variation in Y(target) is explained by the variation in X(regression line)”
This is calculated using the sum of squared errors. Let’s go through the formulation to understand it better.
Total variation in Y (Variance of Y):

Percentage of variation described the regression line:

Subsequently, the percentage of variation described the regression line:

Finally, we have our formula for the coefficient of determination, which can tell us how good or bad the fit of the regression line is:

Sachin Rathore, Mechanical Engineering


The following table shows the midterm and final exam grades obtained for students in a database
course.
(i) Use the method of least squares to find an equation for the prediction of a student’s final exam
grade based on the student’s midterm grade in the course.
(ii) Predict the final exam grade of a student who received an 86 on the midterm exam.

X (Midterm exam) Y (Final exam)


72 84
50 63
81 77
74 78
94 90
86 75
59 49
83 79
65 77
33 52
88 44
81 90
Sachin Rathore, Mechanical Engineering
Solution by MS Excel
100

90
y = 0.4437x + 39.476
80 R² = 0.2488
70

60
Axis Title

50 Series1
Linear (Series1)
40

30

20

10

0
0 10 20 30 40 50 60 70 80 90 100
Axis Title

Sachin Rathore, Mechanical Engineering


Y(Final exam)-Actual Prediction Marks (y = 0.4437x
X (Midterm exam) Error (RMSE)
Mark + 39.476)

Yj=Actual, yj (cap) = Prediction (Yj-


Yj)
72 84 71 13 169
50 63 62 1 1
81 77 75 2 4
74 78 72 6 36
94 90 81 9 81
86 75 78 -3 9
59 49 66 -17 289
83 79 76 3 9
65 77 68 9 81
33 52 54 -2 4
88 44 79 -35 1225
81 90 75 15 225

RSME= 13.33229163

Sachin Rathore, Mechanical Engineering


Solution by MATLAB

Sachin Rathore, Mechanical Engineering


STEP 1: ADD DATA
MACHINE TOOLBOX@
MATLAB

Sachin Rathore, Mechanical Engineering


Training Module

Sachin Rathore, Mechanical Engineering


Response

Sachin Rathore, Mechanical Engineering


Actual vs prediction

Sachin Rathore, Mechanical Engineering


Test module

Sachin Rathore, Mechanical Engineering


Final result

Sachin Rathore, Mechanical Engineering


Prediction
X Y(Final
Marks (y =
(Midter exam)- Error (RMSE)
0.4437x +
m exam) Actual Mark
39.476)
Yj=Actual, yj (cap)
= Prediction (Yj-Yj)
72 84 71 13 169
50 63 62 1 1
81 77 75 2 4
74 78 72 6 36
94 90 81 9 81
86 75 78 -3 9
59 49 66 -17 289
83 79 76 3 9
65 77 68 9 81
33 52 54 -2 4
88 44 79 -35 1225
81 90 75 15 225
13.33
RSME= 22916
3

Sachin Rathore, Mechanical Engineering


Prediction model

Sachin Rathore, Mechanical Engineering


Applications of Supervised Learning
•Sentiment Analysis: It is a natural language processing technique in which we analyze and categorize some
meaning out of the given text data. For example, if we are analyzing tweets of people and want to predict whether a
tweet is a query, complaint, suggestion, opinion or news, we will simply use sentiment analysis.

•Recommendations: Every e-Commerce site or media, all of them use the recommendation system to recommend
their products and new releases to their customers or users on the basis of their activities. Netflix, Amazon,
YouTube, Flipkart are earning huge profits with the help of their recommendation system.

•Spam Filtration: Detecting spam emails is indeed a very helpful tool, this filtration techniques can easily detect any
sort of virus, malware or even harmful URLs. In recent studies, it was found that about 56.87 per cent of all emails
revolving around the internet were spam in March 2017 which was a major drop from April 2014's 71.1 percent
spam share.
Sachin Rathore, Mechanical Engineering
Logistic regression is another technique borrowed by machine learning from the field of statistics.

It is the go-to method for binary classification problems (problems with two class values). In this post you will discover the
logistic regression algorithm for machine learning.

For example,
To predict whether an email is spam (1) or (0)
Whether the tumor is malignant (1) or not (0)

Sachin Rathore, Mechanical Engineering


Polynomial Regression

Regression is defined as the method to


find the relationship between the
independent and dependent variables
to predict the outcome. The first
polynomial regression model was used
in 1815 by Gergonne.

It is used to find the best fit line using


the regression line for predicting the
outcomes.

There are many types of regression


techniques, polynomial regression is
one of them.

•If we add higher degrees such as quadratic, then it turns the line into a curve that better fits the data. Generally, it is
used when the points in the data set are scattered and the linear model is not able to describe the result clearly. We
should always keep an eye on Overfitting and Underfitting while considering these degrees to the equation.

Sachin Rathore, Mechanical Engineering


Polynomial Regression Uses
•It is used in many experimental procedures to produce the outcome using this equation.
•It provides a great defined relationship between the independent and dependent variables.
•It is used to study the isotopes of the sediments.
•It is used to study the rise of different diseases within any population.
•It is used to study the generation of any synthesis.

Sachin Rathore, Mechanical Engineering


Model Selection for Machine Learning

What Is Model Selection

Model selection is the process of selecting one final machine learning model from
among a collection of candidate machine learning models for a training dataset.

Model selection is a process that can be applied both across different types of models
(e.g. logistic regression, SVM, KNN, etc.) and across models of the same type
configured with different model hyperparameters (e.g. different kernels in an SVM).

Sachin Rathore, Mechanical Engineering


Sachin Rathore, Mechanical Engineering
For example, we may have a dataset for which we are interested in developing a classification or regression

predictive model. We do not know beforehand as to which model will perform best on this problem, as it is

unknowable.

Therefore, we fit and evaluate a suite of different models on the problem.

Model selection is the process of choosing one of the models as the final model that addresses the problem.

Model selection is different from model assessment

Sachin Rathore, Mechanical Engineering


Therefore, a “good enough” model may refer to many things and is specific to your project, such as:

•A model that meets the requirements and constraints of project stakeholders.

•A model that is sufficiently skillful given the time and resources available.

•A model that is skillful as compared to naive models.

•A model that is skillful relative to other tested models.

•A model that is skillful relative to the state-of-the-art.

Sachin Rathore, Mechanical Engineering


Bayesian Decision Theory

Introduction

Bayesian decision theory refers to the statistical approach based on tradeoff


quantification among various classification decisions based on the concept of
Probability(Bayes Theorem) and the costs associated with the decision.

It is basically a classification technique that involves the use of the Bayes Theorem
which is used to find the conditional probabilities.

In Statistical pattern Recognition, we will focus on the statistical properties of


patterns that are generally expressed in probability densities (pdf’s and pmf’s), and this
will command most of our attention in this article and try to develop the fundamentals of
the Bayesian decision theory.

Sachin Rathore, Mechanical Engineering


Prerequisites
Random Variable
A random variable is a function that maps a possible set of outcomes to some values like while tossing a coin and
getting head H as 1 and Tail T as 0 where 0 and 1 are random variables.

Bayes Theorem
The conditional probability of A given B, represented by P(A | B) is the chance of occurrence of A given that B has
occurred.

P(A | B) = P(A,B)/P(B) or

By Using the Chain rule, this can also be written as:

P(A,B) = P(A|B)P(B)=P(B|A)P(A)

P(A | B) = P(B|A)P(A)/P(B) ——- (1)

Where, P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)

Here, equation (1) is known as the Bayes Theorem of probability Our aim is to explore each of the components
included in this theorem.
Sachin Rathore, Mechanical Engineering
(a) Prior or State of Nature:

Prior probabilities represent how likely is each Class is going to occur.


Priors are known before the training process.
The state of nature is a random variable P(wi).
If there are only two classes, then the sum of the priors is P(w1) + P(w2)=1, if the classes are exhaustive.

(b) Class Conditional Probabilities:

It represents the probability of how likely a feature x occurs given that it belongs to the particular class. It is denoted by, P(X|A) where x is a
particular feature

It is the probability of how likely the feature x occurs given that it belongs to the class wi.
Sometimes, it is also known as the Likelihood.
It is the quantity that we have to evaluate while training the data. During the training process, we have input(features) X labeled to
corresponding class w and we figure out the likelihood of occurrence of that set of features given the class label.

Sachin Rathore, Mechanical Engineering


(c) Evidence:

•It is the probability of occurrence of a particular feature i.e. P(X).


•It can be calculated using the chain rule as, P(X) = Σin P(X | wi) P(wi)
•As we need the likelihood of class conditional probability is also figure out evidence values during training.

(d) Posterior Probabilities:

•It is the probability of occurrence of Class A when certain Features are given
•It is what we aim at computing in the test phase in which we have testing input or features (the given entity) and
have to find how likely trained model can predict features belonging to the particular class wi.

Sachin Rathore, Mechanical Engineering


Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given the weather conditions, each
tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for playing golf. Here is a tabular representation of our dataset.

Case Temperature Humidity Windy Play Golf


Outlook
0 Rainy Hot High False No

1 Rainy Hot High True No

2 Overcast Hot High False Yes

3 Sunny Mild High False Yes

4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No

6 Overcast Cool Normal True Yes

7 Rainy Mild High False No

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes

10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

Sachin Rathore, Mechanical Engineering


The dataset is divided into two parts, Case Temperature Humidity Windy Play Golf
Outlook
namely, feature matrix and the response 0 Rainy Hot High False No
vector.
1 Rainy Hot High True No

2 Overcast Hot High False Yes

•Feature matrix contains all the vectors(rows) 3 Sunny Mild High False Yes
of dataset in which each vector consists of the 4 Sunny Cool Normal False Yes
value of dependent features. In above dataset,
5 Sunny Cool Normal True No
features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’. 6 Overcast Cool Normal True Yes

7 Rainy Mild High False No


•Response vector contains the value of class
8 Rainy Cool Normal False Yes
variable(prediction or output) for each row of
9 Sunny Mild Normal False Yes
feature matrix. In above dataset, the class
variable name is ‘Play golf’. 10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

Sachin Rathore, Mechanical Engineering


Assumption:
The fundamental Naive Bayes assumption is that each feature makes an:
•independent
•equal

contribution to the outcome.

With relation to our dataset, this concept can be understood as:


•We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to do with the
humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence, the features are assumed to be independent.

•Secondly, each feature is given the same weight(or importance). For example, knowing only temperature and humidity
alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to
the outcome.

Note: The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the independence
assumption is never correct but often works well in practice.

Sachin Rathore, Mechanical Engineering


Bayes’ Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:

where A and B are events and P(B) ? 0.

•Basically, we are trying to find probability of event A, given the event B is true. Event B is also termed as evidence.

•P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an

unknown instance(here, it is event B).

•P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.

Sachin Rathore, Mechanical Engineering


Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of dataset)

X = (Rainy, Hot, High, False)


Y = No
So basically, P(Y|X) here means, the probability of “Not playing golf” given that the weather conditions are “Rainy
outlook”, “Temperature is hot”, “high humidity” and “no wind”.

Sachin Rathore, Mechanical Engineering


Naïve Bayes Classifier

Sachin Rathore, Mechanical Engineering


What is Naïve Bayes Classifier?

The Naïve Bayes Classifier belongs to the family of probability classifier, using Bayesian theorem. The reason why it is
called ‘Naïve’ because it requires rigid independence assumption between input variables. Therefore, it is more
proper to call Simple Bayes or Independence Bayes. This algorithm has been studied extensively since 1960s.
Simple though it is, Naïve Bayes Classifier remains one of popular methods to solve text categorization problem, the
problem of judging documents as belonging to one category or the other, such as email spam detection.

The goal of Naïve Bayes Classifier is to calculate conditional probability:

for each of K possible outcomes or classes Ck.

Sachin Rathore, Mechanical Engineering


for each of K possible outcomes or classes Ck.
Let x=(x1,x2,…,xn). Using Bayesian theorem, we can get:

The joint probability can be written as:

Assume that all features x are mutually independent, we can get:

Therefore, formula can be written as:

Therefore, this is the final formula for Naïve Bayes Classifier.

Sachin Rathore, Mechanical Engineering


How to calculate parameters and make a prediction in Naïve Bayes Classifier?

Maximum Likelihood Estimation (MLE) is used to estimate parameters — prior probability and conditional probability.

The prior probability equals the number of certain cases of y occur divided by the total number of records.

The conditional probability of p(x1=a1|y=C1) equals the number of cases when x1 equals to a1 and y equals to C1
divided by the number of cases when y equals to C1.
Naïve Bayes Classifier uses following formula to make a prediction:

Sachin Rathore, Mechanical Engineering


For example, 15 records in the table below are used to train a Naïve Bayes model, and then a prediction is made to a new record X(B, S).

Use formula above to estimate prior and conditional probability, and we can get:

Finally, as of X(B, S), we can get:


P(Y=0)P(X1=B|Y=0)P(X2=S|Y=0)>
P(Y=1)P(X1=B|Y=1)P(X2=S|Y=1), so y=0.
Sachin Rathore, Mechanical Engineering
So, finally, we are left with the task of calculating P(y) and P(xi | y).

Please note that P(y) is also called class probability and P(xi | y) is called conditional probability.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xi | y).

Let us try to apply the above formula manually on our weather dataset. For this, we need to do some

precomputations on our dataset.

We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been demonstrated in the tables below:

Sachin Rathore, Mechanical Engineering


Case Temperature Humidity Windy Play Golf
Outlook
0 Rainy Hot High False No

1 Rainy Hot High True No

2 Overcast Hot High False Yes

3 Sunny Mild High False Yes

4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No

6 Overcast Cool Normal True Yes

7 Rainy Mild High False No

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes

10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

Sachin Rathore, Mechanical Engineering


So, in the figure above, we have calculated P(xi | yj) for each xi
in X and yj in y manually in the tables 1-4. For example,
probability of playing golf given that the temperature is cool, i.e
P(temp. = cool | play golf = Yes) = 3/9.

Also, we need to find class probabilities (P(y)) which has been


calculated in the table 5. For example, P(play golf = Yes) = 9/14.

So now, we are done with our pre-computations and the


classifier is ready!

Let us test it on a new set of features (let us call it today):

today = (Sunny, Hot, Normal, False)


Sachin Rathore, Mechanical Engineering
So, probability of playing golf is given by:

and probability to not play golf is given by:

Sachin Rathore, Mechanical Engineering


Since, P(today) is common in both probabilities, we can ignore P(today) and find proportional probabilities as:

and

Sachin Rathore, Mechanical Engineering


Now, since

These numbers can be converted into a probability by making the sum equal to 1 (normalization):

and

So, prediction that golf would be played is ‘Yes’.

The method that we discussed above is applicable for discrete


Since data. In case of continuous data, we need to make some
assumptions regarding the distribution of values of each feature.
The different naive Bayes classifiers differ mainly by the
assumptions they make regarding the distribution of P(xi | y).

Now, we discuss one of such classifiers here.

Sachin Rathore, Mechanical Engineering


Classification @MATLAB

Case Temperature Humidity Windy Play Golf Outlo


Temp
Humi Wind Play
Outlook eratur
ok dity y Golf
e
0 Rainy Hot High False No
Norm
Rainy Hot FALSE
1 1 al 1 1 Yes 1
1 Rainy Hot High True No
Overc
Mild High TRUE
ast 2 2 2 2 No 2
2 Overcast Hot High False Yes
Sunny Cool
3 3
3 Sunny Mild High False Yes

4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No

6 Overcast Cool Normal True Yes

7 Rainy Mild High False No

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes

10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

Sachin Rathore, Mechanical Engineering


conversion @MATLAB

Case Outlook Temperature Humidity Windy Play Golf


0 1 1 2 1 2
1 1 1 2 2 2
2 2 1 2 1 1
3 3 2 2 1 1
4 3 3 1 1 1
5 3 3 1 2 1
6 2 3 1 2 1
7 1 2 2 1 2
8 1 3 1 1 1
9 3 2 1 1 1
10 1 2 1 2 1
11 2 2 2 2 1
12 2 1 1 1 1
13 1 2 2 2 2
Sachin Rathore, Mechanical Engineering
STEP 1: IMPORT DATA

Sachin Rathore, Mechanical Engineering


STEP 2: ML APP

Sachin Rathore, Mechanical Engineering


STEP 3: TRAIN DATA

Sachin Rathore, Mechanical Engineering


STEP 4: ACCURACY

Sachin Rathore, Mechanical Engineering


STEP 4:PREDICTION

Sachin Rathore, Mechanical Engineering


STEP 4:PREDICTION

Sachin Rathore, Mechanical Engineering


STEP 4:PREDICTION

Sachin Rathore, Mechanical Engineering


Sachin Rathore, Mechanical Engineering
Bayes Optimal Classifier

The Bayes Optimal Classifier is a probabilistic model that makes the most probable prediction for a new example.
It is described using the Bayes Theorem that provides a principled way for calculating a conditional probability. It is
also closely related to the Maximum a Posteriori: a probabilistic framework referred to as MAP that finds the most
probable hypothesis for a training dataset.

In practice, the Bayes Optimal Classifier is computationally expensive, if not intractable to calculate, and instead,
simplifications such as the Gibbs algorithm and Naive Bayes can be used to approximate the outcome.

you will discover Bayes Optimal Classifier for making the most accurate predictions for new instances of data.
After reading this post, you will know:

•Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
•Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the
training dataset.
•Bayes Optimal Classifier is a probabilistic model that finds the most probable prediction using the training data and
space of hypotheses to make a prediction for a new data instance.

Sachin Rathore, Mechanical Engineering


Estimator Bias, And The Bias — Variance Tradeoff

Plus, how to compare estimators based on their bias, variance and mean squared error

A statistical estimator can be evaluated on the basis of how biased it is in its prediction, how consistent its performance is,
and how efficiently it can make predictions. And the quality of your model’s predictions are only as good as the quality of
the estimator it uses.

In this section, we’ll cover the property of bias in detail and learn how to measure it.

The bias of an estimator happens to be joined at the hip with the variance of the estimator’s predictions via a concept called
the bias -variance tradeoff, and so, we’ll learn about that concept too.

We’ll close with a discussion on the Mean Squared Error of the estimator, its applicability to regression modeling, and we’ll
show how to evaluate various estimators of the population mean using the properties of bias, variance and their Mean
Squared Error.

Sachin Rathore, Mechanical Engineering


What is a Statistical Estimator?
An estimator is any procedure or formula that is used to predict or estimate the value of some unknown quantity such as say,
your flight’s departure time, or today’s NASDAQ closing price.

Let’s state an informal definition of what an estimator is:A statistical estimator is a statistical device used to estimate the true,
but unknown, value of some parameter of the population such as the mean or the median. It does this by using the information
contained in the data points that make a sample of values.

In our daily lives, we tend to employ various types of estimators without even realizing it. Following are some types of estimators
that we commonly use:

Estimators based on good (or bad!) Judgement


You ask your stock broker buddy to estimate how high the price of your favorite stock will go in a year’s time. In this case, you
are likely to get an interval estimate of the price, instead of a point estimate.

Estimators based on rules of thumb, and some calculation


You estimate the efforts needed to complete your next home improvement project using some estimation technique such as
the Work Breakdown Structure .

Estimators based on polling


You ask an odd number of your friends, who they think will win the next election, and you accept the majority result.
In each case, you wish to estimate a parameter you don’t know.
In statistical modeling, the mean, especially the mean of the population, is a fundamental parameter that is often estimated.
Sachin Rathore, Mechanical Engineering
Let’s look at a real life data sample.
Following is a data set of surface temperatures in the North Eastern Atlantic ocean at a certain time of year:

Sachin Rathore, Mechanical Engineering


•Estimator #1: We could take the average of the minimum and maximum temperature value in the above sample:

y_min=0.28704015899999996 y_max=15.02521203Estimate
#1 of the population mean=7.3690859355
Sachin Rathore, Mechanical Engineering
Estimator #2: We could choose a random value from the sample and designate it as the population mean:

Estimate #2 of the population mean=13.5832207

Estimator #3: We could use the following estimator, which averages out all temperature values in the data sample:

Estimate #3 of the population mean=11.94113359335031


Sachin Rathore, Mechanical Engineering
Estimator Bias
Suppose you are shooting basketballs. While some balls make it through the net, you find that most of your throws
are hitting a point below the hoop. In this case, whatever technique you are using to estimate the correct angle and
speed of the throw is underestimating the (unknown) correct values of angle and speed.
Your estimator has a negative bias.

Sachin Rathore, Mechanical Engineering


With practice, your throwing technique improves, and you are able to dunk more often. And from a bias perspective,
you begin overshooting the basket approximately just as often as undershooting it. You have managed to reduce
your estimation technique’s bias.

Sachin Rathore, Mechanical Engineering


The bias-variance trade-off

One aspect that might be apparent to you from the above two figures is that, while in the first figure, although
the bias is large, the ‘dispersion’ of the missed shots is less, leading to a lower variance in outcomes. In the
second figure, the bias has undoubtedly reduced because of a more uniform spreading out of the missed shots,
but that has also lead to a higher spread, a.k.a. higher variance.

The first technique appears to have a larger bias and a smaller variance and it is vice versa for the second
technique. This is no coincidence and it can be easily proved (in fact, we will prove it later!) that there is a direct
give and take between the bias and variance of your estimation technique.

Sachin Rathore, Mechanical Engineering


Note that this mean y_bar relates to our sample of n values , i.e. n ball throws, or n ocean surface temperatures, etc.
Switching to the ocean temperatures example, if we collect another set of n ocean temperature values and average them out,
we’ll get a second value for the sample mean y_bar. A third sample of size n will yield a third sample mean, and so on. So, the
sample mean y_bar is itself a random variable. And just like any random variable, y_bar has a probability distribution and an
expected value, denoted by E(y_bar).

We are now in position to define the bias of the estimator y_bar for the population mean µ as follows:

The bias of the estimator y_bar for the population mean µ, is the difference between the expected value of the sample mean
y_bar, and the population mean µ. Here is the formula:

Sachin Rathore, Mechanical Engineering


Support Vector Machines

Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection.

The objective of the support vector machine algorithm is to find a hyperplane in an N-


dimensional space(N - the number of features) that distinctly classifies the data points.

Sachin Rathore, Mechanical Engineering


Possible hyperplanes

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is
to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes.
Maximizing the margin distance provides some reinforcement so that future data points can be classified with more
confidence.
Sachin Rathore, Mechanical Engineering
Hyperplanes and Support Vectors

Sachin Rathore, Mechanical Engineering


Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the
hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will
change the position of the hyperplane. These are the points that help us build our SVM.

Sachin Rathore, Mechanical Engineering

You might also like