You are on page 1of 84

Department of

Computer Engineering

Machine Learning Machine Learning


Sem 7
System Design 01CE0711
4 Credits
Unit #2

Prof. Ankita Mishra


● After completion of this course, students will be able to
● Understand machine-learning concepts. (Understand)
● Understand Optimization theory and concepts.
(Understand)
Course ● Understand and analyze the different methods of Gradient
Outcomes Descent. (Analyze)
● Apply the concept of Supervised and Unsupervised
Learning. (Apply)
● Apply the concepts of machine learning and optimization in
designing intelligent systems. (Apply)
● Evaluating a Learning Algorithm: Deciding what to try
next,
● Evaluating Hypothesis, Model Selection and Train/
Validation/ Test Sets
● Bias Vs variance: Diagnosing Bias Vs Variance,
Regularization, and Bias/ Variance, Learning Curve,
Topics ● Building a Spam Classifier: Prioritizing what to work on,
Error Analysis,
● handling Skewed Data: Error Matrices for Skewed
Classes, Trading off Precision and recall,
● Data for Machine Learning;
● Support Vector Machines: Large Margin Classification:
● Optimization Objective, Large Margin Intuition, Kernels
Problem Statement:
It can so happen that, upon applying learning algorithm
to a problem statement, there are unacceptably large
errors in the predictions made by a model. There are
Evaluating a various options that can possibly increase the
Learning performance of a model and in turn accuracy, such as,
Algorithm: ● Acquire more training data

Deciding what ● Filter and reduce the number of features

to try next ● Increase the number of features


● Adding polynomial features
● Decreasing the regularization parameter, λ
● Increase the regularization parameter, λ
● Often it is hard to decide what is right and what is not
in evaluating the effectiveness of an algorithm. Which
step one should try and evaluate among the heap of
probable options that can help.
● Seeing the number of options it can often be
Evaluating a cumbersome to decide which path should one follow.
Learning ● Because it can so happen that sometimes one or more
Algorithm: of the listed techniques might not work in a given case
and hence would lead to wasted resources.
Deciding what
● Machine Learning Diagnostics are tests that help gain
to try next insight about what would or would not work with a
learning algorithm, and hence give guidance about how
to improve the performance. These can take time to
implement, but are still worth venturing into during the
time of uncertainties.
Overfitting and Train-Test Split
● The case of overfitting in a linear regression we have
seen that even though hypothesis has low error rate
Evaluating a still it is inaccurate due to overfitting.
Learning ● This is where the standard technique of train-test
Algorithm: split comes in handy.

Deciding what ● In this method, given a training dataset, it is split into


two sets: training set and test set.
to try next ● Typically, the training set has 70% of the data while test
set has the remaining 30%.
• The training process is then defined as:
Learn Θ and minimize J(Θ) on the training set.
Compute the error, Jtest(Θ) on the test set.
Evaluating a
Learning ● Test set error can be defined as follows:
Algorithm: For linear regression, mean-squared error (MSE),
defined as,
Deciding what
to try next
For logistic regression, the cross-entropy cost function,
defined as,

Evaluating a
Learning
Algorithm:
Deciding what
to try next
Model
selection:
Issue with
train/test split
Why Train/Validation/Test Splits?
● We have n models, having varying candidate hyper
parameters (like number of polynomial terms in
regression, number of hidden layer and neurons in the
Why neural network) and one is chosen based on the lowest
test error it reports after training.
Train/Validati ● Would it be correct to report this error as the indicator
on/Test of the generalized performance of the selected model?
Splits? ● In general, many practitioners do use the same metrics
as model performance, but it is advised against.
● This is so because, the way the model parameters were
fit to the train samples and hence would report lower
error on train dataset, similarly the model hyper
parameters are fit to the test set and would report a
lower error.
● To overcome this issue, it is recommended to split the
dataset into three parts, namely, train, cross-validation
(or validation) and test.
Why ● Now, train set is used to optimize the model
parameters, then cross-validation set is used to select
Train/Validati the best model among ones having varying hyper
on/Test parameters.

Splits? ● Finally the generalized performance can be calculated


on test dataset which is not seen during training and
model selection process.
● This would be the truly unbiased reporting of model
performance metrics. This way the hyper parameters
have not been trained using the test set.
Model
selection:
solution to
with train/test
split
Model
selection:
solution to
with train/test
split
Model
selection:
solution to
with train/test
split
There are mainly two types of errors in machine
learning:
● Reducible errors: These errors can be reduced to
improve the model accuracy. Such errors can further be
classified into bias and Variance.

Errors In
Machine
Learning
● Irreducible errors: These errors will always be present
in the model. The cause of these errors is unknown
variables whose value can't be reduced.
● A machine learning model analyses the data, find patterns
in it and make predictions.
● While training, the model learns these patterns in the
dataset and applies them to test data for prediction.
● While making predictions, a difference occurs between
prediction values made by the model and actual
values/expected values, and this difference is known as
bias errors or Errors due to bias.
Bias ● A model has either:
Low Bias: A low bias model will make fewer assumptions
about the form of the target function.
High Bias: A model with a high bias makes more
assumptions, and the model becomes unable to capture the
important features of our dataset. A high bias model also
cannot perform well on new data.
● The variance would specify the amount of variation in the
prediction if the different training data was used.
● In simple words, variance tells that how much a random
variable is different from its expected value.
● Ideally, a model should not vary too much from one training
dataset to another, which means the algorithm should be
good in understanding the hidden mapping between inputs
and output variables.
Variance ● A model has either:
Low variance means there is a small variation in the
prediction of the target function with changes in the
training data set.
At the same time, High variance shows a large variation in
the prediction of the target function with changes in the
training dataset.
There are four possible combinations of bias and
variances, which are represented by the below
diagram:

Different
Combinations
of
Bias-Variance
1. Low-Bias, Low-Variance: The combination of low bias
and low variance shows an ideal machine learning model.
However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high
variance, model predictions are inconsistent and accurate
on average. This case occurs when the model learns with
Different a large number of parameters and hence leads to
an overfitting.
Combinations
3. High-Bias, Low-Variance: With High bias and low
of variance, predictions are consistent but inaccurate on
average. This case occurs when a model does not learn
Bias-Variance well with the training dataset or uses few numbers of the
parameter. It leads to underfitting problems in the
model.
4. High-Bias, High-Variance: With high bias and high
variance, predictions are inconsistent and also inaccurate
on average
Overfitting
● Overfitting occurs when our machine learning model
tries to cover all the data points or more than the
required data points present in the given dataset.
● Because of this, the model starts caching noise and
Overfitting inaccurate values present in the dataset, and all these
and factors reduce the efficiency and accuracy of the
model.
Underfitting
● The over fitted model has low bias and high variance.
● The chances of occurrence of overfitting increase as
much we provide training to our model. It means the
more we train our model, the more chances of
occurring the over fitted model.
Underfitting
● Underfitting occurs when our machine learning model
is not able to capture the underlying trend of the data.
● To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the
model may not learn enough from the training data.
Overfitting
● As a result, it may fail to find the best fit of the
and dominant trend in the data.
Underfitting ● In the case of underfitting, the model is not able to
learn enough from the training data, and hence it
reduces the accuracy and produces unreliable
predictions.
● An under fitted model has high bias and low variance.
Overfitting
and
Underfitting
Overfitting
and
Underfitting
Diagnosing
Bias Vs
Variance
Bias-Variance Trade-Off
●While building the machine learning model, it is
really important to take care of bias and variance in
order to avoid overfitting and underfitting in the
model.
Bias-Variance ● If the model is very simple with fewer parameters,
Trade-Off it may have low variance and high bias.
●Whereas, if the model has a large number of
parameters, it will have high variance and low bias.
●So, it is required to make a balance between bias
and variance errors, and this balance between the
bias error and variance error is known as the
Bias-Variance trade-off.
Bias-Variance
Trade-Off
● For an accurate prediction of the model, algorithms
need a low variance and low bias. But this is not
possible because bias and variance are related to each
other:
If we decrease the variance, it will increase the bias.
Bias-Variance If we decrease the bias, it will increase the variance.
Trade-Off ● So, we need to find a sweet spot between bias and
variance to make an optimal model.
● Hence, the Bias-Variance trade-off is about finding the
sweet spot to make a balance between bias and
variance errors.
Bias/Variance
as a function
of
Regularization
parameter λ
Bias/Variance
as a function
of
Regularization
parameter λ
Bias/Variance
as a function
of
Regularization
parameter λ
Learning
Curves
Learning
Curves
Learning
Curves
Building a
Spam
Classifier:
Prioritizing
what
to work on
● Supervised Learning:
x= features of email
y= spam(1) or not spam(0)
Building a
Feature x= Choose 100 words indicative of spam/not
Spam spam
Classifier: Eg: Indicative words in Spam emails: Buy, Discount, Deal
Prioritizing Indicative words in Non Spam emails: Andrew, Now
what
to work on
● Step 1: Arranging 100 indicative words in alphabetical
order. For eg: Andrew, Buy, Deal, Discount, Now
● Step 2: Now I will check email and map words from
Building a list(described in step 1) and then define feature vector
X.
Spam
Classifier:
Prioritizing
what
to work on
X= [0 1 1 0 0….]
Building a
Spam
Classifier:
Prioritizing
what
to work on
Error Analysis
Error Analysis
Error Analysis
Error Analysis
What are Skewed Classes?
● Skewed classes basically refer to a dataset, wherein the
number of training example belonging to one class
out-numbers heavily the number of training examples
belonging to the other.
Error analysis ● For example:
of skewed Consider a binary classification, where a cancerous
patient is to be detected based on some features. And
class say only 1 1 of the data provided has cancer positive. In
a setting where having cancer is labeled 1 and not
cancer labeled 0, if a system naively gives the
prediction as all 0’s, still the prediction accuracy will be
99%.
● Therefore, it can be said with conviction that the
accuracy metrics or mean-squared error for skewed
classes, is not a proper indicator of model performance.

Error analysis ● Hence, there is a need for a different error metric for
of skewed skewed classes.
class
● Precision/Recall
In a binary classification, one of the following four
scenarios may occur,
● True Positive (TP): the model predicts 1 and the actual
Error analysis class is 1
of skewed ● True Negative (TN): the model predicts 0 and the
class actual class is 0
● False Positive (FP): the model predicts 1 but the actual
class is 0
● False Negative (FN): the model predicts 0 but the
actual class is 1
Error analysis
of skewed
class
Confusion Matrix

Error analysis
of skewed
class
Then precision and recall can be defined as follows,
● Precision: Precision defines of all the predictions y=1,
which ones are correct.

Error analysis
of skewed
class ● Recall: defines of all the actual y=1, which ones did the
model predict correctly.

● Note: https://www.youtube.com/watch?v=qWfzIYCvBqo
● Now, if we evaluate a scenario where the classifier
predicts all 0’s then the recall of the model will be 0,
which then points out the inability of the system.

● Note:
Error analysis In case of skewed classes, it’s not possible for the
of skewed classifiers to cheat the evaluation metrics of recall
class and precision. Also, it is important to note that
precision and recall metrics work better if y=1,
denotes the presence of the rarer class.
F1-Score
F1-Score is a measure combining both precision and
recall. It is generally described as the harmonic mean of
the two. Harmonic mean is just another way to
calculate an “average” of values, generally described as
Error analysis more suitable for ratios (such as precision and recall)
than the traditional arithmetic mean. The formula used
of skewed for F1-score in this case is:
class
● The Tradeoff
● By changing the threshold value for the classifier
confidence, one can adjust the precision and recall for
Error analysis the model.
of skewed ● For example, in a logistic regression the threshold is
class generally at 0.5. If one increases it, we can be sure that
of all the predictions made more will be correct, hence,
high precision. But there are also higher chances of
missing the positive cases, hence, the lower recall.
● Similarly, if one decreases the threshold, then the
chances of false positives increases, hence low
precision. Also, there is lesser probability of missing the
actual cases, hence high recall.
● A precision-recall tradeoff curve may look like one
Error analysis among the following,
of skewed
class
Example to
understand
precision and
recall
Example to
understand
precision and
recall
Example to
understand
precision and
recall
Example to
understand
precision and
recall
Example to
understand
precision and
recall
Example to
understand
precision and
recall
Example to
understand
precision and
recall
Example to
understand
precision and
recall
Example to
understand
precision and
recall
Example to
understand
precision and
recall
● Consider the following hospital dataset predicting
number of sick people. Compute confusion matrix of
the sick/not sick prediction of patients. Calculate
precision, recall and f1 score.
Numerical
problem on
Error analysis
of skewed
class
Data In
Machine
Learning
Data In
Machine
Learning
Data In
Machine
Learning
Data In
Machine
Learning
Data In
Machine
Learning
Data In
Machine
Learning
• Support Vector Machine or SVM is one of the most
popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems.

Support • However, primarily, it is used for Classification


problems in Machine Learning.
Vector
Machine
● The goal of the SVM algorithm is to create the best line
Algorithm or decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new
data point in the correct category in the future. This
best decision boundary is called a hyperplane.
● SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is
termed as Support Vector Machine.

Support
● SVM algorithm can be used for Face detection, image
Vector classification, text categorization, etc.
Machine
Algorithm ● Consider the below diagram in which there are two
different categories that are classified using a decision
boundary or hyperplane:
Support
Vector
Machine
Algorithm
● Example:

Support
Vector
Machine
Algorithm
● Example:
Suppose we see a strange cat that also has some features
of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by
using the SVM algorithm.
Support We will first train our model with lots of images of cats and
dogs so that it can learn about different features of cats and
Vector dogs,
Machine and then we test it with this strange creature.
Algorithm So as support vector creates a decision boundary between
these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and
dog.
On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
● Types of SVM:
SVM can be of two types:
● Linear SVM: Linear SVM is used for linearly separable
data, which means if a dataset can be classified into
two classes by using a single straight line, then such
Support data is termed as linearly separable data, and classifier
Vector is used called as Linear SVM classifier.
Machine
Algorithm ● Non-linear SVM: Non-Linear SVM is used for
non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
● Hyperplane:
● There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need
to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the
hyperplane of SVM.
Support ● The dimensions of the hyperplane depend on the features
Vector present in the dataset, which means-
● if there are 2 features (as shown in image), then hyperplane
Machine will be a straight line.
Algorithm ● And if there are 3 features, then hyperplane will be a
2-dimension plane.
● We always create a hyperplane that has a maximum
margin, which means the maximum distance between the
data points.
● Support Vectors:
● The data points or vectors that are the closest to the
hyperplane and which affect the position of the
hyperplane are termed as Support Vector.
Support ● Since these vectors support the hyperplane, hence
called a Support vector.
Vector
Machine
Algorithm
Which hyperplane is the best?
• Well, the best hyperplane is the one that maximizes
the margin.
• The margin is the distance between the hyperplane
Support and a few close points. These close points are the
support vectors because they control the hyperplane.
Vector
• This is basically the Maximum Margin Classifier. It
Machine maximizes the margin of the hyperplane.
Algorithm • This is the best hyperplane because it reduces the
generalization error the most.
• If we add new data, the Maximum Margin Classifier is
the best hyperplane to correctly classify the new data.
Support
Vector
Machine
Algorithm
● Drawback:
● This SVM requires the two classes to be completely linearly
separated. But, this wont be a general scenario always. So, we
need a solution.
● Example- Consider the below figure, where the data points cannot
Support be separated using a linearly separable hyperplane.
Vector
Machine
Algorithm
● So In this case, the Maximum Margin Classifier would
not work.
● Now, Soft margin is the solution which will allow for
some misclassification of data. This is known as a Soft
Margin Classifier or a Support Vector Classifier. It also
Support attempts to maximize the margin separating the two
Vector classes. The graph below illustrates this SVM.
Machine
Algorithm
● The support vector classifier contains a tuning
parameter in order to control how much
misclassification it will allow. This tuning parameter is
important when looking to minimize error.
● When the tuning parameter (often denoted as C) is
Support small, the classifier allows only a small bit of
misclassification. The support vector classifier will have
Vector low bias but may not generalize well and have high
variance. We may overfit the training data if our tuning
Machine parameter is too small.
Algorithm ● If C is large, the number of misclassifications allowed
has been increased. This classifier would generalize
better but may have a high amount of bias. When the
tuning parameter is zero, there can be no
misclassification and we have the maximum margin
classifier.
● Kernels:
● The support vector classifier can fail if the data is not
linearly separable. So, there is a method of handling
non-linearly separable classes. This method uses the
Support kernel trick.
Vector ● We need “to enlarge the feature space in order to
Machine accommodate a non-linear boundary between the
classes”.
Algorithm ● Kernels are functions that quantify similarities between
observations.
● Simply, these kernels transform our data in order to
pass a linear hyperplane and thus classify our data.
Support
Vector
Machine
Algorithm
End of Unit-2
Any
Thank you
Queries..??
આવજો

You might also like