You are on page 1of 70

Go, change the

RV College of world
Engineering

Introduction to Machine Learning

Improvi
UNIT II
Jyoti Shetty

5/16/2022 1
RV College of
Go, change the world
Engineering
Selecting a model

• Case study – criminal recognition


RV College of
Go, change the world
Engineering
Selecting a model

• Model – structured representation of raw input data to a


meaningful pattern.
• Training – processes of assigning a Model and fitting a
specific Model to a data set is called training.
• Target function – machine learning algorithm creates its
cognitive capability by building a mathematical formulation
or function known as Target function, based on features of
input data set.
• Input variable denoted as X1, X2… Xn an output variable
denoted as Y. the relationship between X & Y is
represented as
• Y= f(X)+e , f is target function & e is random error term
RV College of
Go, change the world
Engineering
Selecting a model

• In ML, cost functions are used to estimate how badly models are performing. Put simply, a cost
function is a measure of how wrong the model is in terms of its ability to estimate the relationship
between X and y. This is typically expressed as a difference or distance between the predicted value
and the actual value. Example R-squared error in linear regression

• It’s a method of evaluating how well specific algorithm models the given data point. The loss
function computes the error for a single training example, while the cost function is the average of
the loss functions of the entire training set.

• Objective function takes data & model and returns a value by searching model parameters that
maximize or minimize the return value. Example maximize reward value in reinforcement, minimize
squared error in regression.
RV College of
Go, change the world
Engineering
Selecting a model

• Predictive models-Try to establish relationship between the target feature & predictor
features. (supervised learning)
• Predict Win/ loss
• Predictive fraud
• Predictive customer churn
• The Models used for prediction of categorical target feature are known as classification
model.
• The target feature is known as class & categories to which classes are divided are known as
levels.
• Example - kNN, Decision Tree, Naïve Bayes.
• The Models used for prediction of numerical target feature are known as regression model.
• Predict revenue growth
• Predictive rain fall amount
• Predictive demand for flu shots
• Example – linear regression, Decision Tree, neural network, support vector machines(SVM).
RV College of
Go, change the world
Engineering
Selecting a model

• Categorical feature may required to be converted to numerical,


• example category malignant – 0, non- malignant category – 1.

• Factors considered for model selection –


• Type of learning task in hand – supervised or unsupervised
• Type of data
• size of dataset.
• Small – low variance model naïve Bayes, large– low bias
model regression.
• Problem domain
• Experience of developer
RV College of
Go, change the world
Engineering
Selecting a model

Descriptive models

• The descriptive modeling uses mainly unsupervised learning approaches for summarizing,
clustering, extracting rules to answer what happens was happened in the past.
(unsupervised learning)

• While Predictive analysis is about machine learning approaches for the aim forecasting
future data based on past data
• There is no target feature or single feature of interest
• Based on values of all features interesting patterns are discovered
• Descriptive models that group together similar data instances i.e data instances having
similar value for different features are called clustering models.
• Customer segmentation based on demographic, ethnic and social factors.
• Grouping music generes
• Grouping commodities in inventory
• Example-K-means, DBSCAN are used for clustering
• Market basket analysis-
RV College of
Go, change the world
Engineering
Holdout method

• In this approach we randomly split the complete data into training and test
sets. Then Perform the model training on the training set and use the test set
for validation purpose, ideally split the data into 70:30 or 80:20.
• With this approach there is a possibility of high bias if we have limited data,
because we would miss some information about the data which we have not
used for training.
• If our data is huge and our test sample and train sample has the same
distribution then this approach is acceptable.
RV College of
Go, change the world
Engineering
Holdout method
RV College of
Engineering Holdout method Go, change the world

• In some cases the data is partitioned in 3 sets- train, test & validation.
• Test data is used only once at the end to measure performance.
• validation data is used in iterations to improve the performance of the model.
• Problem is when data is less.
• stratified sampling – the data is split into multiple starta & a random sample is
selected from each sample.
RV College of
Engineering Holdout method Go, change the world

# Import required libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# Import necessary modules


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeavePOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold
RV College of
Engineering Holdout method Go, change the world

dat = pd.read_csv('/content/diabetes.csv')
print(dat.shape)
dat.describe().transpose()
x1 = dat.drop('Outcome', axis=1).values
y1 = dat['Outcome'].values
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x1, y1, test_size=0.30, random_state=10)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
RV College of
Go, change the world
Engineering
K-fold cross-validation method

1.Split the entire data randomly into K folds (value of K shouldn’t be too
small or too high, ideally we choose 5 to 10 depending on the data size).
The higher value of K leads to less biased model (but large variance
might lead to over-fit), where as the lower value of K is similar to the
train-test split approach we saw before.

2.Then fit the model using the K-1 (K minus 1) folds and validate the
model using the remaining Kth fold. Note down the scores/errors.

3.Repeat this process until every K-fold serve as the test set. Then take
the average of your recorded scores. That will be the performance metric
for the model.
RV College of
Go, change the world
Engineering
K-fold cross-validation method
RV College of
Go, change the world
Engineering
K-fold cross-validation method
RV College of
Engineering
Leave-one-out cross-validation method Go, change the world

Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the
number of data points in the set. That means that N separate times, the function approximator is trained on all
the data except for one point and a prediction is made for that point. As before the average error is computed
and used to evaluate the model. The evaluation given by leave-one-out cross validation error (LOO-XVE) is
good, but at first pass it seems very expensive to compute.
RV College of
Go, change the world
Engineering
K-fold cross-validation method

kfold = model_selection.KFold(n_splits=10, random_state=100)


model_kfold = LogisticRegression()
results_kfold = model_selection.cross_val_score(model_kfold, x1, y1, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0))
RV College of
Go, change the world
Engineering
Bootstrap sampling

• The bootstrap method involves iteratively simple random sampling


with replacement(SRSWR).
• Input dataset with n-instances, bootstrap can create one or more
training dataset with n-instances, where some instances being
repeated in them.
• Useful in small datasets.
RV College of
Go, change the world
Engineering
Bootstrap sampling
RV College of
Go, change the world
Engineering
Bootstrap sampling
RV College of
Go, change the world
Engineering
Bootstrap sampling

skfold = StratifiedKFold(n_splits=3, random


_state=100)
model_skfold = LogisticRegression()
results_skfold = model_selection.cross_val_
score(model_skfold, x1, y1, cv=skfold)
print("Accuracy: %.2f%%" % (results_skfold.
mean()*100.0))
RV College of
Go, change the world
Engineering
Bootstrap sampling

loocv = model_selection.LeaveOneOut()
model_loocv = LogisticRegression()
results_loocv = model_selection.cross_val_score(model_loocv, x1, y1, cv=loocv)
print("Accuracy: %.2f%%" % (results_loocv.mean()*100.0))
RV College of
Engineering
Go, change the world
RV College of
Engineering
Go, change the world

• Lazy learning Skips abstraction and generalization


• Lazy learning Discovers patterns
• Lazy learning is known as Rote learning, instance learning, non-
parametric learning
RV College of
Engineering
Go, change the world
parametric and non-parametric models in machine learning
RV College of
Go, change the world
Engineering
Model representation & interpretability

Underfitting
• If the target function is too simple it might not be able to capture the nuances
of training data well.
• Ex – represent non-linear data using linear Model
• Reason may be unavailability of sufficient training data
• Underfitting results in poor performance with training data as well as poor
performance with testing data
• Underfitting can be avoided by
• Use more training data
• Reducing features of training data
RV College of
Go, change the world
Engineering
Model representation & interpretability

Overfitting
• Refers to situation where the Model emulates the training data overly(memorizing)
• Outliers or noise in the training data gets embedded in Model
• Reason may be trying to fit complex Model that closely map to training data
• Overfitting results in good performance with training data but poor performance with test data
• Overfitting can be avoided by
• Cross-validation. Cross-validation is a powerful preventative measure against overfitting. ...
• Train with more data. It won't work every time, but training with more data can help algorithms
detect the signal better. ...
• Remove features. ...
• Early stopping. ...
• Regularization. ...
• Ensembling.
Bias
• Bias is difference between the predicted value and actual value.
• Bias is one type of error which occurs due to wrong assumptions about
data such as assuming data is linear when in reality, data follows a
complex function.
• Occurs when model pays little attention to the underlying data. Even if
we provide more training samples a low bias model does not change.
• It is basically underfitting of data.
• Linear regression, logistic regression etc suffer from low bias
Variance
• Variance is the difference in the model prediction on the train and test
data.
• It is the difference between many model's predictions(fluctuating)
• If the model performs well on the training data (low training error) but
fails to generalize well to the test data (high test dataset error), it is
said to have a high variance.
• This essentially means that your model is overfitting the training data.
• It can be because the model captures random noise in the data or
• it has not seen enough data in the training to generalize well to the
unseen examples.
Overfitting
• High Bias (UNDESIRABLE): This type of model always predicts the
same output every time or takes a random guess while prediction.
• Low Bias/High Variance (UNDESIRABLE): A model that overfits the
training data. This performs poorly on unseen data.
• Low Bias/Low Variance (DESIRABLE): A model that almost always
gives the best results. This model performs well on seen and unseen
examples.
Bias – Variance Dilemma
• To decrease the bias the we need to increase the model complexity,
which means the variance increases.
• To lower the variance means the model may not fit the data well and
hence the bias increases.
• An optimal model is one that make good trade-off between two.
5/16/2022 34
5/16/2022 35
• To avoid underfitting (high bias)
• Try to increase complexity of model
• Train for longer time
• Try to increase the number of features by finding new features or making new
features from the existing ones.
• Decrease regularization
• Use different model
• To avoid overfitting (high variance), try the following –
• Increase the training data (collecting more data/augment the training dataset)
• Less complex model
• Early stop(reduce epocs)
• Remove features (not recommended )
• Ensembling
• L1/L2 regularization to simplify your model.
EVALUATING PERFORMANCE OF A
MODEL- Supervised learning - classification
• A true positive is an outcome where the model correctly predicts the positive class. Similarly,
• a true negative is an outcome where the model correctly predicts the negative class.
• A false positive is an outcome where the model incorrectly predicts the positive class.
• And a false negative is an outcome where the model incorrectly predicts the negative class.

1 . the model predicted win and the team won


2 . the model predicted win and the team lost
3 . the model predicted loss and the team won
4 . the model predicted loss and the team lost

5/16/2022 38
confusion matrix
• A matrix containing correct and incorrect predictions in the form of
TPs, FPs, FNs and TNs is known as confusion matrix.

5/16/2022 40
5/16/2022 41
Sensitivity and specificity
The sensitivity of a model measures the proportion of
TP examples or positive cases which were correctly
classified.

Specificity of a model measures the proportion of


negative examples which have been correctly classified.

5/16/2022 42
5/16/2022 43
5/16/2022 44
F-measure is another measure of model performance
which combines the precision and recall. It takes the
harmonic mean of precision and recall as calculated as

5/16/2022 45
5/16/2022 46
5/16/2022 47
5/16/2022 48
• A logistic regression model that returns 0.9995 for a particular email
message is predicting that it is very likely to be spam. Conversely,
another email message with a prediction score of 0.0003 on that
same logistic regression model is very likely not spam. However, what
about an email message with a prediction score of 0.6? In order to
map a logistic regression value to a binary category, you must define
a classification threshold (also called the decision threshold). A value
above that threshold indicates "spam"; a value below indicates "not
spam." It is tempting to assume that the classification threshold
should always be 0.5, but thresholds are problem-dependent, and are
therefore values that you must tune.
5/16/2022 49
• https://developers.google.com/machine-learning/crash-
course/classification/check-your-understanding-accuracy-precision-
recall
• https://www.dataschool.io/simple-guide-to-confusion-matrix-
terminology/

5/16/2022 50
• https://developers.google.com/machine-learning/crash-
course/classification/check-your-understanding-roc-and-auc

5/16/2022 51
Supervised learning – regression

Residual (YCAP)= distance between actual & predicted value

5/16/2022 52
Linear Regression – R Squared
R-squared is a good measure to evaluate the model fitness. It is also known as the coefficient of
determination, or for multiple regression, the coefficient of multiple determination. The R-
squared value lies between 0 to 1 (0%–100%) with a larger value representing a better fit. It is
calculated as:

5/16/2022 53
Clustering- Internal evaluation
The internal evaluation methods generally measure cluster quality based on homogeneity of
data belonging to the same cluster and heterogeneity of data belonging to different clusters.
The homogeneity /heterogeneity is decided by some similarity measure.

5/16/2022 54
5/16/2022 55
Clustering- external evaluation
• Purity: Purity is a measure of the extent to which clusters contain a
single class. Its calculation can be thought of as follows: For each
cluster, count the number of data points from the most common class
in said cluster. Now take the sum over all clusters and divide by the
total number of data points. Formally, given some set of clusters M
and some set of classes D, both partitioning N data points, purity can
be defined as:

5/16/2022 56
5/16/2022 57
Improving performance of a model
• The model selection is done one several aspects:
• Type of learning task in hand – supervised or unsupervised
• Type of data - i.e. categorical or numeric
• Problem domain
• Experience
• Model parameter tuning – process of adjusting model fitting options
• EX K value in knn, a number of hidden layers can be adjusted to tune the
performance in neural networks model.
• Ensemble Approach-To improve performance of a model combines
different model with diverse strengths
• Ensemble helps in averaging biases & reducing variance
• Combine weak learner to create strong learner

5/16/2022 58
Improving performance of a model

5/16/2022 59
Improving performance of a model

5/16/2022 60
Improving performance of a model
• Bagging or bootstrap aggregating uses bootstrap sampling to
generate multiple training dataset. Multiple models are created using
same machine learning algorithm and bootstrap samples. The
outcomes of the models combined using majority voting or average.

5/16/2022 61
• In Boosting algorithms each classifier is trained on data, taking into
account the previous classifiers’ success. After each training step, the
weights are redistributed. Misclassified data increases its weights to
emphasise the most difficult cases. In this way, subsequent learners
will focus on them during their training. Ex - Adaptive Boosting or
AdaBoost, Gradient Boosting, XGBoost

5/16/2022 62
5/16/2022 63
5/16/2022 64
• Random forests or random decision forests are an ensemble
learning method for classification, regression and other tasks that
operates by constructing a multitude of decision trees at training time
and outputting the class that is the mode of the classes (classification)
or mean/average prediction (regression) of the individual trees

5/16/2022 65
5/16/2022 66
Improving performance of a model
• The fundamental difference between bagging and random forest is
that in Random forests, only a subset of features are selected
at random out of the total and the best split feature from the subset
is used to split each node in a tree, unlike in bagging where all
features are considered for splitting a node.

5/16/2022 67
5/16/2022 68
Kappa value of a model indicates the adjusted the
model accuracy. It is calculated using the formula
below

5/16/2022 69
5/16/2022 70

You might also like