Unit 3 Modelling and Evaluation

Silver Oal College Of Engineering And Technology
Unit 3 :
Modeling and Evaluation:
1 Prof. Monali Suthar (SOCET-CE)

Outline
 Selecting a Model: Predictive/Descriptive,
 Training a Model for supervised learning,
 Model representation and interpretability,
 Evaluating performance of a model,
 Improving performance of a model.

Model Selection
 Model selection is the process of selecting one final
machine learning model from among a collection of
candidate machine learning models for a training dataset.
 Model selection is a process that can be applied both
across different types of models (e.g. logistic regression,
SVM, KNN, etc.)
 Model selection is the process of choosing one of the
models as the final model that addresses the problem.
 Model selection is different from model assessment.
What is good enough model ?

Model Selection
 A “good enough” model may refer to many things and is
specific to your project, such as:
1. A model that meets the requirements and constraints of
project stakeholders.
2. A model that is sufficiently skillful given the time and
resources available.
3. A model that is skillful as compared to naive models.
4. A model that is skillful relative to other tested models.
5. A model that is skillful relative to the state-of-the-art.

Model Selection Techniques
 The best approach to model selection requires “sufficient”
data, which may be nearly infinite depending on the
complexity of the problem.
 There are two main classes of techniques to approximate
the ideal case of model selection; they are:
1. Probabilistic Measures: Choose a model via in-sample error
and complexity.
2. Resampling Methods: Choose a model via estimated out-of-
sample error.

 Probabilistic Measures
 Probabilistic measures involve analytically scoring a candidate model using both its
performance on the training dataset and the complexity of the model.
 It is known that training error is optimistically biased, and therefore is not a good basis
for choosing a model.
 The performance can be penalized based on how optimistic the training error is
believed to be.
 This is typically achieved using algorithm-specific methods, often linear, that penalize
the score based on the complexity of the model.
 A model with fewer parameters is less complex, and because of this, is preferred
because it is likely to generalize better on average.
 Four commonly used probabilistic model selection measures include:
 Akaike Information Criterion (AIC).
 Bayesian Information Criterion (BIC).
 Minimum Description Length (MDL).
 Structural Risk Minimization (SRM).
 Probabilistic measures are appropriate when using simpler linear models like linear
regression or logistic regression where
6
the calculating of model complexity penalty
Prof. Monali Suthar (SOCET-CE)
(e.g. in sample bias) is known and tractable.
 Resampling Methods
 Resampling methods seek to estimate the performance of a model (or more precisely, the
model development process) on out-of-sample data.
 This is achieved by splitting the training dataset into sub train and test sets, fitting a
model on the sub train set, and evaluating it on the test set.
 This process may then be repeated multiple times and the mean performance across each
trial is reported.
 It is a type of Monte Carlo estimate of model performance on out-of-sample data,
although each trial is not strictly independent as depending on the resampling method
chosen, the same data may appear multiple times in different training datasets, or test
datasets.
 Three common resampling model selection methods include:
 Random train/test splits.
 Cross-Validation (k-fold, LOOCV, etc.).
 Bootstrap.
 Most of the time probabilistic measures (described in the previous section) are not
available, therefore resampling methods are used.

Types of model

Types of model
 Predictive Analytics
 Predictive Analytics is to say something about future results not of current
behavior.
 Predictive Analytics will help an organization to know what might happen
next, it predicts future based on present data available.
 It will analyze the data and provide statements that have not happened yet. It
makes all kinds of predictions that you want to know and all predictions are
probabilistic in nature.
 It uses the supervised learning functions which are used to predict the target
value.
 The methods come under this type of mining category are called classification,
time-series analysis and regression.
 Modeling of data is the necessity of the predictive analysis, and it works by
utilizing a few variables of the present to predict the future not known data
values for other variables.

Predictive Model
 There are different models developed for design-specific
functions.
1. Forecast models
2. Classification models
3. Outliers Detection Models
4. Time series model
5. Clustering Model
6. Neural Network algorithms
7. Decision Trees Algorithms

PREDICTIVE MODELLING PROCESS
 The process involves running algorithms on the data set in which the prediction is
going to take place.
 The process involves training the model, multiple models being used on the same data
set and finally arriving on the model which is the best fit based on the business data
understanding.
 The predictive models’ category includes predictive, descriptive, and decision models.
 The predictive modelling process goes as follows:
1. Pre-processing.
2. Data mining.
3. Results validation.
4. Understand business & data.
5. Prepare data.
6. Model data.
7. Evaluation.
8. Deployment.
9. Monitor & improve.

PREDICTIVE MODELLING
 FEATURES :
1. Data analysis & manipulation: Create new data sets, tools for data
analysis, categorize, club, merge and filter data sets.
2. Visualization: This includes interactive graphics and reports.
3. Statistics: To confirm and create relationships between variables in the
data.
4. Hypothesis testing: Creating models, evaluating and choosing the right
models.
 Limitations
1. Errors in data labeling
2. Shortage of massive data sets needed to train machine learning
3. The machine’s inability to explain what and why it did what it did
4. Generalizability of learning, or rather lack thereof
5. Bias in data and algorithms

Types of model
 Descriptive Analytics
 Descriptive Analytics will help an organization to know what has happened in
the past, it would give you the past analytics using the data that are stored.
 For a company, it is necessary to know the past events that help them to make
decisions based on the statistics using historical data.
 This term is basically used to produce correlation, cross-tabulation, frequency
etc.
 These technologies are used to determine the similarities in the data and to find
existing patterns.
 One more application of descriptive analysis is to develop the captivating
subgroups in the major part of the data available.
 This analytics emphasis on the summarization and transformation of the data
into meaningful information for reporting and monitoring.
 For example, you might want to know how much money you lost due to fraud
and many more.

Descriptive Model
 The descriptive analysis uses mainly unsupervised learning
approaches for summarizing, classifying, extracting rules to answer
what happens was happened in the past.
 The descriptive models are different in nature from predictive
models since they don’t need to perform as accurately as the
predictive models need to.
 Since predictions are for a potential future event and business wants
to exploit that knowledge and take actions on the predictions, the
reliability of the prediction matters a lot.
 It describes data in clusters or association rules so it doesn’t need to
be accurate, just approximate.
 Ex: Association rules or market-basket analysis,Clustering, Feature
extraction

Descriptive Vs Predictive
Comparison Descriptive Predictive
It determines, what happened in the past It determines, what can happen in the
Basic
by analyzing stored data. future with the help past data analysis.
It produces results does not ensure

Preciseness It provides accurate data.
accuracy.
Standard reporting, query/drill down and Predictive modelling, forecasting,

Practical analysis methods
ad-hoc reporting. simulation and alerts.
It requires data aggregation and data It requires statistics and forecasting

Require
mining methods
Type of approach Reactive approach Proactive approach
Carry out the induction over the current

Describes the characteristics of the data in
Describe and past data so that predictions can be
a target data set.
made.
•what will happen next?

•what happened?
•what is the outcome if these trends
Methods(in general) •where exactly is the problem?
continue?
•what is the frequency of the problem?
•what actions are required to be taken?

Train a Supervised Machine Learning Model
 Steps Involved in Supervised Learning:
1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and
validation dataset.
4. Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output, which means our model is accurate.

Supervised Machine Learning
 Regression
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
 Classification
 Random Forest
 Decision Trees
 Logistic Regression
 Support vector Machines

18
Predictive model
Predictive modeling is the process of taking known results
and developing a model that can predict values for new
occurrences.
It uses historical data to predict future events.
There are many different types of predictive modeling
techniques including linear regression (ordinary least
squares), logistic regression, ridge regression, time series,
decision trees, neural network.
19
Application of Predictive method
20
Process of Predictive model
Step 1:Data collection and purification: Data is
accumulated from all the sources to extract the required
information by cleaning data with some operations that
eliminate loud data to get accurate estimations. Various
sources are included Transaction and customer assistance
data, survey and economic data.
21
Step 2: Data transformation: Data need to be transformed
through accurate processing to get normalized data. The
values are scaled in a provided range of normalized data,
extraneous elements get removed by correlation analysis to
conclude the final decision.
22
Step 3: Formulation of the predictive model: Any
predictive model often employs regression techniques to
design a predictive model by using the classification
algorithm. During this process, test data is recognized,
classification decisions get implemented on test data to
determine the performance of the model.
23
Step 4: Performance analysis or conclusion: At last,
inferences are drawn from the model, for this, cluster
analysis is performed. After building the model analysis is
important for the maintaining.
24
Steps in building regression model
STEP 1: Collect/Extract Data
The first step in building a regression model is to collect or extract data on the dependent
(outcome) vari-able and independent (feature) variables from diﬀerent data sources. Data
collection in many cases can be time-consuming and expensive, even when the organization has
well-designed enterprise resource planning (ERP) system.
STEP 2: Pre-Process the Data
Before the model is built, it is essential to ensure the quality of the data for issues such as
reliability, completeness, usefulness, accuracy, missing data, and outliers.
1. Data imputation techniques may be used to deal with missing data. Use of descriptive statistics
and visualization (such as box plot and scatter plot) may be used to identify the existence of
outliers and variability in the dataset.
25
2. Many new variables (such as the ratio of variables or product of variables) can be derived (aka
feature engineering) and also used in model building.
3. Categorical data has must be pre-processed using dummy variables (part of feature engineering)
before it is used in the regression model.
STEP 3: Dividing Data into Training and Validation Datasets

In this stage the data is divided into two subsets (sometimes more than two subsets): training
dataset and validation or test dataset. The proportion of training dataset is usually between 70%
and 80% of the data and the remaining data is treated as the validation data.
STEP 4: Perform Descriptive Analytics or Data Exploration

It is always a good practice to perform descriptive analytics before moving to building a predictive
analytics model. Descriptive statistics will help us to understand the variability in the model and
visualization of the data through, say, a box plot which will show if there are any outliers in the
data
26
STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. The method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.
STEP 6:
Perform Model Diagnostics Regression is often misused since many times the modeler fails to
perform necessary diagnostics tests before applying the model. Before it can be applied, it is
necessary that the model created is validatedfor all model assumptions including the definition of
the function form. If the model assumptions are violated, then the modeler must use remedial
measure.
STEP 7: Validate the Model and Measure Model Accuracy

A major concern in analytics is over-fitting, that is, the model may perform very well on the
training dataset, but may perform badly in validation dataset. It is important to ensure that the
model performance is consistent on the validation dataset as is in the training dataset. In fact, the
model may be cross-validated using multiple training and test datasets.
27
STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. Te method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.
STEP 6:
Perform Model Diagnostics Regression is often misused since many times the modeler fails to
perform necessary diagnostics tests before applying the model. Before it can be applied, it is
necessary that the model created is validated for all model assumptions including the definition of
the function form. If the model assumptions are violated, then the modeler must use remedial
measure.
STEP 7: Validate the Model and Measure Model Accuracy

A major concern in analytics is over-fitting, that is, the model may perform very well on the
training dataset, but may perform badly in validation dataset. It is important to ensure that the
model performance is consistent on the validation dataset as is in the training dataset. In fact, the
model may be cross-validated using multiple training and test datasets.
28
linear regression model
 Linear regression is a quiet and simple statistical regression method used for
predictive analysis and shows the relationship between the continuous variables.
 Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), consequently called linear regression.
 If there is a single input variable (x), such linear regression is called simple linear
regression. And if there is more than one input variable, such linear regression is
called multiple linear regression.
 The linear regression model gives a sloped straight line describing the relationship
within the variables.
29
Cost function
A cost function, also called a loss function, is used to define and measure the error of a model. The
differences between the prices predicted by the model and the observed prices of the pizzas in the
training set are called residuals or training errors.
Cost function optimizes the regression coefficients or weights and measures how a linear
regression model is performing. The cost function is used to find the accuracy of the mapping
function that maps the input variable to the output variable. This mapping function is also known
as the Hypothesis function.
in Linear Regression, Mean Squared Error (MSE) cost function is used, which is the average of
squared error that occurred between the predicted values and actual values.
By simple linear equation y=mx+b we can calculate MSE as:

Let’s y = actual values, yi = predicted values
30
EXAMPLE:
 Let's assume that you have recorded the diameters and prices of pizzas that
you have previously eaten in your pizza journal. These observations comprise
our training data
import matplotlib.pyplot as plt

X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
plt.figure()
plt.title('Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
31
EXAMPLE:
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)
plt.show()
32
EXAMPLE:
from sklearn.linear_model import LinearRegression
# Training data
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
print 'A 12" pizza should cost: $%.2f' % model.predict([12])[0]
A 12" pizza should cost: $13.68
33
EVALUATING THE FITNESS OF MODEL
sum of squares is calculated with the formula in the can produce the best
pizza-price predictor by minimizing the sum of the residuals. That is, our
model fits if the values it predicts for the response variable are close to the
observed values for all of the training examples. This measure of the model's
fitness is called the residual sum of squares cost function. Formally, this
function assesses the fitness of a model by summing the squared residuals
for all of our training examples. The residual lfollowing equation,
34
EVALUATING THE MODEL
how well the observed values of the response variables are predicted by the
model. More concretely, r-squared is the proportion of the variance in the
response variable that is explained by the model. An r-squared score of one
indicates that the response variable can be predicted without any error using
the model.
r-squared is equal to the square of the Pearson product moment correlation

coefficient, or Pearson's r.
35
CALCULATION
36
PYTHON IMPLEMENTATION
from sklearn.linear_model import LinearRegression
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
X_test = [[8], [9], [11], [16], [12]]
y_test = [[11], [8.5], [15], [18], [11]]
model = LinearRegression()
model.fit(X, y)
print 'R-squared: %.4f' % model.score(X_test, y_test)
An r-squared score of 0.6620 indicates that a large proportion of the

variance in the test instances'
37
 https://medium.com/ml-research-lab/chapter-2-data-and-it
s-different-types-3dfebcbb4dbe
 https://blog.statsbot.co/data-structures-related-to-machine-
learning-algorithms-5edf77c8bbf4#:~:text=Array,mathem
atical%20tool%20at%20your%20disposal
.
 https://www.upgrad.com/blog/types-of-data/
 https://www.spirion.com/data-remediation/
38
 https://seleritysas.com/blog/2019/12/12/types-of-predictiv
e-analytics-models-and-how-they-work
/
 https://
towardsdatascience.com/selecting-the-correct-predictive-
modeling-technique-ba459c370d59
 https://
www.netsuite.com/portal/resource/articles/financial-mana
gement/predictive-modeling.shtml
 https://
www.dezyre.com/article/types-of-analytics-descriptive-pre
dictive-prescriptive-analytics/209#toc-2
 https://
www.sciencedirect.com/topics/computer-science/descripti

Unit 3 Modelling and Evaluation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3 Modelling and Evaluation

Uploaded by

Copyright:

Available Formats

Silver Oal College Of Engineering And Technology

1 Prof. Monali Suthar (SOCET-CE)

2 Prof. Monali Suthar (SOCET-CE)

What is good enough model ?

3 Prof. Monali Suthar (SOCET-CE)

4 Prof. Monali Suthar (SOCET-CE)

5 Prof. Monali Suthar (SOCET-CE)

7 Prof. Monali Suthar (SOCET-CE)

8 Prof. Monali Suthar (SOCET-CE)

9 Prof. Monali Suthar (SOCET-CE)

10 Prof. Monali Suthar (SOCET-CE)

11 Prof. Monali Suthar (SOCET-CE)

12 Prof. Monali Suthar (SOCET-CE)

13 Prof. Monali Suthar (SOCET-CE)

14 Prof. Monali Suthar (SOCET-CE)

It produces results does not ensure

Standard reporting, query/drill down and Predictive modelling, forecasting,

It requires data aggregation and data It requires statistics and forecasting

Type of approach Reactive approach Proactive approach

Carry out the induction over the current

•what will happen next?

15 Prof. Monali Suthar (SOCET-CE)

16 Prof. Monali Suthar (SOCET-CE)

17 Prof. Monali Suthar (SOCET-CE)

STEP 3: Dividing Data into Training and Validation Datasets

STEP 4: Perform Descriptive Analytics or Data Exploration

STEP 7: Validate the Model and Measure Model Accuracy

STEP 7: Validate the Model and Measure Model Accuracy

By simple linear equation y=mx+b we can calculate MSE as:

import matplotlib.pyplot as plt

r-squared is equal to the square of the Pearson product moment correlation

An r-squared score of 0.6620 indicates that a large proportion of the

You might also like