You are on page 1of 40

Silver Oal College Of Engineering And Technology

Unit 3 :
Modeling and Evaluation:

1 Prof. Monali Suthar (SOCET-CE)


Outline
 Selecting a Model: Predictive/Descriptive,
 Training a Model for supervised learning,
 Model representation and interpretability,
 Evaluating performance of a model,
 Improving performance of a model.

2 Prof. Monali Suthar (SOCET-CE)


Model Selection
 Model selection is the process of selecting one final
machine learning model from among a collection of
candidate machine learning models for a training dataset.
 Model selection is a process that can be applied both
across different types of models (e.g. logistic regression,
SVM, KNN, etc.)
 Model selection is the process of choosing one of the
models as the final model that addresses the problem.
 Model selection is different from model assessment.

What is good enough model ?

3 Prof. Monali Suthar (SOCET-CE)


Model Selection
 A “good enough” model may refer to many things and is
specific to your project, such as:
1. A model that meets the requirements and constraints of
project stakeholders.
2. A model that is sufficiently skillful given the time and
resources available.
3. A model that is skillful as compared to naive models.
4. A model that is skillful relative to other tested models.
5. A model that is skillful relative to the state-of-the-art.

4 Prof. Monali Suthar (SOCET-CE)


Model Selection Techniques
 The best approach to model selection requires “sufficient”
data, which may be nearly infinite depending on the
complexity of the problem.
  There are two main classes of techniques to approximate
the ideal case of model selection; they are:
1. Probabilistic Measures: Choose a model via in-sample error
and complexity.
2. Resampling Methods: Choose a model via estimated out-of-
sample error.

5 Prof. Monali Suthar (SOCET-CE)


Model Selection Techniques
 Probabilistic Measures
 Probabilistic measures involve analytically scoring a candidate model using both its
performance on the training dataset and the complexity of the model.
 It is known that training error is optimistically biased, and therefore is not a good basis
for choosing a model.
 The performance can be penalized based on how optimistic the training error is
believed to be.
 This is typically achieved using algorithm-specific methods, often linear, that penalize
the score based on the complexity of the model.
 A model with fewer parameters is less complex, and because of this, is preferred
because it is likely to generalize better on average.
 Four commonly used probabilistic model selection measures include:
 Akaike Information Criterion (AIC).
 Bayesian Information Criterion (BIC).
 Minimum Description Length (MDL).
 Structural Risk Minimization (SRM).
 Probabilistic measures are appropriate when using simpler linear models like linear
regression or logistic regression where
6
the calculating of model complexity penalty
Prof. Monali Suthar (SOCET-CE)
(e.g. in sample bias) is known and tractable.
Model Selection Techniques
 Resampling Methods
 Resampling methods seek to estimate the performance of a model (or more precisely, the
model development process) on out-of-sample data.
 This is achieved by splitting the training dataset into sub train and test sets, fitting a
model on the sub train set, and evaluating it on the test set.
 This process may then be repeated multiple times and the mean performance across each
trial is reported.
 It is a type of Monte Carlo estimate of model performance on out-of-sample data,
although each trial is not strictly independent as depending on the resampling method
chosen, the same data may appear multiple times in different training datasets, or test
datasets.
 Three common resampling model selection methods include:
 Random train/test splits.
 Cross-Validation (k-fold, LOOCV, etc.).
 Bootstrap.
 Most of the time probabilistic measures (described in the previous section) are not
available, therefore resampling methods are used.

7 Prof. Monali Suthar (SOCET-CE)


Types of model

8 Prof. Monali Suthar (SOCET-CE)


Types of model
 Predictive Analytics
 Predictive Analytics is to say something about future results not of current
behavior.
 Predictive Analytics will help an organization to know what might happen
next, it predicts future based on present data available.
 It will analyze the data and provide statements that have not happened yet. It
makes all kinds of predictions that you want to know and all predictions are
probabilistic in nature.
 It uses the supervised learning functions which are used to predict the target
value.
 The methods come under this type of mining category are called classification,
time-series analysis and regression.
 Modeling of data is the necessity of the predictive analysis, and it works by
utilizing a few variables of the present to predict the future not known data
values for other variables.

9 Prof. Monali Suthar (SOCET-CE)


Predictive Model
 There are different models developed for design-specific
functions.
1. Forecast models
2. Classification models
3. Outliers Detection  Models
4. Time series model
5. Clustering Model
6. Neural Network algorithms
7. Decision Trees Algorithms

10 Prof. Monali Suthar (SOCET-CE)


PREDICTIVE MODELLING PROCESS
 The process involves running algorithms on the data set in which the prediction is
going to take place. 
  The process involves training the model, multiple models being used on the same data
set and finally arriving on the model which is the best fit based on the business data
understanding.
 The predictive models’ category includes predictive, descriptive, and decision models.
 The predictive modelling process goes as follows:
1. Pre-processing.
2. Data mining.
3. Results validation.
4. Understand business & data.
5. Prepare data.
6. Model data.
7. Evaluation.
8. Deployment.
9. Monitor & improve. 

11 Prof. Monali Suthar (SOCET-CE)


PREDICTIVE MODELLING
 FEATURES :
1. Data analysis & manipulation: Create new data sets, tools for data
analysis, categorize, club, merge and filter data sets.
2. Visualization: This includes interactive graphics and reports.
3. Statistics: To confirm and create relationships between variables in the
data.
4. Hypothesis testing: Creating models, evaluating and choosing the right
models. 
 Limitations 
1. Errors in data labeling
2. Shortage of massive data sets needed to train machine learning
3. The machine’s inability to explain what and why it did what it did
4. Generalizability of learning, or rather lack thereof
5. Bias in data and algorithms

12 Prof. Monali Suthar (SOCET-CE)


Types of model
 Descriptive Analytics
 Descriptive Analytics will help an organization to know what has happened in
the past, it would give you the past analytics using the data that are stored.
 For a company, it is necessary to know the past events that help them to make
decisions based on the statistics using historical data.
 This term is basically used to produce correlation, cross-tabulation, frequency
etc. 
 These technologies are used to determine the similarities in the data and to find
existing patterns. 
 One more application of descriptive analysis is to develop the captivating
subgroups in the major part of the data available.
 This analytics emphasis on the summarization and transformation of the data
into meaningful information for reporting and monitoring.
 For example, you might want to know how much money you lost due to fraud
and many more.

13 Prof. Monali Suthar (SOCET-CE)


Descriptive Model
 The descriptive analysis uses mainly unsupervised learning
approaches for summarizing, classifying, extracting rules to answer
what happens was happened in the past. 
 The descriptive models are different in nature from predictive
models since they don’t need to perform as accurately as the
predictive models need to.
 Since predictions are for a potential future event and business wants
to exploit that knowledge and take actions on the predictions, the
reliability of the prediction matters a lot.
 It describes data in clusters or association rules so it doesn’t need to
be accurate, just approximate.
 Ex: Association rules or market-basket analysis,Clustering, Feature
extraction

14 Prof. Monali Suthar (SOCET-CE)


Descriptive Vs Predictive
Comparison Descriptive Predictive

It determines, what happened in the past It determines, what can happen in the
Basic
by analyzing stored data. future with the help past data analysis.

It produces results does not ensure


Preciseness It provides accurate data.
accuracy.

Standard reporting, query/drill down and Predictive modelling, forecasting,


Practical analysis methods
ad-hoc reporting. simulation and alerts.

It requires data aggregation and data It requires statistics and forecasting


Require
mining methods

Type of approach Reactive approach Proactive approach

Carry out the induction over the current


Describes the characteristics of the data in
Describe and past data so that predictions can be
a target data set.
made.

•what will happen next?


•what happened?
•what is the outcome if these trends
Methods(in general) •where exactly is the problem?
continue?
•what is the frequency of the problem?
•what actions are required to be taken?

15 Prof. Monali Suthar (SOCET-CE)


 
Train a Supervised Machine Learning Model
 Steps Involved in Supervised Learning:
1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and
validation dataset.
4. Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output, which means our model is accurate.

16 Prof. Monali Suthar (SOCET-CE)


Supervised Machine Learning
 Regression
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
 Classification
 Random Forest
 Decision Trees
 Logistic Regression
 Support vector Machines

17 Prof. Monali Suthar (SOCET-CE)


18
Predictive model
Predictive modeling is the process of taking known results
and developing a model that can predict values for new
occurrences.
It uses historical data to predict future events.
There are many different types of predictive modeling
techniques including linear regression (ordinary least
squares), logistic regression, ridge regression, time series,
decision trees, neural network.

19
Application of Predictive method

20
Process of Predictive model
Step 1:Data collection and purification: Data is
accumulated from all the sources to extract the required
information by cleaning data with some operations that
eliminate loud data to get accurate estimations. Various
sources are included Transaction and customer assistance
data, survey and economic data.

21
Process of Predictive model
Step 2: Data transformation: Data need to be transformed
through accurate processing to get normalized data. The
values are scaled in a provided range of normalized data,
extraneous elements get removed by correlation analysis to
conclude the final decision.

22
Process of Predictive model
Step 3: Formulation of the predictive model: Any
predictive model often employs regression techniques to
design a predictive model by using the classification
algorithm. During this process, test data is recognized,
classification decisions get implemented on test data to
determine the performance of the model.

23
Process of Predictive model
Step 4: Performance analysis or conclusion: At last,
inferences are drawn from the model, for this, cluster
analysis is performed. After building the model analysis is
important for the maintaining.

24
Steps in building regression model
STEP 1: Collect/Extract Data
The first step in building a regression model is to collect or extract data on the dependent
(outcome) vari-able and independent (feature) variables from different data sources. Data
collection in many cases can be time-consuming and expensive, even when the organization has
well-designed enterprise resource planning (ERP) system.
STEP 2: Pre-Process the Data
Before the model is built, it is essential to ensure the quality of the data for issues such as
reliability, completeness, usefulness, accuracy, missing data, and outliers.
1. Data imputation techniques may be used to deal with missing data. Use of descriptive statistics
and visualization (such as box plot and scatter plot) may be used to identify the existence of
outliers and variability in the dataset.

25
Steps in building regression model
2. Many new variables (such as the ratio of variables or product of variables) can be derived (aka
feature engineering) and also used in model building.
3. Categorical data has must be pre-processed using dummy variables (part of feature engineering)
before it is used in the regression model.

STEP 3: Dividing Data into Training and Validation Datasets


In this stage the data is divided into two subsets (sometimes more than two subsets): training
dataset and validation or test dataset. The proportion of training dataset is usually between 70%
and 80% of the data and the remaining data is treated as the validation data.

STEP 4: Perform Descriptive Analytics or Data Exploration


It is always a good practice to perform descriptive analytics before moving to building a predictive
analytics model. Descriptive statistics will help us to understand the variability in the model and
visualization of the data through, say, a box plot which will show if there are any outliers in the
data

26
Steps in building regression model
STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. The method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.

STEP 6:
Perform Model Diagnostics Regression is often misused since many times the modeler fails to
perform necessary diagnostics tests before applying the model. Before it can be applied, it is
necessary that the model created is validatedfor all model assumptions including the definition of
the function form. If the model assumptions are violated, then the modeler must use remedial
measure.

STEP 7: Validate the Model and Measure Model Accuracy


A major concern in analytics is over-fitting, that is, the model may perform very well on the
training dataset, but may perform badly in validation dataset. It is important to ensure that the
model performance is consistent on the validation dataset as is in the training dataset. In fact, the
model may be cross-validated using multiple training and test datasets.

27
Steps in building regression model
STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. Te method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.

STEP 6:
Perform Model Diagnostics Regression is often misused since many times the modeler fails to
perform necessary diagnostics tests before applying the model. Before it can be applied, it is
necessary that the model created is validated for all model assumptions including the definition of
the function form. If the model assumptions are violated, then the modeler must use remedial
measure.

STEP 7: Validate the Model and Measure Model Accuracy


A major concern in analytics is over-fitting, that is, the model may perform very well on the
training dataset, but may perform badly in validation dataset. It is important to ensure that the
model performance is consistent on the validation dataset as is in the training dataset. In fact, the
model may be cross-validated using multiple training and test datasets.

28
linear regression model
 Linear regression is a quiet and simple statistical regression method used for
predictive analysis and shows the relationship between the continuous variables.
 Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), consequently called linear regression.
 If there is a single input variable (x), such linear regression is called simple linear
regression. And if there is more than one input variable, such linear regression is
called multiple linear regression.
 The linear regression model gives a sloped straight line describing the relationship
within the variables.

29
Cost function
A cost function, also called a loss function, is used to define and measure the error of a model. The
differences between the prices predicted by the model and the observed prices of the pizzas in the
training set are called residuals or training errors.
Cost function optimizes the regression coefficients or weights and measures how a linear
regression model is performing. The cost function is used to find the accuracy of the mapping
function that maps the input variable to the output variable. This mapping function is also known
as the Hypothesis function.

in Linear Regression, Mean Squared Error (MSE) cost function is used, which is the average of
squared error that occurred between the predicted values and actual values.

By simple linear equation y=mx+b we can calculate MSE as:


Let’s y = actual values, yi = predicted values

30
EXAMPLE:
 Let's assume that you have recorded the diameters and prices of pizzas that
you have previously eaten in your pizza journal. These observations comprise
our training data

import matplotlib.pyplot as plt


X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
plt.figure()
plt.title('Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
31
EXAMPLE:
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)
plt.show()

32
EXAMPLE:
from sklearn.linear_model import LinearRegression
# Training data
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
print 'A 12" pizza should cost: $%.2f' % model.predict([12])[0]
A 12" pizza should cost: $13.68

33
EVALUATING THE FITNESS OF MODEL
sum of squares is calculated with the formula in the can produce the best
pizza-price predictor by minimizing the sum of the residuals. That is, our
model fits if the values it predicts for the response variable are close to the
observed values for all of the training examples. This measure of the model's
fitness is called the residual sum of squares cost function. Formally, this
function assesses the fitness of a model by summing the squared residuals
for all of our training examples. The residual lfollowing equation,

34
EVALUATING THE MODEL

how well the observed values of the response variables are predicted by the
model. More concretely, r-squared is the proportion of the variance in the
response variable that is explained by the model. An r-squared score of one
indicates that the response variable can be predicted without any error using
the model.

r-squared is equal to the square of the Pearson product moment correlation


coefficient, or Pearson's r.

35
CALCULATION

36
PYTHON IMPLEMENTATION
from sklearn.linear_model import LinearRegression
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
X_test = [[8], [9], [11], [16], [12]]
y_test = [[11], [8.5], [15], [18], [11]]
model = LinearRegression()
model.fit(X, y)
print 'R-squared: %.4f' % model.score(X_test, y_test)

An r-squared score of 0.6620 indicates that a large proportion of the


variance in the test instances'

37
 https://medium.com/ml-research-lab/chapter-2-data-and-it
s-different-types-3dfebcbb4dbe
 https://blog.statsbot.co/data-structures-related-to-machine-
learning-algorithms-5edf77c8bbf4#:~:text=Array,mathem
atical%20tool%20at%20your%20disposal
.
 https://www.upgrad.com/blog/types-of-data/
 https://www.spirion.com/data-remediation/

38
 https://seleritysas.com/blog/2019/12/12/types-of-predictiv
e-analytics-models-and-how-they-work
/
 https://
towardsdatascience.com/selecting-the-correct-predictive-
modeling-technique-ba459c370d59
 https://
www.netsuite.com/portal/resource/articles/financial-mana
gement/predictive-modeling.shtml
 https://
www.dezyre.com/article/types-of-analytics-descriptive-pre
dictive-prescriptive-analytics/209#toc-2
 https://
39 Prof. Monali Suthar (SOCET-CE)
www.sciencedirect.com/topics/computer-science/descripti
40 Prof. Monali Suthar (SOCET-CE)

You might also like