Professional Documents
Culture Documents
Lecture presented to
MTech/BTech Pet Engg Students
IIT (ISM) Dhanbad
November 2, 2022
1
Presentation Outline
CONTEXT
MECHANICS
CAVEATS
Mishra - IIT(ISM) Nov 2022 2
Data-Driven Modeling
• Classical statistics have to postulate model between
independent (predictor) & dependent (response) variables
• However, growing need to look beyond linear regression (and
variants) for complex multi-dimensional data sets
• Idea is to extract the model from the data without making any
assumptions regarding the underlying functional form
– key difference with classical approach
– regression problems, where response variable is
continuous (e.g., permeability)
– classification problems, where response variable is
categorical (e.g., rock type)
blog.gartner.com
3000
2000
1000
0
1985 1990 1995 2000 2005 2010 2015 2020 2025
Train
Predict
MWC7.
MWC5.
C2.C6
MMP
API
C7.
Vol
C1
V.I
Int
T
1
C7.
0.8
C2.C6
0.6
MWC7.
0.4
Int
API 0.2
MWC5. 0
V.I -0.2
C1
-0.4
Vol
-0.6
MMP
-0.8
T
-1
• Seek parsimonious
balance between
goodness of fit and
model complexity
• Minimize AIC (Akaike
Information Criterion)
AIC = n log( SS E / n ) + 2 p
where
n = number of observations
p = number of model parameters
SS E = residual sum of squares
Support Vector R1 R2 R4 R3
Machine
Partition parameter space into rectangular
Artificial Neural regions with constant values or class labels
Network
Gaussian Process
Emulation
Support Vector R1 R2 R4 R3
Machine
Partition parameter
Build ensemble of space into rectangular
trees using random
Artificial Neural regions
subsetswith constant values
of observations andorpredictors
class labels
Network
Gaussian Process
Emulation
Support Vector R1 R2 R4 R3
Machine
Partition
Build
Build parameter
sequence
ensembleof of space
trees
trees into
that
using rectangular
address
randomshort-
Artificial Neural regions
subsetswith
comings constant
of of values
observations
each previousandor class
fitted
predictorslabels
tree
Network
Gaussian Process
Emulation
Support Vector R1 R2 R4 R3
Machine
Partition
Build
Find
Build parameter
hyperplane
sequence
ensemble space
ofmaximizing
of
trees
trees
thatinto
using rectangular
address
separation
randomshort-
of
Artificial Neural regions
data
subsetswith
comings
and of constant
transform
of values
observations
each previous
data andor
into class
fitted
predictors
linear labels
tree
space
Network
Gaussian Process
Emulation
Support Vector R1 R2 R4 R3
Machine
Partition
Build
Inputs
Find
Build parameter
hyperplane
sequence
mapped
ensembleto space
ofmaximizing
of
trees
outputs
trees
thatinto
using
via rectangular
address
separation
hidden
randomshort-
units
of
Artificial Neural regions
data
subsets
using awith
comings
and of constant
sequence
transform
of values
observations andor
eachofprevious
data
nonlinear
into class
fitted
predictors
linear labels
functions
tree
space
Network
Gaussian Process
Emulation
Support Vector R1 R2 R4 R3
Machine
Partition
Build
Inputs
Find
Build parameter
Multidimensional
hyperplane
sequence
mapped
ensembleto space
ofmaximizing into
interpolation
of
trees
outputs
trees
that
using
via rectangular
address
separation
hidden
considering
randomshort-
units
of
Artificial Neural regions
trend
data
subsets
using awith
comings
and constant
sequence
of
transform
autocorrelation
of
observationsvalues
eachofprevious
data
nonlinear
into
and or
structureclass
fitted
predictors
linear labels
functions
tree
of
space
data
Network
Gaussian Process
Emulation
Train
THREE TYPES OF
MODEL
VALIDATION
• full training data
• 10-fold CV
• held out test data
https://christophm.github.io/interpretable-ml-book/
Mishra - IIT(ISM) Nov 2022 28
Identifying Predictor Importance
Linear
Linear
Random
Random Forest
Forest
N=81
N=2935
Robust models can be built with
small datasets if all relevant causal
variables are included in the model
39