SM IIT - ISM ML Lecture 02 Nov 2022

BEST PRACTICES FOR
MACHINE LEARNING APPLICATIONS

IN PETROLEUM ENGINNERING
Dr. Srikanta Mishra

Battelle Memorial Institute, USA
mishras@battelle.org
Lecture presented to
MTech/BTech Pet Engg Students
IIT (ISM) Dhanbad
November 2, 2022
1
Presentation Outline
CONTEXT 
MECHANICS 
CAVEATS 
Mishra - IIT(ISM) Nov 2022 2

Data-Driven Modeling
• Classical statistics  have to postulate model between
independent (predictor) & dependent (response) variables
• However, growing need to look beyond linear regression (and
variants) for complex multi-dimensional data sets
• Idea is to extract the model from the data without making any
assumptions regarding the underlying functional form
– key difference with classical approach
– regression problems, where response variable is
continuous (e.g., permeability)
– classification problems, where response variable is
categorical (e.g., rock type)

Some Definitions 
• Data analytics  data collection A

and analysis to understand hidden Collect Data
C
patterns and relationships
Collect Data
– Machine learning  enabling B
tool for building model between Infer Rules
predictors and response, typically (predictive model)
using “black-box” methods
D
• Artificial intelligence  E
Make Decision
applying predictive model with
new data to make decisions,
without human intervention Data analytics  A, B
Machine learning  B
(with possibility of feedback) Artificial Intelligence  C, B, D, E
Mishra et al., 2021, JPT (March), 25-30.


Types of Analytics
blog.gartner.com


Why Data-Driven Models?
• Mechanistic modeling in conventional reservoirs is data and
computation-intensive
• Mechanistic modeling in unconventional reservoirs is complex
(coupled processes) and often immature
• Physics-based modeling in other areas (e.g., wellbore flow,
drilling operations, equipment failure) not robust
• Empirical models (e.g., decline curves) popular alternative
but have many limitations (model form, parameterization)
• Data-driven models are emerging as alternative approach
(let the “machine” learn about the system from the data)


Why ML Models and When?
• Historically, subsurface science and engineering analyses
have relied on mechanistic models
• Incorporation of causal input-output relationship
• Experienced professionals are wary of purely “black-box”
ML models that lack such understanding
• Nevertheless, the use of ML models is easy to justify
➢ relevant physics-based model is computation intensive
and/or immature
➢ suitable mechanistic modeling paradigm does not exist
Mishra et al., 2021, JPT, March, 25-30.


Growing Application Base in O&G
"Machine Learning" Hits from OnePetro Database
6000
5000 Lots of examples exist!

4000
3000
2000
1000
0
1985 1990 1995 2000 2005 2010 2015 2020 2025
Mishra et al., 2021, JPT (March), 25-30.


Data-Driven Modeling Mechanics
Exploratory Data Analysis Feature Selection Multivariate Analysis
Train
Model Model Model Model Model
Predict
Model Building Cross-Validation Variable Importance


Exploratory Data Analysis
• Primarily based on visualization + basic metrics
– Range, variability and distribution
– Correlation between variables
– Outliers
– Underlying trend
– Natural groupings
• Also known as “descriptive analytics”


Scatter Plot Matrix (Multivariate Data)
Mishra et al., 2014, Env. Geosci., 21(2), 59-74.


Dealing with Missing Variables
• Key assumption  missing data
mechanism has not distorted the
observed data
• Multiple possible strategies

– Fill in (“impute”) missing value
with mean or median of the
non-missing values for that input
– Build predictive model for each
input given the other inputs, and
impute missing values
– Built-in methods for RF and GBM
• Check for consistency between

original and imputed distributions


Handling Outliers
• Both univariate and bi/multi-variate
analysis may be needed
• Start with simple plot of
normalized variable (z-score) to
flag samples that are > +/3
z = (x-m)/s (assuming normality)
• Look for difference between
Pearson (ordinary) and Spearman
(rank) correlation coefficients –
indicator of outlier or non-linearity
• Identify values beyond pre-defined
Human input error (typo)?
percentile threshold (P5-P95) – Need for data QA/QC!
also possible with Boxplots
• Outliers can simply be removed, or
replaced with min/max quantiles

Visualizing Multivariate Correlations
Correlation CO2_MMP.csv using Spearman
MWC7.
MWC5.
C2.C6
MMP
API
C7.
Vol
C1
V.I
Int
T
1
C7.
0.8
C2.C6
0.6
MWC7.
0.4
Int
API 0.2
MWC5. 0
V.I -0.2
C1
-0.4
Vol
-0.6
MMP
-0.8
T
-1
Rattle 2019-Oct-07 15:02:10 MISHRAS


Regression Coefficients
Regression Statistics
Multiple R 0.733851
R Square 0.538538 Fraction of total variance explained by model
Adjusted R Square
0.522625
Standard Error
44.71329 Estimated SD of error term in regression =~ RMSE
Observations 31
Coeff Std Error t Stat P-value Lower 95% Upper 95%

Intercept 97.397 13.844 7.035 9.75E-08 69.082 125.711
X Variable 1 2.063 0.355 5.818 2.63E-06 1.337 2.788
Mean and SD = Coeff/SE, the smaller ~

= Coeff ± 2SE
of regression the bigger the better
coefficients the better (likelihood that
Coeff is different
from zero)
How Many Terms in Regression? 
• Seek parsimonious
balance between
goodness of fit and
model complexity
• Minimize AIC (Akaike
Information Criterion)
AIC = n log( SS E / n ) + 2 p
where
n = number of observations
p = number of model parameters
SS E = residual sum of squares


Predictive Modeling Methods
Regression &
Classification Tree
X1 < t1
Random
Forest
Gradient Boosting X2 < t2 X2 < t3

Machine
Support Vector R1 R2 R4 R3
Machine
Partition parameter space into rectangular
Artificial Neural regions with constant values or class labels
Network
Gaussian Process
Emulation


Regression &
Classification Tree
X1 < t1
Random
Forest

Machine
Machine
Partition parameter
Build ensemble of space into rectangular
trees using random
Artificial Neural regions
subsetswith constant values
of observations andorpredictors
class labels
Network
Gaussian Process
Emulation


Regression &
Classification Tree
X1 < t1
Random
Forest

Machine
Machine
Partition
Build
Build parameter
sequence
ensembleof of space
trees
trees into
that
using rectangular
address
randomshort-
subsetswith
comings constant
of of values
observations
each previousandor class
fitted
predictorslabels
tree
Network
Gaussian Process
Emulation


Regression &
Classification Tree
X1 < t1
Random
Forest

Machine
Machine
Partition
Build
Find
Build parameter
hyperplane
sequence
ensemble space
ofmaximizing
of
trees
trees
thatinto
using rectangular
address
separation
randomshort-
of
data
subsetswith
comings
and of constant
transform
of values
observations
each previous
data andor
into class
fitted
predictors
linear labels
tree
space
Network
Gaussian Process
Emulation


Regression &
Classification Tree
X1 < t1
Random
Forest

Machine
Machine
Partition
Build
Inputs
Find
Build parameter
hyperplane
sequence
mapped
ensembleto space
ofmaximizing
of
trees
outputs
trees
thatinto
using
via rectangular
address
separation
hidden
randomshort-
units
of
data
subsets
using awith
comings
and of constant
sequence
transform
of values
observations andor
eachofprevious
data
nonlinear
into class
fitted
predictors
linear labels
functions
tree
space
Network
Gaussian Process
Emulation


Regression &
Classification Tree
X1 < t1
Random
Forest

Machine
Machine
Partition
Build
Inputs
Find
Build parameter
Multidimensional
hyperplane
sequence
mapped
ensembleto space
ofmaximizing into
interpolation
of
trees
outputs
trees
that
using
via rectangular
address
separation
hidden
considering
randomshort-
units
of
trend
data
subsets
using awith
comings
and constant
sequence
of
transform
autocorrelation
of
observationsvalues
eachofprevious
data
nonlinear
into
and or
structureclass
fitted
predictors
linear labels
functions
tree
of
space
data
Network
Gaussian Process
Emulation


Deep Learning Approaches
Method Uses Drawbacks
Useful for extracting local information from
Does not capture global
Convolutional data (neighborhood information). Very stable
information from data (only
Neural Network model that scales to very large data sets. Fewer
local relationships)
parameters than fully connected models
The sequential nature of model
Useful for capturing sequential relationships in
Recurrent leads to performance issues (not
data. Can capture long term dependencies in
Neural Network easily parallelizable). Can suffer
data (subject verb relationship in text etc.)
from exploding/vanishing gradients
Useful for compressing data into a smaller Not typically useful for standard
Autoencoder number of features. Can be useful for data tasks like regression or
exploration, denoising, and generative models. classification
Training is sample inefficient
Used when optimizing a reward signal is the
(need many examples to learn).
Reinforcement goal, e.g. maximizing points in a video game. It’s
Stability of training is an issue.
Learning a good option when the training signal isn’t
Need to optimize exploration vs
described via a loss (MSE, cross entropy, etc.)
exploitation in training.
Most useful when performing a task that is well
explored (text, images) with a large number of Requires pretrained models for a
Transfer
pre-trained models. By freezing model similar task to exist and this isn’t
Learning parameters, we can maximize performance always the case.
when we have limited data for our task


k -fold Cross Validation
Recommended, if an independent test dataset is not available
Train
Model Model Model Model Model

Full Dataset
Predict


Evaluating Model Fits
VARIETY OF
MODELING
METHODS
THREE TYPES OF
MODEL
VALIDATION
• full training data
• 10-fold CV
• held out test data


Model Aggregation – Why?
• Model fits measured in terms of training or test error –
multiple competing models may arise!
• Aggregating over a large set of acceptable models can provide

more robust understanding and predictions
• Ensemble models (with predictions aggregated) are generally
top performers in data science competitions


Ensemble Modeling Methods
• Model aggregation strategies
– Simple averaging (direct average of constituent model predictions,
e.g., using arithmetic average)
– Weighted averaging (weighted averaging of constituent model
predictions, e.g., using inverse of RMSE)
– Stacking (predictions from the constituent models are used as
predictors in an aggregate model, e.g., NN training)
• Similar arguments underlie Beven’s concept of

“equifinality” in watershed hydrology modeling
with the GLUE framework (weighted averaging)
Schuetter et al., 2019, URTeC-929


Variable Importance Approaches
Strategy Notation Description
Removing a Remove a variable from the model, re-train the model and compare
Remove
variable the reduction in pseudo-R2, i.e. R2 loss.
Permute a variable’s values, which breaks the relationship between
Permuting a the variable and the true outcome, then compare the reduction in
Permute
variable pseudo-R2, i.e. R2 loss, of the dataset with permuted values to that
with true values.
The partial dependence plot shows the marginal effect of different
Partial variables on the predicted outcome. PDPs are “flat” for less important
PDP
Dependent Plot variables while the variables whose PDP vary across a wider range of
the response are more likely to be important.
Compare how the model predictions change in a small “window” of
Accumulated
ALE different variables. ALE plots are faster and unbiased alternative to
Local Effects Plot
partial dependence plots.
Local LIME attempts to understand the model by perturbing the input of
Interpretable data samples and interpreting how the predictions change. Variable
LIME
Model-Agnostic weights can then be extracted from a simple local model on the
Explanations permuted dataset to explain local behavior.
SHAP is a method to explain individual predictions based on the game
theoretically optimal Shapley values. A prediction can be explained by
Shapley
assuming that each feature value of the instance is a “player” in a
Addictive SHAP
game where the prediction is the payout. Shapley values – a method
exPlantations
from coalitional game theory, tells us how to fairly distribute the
“payout” among the features.
https://christophm.github.io/interpretable-ml-book/

Identifying Predictor Importance


Classification Tree Analysis
▪ Binary classification to identify
factors separating top 25% from Top 25%: Not
bottom 25% producing wells too shallow, not
too deep, long
lateral with
▪ Accuracy: more proppant,
but not too long
BOTTOM TOP CORRECT

25% 25% ID
BOTTOM
62 18 78%
25%
TOP 25% 7 73 91%
TOTAL 69 91 70%


Challenges for Acceptance of ML
Our ML models If I don’t understand the model,
are not very good how can I believe it?
• Consumer marketing ML/AI models • Articulate adequacy of predictors
are not necessarily highly accurate!
• Need to manage expectations re. • Demonstrate model robustness
quality of fit for subsurface models • Explain inner workings (key variables)
• Focus more on added value from • Use creative visualizations
ML models + complementary role
We are still waiting for My staff need to learn

the “Aha” Moment! data science, but how?
• ML model may or may not produce • Significant (informal) self-learning to
new insights
become “citizen data scientists”
• Provides an alternative quantitative
input-output relationship • Need formal knowledge of
conventional data analysis, python/R
• Useful when physics-based model is
slow, data-intensive or immature programming, and machine learning
Mishra et al., 2021, JPT (March)

Recommended Approach & Mindset
• Framing the problem ▪ Data-driven models
can complement
• Selecting the causal variables physics- and geology-
based insights
• Checking the data ▪ Need to “understand +
apply & predict +
• Choosing the modeling technique(s) communicate”
• Validating the model ▪ Danger of becoming
“technicians” if
• Understanding variable importance understanding of
foundational concepts
• Communicating the results and caveats lacking
Template for documenting results + reviewing/evaluating other studies


Data-Driven Modeling Steps
• Frame the problem (what is the question I want to answer?
What causal variables do I need?
• Explore/check the data (distributions – correlations –
significance – outliers – missing variables)
• Prepare data for model building/testing (80/20 or 70/30
split with replicates, k-fold cross validation, blind test data)
• Try at least 3 models (reference linear model, one tree-based
model, one non-tree based model)
– Look at model performance with training v/s test data
– Select best model(s) – train on all data – aggregate as needed
• Determine variable importance (R2-loss, error rate change)

Q1 – Do I Need Machine Learning? 
Linear
Linear
Random
Random Forest
Forest
Mishra – IIT(ISM) Nov 2022 34

Q2 – Which Technique Works Best? 
• No single technique is Power of Ensemble Modeling
consistent best performer
• Often, multiple competing
models have equally good
fits (R2/RMSE)
• Aggregate models 
robust understanding
and predictions
• Pick “Forest” over “Trees”

Q3 – How Much Data Do I Need? 
Large datasets can produce poor
results if key causal variables
are not included in the model
N=81
N=2935
Robust models can be built with
small datasets if all relevant causal
variables are included in the model

Q4 – How Do I Learn Data Science? 
• Citizen data scientist/analyst

(one who learns from data)
– Basic skills  domain knowledge (e.g., PE)
– New skills  statistics/ML, programming
• Core (data science) competencies

– Data collection, preparation, exploration
– Data storage and retrieval
– Computing with data
– Applied machine learning
– Data visualization/communication
Donoho, J. Comp. Graphical Stat, 2017 https://www.linkedin.com/pulse/new-venn-

diagram-data-science-pierluigi-casale/


Looking Ahead
• Growing trend towards the use of statistical and machine
learning techniques for subsurface applications
• Goal  “mine” big data and develop data-driven insights to
improve reservoir description & performance prediction
• Geo-scientists/engineers need to develop better awareness of
full repertoire of available techniques and their potential
• Data scientists need to understand problem domain to
propose/apply appropriate techniques
• Decision makers, regulators and other stakeholders need to
be educated on what ML can and cannot do

Dr Srikanta Mishra
Battelle, Columbus, USA
(512) 351-6038
mishras@battelle.org
39

SM IIT - ISM ML Lecture 02 Nov 2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SM IIT - ISM ML Lecture 02 Nov 2022

Uploaded by

Copyright:

Available Formats

BEST PRACTICES FOR

MACHINE LEARNING APPLICATIONS

Dr. Srikanta Mishra

Mishra - IIT(ISM) Nov 2022 3

• Data analytics  data collection A

Mishra et al., 2021, JPT (March), 25-30.

Mishra - IIT(ISM) Nov 2022 5

Mishra - IIT(ISM) Nov 2022 6

Mishra et al., 2021, JPT, March, 25-30.

Mishra - IIT(ISM) Nov 2022 7

5000 Lots of examples exist!

Mishra et al., 2021, JPT (March), 25-30.

Exploratory Data Analysis Feature Selection Multivariate Analysis

Model Model Model Model Model

Model Building Cross-Validation Variable Importance

Mishra - IIT(ISM) Nov 2022 9

• Also known as “descriptive analytics”

Mishra - IIT(ISM) Nov 2022 10

Mishra et al., 2014, Env. Geosci., 21(2), 59-74.

Mishra - IIT(ISM) Nov 2022 11

• Multiple possible strategies

• Check for consistency between

Mishra - IIT(ISM) Nov 2022 12

Rattle 2019-Oct-07 15:02:10 MISHRAS

Mishra - IIT(ISM) Nov 2022 14

Coeff Std Error t Stat P-value Lower 95% Upper 95%

Mean and SD = Coeff/SE, the smaller ~

Mishra - IIT(ISM) Nov 2022 16

Gradient Boosting X2 < t2 X2 < t3

Mishra - IIT(ISM) Nov 2022 17

Gradient Boosting X2 < t2 X2 < t3

Mishra - IIT(ISM) Nov 2022 18

Gradient Boosting X2 < t2 X2 < t3

Mishra - IIT(ISM) Nov 2022 19

Gradient Boosting X2 < t2 X2 < t3

Mishra - IIT(ISM) Nov 2022 20

Gradient Boosting X2 < t2 X2 < t3

Mishra - IIT(ISM) Nov 2022 21

Gradient Boosting X2 < t2 X2 < t3

Mishra - IIT(ISM) Nov 2022 22

Mishra - IIT(ISM) Nov 2022 23

Model Model Model Model Model

Mishra - IIT(ISM) Nov 2022 24

Mishra - IIT(ISM) Nov 2022 25

• Aggregating over a large set of acceptable models can provide

Mishra - IIT(ISM) Nov 2022 26

• Similar arguments underlie Beven’s concept of

Schuetter et al., 2019, URTeC-929

Mishra - IIT(ISM) Nov 2022 29

BOTTOM TOP CORRECT

Mishra - IIT(ISM) Nov 2022 30

We are still waiting for My staff need to learn

Template for documenting results + reviewing/evaluating other studies

Mishra - IIT(ISM) Nov 2022 32

Mishra - IIT(ISM) Nov 2022 33

Mishra – IIT(ISM) Nov 2022 34

Mishra - IIT(ISM) Nov 2022 35

Mishra - IIT(ISM) Nov 2022 36

• Citizen data scientist/analyst

• Core (data science) competencies

Donoho, J. Comp. Graphical Stat, 2017 https://www.linkedin.com/pulse/new-venn-

Mishra - IIT(ISM) Nov 2022 37

Mishra - IIT(ISM) Nov 2022 38