You are on page 1of 39

BEST PRACTICES FOR

MACHINE LEARNING APPLICATIONS


IN PETROLEUM ENGINNERING

Dr. Srikanta Mishra


Battelle Memorial Institute, USA
mishras@battelle.org

Lecture presented to
MTech/BTech Pet Engg Students
IIT (ISM) Dhanbad
November 2, 2022
1
Presentation Outline

CONTEXT 

MECHANICS 

CAVEATS 
Mishra - IIT(ISM) Nov 2022 2

Data-Driven Modeling
• Classical statistics  have to postulate model between
independent (predictor) & dependent (response) variables
• However, growing need to look beyond linear regression (and
variants) for complex multi-dimensional data sets
• Idea is to extract the model from the data without making any
assumptions regarding the underlying functional form
– key difference with classical approach
– regression problems, where response variable is
continuous (e.g., permeability)
– classification problems, where response variable is
categorical (e.g., rock type)

Mishra - IIT(ISM) Nov 2022 3


Some Definitions 

• Data analytics  data collection A


and analysis to understand hidden Collect Data
C
patterns and relationships
Collect Data
– Machine learning  enabling B
tool for building model between Infer Rules
predictors and response, typically (predictive model)
using “black-box” methods
D
• Artificial intelligence  E
Make Decision
applying predictive model with
new data to make decisions,
without human intervention Data analytics  A, B
Machine learning  B
(with possibility of feedback) Artificial Intelligence  C, B, D, E

Mishra et al., 2021, JPT (March), 25-30.


Mishra - IIT(ISM) Nov 2022 4

Types of Analytics

blog.gartner.com

Mishra - IIT(ISM) Nov 2022 5



Why Data-Driven Models?
• Mechanistic modeling in conventional reservoirs is data and
computation-intensive
• Mechanistic modeling in unconventional reservoirs is complex
(coupled processes) and often immature
• Physics-based modeling in other areas (e.g., wellbore flow,
drilling operations, equipment failure) not robust
• Empirical models (e.g., decline curves) popular alternative
but have many limitations (model form, parameterization)
• Data-driven models are emerging as alternative approach
(let the “machine” learn about the system from the data)

Mishra - IIT(ISM) Nov 2022 6



Why ML Models and When?
• Historically, subsurface science and engineering analyses
have relied on mechanistic models
• Incorporation of causal input-output relationship
• Experienced professionals are wary of purely “black-box”
ML models that lack such understanding
• Nevertheless, the use of ML models is easy to justify
➢ relevant physics-based model is computation intensive
and/or immature
➢ suitable mechanistic modeling paradigm does not exist

Mishra et al., 2021, JPT, March, 25-30.

Mishra - IIT(ISM) Nov 2022 7



Growing Application Base in O&G
"Machine Learning" Hits from OnePetro Database
6000

5000 Lots of examples exist!


4000

3000

2000

1000

0
1985 1990 1995 2000 2005 2010 2015 2020 2025

Mishra et al., 2021, JPT (March), 25-30.


Mishra - IIT(ISM) Nov 2022 8

Data-Driven Modeling Mechanics

Exploratory Data Analysis Feature Selection Multivariate Analysis

Train

Model Model Model Model Model

Predict

Model Building Cross-Validation Variable Importance

Mishra - IIT(ISM) Nov 2022 9



Exploratory Data Analysis
• Primarily based on visualization + basic metrics
– Range, variability and distribution
– Correlation between variables
– Outliers
– Underlying trend
– Natural groupings

• Also known as “descriptive analytics”

Mishra - IIT(ISM) Nov 2022 10



Scatter Plot Matrix (Multivariate Data)

Mishra et al., 2014, Env. Geosci., 21(2), 59-74.

Mishra - IIT(ISM) Nov 2022 11



Dealing with Missing Variables
• Key assumption  missing data
mechanism has not distorted the
observed data

• Multiple possible strategies


– Fill in (“impute”) missing value
with mean or median of the
non-missing values for that input
– Build predictive model for each
input given the other inputs, and
impute missing values
– Built-in methods for RF and GBM

• Check for consistency between


original and imputed distributions

Mishra - IIT(ISM) Nov 2022 12



Handling Outliers
• Both univariate and bi/multi-variate
analysis may be needed
• Start with simple plot of
normalized variable (z-score) to
flag samples that are > +/3
z = (x-m)/s (assuming normality)
• Look for difference between
Pearson (ordinary) and Spearman
(rank) correlation coefficients –
indicator of outlier or non-linearity
• Identify values beyond pre-defined
Human input error (typo)?
percentile threshold (P5-P95) – Need for data QA/QC!
also possible with Boxplots
• Outliers can simply be removed, or
replaced with min/max quantiles
Mishra - IIT(ISM) Nov 2022 13

Visualizing Multivariate Correlations
Correlation CO2_MMP.csv using Spearman

MWC7.

MWC5.
C2.C6

MMP
API
C7.

Vol
C1
V.I
Int

T
1
C7.
0.8
C2.C6
0.6
MWC7.

0.4
Int

API 0.2

MWC5. 0

V.I -0.2

C1
-0.4

Vol
-0.6

MMP
-0.8
T
-1

Rattle 2019-Oct-07 15:02:10 MISHRAS

Mishra - IIT(ISM) Nov 2022 14



Regression Coefficients
Regression Statistics
Multiple R 0.733851
R Square 0.538538 Fraction of total variance explained by model
Adjusted R Square
0.522625
Standard Error
44.71329 Estimated SD of error term in regression =~ RMSE
Observations 31

Coeff Std Error t Stat P-value Lower 95% Upper 95%


Intercept 97.397 13.844 7.035 9.75E-08 69.082 125.711
X Variable 1 2.063 0.355 5.818 2.63E-06 1.337 2.788

Mean and SD = Coeff/SE, the smaller ~


= Coeff ± 2SE
of regression the bigger the better
coefficients the better (likelihood that
Coeff is different
from zero)
Mishra - IIT(ISM) Nov 2022 15
How Many Terms in Regression? 

• Seek parsimonious
balance between
goodness of fit and
model complexity
• Minimize AIC (Akaike
Information Criterion)
AIC = n log( SS E / n ) + 2 p
where
n = number of observations
p = number of model parameters
SS E = residual sum of squares

Mishra - IIT(ISM) Nov 2022 16



Predictive Modeling Methods
Regression &
Classification Tree
X1 < t1
Random
Forest

Gradient Boosting X2 < t2 X2 < t3


Machine

Support Vector R1 R2 R4 R3
Machine
Partition parameter space into rectangular
Artificial Neural regions with constant values or class labels
Network

Gaussian Process
Emulation

Mishra - IIT(ISM) Nov 2022 17



Predictive Modeling Methods
Regression &
Classification Tree
X1 < t1
Random
Forest

Gradient Boosting X2 < t2 X2 < t3


Machine

Support Vector R1 R2 R4 R3
Machine
Partition parameter
Build ensemble of space into rectangular
trees using random
Artificial Neural regions
subsetswith constant values
of observations andorpredictors
class labels
Network

Gaussian Process
Emulation

Mishra - IIT(ISM) Nov 2022 18



Predictive Modeling Methods
Regression &
Classification Tree
X1 < t1
Random
Forest

Gradient Boosting X2 < t2 X2 < t3


Machine

Support Vector R1 R2 R4 R3
Machine
Partition
Build
Build parameter
sequence
ensembleof of space
trees
trees into
that
using rectangular
address
randomshort-
Artificial Neural regions
subsetswith
comings constant
of of values
observations
each previousandor class
fitted
predictorslabels
tree
Network

Gaussian Process
Emulation

Mishra - IIT(ISM) Nov 2022 19



Predictive Modeling Methods
Regression &
Classification Tree
X1 < t1
Random
Forest

Gradient Boosting X2 < t2 X2 < t3


Machine

Support Vector R1 R2 R4 R3
Machine
Partition
Build
Find
Build parameter
hyperplane
sequence
ensemble space
ofmaximizing
of
trees
trees
thatinto
using rectangular
address
separation
randomshort-
of
Artificial Neural regions
data
subsetswith
comings
and of constant
transform
of values
observations
each previous
data andor
into class
fitted
predictors
linear labels
tree
space
Network

Gaussian Process
Emulation

Mishra - IIT(ISM) Nov 2022 20



Predictive Modeling Methods
Regression &
Classification Tree
X1 < t1
Random
Forest

Gradient Boosting X2 < t2 X2 < t3


Machine

Support Vector R1 R2 R4 R3
Machine
Partition
Build
Inputs
Find
Build parameter
hyperplane
sequence
mapped
ensembleto space
ofmaximizing
of
trees
outputs
trees
thatinto
using
via rectangular
address
separation
hidden
randomshort-
units
of
Artificial Neural regions
data
subsets
using awith
comings
and of constant
sequence
transform
of values
observations andor
eachofprevious
data
nonlinear
into class
fitted
predictors
linear labels
functions
tree
space
Network

Gaussian Process
Emulation

Mishra - IIT(ISM) Nov 2022 21



Predictive Modeling Methods
Regression &
Classification Tree
X1 < t1
Random
Forest

Gradient Boosting X2 < t2 X2 < t3


Machine

Support Vector R1 R2 R4 R3
Machine
Partition
Build
Inputs
Find
Build parameter
Multidimensional
hyperplane
sequence
mapped
ensembleto space
ofmaximizing into
interpolation
of
trees
outputs
trees
that
using
via rectangular
address
separation
hidden
considering
randomshort-
units
of
Artificial Neural regions
trend
data
subsets
using awith
comings
and constant
sequence
of
transform
autocorrelation
of
observationsvalues
eachofprevious
data
nonlinear
into
and or
structureclass
fitted
predictors
linear labels
functions
tree
of
space
data
Network

Gaussian Process
Emulation

Mishra - IIT(ISM) Nov 2022 22



Deep Learning Approaches
Method Uses Drawbacks
Useful for extracting local information from
Does not capture global
Convolutional data (neighborhood information). Very stable
information from data (only
Neural Network model that scales to very large data sets. Fewer
local relationships)
parameters than fully connected models
The sequential nature of model
Useful for capturing sequential relationships in
Recurrent leads to performance issues (not
data. Can capture long term dependencies in
Neural Network easily parallelizable). Can suffer
data (subject verb relationship in text etc.)
from exploding/vanishing gradients
Useful for compressing data into a smaller Not typically useful for standard
Autoencoder number of features. Can be useful for data tasks like regression or
exploration, denoising, and generative models. classification
Training is sample inefficient
Used when optimizing a reward signal is the
(need many examples to learn).
Reinforcement goal, e.g. maximizing points in a video game. It’s
Stability of training is an issue.
Learning a good option when the training signal isn’t
Need to optimize exploration vs
described via a loss (MSE, cross entropy, etc.)
exploitation in training.
Most useful when performing a task that is well
explored (text, images) with a large number of Requires pretrained models for a
Transfer
pre-trained models. By freezing model similar task to exist and this isn’t
Learning parameters, we can maximize performance always the case.
when we have limited data for our task

Mishra - IIT(ISM) Nov 2022 23



k -fold Cross Validation
Recommended, if an independent test dataset is not available

Train

Model Model Model Model Model


Full Dataset
Predict

Mishra - IIT(ISM) Nov 2022 24



Evaluating Model Fits
VARIETY OF
MODELING
METHODS

THREE TYPES OF
MODEL
VALIDATION
• full training data
• 10-fold CV
• held out test data

Mishra - IIT(ISM) Nov 2022 25



Model Aggregation – Why?
• Model fits measured in terms of training or test error –
multiple competing models may arise!

• Aggregating over a large set of acceptable models can provide


more robust understanding and predictions
• Ensemble models (with predictions aggregated) are generally
top performers in data science competitions

Mishra - IIT(ISM) Nov 2022 26



Ensemble Modeling Methods
• Model aggregation strategies
– Simple averaging (direct average of constituent model predictions,
e.g., using arithmetic average)
– Weighted averaging (weighted averaging of constituent model
predictions, e.g., using inverse of RMSE)
– Stacking (predictions from the constituent models are used as
predictors in an aggregate model, e.g., NN training)

• Similar arguments underlie Beven’s concept of


“equifinality” in watershed hydrology modeling
with the GLUE framework (weighted averaging)

Schuetter et al., 2019, URTeC-929


Mishra - IIT(ISM) Nov 2022 27

Variable Importance Approaches
Strategy Notation Description
Removing a Remove a variable from the model, re-train the model and compare
Remove
variable the reduction in pseudo-R2, i.e. R2 loss.
Permute a variable’s values, which breaks the relationship between
Permuting a the variable and the true outcome, then compare the reduction in
Permute
variable pseudo-R2, i.e. R2 loss, of the dataset with permuted values to that
with true values.
The partial dependence plot shows the marginal effect of different
Partial variables on the predicted outcome. PDPs are “flat” for less important
PDP
Dependent Plot variables while the variables whose PDP vary across a wider range of
the response are more likely to be important.
Compare how the model predictions change in a small “window” of
Accumulated
ALE different variables. ALE plots are faster and unbiased alternative to
Local Effects Plot
partial dependence plots.
Local LIME attempts to understand the model by perturbing the input of
Interpretable data samples and interpreting how the predictions change. Variable
LIME
Model-Agnostic weights can then be extracted from a simple local model on the
Explanations permuted dataset to explain local behavior.
SHAP is a method to explain individual predictions based on the game
theoretically optimal Shapley values. A prediction can be explained by
Shapley
assuming that each feature value of the instance is a “player” in a
Addictive SHAP
game where the prediction is the payout. Shapley values – a method
exPlantations
from coalitional game theory, tells us how to fairly distribute the
“payout” among the features.

https://christophm.github.io/interpretable-ml-book/
Mishra - IIT(ISM) Nov 2022 28

Identifying Predictor Importance

Mishra - IIT(ISM) Nov 2022 29



Classification Tree Analysis
▪ Binary classification to identify
factors separating top 25% from Top 25%: Not
bottom 25% producing wells too shallow, not
too deep, long
lateral with
▪ Accuracy: more proppant,
but not too long

BOTTOM TOP CORRECT


25% 25% ID
BOTTOM
62 18 78%
25%
TOP 25% 7 73 91%
TOTAL 69 91 70%

Mishra - IIT(ISM) Nov 2022 30



Challenges for Acceptance of ML
Our ML models If I don’t understand the model,
are not very good how can I believe it?
• Consumer marketing ML/AI models • Articulate adequacy of predictors
are not necessarily highly accurate!
• Need to manage expectations re. • Demonstrate model robustness
quality of fit for subsurface models • Explain inner workings (key variables)
• Focus more on added value from • Use creative visualizations
ML models + complementary role

We are still waiting for My staff need to learn


the “Aha” Moment! data science, but how?
• ML model may or may not produce • Significant (informal) self-learning to
new insights
become “citizen data scientists”
• Provides an alternative quantitative
input-output relationship • Need formal knowledge of
conventional data analysis, python/R
• Useful when physics-based model is
slow, data-intensive or immature programming, and machine learning
Mishra et al., 2021, JPT (March)
Mishra - IIT(ISM) Nov 2022 31

Recommended Approach & Mindset
• Framing the problem ▪ Data-driven models
can complement
• Selecting the causal variables physics- and geology-
based insights
• Checking the data ▪ Need to “understand +
apply & predict +
• Choosing the modeling technique(s) communicate”
• Validating the model ▪ Danger of becoming
“technicians” if
• Understanding variable importance understanding of
foundational concepts
• Communicating the results and caveats lacking

Template for documenting results + reviewing/evaluating other studies

Mishra - IIT(ISM) Nov 2022 32



Data-Driven Modeling Steps
• Frame the problem (what is the question I want to answer?
What causal variables do I need?
• Explore/check the data (distributions – correlations –
significance – outliers – missing variables)
• Prepare data for model building/testing (80/20 or 70/30
split with replicates, k-fold cross validation, blind test data)
• Try at least 3 models (reference linear model, one tree-based
model, one non-tree based model)
– Look at model performance with training v/s test data
– Select best model(s) – train on all data – aggregate as needed
• Determine variable importance (R2-loss, error rate change)

Mishra - IIT(ISM) Nov 2022 33


Q1 – Do I Need Machine Learning? 

Linear
Linear

Random
Random Forest
Forest

Mishra – IIT(ISM) Nov 2022 34


Q2 – Which Technique Works Best? 
• No single technique is Power of Ensemble Modeling
consistent best performer
• Often, multiple competing
models have equally good
fits (R2/RMSE)
• Aggregate models 
robust understanding
and predictions
• Pick “Forest” over “Trees”

Mishra - IIT(ISM) Nov 2022 35


Q3 – How Much Data Do I Need? 
Large datasets can produce poor
results if key causal variables
are not included in the model

N=81

N=2935
Robust models can be built with
small datasets if all relevant causal
variables are included in the model

Mishra - IIT(ISM) Nov 2022 36


Q4 – How Do I Learn Data Science? 

• Citizen data scientist/analyst


(one who learns from data)
– Basic skills  domain knowledge (e.g., PE)
– New skills  statistics/ML, programming

• Core (data science) competencies


– Data collection, preparation, exploration
– Data storage and retrieval
– Computing with data
– Applied machine learning
– Data visualization/communication

Donoho, J. Comp. Graphical Stat, 2017 https://www.linkedin.com/pulse/new-venn-


diagram-data-science-pierluigi-casale/

Mishra - IIT(ISM) Nov 2022 37



Looking Ahead
• Growing trend towards the use of statistical and machine
learning techniques for subsurface applications
• Goal  “mine” big data and develop data-driven insights to
improve reservoir description & performance prediction
• Geo-scientists/engineers need to develop better awareness of
full repertoire of available techniques and their potential
• Data scientists need to understand problem domain to
propose/apply appropriate techniques
• Decision makers, regulators and other stakeholders need to
be educated on what ML can and cannot do

Mishra - IIT(ISM) Nov 2022 38


Dr Srikanta Mishra
Battelle, Columbus, USA
(512) 351-6038
mishras@battelle.org

39

You might also like