## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

by Srikanta Mishra and Akhil Datta-Gupta

Ratings:

250 pages190 hours

*Applied Statistical Modeling and Data Analytics: A Practical Guide for the Petroleum Geosciences* provides a practical guide to many of the classical and modern statistical techniques that have become established for oil and gas professionals in recent years. It serves as a "how to" reference volume for the practicing petroleum engineer or geoscientist interested in applying statistical methods in formation evaluation, reservoir characterization, reservoir modeling and management, and uncertainty quantification.

Beginning with a foundational discussion of exploratory data analysis, probability distributions and linear regression modeling, the book focuses on fundamentals and practical examples of such key topics as multivariate analysis, uncertainty quantification, data-driven modeling, and experimental design and response surface analysis. Data sets from the petroleum geosciences are extensively used to demonstrate the applicability of these techniques. The book will also be useful for professionals dealing with subsurface flow problems in hydrogeology, geologic carbon sequestration, and nuclear waste disposal.

Authored by internationally renowned experts in developing and applying statistical methods for oil & gas and other subsurface problem domains Written by practitioners for practitioners Presents an easy to follow narrative which progresses from simple concepts to more challenging ones Includes online resources with software applications and practical examples for the most relevant and popular statistical methods, using data sets from the petroleum geosciences Addresses the theory and practice of statistical modeling and data analytics from the perspective of petroleum geoscience applicationsPublisher: ElsevierReleased: Oct 27, 2017ISBN: 9780128032800Format: book

time.

**Chapter 1 **

**Basic Concepts **

Statistics is the science of acquiring and utilizing data. It provides us with the tools for data collection, summarization, and interpretation, with the goal of identifying the underlying structure, trends, and relationships inherent in the data. This is how we convert data into information.

Analysis; Bayes rule; Data; Knowledge; Population; Probability; Statistic

1.1 **Background and Scope **

1.1.1 **What Is Statistics? **

1.1.2 **What Is Big Data Analytics? **

1.1.3 **Data Analysis Cycle **

1.1.4 **Some Applications in the Petroleum Geosciences **

1.2 **Data, Statistics, and Probability **

1.2.1 **Outcomes and Events **

1.2.2 **Probability **

1.2.3 **Conditional Probability and Bayes Rule **

1.3 **Random Variables **

1.3.1 **Discrete Case **

1.3.2 **Continuous Case **

1.3.3 **Indicator Transform **

1.4 **Summary **

**Exercises **

**References **

We introduce the reader to some fundamental concepts of classical statistics such as probability and random variables, along with basic concepts from the emerging field of data analytics and big-data technologies. We also list some typical applications of the relevant techniques for data analysis in the petroleum geosciences.

Statistics is the science of acquiring and utilizing data. It provides us with the tools for data collection, summarization, and interpretation—with the goal of identifying the underlying structure, trends, and relationships inherent in the data. This is how we convert data into information.

Fundamental to statistics are the concepts of *population *and *sample*. A population is the universe of all possible outcomes and events, whereas a sample is a finite subset extracted from the population. Statistical analyses are performed on the sampled data to draw *inference *about the characteristics of the population, without having to study the entire population. The population is exhaustive and is characterized by its *parameters*. The sample is limited and is characterized by the *statistic *that is related to the population parameters.

**Fig. 1.1 shows a schematic of the relationship between population and sample. Here, the population represents permeability values for an entire oil reservoir at the scale of a small core plug. To learn more about the distribution of permeability values, in step (1), we randomly sample this population using a finite number of core plugs (e.g., 250). In step (2), we analyze these permeability values to determine the proportion of plugs with permeability greater than 10 mD (e.g., 65%). Finally, in step (3), we determine the representativeness of this result for the entire population (e.g., 95% certain that margin of error is ± 6%). **

**Fig. 1.1 **Schematic showing population—sample relationship.

Application of statistics to any dataset generally begins with exploratory data analysis. Here, the goal is to quantify and visualize the range of values a given variable can take, summary attributes such as averages and spread, and the nature and strength of correlation between two or more variables (**Chapter 2). In the next step, the distribution of the variable is examined to understand the relative likelihood of various values within the observed range and the possibility of describing the distribution using a compact mathematical form (Chapter 3). Another common task involves exploring how the relationship between two variables can be described using a linear regression model or variants thereof (Chapter 4). When multiple variables are included in the dataset, it is useful to identify the degree of redundancy among different variables and if the dataset can be partitioned into any statistically homogeneous subpopulations (i.e., clusters). This is the scope of multivariate analysis (Chapter 5). **

The broad classes of techniques described above fall within the realm of classical statistics and have been employed by petroleum engineers and geoscientists for many years (see **Stanley, 1973 and references therein). Recent contributions (e.g., Davis, 2002; Jensen et al., 2000) discuss the geoscience-oriented application of these techniques in greater detail, including other topics not covered in this book such as geostatistics and time series analysis. **

Statistical methods are also relevant in the context of uncertainty analysis, where the goal is to translate the uncertainty in the inputs of a model into uncertainty in corresponding model predictions (**Chapter 6). Here, the concepts mentioned in the previous paragraph are fundamental to characterizing the uncertainty both in the model inputs and the model results and building predictive models that relate the specified uncertain inputs to the computed uncertain outputs. Another important application is with respect to design of experiments, both physical and computational (Chapter 7). Statistical approaches are useful for determining how to construct a limited number of experiments that properly span the design space and how to fit a response surface to the experimental results that can be used as a surrogate model. **

The terms big data

and data analytics

have become quite the buzzword in recent years, especially because of many reported applications in areas such as consumer marketing, health and life sciences, and national security. This has led to the perception that big data analytics has the potential to be a game changer for oil and gas applications (**Holdaway, 2014). The industry is beginning to explore the possibilities of mining large volumes of data about the subsurface, physical infrastructure, and flows to obtain new insights about the reservoir that can help increase operational efficiencies. **

*Big data *generally refers to large, multivariate datasets characterized by the three V's: volume, variety, and velocity (**Fig. 1.2). Volume refers to the size of the data, where we are increasingly dealing with ~ 10²–10⁴ independent variables and ~ 10³–10⁶ observations or data records, each collected at multiple temporal and/or spatial locations. Variety refers to data in multiple formats such as numbers, video, and text, which can be both structured and unstructured, and requires a combination of numerical methods, image analysis, and/or natural language processing. Velocity refers to the growing ubiquity of real-time streaming data from downhole sensors or surface gauges, which adds to the size of the dataset with additional considerations such as data archival, resampling, and redundancy analysis. **

**Fig. 1.2 **Big data analytics—what and why.

As shown in **Fig. 1.2, data analytics is the process of (a) examining the data, (b) understanding what the data say and learning from the data, and (c) making predictions based on these data-driven insights that (hopefully) lead to better decisions (Hastie et al., 2008). Essentially, data analytics methods are applied to help understand hidden patterns and relationships in large and complex datasets. A number of equivalent terms such as statistical learning, knowledge discovery, data mining, and data-driven modeling are often interchangeably used to describe this collection of techniques, which are drawn from computer science, machine learning, and artificial intelligence (Chapter 8). **

From an information technology perspective, however, the scope of data analytics is somewhat broader because it includes the following steps (**IDC Energy Insights, 2014): **

• *Data organization and management*, which involves data collection, warehousing, tagging, QA/QC, normalization, integration, and extraction.

• *Analytics and discovery*, which involves software-driven analysis, predictive model building, and extraction of data-driven insights.

• *Decision support and automation*, which involves deploying rule-based systems with functionality to support collaboration, scenario evaluation, and risk management.

Although big data has not become ubiquitous in the oil and gas industry, a vision for how big-data-related technologies can be implemented in the context of exploration and production operations is described in **Brulé (2015). **

For petroleum geoscience applications, it is more useful to consider statistical modeling and data analytics as part of an integrated data analysis cycle as shown in **Fig. 1.3. The scope of various work elements that comprise this cycle are explained below. **

**Fig. 1.3 **Schematic of data analysis cycle.

*Data collection and management*. This step involves the acquisition and aggregation of data from multiple sources (e.g., cores, well logs, and production records), possibly in multiple forms (e.g., numbers and text). The data also undergo a QA/QC process to ensure the traceability and accuracy of each data record. Finally, the data have to be made easily available for visualization and analysis.

*Exploratory data analysis*. The goal of this step is to develop a preliminary understanding of the data in terms of the characteristics of individual variables and the relationship among various variables. Other objectives include identifying key variables of interest, formulating questions for digging deeper into the data, and selecting techniques that will be used for detailed analysis. The relevant concepts involved in this step are discussed in **Chapters 2 and 3. **

*Predictive modeling*. The analyses in this step generally begin with *unsupervised learning*, where the issues of redundancy among the independent variables and possible reduction in data dimensionality (without losing any information) are first addressed. This is followed by *supervised learning*, where observed values of a response variable are used to train a model between the independent variables (i.e., predictors) and the dependent variable (i.e., response). This predictive model can then be used to answer questions posted in the previous step. **Chapters 4–8 discuss the relevant concepts that are integral to this step. **

*Visualization and reporting*. The ultimate goal of any modeling and/or analysis is to provide input for a decision by transferring information to decision-makers. It is therefore necessary to capture what has been learned in the form of visual summaries, compact reports, or decision-support tools that can be used to answer what-if

type questions. Another useful outcome from this step is the use of insights from predictive modeling to identify what new data should be collected and the kinds of questions to pursue in the future.

The principles described throughout the book are explained with the help of many illustrative examples and problems to demonstrate their practical applicability. These include the following:

• Determining conditional probabilities of cause-effect relationships

• Computing summary statistics (e.g., mean and variance)

• Calculating correlation and rank correlation coefficients between two variables

• Visualizing univariate, bivariate, and multivariate data

• Estimating probability coverage levels for different distributions

• Analyzing behavior of normal and lognormal distributions

• Calculating confidence interval and sampling distribution for the mean

• Testing for significance of difference in means

• Comparing two different distributions for statistical equivalence

• Fitting simple and multiple linear regression models to observed data

• Developing a nonparametric regression model from given data

• Reducing data dimensionality with principal component analysis

• Grouping data with k-means and hierarchical clustering

• Identifying classification boundary between clusters using discriminant analysis

• Developing distributions from data, limited knowledge, or subjective judgment

• Translating model input uncertainty into uncertainty in model predictions using Monte Carlo simulation and analytic alternatives

• Analyzing input-output dependencies from Monte Carlo simulation results

• Creating an experimental design and fitting a response surface to the results

• Applying machine learning techniques (e.g., random forest, gradient boosting machine, support vector regression, and kriging model) for predictive modeling

• Generating decision rules with classification tree analysis

Some of the examples listed here are purely pedagogic in nature, while others are based on actual datasets (albeit reduced in size to make the presentation tractable). Finally, several field datasets have been analyzed to demonstrate how multiple methods come together

in the context of linear and nonparametric regression analysis, multivariate analysis, and data-driven modeling.

Generally, there is some degree of unpredictability or randomness associated with most natural phenomena. We can represent this unpredictability in terms of the many possible outcomes of an experiment to define what can happen.

Simply put, statistics is concerned with the determination of the probable (events) given the possible (outcomes) (**Davis, 2002). Formally stated, outcomes are elements of the sample space Ω, events are an appropriate subset of Ω, and probability, P, is the likelihood of the event occurring (0 ≤ P ≤ 1). **

The sample space, *Ω*, is a set whose elements describe outcomes of the experiment of interest. For example, if the experiment is a wildcat well with two possible outcomes—dry well (*D*) or success (*S*), then the sample space is *Ω* = {*D*, *S*}. If the experiment is porosity determination from core samples with multiple possible outcomes (equal to the number of samples), then the sample space is *Ω* = {0, 1}. Another experiment could be the order in which three wells are tested—leading to six different outcomes—with the sample space being *Ω* = {123, 132, 213, 231, 312,

You've reached the end of this preview. Sign up to read more!

Page 1 of 1

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading