You are on page 1of 18

Machine Learning for Chemical Engineers

CHE F315

Ajaya Kumar Pani


BITS Pilani Department of Chemical Engineering
B.I.T.S-Pilani, Pilani Campus
Pilani Campus
Lecture-2
12-01-2024
BITS Pilani
Pilani Campus
Data Preprocessing
BITS Pilani
Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Outline

Industrial data characteristics


Missing values
Outlier

16 January 2024 4
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Industrial data characteristics


and treatment
• Wide use of distributed control system, increasing use of
online sensors with low sampling time, improved data
transmission and storage facility have resulted in
availability of huge amount of past process data
• Data-driven process modeling, monitoring, prediction
and control have received much attention in recent
years.
• By analyzing the patterns of process data and
relationships among variables, useful information can be
extracted, based on which statistical models can then be
developed for various applications, such as process
monitoring, fault diagnosis, mode clustering, soft sensing
of key variables/quality variables, etc.

16 January 2024 5
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Industrial data characteristics


and treatment
• Big Data in process industries is characterized by
volume, variety, and velocity, or simply V3
• volume refers to size of ever-growing data sets which
range from terabytes (1012 bytes) to zettabytes (1021
bytes)
• “variety” describes various types of data: process
measurements, text, audio, and images
• “velocity” refers to the speed of big data generation

16 January 2024 6
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Industrial data characteristics


and treatment
• Data preparation is an initial step for machine learning model
development
• The main tasks of this step are to extract the dataset from the
historical database, examine the structure of the dataset, and
make data selections through sample and variable directions,
etc.
• In order to extract an effective dataset from the historical
database, the operating regions of the process need to be
analyzed, and any changes of operating condition also need
to be identified.
• To ensure the efficiency for the information extraction step,
the natures or characteristics of the process data should be
analyzed, such as non-Gaussianity, linear/nonlinear
relationships among different variables, time-series
correlations, etc.

16 January 2024 7
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Industrial data characteristics


and treatment

Thebelt, A., Wiebe, J., Kronqvist, J., Tsay, C., & Misener, R. (2022). Maximizing information from chemical
engineering data sets: Applications to machine learning. Chemical Engineering Science, 252, 117469.

16 January 2024 8
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Industrial data characteristics


and treatment
Normal Distribution
• The normal distribution is also known as the Gaussian
distribution.

Probabilities associated with the normal distribution

16 January 2024 9
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Data preprocessing

Data pre-processing is carried out in order to improve the


quality of the data
outliers and gross errors should be removed from the
modeling dataset, which will otherwise greatly
deteriorate the performance of the machine learning
model
missing values need to be addressed, e.g. deletion of the
sample, missing value estimation, Bayesian inference,
etc
the scale difference among process variables needs to be
considered

16 January 2024 10
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Data preprocessing

The raw data of different formats stored in databases are


not useful until they are cleaned and transformed
Data cleaning consists of four steps:
• missing data imputation
• Outlier detection and noise removal
• time alignment
• Delay estimation

16 January 2024 11
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Data preprocessing
Missing data imputation
Missing values in process industries refer to entries in the
data set that have no connection with the real state of
the process and take values such as ±∞, 0, nan (not a
number)
There are generally three missing patterns:
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)

16 January 2024 12
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Data preprocessing
Missing data imputation
A and C – missing values for
single/multiple variables 
due to sensor failure
B – values of some variables
missing at same time
instances  fault
D – single variable showing
regular missing values 
multirate sampling

Common missing patterns in the process industries

Xu, S., Lu, B., Baldea, M., Edgar, T. F., Wojsznis, W., Blevins, T., & Nixon, M. (2015). Data cleaning in the process
industries. Reviews in Chemical Engineering, 31(5), 453-490.
16 January 2024 13
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Data preprocessing

Missing data imputation


Deletion
eliminate any time point that contains missing values
Works well for large datasets
Will sacrifice a large amount of data, reduce the statisti
cal power, and lead to biased parameter estimation with
more uncertainty
Replacement
Mean replacement
Interpolation replacement

16 January 2024 14
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Data preprocessing
Outlier detection and removal

• Observations or subsets of
observations that do not show a
consistent behavior with the rest
of the data set from a statistical
perspective
• Causes: malfunction of sensors
Pani, A. K., & Mohanta, H. K. (2016). Online monitoring of cement
and inappropriate treatment of clinker quality using multivariate statistics and Takagi-Sugeno fuzzy-
missing data inference technique. Control Engineering Practice, 57, 1-17.

• Two types of outliers: univariate


and multivariate

16 January 2024 15
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Data preprocessing

Univariate outlier detection


3σ rule: The 3σ rule is widely used for detecting outliers
from an i.i.d. data set {xk} subject to a normal distribution
N(μ, σ2 ). If the following condition holds:
xk is an outlier if
Hampel identifier: Instead of using mean and standard
deviation, the Hampel identifier uses the median med
and MAD:
|xk-med| > 3×1.483MAD

16 January 2024 16
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

Data preprocessing
Quartile-based identifier and boxplots:
Uses the interquartile distance Q as the scale parameter
Q = Q3 – Q1
where Q1 is the lower quartile, x0.25 and Q3 is the upper quartile,
x0.75
13
med = (Q1+ Q3)/2
For a symmetric data distribution, the following condition to detect
outliers:
|xk -med| >2Q
A boxplot is used as a graphical demonstration
of the quartile-based detector
In the plot, any point that lies outside the
upper or lower fences, is considered as an
outlier.

16 January 2024 17
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers

16 January 2024
18 BITS Pilani, Pilani Campus

You might also like