Lecture 2

Machine Learning for Chemical Engineers
CHE F315
Ajaya Kumar Pani

BITS Pilani Department of Chemical Engineering
B.I.T.S-Pilani, Pilani Campus
Pilani Campus
Lecture-2
12-01-2024
BITS Pilani
Pilani Campus
Data Preprocessing
BITS Pilani
Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Outline
Industrial data characteristics

Missing values
Outlier
16 January 2024 4
BITS Pilani, Pilani Campus

and treatment
• Wide use of distributed control system, increasing use of
online sensors with low sampling time, improved data
transmission and storage facility have resulted in
availability of huge amount of past process data
• Data-driven process modeling, monitoring, prediction
and control have received much attention in recent
years.
• By analyzing the patterns of process data and
relationships among variables, useful information can be
extracted, based on which statistical models can then be
developed for various applications, such as process
monitoring, fault diagnosis, mode clustering, soft sensing
of key variables/quality variables, etc.
16 January 2024 5

and treatment
• Big Data in process industries is characterized by
volume, variety, and velocity, or simply V3
• volume refers to size of ever-growing data sets which
range from terabytes (1012 bytes) to zettabytes (1021
bytes)
• “variety” describes various types of data: process
measurements, text, audio, and images
• “velocity” refers to the speed of big data generation
16 January 2024 6

and treatment
• Data preparation is an initial step for machine learning model
development
• The main tasks of this step are to extract the dataset from the
historical database, examine the structure of the dataset, and
make data selections through sample and variable directions,
etc.
• In order to extract an effective dataset from the historical
database, the operating regions of the process need to be
analyzed, and any changes of operating condition also need
to be identified.
• To ensure the efficiency for the information extraction step,
the natures or characteristics of the process data should be
analyzed, such as non-Gaussianity, linear/nonlinear
relationships among different variables, time-series
correlations, etc.
16 January 2024 7

and treatment
Thebelt, A., Wiebe, J., Kronqvist, J., Tsay, C., & Misener, R. (2022). Maximizing information from chemical
engineering data sets: Applications to machine learning. Chemical Engineering Science, 252, 117469.
16 January 2024 8

and treatment
Normal Distribution
• The normal distribution is also known as the Gaussian
distribution.
Probabilities associated with the normal distribution
16 January 2024 9
Data preprocessing
Data pre-processing is carried out in order to improve the

quality of the data
outliers and gross errors should be removed from the
modeling dataset, which will otherwise greatly
deteriorate the performance of the machine learning
model
missing values need to be addressed, e.g. deletion of the
sample, missing value estimation, Bayesian inference,
etc
the scale difference among process variables needs to be
considered
16 January 2024 10
Data preprocessing
The raw data of different formats stored in databases are

not useful until they are cleaned and transformed
Data cleaning consists of four steps:
• missing data imputation
• Outlier detection and noise removal
• time alignment
• Delay estimation
16 January 2024 11
Data preprocessing
Missing data imputation
Missing values in process industries refer to entries in the
data set that have no connection with the real state of
the process and take values such as ±∞, 0, nan (not a
number)
There are generally three missing patterns:
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
16 January 2024 12
Data preprocessing
A and C – missing values for
single/multiple variables 
due to sensor failure
B – values of some variables
missing at same time
instances  fault
D – single variable showing
regular missing values 
multirate sampling
Common missing patterns in the process industries
Xu, S., Lu, B., Baldea, M., Edgar, T. F., Wojsznis, W., Blevins, T., & Nixon, M. (2015). Data cleaning in the process
industries. Reviews in Chemical Engineering, 31(5), 453-490.
16 January 2024 13
Data preprocessing

Deletion
eliminate any time point that contains missing values
Works well for large datasets
Will sacrifice a large amount of data, reduce the statisti
cal power, and lead to biased parameter estimation with
more uncertainty
Replacement
Mean replacement
Interpolation replacement
16 January 2024 14
Data preprocessing
Outlier detection and removal
• Observations or subsets of
observations that do not show a
consistent behavior with the rest
of the data set from a statistical
perspective
• Causes: malfunction of sensors
Pani, A. K., & Mohanta, H. K. (2016). Online monitoring of cement
and inappropriate treatment of clinker quality using multivariate statistics and Takagi-Sugeno fuzzy-
missing data inference technique. Control Engineering Practice, 57, 1-17.
• Two types of outliers: univariate

and multivariate
16 January 2024 15
Data preprocessing
Univariate outlier detection

3σ rule: The 3σ rule is widely used for detecting outliers
from an i.i.d. data set {xk} subject to a normal distribution
N(μ, σ2 ). If the following condition holds:
xk is an outlier if
Hampel identifier: Instead of using mean and standard
deviation, the Hampel identifier uses the median med
and MAD:
|xk-med| > 3×1.483MAD
16 January 2024 16
Data preprocessing
Quartile-based identifier and boxplots:
Uses the interquartile distance Q as the scale parameter
Q = Q3 – Q1
where Q1 is the lower quartile, x0.25 and Q3 is the upper quartile,
x0.75
13
med = (Q1+ Q3)/2
For a symmetric data distribution, the following condition to detect
outliers:
|xk -med| >2Q
A boxplot is used as a graphical demonstration
of the quartile-based detector
In the plot, any point that lies outside the
upper or lower fences, is considered as an
outlier.
16 January 2024 17
16 January 2024
18 BITS Pilani, Pilani Campus

Lecture 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2

Uploaded by

Copyright:

Available Formats

Machine Learning for Chemical Engineers

Ajaya Kumar Pani

Industrial data characteristics

Industrial data characteristics

Industrial data characteristics

Industrial data characteristics

Industrial data characteristics

Industrial data characteristics

Probabilities associated with the normal distribution

Data pre-processing is carried out in order to improve the

The raw data of different formats stored in databases are

Common missing patterns in the process industries

Missing data imputation

• Two types of outliers: univariate

Univariate outlier detection

You might also like