Professional Documents
Culture Documents
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become
necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using
any information, methods, compounds, or experiments described herein. In using such information or methods
they should be mindful of their own safety and the safety of others, including parties for whom they have a
professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability
for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or
from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
xi
Foreword
Chapter outline
1.1 Who this book is for 1
1.2 Preview of the content 3
1.3 Power generation industry overview 4
1.4 Fuels as limited resources 6
1.5 Challenges of power generation 8
References 11
Machine Learning and Data Science in the Power Generation Industry. https://doi.org/10.1016/B978-0-12-819742-4.00001-9
# 2021 Elsevier Inc. All rights reserved. 1
2 Chapter 1 Introduction
Fig. 1.1 The evolution of machine learning as compared to human performance relative to a common standard dataset.
Fig. 1.2 The six main application areas of machine learning in power generation.
References
Meadows, D.H., Meadows, D.L., Randers, J., Behrens III, W.W., 1972. The Limits to
Growth: A Report for the Club of Rome’s Project on the Predicament of Man-
kind. Universe Books, White River Junction.
Meadows, D.H., Randers, J., Meadows, D.L., 2004. The Limits to Growth: The
30-Year Update. Chelsea Green Publishing, White River Junction.
Data science, statistics, and time
2
series
Patrick Bangerta,b
a
Artificial Intelligence, Samsung SDSA, San Jose, CA, United States
b
Algorithmica Technologies GmbH, Bad Nauheim, Germany
Chapter outline
2.1 Measurement, uncertainty, and record keeping 15
2.1.1 Uncertainty 16
2.1.2 Record keeping 17
2.2 Correlation and timescales 19
2.3 The idea of a model 22
2.4 First principles models 24
2.5 The straight line 25
2.6 Representation and significance 28
2.7 Outlier detection 30
2.8 Residuals and statistical distributions 30
2.9 Feature engineering 34
2.10 Principal component analysis 36
2.11 Practical advices 38
References 39
Machine Learning and Data Science in the Power Generation Industry. https://doi.org/10.1016/B978-0-12-819742-4.00002-0
# 2021 Elsevier Inc. All rights reserved. 13
14 Chapter 2 Data science, statistics, and time series
2.1.1 Uncertainty
Every sensor has a tolerance and so does the analog-to-digital
conversion. Some electrical signal loss and distortion is intro-
duced along the way from the sensor to the converter. Sensors
drift over time and must be recalibrated at regular intervals. Sen-
sors get damaged and must be repaired. Accumulations of mate-
rial may cover a sensor and distort its measurement. Numbers are
stored in IT systems, such as the control systems, with finite pre-
cision and there is loss of such precision with every conversion or
calculation. All these factors lead to a de facto uncertainty in any
measurement. The total uncertainty in any one measurement
may be difficult to assess. In the temperature example, we may
decide that the uncertainty is 1°C. This would make the last
two observations—of 3 and 2.5—numerically distinct but actually
equal.
Any computation made based on an uncertain input has an
uncertainty of its own as a consequence. Computing this inherited
uncertainty may be tricky—depending on the computation
itself—and may lead to a larger uncertainty than the human
designer anticipated. In general, the uncertainty in a quantity y
that has been computed using several uncertain inputs xi is the
total differential,
XN
∂y 2
ðΔy Þ2 ¼ ðΔxi Þ2
i¼1
∂x i
Fig. 2.3 Determining the uncertainty of a computed result can be difficult and may need to be estimated.
Fig. 2.4 A sensor outputs its value at every time step but not every value is recorded if the compression factor is 1.
Fig. 2.5 There are several ways to interpolate missing data. Here the data for t¼ 3 is missing. Which method shall we
choose?
Fig. 2.6 Correlation between two tags indicates the amount of information that the change in one provides for the other.
Fig. 2.7 The correlation function between two tags as a function of the time lag.
two time series are clearly related by some mechanism that takes
about 10 time steps to work. If there is an amount of time that is
inherent to the dynamics of the system, we call it the timescale of
the system. It is very important to know the timescales of the var-
ious dynamics in the system under study. Some dynamics may go
from cause to effect in seconds while others may take days.
It is important to know the timescales because they determine
the frequency at which we must observe the system. For example,
in Fig. 2.7 the timescale is clearly 10. If we observe the system once
every 20 time steps, then we lose important information about the
dynamics of the system. Any modeling attempt on this dataset is
doomed to fail.
A famous result in the field of signal processing is known as the
Nyquist-Shannon sampling theorem and it says, roughly speak-
ing: If you want to be able to reconstruct the dynamics of a system
based on discrete observations, the frequency of observations
must be at least two observations for the shortest timescale pre-
sent in the system. This is assuming that you know what the short-
est timescale is and that the data is more or less perfect. In
industrial reality, this is not true. Therefore in practical machine
learning, there is a rule of thumb that we should try for 10 obser-
vations relative to the timescale of the system. In the example of
Fig. 2.7, where we measured the system every 1 time step, the
observation frequency was therefore chosen well.
The other reason for choosing to sample more frequently than
strictly necessary is an effect known as aliasing sometimes seen in
digital photography. This is illustrated in Fig. 2.8. Based on multi-
ple observations, we desire to reconstruct the dynamics and there
are multiple candidates if the sampling frequency is tuned per-
fectly to the internal timescale. The danger is that the wrong
one is chosen, resulting in poor performance. If we measure more
frequently, this problem is overcome.
22 Chapter 2 Data science, statistics, and time series
Fig. 2.8 The aliasing problem in that the observations allow multiple valid
reconstructions.
In Fig. 2.8, the dots are the observations and the lines are var-
ious possible reconstructions of the system. All reconstructions fit
the data perfectly but are quite different from each other. The data
is therefore not sufficient to characterize the system properly and
we must make an arbitrary choice.
The practical recommendation to a data scientist is to obtain
domain knowledge on the system in question in order to deter-
mine the best sampling frequency for the task. Sampling the data
less frequently than indicated earlier results in loss of information.
Sampling it more frequently will only cost the analysis in time and
effort. On balance, we should err on the side of measuring some-
what more frequently than deemed necessary. When working with
transient systems, the right frequency might change depending
on the present regime. In those cases, we will want to choose
the fastest frequency from the entire system.
Hän ojensi minulle kätensä, jota minä lujasti puristin. Sen jälkeen
hän oli tyynempi, ja näkyipä hänen silmissään jonkinlaisen taistelu-
ilon välähdyksiäkin.
Etruskilainen vaasi
*****
Sanat kuivivat hänen huulillaan; hän näki enää vain yhden esineen
ja ajatteli ainoastaan yhtä asiaa: etruskilaista vaasia!
Näin sanoen hän näytti kovin ryvettynyttä kirjettä, jonka hän veti
esiin lemuavasta silkkikukkarostaan.
— Kuusi viikkoa.