Professional Documents
Culture Documents
Volume 9
Edited by
Konstantinos N. Zafeiris
Christos H. Skiadas
Yiannis Dimotikalis
Alex Karagrigoriou
Christiana Karagrigoriou-Vonta
First published 2022 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted
under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or
transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the
case of reprographic reproduction in accordance with the terms and licenses issued by the
CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the
undermentioned address:
www.iste.co.uk www.wiley.com
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Konstantinos N. ZAFEIRIS, Yiannis DIMOTIKALIS, Christos H. SKIADAS, Alex KARAGRIGORIOU
and Christiana KARAGRIGORIOU-VONTA
Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Data understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3. Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4. Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2. Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1. Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2. Blockchain types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3. Blockchain-based web applications . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4. Blockchain consensus algorithms . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.5. Other consensus algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3. Analysis stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1. Art Shop web application . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2. SQL-based application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3. NoSQL-based application . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.4. Blockchain-based application . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1. Adding records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2. Query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.3. Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.4. Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2. Discrete-time model with reinsurance and bank loans . . . . . . . . . . . . . . 44
4.2.1. Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2. Optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3. Model stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3. Continuous-time insurance model with dividends . . . . . . . . . . . . . . . . 48
4.3.1. Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2. Optimal barrier strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3. Special form of claim distribution . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4. Numerical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Contents vii
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1. Main limit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2. Block maxima method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.3. Largest order statistics method. . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.4. Estimation of other tail parameters . . . . . . . . . . . . . . . . . . . . . . 63
5.4. Results and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2. Nearest neighbor methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.1. Background of the NN methods . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2. The k-nearest neighbors method . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.3. The fixed-radius NN method. . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.4. The kernel-NN method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.5. Algorithms of the three considered NN methods. . . . . . . . . . . . . . . 72
6.2.6. Parameter and distance metric selection . . . . . . . . . . . . . . . . . . . 74
6.3. Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.1. Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.2. Variable selection and data splitting. . . . . . . . . . . . . . . . . . . . . . 75
6.3.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3.4. A discussion and comparison of results . . . . . . . . . . . . . . . . . . . . 78
6.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
viii Data Analysis and Related Applications 1
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.1. Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.2. Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.3. Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.1. EFA results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.2. CFA results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3.3. Scale construction and assessment . . . . . . . . . . . . . . . . . . . . . . 91
7.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.5. Funding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 12. Invariant Description for a Batch Version of the UCB Strategy
with Unknown Control Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Sergey GARBAR
Part 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Chapter 27. High Speed and Secured Network Connectivity for Higher
Education Institutions Using Software Defined Networks . . . . . . . . . . . 371
Lincoln S. PETER and Viranjay M. SRIVASTAVA
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
The field of data analysis has grown enormously over recent decades due to the
rapid growth of the computer industry, the continuous development of innovative
algorithmic techniques and recent advances in statistical tools and methods. Due to
the wide applicability of data analysis, a collective work is always needed to bring
all recent developments in the field, from all areas of science and engineering, under
a single umbrella.
Part 1 focuses mainly on computational data analysis and related fields, with
nine chapters covering machine learning algorithms, web applications, spatial
analysis, multivariate regression, factor analysis, mixture models, non-parametric
techniques and tail distributions.
Part 2 focuses mainly on stochastic and algorithmic data analysis and related
fields, with nine chapters covering volatility, calibration, segmentation, Markov
chains, genetic algorithms, classification algorithms, batch processing, entropies and
pseudodistances.
xviii Data Analysis and Related Applications 1
Part 3 focuses mainly on applied statistical data analysis and related fields, with
five chapters covering spatial statistics, Monte Carlo methods, machine learning
methods, time series analysis and gas analysis.
Part 4 focuses mainly on economic and numerical data analysis and related
fields, with six chapters covering economic downturn, cyber systems, morbidity,
fixed-income market, Bayesian inference and reliability analysis.
Konstantinos N. ZAFEIRIS
Yiannis DIMOTIKALIS
Christos H. SKIADAS
Alex KARAGRIGORIOU
Christiana KARAGRIGORIOU-VONTA
April 2022
PART 1
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
1
Thyroid cancer is the second most prevalent cancer type among women in
Turkey. The number of people diagnosed with thyroid cancer in the United States in
2021 is estimated as 44,280, according to the report published by the American
Cancer Society. The risk of thyroid cancer can be reduced by early diagnosis and
treatment. This study is focused on predicting five different thyroid diseases, based
on various symptoms and reports of the thyroid. Several machine learning
algorithms, such as support vector machine, k-nearest neighbors, artificial neural
network and decision tree are used for diagnosis of various thyroid diseases, and
their classification performances are compared with each other. For this purpose,
a thyroid disease dataset gathered from the Department of Nuclear Medicine and
Endocrinology in Istanbul University-Cerrahpaşa Faculty of Medicine was used.
1.1. Introduction
Chapter written by Burcu Bektas GÜNEŞ, Evren BURSUK and Rüya ŞAMLI.
4 Data Analysis and Related Applications 1
There are four basic steps in the decision-making process providing diagnosis in
medicine. These are: cue acquisition, hypothesis generation, cue interpretation and
hypothesis evaluation. In modern times, the wide variety of diseases (differential
diagnosis), complicated disease states (the presence of more than one disease in the
same person), selectivity in perception, variety/size of medical data, insufficient
time allocated to the evaluation processes and the need for these processes to be
done in a limited time are all factors that may cause errors in the steps of this
decision-making process. Physical or emotional changes due to human nature such
as stress, fatigue, distraction, illness or inexperience can also increase the likelihood
of these diagnostic errors. Considering today’s technology, various computer-aided
systems are used to reduce these errors, and a new one is added to these systems
every day (Bursuk 1999; Nohria 2015). In addition, machine learning (ML), another
branch of artificial intelligence, is used in programs designed recently. It is used in
an increasingly wide range.
In this study, we explored the use of machine learning methodology for the
automatic classification of thyroid diseases using 10 attributes. We used the private
dataset that contains the information of 130 patients from the Department of Nuclear
Medicine and Endocrinology in Istanbul University-Cerrahpaşa Faculty of
Medicine, Turkey (IUC). After pre-processing stages, the data were trained by
adapting most of the ML algorithms to our data. Results of this research indicated
Performance of Evaluation of Diagnosis of Various Thyroid Diseases 5
that by using all the findings (physical examination, laboratory findings and
radiologic findings) together, various types of thyroid disease can be diagnosed and
the ML provides almost 100% correct answers.
This research was carried out using physical examination, laboratory findings
and radiologic findings, depicted in Table 1.1. Data were obtained from IUC after
the Ethical Committee’s approval.
This dataset contains five diseases. These are Plummer disease, toxic
multi-nodular goiter, Hashimoto’s disease, Graves’ disease and subacute thyroiditis.
In this context, the number of target attributes are seven for Plummer disease, 40 for
6 Data Analysis and Related Applications 1
toxic multi-nodular goiter, 32 for Hashimoto’s disease, 48 for Graves’ disease and
three for subacute thyroiditis for multiple classifications, as shown in Figure 1.1.
Figure 1.1. Class visualization for the whole dataset. For a color
version of this figure, see www.iste.co.uk/zafeiris/data1.zip
1.3. Modeling
For five different diseases, analyses were performed using machine learning
methods. SVM, k-nearest neighbors (KNN), artificial neural network (ANN) and
decision tree (DT) were used. With these algorithms, fivefold cross-validation was
used as a performance evaluation method for the dataset before the models were
performed. According to this method, the dataset is divided into five equal parts
each time, one part is chosen to be tested and the others are used as training data.
The accuracy metric in equation [1.1], the precision metric in equation [1.2], the
recall metric in equation [1.3] and F-measure metric in equation [1.4] are widely
used for model performance. In this study, accuracy was selected as the model
performance evaluation metric.
[1.1]
[1.2]
[1.3]
∗
2∗ [1.4]
True positive (TP): the true label of the given sample is positive; it refers to the
number of data that the classifier also predicts as positive. True negative (TN):
Performance of Evaluation of Diagnosis of Various Thyroid Diseases 7
the true label of the given sample is negative; it refers to the number of data that the
classifier predicts as negative. False positive (FP): the true label is negative but
refers to the number of data the classifier incorrectly predicts positively. False
negative (FN): the true label is positive but refers to the number of data the classifier
incorrectly predicts negatively (Bulut et al. 2020).
SVM is one of the managed machine learning algorithms used for both
classification and regression issues, and is generally used for a bit of arrangement
problems. Each data item is plotted as a point in n-dimensional space with the value
of each feature being the value of a particular coordinate. The classification then
takes place by finding the hyper-plane that ideally differentiates the classes (Razia
et al. 2018; Raisinghani et al. 2019; Dharmarajan et al. 2020).
leaves represent a class. The DT algorithm commonly uses the gini index,
information gain, chi-square and reduction in variance to make a strategic split
(Raisinghani et al. 2019; Chaubey et al. 2021). In this study, the J48 decision tree
algorithm was used.
1.4. Findings
The performance of the models is assessed using the accuracy metric. The results
are shown in Table 1.2 and Figure 1.2. The SVM algorithm achieved 100%
performance. Figure 1.2 shows the accuracy performances of the ML algorithms
compared with each other.
Accuracy
1.02
1
0.98
0.96
0.94
SVM ANN
KNN Decision Tree
Predicted Label
Toxic
Graves’ Hashimoto’s Subacute Plummer
multi-nodular
disease disease thyroiditis disease
goiter
True Label
48 0 0 0 0
0 32 0 0 0
0 0 3 0 0
0 0 0 40 0
0 0 0 0 7
Predicted Label
Toxic
Graves’ Hashimoto’s Subacute Plummer
multi-nodular
disease disease thyroiditis disease
goiter
True Label
48 0 0 0 0
0 32 0 0 0
0 0 2 0 1
0 0 0 40 0
0 0 0 0 7
Predicted Label
Toxic
Graves’ Hashimoto’s Subacute Plummer
multi-nodular
disease disease thyroiditis disease
goiter
True Label
48 0 0 0 0
0 32 0 0 0
3 0 0 0 0
0 0 0 40 0
0 0 0 0 7
Predicted Label
Toxic
Graves’ Hashimoto’s Subacute Plummer
multi-nodular
disease disease thyroiditis disease
goiter
True Label
48 0 0 0 0
0 32 0 0 0
0 0 2 0 1
0 0 0 40 0
0 0 0 0 7
1.5. Conclusion
In this study, we explored the use of machine learning methodologies for the
automatic classification of thyroid diseases using 10 attributes. We used the private
dataset that contains the information of 130 patients from IUC. After pre-processing
stages, the data were trained by adapting most of the ML algorithms to our data. The
results of this research indicated that by using all the findings (physical examination,
laboratory findings and radiologic findings) together, various types of thyroid
disease can be diagnosed and the ML provides almost 100% correct answers. The
IUC dataset was sufficiently differentiated according to the disease for which it was
labeled. For this reason, ML algorithms have shown very high performances.
Overfitting was not observed. This system can be developed by using a larger and
more balanced dataset. Further development can be done by using image processing
of ultrasonic scanning of thyroid images to predict thyroid nodules, which cannot be
recognized in laboratory findings.
1.6. References
Bulut, B., Kalın, V., Güneş, B.B., Khazhin, R. (2020). Deep learning approach for detection
of retinal abnormalities based on color fundus images. 2020 Innovations in Intelligent
Systems and Applications Conference, 1–6, Istanbul, 15–17 October 2020.
Bursuk, E. (1999). A diagnostic expert system for cardiological, respiratory, vascular and
hematological diseases. Master’s thesis, Institute of Biomedical Engineering, Bosphorus
University, Istanbul.
Chaubey, G., Bisen, D., Arjaria, S., Yadav, V. (2021). Thyroid disease prediction using
machine learning approaches. Natl. Acad. Sci. Lett., 44(3), 233–238.
Performance of Evaluation of Diagnosis of Various Thyroid Diseases 11
Dharmarajan, K., Balasree, K., Arunachalam, A.S., Abirmai, K. (2020). Thyroid disease
classification using decision tree and SVM. Indian J. Public Health Res. Dev., 11, 229.
Godara, S. and Kumar, S. (2018). Prediction of thyroid disease using machine learning
techniques. International Journal of Electronics Engineering, 10(2), 787–793.
Hameed, M.A. (2017). Artificial neural network system for thyroid diagnosis. Eng. Sci.,
11(25), 518–528.
Haykin, S.S. and Haykin, S.S. (2009). Neural Networks and Learning Machines, 3rd edition.
Prentice Hall, New York.
Nohria, R. (2015). Medical expert system – A comprehensive review. Int. J. Comput. Appl.,
130(7), 44–50.
Raisinghani, S., Shamdasani, R., Motwani, M., Bahreja, A., Raghavan Nair Lalitha, P. (2019).
Thyroid prediction using machine learning techniques. In ICACDS 2019: Advances in
Computing and Data Sciences, Singh, M., Gupta, P., Tyagi, V., Flusser, J., Ören, T.,
Kashyap, R. (eds). Springer, Singapore.
Razia, S., Swathi Prathyusha, P., Krishna, N.V., Sumana, N. (2018). A comparative study of
machine learning algorithms on thyroid disease prediction. International Journal of
Engineering & Technology, 7(2.8), 315–319.
Reza Obeidavi, M., Rafiee, A., Mahdiyar, O. (2017). Diagnosing thyroid disease by neural
networks. Biomed. Pharmacol. J., 10(2), 509–524.
Wang, Y., Yue, W., Li, X., Liu, S., Guo, L., Xu, H., Zhang, H., Yang, G. (2020). Comparison
study of radiomics and deep learning-based methods for thyroid nodules classification
using ultrasound images. IEEE Access, 8, 52010–52017.
2
Spatial analyses of infectious diseases have a long tradition, and with the
contemporary increasing incidences of chronic and degenerative diseases, consistent
interest has emerged regarding the geography of these types of non-infectious
pathologies and their environmental correlations. In this work, we explore spatial
variations in the prevalence of thyroid cancer, taking into account the demographic
heterogeneity in the at-risk population at the small-area level.
This work aims to enhance the existing research surrounding thyroid incidence in
volcanic areas by analyzing spatial patterns of thyroid cancer cases in Mount Etna’s
area, in the eastern part of Sicily. It is known from the medical literature that several
constituents of volcanic lava and ashes, such as radioactive and heavy metals, are
involved in the pathogenesis of thyroid cancer via the biocontamination of
atmosphere, soil and aquifers. Here, we exploit a unique dataset that allowed us to
geocode the geographic location of cases at the household level, whereas all studies
that we are aware of use aggregated data. Applying the local Moran’s I statistic as a
means for detecting spatial clustering, we aimed to disentangle the spatial
aggregation of thyroid cancer cases due to the proximity to a volcanic area from that
due to the geographic variations in the density of the population at risk and other
concomitant environmental risk factors.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
14 Data Analysis and Related Applications 1
Our preliminary findings seem to confirm a vast empirical literature that has
revealed an increased thyroid cancer incidence in volcanic areas, such as islands,
Hawaii and the Philippines, where an intense basaltic volcanic activity has also been
long detected; furthermore, parts of the Etna volcanic area seem to be more affected
than others.
2.1. Introduction
At the end of the 18th century, Dr. Valentine Seaman mapped yellow fever cases
in New York and thus succeeded in highlighting a possible correlation between the
sites of various dumps and the location of the cases (Stevenson 1965). About
60 years later, John Snow came up with the idea of creating a map of the cholera
cases that were plaguing Soho (London) at the time, and he realized that the cause of
the epidemic was due to a specific public fountain. By closing the fountain he
managed to stop the infection (Snow 1855; Walter 2000). These are just two of the
first attempts to use cartography as a tool to provide epidemiological information.
From that time on, geographic maps have increasingly been adopted as a traditional
tool to visualize the spatial distribution of diseases in the field of health. In general,
considerable effort has been devoted to the development of geographic information
systems (GIS) that facilitate the understanding of public health problems and foster
collaboration between physicians, epidemiologists and geographers to map and
predict disease risk (Croner et al. 1996). As a result of the epidemiological
transition, the long tradition of using geographic techniques for the analysis of
infectious diseases has assisted a similar application in the geographic distribution of
chronic diseases such as cancer and various types of heart disease (Ghosh et al.
1999; Wakefield 2007). There are many environmental risk factors included among
the possible concurrent causes of non-infectious pathologies, and geographical
representations constitute a valid tool for conducting exploratory analyses on the
spatial distribution of cases. In particular, May (1950) emphasized how a disease is
the product of the interaction between pathological factors (such as vectors and
genetic causes) and geographical factors acting on a physical, biological and social
level.
To date, many epidemiological studies suggest that the etiology of thyroid cancer
(TC) includes the presence of an active volcano among several factors such as the
technological improvement of screening systems, iodine consumption and others
(Marcello et al. 2014; Vigneri et al. 2015). TC is the most widespread endocrine
neoplasm, whose incidence has grown steadily around the world in recent decades
(Curado et al. 2007; Kilfoy et al. 2009; Fitzmaurice et al. 2015; Liu et al. 2017).
Exploring Chronic Diseases’ Spatial Patterns 15
local Moran’s I index. The local Moran’s I statistic is able to detect the presence of
spatial autocorrelation at the level of sub-areas, which may not emerge at the global
level. Although TC case maps and cluster analysis cannot prove the causal
mechanisms underlying the investigated phenomenon, we rely on these
methodologies to provide further evidence regarding the volcano–TC relationship
and to support decision-making in the public health sector. Our results show the
presence of areas of greater risk that would suggest a possible effect of proximity to
Mount Etna and also to Mount Vulcano, although the latter presents a reduced
activity in comparison with the first one. Despite this, given the exploratory
contribution of our work, a more in-depth study is required to gain a greater
understanding of the phenomenon.
This work is organized as follows: the second section describes the available
data and the salient features of the area under analysis; the third section reports the
methodology applied, with particular mention of SIR and local Moran’s I index; the
fourth section illustrates and discusses the distribution of TC in the eastern part of
Sicily and shows the presence of clusters of high- and low-risk areas and the fifth
and last section summarizes and concludes the work.
TC is the most widespread endocrine neoplasm in the world and has been
increasing steadily in recent decades (Curado et al. 2007; Kilfoy et al. 2009;
Fitzmaurice et al. 2015). Incidence rates significantly higher than the national
averages were recorded in various volcanic areas such as the area that we consider in
this work, eastern Sicily. This area includes four provinces: Messina, Catania, Enna
and Siracusa. The volcanic area that refers to Mount Etna, the highest active
European volcano, is located in the province of Catania but involves some other
areas of the southern province of Messina. Pellegriti et al. (2009) actually report a
considerable increase in the incidence rate of TC compared to the Italian average,
especially in the province of Catania. The Sicilian TT incidence figures are made
public in the Health Atlas of Sicily, published by the Department for Health
Activities and Epidemiological Observatory (Regional Health Department 2106).
Table 2.1 shows the TT incidence rate for the provinces of eastern Sicily (calculated
for the period 2003–2011 by standardization on the new European population per
100,000 inhabitants), disclosed in the Health Atlas. The rate is always higher for
women than men, as known in the literature, and higher than the regional value in
the provinces of Catania and Messina, for both sexes.
Exploring Chronic Diseases’ Spatial Patterns 17
Several studies have revealed, over time, the presence of high levels of heavy
metals in the volcanic area, as a result of the continuous emissions of gas (mainly
composed of gases such as CO2 and SO2), ash and lava by Mount Etna (Buat-Ménard
and Arnold 1978; Cimino and Ziino 1983; Caltabiano et al. 2004; Andronico et al.
2009; D’Aleo et al. 2016). Such heavy metals include among others arsenic,
cadmium, chromium, cobalt, mercury, tungsten and zinc which, in high
concentrations, could contaminate soil, water and the atmosphere, eventually
entering the food chain (Vigneri et al. 2017). These works indicate that the presence
of an active volcano could contaminate the surrounding area through the repeated
emissions leading to potential repercussions for human health.
The territory of these provinces is heterogeneous and includes the volcanic area
as well as urban, rural and industrial regions (Istat 2013). As a result, the resident
population and the cases of TC are distributed in a non-homogeneous way according
to the characteristics of the urban morphology and of the natural environment
(Figure 2.1).
2.3. Methodology
neighboring areas, such as the presence of volcanic areas adjacent to coastal and
plain areas. Therefore, the expected risk of cancer will be higher where the number
of population at risk is high, and the environmental factors are close. Conversely, the
risk will be relatively lower in sparsely populated areas or where the natural causes
of the risk are missing.
In this case, it is possible that neighboring areas with similar population density
or in the presence (absence) of other risk factors, give rise to actual clusters of high,
medium and low risk of TC. The analysis of the similarity of the attributes of nearby
geographic areas is generally part of the study of spatial autocorrelation, which
evaluates the spatial distribution of a particular process in terms of relationships,
mutual influences and distance (Cressie 1991; Anselin and Rey 2010; Borruso and
Murgante 2012).
The risk of TC was represented through the production of maps showing the
spatial distribution, for each census tract, of the standardized incidence ratio (SIR).
The SIRs were calculated for each inhabited census tract by indirect standardization
(Waller and Gotway 2004, pp. 12–15), using the incidence rate of TC observed in
the same period (2003–2016) in the whole of eastern Sicily. SIR is the ratio between
observed TC cases and expected TC cases in each census tract i
where Oi is the number of cases observed for census tract i and Ei is the number of
cases expected in the same census tract i. The number of expected cases is calculated
as the product of the population at risk (and therefore the entire resident population)
in the given census tract i and the general incidence rate for the entire investigated
area
=
20 Data Analysis and Related Applications 1
where Pi is the population at risk in the specific census tract i and r+ is the general
incidence rate of TC, calculated for the four provinces of interest as a whole, as
The SIR index suffers from limits in terms of variability: sparsely populated
areas have a high probability of resulting in a significantly high index, showing a
fallacious increase in the risk of TC. Furthermore, by construction, the standard
error of SIR tends to be large for sparsely populated areas and small for densely
populated ones. As a result, the confidence intervals of SIR will attribute
significance mostly to the highly populated areas (Haining 2003). On the whole,
Exploring Chronic Diseases’ Spatial Patterns 21
areas with low population density often result in extreme values of SIR while highly
populated areas are mostly associated with SIR significantly different from 1. To
overcome these issues and contain the variability in the spatial distribution of the
population, we will consider only the census tracts with more than 30 residents for
the calculation of SIR. On the contrary, when computing the expected global
number of cases for each stratum, rj, we will consider the totality of TC cases and
the resident population.
The local Moran’s I indicator belongs to the so-called LISA (Local Indicators of
Spatial Association) or local indicators of spatial autocorrelation proposed by
Anselin (1995). It is calculated with the following formula:
− ̅
= − ̅
,
Positive and high values of the local Moran’s I index indicate that a given region
is surrounded by neighboring regions with similar high (or low) values of the
variable under study. In this case, the spatial groups detected are defined as
“high–high” (region with a high value surrounded by regions with high values) or
“low–low” (region with low value surrounded by regions with low values). In terms
of cancer risk, a “high–high” cluster would indicate a high-risk area, while a
“low–low” cluster would denote a low-risk area. Negative values of the local
Moran’s I reveal that the region under examination is a spatial outlier. A spatial
outlier is an area that has a markedly different value from that of its neighbors
(Cerioli and Riani 1999). Spatial outliers are divided into “high–low” (high value
surrounded by neighbors with low values) and “low–high” (low value surrounded by
neighbors with high values).
The local Moran’s I can be standardized so that its significance can be tested
under normal distribution assumption. However, its distribution under the null
22 Data Analysis and Related Applications 1
In eastern Sicily from 2003 to 2016, 7,182 individuals were affected by TC. The
etiology of this tumor is complex and varied, and can be genetic as well as
preventive, come from dietary causes, etc. as already mentioned. In the case of
Sicily, the distribution of TC cases could also be conditioned by two geographical
components:
– the spatial arrangement of the resident population, with particular reference to
the female part, which is known to be the most affected by TC (Parkin et al. 2005).
Where the population is more concentrated or where the female population is
predominant, it will be more likely to record a high incidence of TC;
– the presence of environmental factors such as the volcanic nature of the
territory. The fumes emitted by an active volcano, such as Mount Etna, are able to
transport heavy metals and radioactive substances capable of contaminating the air,
water and soil of the surrounding areas (Fiore et al. 2019).
Figure 2.2(a)and 2.2(b) shows, respectively, SIR by census section and the
relative confidence intervals. From the mere SIR representation (Figure 2.1(a)),
different risk areas emerge, namely those with an SIR value greater than 1. These
areas are located in the area around Mount Etna as well as in the non-volcanic
provinces, especially in those of Enna and Messina. The consideration of the
confidence intervals for SIR (Figure 2.3(b)) instead highlights the area south-east of
Mount Etna and different sections belonging mainly to the Messina province.
In both maps, it is evident that if in the non-volcanic provinces the census sections
with SIR greater than 1 are casually arranged on the territory, in the province of
Catania, the risk sections are concentrated in an area close to Mount Etna, leaving
the rest of the province almost free. Furthermore, the location of the risk areas along
the NW–SE axis could suggest that persistent winds in the SE direction could carry
the toxic substances emitted by the volcano, therefore polluting the atmosphere of
the territories positioned along this corridor, as highlighted in Boffetta et al. (2020).
It is also interesting to note that the census sections on the island of Lipari show a
high and significant SIR. Indeed, this area is also of a volcanic type and is located in
the immediate vicinity of Mount Vulcano, an active volcano presenting only a little
activity compared to that of Mount Etna. The island of Vulcano is home to
numerous sulfurous fumaroles as well as a field of frequent submarine volcanic CO2
emissions, whose spatial distribution follows the direction given by persistent winds
blowing from the NW (Vizzini et al. 2020). Moreover, Vizzini et al. (2013) stated
that the area experiences “low”-level contamination due to elements such as Ba, Fe,
As and Cd. Overall, the significance of SIR in Lipari seems to further corroborate
the idea that a volcano can influence the incidence of TC nearby.
Figure 2.3(a) shows the local Moran’s I statistic, while Figure 2.3(b) shows the
pseudo p-values obtained from the conditioned permutation procedure. Low-risk
census sections surrounded by low-risk census sections are represented in bright
yellow; those of high risk with high-risk neighbors are in the brown; low-risk
sections surrounded by neighboring high-risk sections are colored light orange and
high-risk ones with a low-risk neighborhood appear in dark orange. Figure 2.3(a)
shows a variation in the risk between the northeast and the southwest: southern and
western internal areas do not host high-risk clusters, while the eastern and northern
ones present different high-risk clusters. In particular, there are extensive low-risk
clusters along the eastern coast of Messina and Syracuse, whereas high-risk groups
emerge in the SSE area to Mount Etna, in the Aeolian Islands up north and on the
northern coast near Barcellona Pozzo di Gotto. Figure 2.3(b) illustrates that the
sections constituting the high- and low-risk clusters are significant at a level equal to
at most α = 0.05. Finally, it should be noted that most of the considered sections
were found to be of insignificant risk, as can be seen from the large gray areas
present in both maps.
Exploring Chronic Diseases’ Spatial Patterns 25
The cluster analysis could confirm the hypothesis according to which persistent
winds in the SE direction would push the radioactive substances emitted by the
volcano towards areas that report a high risk. A similar suggestion seems to apply to
the Aeolian Islands and the sections near Barcellona Pozzo di Gotto.
2.5. Conclusion
standardization. From the first maps obtained, we found a possible significant risk
area at the foot of Mount Etna. We then conducted a cluster analysis to uncover
possible high-risk pocket in the area. We computed the local Moran’s I index on the
SIR previously obtained and created maps of high- and low-risk clusters, and of risk
change. These maps highlighted the presence of a high-risk cluster to the SSE of
Mount Etna, in the Aeolian Islands, and near Barcellona Pozzo di Gotto. In the rest
of the region, no other important high-risk clusters have emerged. The detection of
areas of greatest risk located near Mount Etna seems to support the hypothesis that
the presence of a volcano may influence the incidence of TC in the surrounding
people. In addition to this, the risk areas emerged on the island of Lipari (and on the
Aeolian Islands on the whole), and along the northern coast of Sicily also seem to
indicate a possible influence of the nearby Mount Vulcano. This preliminary finding
should be of crucial interest for public health and could optimize the distribution of
local health services and implement targeted screening, monitoring and prevention
campaigns by efficiently exploiting the available resources.
2.6. References
Andronico, D., Spinetti, C., Cristaldi, A., Buongiorno, M.F. (2009). Observations of Mt. Etna
volcanic ash plumes in 2006: An integrated approach from ground-based and polar
satellite NOAA-AVHRR monitoring system. Journal of Volcanology and Geothermal
Research, 180, 35–147.
Anselin, L. (1995). Local indicators of spatial association–LISA. Geographical Analysis, 27,
93–115.
Anselin, L. (2005). Exploring spatial data with GeoDa: A workbook. Workbook, Spatial
Analysis Laboratory, Department of Geography, University of Illinois, Urbana, IL.
Anselin, L. and Bera, A.K. (1998). Spatial dependence in linear regression models with an
introduction to spatial econometrics. In Handbook of Applied Economic Statistics, Ullah,
A. and Giles, D. (eds). Marcel Dekker, New York.
Anselin, L. and Rey, S.J. (2010). Perspectives on Spatial Data Analysis. Springer, Berlin,
Heidelberg.
Arnbjörnsson, E., Arnbiörnsson, A., Ólafsson, A. (1986). Thyroid cancer incidence in relation
to volcanic activity. Archives of Environmental Health, 41(1), 36–40.
Exploring Chronic Diseases’ Spatial Patterns 27
Assessorato Regionale alla Salute (2016). Atlante Sanitario della Sicilia. Supplement,
Dipartimento per le Attività Sanitarie ed Osservatorio Epidemiologico.
Banerjee, S., Carlin, B.P., Gelfand, A.E. (2004). Hierarchical Modeling and Analysis for
Spatial Data. Chapman & Hall/CRC, Boca Raton/London.
Biondi, B., Arpaia, D., Montuori, P., Ciancia, G., Ippolito, S., Pettinato, G., Triassi, M.
(2012). Under the shadow of Vesuvius: A risk for thyroid cancer? Thyroid, 22(12),
1296–1297.
Bivand, R.S., Pebesma, E., Gómez-Rubio, V. (2008). Applied Spatial Data Analysis with R.
Springer, New York.
Boffetta, P., Memeo, L., Giuffrida, D., Ferrante, M., Sciacca. S. (2020). Exposure to
emissions from Mount Etna (Sicily, Italy) and incidence of thyroid cancer: A geographic
analysis. Scientific Reports, 10, 21298.
Borruso, G. and Murgante, B. (2012). Analisi dei fenomeni immigratori e tecniche di
autocorrelazione spaziale. Primi risultati e riflessioni, Geotema, 43–45.
Bray, F., Colombet, M., Mery, L., Piñeros, M., Znaor, A., Zanetti R., Ferlay, J. (2017).
Cancer Incidence in Five Continents, Volume XI. International Agency for Research on
Cancer, Lyon.
Breslow, N.E. and Day, N.E. (1987). Statistical Methods in Cancer Research, Heseltine, E.
(ed.). IARC Scientiphic Publications no. 82, Lyon.
Buat-Ménard, P. and Arnold, M. (1978). The heavy metal chemistry of atmospheric
rarticulate matter emitted by Mount Etna Volcano. Geophysical Research Letters, 5(4),
245–248.
Caguioa, P.B., Bebero, K.G.M., Bendebel, M.T.B., Saldana, J.S. (2019). Incidence of thyroid
carcinoma in the Philippines: A retrospective study from a tertiary university hospital.
Annals of Oncology, 30.
Caltabiano, T., Burton, M., Giammanco, S., Allard, P., Bruno, N., Murè, F., Romano, R.
(2004). Volcanic gas emissions from the summit craters and flanks of Mt. Etna,
1987–2000. Geophysical Monograph Series, 143, 111–128.
Cerioli, A. and Riani, M. (1999). The ordering of spatial data and the detection of multiple
outliers. Journal of Computational and Graphical Statistics, 8(2), 239–258.
Cimino, G. and Ziino, M. (1983). Heavy metal pollution. Part VII. Emissions from Mount
Etna volcano. Geophysical Research Letters, 10(1), 31–34.
Cressie, N. (1991). Statistics for Spatial Data. Wiley, New York.
Croner, C.M., Sperling, J., Broome. F.R. (1996). Geographic information systems (GIS): New
perspectives in understanding human health and environmental relationships. Statistics in
Medicine, 15(18), 1961–1977.
Curado, M.-P.E., Brenda, H.R.S., Storm, H., Ferlay, M., Heanue, J., Boyle. P. (2007). Cancer
Incidence in Five Continents, Volume IX. WHO, Geneva.
28 Data Analysis and Related Applications 1
D’Aleo, R., Bitetto, M., Delle Donne, D., Tamburello, G., Battaglia, A., Coltelli, M.,
Patanè, D., Prestifilippo, M., Sciotto, M., Aiuppa, A. (2016). Spatially resolved SO2 flux
emissions from Mt Etna. Geophysical Research Letters, 43(14), 7511–7519.
Duntas, L.H. and Doumas, C. (2009). The “rings of fire” and thyroid cancer. Hormones, 8(4),
249–253.
Fiore, M., Conti, G.O., Caltabiano, R., Buffone, A., Zuccarello, P., Cormaci, L., Cannizzaro,
M.A., Ferrante. M. (2019). Role of emerging environmental risk factors in thyroid cancer:
A brief review. International Journal of Environmental Research and Public Health,
16(1185).
Fitzmaurice, C., Dicker, D., Pain, A., Hamavid, H., Moradi-Lakeh, M., MacIntyre, M.F.,
Allen, C., Hansen, G., Hansen, G., Woodbrook, R. et al. (2015). The global burden of
cancer 2013. JAMA Oncology, 1(4), 505–527.
Ghosh, M., Natarajan, K., Waller, L.A., Kim, D. (1999). Hierarchical Bayes GLMs for the
analysis of spatial data: An application to disease mapping. Journal of Statistical
Planning and Inference, 75(2).
Goodman, M.T., Yoshizawa, C.N., Kolonel, L.N. (1988). Descriptive epidemiology of
thyroid cancer in Hawaii. Cancer, 61, 1272–1281.
Haining, R. (2003). Spatial Data Analysis: Theory and Practice. Cambridge University Press,
Cambridge.
Hawai’i Tumor Registry (2019). Hawai’i Cancer at a Glance 2012–2016. Hawai’i Tumor
Registry.
Hrafnkelsson, J.H., Tulinius, J.G., Ólafsdottir, J.G., Sigvaldason. H. (1989). Papillary thyroid
carcinoma in Iceland: A study of the occurrence in families and the coexistence of other
primary tumours. Acta Oncologica, 28(6), 785–788.
Istat (2013). La Sicilia, un territorio che cambia. Istat.
Kilfoy, B.A., Zheng, T., Holford, T.R., Han, X., Ward, M.H., Sjodin, A., Zhang, Y., Bai, Y.,
Zhu, C., Guo, G.L. et al. (2009). International patterns and trends in thyroid cancer
incidence, 1973–2002. Cancer Causes and Control, 20(5), 525–531.
Kolonel, L.N., Hankin, J.H., Wilkens, L.R., Fukunaga, F.H., Ward Hinds, M. (1990). An
epidemiologic study of thyroid cancer in Hawaii. Cancer Causes and Control,
1, 223–234.
Kung, T.M., Ng, W.L., Gibson, J.B. (1981). Volcanoes and carcinoma of the thyroid:
A possible association. Archives of Environmental Health, 36(5), 265–267.
LeSage J.P. and Pace, K.R. (2014). The biggest myth in spatial econometrics. Econometrics,
2(4), 217–249.
Liu, Y., Su, L., Xiao, H. (2017). Review of factors related to the thyroid cancer epidemic.
International Journal of Endocrinology, 2017:5308635. doi: 10.1155/2017/5308635.
Exploring Chronic Diseases’ Spatial Patterns 29
Malandrino, P., Scollo, C., Marturano, I., Russo, M., Tavarelli, M., Attard, M., Richiusa, P.,
Violi, M.A., Dardanoni, G., Vigneri, R. et al. (2013). Descriptive epidemiology of human
thyroid cancer: Experience from a regional registry and the “Volcanic Factor”. Frontiers
in Endocrinology, 4(65), 1–7.
Malandrino, P., Russo, M., Ronchi, A., Minoia, C., Cataldo, D., Regalbuto, C., Giordano, C.,
Attard, M., Squatrito, S., Trimarchi, F. et al. (2016). Increased thyroid cancer incidence
in a basaltic volcanic area is associated with non-anthropogenic pollution and
biocontamination. Endocrine, 53, 471–479.
Marcello, M.A., Malandrino, P., Almeida, J.F.M., Martins, M.B., Cunha, L.L., Bufalo, N.E.,
Pellegriti, G., Ward, L.S. (2014). The influence of the environment on the development of
thyroid tumors: A new appraisal. Endocrine-related Cancer, 21(5), T235–T254.
May, J.M. (1950). Medical geography: Its methods and objectives. Geographical Review,
40(1), 9–41.
Parkin, D.M., Bray, F., Ferlay, J., Pisani, P. (2005). Global cancer statistics, 2002. CA:
A Cancer Journal for Clinicians, 55(2), 74–108.
Pellegriti, G., De Vathaire, F., Scollo, C., Attard, M., Giordano, C., Arena, S., Dardanoni, G.,
Frasca, F., Malandrino, P., Vermiglio, F. (2009). Papillary thyroid cancer incidence in the
volcanic area of Sicily. Journal of the National Cancer Institute, 101, 1575–1583.
Snow, J. (1855). On the Mode of Communication of Cholera. John Churchill, London.
Stevenson, L.G. (1965). Putting disease on the map: The early use of spot maps in the
study of yellow fever. Journal of the History of Medicine and Allied Sciences, 20(3),
226–261.
Truong, T., Rougier, Y., Dubourdieu, D., Guihenneuc-Jouyaux, C., Orsi, L., Hémon, D.,
Guénel, P. (1985). Time trends and geographic variations for thyroid cancer in New
Caledonia, a very high incidence area (1985–1999). European Journal of Cancer
Prevention, 16(1), 62–70.
Vigneri, R., Malandrino, P., Vigneri, P. (2015). The changing epidemiology of thyroid
cancer: Why is incidence increasing? Current Opinion in Oncology, 27, 1–7.
Vigneri, R., Malandrino, F., Russo, G.M., Vigneri, P. (2017). Heavy metals in the
volcanic environment and thyroid cancer. Molecular and Cellular Endocrinology, 457,
73–80.
Vizzini, S., Di Leonardo, R., Costa, V., Tramati, C.D., Luzzu, F., Mazzola, A. (2013). Trace
element bias in the use of CO2 vents as analogues for low pH environments: Implications
for contamination levels in acidified oceans. Estuarine, Coastal and Shelf Science, 134,
19–30.
Vizzini, S., Andolina, C., Caruso, C., Corbo, A. (2020). Isole Eolie: I campi di emissioni
vulcaniche sottomarine di CO2 a Vulcano e Panarea. Memorie Descrittive della Carta
geologica d’Italia, 105, 91–96.
Wakefield, J. (2007). Disease mapping and spatial regression with count data. Biostatistics,
8(2), 158–183.
30 Data Analysis and Related Applications 1
Waller, L.A. and Gotway, C.A. (2004). Applied Spatial Statistics for Public Health Data.
John Wiley & Sons, Hoboken, NJ.
Walter, S.D. (2000). Disease mapping: A historical perspective. In Spatial Epidemiology:
Methods and Applications, Elliott, P., Wakefield, J., Best, N., Briggs, D. (eds). Oxford
University Press, Oxford.
3
Analysis of Blockchain-based
Databases in Web Applications
3.1. Introduction
Databases have been continuously improved from the start of using computers to
the current day, where they have become indispensable in our daily lives. Current
database management systems are powered by a legacy that has been developed
over many years according to users’ needs, alongside the invention of computers.
Technologies continue to be developed according to the needs and level of
civilization that humanity has reached. Blockchain, one of the solutions that has
been developed, has brought devastating changes in various fields.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
32 Data Analysis and Related Applications 1
blockchain-based systems and SQL and NoSQL database systems will be compared.
Some analysis will be shared through an example of an Art Shop web application.
3.2. Background
3.2.1. Blockchain
Application layer: the structure that defines the business part of the blockchain
and regulates state transitions.
Consensus layer: the layer that enables the nodes to agree on the state of the
blockchain and creates the decision-making mechanism of the decentralized
structure.
Networking layer: the layer responsible for the reproduction and propagation of
transactions, state transition messages and consensus messages
(https://v1.cosmos.network/intro).
Proof of work (PoW): blockchain state changes are performed using computing
power resources. The source of evidence required to establish a new ring or a new
transaction is greater than that required to verify an established evidence. The idea
behind this asymmetry is to prevent the system from being deceived by a fraudulent
transaction.
Bitcoin, the most popular blockchain, also works with the PoW algorithm.
In 2021, the daily average confirmation time (the average time for a transaction with
miner fees to be included in a mined block and added to the public ledger) exceeded
the monthly average of 800 minutes (Blockchain.com n/a).
34 Data Analysis and Related Applications 1
Proof of stake (PoS): against the limits of the PoW algorithm, the proposal of
the PoS algorithm, which was first suggested in a forum in 2011, is based on the
concepts that a new node that wishes to participate in the block creation process
must first prove that they have a certain number of relevant value tokens and
lock/stake a certain amount of value into the escrow account. The locked amount is
an escrow in order to ensure the security of the transaction. If the node performing
the relevant transaction behaves inappropriately, it may lose the value it has locked
in escrow and is not allowed to participate in any of the transactions to change the
block state again, according to the rules.
Applications are being developed that use PoW and PoS algorithms in a hybrid
way or with consensus algorithms that are developed with different mechanisms
from start to finish. While these algorithms are sometimes developed based on
users’ needs, sometimes they have to be used by considering the restrictions on the
business logic side. Blockchain applications are brought to life with new algorithms
every day and offered to the masses to use (Ferdous et al. 2020).
The art gallery application, with an inventory of artworks, was created with both
MySQL(SQL), MongoDB(NoSQL) and a blockchain-based structure. With the art
gallery application, the gallery owner can add and delete new artworks to their
inventory and update the specified information of the works.
For the NoSQL version of the Art Shop application, Mongo Atlas and MongoDB
version 4.4.6 were used. On the server side, Ubuntu 20.04 operating system and
DigitalOcean’s 2 GB memory/1 CPU system were used. Strapi 3.6.5 was used as the
content management system.
For the blockchain-supported version of the Art Shop web application, the
Ubuntu 20.04 operating system was used in DigitalOcean’s 2 GB memory, 1 CPU
droplet. Starport 0.16 was used to create and manage the blockchain. Go 1.16 was
installed to run Starport. Starport’s frontend application works with “Vue.js”.
“Node.js” and “npm” were also installed to run these packages on the server.
36 Data Analysis and Related Applications 1
3.4. Analysis
In the Art Shop application, the shop owner has three main options when they
want to add pieces of art to the inventory:
1) adding data directly to the database with the command interface;
2) adding data using the graphical interface of the database management systems;
3) adding data using the specially developed application user interface.
The SQL- and NoSQL-enhanced custom commands can be used to add inputs in
bulk with a json file or comma-separated values.
Adding new inputs is one of the tasks that can compare the performance of the
systems. The scenario of a 150-line comma-separated values (CSV) file, seen in
Figure 3.3, as sample data, and the art shop owner adding their inventory to the web
application in one go, was implemented using the command interface. The SQL and
NoSQL systems task was performed without any problems. It was completed in the
times that can be seen in Figures 3.4 and 3.5. The SQL system performed the task of
adding bulk data faster than the NoSQL system. While the “LOAD DATA LOCAL
INFILE” command is used directly on the server for SQL, the “mongoimport”
command, which provides connection to MongoDB Atlas servers via a local
computer, is used for the NoSQL structure.
Multiple data were used in the SQL and NoSQL systems in order for the
performance test to be meaningful during the addition of the entries, but since the
Starport infrastructure used in the blockchain-supported system – as in all popular
algorithms – actually uses Tendermint’s consensus algorithm, called BFT POS, the
inputs must be added one by one (Ferdous et al. 2020). In order for a node to add
more than one entry, the messages must be added to the Merkle-tree structure with
unique hashes and proven one at a time.
An input has been added to the blockchain for the blockchain-based system that
uses the Starport infrastructure, in which data is added one by one, due to its structure.
This addition first required the creation of functions for the CRUD (Create, Read,
Update, Delete) actions of the digital asset. The creation of the artwork structure as a
digital asset on the blockchain was accomplished with the Starport command “starport
type artwork Arttype name artist year owner”. Adjustments were then made to various
proto and structural files for the API system for Starport’s web application. Then, with
the command “artshopd tx artshop create-artwork ‘Painting’ ‘Name2’ ‘Artist 2’ ‘1999’
‘0’ --from=dbtests”, the first registration of the blockchain was made on behalf of the
user “dbtests”. Figure 3.6 shows the time associated with the record added to the
blockchain by the validator of the chain.
Analysis of Blockchain-based Databases in Web Applications 37
Figure 3.5. The process of adding 150 lines of dummy-data with the
NoSQL system and the time elapsed. For a color version
of this figure, see www.iste.co.uk/zafeiris/data1.zip
38 Data Analysis and Related Applications 1
Figure 3.6. Adding a single entry to the blockchain-supported system and the time
elapsed. For a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
Since there will be a query test later in the analysis, adding 150 records to the
blockchain-based system and running the queries one by one, which took a total of
450 seconds, were both fully achieved.
According to the tests made, the order in time taken to add inputs, from fastest to
slowest, is SQL, NoSQL, blockchain.
3.4.2. Query
Figure 3.7 shows the queries and operation times to query the names of the
artworks starting with the expression “Name 1” among 150 lines, which are the
sample data containing the artworks in the Art Shop application, from the relevant
table in the database.
The detailed operation time was not displayed in the MongoDB command
interface used for the NoSQL system. Before the query, the “setVerboseShell(true)”
command was run to show the operation time at the end of the operation.
In Starport, which is used for the blockchain-based database system, only the
following commands are available on the query side of the CRUD functions created
with the “type” command by default. In order to run the query carried out in this
Analysis of Blockchain-based Databases in Web Applications 39
system, which was created using the GO language, the query must be added to the
system by developing it.
Query commands that come by default with the “type” command in Starport are:
3.4.3. Functionality
The first published version of MySQL database used for SQL was published in
1995, and MongoDB, used for the NoSQL system, was published in 2005. Starport,
the open-source code developed by the Tendermint company, which is used for the
blockchain-supported system, released its first version at the beginning of 2020.
There are database management systems applications that have been developed
for traditional database systems that are legacies from the past, server files that are
installed with one click in server companies, and ready-made database servers
managed by the cloud. On the blockchain side, there are limited alternatives that
provide managed server service in the cloud server. These are Oracle (Oracle.com
n/a) and Amazon Web Services (AWS.amazon.com n/a) solutions.
For software development, the legacy from the past and SQL- and NoSQL-related
resources are more. For blockchain-based database systems, the Tendermint
consensus algorithm, Cosmos SDK and Starport solutions offer a start for
application, consensus and network layers, but it is necessary to develop
improvements and arrangements according to application needs. With its API
support and frontend application, Cosmos SDK has facilitated the development of
blockchain-based web applications in web applications.
3.4.4. Security
In SQL and NoSQL systems, IP addresses that can be connected to the database
server can be defined in the database management systems layer. All requests can be
blocked except requests from these IP addresses. In addition, by defining user
accounts other than IP addresses, authorization can be made for users coming from a
specific IP address and authorized with a username/password. In the PoS algorithm
on the blockchain side, if a fraudulent transaction is detected, then the staked value
may not be returned to the node and the account may be deleted from the chain
completely.
3.5. Conclusion
Web applications have taken over all of our daily lives. All of the main areas,
such as government applications, health, finance and entertainment, are
indispensable and irreversibly managed with web applications. As the estimates of
the number of devices connected to each other increase day by day, the
communication, security and speed of this interconnected crowd all gain importance.
With the increase in this number, the popularity of decentralized and trustless
blockchain-based structures is increasing.
Databases in web applications were classified according to the units where the
data was kept and the relationships of the data with each other. In the blockchain, it
is classified according to the system participation and consensus algorithm.
Blockchain-based systems are decentralized, consistent and eliminate the trust
problem.
SQL and NoSQL constructs record, send and process data that crosses the
authority barrier without questioning application layer decisions. The use of their
own functions in them is not very common due to the rapidity of development in
web languages and the flexibility of web languages. Blockchain technologies are
also diversified, especially on the consensus side. Authority and participation in the
system is one of the most critical points for data processing in web applications that
are intended for use by multiple stakeholders.
3.6. References
4.1. Introduction
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
44 Data Analysis and Related Applications 1
Now let fn (x) be the minimal expected cost during n periods and β be the discount
factor for future costs. Then, using dynamic programming (see Bellman (1957)), we
easily obtain the following relation:
fn (x) = −kx + min Gn (y), with
yx
T HEOREM 4.1.– There exists an increasing sequence of critical levels {yn }n1 such
that:
Gn (yn ), if x yn ,
fn (x) = −kx + [4.4]
Gn (x), if x > yn .
P ROOF.– Consider at first a one-period case. Obviously, we have to find the solution
of the equation G1 (y) = 0, where G1 (y) = k − rF̄ (y/α). Due to assumption r > k,
it follows immediately that y1 = αF −1 (1 − k/r) exists; moreover, it is the unique
solution of the equation under consideration.
Hence, it is clear that f1 (x) < 0 for all x. Moreover, on the one hand,
∞
G2 (y) = G1 (y) + β f1 (y + M − αs)ϕ(s) ds ≤ G1 (y),
0
That means y1 < y2 < ȳ. Furthermore, f2 (x) < 0 for all x. Thus, the base of
induction is established. Assuming that [4.4] is true for the number of periods less or
equal to n, we prove its validity for n + 1. It is possible to write
⎧
⎪
⎨0, if x yn−1 ,
fn (x) − fn−1
(x) = −Gn−1 (x), if yn−1 < x yn ,
⎪
⎩
Gn (x) − Gn−1 (x), if x > yn .
46 Data Analysis and Related Applications 1
Since
∞
Gn+1 (y) − Gn (y) = β [fn (y + M − αs) − fn−1
(y + M − αs)]ϕ(s) ds,
0
we deduce that Gn+1 (y) < Gn (y), so yn < yn+1 . Rewriting Gn+1 (y) as follows:
∞
G1 (y) + β fn (y + M − αs)ϕ(s) ds = H(y)
0
y+M −yn
α
+β Gn (y + M − αs) ds,
0
it is easy to see that Gn+1 (y) > H(y). This entails the needed relation yn+1 < ȳ,
thus ending the proof. It is possible to formulate an obvious corollary:
Now, we turn to sensitivity analysis and prove that the model under consideration
is stable with respect to small perturbations of the underlying distribution. For this
purpose, we introduce two variants of the model. In the first one, the claim distribution
has density ϕX (x) and d.f. is denoted by FX (x). In the second one, the claim density
is ϕY (x) and d.f. is FY (x). The corresponding cost functions are denoted by fn,X (x)
and fn,Y (x). The distance between distributions will be measured by means of the
Kantorovich metric.
D EFINITION 4.1.– For random variables X and Y defined on some probability space
and possessing finite expectations, it is possible to define their distance on the base of
the Kantorovich metric in the following way:
∞
κ(X, Y ) = |FX (t) − FY (t)| dt
−∞
The distance between the cost functions is measured in terms of the Kolmogorov
uniform metric. Thus, we are going to study.
L EMMA 4.1.– Let functions gi (y), i = 1, 2, be such that |g1 (y) − g2 (y)| < δ for some
δ > 0 and any y, then supx | inf yx g1 (y) − inf yx g2 (y)| < δ.
P ROOF.– Fix x and put Ci = inf yx gi (y). Then, according to the definition of
infimum, for any ε > 0, there exists y1 (ε) x such that g1 (y1 (ε)) < C1 + ε.
Therefore, g2 (y1 (ε)) < g1 (y1 (ε)) + δ < C1 + ε + δ implying C2 < g2 (y1 (ε)) <
C1 + ε + δ. Letting ε → 0, we obtain immediately C2 < C1 + δ. In a similar way, we
establish C1 < C2 + δ, thus obtaining the desired result |C1 − C2 | < δ. Now, we are
able to estimate Δ1 .
P ROOF.– According to Lemma 4.1, we need to estimate |G1,X (y) − G1,Y (y)| for any
+
∞ functions gives G1,X (y) − G1,Y (y) = r[E(αX − y) −
y. The definition of these
+
E(αY − y) ] = rα y/α (F̄X (t) − F̄Y (t)) dt. This leads immediately to the desired
estimate. Next, we prove the main result demonstrating the model’s stability.
n
T HEOREM 4.2.– If κ(X, Y ) = ρ, then Δn Dn ρ, where Dn = α( r(1−β
1−β
)
+
n
k(β−β )
1−β ).
P ROOF.– As in Lemma 4.2, we begin with the estimation of |Gn,X (y) − Gn,Y (y)| for
any y. Due to definition [4.3], we have:
Obviously, the first term on the right-hand side of the inequality is less than αrρ.
To estimate the second term, we rewrite it in the form:
∞ ∞
β| fn−1,X (y + M − αs)ϕX (s) ds − fn−1,Y (y + M − αs)ϕX (s) ds
0 0
∞ ∞
+ fn−1,Y (y + M − αs)ϕX (s) ds − fn−1,Y (y + M − αs)ϕY (s) ds|.
0 0
Clearly,
∞
| [fn−1,X (y + M − αs) − fn−1,Y (y + M − αs)]ϕX (s) ds| Δn−1 .
0
48 Data Analysis and Related Applications 1
∞
Integrating by parts, we rewrite fn−1,Y (y + M − αs)ϕY (s) ds in the form:
0
∞
−fn−1,Y (y + M − αs)F̄Y (s)|∞
0 − α
fn−1,Y (y + M − αs)F̄Y (s) ds
0
∞
= fn−1,Y (y + M ) − α fn−1,Y (y + M − αs)F̄Y (s) ds.
0
Hence, we obtain:
Δn αrρ + αβ max |fn−1,Y (y)|ρ + βΔn−1 .
y
It is not difficult to prove that maxy |fn−1,Y (y)| k for all n, so:
Solving this recurrent relation, we finish the proof and get the desired form of Dn .
r+kβ
C OROLLARY 4.2.– Δn 1−β αρ for any n.
In other words, we established the stability of the model with respect to small
perturbations of claim distribution.
where x = R(0) is the initial capital, Xn is the amount of the nth claim, and N (t)
is the number of claims up to time t. The sequence {Xn , n 1} consists of i.i.d.
non-negative r.v.’s with finite mean and d.f. F (x). It is independent of the Poisson
process N (t) with intensity λ. The premium inflow rate is c > 0.
Starting with the seminal paper by De Finetti (1957), published in 1957, the study
of dividends is an important subject for actuarial mathematics. We mention also in
Optimization and Asymptotic Analysis of Insurance Models 49
passing the papers by Gordon (1959) and Miller and Modigliani (1961), which were
among the first to treat dividends problem, and the paper by Albrecher and Thonhauser
(2009) which gives the review of results obtained before 2009.
The objective function V (Q0 , L) is the expected discounted dividend payed until
ruin. To calculate it, we introduce the following notation. Let L be some strategy of
dividend payment, Q0 be the initial capital, and δ be the force of interest, δ > 0. Then,
it is possible to write:
T
V (Q0 , L) = E e−δt dL(t) .
0
D EFINITION 4.3.– The strategy L is called barrier one with barrier level b, if for
Q(t) > b, the amount Q(t) − b is payed immediately, if Q(t) = b, all the premium
inflow is payed as a dividend and in case Q(t) < b nothing is payed.
T HEOREM 4.3.– There exists b∗ such that for any initial capital satisfying condition
0 Q0 b∗ the barrier strategy specified by this level is optimal.
50 Data Analysis and Related Applications 1
Put for simplicity Q0 = Q. Then, it is not difficult to prove using the total
probability formula and properties of the Poisson process that V (Q, b) satisfies the
following integro-differential equation:
Q+0
∂V (Q, b) λ+δ λ
= V (Q, b) − V (Q − y, b) dF (y) [4.6]
∂Q c c
0
We can find in the book by Bühlmann (1970) that the solution of [4.6] with
boundary condition [4.7] has the form:
h(Q)
V (Q, b) =
h (b)
where function h(x) is a unique solution, up to a constant factor, of the equation
x+0
λ+δ λ
h (x) = h(x) − h(x − y) dF (y) [4.8]
c c
0
T HEOREM 4.4.– Assume that d.f. of claims has a density ϕ(y) given by ϕ(y) =
P (y)e−y , y 0, where P (y) is the polynomial of degree m. Then, the
integro-differential equation [4.8] can be reduced to a homogeneous ordinary
differential equation of degree m + 2 with constant coefficients.
Integration by parts of the last summand and replacement of the integral via
formula [4.9] leads to:
ch (x) = (λ + δ − c)h (x) + (λ + δ − λP (0))h(x)
x
−λ h(x − y)e−y P (y) dy. [4.10]
0
Note that we have already got a linear differential equation with constant
coefficients plus the integral term. Performing the same transformation of expression
[4.10], we obtain the differential equation of higher order:
ch (x) = (λ + δ − 2c)h (x) + (2λ + 2δ − λP (0) − c)h (x)
x
+ (λ + δ − λP (0) − λP (0))h(x) − λ h(x − y)e−y P (y) dy.
0
[4.11]
Finally, to establish the relation between the function h(x) and its several
derivatives, we repeat the previous cycle, getting:
ch(4) (x) = (λ + δ − 3c)h (x) + (3λ + 3δ − λP (0) − 3c)h (x)
+ (3λ + 3δ − 2λP (0) − λP (0) − c)h (x) + (λ + δ − λP (0)
x
− λP (0) − λP (0))h(x) − λ h(x − y)e−y P (y) dy. [4.12]
0
Each time, we obtain the homogeneous linear differential equation with constant
coefficients plus integral term. To make clear that such a statement is true for any order
of equation, we write the following Table 4.1 showing how the new coefficients are
related to those from the previous equation.
The lth column of Table 4.1 presents the coefficients in expression ch(l) (x)
corresponding to h(x) (the first raw) and h(j−1) (x) (the jth raw), j 1. Hence, it is
52 Data Analysis and Related Applications 1
not difficult to see that the coefficients of the main diagonal have the form λ + δ − kc
for non-negative integer k. It is clear from the procedure of getting the equation of
order k + 1 from that of order k. The same reason is for the expressions in the first
row of the table (coefficients by h(x), having the form λ + δ (in the first column) and
λ(1 − P (0)− P (0)− . . .− P (k) (0))+ δ (in the (k + 1)th column for any non-negative
integer k)). In order to calculate the other non-zero coefficients (i.e. those on the main
diagonal and above), we have to use the following rule: di,j = di,j−1 + di−1,j−1 if
di,j stands in the ith row and jth column. Obviously, all the coefficients are constant.
In all the equations along with derivatives, there exists an integral term. However,
the order of derivative of polynomial P (y) under the sign of integral increases each
time when we pass from the kth equation to the (k + 1)th one. Thus, using the
induction, we obtain:
Since the (m + 1)th derivative of the polynomial is zero, the integral term
disappears. It follows from the proved theorem that it is easier to find the optimal
barrier for 0 Q b if the d.f. satisfies the condition dF (y) = e−y P (y) · dy,
where P (y) is a polynomial of degree m. An example of such a distribution is
Γ(m + 1, 1), where m is a non-negative integer. In this case, the density has the form
1 m −y
ϕ(y) = m! y e .
Therefore, we assume further that the claim amount has the exponential
distribution with parameter γ, that is, F (y) = 1 − e−γy , where γ is the inverse of
mathematical expectation. Since ϕ(y) = γe−γy (the polynomial degree is equal to
zero), proceeding as in the general case, we obtain a homogeneous linear differential
equation of the second degree with constant coefficients. In fact, we obtain
cr2 − (λ + δ − cγ)r − δγ = 0.
Optimization and Asymptotic Analysis of Insurance Models 53
The general solution of the differential equation under consideration has the form:
Due to our assumptions, all the parameters are positive. It follows immediately
that the signs of roots are different. For certainty, suppose that r1 > 0 and r2 < 0.
P ROOF.– We substitute the explicit form of h(x) in [4.8]. After calculation of h (x)
and integral in this equation, we set x = 0, obtaining:
λ+δ λ+δ
C1 r1 + C2 r2 = C1 + C2 .
c c
According to Vieta’s theorem, we have λ+δ−cγ c = r1 + r2 , in other words,
λ+δ
c = γ + r1 + r2 . Using this relation, we obtain 0 = (γ + r2 )C1 + (γ + r1 )C2 .
Whence it follows immediately that C 1
C2 has the desired form, thus ending the proof.
The inequality r2 + γ > 0 is also valid. Therefore, according to Lemma 4.3 C 1
C2 < 0
and constants C1 and C2 have different signs. Our aim is to find a positive solution
h(x) for 0 < x < ∞. To this end, we need to have C1 > 0 (this is easy to understand,
letting x tend to +∞ in the expression of h(x)), hence, C2 < 0 and C1 + C2 > 0 (we
want to have h(0) > 0).
Thus, we established the form of h(x) and found the restrictions on the constants.
The last step is to find the optimal barrier b∗ in the set [0; +∞). In order to
minimize the derivative h (x), we have to find the root of the equation h (x) = 0.
Hence, we have to solve the following equation:
∗ ∗
C1 r12 er1 b + C2 r22 er2 b = 0,
giving
∗ 1 C2 r22
b = ln − .
r1 − r2 C1 r12
The right-hand side is well defined, since we have already established that:
C2 r2 + γ
− = .
C1 r1 + γ
54 Data Analysis and Related Applications 1
We consider here the claim distributions with density ϕ(y) = γe−γy , that is,
exponential distribution with parameter γ. To find the optimal barrier b∗ , the formula
[4.13] is used.
In the following, we provide the Python code for solving the problem of barrier
calculation in a particular case and the results obtained:
import math
In Table 4.2, the optimal values of barrier b∗ are given for some parameters under
the additional condition γ1 · δ = 1, 000, 000.
1
The results for the case γ · δ = 5, 000, 000 are given in Table 4.3.
In both cases, we calculated c in such a way that there will be a safety load of 5%,
10%, 15% and 20%. Note that the increase of safety loads leads to the decrease of the
optimal barrier level.
A problem, proposed in the book by Bühlmann (1970), is solved in section 4.3. The
company capital is described by a compound Poisson process controlled by dividend
strategy. The expected discounted dividends until ruin are chosen as the objective
function. For the barrier strategy, the explicit form of the linear differential equation
is established if the claim amounts have the density ϕ(y) = P (y)e−y , where P (y) is
a polynomial of degree m. Gamma-distributions with integer parameter belong to this
class. Further investigation includes the sensitivity analysis of such a model and the
consideration of more complicated models including dependence between the claim
amounts and their number, investment in risky and non-risky assets and taxes. Other
dividend strategies can be considered (see Bulinskaya (2018)). Due to a lack of space,
these results will be published in another paper.
56 Data Analysis and Related Applications 1
4.5. References
Albrecher, H. and Thonhauser, S. (2009). Optimality results for dividend problems in insurance.
Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales. Serie A. Matematicas,
103(2), 295–320.
Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.
Bernstein, P.L. (1996). Against the Gods: The Remarkable Story of Risk. John Wiley and Sons,
Inc., New York.
Bulinskaya, E. (2017). New research directions in modern actuarial sciences. Springer
Proceedings in Mathematics and Statistics, 208, 349–408.
Bulinskaya, E. (2018). Asymptotic analysis and optimization of some insurance models.
Applied Stochastic Models in Business and Industry, 34(6), 762–773.
Bulinskaya, E. and Gusak, J. (2016). Optimal control and sensitivity analysis for two risk
models. Communications in Statistics – Simulation and Computation, 45, 1451–1466.
Bühlmann, H. (1970). Mathematical Methods in Risk Theory. Springer, Berlin, Heidelberg,
New York.
Cramér, H. (1955). Collective Risk Theory: A Survey of the Theory from the Point of View of the
Theory of Stochastic Process. Ab Nordiska Bokhandeln, Stockholm.
De Finetti, B. (1957). Su un’impostazione alternativa della teoria collettiva del rischio.
Transactions of the XV International Congress of Actuaries, 433–443.
Dickson, D.C.M. and Waters, H. (2004). Some optimal dividends problems. ASTIN Bulletin,
34, 49–74.
Gerber, H. (1969). Entscheidungskriterien fur den zusammengesetzten Poisson-Prozess.
Schweizerische Vereinigung der Versicherungsmathematiker Mitteilungen, 69, 185–228.
Gordon, M.J. (1959). Dividends, earnings and stock prices. Review of Economics and Statistics,
41, 99–105.
Lundberg, F. (1903). Approximerad Framstallning av sannolikhetsfunktionen. Aterforsakering
av Kollektivrisker. Akad. Afhandling, Almqvist o. Wiksell, Uppsala.
Miller, M.H. and Modigliani, F. (1961). Dividend policy, growth, and the valuation of shares.
The Journal of Business, 34(4), 411–433.
5
Bridges are important structures. They are used on land transportation to connect
different points that are usually inaccessible. Loading forces due to traffic volume and
flow are important physical factors that affect the bridge’s structural reliability. Thus,
for safety assessments, it is important to monitor and study traffic volume. In this
work, we analyze the traffic data on the 25 de Abril Bridge in Portugal. The aim is to
study the tail distribution.
5.1. Introduction
Bridges are the structures that allow people and vehicles to cross a space between
two elevations. They are used to join roads, as well as to connect the two banks of a
body of water, like a lake or river, or a deep opening, like a valley. The assessment of
the safety of existing bridges has received technical and scientific attention, partly due
to the occurrence of grave accidents in these structures. For safety assessments, it is
thus important to monitor and study traffic volume and flow on bridges. In this work,
we analyze the traffic volume data on the 25 de Abril Bridge, in Portugal (Figure 5.1).
One main concern is the analysis of high traffic since it can lead to long periods of
traffic congestion which can result in higher probabilities of failure of the bridge in
its lifetime. The 25 de Abril Bridge opened on the 6th of August 1966 and connects
Lisbon to the southern side of the Tagus River. This is the longest suspension bridge in
Europe, with a total length of 2,277 meters. It has two levels: the upper level for cars
with a three-lane roadway in each direction with a dividing guardrail as well as a lower
one, built-in 1999, for trains. Due to its similarity and because it was manufactured by
the same company, it is often compared to the Golden Gate Bridge in San Francisco.
Chapter written by Frederico C AEIRO, Ayana M ATEUS and Conceicao V EIGA DE A LMEIDA.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
58 Data Analysis and Related Applications 1
The rest of this chapter is organized as follows: in section 5.2 we describe the data
under study. In section 5.3, we review the extreme value methodology used in this
work. Finally, in section 5.4, we apply the extreme value models to infer the extremal
behavior of the traffic volume and provide some concluding remarks.
Figure 5.1. 25 de Abril Bridge and the Sanctuary of Christ the King monument (to
the right of the photo) in the city of Almada. The photo was taken by the first author
in September 2019. For a color version of this figure, see www.iste.co.uk/zafeiris/
data1.zip
5.2. Data
The traffic data we considered in our analysis was provided by INE (Instituto
Nacional de Estatística/Statistics Portugal) and by IMT (Instituto da Mobilidade e dos
Transportes, I.P.). Although there are only tolls in the South-North direction, traffic is
also counted in the other direction through sensors placed on the floor. The available
data consists in the number of vehicles. No information is available regarding the
class of a vehicle and the corresponding load. INE provided an archive of public
data with easy online access. Regarding traffic volume, the data obtained from INE
consists in the annual and monthly average daily traffic between 1998 and 2019. To
study variations in the traffic, including the extreme values, the daily average could
be meaningless. Thus, daily (or hourly) observations are more appropriate to make
inferences in the right tail. Daily values since January 1, 2010 to December 31, 2018
were provided on request by IMT. We also obtained from IMT annual and monthly
average daily data for the years before 1998.
Figure 5.2 shows the annual average daily traffic from 1966 to 2019. The years
from 1966 to 2001 corresponds to a period of traffic growth. After 2001, the annual
average daily traffic number appears to be stationary, with a change point in 2010.
Note that the year 2001 corresponds to the beginning of the Portugal economic
Statistical Analysis of Traffic Volume in the 25 de Abril Bridge 59
5.3. Methodology
results for the left tail can be easily derived from the analogous results for the right
tail. Fréchet (1927) and Fisher and Tippett (1928) were the first to derive asymptotic
probability models for the transformed sample maximum. The first fundamental
limit result is due to Gnedenko (1943) who fully characterized the three possible
non-degenerate limit distributions of the linearly normalized sample maximum of
iid random variables (see also von Mises (1964) 1). This result is now known as the
extremal types theorem. Let X(n) = max1≤i≤n (Xi ) be the sample maximum. Let
us also assume that there exist normalizing constants an > 0, bn ∈ R and some
non-degenerate df G such that, for all x,
X(n) − bn
lim P ≤ x = G(x). [5.1]
n→∞ an
With the appropriate choice of the normalizing constants, G must be one of the
three limit models, which may be unified in the generalized extreme value (GEV)
distribution,
exp −(1 + ξx)−1/ξ , 1 + ξx > 0 if ξ = 0
G(x) ≡ G(x|ξ) := [5.2]
exp(− exp(−x)), x ∈ R if ξ = 0
here presented in the von Mises–Jenkinson form (Jenkinson 1955; von Mises 1964).
When the non-degenerate limit in [5.1] exists, we say that F belongs to the
1 This reference is a reprint of the 1936 edition, found at: von Mises, R. (1936). La distribution
de la plus grande de n valeurs, Rev., Math, Union Interbalcanique, 1, 141–160.
Statistical Analysis of Traffic Volume in the 25 de Abril Bridge 61
Another important result in the field of EVT is the joint limiting distribution of
the r largest order statistics (with r fixed). We will assume that equation [5.1] holds,
i.e. (X(n) − bn )/an converges in distribution to G(x), with adequate normalizing
constants an > 0 and bn ∈ R. Then, the joint limiting distribution of the normalized
r largest order statistics is:
X(n) − bn X(n−1) − bn X(n−r+1) − bn
, ,...,
an an an
with X(n) ≥ X(n−1) ≥ . . . ≥ X(n−r+1) , is the multivariate GEV model (Dwass
1964), with an associated probability density function given by:
r−1
g(x(n−i+1) )
hr (x(n) , x(n−1) , · · · , x(n−r+1) ) = g(x(n−r+1) ) , [5.3]
i=1
G(x(n−i+1) )
if x(n) > x(n−1) > · · · > x(n−r+1) , where g(x) = ∂G(x) ∂x , and G(x) is the GEV
distribution given in [5.2]. Note that for r = 1, equation [5.3] corresponds to the
density function of the GEV distribution, as expected. Also, if we consider the extreme
order statistic X(k) for some fixed k, we have (Arnold et al. 1992):
k−1
(− ln G(x))i
X(n−k+1) − bn
lim P ≤ x = G(x) . [5.4]
n→∞ an i!
i=0
The block maxima method consists of dividing the initial sample into disjoint
blocks of equal size and fitting the GEV model in equation [5.2] to the sample of
62 Data Analysis and Related Applications 1
block maxima. The size of the block is important due to the usual trade-off between
bias (small block size) and variance (large block size). When working with time-series
data, it is usual to choose the block length as one year. This choice allows us to assume
that the block maxima is iid, even though data has serial dependence. The limit in
equation [5.1] justifies the following approximation, for large values of n:
z − bn
P (X(n) ≤ z) ≈ G
an
Because the GEV model provides only an approximation for the distribution of
Mn , bias due to model misspecification can occur. Since the normalizing constants
an > 0 and bn ∈ R are unknown, they are incorporated in the GEV distribution as
location and scale parameters, λ and δ, leading to the model:
−1/ξ
exp − 1 + ξ z−λ , 1 + ξ z−λ
δ > 0 if ξ = 0 [5.5]
G(z|ξ, λ, δ) := δ
z−λ
exp − exp − δ , z∈R if ξ = 0
Next, we fit the GEV model in equation [5.5] to the block maxima sample.
The estimation of the parameters (ξ, λ, δ) is usually performed using the maximum
likelihood method or the probability weighted moment (PWM) method (Hosking
et al. 1985). Since the support of the GEV model may depend on its parameters, the
asymptotic normality of the maximum likelihood estimators may not hold. However,
if ξ > −0.5, the maximum likelihood estimators are consistent and asymptotically
normal (Smith 1985). Regarding PWM estimators, consistency and asymptotically
normality can be guaranteed for ξ < 1 and ξ < 0.5, respectively. Note that in
practical applications, we often have −0.5 < ξ < 0.5. Additional asymptotic results
for the block maxima method were recently presented in Bücher and Segers (2017)
and Dombry and Ferreira (2019).
Model checking can be done with a histogram, a probability plot, a quantile plot or
with a return level plot with empirical estimates of the return level function (see Coles
(2001) and Reiss and Thomas (2007) for further details).
When analyzing extreme values with the block maxima method, we often miss
several extreme observations. This problem has motivated researchers to use more
extreme values from the sample. Smith (1986) and Weissman (1978) were the first
to make inference with a model based on the r-largest order statistics from each
block. Under this approach, the initial sample is divided into blocks and we select
the r-largest order statistics from each block. Then, the model in equation [5.3] with
additional location and scale parameters λ and δ > 0 is fitted to the data. The
estimation is usually performed by maximum likelihood. As with the choice of the
Statistical Analysis of Traffic Volume in the 25 de Abril Bridge 63
block length, the choice of the parameter r accommodates a trade-off between bias
(large r) and variance (small r). In practice, it is advisable not to choose r too large
(Smith 1986).
R EMARK 5.1.– Note that both probabilistic models used in sections 5.3.2 and 5.3.3
share the same shape, location and scale parameters, (ξ, λ, δ). Therefore, it is usual
to estimate those parameters, using the r-largest order statistics method, and then
incorporate those estimates in the GEV model in equation [5.5] to estimate other
important parameters.
Estimation of the model parameters is an important first step for further inference
in the tail. The second and most important step is to yield precise inference about
the tail behaviour of F . More precisely, estimate parameters such as an upper tail
probability, an extreme quantile or the right-endpoint of F , whenever finite.
An upper tail probability is the probability that the block maximum exceeds some
high value yp with probability p (p small). The tail probability can be estimated by
1 − G(yp |, ξ̂, λ̂, δ̂), where G is the GEV df in equation [5.5].
The quantile q1−p is also the level expected to be exceeded on average once every
1/p years. We usually say that q1−p is the return level associated with the return period
1/p. A plot of the return period (on a logarithmic scale) versus the return level is called
a return level plot.
Let ω = sup{x : F (x) < 1} denote the right endpoint of the GEV model. If
ξ < 0, the right endpoint is finite and can be estimated by:
δ̂
ω̂ = λ̂ − .
ξˆ
The models presented in section 5.3 will now be applied to the traffic data of
the 25 de Abril Bridge. We will consider only the period where daily values are
available (2010–2018). Due to yearly seasonality, the block is defined as one year.
64 Data Analysis and Related Applications 1
All computations were done in R software, with package ismev (Heffernan and
Stephenson 2018). Table 5.1 shows the maximized log-likelihood (ll0 ), parameters
estimates and standard errors in parentheses of the GEV (r = 1) and a multivariate
GEV model with 2 ≤ r ≤ 5.
r ll0 λ̂ σ̂ ξ̂
1 −87.599 170,156.651 (1,409.031) 3,778.887 (1,026.358) −0.132 (0.240)
2 −168.730 172,045.883 (1,404.189) 4,348.683 (664.485) −0.346 (0.164)
3 −244.945 172,548.763 (1,255.496) 4,071.888 (526.778) −0.314 (0.148)
4 −318.084 172,636.307 (1,109.469) 3,858.426 (464.409) −0.277 (0.123)
5 −388.923 172,390.040 (986.444) 3,546.707 (357.260) −0.250 (0.097)
Table 5.1. Maximized log-likelihood (ll0 ), parameters estimates and standard errors
in parentheses of the GEV (r = 1) and multivariate GEV model with 2 ≤ r ≤ 5
Comparing the results, we note that both estimates and standard errors change with
different values of r. The standard errors decrease as r increases. Due to a possible
increase of bias, it is advisable to not let r be too large. Coles (2001) suggests choosing
r as large as possible, subject to diagnostics of the fit.
175000
Empirical
Model
0.4
165000
0.0
Empirical Model
8e−05
Return Level
f(z)
4e−05
0e+00
1e−01 1e+00 1e+01 1e+02 1e+03 165000 170000 175000 180000 185000
Return Period z
Figure 5.4. Diagnostic plots of the GEV model fit to the yearly maximum
from the daily traffic data of the 25 de Abril Bridge. For a color
version of this figure, see www.iste.co.uk/zafeiris/data1.zip
Statistical Analysis of Traffic Volume in the 25 de Abril Bridge 65
We validated the fitted model using the histogram, the probability plot, the quantile
plot and the return level plot. These plots confirm that the fit is more satisfactory for
r = 1. In Figure 5.4, we present the diagnostic plots of the GEV distribution based on
the block maxima method (r = 1).
Using the delta method, the asymptotic 95% confidence intervals for the
parameters ξ, λ and δ are, respectively, (−0.603, 0.338), (167395.0, 172918.3) and
(1767.262, 5790.513). Despite the fact that the point estimate of the shape parameter
ξ is negative, the corresponding confidence interval includes the value zero. Therefore,
we do not have enough evidence to assume that the Weibull model is the most
appropriate one. The likelihood ratio test statistic is equal to 0.292 which suggests
that the Gumbel model could be adequate. Nevertheless, we decided to take the safest
decision and prefer to model the tail within the GEV family of distributions.
In Table 5.2, we provide estimates and confidence intervals for the m-year return
level (m = 10, 50, 100). Assuming the stationarity of future extreme values, we expect
a daily traffic always below 195,000 vehicles during the next 100 years. Also, since the
estimate of the shape parameter is negative, the endpoint estimate is 198,690 vehicles.
Return period Return level 95% confidence interval for the return level
10 177,511 (173,115, 181,906)
50 181,672 (173,267, 190,076)
100 183,175 (172,367, 193,982)
5.5. Acknowledgements
This work was partially funded by national funds through the FCT – Fundação
para a Ciência e a Tecnologia, I.P., under the scope of the project UIDB/00297/2020
(Center for Mathematics and Applications).
5.6. References
Arnold, B.C., Balakrishnan, N., Nagaraja, H.N. (1992). A First Course in Order Statistics.
Wiley, New York.
Beirlant, J., Caeiro, F., Gomes, M.I. (2012). An overview and open research topics in statistics
of univariate extremes. Revstat – Statistical Journal, 10(1), 1–31.
Bücher, A. and Segers, J. (2017). On the maximum likelihood estimator for the generalized
extreme-value distribution. Extremes, 20, 839–872.
66 Data Analysis and Related Applications 1
6.1. Introduction
The World Health Organization (WHO) defines diabetes mellitus as “a chronic,
metabolic disease characterized by elevated levels of blood glucose (or blood sugar),
which leads over time to serious damage to the heart, blood vessels, eyes, kidneys
and nerves”. Gestational diabetes mellitus (GDM) is a form of diabetes which arises
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
68 Data Analysis and Related Applications 1
Savona-Ventura et al. (2013) explained that the screening, as well as the OGTT, are
costly diagnostic methods. To this end, an alternative clinical risk assessment method
for GDM, based on explanatory variables that can be easily measured at minimal cost,
is sought to preclude these tests, especially in countries and health centers dealing
with budget cuts and a lack of resources. The prediction of the risk of an individual
acquiring GDM is a problem that can be tackled using a variety of classification
techniques. Savona-Ventura et al. (2013) applied binary logistic regression (BLR).
In the literature, some shortcomings of the BLR model devised in the study by
Savona-Ventura et al. (2013) are outlined. Kotzaeridi et al. (2021) remark that this
model tended to underestimate the risk of GDM. Furthermore, Lamain-de Ruiter
et al. (2017) found that this same model also involved a moderate risk of bias when
compared to other models. Thus, we seek alternative methods which may serve as
an improvement over the BLR model implemented by Savona-Ventura et al. (2013).
Nearest neighbor (NN) methods, which are non-parametric classification techniques,
were found to be commonly used in studies involving the prediction of diabetes
mellitus. Kandhasamy and Balamuali (2015) compared the performance of four
popular classification techniques, namely the J48 decision tree, the k-nearest neighbor
(kNN) classifier, random forests and support vector machines (SVMs), in predicting
the risk of diabetes mellitus for noisy (or inconsistent) data with missing values and
for consistent data. The study showed that the J48 decision tree performed best for the
noisy data, while random forests and the kNN classifier with k = 1 performed best
for the consistent data. Furthermore, Saxena et al. (2004) discuss in detail the use of
kNN in classifying diabetes mellitus. The authors applied this algorithm to a dataset
consisting of 11 variables, among which were glucose concentration, age, sex and
body mass index. Saxena et al. (2004) then analyzed the results obtained for k = 3 and
k = 5 through the use of well-known performance measures; the results of the study
led to the conclusion that the error rate increased for the larger value of k, and so better
results were obtained for k = 3.
In this chapter, our main aim is to test the applicability and the performance
of three well-known NN methods, to the problem of predicting the risk of GDM.
In particular, we focus on the application of the kNN method, the fixed-radius-NN
method and the kernel-NN method. These methods will be applied to a dataset
pertaining to 1,368 pregnant women from 11 Mediterranean countries. More
specifically, the dataset consists of 71 explanatory variables such as age, pre-existing
hypertension, menstrual cycle regularity and history of diabetes in the family. Since
the classification accuracy may be affected by factors such as the presence of missing
Predicting the Risk of Gestational Diabetes Mellitus 69
We begin this section by introducing some notations that will be used throughout
this chapter. The problem being tackled here involves binary prediction, which means
having two possible class labels: positive for GDM (1) and negative for GDM (0).
Thus, let Y be a random variable that represents a possible class label of an individual,
and let X = (X1 , . . . , Xp ) be a p-dimensional random vector whose components,
which are random variables, represent a certain feature in the dataset, for example, the
age of the mother and number of miscarriages. Also, let x = (x1 , . . . , xp ) be a vector
of observed values. In a classification problem, we make use of a dataset comprising
a finite sample of independent, identically distributed pairs (x1 , y1 ), . . . , (xn , yn ),
where yi indicates the class label of the ith observation, for i = 1, ..., n, and n
denotes the sample size. Then, we aim at using this dataset to estimate a function
Ŷ that, given a newly obtained observation/feature vector x, outputs a predicted label
Ŷ (x) ∈ {0, 1}. The function Ŷ is called a classifier. The best classifier in terms of
minimizing probability of error is the so-called Bayes classifier and is defined as
follows:
ŶBayes (x) = argmax P(Y = y|X = x)
y∈{0,1}
1 if P(Y = 1|X = x) ≥ P(Y = 0|X = x)
= . [6.1]
0 otherwise
x to have label 1. By defining η(x) = P(Y = 1|X = x), it can easily be shown that
equation [6.1] can be re-written as follows:
1 if η(x) ≥ 12
ŶBayes (x) = . [6.2]
0 otherwise
It was shown by Chen and Shah (2018) that the Bayes classifier in equation [6.2]
is indeed the one that minimizes the probability of a misclassification. Thus, any
classification procedure cannot do better than the Bayes classifier. Unfortunately, in
classification, we do not know the Bayes classifier ŶBayes and have to estimate it
from training data. In the next sections, we will see how we can define/approximate
the function η using three different NN methods.
The kNN algorithm is a non-parametric method that is used for classification and
regression. In the former, to decide the class label of a feature vector, we consider the
k points in the set of observed data that are closest to the point of interest. An object is
allocated to the most common class among its k nearest neighbors, where k ∈ Z+ and
usually takes on a small value. If k = 1, then the object is merely predicted to belong
to the same class as that single nearest neighbor.
Using the set-up shown in the previous section, we now proceed to define an
estimate η̂ for η(x) = P(Y = 1|X = x) as follows:
k
1
η̂(x) = Y(i) (x), [6.3]
k i=1
where Y(i) = 1 if the ith neighbor of x has label 1 and 0 otherwise. Hence, an estimate
for equation [6.2] is as follows:
1 if η̂(x) ≥ 12
ŶkN N (x) = . [6.4]
0 if otherwise
Over the years, a number of results concerning upper and lower bounds on
misclassification errors of the kNN classifier as well as a number of convergence
guarantees have been proven. These can be found in Chaudhuri and Dasgupta (2014).
We now move on to discuss the fixed-radius NN method.
instead of determining the test point’s label by looking at its k nearest neighbors,
this point is assigned a class label through a majority vote of its neighbors that are
captured within a ball of radius r. We have to note here that if the radius r, which can
take any positive value, is not chosen carefully, then there is a risk of not finding any
points in the neighborhood of the test point.
where ρ represents the considered distance function and 1{·} is an indicator function
taking the value of 1 if its argument is true and 0 otherwise. The difference between
equations [6.5] and [6.3] is that instead of taking the average of the labels of the
k-nearest neighbors of the test point, η̂f r−N N (x) is estimated by taking the average of
the labels of all points within distance r from the reference point x. Hence, an estimate
of equation [6.2] is obtained by replacing η with equation [6.5] in equation [6.2] and
we obtain:
1 if η̂f r−N N (x) ≥ 12 and ni=1 1ρ(x,xi )≤r > 0
Ŷf r−N N (x) = . [6.6]
0 if otherwise
Convergence results related to this method were discussed in detail in Chen and
Shah (2018). Next, we proceed to discuss the kernel-NN method.
crucial since, as previously mentioned, the former is the one that affects which points
give the most contribution, yielding equally good results for different kernel functions.
In literature, various kernel functions have been proposed. These include the
uniform, Epanenchnikov, normal, biweight and triweight kernels, among others. We
invite the interested reader to refer to Scheid (2004) and Guidoum (2015) for a
discussion on various types of kernels. Since the choice of the kernel function is not
crucial, throughout this chapter the Epanechnikov kernel was selected. Indeed, when
other kernels were selected the results did not change significantly.
As in the previous two sections, we now provide an estimate for the conditional
probability η. For kernel classification, this estimator is defined as follows:
⎧
ρ(x,xi )
⎪
⎪
n
i=1 K yi
⎨
h
if ni=1 K ρ(x,x i)
>0
n ρ(x,xi ) h
η̂Kernel−N N (x, h) = i=1 K
,
⎪
⎪
h
⎩
0 if otherwise
[6.7]
where the kernel function K determines the contribution of the ith training point to
the class label prediction through a weighted average. As with the previous methods,
we now replace η with η̂Kernel−N N in equation [6.2] to obtain the following:
n
1 if η̂Kernel−N N (x) ≥ 12 and i=1 K ρ(x,x h
i)
>0
ŶKernel−N N (x) =
0 if otherwise
[6.8]
Convergence results related to this method were discussed in detail in Chen and
Shah (2018).
for GDM). Thus, the training set will be denoted by ((x1 , y1 ), . . . , (xm , ym )) and the
test set will be denoted by ((xm+1 , ym+1 ), . . . , (xn , yn )). Therefore, xi , 1 ≤ i ≤ m is
the feature vector that contains the observed values for the ith individual in the training
set and will sometimes be called a training point. Similarly, xj , m + 1 ≤ j ≤ n is the
feature vector that contains the observed values for the jth individual in the test set,
and for simplicity will sometimes be called a test point. Therefore, although the class
label of the test point is known, we will be using the previously described NN methods
to predict it in order to evaluate the classification performance of the algorithms.
Algorithm 6.1 provides the pseudocode of kNN, while Algorithms 6.2 and 6.3
provide the pseudocode of the fixed-radius NN method and the kernel NN method,
respectively.
The NN methods discussed above all require a distance metric. To this end, an
appropriate distance metric must be utilized to account for the type of features in
the feature space (continuous variables, categorical variables or mixed data). The
choice of this metric is made particularly tricky especially in the case where a mixed
dataset has to be considered; some variables are quantitative in nature, while other
variables are categorical. Hence, in such cases, traditional distance metrics like the
Euclidean distance are not appropriate. However, there exist distance metrics that have
been designed for mixed data. These are the heterogeneous value difference metric
(HVDM) and the heterogeneous Euclidean-overlap metric (HEOM) defined in Mody
(2009).
In the next section, we discuss the results obtained when the above three techniques
were applied to a dataset for GDM risk prediction.
Predicting the Risk of Gestational Diabetes Mellitus 75
The dataset utilized is made up of information related to 1,368 mothers and their
newborn babies from 11 countries collected between 2010 and 2011. It is a mixed
dataset consisting of 72 variables – 44 categorical and 28 continuous. The categorical
variables also include the binary response variable that indicates whether the mother
has GDM or not, thus representing the mother’s class label. We note that the number
of variables in the dataset is quite large, and some may not be relevant in determining
whether a mother is at risk of being diagnosed with GDM. Therefore, appropriate tests
(described in section 6.3.2) were used to eliminate insignificant variables.
Furthermore, we see that 352 mothers were diagnosed with GDM according to
the International Association of Diabetes and Pregnancy Study Groups’ (IADPSGs)
criteria, which make up 26% of the total number of mothers. On the contrary, 1,016
mothers were found not to have GDM, making up 74% of the cases. Thus, we are
dealing with imbalanced data, since the majority of mothers do not have GDM, and
the minority do. This may deteriorate the performance of the classifier by increasing
the false negatives. To overcome this problem, the SMOTE-NC technique as described
by Chawla et al. (2002) was implemented prior to the application of the classification
techniques to balance out the data.
Therefore, after performing variable selection, we are left with 12 categorical and
18 continuous variables, making up 30 variables in all.
The dataset was split into two non-overlapping sets: the training set and the test
set. This was carried out at an 80:20 ratio, with the training set in this case being made
up of 1,094 mothers, 283 (26%) of whom were diagnosed with GDM, and the test set
consisting of 274 mothers, 69 (25%) of whom were diagnosed with GDM. The class
imbalance in the training set was then catered for through the use of the SMOTE-NC
algorithm, as explained in Chawla et al. (2002).
6.3.3. Results
Variables were first scaled and standardized before fitting any models. The optimal
hyperparameter values for all three NN methods were then found using 10-fold cross
validation; namely, for kNN, k = 5, for fixed-radius NN, r = 5.2 and for kernel-NN,
h = 0.19.
Actual
Positive Negative Total
Positive 40 6 46
Predicted
Negative 29 199 228
Total 69 205 274
Table 6.1. Confusion matrix for kNN on the test set (balanced case)
In the following, we will take a look at the confusion matrices obtained for each
method. Table 6.1 presents the confusion matrix for kNN, Table 6.2 presents the
confusion matrix for fixed-radius NN and Table 6.3 presents the confusion matrix
for kernel-NN.
Predicting the Risk of Gestational Diabetes Mellitus 77
Actual
Positive Negative Total
Positive 2 0 2
Predicted
Negative 73 199 272
Total 75 199 274
Table 6.2. Confusion matrix for fixed-radius NN on the test set (balanced case)
Actual
Positive Negative Total
Positive 43 26 69
Predicted
Negative 2 203 205
Total 45 229 274
Table 6.3. Confusion matrix for the kernel-NN on the test set (balanced case)
Table 6.4. Confusion matrix for BLR on the test set using
the original variables (imbalanced case)
Actual
Positive Negative Total
Positive 45 10 55
Predicted
Negative 24 195 219
Total 69 205 274
Table 6.5. Confusion matrix for BLR on the test set using
the original variables (balanced case)
78 Data Analysis and Related Applications 1
After checking for multicollinearity through the Spearman correlation matrix and
removing any correlated predictors, another BLR model was applied to the training
set, and the parsimonious model was obtained using a backward stepwise process.
In this case, seven significant variables were found and retained, namely “Country”,
“Family history: diabetes mellitus in mother”, “Family history: diabetes mellitus in
father”, “Parity”, “Weight at oral glucose tolerance test”, “Area under the curve” and
“Apgar score”. Finally, this BLR model was then applied to the test set. The confusion
matrix obtained is given in Table 6.6.
Actual
Positive Negative Total
Positive 49 7 56
Predicted
Negative 20 198 205
Total 69 205 274
Table 6.6. Confusion matrix for BLR on the test set using
only the significant variables (balanced case)
We should note that for the kNN, kernel NN and BLR methods, the proportion
of true predictions as can be seen in the confusion matrices were relatively higher to
those for false predictions, meaning that these techniques seem to be adequate for the
data. However, for the fixed-radius NN method, the confusion matrix in the balanced
case showed a high proportion of true negatives (72.6%), while the proportion of
true positives (0.73%) was very low relative to false predictions. This means that
this method is probably not the best for the data, since it does not perform well in
diagnosing mothers who have GDM.
Table 6.7 shows an ordering of the methods according to their overall performance
on the test set, from best to worst. This was based on five performance measures,
namely accuracy, area under the ROC curve, precision, sensitivity and F1 score.
We see here that BLR using the original variables on the test set in the imbalanced
case (which was the original case studied by Savona-Ventura et al. (2013)) was the
fifth best classification technique, surpassing fixed-radius NN which had the worst
overall performance. Furthermore, BLR using the original variables on the test set in
the balanced case came in fourth overall. The kNN method in the balanced case had
the best AUC, and the best overall performance for the NN methods followed by the
kernel method. Finally, the binary logistic regression technique applied to the balanced
data after obtaining the parsimonious model performed slightly better than the kNN
method and proved to perform the best overall for this dataset.
Predicting the Risk of Gestational Diabetes Mellitus 79
6.4. Conclusion
While carrying out this study, a limitation encountered was that 10-fold cross
validation to determine the optimal hyperparameters for kNN and fixed-radius NN
in Python was not computationally efficient, meaning that it took a very long time
to train the algorithms. A possible improvement to the study may be the exploration
of Bayesian neural networks for classification problems, where cross validation is no
longer needed and so the algorithm is trained more efficiently using MCMC methods.
Alternative classification methods found in the literature can also be applied and
compared with NN methods to obtain the best model for prediction, namely decision
trees, random forests and support vector machines, for example, which are also widely
used in these types of problems.
6.5. References
Alfadhli, E.M. (2015). Gestational diabetes mellitus. Saudi Medical Journal, 36(4), 399–406.
Chaudhuri, K. and Dasgupta, S. (2014). Rates of convergence for nearest neighbour
classification. Proceedings of the 27th International Conference on Neural Information
Processing Systems – Volume 2, 3437–3445, Montreal.
80 Data Analysis and Related Applications 1
Chawla, N.V., Bowyer, K., Hall, L.O., Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.
Chen, G.H. and Shah, D. (2018). Explaining the success of nearest neighbour methods in
prediction. Foundations and Trends in Machine Learning, 10(5–6), 337–588.
Guidoum, A.C. (2015). Kernel estimator and bandwidth selection for density and its derivatives
[Online]. Available at: https://cran.r-project.org/web/packages/kedd/vignettes/kedd.pdf.
Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning: Data
Mining, Inference and Prediction. Springer, New York.
Kandhasamy, J.P. and Balamuali, S. (2015). Performance analysis of classifier models to predict
diabetes mellitus. Procedia Computer Science, 47, 45–51.
Kotzaeridi, G., Blatter, J., Eppel, D., Rosicky, I., Mittlboeck, M., Yerlikaya-Schatten, G.,
Schatten, C. (2021). Performance of early risk assessment tools to predict the later
development of gestational diabetes. European Journal of Clinical Investigation, 51(23).
Lamain-de Ruiter, M., Kwee, A., Naaktgeboren, C.A., Franx, A., Moons, K., Koster, M. (2017).
Prediction models for the risk of gestational diabetes: A systematic review. Diagnostic and
Prognostic Research, 1, 3.
Mody, R. (2009). Optimizing the distance function for nearest neighbors classifcation. Thesis,
University of California San Diego [Online]. Available at: https://escholarship.org/uc/item/
9b3839xn.
Savona-Ventura, C., Vassallo, J., Marre, M., Karamanos, B.G. (2013). A composite risk
assessment model to screen for gestational diabetes meillitus among Mediterranean women.
International Journal of Gynecology and Obstetrics, 120(3), 240–244.
Saxena, K., Khan, D.Z., Singh, S. (2004). Diagnosis of diabetes mellitus using K nearest
neighbor algorithm. International Journal of Computer Science Trends and Technology, 2(4),
36–43.
Scheid, S. (2004). Introduction to kernel smoothing [Online]. Available at: https://compdiag.
molgen.mpg.de/docs/talk_05_01_04_stefanie.pdf.
7
In all the countries of both surveys, EFA performed on the first half-samples
resulted in a unidimensional solution based on the four common items of the
political trust in national institutions. CFA performed on the second half-samples
and the full samples resulted in adequate model fit for all cases. Moreover, the
analysis provided reliable scales which were of adequate convergent validity. The
methodology presented may be easily applied to other cases of validating scales
composed of pseudo-interval or ordinal items.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
82 Data Analysis and Related Applications 1
7.1. Introduction
7.2. Methods
7.2.1. Participants
The analysis was based on the 2008 ESS and EVS data for Greece, Portugal and
Spain. The ESS defines the survey population as all individuals aged 15+ residing
within private households in each country, regardless of their nationality, citizenship
or language, and this definition applies to all rounds of the survey. The EVS applies
a similar definition with the exception of age, which is defined at 18+. Therefore,
the analysis is based on those aged 18+ for both datasets so as to establish their
comparability. In Table 7.1, the demographic and social characteristics of the
participants aged 18+ are presented.
Secondary In paid
Men Women Age mean Married
Country N education or work*
(%) (%) (SD) (%)
lower (%) (%)
Greece
ESS 2,019 45.1 54.9 45.8 (16.3) 59.5 73.5 58.1
EVS 1,500 43.3 56.7 49.6 (18.4) 59.3 82.2 45.4
Portugal
ESS 2,296 38.5 61.5 53.9 (19.2) 56.9 87.4 41.6
EVS 1,553 40.4 59.6 53.0 (18.7) 59.6 91.0 47.2
Spain
ESS 2,486 47.5 52.5 47.9 (18.6) 56.6 76.0 54.1
EVS 1,500 43.9 56.1 47.9 (19.4) 45.5 67.8 50.9
*The reference period for the respondent’s main activity was defined as during the last seven
days.
As shown, in all samples, there were more women than men. Gender is
distributed similarly in the Greek and Portuguese ESS and EVS samples. In the
Spanish case, a difference of 3.4% is detected between the two samples. In all
samples, the mean age was over 45.8 years. In the cases of Portugal and Spain, the
mean age was the same for both the ESS and EVS samples. In the case of Greece,
the mean age of the EVS sample was higher than the ESS one. More than 45.4% of
the participants were married. In the case of Greece, there were no differences
between the two samples. In the case of Portugal, the percentage of married
84 Data Analysis and Related Applications 1
participants in the EVS sample was higher than in the ESS one and the reverse holds
true for the case of Spain. In all samples, more than 73.5% had completed secondary
education or lower. In the cases of Greece and Portugal, the percentage of those that
had completed secondary education or lower was higher in the EVS sample and the
reverse holds true for the case of Spain. In all samples, at least 41.5% were in paid
work. In the cases of Greece and Spain, the percentage of those that were in paid
work is higher in the ESS sample and the reverse holds true for the case of Portugal.
7.2.2. Instrument
In the ESS core questionnaire, five items are used for the measurement of
political trust in national institutions: parliament, legal system, police, politicians
and political parties. All these items are included in all rounds of the survey with the
exception of the question on political parties, which was introduced in the second
round (2004). Each item is assigned a scale ranging from 0 (no trust at all) to 10
(complete trust). The level of measurement of these items is pseudo-interval. The
EVS measures political trust as confidence in 17 national institutions of which only
four are common with ESS (Table 7.2): police, courts, political parties and
parliament. These items are assigned a scale ranging from 1 (a great deal) to 4 (none
at all) and therefore their level of measurement is ordinal. The values of the EVS
items were first reversed, in order to achieve correspondence between the ordering
of the response categories to the ESS items.
Table 7.2. The European Social Survey (ESS) and European Values
Study (EVS) measurement of political trust in national institutions
Political Trust in National Institutions 85
EFA was performed on the first half in order to assess the construct validity of
the scale (Fabrigar et al. 1999; Bartholomew et al. 2008). The structure suggested by
EFA was subsequently validated by carrying out CFA on the second half. Based on
the full sample and the CFA results, the psychometric properties of the scale were
assessed. Statistical analyses were performed using Mplus Version 8.4 and IBM
SPSS Statistics Version 20.
The half-sample sizes were large enough (>300) to carry out factor analyses
(Tabachnick and Fidell 2007). Since sample sizes ranged from 1,500 (Greece and
Spain, EVS) to 2,486 (Spain, ESS), the half-samples were 750 (Greece and Spain,
EVS) to 1,243 (Spain, ESS) and were therefore considered large enough to carry out
factor analyses separately in each country.
Initially, missing data analysis and data screening for outliers and unengaged
responses was performed for both half-samples (Michalopoulou 2017; Charalampi
2018; Charalampi et al. 2019, 2020). Only cases with missing values on all items
were automatically excluded from the analysis (Muthén and Muthén 1998–2017).
Cases were also eliminated if they exhibited low standard deviation (< 0.5), i.e. no
variance in the responses (Gaskin 2016). Data screening for outliers was based on
background variables, for example, gender (dichotomy), age (ratio) and education
(pseudo-interval). Cases were eliminated if they were shown in the boxplots as
outliers (Gaskin 2016; see also Thompson 2005; Tabachnick and Fidell 2007;
Brown 2015).
7.2.3.1. EFA
In performing EFA, the following sequence of decisions was required
(Michalopoulou 2017; Charalampi 2018; Charalampi et al. 2019, 2020):
1) Initially, the items’ frequency distributions were inspected and, in the case of
pseudo-interval items for floor and ceiling effects, bearing in mind that percentages
of responses less than 15 are normally deemed to be acceptable (Terwee et al. 2007).
In the case of pseudo-interval items, the appropriate univariate statistics were
computed for each item and their distributional properties were inspected (testing for
normality) to decide on the appropriateness of the methods to be used. The criterion
of corrected item-total correlations < 0.30 (Nunnally and Bernstein 1994) was used
to decide which items to exclude from the analysis. In the case of ordinal items, only
the mode and median were computed for each item.
86 Data Analysis and Related Applications 1
2) The covariance matrix and the polychoric correlation matrix were employed
as the appropriate matrices of associations for pseudo-interval and ordinal items,
respectively (Brown 2015).
3) Maximum likelihood and robust weighted least squares were applied as the
appropriate methods of factor extraction for pseudo-interval and ordinal items,
respectively (Brown 2015).
4) Considering the factor analytic theory, “factors that are represented by two or
three indicators may be underdetermined […] and highly unstable across
replications” (Brown 2015, p. 21), only a unidimensional model could be tested.
5) Items were considered salient if their factor loadings were > 0.30 and therefore
the meaning of the dimension was inferred from these items (Fabrigar et al. 1999;
Thompson 2005). Items with loadings < 0.30 (i.e. low communalities) were
excluded from the analysis (Brown 2015).
7.2.3.2. CFA
In applying CFA, the following sequence of decisions was required
(Michalopoulou 2017; Charalampi 2018; Charalampi et al. 2019, 2020):
1) The decision on the inclusion of items in the analysis was based on the results
of the item analysis and EFA carried out on the first half-sample.
2) CFA was performed using the covariance matrix of associations and
maximum likelihood estimation in the case of pseudo-interval items and the
polychoric correlation matrix and robust weighted least squares in the case of
ordinal items.
3) Model fit was considered adequate if χ2/df < 3, standardized root-mean-square
residual (SRMR) < 0.05, comparative fit index (CFI) and Tucker-Lewis index (TLI)
values were ≥ 0.95 and the root-mean-square error approximation (RMSEA) ≤ 0.06
with the 90% confidence interval (CI) upper limit ≤ 0.06 (Bollen 1989; Hu and
Bentler 1999; Thompson 2005; Tabachnick and Fidell 2007; Schmitt 2011; Brown
2015). Model fit was considered acceptable if χ2/df < 3, SRMR < 0.08, CFI and TLI
values were > 0.90 and RMSEA < 0.08 with the 90% CI upper limit < 0.08 (Hu and
Bentler 1999; Marsh et al. 2004). However, because SRMR seems to not perform
well in CFA models with categorical items (Yu 2002; Brown 2015), it was not used
in the case of ordinal items.
4) Searches for modification indices and further specifications were performed.
Where necessary, correlations between error variances were introduced (Thompson
2005; Brown 2015).
Political Trust in National Institutions 87
In order to facilitate the comparison between the results of the two surveys for
the full samples, all items of the EVS survey datasets were rescaled into a 0–10 scale
by applying the following simple transformation (Charalampi 2018; Charalampi
et al. 2019, 2020):
−
. − +
−
7.3. Results
The full sample screening of datasets for both surveys identified no unengaged
responses (standard deviation = 0.000). In the Portuguese sample of both surveys,
four and six outlying cases with a higher education degree were detected,
respectively, and it was decided not to reject them from the analysis. There were
four, eleven and five cases with missing values on all items in the Greek, Portuguese
and Spanish ESS samples, respectively. Moreover, there were seventeen and eight
cases with missing values on all items in the Portuguese and Spanish EVS samples,
respectively. These cases were excluded from the analysis.
In every country of the ESS datasets, respondents had used the full range of
possible responses for all items (Table 7.3). The majority of the responses were
88 Data Analysis and Related Applications 1
clustered closer to the lower end of their respective scales. Floor effects were present
in all three countries’ samples for the item measuring trust in political parties (PT5),
and consequently, this item had the lowest mean responses. Relatively high mean
responses were found for the item defining trust in the police (PT3), mainly in the
case of the Spanish sample. None of the items were rejected based on the criterion of
corrected item-total correlations < 0.30. Non-normality was not severe for any item
(skewness > 2; kurtosis > 7). As shown, the proportion of missing values was
negligible, exceeding 5.4% for only one item (PT1) of the Spanish sample.
In parallel, frequency distributions and mode and median values of the items
based on the first Greek, Portuguese and Spanish EVS half-samples were inspected
(Table 7.4). The full range of possible responses was used for all items. The
majority of the responses were clustered around the middle and closer to the lower
end of their respective scales. As shown, the proportion of missing values was
negligible, exceeding 4.5% for only two items of the Portuguese (PT1) and Spanish
(PT1) sample, respectively.
EFA for the pseudo-interval and ordinal items was performed on the first
half-samples with maximum likelihood of the covariance matrix of associations and
with robust weighted least squares of the polychoric matrix of associations,
respectively. Table 7.5 shows the factorial structure of the one-factor solutions of
both surveys. All items exhibited strong factor loadings (≥ 0.40).
The one first-order factor model indicated by the EFA results was tested by
performing CFA on the second half-samples. Modification searches were conducted,
and, where necessary, correlations between error variances were introduced. The
CFA results for the Greek ESS and EVS samples were χ2/df = 2.23 (df = 1),
SRMR = 0.005, CFI = 0.999, TLI = 0.996, RMSEA (90% CI) = 0.035 (0.000–0.099)
and χ2/df = 14.87 (df = 2), CFI = 0.983, TLI = 0.950, RMSEA (90% CI) = 0.136
(0.095–0.181), respectively. The CFA results for the Portuguese ESS and EVS
samples were χ2/df = 5.09 (df = 1) SRMR = 0.010, CFI = 0.997, TLI = 0.982, RMSEA
(90% CI) = 0.060 (0.017–0.115) and χ2/df = 5.23 (df = 1), CFI = 0.998, TLI = 0.986,
RMSEA (90% CI) = 0.074 (0.023–0.142), respectively. The CFA results for the
Spanish ESS and EVS samples were χ2/df = 1.79 (df = 1), SRMR = 0.005,
CFI = 0.999, TLI = 0.997, RMSEA (90% CI) = 0.025 (0.000–0.085) and χ2/df = 1.81
(df = 1), CFI = 0.999, TLI = 0.992, RMSEA (90% CI) = 0.033 (0.000–0.109),
respectively.
Frequency percentage of response categories
Country/item Mean SD 95% CI 0 1 2 3 4 5 6 7 8 9 10 NA Skew. Kurt. CC
Greece (n = 1,009)
PT1 3.58 2.496 3.42–3.74 14.1 12.1 11.1 11.4 9.6 18.6 8.8 7.2 4.5 1.1 0.8 0.8 0.20 -0.88 0.706
PT2 4.75 2.577 4.59–4.91 7.8 6.1 7.3 10.3 10.3 16.9 10.4 14.7 10.3 4.2 1.1 0.5 -0.26 -0.85 0.765
PT3 4.87 2.604 4.71–5.04 6.9 5.9 7.7 8.9 10.3 19.4 10.3 12.3 10.7 4.9 2.4 0.2 -0.17 -0.76 0.619
PT5 2.50 2.151 2.36–2.63 23.3 18.7 11.6 12.8 10.9 14.2 3.3 2.9 1.3 0.1 0.2 0.8 0.57 -0.51 0.589
Portugal (n = 1,148)
PT1 3.43 2.403 3.28–3.58 16.5 7.0 11.6 13.4 11.9 17.5 7.1 5.1 2.8 0.7 1.0 5.4 0.26 -0.55 0.681
PT2 3.77 2.478 3.61–3.92 13.1 6.6 9.4 14.5 10.5 17.8 8.6 6.4 5.4 1.8 0.7 5.1 0.15 -0.73 0.689
PT3 5.39 2.330 5.25–5.53 5.0 1.3 5.0 6.7 9.1 21.9 15.2 15.1 12.0 3.3 3.9 1.6 -0.43 -0.05 0.456
PT5 2.42 2.114 2.29–2.54 30.1 9.6 14.0 12.4 13.0 12.3 2.8 2.3 0.4 0.2 0.2 2.8 0.46 -0.62 0.595
Spain (n = 1,243)
PT1 4.93 2.268 4.80–5.06 5.8 2.9 5.1 7.6 10.5 21.7 15.7 10.9 8.5 2.3 0.7 8.2 -0.43 -0.22 0.658
PT2 4.20 2.444 4.06–4.35 8.8 6.8 10.7 11.2 13.2 18.8 9.9 8.0 7.1 2.2 0.9 2.4 0.04 -0.71 0.669
PT3 5.95 2.185 5.82–6.08 2.3 2.0 3.3 5.3 7.2 17.4 15.7 19.4 16.7 7.1 3.0 0.7 -0.63 0.16 0.536
PT5 3.33 2.331 3.19–3.46 17.6 9.5 11.2 13.2 13.0 17.7 6.8 3.8 2.7 1.1 0.3 3.1 0.24 -0.62 0.608
SD = standard deviation; CI = confidence interval; NA = no answer (missing values); Skew. = skewness; Kurt. = kurtosis; CC = corrected
item-total correlation.
Standard errors for skewness and kurtosis of the Greek items were 0.078 and 0.155, respectively; standard errors for skewness and kurtosis of
the Portuguese items were 0.076 and 0.152, respectively; standard errors for skewness and kurtosis of the Spanish items were 0.073 and 0.146,
respectively.
Table 7.3. Item analysis of the political trust in national institutions for Greece, Portugal
and Spain based on the first half-samples: European Social Survey, 2008
Political Trust to National Institutions
89
90 Data Analysis and Related Applications 1
Table 7.4. Item analysis of the political trust in national institutions for Greece,
Portugal and Spain based on the first half-samples: European Values Study, 2008
Table 7.5. Exploratory factor analysis of the political trust in national institutions
items performed on the first half-samples of Greece, Portugal and Spain:
European Social Survey (ESS) and European Values Study (EVS), 2008
Political Trust in National Institutions 91
In all these cases, the model df were one, with the exception of the Greek EVS
sample, where the model df were two. However, although the half-sample sizes were
large enough, ranging from (750) to (1,241), for these single (and double) degree of
freedom models, the RMSEA 90% CI limits ranged from 0.0 to 0.181, suggesting
that they were “likely somewhere between perfect and extremely horrible! Clearly,
any RMSEA value with a CI this wide is of no value” (Kenny et al. 2015, p. 501).
Moreover, as all models were composed of four items, we considered the (Kenny
and McCoach 2003) results that the RMSEA tends to improve by the addition of
more items to the model whereas the CFI and TLI tend to worsen as the number of
items in the model increases. In this respect, relying on the SRMR, CFI and TLI
values – with all the reservations expressed by Kenny et al. (2015) – the findings
suggested adequate model fit for all models under consideration.
The AVE was computed for each scale based on the CFA repeated for the full
samples of Greece (Figure 7.1, ESS: χ2/df = 4.19 (df = 1), SRMR = 0.005,
CFI = 0.999, TLI = 0.995 and RMSEA = 0.040 with the 90% CI = 0.007–0.082 and
EVS: χ2/df = 5.30 (df = 1), CFI = 0.999, TLI = 0.992 and RMSEA = 0.054 with
the 90% CI = 0.017–0.102), Portugal (Figure 7.2, ESS: χ2/df = 9.17 (df= 1),
SRMR = .009, CFI = 0.997, TLI = 0.982 and RMSEA = 0.060 with the 90%
CI = 0.030–0.098 and EVS: χ2/df = 10.08 (df = 1), CFI = 0.998, TLI = 0.986 and
RMSEA = 0.077 with the 90% CI = 0.039–0.123) and Spain (Figure 7.3, ESS:
χ2/df = 2.45 (df = 1), SRMR = 0.004, CFI = 0.999, TLI = 0.997 and RMSEA = 0.024
with the 90% CI = 0.000–0.064 and EVS: χ2/df = 8.65 (df = 1), CFI = 0.994,
TLI = 0.967 and RMSEA = 0.072 with the 90% CI = 0.034–0.119). Therefore,
based on the argument presented for the CFA results of half-samples, all models
provided adequate model fit.
Figure 7.1. Standardized solution for the political trust (pt) one first-order factor
models based on CFA performed on the Greek ESS (N = 2,015) and EVS
(N = 1,500) full samples. Observed variables are represented by squares and the
latent variable by a circle
Figure 7.2. Standardized solution for the political trust (pt) one first-order factor
models based on CFA performed on the Portuguese ESS (N = 2,285) and EVS
(N = 1,536) full samples. Observed variables are represented by squares and the
latent variable by a circle
Political Trust in National Institutions 93
Figure 7.3. Standardized solution for the political trust (pt) one first-order factor
models based on CFA performed on the Spanish ESS (N = 2,481) and EVS
(N = 1,492) full samples. Observed variables are represented by squares and the
latent variable by a circle
Table 7.6. Descriptive statistics, convergent validity, composite reliability and internal
consistencies of the political trust in national institutions items based on the full
samples of Greece, Portugal and Spain: European Social Survey (ESS) and
European Values Study (EVS), 2008
As shown, higher mean scale values were obtained from the EVS samples of
Greece and Portugal, and the reverse holds true for the Spanish samples.
7.4. Conclusion
The investigation of the structure (dimensionality) of the 2008 ESS and EVS
measurement of political trust in national institutions scale by applying the
traditional approaches of EFA and CFA to randomly split half-samples resulted in
all countries in a unidimensional structure following Brown’s (2015)
recommendation to eliminate from the analysis factors defined by two or three
items.
Political Trust in National Institutions 95
7.5. Funding
This research, conducted under the auspices of the National Centre for Social
Research, was co-financed by Greece and the European Union (European Social
Fund – ESF) through the Operational Programme “Human Resources Development,
Education and Lifelong Learning 2014-2020” in the context of the project “Greece
and Southern Europe: Investigating political trust to institutions, social trust and
human values, 2002-2017” (MIS 5049524).
7.6. References
Bartholomew, D.J., Steele, F., Moustaki, I., Galbraith, J. (2008). Analysis of Multivariate
Social Science Data. Chapman & Hall/CRC, London.
Bollen, K.A. (1989). Structural Equations with Latent Variables. John Wiley & Sons,
New York.
Brown, T.A. (2015). Confirmatory Factor Analysis for Applied Research, 2nd edition. The
Guilford Press, New York.
Charalampi, A. (2018). The importance of items’ level of measurement in investigating the
structure and assessing the psychometric properties of multidimensional constructs.
Doctoral Dissertation, Panteion University of Social and Political Sciences, Athens.
Charalampi, A., Michalopoulou, C., Richardson, C. (2019). Determining the structure and
assessing the psychometric properties of multidimensional scales constructed from ordinal
and pseudo-interval items. Communications in Statistics – Case Studies, Data Analysis
and Applications, 5(1), 26–38.
96 Data Analysis and Related Applications 1
Charalampi, A., Michalopoulou, C., Richardson, C. (2020). Validation of the 2012 European
Social Survey measurement of wellbeing in seventeen European countries. Applied
Research in Quality of Life, 15(1), 73–105.
Clark, L.A. and Watson, D. (1995). Constructing validity: Basic issues in objective scale
development. Psychological Assessment, 7(3), 309–319.
Daskalopoulou, I. (2018). Individual-level evidence on the causal relationship between social
trust and institutional trust. Social Indicators Research, 144, 275–298.
Ervasti, H., Kouvo, A., Venetoklis, T. (2018). Social and institutional trust in times of crisis:
Greece, 2002–2011. Social Indicators Research, 141, 1207–1231.
Fabrigar, L.R., Wegener, D.T., MacCallum, R.C., Strahan, E.J. (1999). Evaluating the use of
exploratory factor analysis in psychological research. Psychological Methods, 4(3),
272–299.
Fornell, C. and Larcker, D.F. (1981). Evaluating structural equation models with
unobservable variables and measurement error. Journal of Marketing Research, 18(1),
39–50.
Gaskin, J. (2016). Data screening. Gaskination’s StatWiki [Online]. Available at:
http://statwiki.gaskination.com/index.php?title=Main_Page [Accessed 30 June 2016].
Hooghe, M. and Kern, A. (2015). Party membership and closeness and the development of
trust in political institutions: An analysis of the European Social Survey, 2002–2010.
Party Politics, 21(6), 944–956.
Hu, L. and Bentler, P.M. (1999). Cutoff criteria for fit indexes in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation Modeling,
6(1), 1–55.
Kenny, D.A. and McCoach, D.B. (2003). Effect of the number of variables on measures of fit
in structural equation modeling. Structural Equation Modeling, 10(3), 333–351.
Kenny, D.A., Kaniskan, B., McCoach, D.B. (2015). The performance of RMSEA in models
with small degrees of freedom. Sociological Methods & Research, 44(3), 486–507.
Marsh, H.W., Hau, K.T., Wen, Z. (2004). In search of golden rules: Comment on hypotheses-
testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing
Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11(3), 320–341.
Michalopoulou, C. (2017). Likert scales require validation before application – Another
cautionary tale. BMS Bulletin de Méthodologie Sociologique, 134, 5–23.
Muthén, L.K. and Muthén, B.O. (1998–2017). Mplus User’s Guide, 8th edition. Muthén &
Muthén, Los Angeles, CA.
Nunnally, J.C. and Bernstein, I.H. (1994). Psychometric Theory. McGraw-Hill, New York.
Raykov, T. (2007). Reliability if deleted, not “alpha if deleted”: Evaluation of scale reliability
following component deletion. British Journal of Mathematical and Statistical
Psychology, 60(2), 201–216.
Political Trust in National Institutions 97
Chapter written by Agnese Maria D I B RISCO, Roberto A SCARI, Sonia M IGLIORATI and
Andrea O NGARO.
For a color version of all the figures in this chapter, see www.iste.co.uk/zafeiris/data1.zip.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
100 Data Analysis and Related Applications 1
8.1. Introduction
Although the beta distribution can show very different shapes, it fails to model
a wide range of phenomena, including heavy tails and bimodal responses. To
achieve greater flexibility, two further distributions have been proposed on the
restricted interval that take advantage of a mixture structure. The first one is called
variance-inflated beta (VIB) (Di Brisco et al. 2020); it is a mixture of two betas sharing
a common mean parameter where the first component has a precision decreased by
factor k, and thus, it displays a larger variance. The second distribution that has been
proposed to enhance the flexibility of the regression models for constrained data is
the flexible beta (FB) (Migliorati et al. 2018). The rationale behind this distribution
is to consider a mixture of two betas sharing a common precision parameter and
with different component means. It is noteworthy that, not only are the component
means indeed different, but they are also arranged so that the first component mean
is greater than the second one, thus avoiding any computational burden related to
label-switching (Frühwirth-Schnatter 2006).
Other strategies to deal with bounded responses have been proposed in the
literature such as regression models based on a new class of Johnson SB distributions
(Lemonte et al. 2016), mixed regression models based on the simplex distribution
The State of the Art in Flexible Regression Models 101
(Qiu et al. 2008), quantile regression models (Bayes et al. 2017) and fully
non-parametric regression models (Barrientos et al. 2017). The analysis of these
proposals goes beyond the aim of this chapter.
The rest of this chapter is structured as follows. Section 8.2 describes the general
framework of a regression model for bounded responses, whereas section 8.2.1
extends the model with the augmentation strategy. Section 8.2.2 illustrates the beta
distribution and its flexible extensions and shows how to get the corresponding
parametric regression models, either augmented or not. Section 8.3 is dedicated to two
case studies. Section 8.3.1 illustrates the analysis of the “Stress” dataset by making use
of the regression models without augmentation; it also provides a quick overview of
the FlexReg package. Section 8.3.2 focuses on the analysis of the “Reading” dataset
by illustrating mainly the augmented regression models.
1 https://CRAN.R-project.org/package=FlexReg.
102 Data Analysis and Related Applications 1
Moreover, we can also link the precision parameter to some covariates (either the
same as in the regression model for the mean or different). To do that, equation [8.1]
is complemented with the following:
g2 (φi ) = x2i β2 [8.2]
where g2 (·) is an adequate link function, x2i is a vector of covariates observed on
subject i (i = 1, . . . , n), and β2 is a vector of regression coefficients for the precision.
Common choices for g2 (·) are the logarithm and the square root.
8.2.1. Augmentation
where the vector (q0 , q1 , q2 ) belongs to the simplex being 0 < q0 , q1 , q2 < 1 and
q0 + q1 + q2 = 1. The density f (y; η), with η being a vector of parameters which
will include at least a mean and a precision parameter, is defined on the open interval
and it can be either a beta or one of its flexible alternatives (see section 8.2.2). The
marginal mean and variance of an rv with an augmented distribution are equal to:
Please note that, by simply setting one or both probabilities q1 and q0 equal to zero,
it is possible to model scenarios where only 0s or 1s or neither are observed, the latter
case restoring a non-augmented regression model.
Having in mind the regression framework that has just been outlined, a
fully parametric approach requires the definition of a proper distribution on the
bounded support for the response variable. As a general rule, it is convenient, for
regression purposes, to express the distributions on the bounded support in terms of
mean-precision parameters.
A well-known distribution on the open interval is the beta one. The standard Breg
model is derived if Yi , i = 1, . . . , n, are independent and follow a beta distribution.
104 Data Analysis and Related Applications 1
Its augmented version, referred to as the augmented beta regression (ABreg) model,
is obtained when the density function f (y; η) in equation [8.3], for 0 < y < 1,
is of a beta rv. The probability density function of the beta with a mean-precision
parameterization, Y ∼ Beta(μφ, (1 − μ)φ), is as follows:
Γ(φ)
fB∗ (y; μ, φ) = y μφ−1 (1 − y)(1−μ)φ−1
Γ(μφ)Γ((1 − μ)φ)
for 0 < y < 1, where the parameter 0 < μ < 1 identifies the mean and φ > 0 is
interpreted as a precision parameter being:
μ(1 − μ)
V ar(Y ) = .
φ+1
By letting vary the parameters that index the distribution, we can observe a
variety of shapes. Although its inherent flexibility, the beta is not designed to model
heavy tails (often due to outlying observations) and bimodality (possibly due to latent
structures in data).
The flexible extensions of the beta originate to precisely manage these types of
data patterns that, in our experience, often occur in practical situations. The flexibility
of which we speak is achieved by making use of mixture distributions.
for 0 < y < 1, where 0 < μ < 1 identifies the overall mean of Y (as well as mixture
component means), 0 < k < 1 is a measure of the extent of the variance inflation,
0 < p < 1 is the mixing proportion parameter, and φ > 0 plays the role of a precision
parameter, since as it increases V ar(Y ) decreases. The idea behind this distribution
is to draw up a mixture of two betas where one component is entirely dedicated to
outlying observations.
The second flexible extension refers to the FB distribution, Y ∼ F B(μ, φ, w̃, p),
whose probability density function is as follows:
for 0 < y < 1, where 0 < μ < 1 identifies the mean of Y , 0 < w̃ <
min{μ/p, (1 − μ)/(1 − p)} is a measure of distance between the two mixture
The State of the Art in Flexible Regression Models 105
3.5
3.0
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
3.5
5
3.0
4
2.5
2.0
3
1.5
2
1.0
1
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 8.1. Top panels: the dotted line refers to the density of a Beta(0.5, 20). On
the left-hand side, the dashed lines refer to densities from a V IB rv with μ = 0.5,
φ = 20, k = 0.1 and p = {0.1 (red), 0.3 (green), 0.5 (blue)}. On the right-hand side,
the dashed lines refer to densities from an F B rv with μ = 0.5, φ = 20, p = 0.5
and w = {0.3 (red), 0.5 (green), 0.8 (blue)}. Bottom panels: the dotted line refers to the
density of a Beta(0.8, 10). On the left-hand side, the dashed lines refer to densities from
a V IB rv with μ = 0.8, φ = 10, k = 0.01 and p = {0.3 (red), 0.5 (green), 0.8 (blue)}.
On the right-hand side, the dashed lines refer to densities from an F B rv with μ = 0.8,
φ = 10, p = 0.9 and w = {0.5 (red), 0.8 (green), 0.9 (blue)}
106 Data Analysis and Related Applications 1
The best way to appreciate the additional flexibility provided by the proposed
mixture distributions is to visualize some densities. In the top panels of Figure 8.1,
the dotted line represents a symmetric beta density with mean equal to 0.5. The VIB
distribution enables densities, represented as colored dashed lines on the left-hand
side, still centered in 0.5 and furthermore with heavier tails than the one of the beta.
Conversely, the FB distribution can provide densities, represented as colored dashed
lines on the right-hand side, which are bimodal: the overall mean is still equal to 0.5
but the component means are different. Another scenario, represented in the bottom
panels of Figure 8.1, is considered a negatively skewed beta. By properly setting the
parameters of the VIB, it is possible to get densities, represented as colored dashed
lines on the left-hand side, that put increasing mass on the left tail of the distribution.
Moreover, the FB can handle a heavier left tail still preserving the center of the
distribution, as it emerges by looking at the right-hand side.
Inference in regression models for bounded responses can be done either with
a likelihood-based approach or a Bayesian one. In fact, likelihood-based inference
requires numerical integration and optimization which often leads to analytical
challenges and computational issues.
The additional parameters of the FB type and VIB type regression models, which are
the mixing proportion p, the normalized distance w, and the extent of the variance
inflation k, all have a uniform prior on (0, 1).
The evaluation of fit of a model is made through the widely applicable information
criterion (WAIC) (Vehtari et al. 2017) whose rationale is the same as that of standard
comparison criteria, namely penalizing an estimate of the goodness of fit of a model
by an estimate of its complexity. The advantage of WAIC over other well-established
criteria, in this framework, consists of its being fully Bayesian and well defined for
mixture models. The rule of thumb states that models with smaller values of the
comparison criteria are better in fit.
The best way to illustrate all the methodological aspects described so far is to resort
to some practical applications. The computational implementation of the regression
models without augmentation, that is, estimation issues and assessment of the results,
is made easier with the R package FlexReg. An upgrade of the package containing the
augmented versions of all models at hand is forthcoming.
The “Stress” dataset, available from the FlexReg package, concerns a sample
of non-clinical women in Townsville, Queensland, Australia. Respondents were
asked to fill out a questionnaire from which the stress and anxiety rates were
computed (Smithson and Verkuilen 2006). We fit Breg, FBreg and VIBreg regression
models by regressing the mean of anxiety onto the stress level. Each model is run
20,000 iterations with the first half as burn-in. The implementation is done with the
flexreg() function of the FlexReg package:
> data("Stress")
Please note that, as much as possible, the function preserves the structure of lm()
and glm() functions so as to facilitate its use among R users. In particular, formula
is the main argument of the function where the user has to specify the name of the
dependent variable and, separated by a tilde, the names of the covariates for the
regression model for the mean. If appropriate, we can also specify the names of the
covariates for the regression model for the precision (separated from the rest by a
vertical bar). The argument type allows us to select the type of model out of Breg,
FBreg and VIBreg.
108 Data Analysis and Related Applications 1
Once the parameters have been estimated through the HMC algorithm and before
continuing with further analyses, it is a good practice to check for the convergence
to the posterior distributions. On that, the FlexReg package is endowed with the
convergence.plot() and the convergence.diag() functions, both requiring as
the main argument an object of class flexreg which is obtained as a result of the
flexreg() function. The former produces a .pdf file containing some convergence
plots (i.e. density-plots, trace-plots, intervals, rates, Rhat and autocorrelation-plots) for
the Monte Carlo draws. The latter returns some diagnostics to check for convergence
to the equilibrium distribution of the Markov chains and it prints the number (and
percentage) of iterations that ended with a divergence and that saturated the max
treedepth, and the E-BFMI values for each chain for which E-BFMI is less than 0.2
(Gelman et al. 2014).
0.7
0.6
0.6
0.5
0.5
0.4
0.4
Anxiety
Anxiety
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
Stress Stress
Figure 8.2. Left-hand side: fitted regression curves for the models. Breg in dotted line,
VIBreg in solid line and FBreg in dashed lines. Colored dashed curves refer to the
component means of the FBreg model. Right-hand side: scatterplot of stress level
versus anxiety level. Red dots refer to subjects belonging to group 1
The State of the Art in Flexible Regression Models 109
Aside from the assessment of the best model, it is of interest to evaluate any
inconsistency between observed and predicted values. In a Bayesian perspective, it is
convenient to compute the posterior predictive distribution, namely the distribution of
unobserved values conditional on the observed data. This operation is straightforward
for our flexible regression models thanks to the posterior_predict() function, the
result of which is an object of class flexreg_postpred containing a matrix with
the simulated posterior predictions. The plot method applied to posterior predictives
returns the posterior predictive interval for each statistical unit plus the observed value
of the response variable in red dots. By way of example, Figure 8.3 shows the 95%
posterior predictive intervals for the VIBreg model. It is worth noting that the model
provides accurate predictive intervals since all observed values are comprised within
the intervals. A similar behavior also holds for the Breg and FBreg models.
1.00
− − − −
−
Posterior Predictive
0.75
− − −
− − − −
− −− −
0.50 − − −
− − − − − −−
− − −− − − −− −−
−
−−− −−−
−− −−− −
− −
− −− −
− − −−
− − −−−−−−− − − −−−− − − −−−− −−−− − − − −− − −
−−− −−− − − − − − − −− −− −− − − − − − −−
0.25 −− − −−−− − − − − − −
− − − −−− − −− − − −−
− − −−−− −− −− − −− −− − − − − − − −− −
− − − − −
− − − −
0.00 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−
0 50 100 150
Unit ID
Figure 8.3. 95% posterior predictive intervals for each statistical unit for the
VIBreg model. The observed anxiety levels are represented with orange dots
The last inspection regarding the behavior of the regression models involves the
Bayesian residuals, either raw or standardized:
riraw
riraw = yi − μ̂i ristd = , i = 1, . . . , n [8.5]
Var(yi )
where μ̂i and Var(yi ) are the predicted mean and variance of the response, both
assessed using the posterior means of the parameters.
110 Data Analysis and Related Applications 1
The computation of residuals for flexible models can be done through the function
called residuals(). The argument object features an object of class flexreg,
which contains all the results related to the estimated model of type Breg, VIBreg
or FBreg. By specifying the argument type, it is possible to compute either raw or
standardized residuals. Furthermore, if the model is of FB type, the function allows us
to compute also the cluster residuals that are obtained as the difference between the
observed responses and the cluster means. This is achieved by simply setting cluster
= T:
It is worth noting that the cluster residuals computed for the FBreg model allow us
to provide a classification of data into two clusters, as shown on the right-hand side of
Figure 8.2. This result is consistent with that seen with the regression curves.
The second dataset we explore, likewise from the FlexReg package, is called
“Reading” and it collects data referring to a group of 44 children, 19 of whom have
received a diagnosis of dyslexia. Available types of information concern the proportion
of accuracy in reading tasks and the non-verbal intelligent quotient (IQ), besides the
dyslexia status (DYS, being dyslexic (1) or not (0)).
This case study has been extensively analyzed within the literature on regression
models for bounded responses and it is of special interest because of the presence of
values at the upper boundary of the support, corresponding to children (13 out of 44)
that achieved a perfect score in reading tests. One possibility is to handle this dataset
by simpling transforming the response variable from (0, 1] to the open interval (0, 1).
An alternative option is to analyze the dataset through an augmentation strategy.
1.0
0.9
0.9
0.8
0.8
Accuracy
Accuracy
0.7
0.7
0.6
0.6
0.5
0.5
−1 0 1 2 −1 0 1 2
IQ IQ
Figure 8.4. Left-hand side: fitted regression curves for the models with augmentation
(violet lines) and without augmentation (black lines) refer to the Breg (dotted lines),
VIBreg (solid lines) and FBreg (dashed lines) models. Right-hand side: fitted
regression curve for the overall mean (solid line) and for the component means (dashed
lines) of the AFBreg model
with regression equations as in equation [8.6] have been estimated through the HMC
algorithm:
The three competing models show similar fit to data, in terms of WAIC. Moreover,
by looking at posterior means and credible intervals (CIs) from Table 8.1, it emerges
that the dyslexic status of children plays a significant role in explaining both the
probability of achieving a perfect score, the mean and the precision of the reading
accuracy response variable in all competing models.
Table 8.1. Reading data: posterior means and CIs for the parameters of the
AFBreg, AVIBreg and ABreg regression models together with WAIC values
8.4. References
Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman and Hall,
London.
Albert, J. (2009). Bayesian Computation With R, 2nd edition. Springer Science, New York.
Barrientos, A.F., Jara, A., Quintana, F.A. (2017). Fully nonparametric regression for bounded
data using dependent Bernstein polynomials. Journal of the American Statistical Association,
112(518), 806–825.
The State of the Art in Flexible Regression Models 113
Bayes, C., Bazan, J.L., de Castro, M. (2017). A quantile parametric mixed regression model for
bounded response variables. Statistics and Its Interface, 10(3), 483–493.
Di Brisco, A.M., Migliorati, S., Ongaro, A. (2020). Robustness against outliers: A new variance
inflated regression model for proportions. Statistical Modelling, 20(3), 274–309.
Duane, S., Kennedy, A., Pendleton, B.J., Roweth, D. (1987). Hybrid Monte Carlo. Physics
Letters B, 195(2), 216–222.
Ferrari, S. and Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions.
Journal of Applied Statistics, 31(7), 799–815.
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer, Berlin,
Heidelberg.
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B. (2014). Bayesian Data Analysis, Volume 2.
Taylor & Francis, Abingdon.
Lemonte, A.J. and Bazán, J.L. (2016). New class of Johnson SB distributions and its associated
regression model for rates and proportions. Biometrical Journal, 58(4), 727–746.
McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, Volume 37. CRC Press,
Boca Raton, FL.
Migliorati, S., Di Brisco, A.M., Ongaro, A. (2018). A new regression model for bounded
responses. Bayesian Analysis, 13(3), 845–872.
Neal, R.M. (1994). An improved acceptance procedure for the hybrid Monte Carlo algorithm.
Journal of Computational Physics, 111(1), 194–203.
Qiu, Z., Song, P.X.-K., Tan, M. (2008). Simplex mixed-effects models for longitudinal
proportional data. Scandinavian Journal of Statistics, 35(4), 577–596.
Smithson, M. and Verkuilen, J. (2006). A better lemon squeezer? Maximum-likelihood
regression with beta-distributed dependent variables. Psychological Methods, 11(1), 54–71.
Stan Development Team (2016). Stan modeling language users guide and reference manual
[Online]. Available at: https://mc-stan.org/users/documentation/.
Vehtari, A., Gelman, A., Gabry, J. (2017). Practical Bayesian model evaluation using
leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
9
Compositional data are defined as vectors whose elements are strictly positive
and subject to a unit-sum constraint. When the multivariate response is of
compositional type, a proper regression model that takes account of the unit-sum
constraint is required. This contribution illustrates a new multivariate regression
model for compositional data that is based on a mixture of Dirichlet-distributed
components. Its complex structure is offset by good theoretical properties (among
which identifiability) and a greater flexibility than the standard Dirichlet regression
model. We perform intensive simulation studies to evaluate the fit of the proposed
regression model and its robustness in the presence of multivariate outliers. The
(Bayesian) estimation procedure is performed via the efficient Hamiltonian Monte
Carlo algorithm.
9.1. Introduction
Chapter written by Agnese Maria D I B RISCO, Roberto A SCARI, Sonia M IGLIORATI and
Andrea O NGARO.
For a color version of all the figures in this chapter, see www.iste.co.uk/zafeiris/data1.zip.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
116 Data Analysis and Related Applications 1
The rest of this chapter is organized as follows. Section 9.2 introduces the Dirichlet
and the EFD distributions, and it shows convenient parameterizations for regression
purposes. Section 9.3 outlines details on the EFD regression model. Section 9.3.1
provides an overview on the HMC algorithm, a Bayesian approach to inference
especially suited for mixture models. Section 9.4 illustrates several simulation studies
that have been performed to evaluate the behavior and the fit to data of the EFD
regression model in comparison to the Dirichlet one.
Γ(α + ) D
α −1
fD (y; α) = ∏ yj j ,
∏Dj=1 Γ(α j ) j=1
[9.1]
where Dir(·; ·) denotes the Dirichlet distribution, and er is a vector of zeros except
for the r-th element which is equal to one. It is worth noting that the EFD distribution
contains the Dirichlet as an inner point when τr = 1 and pr = ᾱr for every r = 1, . . . , D.
The p.d.f. of the EFD admits a variety of shapes including, but not limited to, uni- and
multi-modal ones. Moreover, the richer parameterization of the EFD with respect to
the Dirichlet allows for a more flexible modelization of the dependence structure of
the composition. Finally, the EFD distribution shows several theoretical properties,
i.e. some simplicial forms of dependence/independence and identifiability (Ongaro),
that make it tractable from computational and inferential points of view.
Since both parameterizations of the Dirichlet and of the EFD illustrated in section
9.2 explicitly include the mean vector μ, it is possible to derive a regression model
for compositional data. Let Y = (Y1 , . . . , Yn ) be the response matrix such that Yi ,
for i = 1, . . . , n, is a D-dimensional vector on the simplex, and let X = (x1 , . . . , xn )
be the design matrix such that xi are (K + 1)-dimensional vectors. The mean vector
ν i of Yi can be regressed onto a set of covariates in accordance with a GLM strategy
(McCullagh and Nelder 1989). Indeed, since ν i lies on the simplex, a multinomial
logit link function can be adopted as follows:
νi j
g(νi j ) = log = xi β j , [9.6]
νiD
where νi j = E [Yi j ], xi = (1, xi1 , . . . , xiK ) is the vector of covariates, and β j = (β j0 ,β j1 ,
. . . , β jK ) is a vector of regression coefficients. Please note that the Dth category is
conventionally fixed as baseline, so that βDk = 0 for k = 0, 1, . . . , K, and thus:
⎧
exp(xi β j )
⎨ , for j = 1, . . . , D − 1
νi j = g−1 (xi β j ) = 1+∑r=1 exp(xi β r )
D−1
[9.7]
⎩ 1
, for j = D.
1+∑r=1 exp(xi β r )
D−1
Presence of outliers: To evaluate the behavior of the DirReg and EFDReg models
in the presence of outliers, we perturbed scenario (i) according to the following
perturbation scheme. We randomly selected 15 observations (10% of the sample size)
and we applied the perturbation operation defined as y ⊕ δ = C {y1 · δ1 , . . . , yD · δD } ∈
120 Data Analysis and Related Applications 1
S D , where y and δ are the vectors on the simplex playing the roles of perturbed
and perturbing element, respectively. Moreover, the closure operation C {·} is defined
as C {q} = {q1 /q+ , . . . , qD /q+ } with q+ = ∑Dj=1 q j and q j > 0, ∀ j = 1, . . . , D. The
neutral element of the perturbation operation is δ = (1/D, . . . , 1/D) , so that if
element y j is perturbed by δ j greater (lower) than 1/D, the perturbation is upward
(downward). We set three scenarios of perturbation by fixing the perturbing element
δ equal to (0.86, 0.07, 0.07) in scenario (I), (0.07, 0.86, 0.07) in scenario (II) and
(0.07, 0.07, 0.86) in scenario (III).
y2
100
0.75
20
80
40
60 0.50
y1
60
40
80
20 0.25
10
0
y1 y3
20
40
60
80
0
10
0.8
0.3
0.6
y2
y3
0.4 0.2
0.2
0.1
0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figures 9.3, 9.4 and 9.5 show the effect of perturbation on the Dirichlet-distributed
responses. In all plots, the perturbed points are in light-blue while unperturbed points
are in black. Looking at the scatterplots, we can observe that scenario (I) assumes
some outlying observations upward for the first element and downward for the second
and third elements of the composition; this is coherent with the chosen vector δ
Simulation Studies for a Special Mixture Regression Model 121
that has the first element greater than 0.5 and the second and third elements lower
than 0.5. Instead, in scenarios (II) and (III), the second and third elements of the
composition respectively are perturbed upward while the remaining elements are
perturbed downward. Focusing on the ternary plots, it is worth noting that the effect
of perturbation in scenario (III) is clearly visible in that the group of perturbed values,
in blue, is well-separated from the remaining points, in black. The overall effect of
perturbing vector δ = (0.07, 0.07, 0.86) is thus to shift the cloud of points towards
the bottom-right vertex of the plot. Conversely, in scenarios (I) and (II), the perturbed
points are overall shifted towards the bottom-left and top vertex of the ternary plot,
respectively, i.e. in a region with a higher presence of unperturbed points.
1.00
y2
100
0.75
20
80
40
60
y1
0.50
60
40
80
20 0.25
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50
x
1.00
0.6
0.75
0.4
0.50
y2
y3
0.2
0.25
0.00 0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Presence of latent groups: The following simulation study explores the case
of the presence of a latent (unobserved) covariate that induces the occurrence of
clusters. Therefore, data are simulated by including an additional covariate in the
122 Data Analysis and Related Applications 1
regression model that is assumed unknown, not accounted for by the estimates of the
Dirichlet and EFDReg models. In particular, we replicated the generating mechanism
of fitting study (i) by adding a latent dichotomous covariate (scenario (a)) and a latent
covariate with three categories (scenario (b)). In scenario (a), the additional regression
coefficients are β12 = −1 and β22 = 2, and in scenario (b), they also include β13 = 0.5
and β23 = −3. With respect to the dichotomous covariate of scenario (a), the categories
have probabilities of 0.3 and 0.7. In scenario (b), the three categories of the latent
covariate have probabilities of 0.3, 0.15, and 0.55. Figures 9.6 and 9.7 show one
random replication from scenarios (a) and (b) with latent groups, respectively. In the
ternary plots, points are colored and shaped according to their belonging to the latent
groups. The existence of two and three clusters respectively is particularly visible from
the scatterplots referred to the first and second elements of the composition.
1.00
y2
100
0.75
20
80
40
60
y1
0.50
60
40
80
20
0.25
10
0
y1 y3
20
40
60
80
0
10
0.3
0.75
0.2
0.50
y2
y3
0.25 0.1
0.00
0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
y2
100 0.75
20
80
0.50
40
60
y1
60
40
0.25
80
20
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50
x
1.00
0.3
0.75
0.2
y2
y3
0.50
0.1
0.25
0.00
0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
The generic mixture of two Dirichlet distributions has been chosen to induce
heavier tails than that of the Dirichlet. The ternary plot on the top left panel of
Figure 9.8 shows one random replication from the generic mixture where the green
points belong to the first component of the mixture and the orange triangles belong to
the second component. We can observe that the majority of points (belonging to the
second component of the mixture) are placed on the ternary plot and on the scatterplots
124 Data Analysis and Related Applications 1
similarly to scenario (i). At the same time, the group of data coming from the Dirichlet
with the smaller precision parameter is far from the remaining points. Focusing on the
scatterplots referred to the first and second elements of the composition (top right and
bottom left panels of Figure 9.8), it is worth noting that the responses belonging to the
first component of the mixture, that is, the one with the smaller precision parameter,
depart from the data cloud both upward and downward.
y2
100
0.75
20
80
40
60 0.50
y1
60
40
0.25
80
20
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50
x
0.8
0.8
0.6
0.6
y2
y3
0.4
0.4
0.2
0.2
0.0 0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
9.4.1. Comments
Table 9.1 shows the WAIC values in all simulation studies. In fitting study (i),
where the data generating mechanism is Dirichlet, the WAIC of both models is
comparable, while in all remaining scenarios the EFDReg model is far better than
the DirReg one. The superiority in fit of the EFDReg model is particularly noticeable
in fitting study (ii), in all scenarios with outliers, and in the presence of a latent
Simulation Studies for a Special Mixture Regression Model 125
group induced by a dichotomous covariate (scenario (a)). Scenario (b) (i.e. three latent
groups) and the scenario from a generic mixture of two Dirichlet distributions are
particularly challenging and result in a difficulty in fit for both models. Nevertheless,
the EFDReg model is capable of providing a better adaptation to data (lower WAIC)
than the DirReg.
y2
100
0.75
20
80
40
60 0.50
y1
60
40
0.25
80
20
10
0
y1 y3
20
40
60
80
0
10
0.00
1.00
0.6
0.75
0.4
y2
y3
0.50
0.2
0.25
0.00 0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Let us now analyze and comment on the posterior means and MSEs for the
Dirichlet and EFD regression models in all scenarios. All results can be found in
Tables 9.2, 9.3 and 9.4. Moreover, we deepen the analysis of the two models by
inspecting the regression curves that are superimposed on the scatterplots in Figures
9.1–9.8. In all figures, black solid lines refer to the EFD model and black dashed lines
refer to the Dirichlet one. In some scenarios only the solid line appears meaning that
126 Data Analysis and Related Applications 1
the regression curves of both models are almost coincident. Colored lines are referred
to the component means λ 1 (orange), λ 2 (blue), and λ 3 (green) of the EFDReg model.
1.00
y2
100
0.75
20
80
40
60
y1
0.50
60
40
0.25
80
20
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50
x
1.00
0.6
0.75
0.4
0.50
y2
y3
0.25 0.2
0.00
0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Results about the fitting study with Dirichlet-distributed data (scenario (i)), can be
found in the second and third columns of Table 9.2. It is worth noting that both models
provide precise estimates for the regression parameters and similar MSEs. This is
confirmed by almost identical regression curves for the Dirichlet (black dashed line)
and EFD (black solid line) models (see scatterplots in Figure 9.1). The DirReg model
also provides a precise estimate for the precision parameter α + , while the EFDReg
model slightly overestimates it. Looking at the additional parameters of the EFDReg
model, we can observe that the adaptation to Dirichlet-distributed data is achieved
thanks to equally weighted (estimated p j equal to approximately 0.3 for j = 1, 2, 3)
Simulation Studies for a Special Mixture Regression Model 127
0.75
y1
0.50
0.25
0.00
1.00
0.6
0.75
0.4
y2
y3
0.50
0.2
0.25
0.00 0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
In fitting study (ii), the EFDReg model adapts well to data and provides precise
estimates with low MSEs and SEs for all the parameters (see Table 9.4). On the
contrary, the DirReg model, in trying to adapt to data, estimates a considerably lower
precision than the true one, and it also fails to correctly estimate some of the regression
parameters. From scatterplots in Figure 9.2 it emerges that the regression curves for
the EFDReg model adapt very well to data (both for the overall mean and for the
component means), while they are systematically more flat for the DirReg model.
128 Data Analysis and Related Applications 1
Scenario Fitting study (i) Latent groups (a) Latent groups (b)
Model Dir EFD Dir EFD Dir EFD
β10 = 1 1.001 (0.001) 1.001 (0.001) 0.514 (0.231) 0.850 (0.024) 0.756 (0.060) 1.104 (0.012)
β11 = 2 1.998 (0.018) 1.992 (0.018) 1.567 (0.203) 1.665 (0.131) 1.328 (0.461) 1.299 (0.510)
β20 = 0.5 0.501 (0.001) 0.502 (0.001) 0.883 (0.148) 1.275 (0.605) -0.600 (1.213) -0.044 (0.230)
β21 = −3 -3.006 (0.021) -2.998 (0.021) -1.963 (1.086) -2.096 (0.835) -1.568 (2.096) -1.634 (1.945)
α + = 50 50.052 (4.338) 53.636 (5.473) 5.894 (0.258) 21.241 (2.023) 3.033 (0.121) 5.389 (0.600)
p1 — 0.290 (0.111) — 0.640 (0.030) — 0.582 (0.071)
p2 — 0.315 (0.119) — 0.353 (0.030) — 0.409 (0.071)
p3 — 0.395 (0.116) — 0.007 (0.002) — 0.009 (0.001)
w̃1 — 0.149 (0.031) — 0.608 (0.032) — 0.584 (0.037)
w̃2 — 0.146 (0.041) — 0.707 (0.021) — 0.780 (0.019)
w̃3 — 0.151 (0.032) — 0.461 (0.056) — 0.421 (0.012)
Table 9.2. Posterior means for the Dirichlet and EFD regression models in fitting
study (i) and in scenarios (a) and (b) with latent groups. MSEs for the regression
coefficients and SEs for remaining parameters are in parenthesis
The estimates of the unknown parameters in the three scenarios with outliers
are shown in Table 9.3. Moreover, the regression curves of the Dirichlet and EFD
models are plotted on the scatterplots in Figures 9.3, 9.4 and 9.5 referred to scenarios
(I), (II), and (III), respectively. The estimates of the regression parameters of the
Dirichlet and EFD models are affected by the presence of outliers. The element
of flexibility used by the DirReg model in order to adapt to data that depart from
the Dirichlet distribution is given by the precision parameter, that is systematically
underestimated in all scenarios with outliers. Conversely, the EFDReg model can take
advantage of its special mixture structure to better adapt to data. It is worth noting
that in all scenarios with outliers, one component of the mixture is dedicated to the
group of perturbed values as indicated by the corresponding p j estimate which is
around 0.1. The remaining two components equally describe the remaining majority
of unperturbed data with estimates of p j ’s between 0.3 and 0.5. The analysis of the
regression curves allows us to better understand the different behavior of the DirReg
and EFDReg models. The regression curves of the DirReg model are slightly shifted
with respect to the regression curves of the DirReg in the scenario without perturbation
(dotted lines in Figures 9.3–9.5) in the direction of the perturbed values. Instead,
looking at the component means of the EFD we note that the first, second and third
components of the mixture are entirely dedicated to model the subgroup of outliers in
scenarios (I), (II) and (III).
Simulation Studies for a Special Mixture Regression Model 129
Table 9.3. Posterior means for the Dirichlet and EFD regression models in
scenarios (I), (II) and (III) with outliers. MSEs for the regression coefficients
and SEs for remaining parameters are in parenthesis
Results concerning the presence of some latent groups in data are shown in the
last four columns of Table 9.2. The estimates of regression parameters are biased for
both models. Once again, the DirReg model tries to adapt to data by estimating a
very low value for the precision parameter, nevertheless this results in a very poor fit.
The regression curves of the DirReg model, reported in Figures 9.6 and 9.7, severely
miss the data cloud, particularly in scenario (a). The EFDReg model has a satisfactory
behavior in scenario (a) where the latent covariate has two categories with probabilities
of 0.3 and 0.7. These latent clusters are grasped by the EFDReg model with an estimate
equal to 0.64 and 0.353 for the mixing proportions p1 and p2 of the first and second
component, and an estimate close to zero for p3 . This is clearly reflected by the
regression curves of the component means of the EFD model plotted in Figures 9.6 and
9.7. It is worth noting that the orange and blue lines λ 1 and λ 2 perfectly fit the two data
clouds. On the contrary, the green line λ 3 has a very poor fit, but this does not affect
the overall fit of the model since the third component of the mixture has a probability
of occurrence around zero. Scenario (b) is more challenging for the EFDReg model.
Please recall that this scenario assumes the existence of a latent covariate having three
categories with probabilities of 0.3, 0.15 and 0.55. Nevertheless, the EFD model is
able to capture only two out of the three latent clusters, as witnessed by the estimate
of the third mixing proportion p3 which is close to zero. A look at the regression
curves of the component means of the EFD model (Figure 9.7) better explains this
behavior. The first scatterplot, referred to the first element of the response, shows a
good fit of the orange curve λ 1 . The remaining two curves λ 2 and λ 3 are unable to
describe the two visible clusters of data since they are placed in the middle. In the
second scatterplot, referred to second elements of the response, the blue curve adapts
well to one cluster, the green and orange ones are almost overlapping and fit a second
130 Data Analysis and Related Applications 1
cluster well while a third cluster of data is missed by all curves. In the third scatterplot,
referred to the third element of the response, the blue and orange curves cross the data
cloud, but the green one misses it completely. Overall, the EFDReg model has an
excessively rigid mixture structure to adapt to this scenario well, whilst remaining a
far better model than the Dirichlet one.
Scenario Fitting study (ii) Scenario Generic mixture
Model Dir EFD Model Dir EFD
β10 = 1 1.087 (0.015) 1.014 (0.010) β10 = 1 1.006 (0.010) 0.947 (0.010)
β11 = 2 1.990 (0.069) 1.999 (0.012) β11 = 2 2.063 (0.149) 1.967 (0.083)
β20 = 0.5 0.752 (0.068) 0.511 (0.010) β20 = 0.5 0.501 (0.014) 0.457 (0.014)
β21 = −3 -2.409 (0.395) -3.009 (0.014) β21 = −3 -2.967 (0.189) -2.866 (0.159)
α + = 50 6.444 (0.306) 50.153 (4.253) α+ 5.619 (0.892) 6.877 (1.684)
p1 = 1/3 — 0.335 (0.024) p1 — 0.149 (0.239)
p2 = 1/3 — 0.335 (0.034) p2 — 0.189 (0.293)
p3 = 1/3 — 0.331 (0.035) p3 — 0.662 (0.353)
w̃1 = 0.6 — 0.601 (0.016) w̃1 — 0.563 (0.230)
w̃2 = 0.2 — 0.199 (0.032) w̃2 — 0.553 (0.229)
w̃3 = 0.7 — 0.694 (0.029) w̃3 — 0.732 (0.153)
Table 9.4. Posterior means for the Dirichlet and EFD regression models in fitting
study (ii) and in case of a generic mixture of Dirichlet. MSEs for the regression
coefficients and SEs for remaining parameters are in parenthesis
The last two columns of Table 9.4 show the estimates in case observations come
from a generic mixture of two Dirichlet distributions. It is worth recalling that this
scenario assumes that the second mixture component follows the same Dirichlet
distribution as in scenario (i), and the first component differs from the second one
because of the presence of a lower precision parameter. Both the DirReg and EFDReg
models provide reasonably unbiased estimates of the regression parameters, despite
the MSEs being greater than the ones in scenario (i). To confirm this, the regression
curves (dashed and solid lines in Figure 9.8) adapt well to the majority of observations
and are almost overlapping. The presence of a group of data, around 30%, coming
from the Dirichlet distribution with a lower precision parameter forces the DirReg
model to provide a low estimate of the precision parameter in trying to adapt to data.
The EFDReg performs better than the DirReg model since it is capable of recognizing
the presence of some clusters in data. In particular, it dedicates the third component to
describing the majority of data, indeed the estimate of p3 is approximately equal to 0.7.
Instead, the first and second components are dedicated to data coming from the second
component of the generic mixture, and they show similar estimates of all parameters
(p j and w̃ j ). In this regard, the green curve is near the solid one, particularly in the
scatterplots referred to the first and second elements of the composition. Differently,
the blue and orange curves fit the values from the second component of the generic
Simulation Studies for a Special Mixture Regression Model 131
mixture, and they are placed either upward or downward with respect to the majority
of points in the scatterplots.
9.5. References
Aitchison, J. (2003). The Statistical Analysis of Compositional Data. The Blackburn Press,
London.
Albert, J. (1987). Bayesian computation with R. ASA Proceedings of Section on Statistical
Graphics.
Campbell, G. and Mosimann, J.E. (2009). Multivariate analysis of size and shape: Modelling
with the Dirichlet distribution. ASA Proceedings of Section on Statistical Graphics, 93–101.
Di Brisco, A.M., Ascari, R., Migliorati, S., Ongaro, A. (2019). A new regression model for
bounded multivariate responses. Smart Statistics for Smart Applications – Book of Short
Papers SIS, 817–822.
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer Science
+ Business Media, New York.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2013). Bayesian
Data Analysis, 3rd edition. CRC Press, London.
Hijazi, R.H. and Jernigan, R.W. (2009). Modelling compositional data using Dirichlet
regression models. Journal of Applied Probability and Statistics, 4, 77–91.
Maier, M.J. (2014). Dirichletreg: Dirichlet regression for compositional data in R. Paper,
Research Report Series, University of Economics and Business, Vienna.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman & Hall, London.
Neal, R.M. (1994). An improved acceptance procedure for the hybrid Monte Carlo algorithm.
Journal of Computational Physics, 111(1), 194–203.
Ongaro, A., Migliorati, S., Ascari, R., Ongaro et al. (2020). A new mixture model on the
simplex. Statistics and Computing [Online]. Available at: https://doi.org/10.1007/s11222-
019-09920-x.
Stan Development Team (2016). Stan modeling language users guide and reference manual
[Online]. Available at: http://mc-stan.org/.
Vehtari, A., Gelman, A., Gabry, J. (2017). Practical Bayesian model evaluation using
leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
PART 2
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
10
10.1. Introduction
The classical Black–Scholes option pricing model assumes that the underlying
asset follows a geometric Brownian motion with constant volatility, but there are a
significant number of model extensions to ease this assumption of constant volatility.
One of the recent and popular extensions is the Gatheral model, given in Gatheral
(2008), where a double-mean-reverting market model is considered. The same model
is later considered in Bayer et al. (2013).
Our object of interest is the asymptotic expansions of implied volatility under the
Gatheral model presented in earlier research in Albuhayri et al. (2021). Albuhayri
et al. (2021) obtained such asymptotic expansions by applying the Taylor formula of
the implied volatility given in Pagliarani et al. (2017) to the Gatheral model. Applying
this general Taylor formula to a specific three-factor model is a non-trivial task.
In Albuhayri et al. (2021), only analytical formulas were obtained. Therefore, there
is a need for a thorough numerical study on the performances of these asymptotic
expansions as approximation formulas for the implied volatility. Moreover, it is of
practical interest to investigate how these approximation formulas can be used for
calibrating the model to real data. This chapter addresses these two issues for the first-
and second-order implied volatility expansions. The contribution of this chapter is to:
1) clarify for which range of option parameters the first- and second-order
expansions give reasonable approximations;
2) propose a convenient and straightforward partial calibration procedure and
implement it to synthetic and real market data.
Since there is no exact analytical formula on the implied volatility under the
Gatheral model, we use the Monte Carlo simulation to generate benchmark (reference)
values of implied volatilities for the numerical study on the performances of the
asymptotic expansions.
Calibration means finding model parameters such that the model is consistent
with the market data. In terms of option pricing models, we often minimize an error
function on the differences between the market and model implied volatilities. In
contrast to the time-consuming Monte Carlo simulation method, an analytical formula
of the implied volatility model is beneficial for calibration purposes. For a general
overview of model calibration to option data, we refer to Hilpisch (2019).
For the calibration task, we take advantage of the simple polynomial form of the
implied volatility approximation formulas associated with the first- and second-order
asymptotic expansion in order to propose a simple partial calibration procedure. By
saying partial calibration, we mean that only a part of the original model parameters
or a grouped form of them can be calibrated. However, such partial calibration is
still useful as an intermediate step towards the final local optimization problem on
the full calibration, which is a standard procedure. Moreover, our partial calibration
reveals easily applicable values, like the present volatility level. It should be mentioned
that Fouque et al. (2011) have proposed a calibration procedure under a different
three-factor model using a polynomial form of the implied volatility. The asymptotic
expansion in their work was obtained under the assumption of a fast mean-reverting
volatility component together with a slow mean-reverting volatility component.
A more recent and general extension is given in Pagliarani et al. (2017). Assuming
that under a martingale probability measure, the market model is described by a
Rd -valued stochastic process (S(t), Y2 (t), . . . , Yd (t)) that satisfies the following
system of stochastic differential equations:
dS(t) = η1 (t, S(t), Y(t))S(t) dW1∗ (t), S(0) = s,
dYi (t) = μi (t, S(t), Y(t)) dt + ηi (t, S(t), Y(t)) dWi∗ (t), Y(0) = y,
where 2 ≤ i ≤ d and Y(t) is a vector with components Yi (t), y ∈ Rd−1 is a
deterministic vector, and the time t correlation matrix of the Rd -valued stochastic
process with components Wi∗ (t) has entries:
ρij (t, S(t), Y(t)) ∈ [−1, 1].
In the rest of this chapter, we refer to this model as a local stochastic volatility
model.
The model under consideration is the Gatheral model from this family of local
stochastic volatility models. The Gatheral model is a double-mean-reverting market
model proposed by Gatheral (2008). In a subsequent publication, Bayer et al. (2013),
the model is given as follows:
dS(t) = v(t)S(t) dW1∗ (t),
dv(t) = κ1 (v (t) − v(t)) dt + ξ1 v α1 (t) dW2∗ (t), [10.1]
dv (t) = κ2 (θ − v (t)) dt + ξ2 v α2 (t) dW3∗ (t),
and the time t correlation matrix of the R3 -valued stochastic process with components
Wi∗ (t) has entries:
ρij (t, S(t), Y(t)) = ρij ∈ [−1, 1].
138 Data Analysis and Related Applications 1
The reason why we choose this model was described by Bayer et al. (2013) as
follows:
Thus variance mean-reverts to a level that itself moves slowly over time
with the state of the economy.
The European call options are traded on the market; however, the stock’s volatility,
σ, is not directly observable. A possible solution to this problem follows. It is well
known that the Black–Scholes price with zero interest rate satisfies the following
boundary value problem for the Black–Scholes partial differential equation:
∂C(S, t) σ 2 2 ∂ 2 C(S, t)
+ S = 0,
∂t 2 ∂S 2 [10.2]
lim C(S, t) = max{0, S − K}
t↑T
with (S, t) ∈ (0, ∞) × (0, T ). Berestycki et al. (2002) describe a possible solution as
follows:
In Albuhayri et al. (2021), we proved the following results under the model
(equation [10.1]).
T HEOREM 10.1.– The asymptotic expansion of order 1 of the implied volatility has
the form:
√ 1 √
σ (t, x0 , ν0 ; T, k) = v0 + ρ12 ξ1 ν0α1 −1 (k − x0 ) + o T − t + |k − x0 | .
4
Numerical Studies of Implied Volatility Expansions Under the Gatheral Model 139
T HEOREM 10.2.– The asymptotic expansion of order 2 of the implied volatility has
the form:
√ 1
σ(t, x0 , ν0 , ν0 ; T, k) = ν0 + ρ12 ξ1 ν0α1 −1 (k − x0 )
8
1 −1/2
+ 32κ1 ν0 (ν0 − ν0 ) + 8ρ12 ξ1 ν0α1
128
2α −3/2
+ 3ρ212 ξ12 ν0 1 (T − t)
3 2 2 2α1 −2
− ρ ξ ν (k − x0 )2
64 12 1 0
+ o(T − t + (k − x0 )2 ).
The parameters used for the simulation are given in Table 10.1. Here, the initial
asset price is denoted by S0 , the number of steps by M , the number of paths by I, and
the rest are the Gatheral model parameters.
Parameter Value Parameter Value Parameter Value
r 0 κ1 5.5 ρ12 −0.4
S0 100 κ2 0.1 ρ13 0
M 150 v0 0.05 ρ23 0
I 10000000 v0 0.04 θ 0.078
The number of steps, M , is larger for longer maturities. The parameter choices
come from Bayer et al. (2013). Here, ρ13 = ρ23 = 0 is the most realistic situation.
140 Data Analysis and Related Applications 1
Indeed, the correlation between the underlying asset and the long-run mean and the
correlation between the volatility and its long-run mean should be close to zero.
Correlation ρ12 is set to be negative as the underlying price and the volatility are
usually negatively correlated.
We consider 130 options with 10 maturities (30, 60, 91, 122, 152, 182, 273, 365,
547 and 730 calendar days) and with log-moneyness between −0.2 and 0.2 and report
the proportion of options that can be approximated within a relative error of 5% using
the second-order asymptotic expansion below.
For the Double Heston model, this proportion is 45% of all options. However, the
accuracy becomes much higher for options with log-moneyness between −0.1 and
0.07, and maturities from 30 days to 1 year.
Figures 10.1 and 10.2 are the examples of the asymptotic expansions of orders 1
and 2 of the implied volatility, and the benchmark values for two different times to
maturities, 30 days and 1 year, respectively. The number of time steps was M = 300.
From this example, it may be seen that the asymptotic expansion of order 2 gives better
approximations, as expected. In addition, note that Figure 10.2 represents the worst
case for maturities ranging from 30 days to 1 year. The second-order approximations
are more accurate for maturities shorter than 1 year.
Similarly, for the Double Lognormal model, with the second-order expansion, the
corresponding proportion of options that can be approximated within a relative error
of 5% is around 55%. For options with log-moneyness between −0.07 and 0.096, and
maturities from 30 days to 1 year options, the accuracy again becomes higher.
For the first-order expansion, the approximation is decent with relative error less
than 5% only for options with a maturity as short as 30 days, and log-moneyness
between −0.1 and 0.07.
Similar experiments have been done for other values of α1 , α2 , for example, α1 =
α2 = 0.94, the results are alike.
Numerical Studies of Implied Volatility Expansions Under the Gatheral Model 141
The numerical study in the previous section gives a base for the calibration. We
know now for which range of log-moneyness and maturities the first- and second-order
expansions can be used as calibration formulas.
142 Data Analysis and Related Applications 1
To explain the partial calibration procedure, recall the form of the asymptotic
expansion of order 1 of implied volatility:
√ 1
σ1 (t, x0 , ν0 ; T, k) = v0 + ρ12 ξ1 ν0α1 −1 (k − x0 )
4
√ [10.3]
+o T − t + |k − x0 | ,
and the form of the asymptotic expansion of order 2 of the implied volatility:
√ 1
σ2 (t, x0 , ν0 , ν0 ; T, k) = ν0 + ρ12 ξ1 ν0α1 −1 (k − x0 )
8
1 −1/2
+ 32κ1 ν0 (ν0 − ν0 ) + 8ρ12 ξ1 ν0α1
128
2α −3/2
+ 3ρ212 ξ12 ν0 1 (T − t) [10.4]
3 2 2 2α1 −2
− ρ ξ ν (k − x0 )2
64 12 1 0
+ o(T − t + (k − x0 )2 ).
To compare the calibrated and true parameter values, a good way to start is by
generating an implied surface using the Monte Carlo simulation with a known set
of parameters and applying the calibration procedure. To do so, the Gatheral model
with parameters given in Table 10.1 and a fixed value of α1 = α2 = 0.94 is used.
The synthetic data is generated for options with log-moneyness from −0.2 to 0.2,
and maturities from 30 days to 2 years, but, as discussed previously in the calibration
procedure, only the part of options with suitable log-moneyness and maturities are
used.
As expected, the calibration gives close-to-true values. Table 10.2 shows that the
difference between true and calibrated values is fairly small.
144 Data Analysis and Related Applications 1
Before moving to the calibration of the real market data, a short description of
the dataset follows. It consists of daily implied volatility surfaces calculated and
interpolated from the traded call options on ABB stock and on Eurostock 50 Index,
in Nasdaqomx Nordic Exchange and Eurex, respectively. The dataset is processed
from the data provided by the company OptionMetrics LLC. The period is from
November 2019 to November 2020. That is, it starts before the Covid-19 pandemic,
which could be interesting. There are 10 time-to-maturities, 30, 60, 91, 122, 152,
182, 273, 365, 547 and 730 calendar days, There are also 13 implied exercise prices
obtained from the well-known Greek Deltas (0.20 + 0.05n, n = 0, 1, 2, . . . , 12). For
this study, only options with maturities from 30 days to 1 year with a suitable range of
log-moneyness are of interest. While using the first-order expansion, we take 30-day
options with a range of log-moneyness from −0.1 to 0.07. We use at-the-money
(or closely at-the-money) options with maturities from 30 days to 1 year for the
second-order expansion.
For simplicity, we calibrate the special case of Double Lognormal model, i.e. we
set α1 = α2 = 1.
Applying the calibration procedure to the real market data, Figures 10.3 and 10.4
show calibrated daily values of the volatility process of the ABB stock and Eurostock
50 Index, respectively. On both figures, it is clear that the pandemic had a great
impact on the volatility, in the middle of March. This period is when Covid-19 started
spreading in Europe, and high volatility was expected.
Figures 10.5 and 10.6 show calibrated daily values of the product of the correlation
ρ12 and ξ1 of the ABB stock and Eurostock 50 Index, respectively. Because ξ1 is
assumed to be positive, and the realistic situation suggests that ρ12 should be negative,
the product should be negative too. This can be seen in Figure 10.5. Besides this, the
product should be constant if the Double Lognormal model is a representation of the
market. However, in Figure 10.6, it can be seen that there are some extremely positive
values. As the situation is severe, due to Covid-19, a calibration should be done more
often during this period to avoid getting unrealistic values. Again, in both cases, the
impact of the pandemic is obvious.
Numerical Studies of Implied Volatility Expansions Under the Gatheral Model 145
0.7
0.6
0.5
v0
√
0.4
0.3
0.2
0.1
1.0
0.8
v0
√
0.6
0.4
0.2
To calibrate daily values for the product of reversion rate κ1 and difference ν0 −ν0 ,
κ1 (ν0 − ν0 ), as mentioned, a linear regression of implied volatilities against a range of
time-to-maturities for at-the-money options was used. Figures 10.7 and 10.8 show
146 Data Analysis and Related Applications 1
values obtained from the calibration for the ABB stock and Eurostock 50 Index,
respectively. It seems like the pandemic had more impact on the ABB stock in
this case. During the middle of March, the effect of the Covid-19 pandemic was
undoubtedly the largest.
−3
−4
−5
ρ12 ξ1
−6
−7
−8
10
5
ρ12 ξ1
−5
−10
−0.25
−0.50
κ1 (v0 − v0 )
−0.75
−1.00
−1.25
−1.50
−1.75
2019-11 2020-01 2020-03 2020-05 2020-07 2020-09 2020-11
Date
−5
κ1 (v0 − v0 )
−10
−15
−20
10.6. References
Albuhayri, M., Malyarenko, A., Silvestrov, S., Ni, Y., Engström, C., Tewolde, F., Zhang, J.
(2021). Asymptotics of implied volatility in the gatheral double stochastic volatility model.
In Applied Modeling Techniques and Data Analysis 2, Dimotikalis, Y., Karagrigoriou, A.,
Parpoula, C., Skiadas, C.H. (eds). ISTE Ltd, London, and John Wiley & Sons, New York.
Bayer, C., Gatheral, J., Karlsmark, M. (2013). Fast Ninomiya–Victoir calibration of the
double-mean-reverting model. Quantitative Finance, 13(11), 1813–1829.
Berestycki, H., Busca, J., Florent, I. (2002). Asymptotics and calibration of local volatility
models. Quantitative Finance, 2(1), 61–69.
Dupire, B. (1997). Pricing and hedging with smiles. In Mathematics of Derivative Securities,
Dempster, M.A.H. and Pliska, S.R. (eds). Cambridge University Press, Cambridge.
Fouque, J.P., Papanicolaou, G., Sircar, R., Sølna, K. (2011). Multiscale Stochastic Volatility for
Equity, Interest Rate, and Credit Derivatives. Cambridge University Press, Cambridge.
Gatheral, J. (2008). Consistent modeling of SPX and VIX options. Paper presented at The Fifth
World Congress of the Bachelier Finance Society, London, 18 July 2008.
Hilpisch, Y. (2019). Derivatives Analytics with Python: Data Analysis, Models, Simulation,
Calibration and Hedging. John Wiley & Sons, New York.
Pagliarani, S. and Pascucci, A. (2017). The exact Taylor formula of the implied volatility.
Finance and Stochastics, 21(3), 661–718.
11
11.1. Introduction
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
150 Data Analysis and Related Applications 1
The issue of the hot-hand or icy-hand effect in the performance of mutual funds
could be highly useful to managers of collective investment companies as well as
individual investors for a number of reasons. The former could apply the findings
associated with these effects from behavioral finance in their information and
marketing activities. The latter, in turn, might find the relations between the results
occurring in consecutive periods important in the context of evaluating returns and
continuing the possible winning or losing streak. Moreover, advantages of this issue
can be viewed from a third perspective: studies dealing with hot-handed managers of
funds enable academicians to evaluate the intensity of competition in the branch or
enhance the knowledge about portfolio management theories.
This section will provide a brief review of the literature discussing research on
performance persistence. Such research normally consists in comparing the rates of
Performance Persistence of Polish Mutual Funds: Mobility Measures 151
The analysis of the relevant literature discussing the issue of persistence of the
mutual fund allocation effects has allowed identification of three groups of research.
The first one covers basic research, which was the first to show performance
persistence and at the same time constituted the starting point for numerous
subsequent inquiries. The criterion for classifying studies to another group was the
emergence of more recent streams in this area, which comprised more advanced
research approaches. One of them is the publications engaging a Markov chain. The
last group of research included in the review contains the studies describing the
European, including Polish, experiences as regards the occurrence of performance
persistence of domestic mutual funds.
The empirical studies of the turn of the 1990s (Grinblatt and Titman 1989;
Brown and Goetzmann 1995) were the first to suggest a relative stability of the
returns generated by mutual funds. It is then that, for example, Hendricks et al.
(1993) identified the above-mentioned hot-hand effect, which refers to a short-term
performance persistence. Other studies attempted also to determine whether
performance persistence was connected with managerial characteristics or stock
selection (e.g. Grinblatt and Titman 1992). The additionally asked question
concerned the issue whether performance persistence of mutual funds might
possibly be a group phenomenon of adopting a common investment strategy
consisting in allocating assets to the securities that performed well in earlier periods
(e.g. Goetzmann and Ibbotson 1994). This is when a set of research tools ranging
from regression models and analysis of Spearman rank correlation coefficients to the
now classical contingency tables were developed.
One of the reasons for the relative performance persistence over time was
identified as the so-called survivorship bias. For instance, Malkiel (1995), whose
inquiries additionally involved the funds which discontinued their activities, stated
that the evidence for recurring results in a survivorship-bias-free sample deteriorated
with time. At the same time, he was critical about the hypothesis providing that
some managers were able to continuously achieve better results at an acceptable risk
level. Carhart (1997), in turn, noted that funds generating better short-term returns
managed to do so by applying a momentum strategy. On the other hand, returns on
investments diminished after transaction costs were taken into account. Significant
performance persistence was notable, but this was the case with losing funds.
historical results achieved by equity funds influenced future performance, yet in the
short term only. When extending the timeframe of the analysis, the relations
between the returns generated in successive periods vanished. Interestingly, the
persistence effect of good performance was stronger for younger and smaller funds.
The researchers argued that achieving higher or lower results is related to managers’
luck rather than skills.
Huij and Derwall (2008), in turn, chose to use and confront a broad range of
research methods: from contingency tables, which are traditional for the discussed
stream of the relevant literature, to bootstrap techniques. Their findings showed that
the examined bond funds, which were characterized by good and poor performance
in the past, repeated their rates of return in subsequent periods. Using traditional
research methods, this study demonstrates the existence of a relationship between
managerial skills and performance persistence in virtually all analyzed groups of
funds.
Studies from non-US markets refer to the invoked research approaches relying
on Markov chains fairly infrequently. The group of research from European markets
that employ chiefly traditional tools includes, for example, Otten and Bams (2002).
Its authors dealt with the results achieved by equity funds coming from the United
Kingdom, France, Germany, Italy and the Netherlands. Earlier, however, Dahlquist
et al. (2000) based their reasoning on a sample of Swedish funds operating in several
core market segments. Casarin et al. (2008), in turn, examined performance
persistence for Italian funds.
From among more recent studies, which, however, come from developing
markets, the following also deserve attention: Koutsokostas et al. (2020) for Greek
equity funds, performance assessment for Hungarian funds by Bota and Ormos
(2017), studies by Czekaj and Grotowski (2014) and Machnik (2020), among others,
for Polish funds. The findings of the above-mentioned analyses were not consistent,
but long-term persistence was ruled out virtually every time. In the context of an
attempt to capture short-term persistence, the conclusions were more convergent, yet
often dependent on the employed research approach or performance measure.
Generally, the relevant literature, despite the multitude of studies, the abundance of
Performance Persistence of Polish Mutual Funds: Mobility Measures 153
topics and the diversity of the used data, has still failed to provide straightforward
answers to a number of substantial questions. This means that the discussed
phenomenon deserves a further analysis.
The sample used in this study consists of 101 Polish open-end investment funds
covering the period from January 2000 to September 2018. The dataset involving
monthly unit prices of the above registered domestic equity funds was derived from
the reports by Analizy Online, a web service collecting this kind of information in
Poland. Moreover, data on the values of the stock exchange index, which was
important for calculating the used measure of returns, came from the Warsaw Stock
Exchange (GPW) website.
The performance measurement employed in this study uses asset unit values. It
was decided to use continuous return as the base rate, which is one of the most
popular measures of investment effects used in financial analyses. It is based on the
values of funds’ share units and can be calculated logarithmically as follows:
ri = ln(upt/upt-1), [11.1]
where ri is the continuous return of fund i in period t and upt and upt-1 are the unit
prices of fund i at the end (t) and at the beginning (t-1) of the analyzed period,
respectively.
In the next step, the median of the rates of return calculated in this manner is
used to identify the winning funds and the losing funds in individual periods.
The benchmark return being the stock exchange index is then deducted from the
funds’ rate of return. The market-adjusted return allows the determination of the rate
of income exceeding the benchmark. The presented measure of returns is expressed
with the following formula (Lee et al. 2008):
r b = ri - r m , [11.2]
As was mentioned in the first part, the main aim of this chapter is to examine
whether the performance persistence phenomenon occurs in the Polish mutual fund
market. Like Brown and Goetzmann (1995) did in one of their early studies, the
benchmark used here was the median of the rates of return for each period (relative
benchmark) and the value of the stock market index (absolute benchmark). Hence,
the null hypothesis states that the results achieved in consecutive periods are
unrelated to each other. It will be verified using the stochastic procedure, which will
be supported with a few mobility measures.
The main research approach applied was a Markovian framework (see Kemeny
and Snell 1976). The Markov chain used in this study is a special stochastic process
with a countable state space and transitions at integer times. It could be said that
a process X = ( X t ) t∞=1 is a Markov chain with the state space S if it takes value
in set S and for every n ∈ N , for every S1, …, Sn, Sn+1, and for every
t ∈ {n , n + 1, n + 2,...} , we have that:
A crucial aspect in dealing with a Markov chain is its transition matrices, i.e.
matrices
p11t p1tm
pt pt
Pt = m1 mm . [11.5]
The probability of events was calculated with the application of the moving ratio,
which allows the determination of the number of times an outcome can occur
compared to all possible outcomes within a chosen time horizon. The above
assumptions enabled the estimation of transition probabilities on the basis of a
six-month horizon for a monthly perspective, a four-quarter horizon for a quarterly
perspective and a three-year horizon for a yearly perspective.
With reference to the six-month horizon, the generated raw returns, which do not
take market factors into account, seem to imply the existence of a short-term
persistence. As follows from the data provided in Panel A of Table 11.1, the
probability of remaining in the group of winning funds or losing funds is
insignificantly higher (above 0.53) than that of transiting to any other state – for
funds with reversal performance (approx. 0.47). This is not entirely confirmed by
the results achieved as an effect of applying mobility measures. The degree of
mobility measured by means of MU and MD indicates an increased propensity for
transiting between individual performance ranking segments. Funds characterized
by immobility were dominated by both funds with improving and ones with
deteriorating positions.
The findings obtained after grouping funds by the market efficiency criterion are
as expected. According to the results from Panel B of Table 11.1, the probability of
failing to beat the market in consecutive periods was the highest of all permissible
states, which is consistent with the efficient market theory. The noted lack of
uniformity suggests that the percentage of funds whose positions deteriorated was
larger than the percentage of funds whose positions improved. Moreover, it was
Performance Persistence of Polish Mutual Funds: Mobility Measures 157
noted that the funds that remained in their states as losers were observed most
frequently, as far as a short term is concerned (probability at the level of 0.56). This
proves the existence of performance persistence, but only with respect to the
icy-hand effect. Mobility measures confirmed only the tendency for deteriorating
performance (MD at the level of 0.69).
The results of the research, presented in Panel A of Table 11.2, suggest the
persistence of quarterly raw returns of funds in consecutive periods in accordance
with the selected classification criterion, namely the achievement or non-achievement
of the median value in the performance distribution. The probability of transiting
from the winner (loser) state to the loser (winner) state was lower (approx. 0.43)
than that of remaining in the group of winners or losers (approx. 0.56–0.57). This is
partly confirmed by the value of the immobility ratio (IR), which was determined to
be relatively high, i.e. 0.29, yet not enough to exceed the expected level of 0.33.
158 Data Analysis and Related Applications 1
Once again, it must be repeated that the estimations of transition probabilities for
a yearly perspective were made on the basis of observations from a three-year
horizon. Long-term observations were made for classifications on the basis of the
relative (Panel A) as well as the absolute (Panel B) benchmark, and their results are
presented in Table 11.3.
The last of the specified time horizons concerns a yearly perspective. As follows
from the data provided in Panel A of Table 11.3, the performance persistence
observed earlier decreases as the timeframe increases. For the sake of comparison,
the probability of transiting from winning (losing) funds to losing (winning) funds
was approximately 0.52–0.53. Mobility measures (MU and MD) also indicate high
values, ones considerably exceeding the natural level of 0.33, let alone the low
Performance Persistence of Polish Mutual Funds: Mobility Measures 159
values of the immobility ratio (0.13). This means that transition probabilities are not
uniform.
When the absolute benchmark (Panel B of Table 11.3), i.e. the stock market
index in this case, was introduced as a classification criterion, the findings again
turned out to be consistent with the efficient market hypothesis. The empirical
transition probabilities were the highest when funds’ performance deteriorated in
consecutive periods (0.64) or bad income persisted (0.64). The obtained results are
supported by high values of the MD ratio (0.62), which signifies a high share of
funds with deteriorating investment results
11.8. Conclusion
The aim of this chapter was to examine whether the performance persistence
phenomenon occurred in a developing mutual fund market. The analysis was
conducted for Polish equity funds from three time perspectives: monthly, quarterly
and yearly. The empirical investigation was possible through the employment of
Markov chains with transition matrices, supported with a few mobility measures.
The applied research framework, which is still unknown in the area of finance,
has proved useful in the verification of the performance persistence hypothesis.
However, this study, along with the findings of Filip and Rogala (2021), may be
considered as an introduction to the research on the performance of mutual funds in
developing countries by means of stochastic processes and as a basis for further
discussions and analyses in this respect.
11.9. References
Bigard, A., Guillotin, Y., Lucifora, C. (1998). Earnings mobility: An international comparison
of Italy and France. Review of Income and Wealth, 44(4), 535–554.
Bota, G. and Ormos, M. (2017). Determinants of the performance of investment funds
managed in Hungary. Ekonomska Istraživanja / Economic Research, 30(1), 1–14.
160 Data Analysis and Related Applications 1
Brown, S.J. and Goetzmann, W.N. (1995). Performance persistence. The Journal of Finance,
50(2), 679–698.
Carhart, M. (1997). On persistence in mutual fund performance. The Journal of Finance,
52(1), 57–82.
Casarin, R., Pelizzon, L., Piva, A. (2008). Italian equity funds: Efficiency and performance
persistence. ICFAI Journal of Financial Economics, 6(1), 7–28.
Czekaj, J. and Grotowski, M. (2014). Short-term persistence of the results achieved by
common stock investment funds acting in the Polish Capital Market (in Polish).
Ekonomista, 4, 545–557.
Dahlquist, M., Engstrom, S., Soderlind, P. (2000). Performance and characteristics of
Swedish mutual funds. Journal of Financial and Quantitative Analysis, 35(3), 409–423.
Drakos, K., Giannakopoulos, N., Konstantinou, P.T. (2015). Investigating persistence in the
US mutual fund market: A mobility approach. Review of Economic Analysis, 7, 54–83.
Filip, D. and Rogala, T. (2021). Analysis of Polish mutual funds performance: A Markovian
approach. Statistics in Transition New Series, 22(1), 115–130.
Goetzmann, W.N. and Ibbotson, R.G. (1994). Do winners repeat? The Journal of Portfolio
Management, 20(2), 9–18.
Grinblatt, M. and Titman, S. (1989). Mutual fund performance: An analysis of quarterly
portfolio holdings. The Journal of Business, 62(3), 393–416.
Grinblatt, M. and Titman, S. (1992). The persistence of mutual fund performance. Journal of
Finance, 47(5), 1977–1984.
Haslem, J. (2003). Mutual Funds: Risk and Performance Analysis for Decision Making.
Blackwell Publishing, Malden.
Hendricks, D., Patel, J., Zeckhauser, R. (1993). Hot hands in mutual funds: Short-run
persistence of relative performance, 1974–1988. Journal of Finance, 48(1), 93–130.
Huij, J. and Derwall, J. (2008). “Hot hands” in bond funds. Journal of Banking and Finance,
32(4), 559–572.
Huij, J. and Verbeek, M. (2007). Cross-sectional learning and short-run persistence in mutual
fund performance. Journal of Banking and Finance, 31(3), 973–997.
Kemeny, J.G. and Snell, L.J. (1976). Finite Markov Chains. With a New Appendix
“Generalization of a Fundamental Matrix”. Springer-Verlag, New York-Berlin-
Heidelberg-Tokyo.
Koutsokostas, D., Papathanasiou, S., Eriotis, N. (2020). Short-term versus longer-term
persistence in performance of equity mutual funds: Evidence from the Greek market.
International Journal of Bonds and Derivatives, 4(2), 89–103.
Lee, J.S., Yen, P.H., Chen, Y.J. (2008). Longer tenure, greater seniority, or both. Evidence
form open-end equity mutual fund managers in Taiwan. Asian Academy of Management
Journal of Accounting and Finance, 4(2), 1–20.
Performance Persistence of Polish Mutual Funds: Mobility Measures 161
12.1. Introduction
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
164 Data Analysis and Related Applications 1
The regret (loss function) after rounds is defined as the expected difference
between the reward sum associated with an optimal strategy and the sum of the
collected rewards and is equal to
( , )= , max ,…, −
Here, , denotes the expected value calculated with respect to the measure
generated by strategy and parameter .
/
Further, we consider the normalized regret (scaled by ( ) , is the reached
step).
Θ= = + / ; ∈ ℝ, | | ≤ < ∞, = 1, … ,
For “distant” distributions, the normalized regrets have smaller values. For
example, they have order log if max ,…, exceeds all other by some
> 0 (see Lai et al. 1980).
Invariant Description for a Batch Version of the UCB Strategy 165
We aim to build a batch version of the UCB strategy described in Lai (1987).
Also, we obtain its invariant description on the unit horizon in the domain of “close”
distributions (as in the case of “close” distributions, the maximum values of
expected regret are attained). Finally, we show (using Monte Carlo simulations) that
expected regret only depends on the number of processed batches (not the number of
steps) and that the maximum of the scaled regret is reached for step number
proportional to .
Suppose that at the step , the -th arm was chosen times and let ( ) denote
a corresponding cumulative reward (for = 1, … , ).
Since the goal is maximize the total expected reward, it might seem reasonable
to always to apply the action corresponding to the current largest value ( )/ .
However, such a rule can result in a significant loss since initial estimate
( )/ , corresponding to the largest , can by chance take a lower value, and
consequently, this action will be never applied.
To get a correct estimation for , we must ensure that each arm is chosen
infinitely many times as → ∞.
( ) 2 log( / )
( )= + , = 1,2, … , ; = 1,2, … ,
It is supposed that for initial estimation of mean rewards, each arm will be used
once in the initial stage of control.
We consider a setting in which the gambler can change the arm only after using
it times in a row. We assume for simplicity that = , where is the number
166 Data Analysis and Related Applications 1
of batches. This limitation allows batch (and also parallel) processing (see
Kolnogorov 2012, 2018).
If is large and variance is finite, then due to the central limit theorem, the
reward for each batch will have a close to normal distribution with probability
density function
( )
( | ) = (2 ) /
For the -th batch, the following expression for the reward holds:
( )= ⋅ + / +√ ( ~ (0,1))
( ) log( / )
( )= + , = 1,2, … ,
The aim is to get a description of the strategy and the regret, which is
independent of the control horizon size. That way, it will be is possible to make
conclusions about its properties no matter the horizon size. We aim to scale the step
number by some parameter (this parameter is the horizon size in the case where it
is a priori known).
We denote by
1, ( ) = max ( ), … , ( ) ,
( )=
0, ℎ
the indicator of chosen action for processing the ( + 1)-th batch according to the
considered rule (also recall that at ≤ every arm is chosen once for a batch, so
( ) = 1 for = ). With probability 1, only one of values of ( ) is equal to 1.
Invariant Description for a Batch Version of the UCB Strategy 167
( )= /
⋅ + ( / ) + () ( ; ), = 1,2, … ,
( )= /
⋅ + ( / ) +
The upper bound value for each arm can be written as ( = 1,2, … , ; = +
1, + 2, … , ):
√ √ log( / )
( )= + + +
√
Next, we apply a linear transformation that does not change the arrangement of
bounds
( )=( ( )− ) /
log( / )
( )= + + , = 1,2, … ,
Here, changes in interval (0,1 when changes from 1 to , i.e. the control
horizon is scaled to a unit size. A priori unknown control horizons can change from
0 to any value.
To obtain an invariant form for regret, we first assume without loss of generality
that = max( , … , ), so the regret can be expressed as
168 Data Analysis and Related Applications 1
( , )=( / ) / ( − ) , ( )
/
=( ) ( − ) , ( )
/
After normalization (scaling by ( ) ), we get the following expression for
regret:
( ) / ( , )=( / ) / ( /
− ) , = ( − ) , ( )
which is the required invariant description. Hence, we can present the results in the
form of the following theorem.
THEOREM 12.1.– For Gaussian multi-armed bandits with arms, fixed known
variance and unknown expected values ,…, that have “close” distributions
defined by
= + / ; ∈ ℝ, | | ≤ < ∞, = 1, … , ,
( ) log( / )
( )= + , = 1,2, … , ; = 1,2, … ,
log( / )
( )= + + , = 1,2, … ,
( ) / ( , )= / ( − ) , ( )
Invariant Description for a Batch Version of the UCB Strategy 169
Figure 12.1. Maximum scaled regret versus step number for − = 10.
For a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
Values of differences are chosen − = 3.3 (Figure 12.2) as in this case the
biggest maximum regret was obtained according to Garbar (2020a, 2020b). Other
values ( − = 10, Figure 12.1; − = 1, Figure 12.3) correspond to bigger
and smaller difference values for the expected reward.
For all cases, we can observe that the maximum for scaled regret is reached for
proportional to . When the difference in mean rewards is large (Figure 12.1), the
strategy can distinguish between better and worse arms in the early stages of control
and regret is not that big.
170 Data Analysis and Related Applications 1
12.6. Conclusion
We reviewed a batch version of the UCB rule with a priori unknown control
horizon.
Monte Carlo simulations were performed to study the normalized regret for
different fairly large horizon sizes; it is shown that the maximum for regret is
reached for proportional to , as expected based on obtained descriptions.
12.7. Affiliations
12.8. References
Garbar, S.V. (2020a). Invariant description for batch version of UCB strategy for multi-armed
bandit. J. Phys. Conf. Ser., 1658, 012015.
Garbar, S.V. (2020b). Invariant description of UCB strategy for multi-armed bandits for batch
processing scenario. Proceedings of the 24th International Conference on Circuits,
Systems, Communications and Computers (CSCC), 75–78, Chania, Greece.
Kolnogorov, A.V. (2012). Parallel design of robust control in the stochastic environment (the
two-armed bandit problem). Automation and Remote Control, 73, 689–701.
Kolnogorov, A.V. (2018). Gaussian two-armed bandit and optimization of batch data
processing. Problems of Information Transmission, 54(1), 84–100.
Lai, T.L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. Ann.
Statist., 25, 1091–1114.
Lai, T.L., Levin, B., Robbins, H., Siegmund, D. (1980). Sequential medical trials (stopping
rules/asymptotic optimality). Proc. Natl. Acad. Sci. USA, 77, 3135–3138.
Lattimore, T. and Szepesvari, C. (2020). Bandit Algorithms. Cambridge University Press,
Cambridge, UK.
Vogel, W. (1960). An asymptotic minimax theorem for the two-armed bandit problem. Ann.
Math. Statist., 31, 444–451.
13
Beta regression is used to analyze data whose value is within the range (0,1),
such as rates, proportions or percentages, and therefore is useful for analyzing the
variables that affect them (Ferrari and Cribari-Neto 2004; Simas et al. 2010). This
method is based on the beta distribution or its re-parametrizations, proposed by
Ferrari and Cribari-Neto (2004) and Cribari-Neto and Souza (2012), to obtain a
regression structure on the mean that is easier to analyze and interpret. For the
regression for binary data, the literature has debated the problem of incorrect link
functions and therefore proposed new links, such as gev (generalized extreme
value), while, for the mean of the beta regression, the traditional link functions for
binary regressions were used, i.e. logit, probit and complementary log–log. In this
chapter, a new inverse link function is proposed for the mean parameter of a beta
regression, which has as its particular cases inverse logit, representing a traditional
symmetric inverse link function, and gev, proposed for binary data due to its
asymmetry. The new inverse link function proposed in this chapter has the
advantage that it can also be non-monotonic, unlike those proposed until now. The
parameters are estimated maximizing the likelihood function, using a modified
version of the genetic algorithm, therefore giving greater importance to traditional
link functions than the others. This method is compared with the one proposed by
Cribari-Neto and Zeileis (2010), in which the researcher decides a priori the link
function, using simulated data, so as to be able to compare which of the two
methods is closest to the true values. This method, therefore, is better because it is
able to correctly determine the link function with which the data was simulated and
to estimate the parameters with less error.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
174 Data Analysis and Related Applications 1
13.1. Introduction
Beta regression is typically used to analyze data whose value is within the range
(0,1), such as rates, proportions or percentages, and to study the variables which
affect them (Cox 1996; Ferrari and Cribari-Neto 2004; Simas et al. 2010; Cribari-Neto
and Queiroz 2014). This statistical method is based on beta distribution or its
re-parametrizations, proposed by Ferrari and Cribari-Neto (2004) and by Cribari-Neto
and Souza (2012), to obtain a regression structure on the mean which is easier to
analyze and interpret. Cox (1996) and Kieschnick and McCullough (2003) were
among the first to propose beta regression. They proposed their own version of the
generalized linear models (McCullagh and Nelder 1989) for the variables of interest
included in the unit interval, exploiting the belonging of the beta distribution to the
exponential family, on the basis of the generalized linear model. These link the
mean of the variable of interest to a function, called the link function, of exogenous
explanatory variables, also called regressors. The inverse of the link function is
called the response function. Since the mean of the beta distribution is between
0 and 1, Kieschnick and McCullough (2003) recommend the use of a logit link
function, essentially created to be applied to regressions for binary data. In its
traditional form, the beta distribution is characterized by the two parameters p and q.
Because its mean is a function of the parameters p and q, Kieschnick and
McCullough (2003) link them to the explanatory variables through the link function,
considering q as a function of the regressors. Ferrari and Cribari-Neto (2004)
re-parameterize the beta distribution, which thus becomes characterized by the
parameters µ and φ, which are, respectively, the mean and the precision. With this
modification, the analysis is simplified, since the mean is directly linked to the
explanatory variables through the link function. The parameters of the link function
are estimated using the quasi-maximum likelihood (QMLE) method (Cox 1996;
Kieschnick and McCullough 2003) or the maximum likelihood method (Ferrari and
Cribari-Neto 2004). Beta regression has had a further evolution with the possibility
of also linking the precision parameter to explanatory variables through another link
function (Smithson and Verkuilen 2006; Simas et al. 2010). Smithson and Verkuilen
(2006) use the logarithmic function for the link function of the precision parameter
and the logit function for the mean. Simas et al. (2010), instead, propose some
functions both for the link function of the precision parameter and for that of the
mean. Simas et al. (2010) apply, for the mean, the traditional link functions of binary
regressions, i.e. logit, probit and complementary log–log, and, for the precision
parameter, the logarithmic function, the square root and equality. The parameters
of the two link functions are estimated with the maximum likelihood method.
Cribari-Neto and Souza (2012) propose their new parameterization of the beta
distribution, where the parameters represent the mean and a measure of dispersion.
In their study, the logit function is used for both link functions.
A New Non-monotonic Link Function for Beta Regressions 175
In the literature, the problem of incorrect link functions has been discussed in the
context of regressions for binary data (Czado and Santner 1992), and therefore, new
and further link functions have been proposed (Aranda-Ordaz 1981; Stukel 1988;
Nagler 1994; Wang and Dey 2010; Jiang et al. 2013; Gheno 2018). For the mean of
the beta regression, however, until now, researchers have used the link functions
used for the binary regressions, i.e. logit, probit and complementary log–log. Since
the mean is between (0,1), however, only the functions (0,1) →ℜ can be used as link
functions. The logit and probit functions are monotonic and symmetric functions,
while the complementary log–log approaches slowly 0 and quickly 1. The
complementary log–log, also defined as extreme minimal value (Fahrmeir and Tutz
2013), has its complementary version in the log–log, or extreme maximal value
(Fahrmeir and Tutz 2013), because it approaches 0 quickly and 1 slowly. Other
non-symmetric functions have been proposed for binary data. Some examples of these
link functions are gev (generalized extreme value), which has the complementary
log–log link function as a special case (Wang and Dey 2010; Calabrese and Osmetti
2013); scobit (Nagler 1994) and Aranda-Ordaz’s link, which has the logit and
the complementary log–log as special cases (Aranda-Ordaz 1981). Only Canterle and
Bayer (2019) use Aranda-Ordaz’s link for the mean in a beta regression.
In this chapter, a new response function is proposed for the mean parameter of a
beta regression, which has as its particular cases the symmetric inverse link function
logit and the asymmetric gev. The new response function has the advantage that it
can also be non-monotonic, a feature not present in those proposed until now. The
parameters are estimated with the maximization of likelihood, made possible by the
use of a modified version of the genetic algorithm, to give more relevance to
the traditional link functions than the others. This new method is compared with that
proposed by Cribari-Neto and Zeileis (2010) using simulated data, in order to
compare which of the two methods is closest to the true values. This method is able
to correctly determine the link function with which the data are simulated and to
estimate the parameters with less error.
13.2. Model
The variable of interest of a beta regression has a beta distribution and therefore
takes values between 0 and 1, excluded extremes. However, beta regression can also
be used for variables included in an interval (a, b) with the appropriate
modifications, i.e. (y-a)/(b-a) (Ferrari and Cribari-Neto 2004; Smithson and
Verkuilen 2006). If y, instead, can also assume the values 0 and 1, Smithson and
Verkuilen (2006) propose the transformation (y (n-1) + 0.5)/n, where n represents
the sample size.
176 Data Analysis and Related Applications 1
To facilitate the interpretation of the estimated values from the beta regression,
Ferrari and Cribari-Neto (2004) propose the following re-parametrization of the beta
distribution
( )
( , , )= (1 − )( )
( ) (1 − )
where μ represents the mean and is between 0 and 1 excluded, and φ represents the
precision parameter and is greater than 0. In this parameterization, the variable y has
the mean equal to μ and variance equal to μ (1-μ)/(1 + φ). In the simplest form of
beta regression, the mean of the variable of interest is equal to
( )= => ( ) =
(1 − )
( )=
1+
ℎ( ) =
where ℎ(∙) represents the link function of the precision parameter and is such that
ℎ (∙): ℜ → (0, ∞) (Cribari-Neto and Zeileis 2010). A sample of sample size n is
used to estimate the parameters β and φ of the simpler version or the parameters β
and γ of the more complex version. The relative log-likelihood function of the
simplest model becomes
( , )= log( ( , , ))
In the simplest version, each observation has mean equal to and variance
equal to (1 − )/(1 + ). The parameters are obtained by maximizing the
A New Non-monotonic Link Function for Beta Regressions 177
log-likelihood function. The most commonly used link functions for the mean are,
respectively, logit, probit and complementary log–log
∑
( )= => = ∑
1− 1+
( )= ( ) => = +
∑
( ) = log − (1 − ) => =1−
While logit and probit are monotonic and symmetric functions, complementary
log–log approaches 0 slowly and 1 quickly. In this chapter, to broaden the
possibility of studying the relationship between the mean and the explanatory
variables with = 1, … , more comprehensively, a new response function called
logev is proposed, because it contemplates between its particular cases the inverse
link function logit and that gev (Calabrese and Osmetti 2013), until now only
applied to regressions for binary variables. The gev link function is
(− ln ) −1 ∑
/
( )= => =
The response function gev becomes with → 0 the response function of the
complementary log–log and with < 0 the response function of Weibull (Calabrese
and Osmetti 2013). The logev function is
−1/
− 1− + +∑
+∑
1− + + >0
1+
−1
=
+∑
+∑
1− + + ≤0
1+
estimated parameters, and therefore, an a priori choice can eliminate this problem.
Another peculiarity of this function is the possibility of being non-monotonic, a
feature that has not been considered for a beta regression until now. Non-monotonicity
for binary data has only been proposed in the Bayesian field by Gheno (2018).
13.3. Estimation
0.5 ( ) + 0.5 ( )
0.1 ( ) + 0.9( )
greater importance to the logit and gev models than a hybrid model. The standard
errors of the estimated parameters can be calculated with the bootstrap method.
13.4. Comparison
To study the goodness of the method, it is compared with the beta logit
regression proposed by Cribari-Neto and Zeileis (2010) (hereinafter also defined as
betareg), using simulated data so as to know exactly what the true relationship
between the response variable and the explanatory variables is. In the first analysis,
30 datasets of sample size 500 are simulated from a logit model with = 1 and
= −2. Logev beta regression estimates all datasets, while logit beta regression is
able to estimate only 25 datasets. Figure 13.1 shows that in almost 80% of cases,
logev chooses the logit model exactly. As simulated data is used, it is possible to
analyze which of the two methods is closest to the true value. Only the cases where
logev chooses the logit model exactly are considered, and the bias is analyzed
(Langner et al. 2003):
1 1
( ) = ( − ) = ( − 1) = − = −1
1 1
( )= − = +2 = − = +2
where D is the number of simulated datasets, which are estimated by both methods
as logit model and is equal to 19 and c = betareg, logev. If the bias is considered, the
intercept and the coefficient β are better estimated by the logev method, because, in
both cases, the logev bias is closer to 0 than the betareg bias:
When always considering the cases where logev chooses the logit model exactly,
the MSE statistic of the two methods is compared, in order to analyze both the
variance and the bias:
1 1
( ) = ( − ) = ( − 1) = ( − ) = ( − 1)
180 Data Analysis and Related Applications 1
1 1
( )= − = +2 = − = +2
where D = 19 and c = betareg, logev. The intercept is better estimated by the betareg
method, even if the two MSEs are very close, while logev estimates the coefficient β
much better
The two methods are compared with the AIC and BIC criteria. The AIC and BIC
are equal to (Qi and Zhang 2001).
2
= , +
( )
= , +
∑ − ,
, = =
where c = betareg, logev and represents the number of parameters and therefore
= 4, while = 6. The two AICs are now compared:
8 12
∆ = , − , + −
500 500
If ∆ > 0, the best model is logev, and then the condition for choosing the
logev is
−8 12
, − , > + = 0.008
500 500
∆ ( )
4 (500) 6 (500)
∆ = , − , + −
500 500
A New Non-monotonic Link Function for Beta Regressions 181
If ∆ > 0, the best model is the one proposed by the logev regression model,
and the condition for choosing the logev becomes
2 (500)
, − , > = 0.024858
500
∆ ( )
Figure 13.2 shows that logev is always better than the logit regression model.
In the following three datasets, logev and logit are only compared by analyzing the
goodness of the model. Indeed, the estimation of the parameters α and β is not
considered, because, as noted by (Czado and Santner 1992), their estimation depends
on the type of the used link function. In the first dataset, instead, the estimation of the
parameters α and β is compared because logev also chooses the link function logit. In
the second analysis, 30 datasets of sample size 500 are simulated from a gev model
with = 1, = −4, = −2. In this case, logev always correctly chooses the gev
model (Figure 13.3), but estimates better in only about 60% of cases (Figure 13.4).
182 Data Analysis and Related Applications 1
In the third analysis, 30 datasets of sample size 500 are simulated from a gev
model with = 1, = 1, = −2. In this case, logev almost always correctly
chooses the gev model (Figure 13.5) and estimates better in about 90% of cases
(Figure 13.6).
In the fourth analysis, 30 datasets of sample size 500 are simulated from a
non-monotonic model:
| |
=
The logev method is always able to estimate the model, while logit beta
regression is only able to estimate the model 28 times. In this case, logev chooses
the correct model in almost 75% of cases (Figure 13.7) and estimates better in about
99% of cases (Figure 13.8).
13.5. Conclusion
The study of the link functions in the case of beta regression has, until now, been
poorly developed, and therefore, in this chapter, a new response function has been
proposed, which includes as special cases the asymmetric response function gev and
the symmetric inverse link function logit, both monotonic. The peculiarity of this
inverse link function, which is called logev, is that it can also be non-monotonic. To
estimate its parameters, a modified version of the genetic algorithm is used. The
logev beta regression is compared with logit beta regression using simulated data in
order to know the real model. The logev beta regression estimates much better than
logit beta regression and, in addition, finds the true model effectively in most cases.
Therefore, this new response function greatly improves the study of the relationships
among variables.
13.6. References
Holland, J.H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan
Press, Ann Arbor.
Jiang, X., Dey, D.K., Prunier, R., Wilson, A.M., Holsinger, K.E. (2013). A new class of
flexible link functions with application to species co-occurrence in Cape Floristic Region.
The Annals of Applied Statistics, 7(4), 2180–2204.
Kieschnick, R. and McCullough, B.D. (2003). Regression analysis of variates observed on (0, 1):
Percentages, proportions and fractions. Statistical Modelling, 3(3), 193–213.
Langner, I., Bender, R., Lenz-Tönjes, R., Küchenhoff, H., Blettner, M. (2003). Bias of
maximum-likelihood estimates in logistic and Cox regression models: A comparative
simulation study. Discussion paper 362, Ludwig Maximilian University of Munich.
McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, 2nd edition. Chapman
and Hall, London.
Nagler, J. (1994). Scobit: An alternative estimator to logit and probit. American Journal of
Political Science, 38, 230–255.
Qi, M. and Zhang, G.P. (2001). An investigation of model selection criteria for neural
network time series forecasting. European Journal of Operational Research, 132(3),
666–680.
Simas, A.B., Barreto-Souza, W., Rocha, A.V. (2010). Improved estimators for a general class
of beta regression models. Computational Statistics & Data Analysis, 54(2), 348–366.
Smithson, M. and Verkuilen, J. (2006). A better lemon squeezer? Maximum-likelihood
regression with beta-distributed dependent variables. Psychological Methods, 11(1), 54.
Stukel, T.A. (1988). Generalized logistic models. Journal of the American Statistical
Association, 83(402), 426–431.
Wang, X. and Dey, D.K. (2010). Generalized extreme value regression for binary response
data: An application to B2B electronic payments system adoption. The Annals of Applied
Statistics, 4(4), 2000–2023.
Whitley, D. (1994). A genetic algorithm tutorial. Statistics and Computing, 4(2), 65–85.
14
Data collection and storage have become the greatest challenges and tedious
processes in data science engineering. Data from various nodes (sensors, bridges,
switches, hubs, etc.) in the environment or in a particular system is collected at the
nodes from which they arrive at the storage point. These types of operations need a
separate workforce to monitor the whole process of data handling. This proposed
work mainly focuses on the data analytics of creating normalized data from
unprocessed data. This reduces the manipulation of data when it is of a different
form. The data may be realistic depending on the system which produces it. The
normal distribution applies to the collected data to create a dataset that is distributed
over the continuous probability density function. It extends up to infinity in both
directions of the axes. The proposed work provides an easy storage and data
retrieval method in the case of large data volumes. The proposed data recovery is
compliant with the conventional data collection methods. This type of data
interpretation provides security and confidentiality of the user’s data.
14.1. Introduction
Data science has long been prevalent in all areas of science in this digital era.
Data science is an interdisciplinary field that fuses science and technologies by using
algorithms, tasks and devices to extract usable data from raw unstructured data. The
extracted data is applied to various domains to gain insights from the data acquired
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
188 Data Analysis and Related Applications 1
to refine the required data through efficient searches. George and Groza (2019)
introduced a concept of the extract-transform-load (ETL) that used graduate
attributes in the common form of places and details. Separate attention was given to
the transformation procedures that helped to get two different reports as final results.
The reports were the graduate attribute report per cohort (GAR/C) and course
progression report per cohort (CPR/C). The GAR/C accessed each attribute based on
the average and the CPR/C showed the tracking information based on the
achievement made by the students in their program. The reports were generated
synchronously at the same time to enhance the ETL efficiency. The model paved
the way for the integrated assessment of the database based on the granularity of
the ETL. Sulaiman et al. (2019) came up with incorporating information and
communication technologies (ICT) in the power industry by applying modernization
concepts to it. That field is the smart power grid, which integrates all the smart
meters used in that grid. It collects an enormous amount of data from these smart
meters and processes the huge data in the centralized servers to control the entire
grid, making it easier to control and observe from a remote location. In this work,
the smart meters act as a backbone of this smart grid. The data science is used to
process this huge data, thereby benefitting the user and also the energy supplier in
that domain of engineering.
Ghosh and Grolinger (2019) investigated the merging of cloud and edge
computing for IoT data analytics and came up with a deep learning (DL)-based
system to define the data processing along with ML concepts. The encoder is widely
used in all the nodes to reduce data congestion while reading data from the sensors.
These kinds of reduced data from the devices make the big data feasible to the
application. This data has been used directly by the ML algorithm to expand the
original features with the help of a decoder present in the auto-encoder module.
McHann (2013) developed a strategy of collecting data from all the nodes at present
and processing it further at a later stage. But this method involves data to be stored
in lot of storage devices. The large volumes of data have to be stored in the cloud to
perform ML at later stages. As the data storage capacity increases, the need for
technology infrastructure increases, skill sets to work in that infrastructure increase
and it is also expensive in terms of time and budget (Bhuiyan et al. 2017). This
chapter has been organized as follows. Section 14.2 elaborates on the machine
learning models in materials science and its application in Electronic Engineering.
It also discusses data acquisition by supervised learning and outlines accessing data
repositories and the data storage, respectively. This section describes the comparison
of the predicted and the actual values. Section 14.3 illustrates the application of
machine learning in electronic engineering. Finally, section 14.4 concludes the work
and recommends the future aspects.
A Method of Big Data Collection and Normalization 189
The application of ML in semiconductors includes learning from the data. The data
has been acquired from the databases or clouds called repositories. Numerous
researchers around the globe perform their simulations and experiments and provide
their valuable results in the common cloud to prove their integrated research
(Karmawijaya et al. 2019; Naveen Balaji et al. 2019; Balaji et al. 2020; Malini et al.
2020). By accessing the data repositories, the ML user can retrieve usable data by
using efficient algorithms and refining the model. The model can be trained in this
environment by undergoing several searches and improved by the feedback given
(Malini et al. 2020).
Data acquisition is the initial step in ML where data has been extracted from the
data repositories based on the user’s search (Malini et al. 2019). The next step is
learning from the data based on mathematical applications such as correlation and
regression. Supervised learning (SL) is the method of backtracking the inputs from
the outputs, thereby establishing the relationship between the input and output
pairs (Casula et al. 2019; Gowthaman and Srivastava 2021a, 2021b; https://www.
wolframalpha.com). SL has been the most prevalent research area in the ML platform
to refine the model created. Unsupervised learning (UL) is the method of creating the
algorithm to study the pattern behaviors that exist between the input and output. The
UL is the most time-consuming process that needs a revision of the pattern and creates
the model (Kampker et al. 2018). The design of experimental procedures plays a major
role in the design of the ML model. This needs a lot of procedures and algorithms to be
used in the ML model (Moradi et al. 2020). Hence, cyberinfrastructure comes into
existence. The famous cyberinfrastructures are Citrine, The Materials Project, Wolfram
190 Data Analysis and Related Applications 1
Alpha, KIM and Materials Data Facility (MDF) for materials science research
(Karpatne et al. 2017; Tanifuji et al. 2019; Liu and Shur 2020; Gowthaman and
Srivastava 2021c).
The flow of the ML platform for the materials science project has been illustrated
in Figure 14.1. The existing data is present in the data repositories submitted by
researchers around the globe. The predictive model suggests the test run of the search
of the properties of the particular material (Kampker et al. 2018). The search results
have been shared with the information acquisition system for further processing. The
information acquisition system collects the search results from above and sends the
data packets for verification based on the query made on the web front of the
cyberinfrastructure (Liu and Shur 2020). The query has been recorded in the database
of the query information source register. The next step of this process is the
verification of the goals given to the results obtained (Karpatne et al. 2017).
Data acquisition is the most important aspect of data analytics in the materials
science domain. The individual researcher has to devise a platform to make sure the
data acquired is legit and correct as per the requirement (Casula et al. 2019). Every
individual researcher has to collaborate with researchers around the world to make
an integrated search based on the big data (McHann 2013; Moradi et al. 2020). The
next step is the extraction of the result with the publication metric from the open
A Method of Big Data Collection and Normalization 191
The high-ƙ dielectrics have to be chosen for the double-gate (DG) MOSFET
designs. The selection of the high-ƙ dielectric material plays a major role in the
operation of the MOSFET with negligible short channel effects (SCEs)
(Hack and Papka 2015; Feng et al. 2019; Gowthaman and Srivastava 2021d). The
dielectric properties of the material compounds are derived from the periodic table for
analysis. This work concentrates the web front Wolfram Alpha to predict the values
of high-ƙ dielectric for the suitable inclusion in the DG MOSFET design
(Gowthaman and Srivastava 2021d). A computational intelligence platform like
Wolfram Alpha uses ML procedures to select particular required properties of the
chemical compounds. The dielectric compounds discussed are Al2O3, HfO2, ZrO2,
La2O3, SiO2, etc. (He et al. 2018; Ghosh and Grolinger 2019). The advantages of the
data repository are that it is easy to use and predict results using an Internet platform.
192 Data Analysis and Related Applications 1
But this has become labor-intensive in case of frequently used materials as they have
many combinations of compounds. Examples of application programming interfaces
(APIs) are the databases – ACK, Citrine Informatics, OQMD, Wolfram Alpha, etc.
(Chen et al. 2018).
Data storage is done in any one of the following formats: .csv (comma separated
values), Numpy arrays, Matlab, pandas. The .csv files are simple and good for
storage in programs like MS Excel (https://www.wolframalpha.com).
The Numpy arrays are good for mathematical operations and processing.
Sometimes, MATLAB data files can be used but this involves a large amount of
computation and system memory. The pandas file type has been used in sorting,
parsing and storage (also called the excel of python). These data formats remove
data based on logic operations and plot the values accurately (Malini et al. 2019,
2020). The data stored in the system can be used for further processing based on the
needs of the user. The dataset based on the electronic simulation, which shows the
electron density in the valley, has been tabulated in Table 14.1. The normalized data
after statistical processing performed in the raw data has been illustrated in
Table 14.2.
The physical vapor deposition method to create the elemental compounds using
various elements has been illustrated in Figure 14.3. The marked items are the
compounds which have been used in electronic engineering and applied to the DG
MOSFETs to get rid of SCEs. The compounds are as follows lanthanum, gallium,
aluminum and hafnium (Mohammadi et al. 2018). Their properties have been
analyzed for effective usage in electronic applications. The inkjet method of forming
the same compounds has been portrayed in Figure 14.4 for further analysis. These
elements have been displayed for the number of material level analyses to
create/deposit the material (Singh et al. 2017; Singh and Srivastava 2018;
Gowthaman and Srivastava 2021e).
Figure 14.3(a). Dataset derived from the physical vapor deposition device
for various high-ƙ dielctric and semiconductor materials. For a color
version of this figure, see www.iste.co.uk/zafeiris/data1.zip
194 Data Analysis and Related Applications 1
These methods highly depend on the ionic packaging value of the particular
material. Using ML, the time taken to create the material has been reduced
drastically (Chen et al. 2018; Gowthaman and Srivastava 2021e).
Figure 14.3(b). Dataset derived from the inkjet method for various
high-ƙ dielctric and semiconductor materials. For a color version
of this figure, see www.iste.co.uk/zafeiris/data1.zip
Figure 14.4. Simulation data for energy sub-bands for valleys of the DG MOSFET.
For a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
This work mainly focuses on the easy storage of the data and its efficient
retrieval on demand. This concentrates on the large data volumes of databases called
cyber infrastructures (Feng et al. 2019). The data recovery projected in this work
had been submissive to the conventional method of data retrieval. The comparison
of the previous researches results in good agreement with the novel data retrieval
technique. The confidentiality and the security of the user’s data was ensured by
A Method of Big Data Collection and Normalization 195
normalization of the raw data. The unauthorized user cannot determine the type of
data they visualize since it is in a normalized form. Hence, normalizations of the
data had given enormous capability of security and confidentiality in the cloud-based
data query. The electronic engineering field has been enhanced by the usage of
normalization and other statistic modeling of the raw data in order to process it.
ML reduces the data processing and data storage compared to non-ML-based
statistical models.
Figure 14.5. Normalized dataset for energy sub-bands attained for the same.
For a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
The enhanced database architecture and the data storage facilities facilitate the
access of databases through larger query and its assessment. The big data was
introduced in this work to reduce human work and data analysis. The normalization
of the data helps the user to create a detailed analysis in terms of the processed data.
The idea of ML can be further improved in theoretical form to apply statistics
models in the raw data to perform quicker processing. Training is not required at the
data user end since it uses automated processing of raw data and reporting of the
additional queries.
14.5. References
Balaji, N., Sethupathi, M., Sivaramakrishnan, N., Theeijitha, S. (2020). EDF-VD scheduling-
based mixed-criticality cyber-physical systems in smart city paradigm. Inventive
Communication and Computational Technologies, Lecture Notes in Networks and
Systems, 89, 931–946.
196 Data Analysis and Related Applications 1
Bhuiyan, S.M.A., Khan, J.F., Murphy, G.V. (2017). Big data analysis of the electric power
PMU data from the smart grid. SoutheastCon 2017, Concord, NC, USA, 30 March–
2 April 2017, pp. 1–5.
Casula, L., D’Amico, G., Masala, G., Petroni, F., Sobolewski, R.A. (2019). Performance
estimation of a wind farm with a copula dependence structure. 18th Applied Stochastic
Models and Data Analysis International Conference with Demographics Workshop,
Florence, Italy, 11–14, June 2019.
Chen, K., He, Z., Wang, S.X., Hu, J., Li, L., He, J. (2018). Learning-based data analytics:
Moving towards transparent power grids. CSEE Journal of Power and Energy Systems,
4(1), 67–82.
Feng, M., Zheng, J., Ren, J., Hussain, A., Li, X., Xi, Y., Liu, Q. (2019). Big data analytics and
mining for effective visualization and trends forecasting of crime data. IEEE Access, 7,
106111–106123.
George, A. and Groza, V. (2019). Information analytics system database for uniform approach
to continuous engineering program improvement. 15th International Conference on
Engineering of Modern Electric Systems (EMES), Oradea, Romania, 13–14, June 2019,
185–188.
Ghosh, A.M. and Grolinger, K. (2019). Deep learning: Edge-cloud data analytics for IoT.
IEEE Canadian Conference of Electrical and Computer Engineering (CCECE),
Edmonton, AB, Canada, 5–8 May 2019, pp. 1–7.
Gowthaman, N. and Srivastava, V.M. (2021a). Analysis of n-type double-gate MOSFET
(at nanometer scale) using high-K dielectrics for high-speed applications.
44th International Spring Seminar on Electronics Technology, Advancements in
Microelectronics Packaging for Harsh Environment, Dresden, Germany, 6–7, May 2021,
130–131.
Gowthaman, N. and Srivastava, V.M. (2021b). Analysis of InN/La2O3 twosome for
double-gate MOSFETs for radio frequency applications. Third International Conference
on Materials Science and Manufacturing Technology (ICMSMT 2021), Coimbatore,
India, 8–9 April 2021, 1–10.
Gowthaman, N. and Srivastava, V.M. (2021c). Dual gate material (Au and Pt) based
double-gate MOSFET for high-speed devices. IEEE Latin America Electron Devices
Conference (LAEDC), Mexico, 19–21 April 2021, 1–4.
Gowthaman, N. and Srivastava, V.M. (2021d). Design of hafnium oxide (HfO2) sidewall in
InGaAs/InP for high-speed electronic devices. International Conference on Materials
Sciences and Nanomaterials, London, UK, 12–14 July 2021, 1–6.
Gowthaman, N. and Srivastava, V.M. (2021e). Capacitive modeling of cylindrical
surrounding double-gate MOSFETs for hybrid RF applications. IEEE Access, 9,
89234–89242.
Hack, J.J. and Papka, M.E. (2015). Big data: Next-generation machines for big science.
Computing in Science & Engineering, 17(4), 63–65.
A Method of Big Data Collection and Normalization 197
He, X., Chu, L., Qiu, R.C., Ai, Q., Ling, Z. (2018). A novel data-driven situation awareness
approach for future grids – Using large random matrices for big data modeling. IEEE
Access, 6, 13855–13865.
Kampker, A., Kreisköther, K., Büning, M.K., Möller, T., Windau, S. (2018). Exhaustive
data- and problem-driven use case identification and implementation for electric drive
production. 8th International Electric Drives Production Conference (EDPC),
Schweinfurt, Germany, 4–5 December 2018, 1–8.
Karmawijaya, M.I., Nashirul Haq, I., Leksono, E., Widyotriatmo, A. (2019). Development of
big data analytics platform for electric vehicle battery management system.
6th International Conference on Electric Vehicular Technology (ICEVT), Bali, Indonesia,
18–21, November 2019, 151–155.
Karpatne, A., Atluri, G., Faghmous, J.H., Steinbach, M., Banerjee, A., Ganguly, A.,
Shekhar, S., Samatova, N., Kumar, V. (2017). Theory-guided data science: A new
paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data
Engineering, 29(10), 2318–2331, 1 October 2017.
Liu, X. and Shur, M.S. (2020). TCAD model for TeraFET detectors operating in a large
dynamic range. IEEE Transactions on Terahertz Science and Technology, 10(1), 15–20.
Malini, P., Poovika, T., Shanmugavadivu, P., Priya, I.R.P., Balaji, G.N., Rajotiya, R.N.,
Kumar, A., Mashette, G. (2019). 22nm 0.8V strained silicon-based programmable MISR
under various temperature ranges. American Institute of Physics – CF, 2087(020004),
020004-1–020004-12.
Malini, P., Kokila, S., Karthiga, M., Naveen Balaji, G. (2020). Design of hybrid full adder
using full swing and non-full swing XOR XNOR gates. TEST Engineering and
Management, January–February 2020, 2778–2787.
McHann, S.E. (2013). Grid analytics: How much data do you really need? IEEE Rural
Electric Power Conference (REPC), Stone Mountain, GA, USA, 28 April–1 May 2013,
C3–1–C3–4.
Mohammadi, M., Al-Fuqaha, A., Sorour, S., Guizani, M. (2018). Deep learning for IoT big
data and streaming analytics: A survey. IEEE Communications Surveys & Tutorials,
20(4), 2923–2960.
Moradi, J., Shahinzadeh, H., Nafisi, H., Marzband, M., Gharehpetian, G.B. (2020). Attributes
of big data analytics for data-driven decision making in cyber-physical power systems.
14th International Conference on Protection and Automation of Power Systems (IPAPS),
Tehran, Iran, 83–92.
Naveen Balaji, G., Karthiga, M., Swetha, D., Suchitra, M. (2019). Low power design of 0.8V
based 8 bit content addressable memory using MSML implemented in 22nm technology
for aeronautical applications. International Journal of Recent Technology and
Engineering, 8(2S11), 2688–2694.
Singh, M. and Srivastava, V.M. (2018). An analysis of key challenges for adopting the cloud
computing in the Indian education sector. Communications in Computer and Information
Science, 905(1), 439–448, Chapter 44, Springer, Singapore.
198 Data Analysis and Related Applications 1
Singh, M., Srivastava, V.M., Gaurav, K., Gupta, P.K. (2017). Automatic test data generation
based on multi-objective ANT LION optimization algorithm. 28th Annual Symposium of
the Pattern Recognition Association of South Africa and 10th Robotics and Mechatronics
International Conference of South Africa (PRASA-RobMech-2017), Bloemfontein,
South Africa, 30 November–1 December 2017, 168–174.
Sulaiman, S.M., Jeyanthy, P.A., Devaraj, D. (2019). Smart meter data analysis issues: A data
analytics perspective. IEEE International Conference on Intelligent Techniques in
Control, Optimization and Signal Processing (INCOS), Tamilnadu, India, 11–13, April
2019, 1–5.
Tanifuji, M., Matsuda, A., Yoshikawa, H. (2019). Materials data platform – A FAIR system
for data-driven materials science. 8th International Congress on Advanced Applied
Informatics (IIAI-AAI), 7–11 July 2019, 1021–1022.
15
15.1. Introduction
The numerical method presented in this chapter is based on the connection between
the infinitesimal generators of Markov jump processes and corresponding differential
equations. The theoretical background for this property is presented in Ethier and
Kurtz (1986).
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
200 Data Analysis and Related Applications 1
A special choice of the Markov jump process, which approximates the solution
of the system of ordinary differential equations Ẋ = F (t, X), delivers the so
called direct simulation method, where at every jump, only one component of the
vector-valued process is changed. The selection of this component occurs by sampling
according to a probability table, the probability for component i being proportional
to |Fi (t, X)|. Based on this first approximation, Guiaş and Eremeev (2016), Guiaş
(2017) and Guiaş (2019) presented several improvements suitable in principle for
autonomous systems Ẋ = F (X).
The choice Q = F (·, X̃(·)) corresponds to a Picard iteration. We can also take
a deterministic value for the time step, equal to the expected value 1/λ of the
exponentially distributed time step. After this first correction step with the result
denoted by X̄, we can apply further correction steps of the Runge–Kutta type by
taking for Q a polynomial which interpolates between the values F (X ∗ (t)) and
F (X̄(t + Δt)), optionally with an additional intermediate point F (X̄(t + Δt/2)).
The basic stochastic direct simulation scheme for systems of ordinary differential
Ẋ = F (t, X), which is then successively improved, delivers paths of a Markov jump
process X̃(·). Its feature is that at every jump, only one component of the process
is changed with a fixed amount ±1/N , which can be interpreted as the resolution
of the method. The component i, which is chosen to be changed in the next step, is
selected at random with a probability proportional to |Fi (t, X̃(t))|. The steps of the
direct simulation method are the following:
Writing the ODE system on the time interval [t, t + Δt] in the integral form yields:
t+Δt
X(t + Δt) = X(t) + F (s, X(s)) ds.
t
Assuming that X̄(t) is an approximation for the exact solution X(t) and that we
have simulated a path X̃(s), t ≤ s ≤ t + Δt of the Markov jump process, we can use
202 Data Analysis and Related Applications 1
these data in order to compute an approximation for X(t + Δt) which improves the
crude result X̃(t + Δt). This is done by a Picard iteration:
t+Δt
X̄(t + Δt) = X̄(t) + F (s, X̃(s)) ds.
t
In the case of autonomous systems, where F does not explicitly depend on t, the
integral is that of a step function and can be computed effectively by updating its value
after every jump of the Markov process X̃(·). This approach was used in Guiaş and
Eremeev (2016), Guiaş (2017), Guiaş (2019) for autonomous systems.
where X ∗ (t) is a given approximation for the exact solution X(t), Δt is the
time between two consecutive jumps and Q(s) is a vector of polynomials which
approximates the exact term F (s, X(s)). Note that in the case of Picard iterations,
we considered Q(s) = F (s, X̃(s)), i.e. we used the path of the simulated Markov
jump process. In this case, by denoting X̄(t + Δt) a predictor computed by one of the
previous methods, the polynomial Q(s) used by the Runge–Kutta steps interpolates
between the values F (t, X ∗ (t)) and F (t + Δt, X̄(t + Δt)), optionally with an
additional intermediate point F (t + Δt/2, X̄(t + Δt/2)).
For the RK2-method, we consider linear interpolation between the values at the
boundaries of the time interval and we therefore have:
Δt
X ∗ (t + Δt) = X ∗ (t) + F (t, X ∗ (t)) + F (t + Δt, X̄(t + Δt)) [15.1]
2
This method is similar to the classical second-order Heun method, and we denote it
therefore by RK2. Note that X̄(t + Δt) can be any predictor: the value of the Markov
jump process, the approximation obtained by Picard iteration or an approximation
obtained by another RK2 step based, for example, on a previously performed Picard
step. This variant of the scheme can therefore be applied in several layers.
Stochastic Runge–Kutta Solvers Based on Markov Jump Processes 203
Within this framework, for approximations also using values at the midpoint t +
Δt/2 of the time interval, we therefore have several options. The general scheme
RK23 has the form:
Δt
X ∗ (t + Δt) = X ∗ (t) + F (t, X ∗ (t)) + 4F (t + Δt/2, X̄(t + Δt/2))
6
+F (t + Δt, X̄(t + Δt)) [15.2]
By X ∗ (·), we denote the final, i.e. the best approximation for the solution at
the given time points, while X̄(·) are predictors computed by a combination of the
previously described Picard and RK2 methods (theoretically also by direct simulation,
but in this case, the precision is lower, so we do not make this choice here). This
method shows a similarity to the usual Runge–Kutta method of order 3 and, since
for the predictors we use the RK2 scheme, we denote it by RK23. For the value
X̄(t+Δt)), we have two options: we compute this value applying Picard and two RK2
steps starting either directly from X ∗ (t), an algorithm which is denoted by RK23_1,
or starting from the best approximation at t+Δt/2, an algorithm denoted by RK23_2.
We can also apply the classical fourth-order Runge–Kutta scheme on the adapted
interval∗ Δt, which is either exponentially distributed with parameter λ = N ·
time
i |Fi (t, X (t))|, or deterministic, equal to the expected value 1/λ.
A small change in the initial data leads to a totally different trajectory at larger
values of time. The challenge is therefore to compute an accurate solution at time
values as large as possible, since all numerical schemes induce small approximation
errors which in this case may propagate very fast. Kehlet and Logg (2010) presented a
high precision reference solution at large values of time for σ = 10, b = 8/3, r = 28
and initial data x(0) = (1, 0, 0).
The results of the methods introduced in this chapter are presented in Figure 15.1:
The error is taken in the · 1 -norm compared to the reference solution. At times
tmax = 30 or tmax = 40, the solvers RK and RK23_1 have a similar precision to
the MATLAB solvers ode45 and ode113 used for comparison, but the latter ones turn
out to be faster due to the performance of internal MATLAB routines. However, for
tmax = 50, the RK solver delivers better results than the MATLAB solvers (smaller
error and comparable computation time). The RK23_1 solver has a similar error with
the MATLAB solvers but needs a longer computational time.
We consider first a grid that consists of n = 200 points, with hmin = 3.2 ·
10−3 , hmax = 2.28 · 10−2 .
The results of the MATLAB solvers ode45 and ode113 (set at the maximum
possible precision) are compared with those of the solvers RK, RK23_1 and RK23_2.
We note that the solvers presented in this chapter basically show similar behavior
and that they can achieve similar precision to the MATLAB solvers, but at a longer
computational time.
This framework of 400 grid points with large differences of magnitude between
the spatial step sizes (a range between 2 · 10−4 − 2.7 · 10−2 ) shows to be more difficult
to the high precision MATLAB solvers like ode113, which is unable to perform the
computation due to memory problems. Only the solvers ode15s and ode23s, suited for
stiff problems, manage to compute a solution. However, the solvers RK and RK23_1
show better precision than the MATLAB solvers, the latter one performing best for
this problem.
15.4. Conclusion
Starting from the direct simulation method using Markov jump processes,
we developed a class of numerical schemes suited for non-autonomous ordinary
differential equations. After every jump of the underlying process, which occurs on
an adapted time interval, scalable by a given factor which controls the magnitude of
the jumps, we performed Picard iterations and different variants of Runge–Kutta steps.
The result turns out to be a highly efficient scheme, relatively easy to implement, with
precision similar to or even better than that delivered by standard MATLAB solvers.
15.5. References
Ethier, S.N. and Kurtz, T.G. (1986). Markov Processes. Characterization and Convergence.
John Wiley & Sons, New York.
Guiaş, F. (2017). Stochastic Picard–Runge–Kutta solvers for large systems of autonomous
ordinary differential equations. Proceedings of the Fourth International Conference on
Mathematics and Computers in Sciences and in Industry (MCSI), 298–302, Corfu, Greece.
Guiaş, F. (2019). High precision stochastic solvers for large autonomous systems of differential
equations. International Journal of Mathematical Models and Methods in Applied Sciences,
13, 60–63.
Guiaş, F. and Eremeev, P. (2016). Improving the stochastic direct simulation method
with applications to evolution partial differential equations. Applied Mathematics and
Computation, 289, 353–370.
Kehlet, B. and Logg, A. (2010). A reference solution for the Lorenz system on [0,1000]
[Online]. Available at: https://doi.org/10.1063/1.3498141.
16
Our method focuses on binary classification, using training data to sample points
on the decision boundary of the feature space. We then calculate the persistent
homology of this sample and compute metrics to quantify the complexity of the
decision boundary. Our experiments with data sets in various dimensions suggest that
in certain cases, our measures of complexity are correlated with a model’s ability
to generalize to unseen data. We hope that refining this method will lead to a better
understanding of overfitting and a means to compare models.
16.1. Introduction
Chapter written by Alan H YLTON, Ian L IM, Michael M OY and Robert S HORT .
For a color version of all the figures in this chapter, see www.iste.co.uk/zafeiris/data1.zip.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
208 Data Analysis and Related Applications 1
shape of the decision boundary of a classification algorithm and interpret the results.
This geometric approach has become a recurrent theme in machine learning and data
science: our contribution is development and formalization towards a technique that
relies on TDA to evaluate a trained neural network independent of a validation data
set. We include an introduction to our chosen computational approach to TDA, as well
as evidence that it is sensitive to the shape of a decision boundary.
The choice of neural networks is not surprising, as they are the foremost drivers in
the demand for applying pure mathematics to applied problems, including algebraic
topology. This is largely due to their great ability to learn complex relationships in
data sets, as well as the difficulty to “look under the hood”. We hasten to add that
these methods generally apply to classification algorithms, and with the ever-growing
presence of machine learning in everyday applications, it will be crucial to gain a
better understanding of their functions and reliability.
There are several methods of machine learning model evaluation such as train/test
split, cross-validation and various metrics for accuracy, all of which give some sort of
indication of how a model will generalize to unseen data. Ideally, we would like to
determine the efficacy of a model based on the training data alone. With this in mind,
measures of accuracy would not be reliable in the event the model is biased towards
the training data – this is what motivates us to seek to analyze the decision boundaries
learned by models.
In Figure 16.1, suppose the green and black lines represent two decision
boundaries determined by a binary classification model. The green line represents a
model that achieves perfect training accuracy but is likely biased towards the training
data. The black line represents a model with some flaws, but which will likely perform
better on unseen data. It is possible to distinguish this difference in complexity
between models visually in low-dimensional data sets, but it can also be viewed via
Interpreting a Topological Measure of Complexity for Decision Boundaries 209
techniques from TDA for data sets in higher dimensions. A member of the toolkit
of TDA known as persistent homology, which is constructed from ideas in algebraic
topology, estimates the shape of a data set by detecting connected components, holes,
voids and other higher-dimensional features. In Figure 16.1, persistent homology can
be used to measure the complexity of the shapes (lines) at play – this will be made
more clear in section 16.2.
Our study shows that persistent homology can distinguish between decision
boundaries such as the examples pictured in Figure 16.1. We provide evidence that
as a model becomes biased towards the training data set, it is notable via its decision
boundary, including in higher-dimensional data sets. We focus on neural networks
because of the complex relationships they are capable of learning and because their
iterative training process allows us to view the changes in a decision boundary as a
network trains. A proposed metric, which summarizes the topological complexity, is
demonstrated on synthetic data sets with varying amounts of noise. As the noise in a
data set increases, this is indicated via our metric; indeed, noisier data sets are more
likely to lead to overfit models, and this tendency can be observed through our metric.
Our work is not alone in incorporating persistent homology into machine learning.
Methods have recently been proposed to transform persistence diagrams into a more
suitable input for machine learning algorithms: see Adams et al. (2017), Hofer et al.
(2018) and Zhao and Wang (2019). The interest in applying persistent homology to
machine learning tasks comes from its ability to provide a summary of the geometric
and topological features of a data set. Applications come from a variety of fields: see
for instance Carlsson et al. (2008), Bendich et al. (2016) and Motta et al. (2018).
Other work has used persistent homology to analyze machine learning algorithms,
such as Naitzat et al. (2020), and work similar to ours can be found in Varshney et al.
(2015), Chen et al. (2019) and Ramamurthy et al. (2019). The main difference in our
work is the explicit sampling of a decision boundary to describe its shape.
In Figure 16.2, the points are sampled from a circle. Instead of recovering the circle
in its entirety – location and radius – we are more interested in the fact that there is
a circle, which has a line and a single one-dimensional hole in the middle. Moreover,
while this is visible for this curve, we want to recover this type of information
for less-familiar shapes of arbitrary dimensions. This is the function of persistent
homology; the member of the TDA toolkit we will use quantifies the shape of a
decision boundary. We give a brief overview here; see Ghrist (2008) for a more
thorough introduction and Zomorodian and Carlsson (2005) and Edelsbrunner and
Harer (2010) for mathematical details. We will consider data sets consisting of points
in some Euclidean space Rn . We begin by building additional structures on such a
point set, in order to study the shape outlined by the points without any other given
information.
210 Data Analysis and Related Applications 1
With this ability to construct simplicial complexes on data sets, we would like
to have a systematic way to describe the shape of the data set. Forming a quantitative
description of a shape poses a challenge. Another challenge is to choose an appropriate
r that determines the simplicial complex VR(X; r): without prior knowledge about
the data set, there is no clear way to choose an r such that the resulting simplicial
complex accurately depicts the data. Persistent homology offers a solution to both
of these problems. It is based on homology, which is a method from the field
of algebraic topology to characterize the holes in a space. If K is a simplicial
complex, then roughly speaking, the homology vector space Hk (K) has a dimension
equal to the number of k-dimensional holes in K. For instance, a zero-dimensional
hole is a connected component, a one-dimensional hole is a circular hole and a
two-dimensional hole is a spherical hole.
used to understand the global shape of the data set. On the other hand, a significant
number of points near the diagonal can contain information about small-scale features
of the data set. An example of a persistence diagram generated from a uniform random
sample of points on the surface of a sphere is shown in Figure 16.5. The sphere
has one zero-dimensional hole, signifying that it has one connected component, no
one-dimensional holes and one two-dimensional hole (void). Thus, in Figure 16.5, we
see evidence that this shape is sphere-shaped – at least in the eyes of topology. Indeed,
there is one blue point high above the diagonal showing that the dimension of H0 is
likely one; there are no orange points showing strong evidence for H1 , and the blue
point for H2 high above the diagonal identifies the two-dimensional void: the data set
most likely outlines a sphere. We can also note that the single point in H0 at height ∞
records the fact that one connected component never dies.
Our study makes use of the Scikit-TDA (Saul and Tralie 2019) python library,
and specifically the Ripser package, which computes persistent homology using
Vietoris–Rips complexes as described above. We aim to use persistent homology to
characterize the complexity of a set of points; rather than use an entire persistence
diagram, we simplify further to the sum of lifetimes of the points in a diagram. Thus,
for a point set X ⊂ Rn , we define
Sk (X) = (d − b) [16.1]
(b,d)∈dgmk (X)
d=∞
where dgmk (X) is the persistence diagram for X resulting from k-dimensional
persistent homology using Vietoris–Rips complexes. We focus on S0 and S1 , as
persistent homology can be computed quickly in dimensions 0 and 1.
Interpreting a Topological Measure of Complexity for Decision Boundaries 213
16.3. Methodology
For each neural network, we construct a wide and deep-layer architecture with
hidden layers that are equipped with the ReLU activation function. Since we only
consider binary classification, the output layers contain two nodes with the softmax
activation function. We use TensorFlow (Abadi et al. 2015) to train neural networks.
In each example, we use neural networks that are larger than necessary to allow them
to overfit, since our goal is to examine the decision boundary of overfit networks.
For emphasis, we recall that our techniques can also be applied to other
classification algorithms; we only use neural networks as an example. We also focus
on binary classification for simplicity, as this leads to a single decision boundary
between two classes. The ideas presented here can likely be generalized to work for
data sets divided into more than two classes, but we will leave it to future work to
determine how this can best be done.
Our analysis relies on taking a representative sample of points from the decision
boundary in order to characterize its shape. In general, the decision boundary may
be an infinitely large region in Rn , so we require a method that samples the relevant
portion. Our method for sampling the decision boundary uses the training data, along
with the classes assigned to the data by the algorithm, to determine the portion of the
boundary that is sampled. To generate m sample points of the decision boundary, we
begin with m pairs of data points, where each pair has one point from each class; we
sample points close to the decision boundary by choosing the m pairs with minimal
distances between their points. Then, for each pair, we search on the line segment
between the two points of the pair for a point on the decision boundary, simply using
bisection to find a point x such that f (x) ≈ .5.
16.3.3. Procedure
For each data set, we ran multiple experiments with varying amounts of noise
added, following the steps below:
1) We choose n neural networks to train and a period of m epochs.
2) Each data set is split into a training set and a test set.
3) A neural network trains for m epochs, then we determine the training accuracy,
testing accuracy and a sample of points from the decision boundary.
4) Persistent homology is computed from the decision boundary sample, and S0
and S1 are computed from the persistence diagrams.
5) Steps 3 and 4 are repeated until the neural network achieves a specific training
accuracy or reaches a set maximum training time.
6) The process repeats for all n neural networks.
To examine the data recorded in the experiments, S0 and S1 are compared to the
difference in the training accuracy and test accuracy of a model. This difference is a
reliable metric to determine if a model is overfit, and for convenience, we will refer to
this as the overfitness of a model.
Interpreting a Topological Measure of Complexity for Decision Boundaries 215
It seems reasonable to begin with data sets we can concretely depict. Consider
500 points randomly sampled from a torus and 500 points randomly sampled from a
sphere in R3 with the sphere nested inside the torus. A reasonable decision boundary
distinguishing these two sets of data would take on a circular profile.
The TaDAsets package of the Scikit-TDA (Saul and Tralie 2019) python library
was used to produce the data sets in Figure 16.6 and implement a percentage of noise.
We considered three data sets as such with noise levels of 0%, 10% and 20% to allow
for a neural network to become overfit. For each data set, 30% of the points were
reserved as a test set, and 25 neural networks, with 20 hidden layers consisting of 15
nodes each, were trained on the remaining 70%. Every 250 epochs, a sample of 700
points was taken from the decision boundary to calculate S0 and S1 . For this data set,
the decision boundary is expected to have a one-dimensional hole, as the boundary
is expected to look like a circle, so S1 is considered in this case. This process was
repeated until each neural network achieved a 98% accuracy on the training data.
In Figures 16.7 and 16.8, we compare the overfitness of each neural network to S0
and S1 for each noise level. In addition to this summary, Table 16.1 shows the Pearson
216 Data Analysis and Related Applications 1
correlation coefficients of the overfitness compared to S0 and S1 for each data set.
As noise increases, it is expected that the complexity of the decision boundary should
increase as well, and these metrics now provide a way to quantify this behavior. We
see that in the presence of noise, S0 has significant correlation with overfitness as
r > .8. We also see notable correlation for S1 , and this is likely due to the fact that
the decision boundary is expected to have a one-dimensional hole. Furthermore, in
Figures 16.7 and 16.8, we see that each group, corresponding to a different data set,
seems to cluster in three different regions. This means that these metrics not only
measure noise, but can distinguish one boundary from another to some degree.
Next, we consider some examples of simple data sets in various dimensions. The
data points were uniformly sampled from a unit ball in Rn ; we ran tests in dimensions
n = 2, 4, 6, 8, and 10. The data points were divided into two classes based on what
side of a randomly chosen hyperplane they were on, except we allowed an overlap
between the classes near the hyperplane. For each value of n above, we created data
sets with the two classes overlapping 5%, 10%, 20% and 30% by volume, with the goal
of observing the different amounts of overfitting resulting from the different amounts
of overlap. For each dimension n, the number of training data points was chosen
to allow neural networks to gradually overfit over the course of a sufficiently long
training period. In each case, test data was generated according to the same distribution
used for the training data.
Periodically during the training process, a sample of 500 points was taken from
the decision boundary. At each step, persistence diagrams were computed from the
sample, and S0 and S1 were recorded. The training accuracies and test accuracies at
each step were also recorded.
of 30% overlap in dimensions 4 and 6 in Figures 16.9 and 16.10. Each includes the
samples from all networks.
Table 16.2 shows the Pearson correlation coefficients r for S0 and overfitness. In
each case, this was calculated from the data from all networks and each point in the
training process at which the decision boundary was sampled.
Positive values of r indicate that a more overfit model tends to produce a higher
value of S0 . Furthermore, the greater values of r found for higher percentages of
overlap suggest that S0 is in fact sensitive to the amount of overfitting. The correlations
tended to be highest in dimensions 4 and 6. While the correlations shown here were
calculated from the collective trials of all networks with a given dimension and
percentage of overlap, the samples taken during the training of an individual network
often resulted in higher correlation coefficients. For instance, in trials with dimension
6 and a 30% overlap, the individual correlation coefficient for a network was greater
than .8 for 20 out of the 25 networks, and greater than .9 for 16 out of the 25 networks.
As an example, results from one such network with correlation coefficient r = .904
are shown in Figure 16.11.
Interpreting a Topological Measure of Complexity for Decision Boundaries 219
Our experiments suggest that S0 and S1 are correlated with a model’s ability to
generalize to unseen data. We interpret this as an indication that persistent homology
is sensitive to the shape of a decision boundary, where a more complicated decision
boundary produced by an overfit model results in a more complicated persistence
diagram. A key feature of our method was the sampling of a decision boundary, where
the training data was used to find the relevant portion of the decision boundary. Our
approach gives evidence that topological measures of a decision boundary can give
insight into the quality of a model.
There is a wide variety of opportunities to expand upon this work. We list some
ideas for future work here:
1) Experiments with other classification algorithms such as logistic regression,
random forests or k-nearest neighbors.
2) Decision boundary sampling techniques: the method presented here only
demonstrates that the decision boundary can be sampled in a meaningful way. More
work is needed to determine how a high-quality sample can be obtained.
3) Incorporating persistence diagram data into the training process (see for
instance Chen et al. (2019)).
4) Investigation of measures of complexity other than S0 and S1 that could be
extracted from persistence diagrams.
5) Applications of these methods to higher-dimensional data and real data sets. For
instance, experiments could be done with image data or with applications of natural
language processing.
220 Data Analysis and Related Applications 1
The simple examples considered here have demonstrated the ability of persistent
homology to describe the shape of a decision boundary. However, much more work is
needed to create a method suitable for practical applications. We hope that this work
motivates further study of the topological complexity of a model, as well as other work
applying TDA to machine learning.
16.6. References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S.,
Davis, A., Dean, J., Devin, M. et al. (2015). TensorFlow: Large-scale machine learning
on heterogeneous systems [Online]. Available at: tensorflow.org.
Adams, H., Chepushtanova, S., Emerson, T., Hanson, E., Kirby, M., Motta, F., Neville, R.,
Peterson, C., Shipman, P., Ziegelmeier, L. (2017). Persistence images: A stable vector
representation of persistent homology. Journal of Machine Learning Research, 18(8), 1–35.
Bendich, P., Marron, J.S., Miller, E., Pieloch, A., Skwerer, S. (2016). Persistent homology
analysis of brain artery trees. The Annals of Applied Statistics, 10(1), 198.
Carlsson, G., Ishkhanov, T., de Silva, V., Zomorodian, A. (2008). On the local behavior of
spaces of natural images. International Journal of Computer Vision, 76, 1–12.
Chen, C., Ni, X., Bai, Q., Wang, Y. (2019). A topological regularizer for classifiers via persistent
homology. In Proceedings of Machine Learning Research, Chaudhuri, K. and Sugiyama, M.
(eds). 89, 2573–2582, 16–18 April 2019.
Edelsbrunner, H. and Harer, J. (2010). Computational Topology – An Introduction. American
Mathematical Society, Providence, RI.
Ghrist, R. (2008). Barcodes: The persistent topology of data. Bulletin of the American
Mathematical Society, 45, 61–75.
Hofer, C., Kwitt, R., Niethammer, M., Uhl, A. (2018). Deep learning with topological
signatures. arXiv preprint arXiv:1707.04041.
Motta, F.C., Neville, R., Shipman, P.D., Pearson, D.A., Bradley, R.M. (2018). Measures of
order for nearly hexagonal lattices. Physica D: Nonlinear Phenomena, 380, 17–30.
Naitzat, G., Zhitnikov, A., Lim, L.-H. (2020). Topology of deep neural networks. Journal of
Machine Learning Research, 21(184), 1–40.
Ramamurthy, K.N., Varshney, K., Mody, K. (2019). Topological data analysis of decision
boundaries with application to model selection. In Proceedings of the 36th International
Conference on Machine Learning, Chaudhuri, K. and Salakhutdinov, R. (eds). 97,
5351–5360, Long Beach, California, 9–15 June 2019.
Saul, N. and Tralie, C. (2019). Scikit-TDA: Topological data analysis for Python [Online].
Available at: https://github.com/scikit-tda/scikit-tda.
Varshney, K.R. and Ramamurthy, K.N. (2015). Persistent topology of decision boundaries.
Proceedings of the IEEE International Conference on Acoustic Speech Signal Processing,
3931–3935, Brisbane, Australia, April 2015.
Interpreting a Topological Measure of Complexity for Decision Boundaries 221
Zhao, Q. and Wang, Y. (2019). Learning metrics for persistence-based summaries and
applications for graph classification. Advances in Neural Information Processing Systems,
9859–9870.
Zomorodian, A. and Carlsson, G. (2005). Computing persistent homology. Discrete &
Computational Geometry, 33, 249–274.
17
The Minimum Renyi’s Pseudodistance
Estimators for Generalized Linear Models
17.1. Introduction
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
224 Data Analysis and Related Applications 1
The lack of robustness of the maximum likelihood estimator (MLE) and the
maximum quasi-likelihood estimators (MQLE) in GLMs has been widely studied
in the literature. For this reason, robust procedures for GLMs have been considered
to robustify the classical MLE. Among others, Stefanski et al. (1986) studied
optimally bounded score functions for the GLM and generalized the results obtained
by Krasker and Welsch (1982) for classical LRM. However, the robust estimator
proposed by Stefanski et al. (1986) is difficult to compute. In this line, Künsch et al.
(1989) introduced the so-called conditionally unbiased bounded-influence estimate.
The development of robust estimators for the GLM continued with the work of
Morgenthaler (1992) and more recently Cantoni and Ronchetti (2001) proposed a
robust approach based on robust quasi-deviance functions, which simultaneously
performs parameter estimation and variable selection. Another class of M-estimators
was proposed by Bianco and Yohai (1996) and further studied by Croux and
Haesbroeck (2003). Bianco et al. (2013) proposed general M-estimators with missing
values in the responses. Later, Valdora and Yohai (2014) proposed a family of robust
estimators for GLM, based on M-estimators after stabilizing the response variance. In
this line, Ghosh and Basu (2016) presented a robust family of estimators based on the
density power divergence (DPD) approach but assuming a fixed design matrix.
On the other hand, Broniatowski et al. (2012) considered for the first time the
minimum Renyi’s pseudodistance (RP) estimators for the LRM, and they studied
their robustness properties. Based on these minimum RP estimators, Castilla et al.
(2020b) introduced and studied Wald-type tests for the parameters in the LRM. Later,
the results were extended to the context of high-dimensional LRM in Castilla et al.
(2021), combining the robust loss given by RP and regularization methods. Some
interesting results based on independent and non-identically distributed observations
(i.n.i.d.o) were considered in Castilla et al. (2020a). Following the previous works, in
this chapter, we consider the minimum RP estimators for i.n.i.d.o. for the GLMs.
The Minimum Renyi’s Pseudodistance Estimators for Generalized Linear Models 225
In the following, we assume that the explanatory variables, xi , are fixed, and
therefore the random response variables Yi are independent but non-homogeneous
observations, i.e. we deal with the i.n.i.d.o. setup studied in Castilla et al. (2020a).
Let us then consider the i.n.i.d.o. random variables, Y1 , ..., Yn , with density functions
with respect to some common dominating measure, g1 , ..., gn , respectively. The true
densities gi are modeled by the density functions given in [17.1], belonging to the
exponential family. As pointed out in section 17.1, we will denote by fi (y, β, φ)
these density functions, highlighting its dependence on the regression vector β, the
nuisance parameter φ and the observation i, i = 1, .., n. For each observation, the RP
between fi (y, θ) and gi , can be defined for positive values of α as
1 α+1
Rα (fi (y, θ), gi ) = log fi (y, θ) dy
α+1
1
− log fi (y, θ)α gi (y)dy + k, [17.2]
α
where
1 α+1
k= log gi (y) dy
α (α + 1)
does not depend on θ = (β, φ). Since we only have one observation of each
random variable Yi , the best way to estimate the true density gi is to assume that
the distribution is degenerate in the observation yi . Accordingly, we denote by gi the
density function of the degenerate variable at the point yi . Then, [17.2] yields the loss
1 1
Rα (fi (y, θ), gi ) = log fi (y, θ)α+1 dy − log fi (yi , θ)α + k.
α+1 α
[17.3]
At α = 0, the RP loss can be defined taking continuous limits by
R0 (fi (y, θ), gi ) = lim Rα (fi (y, θ), gi ) = − log fi (yi , θ) + k. [17.4]
α↓0
fi (Yi , θ)α
Vi (Yi , θ) = .
Liα (θ)
Based on the previous idea, we consider the objective function
and then the minimum RP estimator, θ α , for the common parameter vector θ, is given
by
n
1
with Hnα (θ) defined in [17.5] for α > 0 and Hn0 (θ) = n log fi (Yi , θ). Note that at
i=1
α = 0, the minimum RP estimator coincides with the MLE and therefore the proposed
family can be considered a generalization of the maximum likelihood procedure.
α+1 ∂ log fi (yi , β, φ) α
fi (y, β, φ) dy fi (yi , β, φ) .
∂β
The Minimum Renyi’s Pseudodistance Estimators for Generalized Linear Models 227
Following Ghosh and Basu (2016), we can rewrite the previous partial derivatives
as
∂ log fi (y, β, φ) yi − μi
= xi = K1i (yi .β, φ) xi
∂β V ar(Yi )g (μi )
and
∂ log fi (y, β, φ) (yi θi − b(θi )) ∂c (yi , φ)
=− 2
a (φ) + = K2i (yi .β, φ) .
∂φ a(φ) ∂φ
Then, substituting on the first equation, we have that the estimating equations for
the parameter β are given by
n xi
i
{Mi (yi , β, φ) − Ni (yi , β, φ)} = 0k [17.7]
i=1 Lα (β, φ)
being
Mi (yi , β, φ) = fi (yi , β, φ)α K1i (yi .β, φ)
and
fi (y, β, φ)α
Ni (yi , β, φ) = fi (y, β, φ)α+1 K1i (y.β, φ) dy.
fi (y, β, φ)α+1 dy
In relation to the estimating equation for φ, we have,
∂Vi (yi , β, φ) 1 ∂ log fi (yi , β, φ) i
= αfi (yi , β, φ)α Lα (β, φ)
∂φ Liα (β, φ)2 ∂φ
α+1
α
−1
α+1
− α fi (y, β, φ) dy
α+1 ∂ log fi (yi , β, φ) α
fi (y, β, φ) dy fi (y, β, φ)
∂φ
1 ∂ log fi (yi , β, φ) i
= 2 αfi (yi , β, φ)α Lα (β, φ)
Liα (β, φ) ∂φ
Liα (β, φ)
− α
fi (y, β, φ)α+1 dy
∂ log fi (yi , β, φ)
fi (y, β, φ)α+1 dy fi (y, β, φ)α .
∂φ
Thus, the estimating equation for φ is given by
n 1
i (β, φ)
{Mi∗ (yi , β, φ) − Ni∗ (yi , β, φ)} = 0 [17.8]
L
i=1 α
228 Data Analysis and Related Applications 1
being
Mi∗ (yi , β, φ) = fi (yi , β, φ)α K2i (yi .β, φ)
and
fi (yi , β, φ)α
Ni∗ (yi , β, φ) = fi (y, β, φ)α+1 K2i (y, β, φ) dy,
fi (y, β, φ)α+1 dy
Castilla et al. (2021) proved that the minimum RP estimator (βα , φα ) is consistent
and asymptotically normal under some regularity conditions. Before stating the
asymptotic distribution, let us introduce some useful notations. We define the
quantities
1
mjli (β, φ) = fi (y, β, φ)α+1 Kji (y.β, φ) Kli (y.β, φ) dy,
fi (y, β, φ)α+1 dy
1
mji (β, φ) = fi (y, β, φ)α+1 Kji (y.β, φ) dy,
fi (y, β, φ)α+1 dy
fi (y, β, φ)2α+1
ljli (β, φ) = 2 (Kji (yi .β, φ) − mji (β, φ))
Liα (β, φ)
(Kli (yi .β, φ) − mli (β, φ)) dy,
[17.9]
for all j, l = 1, 2 and i = 1, .., n.
T HEOREM 17.1.– Let Y1 , ..., Yn be i.n.i.d.o. each with a density function given in
[17.1]. The asymptotic distribution of the minimum RP estimator, (β α , φα ), is given
by
√ 1
L
nΩn (β, φ)− 2 Ψn (β, φ) (β α , φα ) − (β, φ) → N (0p+1 , I p+1 ),
n→∞
being I k the k-dimensional identity matrix and the matrices Ψn and Ωn are given by
1 X T D 11 X X T D 12 1
Ωn (β, φ) = ,
n 1T D 12 X 1T D22 1
where X denotes the design matrix and D jk = diag (ljki )i=1,..,n , j, k = 1, 2, and
⎛ ⎞
T ∗ ∗ T ∗ T ∗ ∗ T ∗
1 ⎝ X D11 −(D1 ) D1 X X D12 −(D1 ) D2 1 ⎠
Ψn (β, φ) = T T
,
n 1T D∗12 −(D∗1 ) D ∗2 X 1T D∗22 −(D∗2 ) D ∗2 1
with D ∗jk = diag (mjki (β, φ))i=1,..,n and D∗j = diag (mji (β, φ))i=1,..,n , , j.k =
1, 2.
We illustrate the proposed robust method for the Poisson regression model. As
pointed out in section 17.1, the Poisson regression model belongs to the GLM with
a known shape parameter φ = 1 and location parameters θi = νi = xTi β and
c(yi ) = − log(yi !). The mean of the response variable is then linked to the linear
predictor through the natural logarithm, i.e. μi = exp(xTi β). Thus, we can apply the
previously proposed method to estimate the vector of regression parameters β with
the objective function given in equation [17.5]. Note that the expression Liα (β) does
not have a simplified form for the Poisson regression model, and it must be computed
numerically.
Figure 17.1 shows the mean-squared error of the estimate, MSE = ||β α −β||2 , for
different values of α = 0, 0.1, 0.3, 0.5, against the sample size. It is straightforward
that our proposed estimators are more robust than the classical MLE, since the MSE
committed is lower for all positive values of α than for α = 0 (corresponding to the
MLE), except for too small sample sizes, in all contaminated scenarios. Conversely,
the MLE is, as expected, the most efficient estimator in the absence of contamination,
closely to our proposed estimators with α = 0.1, 0.3, highlighting the role of α in
controlling the trade-off between efficiency and robustness. In this regard, values of α
about 0.3 perform the best taking into account the low loss of efficiency and the gain
in robustness. Finally, note that small sample sizes adversely affect greater values of
α.
We finally apply our proposed estimators in a real dataset arising from a clinical
trial of 59 patients who suffer from epilepsy. The data was first studied in Leppik et al.
(1985) and has been previously studied for robust Poisson regression in Hosseinian
(2009) and Ghosh and Basu (2016). Patients with epilepsy were treated by the
The Minimum Renyi’s Pseudodistance Estimators for Generalized Linear Models 231
Table 17.1 shows the estimate values of the regression coefficients for different
values of the tuning parameter α, jointly with the coefficient estimates using
some other robust and non-robust methods in the literature, namely the MLE, the
weighted maximum likelihood estimators (WMLE) proposed in Hosseinian (2009),
the M-estimators proposed in Cantoni and Ronchetti (2001) and minimum density
estimators based on the DPD (MDPD), proposed in Ghosh and Basu (2016). These
estimated values have been taken from the mentioned papers. The proposed robust
estimators behave similarly to other robust proposals, justifying robustness also with
real data. Note that all robust estimates with different methods of the regression
coefficients for Trt, Bline and Trt×Bline variables are greater (in absolute value) than
232 Data Analysis and Related Applications 1
the MLE, leading to the suspicion that the last non-robust estimates are influenced by
data contamination.
17.4. Conclusion
In this chapter, we have presented the minimum RP estimator for GLMs. The
proposed estimators are robust against data contamination, including outliers and
leverage points, as well as consistent and asymptotically normal. Following this idea,
Wald-type test statistics could be developed in order to test a simple and composite
null hypothesis, extending the previous work for the LRM in Castilla et al. (2020b).
The latter have been explored in Basu et al. (2021) for the minimum DPD estimators,
but assuming the random design matrix.
17.5. Acknowledgments
17.6. Appendix
We use the same notation introduced for Theorem 17.1 in [17.9]. Let us rewrite
∂Vi (y; β, φ)
= fi (yi , β, φ)α Hi1 (yi , β, φ)xTi xi and
∂β
∂Vi (y; φ)
= fi (yi , β, φ)α Hi2 (yi , β, φ), i = 1, ..., n
∂φ
with
1 1
Hij (yi , β, φ) = K1i (yi .β, φ) −
Liα (β, φ) fi (y, β, φ)α+1 dy
fi (y, β, φ)α+1 K1i (y.β, φ) dy
1
= (K1i (yi .β, φ) − mij (β, φ)) , j = 1, 2.
Liα (β, φ)
The Minimum Renyi’s Pseudodistance Estimators for Generalized Linear Models 233
with
fi (y, β, φ)2α+1
ljli (β, φ) = (Kji (yi .β, φ) − mji (β, φ))
Liα (β, φ)2
(Kli (yi .β, φ) − mli (β, φ)) dy, j, l = 1, 2, i = 1, ..n
1
− 2
fi (y, β, φ)α+1 dy
α+1
fi (y, β, φ) u (yi , β, φ) dy
T
α+1
fi (y, β, φ) u (yi , β, φ) dy
234 Data Analysis and Related Applications 1
being u (yi , β, φ) = (K1i (yi , β, φ)xi , K2i (yi , β, φ)) . Therefore, using the quantities
defined in [17.9], we can express the matrix Ψn (β, φ) as
1 n m11i (β, φ) xTi xi m12i (β, φ) xi
Ψn (β, φ) =
n i=1 m12i (β, φ) xTi m22i (β, φ)
m1i (β, φ) xi
− m1i (β, φ) xi m2i (β, φ) .
m2i (β, φ)
Finally, defining
and
we can write
⎛ ⎞
T ∗ ∗ T ∗ T ∗ ∗ T ∗
1 X D 11 −(D 1 ) D 1 X X D −(D ) D 2 1
Ψn (β, φ) = ⎝ T ∗ 12 1
⎠.
T T
n X D 12 −(D 1 ) D 2 1 X D22 −(D2 ) D∗2 1
∗ ∗ T ∗ ∗
17.7. References
Basu, A., Ghosh, A., Mandal, A., Martin, N., Pardo, L. (2021). Robust Wald-type tests in GLM
with random design based on minimum density power divergence estimators. Statistical
Methods and Applications, 3, 933–1005.
Bianco, A.M. and Yohai, V.J. (1996). Robust estimation in the logistic regression model. Robust
Statistics, Data Analysis, and Computer Intensive Methods, 109, 17–34.
Bianco, A.M., Boent, G., Rodrigues, I.M. (2013). Robust tests in generalized linear models with
missing responses. Computational Statistics and Data Analysis, 65, 80–97.
Broniatowski, M., Toma, A., Vajda, I. (2012). Decomposable pseudodistances and applications
in statistical estimation. Journal of Statistical Planning and Inference, 142, 2574–2585.
Cantoni, E. and Ronchetti, E. (2001). Robust inference for generalized linear models. Journal
of the American Statistical Association, 96, 1022–1030.
Castilla, E., Ghosh, A., Jaenada, M., Pardo, L. (2020a). On regularization methods based
on Rényi’s pseudodistances for sparse high-dimensional linear regression models [Online].
Available at: arXiv:2007.15929.
Castilla, E., Martín, N., Muñoz, S., Pardo L. (2020b). Robust Wald-type tests based on
minimum Rényi pseudodistance estimators for the multiple regression model. Journal of
Statistical Computation and Simulation, 90(14), 2592–2613.
Castilla, E., Jaenada, M., Pardo, L. (2021). Estimation and testing on independent not
identically distributed observations based on Rényi’s pseudodistances [Online]. Available
at: arXiv:2102.12282.
The Minimum Renyi’s Pseudodistance Estimators for Generalized Linear Models 235
Croux, C. and Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimator for logistic
regression. Computational Statistics and Data Analysis, 44, 273–295.
Ghosh, A. and Basu, A. (2016). Robust estimation in generalized linear models: The density
power divergence approach. TEST, 25, 269–290.
Hosseinian, S. (2009). Robust inference for generalized linear models: Binary and Poisson
regression. Thesis, École Polytechnique Fédérale de Lausanne.
Krasker, W.S. and Welsch, R.E. (1982). Efficient bounded-influence regression estimation.
Journal of the American Statistical Association, 77, 595–604.
Künsch, H.R., Stefanski, L.A., Carroll, R.J. (1989). Conditionally unbiased bounded-influence
estimation in general regression models, with applications to generalized linear models.
Journal of the American Statistical Association, 84, 460–466.
Leppik, I., Dreifuss, F., Bowman, T., Santilli, N., Jacobs, M.P., Crosby, C., Cloyd, J.C.,
Stockman, J., Graves, N.M., Sutula, T.P. et al. (1985). A double-blind crossover evaluation
of progabide in partial seizures. Neurology, 35, 285.
McCullagh, P. and Nelder, J.A. (1983). Generalized Linear Models. Chapman and Hall,
London.
Morgenthaler, S. (1992). Least-absolute-deviations fits for generalized linear models.
Biometrika, 79, 747–754.
Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal
Statistical Society, 135, 370–384.
Stefanski, L.A., Carroll, R.J., Ruppert, D. (1986). Optimally bounded score functions for
generalized linear models with applications to logistic regression. Biometrika, 73, 413–424.
Thall, P.F. and Vail, S.C. (1990). Some covariance models for longitudinal count data with
overdispersion. Biometrics, 46(3), 657–671.
Valdora, M. and Yohai, V.J. (2014). Robust estimators for generalized linear models. Journal of
Statistical Planning and Inference, 146, 31–48.
18
18.1. Introduction
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
238 Data Analysis and Related Applications 1
probability distributions are closer together than others. Many of the currently used
tests, such as the likelihood ratio, the chi-squared, the score and Wald tests are defined
in terms of appropriate measures.
For historical reasons, we present first Shannon’s entropy (Shannon 1948) given
by
I S (X) ≡ I S (f ) = − f ln f dμ = Ef [− ln f ],
For more details about entropy measures, the reader is referred to Mathai and
Rathie (1975) and Nadarajah and Zografos (2003).
where ϕ is a convex function on [0, ∞) such that ϕ (1) = ϕ (1) = 0 and ϕ (1) = 0.
We also assume the conventions 0ϕ (0/0) = 0 and 0ϕ (u/0) = lim ϕ (u) /u, u > 0.
u→∞
The class of Csiszar’s measures includes a number of widely used measures that can
be recovered for appropriate choices of the function ϕ. When the function ϕ is defined
as
then the above measure reduces to the Kullback–Leibler measure given in [18.1]. If
or
.
ϕ(u) = ϕ∗λ (u) = uλ+1 − u /(λ(λ + 1)),
we obtain the Cressie and Read power divergence (Cressie and Read 1984), λ = 0,
−1.
. 1 u1+a
ϕ(u) = ϕα (u) = 1 − (1 + )u + , a = 0. [18.6]
a a
240 Data Analysis and Related Applications 1
which is associated with the BHHJ power divergence (Basu et al. 1998) and is a
member of the BHHJ family of divergence measures proposed by Mattheou et al.
(2009), which depends on a general convex function Φ and a positive index a and is
given by
a f
DX (g, f ) = g 1+a Φ dμ, a > 0, Φ ∈ Φ∗ [18.7]
g
where μ represents the Lebesgue measure and Φ∗ is the class of all convex functions
Φ on [0, ∞) such that Φ (1) = Φ (1) = 0 and Φ (1) = 0. We also assume the
conventions 0Φ (0/0) = 0 and 0Φ (u/0) = lim Φ (u) /u, u > 0.
u→∞
For Φ having the special form given in [18.6], we obtain the BHHJ measure
(Basu et al. 1998) which was proposed for the development of a minimum divergence
estimating method for robust parameter estimation. Observe that for Φ (u) = φ (u) ∈
Φ0 and a = 0, the family reduces to Csiszár’s φ−divergence family of measures,
while for a = 0 and for Φ (u) = ϕλ (u) as in [18.5], it reduces to the Cressie and Read
power divergence measure. Other important special cases of the Φ−divergence family
are those for which the function Φ(u) takes the form
Φ1λ (u) = (1 + λ)ϕλ (u) [18.9]
and
1 1 1 1
Φ1α (u) = Φa (u) = u 1+a
− 1+ a
u + . [18.10]
1+a 1+a a a
It is easy to see that for a → 0, the measures Φa (·) and Φ1a (·) reduce to the KL
measure.
Data Analysis based on Entropies and Measures of Divergence 241
More examples of φ functions are given in Arndt (2001) and Pardo (2006).
For more details on divergence measures, see Cavanaugh (2004), Toma (2009)
and Toma and Broniatowski (2011). Specifically, for robust inference based on
divergence measures, see Basu et al. (2011) and a paper by Patra et al. (2013) on
the power divergence and the density power divergence families. The descritized
version of measures has been given considerable attention over the years, with some
representative works being by Zografos et al. (1986) and Papaioannou et al. (1994).
n
Let Y1 , . . . , Yn be a random sample from F and let ni = IEi (Yj ) with
j=1
m
i=1 ni = n, where
1 if Yj ∈ Ei
IEi (Yj ) = , i = 1, 2, . . . , m
0 otherwise
where θ 0 is the true value of the k-dimensional parameter under the null model
and p0 (θ 0 ) = (p10 (θ 0 ), . . . , pm0 (θ 0 )) . Pearson encountered this problem in the
242 Data Analysis and Related Applications 1
well-known chi-square test statistic and suggested the use of a consistent estimator
for the unknown parameter. He further claimed that the asymptotic distribution of
the resulting test statistic, under the null hypothesis, is a chi-square random variable
with m degrees of freedom. Later, for the same test, Fisher (1924) established that
the correct distribution has m − 1 degrees of freedom. The result was later discussed
by Neyman (1949) and recently by Menendez et al. (2001). In this case, since the
null distribution depends on the unknown parameter θ, a consistent estimator of θ is
required.
The partition of the data range is a delicate matter since it is frequently associated
with the loss of information. For a thorough investigation on the issue, the interested
reader is referred to the works by Ferentinos and Papaioannou (1979, 1983).
For testing the above null hypotheses, the most commonly used test statistics are
Pearson’s or the chi-squared test statistic and the likelihood ratio test statistic which
are both special cases of the family of power-divergence test statistics (CR test) which
was introduced by Cressie and Read (1984), is based on the measure given in [18.5]
and is given by
⎛ λ ⎞
m
2n p i
Inλ p, p0 (θ̂) = pi ⎝ − 1⎠ [18.12]
λ (λ + 1) i=1 pi0 (θ̂)
m
pi
= 2n pi0 (θ̂)Φ2,λ , [18.13]
i=1 pi0 (θ̂)
where λ = −1, 0, −∞ < λ < ∞, p0 (θ̂) = (p10 (θ̂), . . . , pm0 (θ̂)) , and
θ̂ is a consistent estimator of θ. Particular values of λ in [18.12] correspond to
well-known test statistics: chi-squared test statistic (λ = 1), likelihood ratio test
statistic (λ → 0), Freeman–Tukey test statistic (λ = −1/2), minimum discrimination
information statistic (Gokhale and Kullback 1978; Kullback 1985) (λ → −1),
modified chi-squared test statistic (Neyman 1949) (λ = −2) and Cressie–Read test
statistic (λ = 2/3).
with φ (x) a convex, twice continuously differentiable function for x > 0 such that
φ (1) = 0.
Data Analysis based on Entropies and Measures of Divergence 243
The above family of tests was generalized by Mattheou and Karagrigoriou (2010)
to the following Φ−family of tests which is based on the Φ−divergence measure given
in [18.8]:
2ndˆa
InΦ p, p0 (θ̂) = , [18.15]
Φ (1)
m
1+a pi
dˆa = pi0 (θ̂) Φ , Φ ∈ Φ∗ . [18.16]
i=1 pi0 (θ̂)
T HEOREM 18.1.– (Cressie and Read 1984). Under the null hypothesis H0 : p =
p0 = (p10 , . . . , pm0 ) , the asymptotic distribution of the Cressie and Read divergence
test statistic, Inλ (p, p0 ), is chi-square with m − 1 degrees of freedom:
L
Inλ (p, p0 ) −−−−→ χ2m−1 .
n→∞
T HEOREM 18.3.– (Mattheou and Karagrigoriou 2010). Under the composite null
hypothesis H0 : p = p0 (θ 0 ), the asymptotic distribution of the Φ−divergence test
statistic, InΦ p, p0 (θ̂) divided by a constant c, is chi-square with m − 1 degrees of
freedom:
1 Φ L
I p, p0 (θ̂) −−−−→ χ2m−1 ,
c n n→∞
where
c = 0.5 min pai0 (θ̂) + max pai0 (θ̂) [18.17]
i i
For the case of the simple null hypothesis, the theorem is adjusted accordingly and
the asymptotic distribution is therefore chi-square with m − 1 degrees of freedom. For
the fixed alternative hypothesis, the power is given in the theorem below:
244 Data Analysis and Related Applications 1
T HEOREM 18.5.– Under the contiguous alternative hypothesis given in [18.19], the
asymptotic distribution of the Φ−divergence test statistic, InΦ (p, p0 ) divided by a
constant c, is a non-central chi-square with m − 1 degrees of freedom:
1 Φ L
I (p, p0 ) −−−−→ χ2m−1,δ ,
c n n→∞
m
where c = 0.5 min pai0 + max pai0 and non-centrality parameter δ = i=1 d2i /pi0 .
i i
Due to the above theorems, the power of the test under the fixed alternative
hypothesis H1 : pi = pib and the local contiguous alternative hypotheses [18.19]
Data Analysis based on Entropies and Measures of Divergence 245
can be easily obtained. For the case of the local contiguous alternative hypotheses, the
power is given by
We close this section with a short discussion about the estimation of the
unknown parameter θ which is a classic inferential problem. Optimal estimating
approaches, like the maximum likelihood estimation, are available in the literature
(e.g. Papaioannou et al. 2007). Here, we focus on the parameter estimator under
the composite hypothesis. Although the traditional MLE can be evaluated and
implemented, we may alternatively consider a wider class of estimators, known as
Φ−divergence estimators. More specifically, the minimum Φ−divergence estimator
of θ is any θ̂ Φ ∈ θk satisfying
m
1+a pi
da (θ̂Φ ) = min da (θ) = min pi0 (θ) Φ
θ∈θ k θ∈θ k
i=1
pi0 (θ)
for a function Φ ∈ Φ∗ and with p̂i = ni /n. Obviously, the resulting estimator depends
on the Φ-function chosen. Observe that for Φ as in [18.6] or [18.10] and for a → 0,
the resulting estimator is the usual MLE for the grouped data. It should be pointed
out that the function Φ used for the Φ−divergence estimator θ̂Φ does not necessarily
coincide with the Φ-function used for the test statistic which, in general, is written as
for two, not necessarily different functions Φ1 & Φ2 ∈ Φ∗ . Finally, note that such
a type of estimator has been thoroughly investigated and their asymptotic theory has
been presented in Meselidis and Karagrigoriou (2020). Indeed, the innovative idea
behind the proposal by Meselidis and Karagrigoriou (2020) is the duality in choosing
among the members of the general class of divergences, one for estimating and one for
testing purposes which may not necessarily be the same. In that sense, the divergence
test statistic given in [18.20] offers the greatest possible range of options both for
the strictly convex function Φ and the indicator value α ∈ R. More specifically, if a
parameter θ needs to be estimated, then a function Φ, say Φ2 , and an index α, say
α2 , are used for that purpose and then we proceed with the distance and the testing
246 Data Analysis and Related Applications 1
problem using a function Φ, say Φ1 , and an index α, say α1 , which, in general, can be
different from those used for the estimation problem. The resulting divergence is given
in [18.20] and [18.21], where θ̂ (Φ2 ,α2 ) is the minimum (Φ2 , α2 ) divergence estimator
which is allowed to be obtained even under restrictions, say c(θ) = 0.
18.4. Simulations
The study is implemented not only for the regular case but also for cases where the
data set is contaminated. In this regard, we define as the contamination level with ∈
[0, 1]. Thus, the data generating distribution has the form (1 − )Γd + Γc , where Γd is
the dominant and Γc the contaminant Gamma distribution. Note that the contamination
level used is taken to be equal to 0.075. Thus, for the examination of estimators and
test statistics in terms of size of the test (α), we contaminate the null distribution with
observations from the alternative hypotheses and vice versa for the examination of
tests in terms of power (γ). Furthermore, for the implementation, we have considered
a large sample size, n = 200 and N = 100000 √ repetitions of the experiment, while
for the partition of the data range, we use 200 = 15 equiprobable intervals, where
the · operator returns the least integer which is greater than or equal to its argument.
In reference to the test statistics, we proceed in a similar manner and retrieve from
the Cressie–Read family the classical modified chi-squared M CS(θ̂MCS ), minimum
discrimination information M DI(θ̂MDI ), Freeman–Tukey F T (θ̂F T ), likelihood
ratio LR(θ̂LR ), Cressie–Read CR(θ̂CR ) and Pearson’s chi-squared CS(θ̂CS ) test
statistics along with the proposed TΦα11 θ̂(Φ2 ,α2 ) for α1 = 10−7 , 0.01, 0.05,
0.10...(0.10)...1.00 and Φ1 as in [18.6].
with θlI being the minimum divergence estimator based on any I divergence for the
lth sample.
Figure 18.1 presents the MSE for the four cases which are associated with
no contamination and contamination from the three alternative distributions. The
minimum divergence estimators are displayed in acceding order following a
counterclockwise direction according to the case where the contaminant distribution
lies far from the null, i.e. when the data are generated from 0.925Γ(1) + 0.075Γ(10).
Results indicate that in terms of MSE, estimators that can be derived from the
Cressie–Read family with λ ≥ 0 along with those that can be derived from the BHHJ
family with small values of α2 have better performance for the no contamination case
and when the contaminant distribution is close to the null (Figures 18.1a and 18.1b).
Note that in these two cases, (θ̂MLE ) has the best performance among all competing
estimators. On the contrary, when the contaminant distribution departs further from the
null (Figures 18.1c and 18.1d), estimators from the BHHJ family with larger values
of α2 and those from the Cressie–Read family with negative values of λ appear to
behave better while the worst results arise for the θ̂MLE . In addition, Figure 18.1
reveals the robustness aspect of the BHHJ and the Cressie–Read estimators since it
is apparent that in the presence of contamination the larger the value of the index α2
and the smaller the value of the parameter λ the smaller the MSE. Finally, note that
in every case, the MSE of θ̂(Φ2 ,α2 ) lies between the MSEs of the θ̂LR and the θ̂L2 .
We should state here that for presentation purposes, the MSE has been multiplied by
248 Data Analysis and Related Applications 1
100. For more information about robust estimation for grouped data, refer to Basu
et al. (1997), Victoria-Feser and Ronchetti (1997), Lin and He (2006) and Toma
and Browniatowski (2011), while for the mathematical connection of the BHHJ and
Cressie–Read families, refer to Patra et al. (2013).
CR(−2) CR(−2)
L2 CR(1) L2 CR(1)
0.57 0.676
BHHJ(0.90) CR(2/3) BHHJ(0.90) CR(2/3)
0.553 0.64
0.518 0.568
BHHJ(0.70) CR(0) BHHJ(0.70) CR(0)
3.47 13.2
BHHJ(0.90) CR(2/3) BHHJ(0.90) CR(2/3)
3.32 9.26
3.03 1.32
BHHJ(0.70) CR(0) BHHJ(0.70) CR(0)
Figure 18.1. MSE (×100) for the four cases of contamination regarding the
tests that can be derived both from the BHHJ and Cressie–Read families
Figure 18.2. Size for the four contamination cases regarding the tests
that can be derived from the BHHJ family. For a color version of
this figure, see www.iste.co.uk/zafeiris/data1.zip
In Figure 18.2, we examine under the four aforementioned cases the behavior of
the BHHJ test statistics in terms of size for various values of the indices α1 and α2 ,
while in Table 18.1, the behavior of the classical tests is presented. In general, we
can see that as the index α1 increases, the size decreases, while as the index α2
increases, the size increases as well. Furthermore, we can observe that in the case
where the contaminant distribution lies far from the null (Figure 18.2d), the size
becomes very large, indicating the disastrous effect imposed from the contaminant
distribution to all BHHJ test statistics. This disastrous effect is also apparent in the
classical test statistics. In the case where the contaminant distribution is the Γ(4)
(Figure 18.2c), the BHHJ family of tests discounts the effect of contamination for
values of α1 ≥ 0.8, while the classical tests are largely affected by the contamination
once again. Finally, for the no contamination and contamination from the Γ(1.5), we
can derive the following conclusions about the behavior of the tests. Regarding the
BHHJ family (Figures 18.2a and 18.2b), we can observe that the larger the value of α1 ,
250 Data Analysis and Related Applications 1
the more conservative the test is, while the best performance appears for α1 ≤ 0.10
and α2 ≥ 0.50. With respect to the classical tests, M DI(θ̂MDI ), F T (θ̂F T ) and
M CS(θ̂MCS ) appear to be conservative, while CS(θ̂CS ) and CR(θ̂CR ) appear to
be liberal. Note that in terms of size, LR(θ̂LR ) appears to have the best performance
among all classical test statistics.
Data distribution FT CR CS LR M DI M CS
Γ(1) 0.04028 0.06538 0.07841 0.04744 0.03783 0.04263
0.925Γ(1) + 0.075Γ(1.5) 0.03943 0.06759 0.08195 0.04775 0.03659 0.04072
0.925Γ(1) + 0.075Γ(4) 0.09546 0.12762 0.14392 0.10539 0.09063 0.09521
0.925Γ(1) + 0.075Γ(10) 0.83558 0.85420 0.85953 0.85953 0.82183 0.68850
Table 18.1. Size for the four contamination cases regarding the classical
tests that can be derived from the Cressie–Read family
Figure 18.3. Power for the no contamination and contamination from Γ(1)
cases regarding the tests that can be derived from the BHHJ family. For
a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
Data distribution FT CR CS LR M DI M CS
Γ(1.5) 0.69412 0.69911 0.70887 0.69061 0.70568 0.74808
0.925Γ(1.5) + 0.075Γ(1) 0.55308 0.57515 0.59049 0.55571 0.56224 0.60448
Table 18.2. Power for the no contamination and contamination from Γ(1) cases
regarding the classical tests that can be derived from the Cressie–Read family
In terms of power, results are presented in Figure 18.3 and Table 18.2 for the BHHJ
and classical tests, respectively. Note that we only present results that are associated
with the Γ(1.5) alternative since in every other case the power reaches the highest level
Data Analysis based on Entropies and Measures of Divergence 251
1 for all tests. As a general conclusion, we can state that the contamination affects the
performance of all tests by notably downgrading their power. Concerning the BHHJ
tests, the best results appear for small values of α1 and large values of α2 , while the
classical modified chi-squared test statistic, M CS(θ̂MCS ), has the best performance
among all classical tests.
where logit(p) = log(p/(1 − p)), while α̂n and α are the exact simulated and nominal
sizes, respectively. When [18.22] is satisfied with d = 0.35, the exact simulated size
is considered to be close to the nominal size. For α = 0.05, the exact simulated size is
close to the nominal if α̂n ∈ [0.0357, 0.0695]. This criterion has been used previously
by Pardo (2010) and Batsidis et al. (2016). We apply the criterion not only for α =
0.05 but also for a range of nominal sizes that are of interest, namely α ∈ [0, 0.1].
Results are presented in Figure 18.4, where the dashed line refers to the situation
where the exact simulated equals the nominal size; thus, lines that lie above this
reference line refer to liberal, while those that lie below refer to conservative test
statistics. Furthermore, the gray area that is depicted in Figure 18.4 refers to Dale’s
criterion; thus, lines that lie in this area satisfy the criterion. From Figures 18.4a
and 18.4b, we observe that in the no contamination case and when the contaminant
distribution is close to the null, besides CS and T 4, every other test satisfies Dale’s
criterion. On a more granular level, we observe that the CR test statistic satisfies
the criterion only for nominal sizes α ≥ 0.03. For the case where the contaminant
distribution is the Γ(4), we can see that the only test that resists the contamination and
satisfies the criterion is T 4. One conclusion that can be derived from Figure 18.4d is
that even though every test fails to satisfy Dale’s criterion, M CS appears to be notably
resistant to the contamination, in relation to all other tests, especially for small nominal
sizes.
Apparently, the actual size of each test differs from the targeted nominal one; thus,
in order to proceed further with the comparison of the tests in terms of power, we
have to make an adjustment. We follow the method proposed in Lloyd (2005) which
involves the so-called receiver operating characteristic (ROC) curves. In particular,
let G(t) = P r(T ≥ t) be the survivor function of a general test statistic T , and
c = inf{t : G(t) ≤ α} be the critical value, then ROC curves can be formulated
by plotting the power G1 (c) against the size G0 (c) for various values of the critical
252 Data Analysis and Related Applications 1
value c. Note that with G0 (t), we denote the distribution of the test statistic under the
null hypothesis and with G1 (t) under the alternative.
Figure 18.4. Exact simulated sizes against nominal sizes for the four cases of
contamination. The gray area depicts the range of exact simulated sizes in which
Dale’s criterion is satisfied. For a color version of this figure, see www.iste.co.uk/
zafeiris/data1.zip
Results are presented in Figure 18.5, from where we can observe that under the
adjustment the test statistics have similar behavior in terms of power for both cases
of no contamination and contamination from the Γ(1), with the performance being
downsized in the latter case. Note also that results under the adjustment differ from
those of the preceding analysis. In particular, we can see that even though from
Figure 18.3 we derived the conclusion that the best results arise for small values of
α1 and large values of α2 in the no contamination case, T1 has the worst performance
among all the BHHJ tests under the adjustment in size. Similar conclusions can be
derived for the classical tests. For example, CS and CR have the worst performance
among the classical tests under the adjustment, although in Table 18.2, results indicate
Data Analysis based on Entropies and Measures of Divergence 253
the opposite. This behavior is explained by the fact that the power of the test is highly
affected from its liberality or not, making the adjustment in size mandatory before
proceeding to the comparison.
0.90
1.0
T4 0.85
T3 MCS
MDI
T2 T4
0.80 MCS
0.9
FT 0.75
T3
0.8 0.70 MDI
T1 T2
0.65
LR
0.7 FT
0.60
T1
0.55
Empirical power
Empirical power
0.6 CR LR
0.50
0.45
0.5
CS CR
0.40
0.4 0.35
CS
0.30
0.3
0.25
0.20
0.2
0.15
0.10
0.1
0.05
0.0 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Empirical size Empirical size
0.6
Empirical power
LR T2
0.50
FT
0.45
0.5
T1
CR
0.40
LR
0.4 0.35
CS 0.30 CR
0.3
0.25
CS
0.20
0.2
0.15
0.10
0.1
0.05
0.0 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Empirical size Empirical size
Figure 18.5. Left: empirical ROC curves for the no contamination and contamination
from Γ(1) cases. Right: the same curves magnified over a relevant range of empirical
sizes. For a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
In addition, taking into account the results of Figure 18.5, we focus our interest on
the following four tests, two from each family, T 4, M CS, T 3 and M DI which appear
to have the best performance in terms of power. Note that, even though M DI and T 3
closely follow each other in terms of size, T 3 appears to perform better in terms of
power. Additionally, we can see that the performance of T 3 in terms of power closely
follows the performance of M CS and especially when the alternative distribution
is contaminated from the null. Although T 4 appears to have the best performance
among all competing tests in terms of power, we should only consider it when the null
distribution is contaminated from a distribution which is neither far nor close to the
null since in every other case the exact simulated size fails to satisfy Dale’s criterion.
254 Data Analysis and Related Applications 1
In conclusion, based on the analysis conducted with regard to the two families of
estimators and test statistics, namely the BHHJ and the Cressie–Read families, we
can state the following remarks. For estimation purposes, under contamination, the
best estimators arise for large values of the index α2 and small negative values of the
parameter λ, while the opposite is true when there is no contamination. In relation to
testing procedures, when the null distribution is not contaminated or is contaminated
from a distribution that is close to it, the best test statistics from the BHHJ family arise
for values of the indices α1 and α2 close to 0.50, say between 0.40 and 0.60, while the
most prominent members of the Cressie–Read family arise for values of λ ∈ [−2, −1].
In the case where the contaminant distribution lies neither too close nor too far from
the null, only test statistics that are members of the BHHJ family with large values of
α1 near 0.90 and moderate values of α2 near 0.30 are appropriate choices.
18.5. References
Ali, S.M. and Silvey, S.D. (1966). A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society Series B, 28, 131–142.
Arndt, C. (2001). Information Measures. Springer, Berlin, Heidelberg.
Basu, A., Basu, S., Chaudhuri, G. (1997). Robust minimum divergence procedures for count
data models. Sankhyā: The Indian Journal of Statistics Series B (1960–2002), 59(1), 11–27.
Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C. (1998). Robust and efficient estimation by
minimising a density power divergence. Biometrika, 85, 549–559.
Basu, S., Basu, A., Jones, M.C. (2006). Robust and efficient parametric estimation for censored
survival data. Annals of the Institute of Statistical Mathematics, 58, 341–355.
Basu, A., Shioya, H., Park, C. (2011). Statistical Inference: The Minimum Distance Approach.
Chapman & Hall/CRC Press, Boca Raton, FL.
Batsidis, A., Martin, N., Pardo Llorente, L., Zografos, K. (2016). ϕ-divergence based procedure
for parametric change-point problems. Methodology and Computing in Applied Probability,
18(1), 21–35.
Cavanaugh, J.E. (2004). Criteria for linear model selection based on Kullback’s symmetric
divergence. Australian & New Zealand Journal of Statistics, 46, 257–274.
Cover, T.M. and Thomas, J.A. (2006). Elements of Information Theory. John Wiley and Sons,
New York.
Cressie, N. and Read, T.R.C. (1984). Multinomial goodness-of-fit tests. Journal of the Royal
Statistical Society, 5, 440–454.
Csiszar, I. (1963). Eine Informationstheoretische Ungleichung und ihre Anwendung auf den
Bewis der Ergodizitat on Markhoffschen Ketten. Publications of the Mathematical Institute
of the Hungarian Academy of Sciences, 8, 84–108.
D’Agostino, R.B. and Stephens, M.A. (1986). Goodness-of-Fit Techniques. Marcel Dekker,
New York.
Data Analysis based on Entropies and Measures of Divergence 255
Dale, J.R. (1986). Asymptotic normality of goodness-of-fit statistics for sparse product
multinomials. Journal of the Royal Statistical Society. Series B (Methodological), 48(1),
48–59.
Ferentinos, K. and Papaioannou, T. (1979). Loss of information due to groupings. Transactions
of the Eighth Prague Conference on Information Theory, Statistical Decision Functions,
Random Processes, vol. C, 87–94, Reidel, Dordrecht-Boston, MA.
Ferentinos, K. and Papaioannou, T. (1983). Convexity of measures of information and loss
of information due to grouping of observations. Journal of Combinatorics, Information &
System Sciences, 8(4), 286–294.
Fisher, R.A. (1924). The conditions under which χ2 measures the discrepancy between
observation and hypothesis. Journal of the Royal Statistical Society, 87, 442–450.
Ghosh, A., Maji, A., Basu, A. (2013). Robust inference based on divergence. In Applied
Reliability Engineering and Risk Analysis: Probabilistic Models and Statistical Inference,
Frenkel, I., Karagrigoriou, A., Lisnianski, A., Kleiner, A. (eds). John Wiley and Sons,
New York.
Gokhale, D.V. and Kullback, S. (1978). The Information in Contingency Tables, vol. 23. Marcel
Dekker, New York.
Kagan, A.M. (1963). On the theory of Fisher’s amount of information. Soviet Mathematics –
Doklady, 4, 991–993.
Kateri, M. and Papaioannou, T. (1997). Asymmetry models for contingency tables. Journal of
the American Statistical Association, 92(439), 1124–1131.
Kateri, M. and Papaioannou, T. (2007). Measures of symmetry-asymmetry for square
contingency tables. TR07-3, University of Piraeus [Online]. Available at: https://www.
researchgate.net/profile/Takis-Papaioannou-2/publication/255586795_Measures_of_
Symmetry-Asymmetry_for_Square_%20Contingency_Tables/links/543147840cf27
e39fa9eb943/Measures-of-Symmetry-Asymmetry-for-Square-Contingency-Tables.pdf.
Kateri, M., Papaioannou, T., Ahmad, R. (1996). New association models for the analysis of sets
of two-way contingency tables. Statistica Applicata, 8, 537–551.
Kullback, S. (1985). Minimum discrimination information (MDI) estimation. In Encyclopedia
of Statistical Sciences, Volume 5, Kotz, S. and Johnson, N.L. (eds). John Wiley and Sons,
New York.
Kullback, S. and Leibler, R. (1951). On information and sufficiency. Annals of Mathematical
Statistics, 22, 79–86.
Liese, F. and Vajda, I. (1987). Convex Statistical Distances. Teubner, Leipzig.
Lin, N. and He, X. (2006). Robust and efficient estimation under data grouping. Biometrika,
93(1), 99–112.
Lloyd, C.J. (2005). Estimating test power adjusted for size. Journal of Statistical Computation
and Simulation, 75(11), 921–933.
Mathai, A. and Rathie, P.N. (1975). Basic Concepts in Information Theory. John Wiley and
Sons, New York.
256 Data Analysis and Related Applications 1
Mattheou, K. and Karagrigoriou, A. (2010). A new family of divergence measures for tests of
fit. Australian and New Zealand Journal of Statistics, 52, 187–200.
Mattheou, K., Lee, S., Karagrigoriou, A. (2009). A model selection criterion based on the BHHJ
measure of divergence. Journal of Statistical Planning and Inference, 139, 128–135.
Matusita, K. (1967). On the notion of affinity of several distributions and some of its
applications. Annals of the Institute of Statistical Mathematics, 19, 181–192.
Menéndez, M.L., Morales, D., Pardo, L., Vajda, I. (2001). Approximations to powers
of ϕ-disparity goodness-of-fit. Communications in Statistics – Theory and Methods, 30,
105–134.
Meselidis, C. and Karagrigoriou, A. (2020). Statistical inference for multinomial populations
based on a double index family of test statistics. Journal of Statistical Computation and
Simulation, 90(10), 1773–1792.
Nadarajah, S. and Zografos, K. (2003). Formulas for Renyi information and related measures
for univariate distributions. Information Sciences, 155, 118–119.
Neyman, J. (1949). Contribution to the theory of χ2 test. In Proceedings of the 1st Symposium
on Mathematical Statistics and Probability, University of Berkeley, 239–273.
Papaioannou, T. (1985). Measures of information. In Encyclopedia of Statistical Sciences,
Vol. 5, Kotz, J. (ed.). Wiley, Hoboken, NJ.
Papaioannou, T., Ferentinos, K., Menéndez, M.L., Salicrú, M. (1994). Discretization of
(h,ϕ)-divergences. Information Sciences, 77(3–4), 351–358.
Papaioannou, T., Ferentinos, K., Tsairidis, C. (2007). Some information theoretic ideas useful
in statistical inference. Methodology and Computing in Applied Probability, 9(2), 307–323.
Pardo, L. (2006). Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC,
Boca Raton, FL.
Pardo, J.A. (2010). An approach to multiway contingency tables based on ϕ-divergence test
statistics. Journal of Multivariate Analysis, 101, 2305–2319.
Patra, S., Maji, A., Basu, A., Pardo, L. (2013). The power divergence and the density power
divergence families: The mathematical connection. Sankhya B, 75(1), 16–28.
Renyi, A. (1961). On measures of entropy and information. Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics and Probability, 1, 547–561.
Sachlas, A. and Papaioannou, T. (2014). Residual and past entropy in actuarial science and
survival models. Methodology and Computing in Applied Probability, 16(1), 79–99.
Shannon, C.E. (1948). A mathematical theory of communication. Bell Systems Technical
Journal, 27(3), 379–423.
Toma, A. (2009). Optimal robust M-estimators using divergences. Statistics and Probability
Letters, 79, 1–5.
Toma, A. and Broniatowski, M. (2011). Dual divergence estimators and tests: Robustness
results. Journal of Multivariate Analysis, 102(1), 20–36.
Tsairidis, C., Ferentinos, K., Papaioannou, T. (1996). Information and random censoring.
Information Science, 92(1–4), 159–174.
Data Analysis based on Entropies and Measures of Divergence 257
Tsairidis, C., Zografos, K., Ferentinos, K., Papaioannou, T. (2001). Information in quantal
response data and random censoring. Annals of the Institute of Statistical Mathematics, 53(3),
528–542.
Victoria-Feser, M. and Ronchetti, E. (1997). Robust estimation for grouped data. Journal of the
American Statistical Association, 92(437), 333–340.
Vonta, F. and Karagrigoriou, A. (2010). Generalized measures of divergence in survival analysis
and reliability. Journal of Applied Probability, 47(1), 216–234.
Zografos, K. and Nadarajah, S. (2005). Survival exponential entropies. IEEE Transactions on
Information Theory, 51, 1239–1246.
Zografos, K., Ferentinos, K., Papaioannou, T. (1986). Discrete approximations to the Csiszár,
Rényi, and Fisher measures of information. Canadian Journal of Statistics, 14(4), 355–366.
Zografos, K., Ferentinos, K., Papaioannou, T. (1990). Divergence statistics: Sampling
properties and multinomial goodness of fit and divergence tests. Communications in
Statistics-Theory and Methods, 19(5), 1785–1802.
PART 3
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
19
This study models Tokyo official land price data using geographically weighted
regression (GWR) and multi-scale GWR (MGWR) models. The GWR model spatially
explores the varying relationships between land prices and the exploratory variables.
Based on the estimated model parameters, the influence of land individuality increases
as the estimated bandwidth parameters in the GWR model decrease. These facts are
also confirmed by the local regression coefficients of the access index, the distance
to the nearest station and residential area dummy variables. The differences between
local coefficients for some convenience indicators, including access time to central
Tokyo and walking distances to nearest stations, tend to increase between the west
and central areas of Tokyo.
19.1. Introduction
evaluation procedures are set for normal land prices, the price announcements of
public land are stationary observations, and since survey points can change frequently,
it is difficult to monitor land prices at the same time points over a long period of time.
Thus, in this study, we analyze the changes in the price announcements for public
land using the geographically weighted regression (GWR) and the multi-scale GWR
(MGWR) models for a total of 38,914 residential use land areas in Tokyo based on
price announcements for public land from 1997 to 2018.
The GWR model was proposed by Brunsdon et al. (1996) and Fotheringham et al.
(1998) as a spatial statistical model considering spatial heterogeneity. In other words,
the GWR model is a local regression model that captures spatial heterogeneity or
non-stationarity by estimating spatially varying regression coefficients. One of the
disadvantages of the GWR model is that the multicollinearity between explanatory
variables occurs when using a common bandwidth for spatial kernels for all
explanatory variables, which yields similar or unstable regression coefficients for the
target area. Hence, various models that extend the GWR model have been proposed
and applied in the literature. For instance, the mixed GWR model is a mixed model
of linear regression and GWR, which attempts to explain both the global variables
common to all observations and the local variations in the characteristics of each site
(Lee et al. 2009). In this study, we overcome the shortcomings of the GWR model
by using the MGWR model, which estimates the local regression coefficients using
variable-specific bandwidth for spatial kernels (Lu et al. 2017) 1.
Various global GWR applications also exist in the literature. For example, Cho
et al. (2006) estimated the GWR model using housing data from Knox County,
Tennessee, showing that the proximity of water areas and parks to housing is reflected
in the price. Helbich et al. (2014) estimated a mixed GWR model that distinguishes
between local and global explanatory variables based on Australian residential data.
Using sample size-based distance measures for spatial kernels, Lu et al. (2015) argued
that the fitting performance is better than that for the usual distance-based kernels,
and constructed a parameter-specific distance metrics-GWR (PSDM-GWR) model
using both distinct bandwidth and metric functions of each explanatory variable.
Additionally, they proposed back-fitting algorithms to fit the generalized linear model
with the parameter estimation of PSDM-GWR models. Lu et al. (2017) estimated
GWR and MGWR models using housing transaction prices in London in 2001,
and showed that the MGWR model is superior in terms of fitting and prediction
accuracy. Recently, several studies extended the GWR and MGWR models to the
space–time dimensions, including that of Huang et al. (2010). LeSage and Pace (2009)
derived estimates focusing on the results of spatiotemporal long-term equilibrium with
1 Both mixed GWR and multi-scale GWR are sometimes referred to as MGWR but, to avoid
confusion, we refer to multi-scale GWR as MGWR.
Geographically Weighted Regression for Official Land Prices 263
regard to the use of cross-sectional data and focusing on the dynamics embodied by
time-dependent parameters with regard to the use of spatiotemporal data.
In this study, we assume independence between the different time points and
estimate secular changes under the GWR and MGWR models. This study thus clarifies
the interannual variability of geographical and environmental factors for land prices
by applying the GWR and MGWR models. Our findings are as follows. For over
20 years, the individual factors of land prices increase, as demonstrated by the increase
in the local regression coefficients in Tokyo. Additionally, due to the secular changes
in environmental factors that indicate convenience, the land price differences between
the central and southern parts of the 23 wards, including the surrounding and other
areas, increase. In particular, in the western part of Tokyo, the estimates of the MGWR
model show that the land price differences in the eastern part increase towards the
west. Moreover, the influence of higher land prices was stronger in the southern part
of Kitatama (Northern Tama Area) than in the northeastern part of the 23 wards. There
are also regional differences in land preferences. From the central to the northern
areas of the 23 wards and from the western area of Kitatama to the eastern area of
Nishitama (Western Tama Area), low-rise residential areas that have an emphasis
on the living environment are preferred. Conversely, in the Minamitama (Southern
Tama) area, residential areas and semi-residential areas that have an emphasis on
convenience and commerciality are preferred. Each influence became stronger as time
progressed. Furthermore, the above-mentioned effects significantly changed before
the 2008 financial crisis and remained stable after the crisis.
We denote yt (s) by the logarithmic public land price vector of site s ∈ D in region
D at time t. Then, the global or non-spatial model can be expressed as follows:
where Xt (s) is the matrix of the explanatory variables described in the previous
section, βt is the vector of the regression coefficient, including the constant term,
and εt (s) is the error term, which is assumed to be independent at time t and for site
264 Data Analysis and Related Applications 1
Under the GWR, to estimate βt,i = [β0,t,i , β1,t,i , · · · , βk−1,t,i ] , we use the
generalized least-squares (GLS) method using the following weight matrix:
1 1 1
V t,i
2
y t (s) = V t,i
2
X t (s) β t,i + V t,i
2
εt (s).
Here, matrix Vt,i is a diagonal matrix and its j-th component vt,i,j is the weight
given to site j:
The estimator of the local regression coefficient at site i and time t is given by:
In the GWR model, it is important to define weight matrix V t,i . To this end, we
use a Gaussian distance-decay function:
d2i,j
vt,i,j = exp − 2 ,
δt
where di,j is the Euclidean distance between i and j. δt is the bandwidth of a common
spatial kernel at time t. Bandwidth δt is determined by minimizing the cross-validation
(CV) error of the following equation:
n
2
δ̂t = argmin CV(δt ), CV(δt ) = [yt,i − ŷt,=i (δt )] , [19.1]
δt i=1
where ŷt,=i (δt ) is the predicted value for the neighbor of site i, without using site i. If
the spatial distribution of the observed points is not constant, an adaptive kernel that
adjusts the bandwidth according to the number of samples, not the distance, may be
used; see, for example, Lu et al. (2015).
Geographically Weighted Regression for Official Land Prices 265
Let the explanatory variable for site s0 be Xt (s0 ). Then, the predicted value of
the log official land price becomes:
See, for example, Leung et al. (2000) and Harris et al. (2011). The corresponding
variance of the predictor becomes:
Brunsdon et al. (1999) pointed out that, for the GWR and mixed GWR
models, the bandwidths of the common spatial kernels are sometimes restrictive
and the resulting GWR estimates tend to be inflexible. Additionally, Wheeler and
Tiefelsdorf (2005) explained that, under the GWR model, there exists instability
that creates multicollinearity due to the similarities of local explanatory variables.
Hence, Yang (2014) proposed the MGWR model, which applies a distinct bandwidth
for each explanatory variable for the spatial kernels. The MGWR model can
provide more location-specific regression surfaces, which makes it possible to avoid
multicollinearity between variables. In this study, we use the following extended
algorithm, as proposed by Lu et al. (2017).
Step 0: Data formatting: we denote the data matrix and log land price by y j,t and
(0)
X t , respectively, for time t(1 ≤ t ≤ T ) and site i(1 ≤ i ≤ p). Let V k,t,i be the initial
weight matrix for t, i, and the k-th regression coefficients in the GWR model. The
(0)
initial kernel bandwidth is set to bwk,t . The required precision is denoted by τ > 0,
and the maximum number of iterations is set as N.
(0) (0) (0) (0)
Step 1: Initialization: initial estimates β̂t = [β̂ 0,t , β̂ 1,t , · · · , β̂ k−1,t ] are
(0) (0) (0)
obtained by the GWR model. Then, we calculate ŷ 0,t = X 0,t ◦ β̂0,t , ŷ 1,t =
(0) (0) (0)
X 1,t ◦ β̂1,t , · · · , ŷ k−1,t = X k−1,t ◦ β̂ k−1,t . Here, X h−1,t denotes the h-th row
of matrix X t and ◦ is the Hadamard product. We obtain the residual sum of squares,
k−1 (0)
RSS(0) = (y t − i=0 ŷ i,t )2 .
Step 2: update the (n)-th estimates using the estimates of the (n − 1)-th iteration
as follows. Here, we re-define the explanatory variable as Xl,t (0 ≤ l ≤ m).
266 Data Analysis and Related Applications 1
m m
(n) (n−1) (n)
1) Calculate ξl,t = y − j=l Latestyhat ŷ j,t , ŷ j,t , where j=l denotes
the sum of numbers other than l and
(n) (n)
ŷ j,t , if ŷ j,t exists
(n−1) (n)
Latestyhat ŷ j,t , ŷ j,t = . (n−1)
ŷ j,t , otherwise.
(n)
2) We calculate bandwidth bwk,t using criteria such as the CV scoring method
(n)
and obtain weight matrix V k,t,i .
(n) (n)
Finally, we calculate β̂l,t by using ξl,t and X l,t .
(n) (n)
3) We update ŷ l,t = X l,t ◦ β̂l,t .
RSS(n) − RSS(n−1)
CVR(n) = . [19.2]
RSS(n−1)
19.3.1. Data
The public announcement of land prices in 2018 was conducted for 47 prefectures
nationwide in Japan, targeting 20,572 areas for urbanization, 1,394 urbanization
control areas, 4,015 other urban planning areas and 19 publicly announced areas
outside the urban planning area, for a total of 26,000 standard land areas. In Tokyo,
there were 2,602 sites and 1,540 residential zones, excluding islands. In this study,
we use the public announcement of land price data for residential zoning in Tokyo
as of January 1, 1997 to 2018. A total of 38,914 data points exist during the 22-year
analysis period. The number of sites subject to the public announcement of land prices
for residential zoning changed annually as needed, which varies from 1,200 to 2,000.
We estimate each model using the official land price as the objective variable. As
explanatory variables, we selected the following seven variables: (1) access index of
the target site (minutes), (2) distance to the main nearest station (m), (3) front road
width (m), (4) land area of the target site (m2 ), (5) low-rise residential area dummy,
(6) residential area dummy, and (7) gas equipment dummy. All variables except for
the dummy ones are transformed into logarithmic values.
Geographically Weighted Regression for Official Land Prices 267
Figure 19.1 shows a boxplot of the transitions in official land prices for 22 analyzed
years. There are outliers above the boxplot due to the presence of very high land prices.
The average official land price at the analysis sites was 393,000 yen/m2 in 2018,
and the median value was 310,000 yen/m2 . Regarding the time-series transitions,
land prices had been declining since 1997 until the early 2000s and then rose until
2008, after which they showed a downward trend once more due to the effects of
the 2008 financial crisis. In recent years, the upward trend of high official land
price sites has been remarkable. Figure 19.2 shows the distribution of official land
prices in the residential areas of Tokyo in 2018. The highest official land price was
4,010,000 yen/m2 and the lowest was 45,000 yen/m2 . The official land prices are
generally high near the central area of the 23 wards, which is also the center of the
city, but they are not always high within the other 23 wards, except for the Adachi,
Katsushika and Edogawa wards in the northeastern part of Tokyo and Musashino City
and Mitaka City, which are adjacent to the 23 wards to the west. Additionally, the
locations and numbers of public notice points are highly biased by region.
Figure 19.1. Boxplots of official land prices in Tokyo from 1997 to 2018
19.3.2. Results
Table 19.1. Regression coefficients for the non-spatial and GWR models for 2018
Table 19.1 shows a comparison of the regression coefficients for the non-spatial
and GWR models. Under the non-spatial model, the low-rise residential area and
residential area dummies are insignificant at the 5% significance level. The local
regression coefficient on the GWR model is estimated for each site, and there is
a range in the distribution of the regression coefficients. If we compare the median
values of the regression coefficients estimated by the GWR model, then the absolute
value of the estimates, which was significant under the non-spatial model, becomes
smaller. Table 19.2 shows a comparison with the MGWR model. If we compare the
regression coefficients on the median values, the estimates for the GWR and MGWR
models take similar values, but smaller absolute values under the non-spatial model.
The range of each regression coefficient, which was large under the GWR model, is
smaller under the MGWR model, probably because of the common bandwidth for
spatial kernels for all explanatory variables. Specifically, this bandwidth might be too
large or too small for each variable in the GWR model and was estimated adequately
under a variable-specific bandwidth for spatial kernels in the MGWR model.
Figure 19.3 shows the time-series transition of the estimated regression coefficients
under the GWR model. The Gaussian distance-decay function is adopted, and the
common bandwidth for the spatial kernels is determined by the CV scoring method,
according to equation [19.1]. Except for the intercept, the range of the regression
coefficients on the access index and gas equipment dummy is larger than for the other
regression coefficients. Additionally, outliers are present for all regression coefficients.
Geographically Weighted Regression for Official Land Prices 269
The regression coefficients on the access index, nearest station distance and the gas
equipment dummy took on a negative trend in recent years, similar to the non-spatial
model. This fact indicates that, if both explanatory variables are at the same level, the
effect of reducing land prices becomes stronger over time. No visual trend is observed
for the coefficients on the other explanatory variables.
Table 19.2. Regression coefficients for the GWR and MGWR models for 2018
Figure 19.3. Boxplots of the transition for the estimated GWR model parameters
Figure 19.4 shows the time-series transition of the estimated regression coefficients
under the MGWR model using boxplots. We use the algorithm of Lu et al. (2017)
for parameter estimation. The bandwidth for the variable-specific spatial kernels is
determined by converging the CVR in equation [19.2]. Compared to the GWR model,
the range of the regression coefficients is smaller and the vertical length of the boxplot
becomes longer with fewer outliers. In addition to the access index and nearest station
distance, a negative trend can be confirmed for front road width and the gas equipment
dummy, and a positive one for the low-rise residential area dummy. The range of
the boxplots for the access index, nearest station distance, low-rise residential area
dummy and residential area dummy becomes larger over time, which indicates that the
270 Data Analysis and Related Applications 1
individual factor of land for the explanatory variable becomes stronger with respect
to land price. Additionally, the increase in the range of the constant terms means that
the individual factors of land for the explanatory variables not used in this analysis are
likely increasing 2.
Table 19.3 summarizes the fitting performances of the non-spatial, GWR and
MGWR models by the land price function in 2018. The MSE indicates the
mean-squared error, and the prediction accuracy is defined by the following equation
[19.3]:
n
1 exp(yt (si )) − exp(ŷt (si ))
2
Prediction accuracy = × 100(%).
n i=1 exp(yt (si ))
[19.3]
2 The individual factors of land include areas of caution on hazard maps, the existence of crime,
local sunshine and noise conditions, as well as the location of garbage collection sites.
Geographically Weighted Regression for Official Land Prices 271
Regarding the goodness of fit, it is desirable that the AICc and prediction accuracy
(%) are small and the adjusted R2 is close to 1. As for the spatial correlation of
residuals, it is desirable that Moran’s I is close to 0 because the spatial correlation
cannot be confirmed for the error term if the spatial regression model is fitted properly.
From this table, the MGWR model outperforms the non-spatial and GWR models in
2018.
Figure 19.5 shows the time-series transition of the fitting performance for each
model. The MGWR model has the best fit of the three models every year. Since the
adjusted R2 of the non-spatial model changes to around 0.84, the non-spatial model
can explain a large proportion of the official land price, but residual Moran’s I is
around 0.50. If there is a spatial correlation, the adjusted R2 is overestimated. The
fit of the GWR and MGWR model is significantly better than that of the non-spatial
model, as the adjusted R2 of the GWR model is around 0.97 and the AICc ranges
from −4,000 to −1,500. Since the transition of residual Moran’s I is around 0.03, no
significant spatial correlation is observed. In the MGWR model, the adjusted R2 is
around 0.98 and the AICc is from −5,000 to −2,000, which is even better than for the
GWR model and the MSE and prediction accuracy are also improved. The change in
residual Moran’s I ranges from −0.04 to −0.03, and no significant spatial correlation
is observed even at the 1% level. Additionally, the MSE and prediction accuracy of
the MGWR model and residual Moran’s I are stable.
Figure 19.6 shows the time-series transition of the bandwidth for the
variable-specific spatial kernels under the MGWR model. The “GWR” label in the
top-right panel indicates the common bandwidth for spatial kernels under the GWR
model. Under this model, the kernel bandwidth changed from 1.4 km to 1.9 km. After
the burst of the bubble economy, the range of official land prices narrowed during their
downward trend, and has remained stable since then. Under the MGWR model, the
272 Data Analysis and Related Applications 1
trend in the kernel bandwidth for each explanatory variable was confirmed, regardless
of the trend of official land prices. The kernel bandwidth of the constant term is smaller
than that for the other explanatory variables due to the increase in the individual factors
of land over time that cannot be explained by the explanatory variables included in
this study. Moreover, the access index, nearest station distance, front road width,
residential area dummy, gas equipment dummy kernel and the bandwidth are allowed
to jump from 2013 to 2016. The cause is thought to be a large change in the number of
publicly announced points. The fact that the kernel bandwidth of the front road width
is larger than those of the other explanatory variables and an upward trend can be seen
in recent years indicates that it may become a global explanatory variable. We would
like to consider these cases in future research.
19.4. Conclusion
This study estimated a land price model using 38,914 sites over 22 years of
residential area of Tokyo. Three models were compared based on their fitting
performances, namely the non-spatial, GWR and MGWR models. We found that the
MGWR model with variable-specific bandwidth for spatial kernels has a better fit than
the non-spatial and GWR models in terms of the adjusted R2 , AICc, MSE, prediction
accuracy and spatial correlation of residuals. The results of the MGWR model and
its visualization confirmed that the individuality of the land, which is a factor of land
price formation, is gradually strengthened by the increase in the range of each local
regression coefficient. The effects of the access index, nearest station distance and
low-rise residential area dummy are remarkable. Additionally, from the increase in
the range of constant terms, the individuality of the land other than the explanatory
variables used in this study strengthened.
19.5. Acknowledgments
This research was supported in part by JSPS KAKENHI Grant Number 18K01706
and Nanzan University Pache Research Subsidy I-A-2 for the 2021 academic year.
19.6. References
Brunsdon, C., Fotheringham, A.S., Charlton, M.E. (1996). Geographically weighted regression:
A method for exploring spatial nonstationarity. Geographical Analysis, 28(4), 281–298.
Brunsdon, C., Fotheringham, A.S., Charlton, M.E. (1999). Some notes on parametric
significance tests for geographically weighted regression. Journal of Regional Science, 39(3),
497–524.
Chay, K.Y. and Greenstone, M. (2005). Does air quality matter? Evidence from the housing
market. Journal of Political Economy, 113(2), 376–424.
Cho, S.H., Bowker, J.M., Park, W.M. (2006). Measuring the contribution of water and green
space amenities to housing values: An application and comparison of spatially weighted
hedonic models. Journal of Agricultural and Resource Economics, 31(3), 485–507.
Fotheringham, A.S., Charlton, M.E., Brunsdon, C. (1998). Geographically weighted regression:
A natural evolution of the expansion method for spatial data analysis. Environment and
Planning A, 30(11), 1905–1927.
Fotheringham, A.S., Crespo, R., Yao, J. (2015). Geographical and temporal weighted regression
(GTWR). Geographical Analysis, 47(4), 431–452.
274 Data Analysis and Related Applications 1
Harris, P., Brunsdon, C., Fotheringham, A.S. (2011). Links, comparisons and extensions of
the geographically weighted regression model when used as a spatial predictor. Stochastic
Environmental Research and Risk Assessment, 25(2), 123–138.
Heckman, J.J., Matzkin, R.L., Nesheim, L. (2010). Nonparametric identification and estimation
of nonadditive hedonic models. Econometrica, 78(5), 1569–1591.
Helbich, M., Brunauer, W., Vaz, E., Nijkamp, P. (2014). Spatial heterogeneity in hedonic house
price models: The case of Austria. Urban Studies, 51(2), 390–411.
Huang, B., Wu, B., Barry, M. (2010). Geographically and temporally weighted regression for
modeling spatio-temporal variation in house prices. International Journal of Geographical
Information Science, 24(3), 383–401.
Lee, S., Kang, D., Kim, M. (2009). Determinants of crime incidence in Korea: A mixed GWR
approach. World Conference of the Spatial Econometrics Association, Barcelona.
LeSage, J.P. and Pace, R.K. (2009). Introduction to Spatial Econometrics. Chapman and
Hall/CRC, Boca Raton, FL.
Leung, Y., Mei, C.L., Zhang, W.X. (2000). Statistical tests for spatial nonstationarity based on
the geographically weighted regression model. Environment and Planning A, 32(1), 9–32.
Lu, B., Harris, P., Charlton, M., Brunsdon, C. (2015). Calibrating a geographically weighted
regression model with parameter-specific distance metrics. Procedia Environmental
Sciences, 26, 109–114.
Lu, B., Brunsdon, C., Charlton, M., Harris, P. (2017). Geographically weighted regression
with parameter-specific distance metrics. International Journal of Geographical Information
Science, 31(5), 982–998.
Wheeler, D.C. and Tiefelsdorf, M. (2005). Multicollinearity and correlation among local
regression coefficients in geographically weighted regression. Journal of Geographical
Systems, 7(2), 161–187.
Wu, C., Ren, F., Hu, W., Du, Q. (2019). Multiscale geographically and temporally weighted
regression: Exploring the spatiotemporal determinants of housing prices. International
Journal of Geographical Information Science, 33(3), 489–511.
Yang, W. (2014). An extension of geographically weighted regression with flexible bandwidth.
PhD Thesis, St Andrews.
20
20.1. Introduction
Software cost estimation can be defined as “predicting the resources required for
a software development process”. The estimation process includes size estimation,
effort estimation, development of initial project schedules and, finally, estimation
of the overall project cost. This can be used to generate requests for proposals,
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
276 Data Analysis and Related Applications 1
In this study, the cost of software projects is estimated by using machine learning
algorithms. In this study, the project cost was estimated by testing 29 different
machine learning algorithms in the WEKA (Waikato Environment for Knowledge
Analysis) for information analysis. Algorithms were applied to a Chinese dataset
taken from the PROMISE data repository.
20.2. Methodology
20.2.1. Dataset
In this study, we used a Chinese dataset that was taken from the PROMISE
software engineering data repository. The Chinese dataset was added to the
PROMISE repository in 2010. This dataset, although comparatively new, was used
in this study because it consisted of 499 records, which was a large number when
compared with most other publicly available software engineering datasets.
However, it is difficult to provide any further information about this dataset (Bosu
and MacDonell 2019). Both dependent and independent attributes were included in
the Chinese dataset, which were used to estimate the cost of software projects. This
dataset consisted of 19 features: 18 independent variables (ID, AFP, Input, Output,
Enquiry, File, Interface, Added, Changed, Deleted, PDR_AFP, PDR_UFP,
NPDR_AFP, NPDR_UFP, Resource, Dev.Type, Duration and N_effort) and one
dependent variable (Effort). The independent attributes of the dataset determined the
value of the dependent attribute. Table 20.1 presents the statistics of the Chinese
dataset.
Some of the independent variables that were not very important to predict the
effort were removed, thus making the model much simpler and efficient (Prabhakar
and Dutta 2013). For example, the ID and Dev.Type attributes were deleted from the
Chinese dataset because they had no effect on effort estimation. The Chinese dataset
was analyzed by 29 machine learning algorithms. The dataset was randomly divided
into the training set and the test set using a k-fold cross-validation technique.
Software Cost Estimation Using Machine Learning Algorithms 277
1 ID 1 499 250
5 Enquiry 0 952 62
6 File 0 2,955 91
7 Interface 0 1,572 24
9 Changed 0 5,193 85
10 Deleted 0 2,657 12
15 Resource 1 4 1
16 Dev.Type 0 0 0
17 Duration 1 84 9
20.2.2. Model
The WEKA contains a large number of machine learning algorithms for data
preprocessing, clustering, classification, regression, visualization and feature
selection.
278 Data Analysis and Related Applications 1
In this study, the cost of software projects is estimated by using machine learning
algorithms in the WEKA. A total of 29 classification algorithms in the WEKA were
applied to the Chinese dataset.
The algorithms under the Meta group, LWL in the Lazy group and Input Mapped
in the Rules group take, in addition to their own parameters, a basic classifier and its
parameters. Therefore, the classifier parameters were changed from the properties
window to get the best performance, and the REP Tree classification algorithm was
chosen for all of them in order to have an accurate comparison.
= ∑ | − | [20.1]
where Pi is the estimated value, Ai is the actual value and n is the number of
samples.
= ∑ ( − ) [20.2]
Software Cost Estimation Using Machine Learning Algorithms 279
where Pi is the estimated value, Ai is the actual value and n is the number of
samples.
| |
= [20.3]
| |
where Pi is the estimated value, Ai is the actual value, Am is the sum of actual values
and n is the number of samples.
∑ ( )
= ∑
[20.4]
( )
where Pij is the predicted value by the individual dataset j for the data point in i, Ai is
the actual value, Am is the sum of actual values and n is the number of samples.
In this study, the Chinese dataset was used to estimate the software cost. The
WEKA, which is a data mining tool, was used in the experiments. Datasets were
randomly divided into training and test data using a 10-fold cross-validation
technique. The performance measurements of the developed models were evaluated
based on the correlation coefficient, MAE, RMSE, RAE and RRSE.
The algorithms under the Meta group, LWL in the Lazy group and Input Mapped
in the Rules group take, in addition to their own parameters, a basic classifier and its
parameters. Therefore, the REP Tree classification algorithm was chosen for all of
them in order to have an accurate comparison. The results are presented in
Table 20.2.
Table 20.2 presents the performance evaluation results of the machine learning
algorithms applied to the Chinese dataset. The results reveal that the SMOreg
algorithm obtains the best estimation result. The correlation coefficient is 0.9897,
the MAE is 271.9954 and the RAE is 7.3511%. In addition to the SMOreg
280 Data Analysis and Related Applications 1
algorithm, the Linear Regression, Simple Linear Regression, M5P, M5 Rules and
Random Committee algorithms also performed relatively well. The Linear
Regression algorithm was the second best performing algorithm, with a correlation
coefficient of 0.9889, MAE of 362,939 and RAE of 9.809%. As shown in Table
20.3, the correlation coefficient value of the M5P algorithm on the Chinese dataset
is high and the RMSE, RAE, RRSE and MAE values are low. The M5P algorithm
performed well in general.
Chinese dataset
Algorithms
Lazy
Meta
Randomizable Filtered
0.9566 635.8599 1887.3988 17.1852 29.0692
Classifier
Misc
Rules
Tree
Correlation RRSE
Algorithms MAE RMSE RAE (%)
coefficient (%)
SMOreg 0.9897 271.9954 939.2438 7.3511 14.466
Linear regression 0.9889 362.939 968.6259 9.809 14.9185
M5P 0.9847 389.5608 1127.8709 10.5285 17.3711
Simple linear regression 0.9833 414.1948 1176.6757 11.1943 18.1228
M5 rules 0.9762 411.9179 1412.4493 11.1328 21.7541
Random Committee 0.9693 473.8313 1594.3058 12.8061 24.555
When the ZeroR algorithm, one of the machine learning algorithms, was applied
to the Chinese dataset for cost estimation using the WEKA tool, it showed the worst
forecast performance. The Decision Stump classification algorithm from the Tree
group performed the worst after the ZeroR algorithm. The correlation coefficient
was 0.8155 and the RAE was 62.2355%.
20.4. Conclusion
It has been noted that the attributes in the datasets affect the estimation result.
There were 19 attributes in the Chinese dataset. When two attributes that did not
affect the cost were removed from the Chinese dataset, much better performance
values were obtained. Some algorithms were able to take another classifier and its
parameters in addition to their own parameters. The REP Tree algorithm was
selected as the basic classifier, and tests were carried out on all of these algorithms
for an accurate comparison.
The analysis of the test results showed that the best prediction algorithm in the
Chinese dataset was the SMOreg algorithm, with a correlation coefficient of 0.9897
and an RAE of 7.3511%. The ZeroR algorithm showed the worst prediction result.
Software Cost Estimation Using Machine Learning Algorithms 283
This study made it possible to obtain information about which machine learning
algorithms could be used for software cost estimation, what the prediction results
might be when these algorithms were applied to the Chinese dataset and which
algorithms worked best.
In future studies, tests will be performed in the WEKA tool using datasets of
software projects prepared with different methodologies. Other methods of artificial
intelligence, such as genetic algorithms and fuzzy logic will also be used for the cost
estimation of software projects.
20.5. References
Attarzadeh, I. and Ow, S.H. (2010). A novel algorithmic cost estimation model based on soft
computing technique. Journal of Computer Science, 6(2), 117–125.
Bosu, M.F. and MacDonell, S.G. (2019). Experience: Quality benchmarking of datasets used
in software effort estimation. Journal of Data and Information Quality, 11(4), 38.
Kumari, S. and Pushkar, S. (2013). Performance analysis of the software cost estimation
methods: A review. International Journal of Advanced Research in Computer Science
and Software Engineering, 229–238.
Marapelli, B. (2019). Software development effort duration and cost estimation using linear
regression and K-nearest neighbors machine learning algorithms. International Journal of
Innovative Technology and Exploring Engineering (IJITEE), 9(2), 2278–3075.
Prabhakar and Dutta, M. (2013). Application of machine learning techniques for predicting
software effort. Elixir International Journal, 56, 13677–13682.
21
A large selection of laser cutting machines is currently available – from 100 USD
entry-level devices to 50,000 USD industrial machines. Generally, the accuracy of
the specific model is characterized by a simple numerical parameter – from 0.3 mm
for simple models to 0.01 mm for industrial models. However, using one parameter
to characterize engraving accuracy may, in some situations, be misleading. This
single parameter may be adequate for evaluating the accuracy of, say, a horizontal
cut, but more parameters may be required to adequately describe the accuracy of the
cut between two arbitrary points. In order to evaluate the practical accuracy of the
different mechanical designs of laser cutting machines, the MAPLE-based software
simulator was designed. By changing the type of the mechanical design and the
values of the parameters of the geometrical sizes of the mechanical members used,
mechanical slacks and mechanical rigidity, a practical evaluation of the resulting cut
accuracy for different parts of the cut can be calculated. This chapter describes
pintograph-based laser cutting machines. Pintograph has recently become popular
due to its simple mechanical implementation that uses inexpensive servo motors
controlled by an inexpensive microcontroller. Despite the simplicity of the
mechanical design, the mathematical model of the real-life pintograph contains a
large number of mechanical and electronic parameters. To evaluate the accuracy of
the pintograph by taking into account slacks in the pintograph’s joints, the previously
designed Monte Carlo software simulator was reworked. Relevant math equations
were created and solved using the MAPLE symbolic software. The simulator takes
into account rod length, slacks in the joints and the servo motor’s resolution. The
simulator operation results are the drawing zone map and the accuracy map in the
drawing zone. By changing the sizes and slacks of the pintograph elements as inputs
of the simulator, it is possible to evaluate the drawing zone and the cutting accuracy.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
286 Data Analysis and Related Applications 1
21.1. Introduction
Two motors (marked “M#1” and “M#2”) are positioned on the axis “X”. The
distance of Motor #1 from the origin {0,0} is marked as “L1”, so that the absolute
coordinates of the Motor #1 shaft (axis) are {L1, 0}. The distance between Motor #1
and Motor #2 is marked as “L2”, so that the absolute coordinates of the Motor #2 shaft
are {(L1+L2), 0}.
The pintograph contains four rigid rods that, in most cases, have equal length.
However, the developed model uses the length of all the rods {L3, L4, L5, L6} as
parameters that can be changed.
The bottom left rod is connected to the shaft of Motor #1, so that the angle
between the axis “X” and the bottom left rod is marked as “a”, whereas the bottom
right rod is connected to Motor #2, so that the angle between the axis “X” and the
bottom right rod is marked as “b”.
The coordinates of the upper part of L3 are marked as {X11, Y11}. The
coordinates of the lower part of L5 are marked as {X12, Y12}. In the ideal design,
L3 and L5 formed a flexible joint, so that X11 = X12 and Y11 = Y12. However, in
the case of slack in the left joint, X12 = X11 + sX1 and Y12 = Y11 + sY1. The
values of sX1 and sY1 effectively described slack in the left joint. The coordinates
of the upper part of L4 are marked as {X21, Y21}. The coordinates of the lower part
288 Data Analysis and Related Applications 1
of L6 are marked as {X22, Y22}. In the ideal design, L4 and L6 formed a flexible
joint, so that X21 = X22 and Y21 = Y22. However, in the case of slack in the right
joint, X22 = X21 + sX2 and Y22 = Y21 + sY2. The values of sX2 and sY2
effectively described slack in the right joint. The coordinates of the top joint are
marked as {X, Y}. We assume that the instrument (e.g. a laser) is positioned at this
point (the top joint). Considering the mechanical design presented in Figure 21.2,
angles “a” and “b” define the position of the instrument {X, Y}, so the shafts of a
controlled rotating motor can position the instrument {X, Y} in a predictable
manner. The angles of “a” and “b” have some tolerance that must be taken into
account in the model.
X12:= L1-L3*cos(a)+sX1;
Y12:= L3*sin(a)+sY1;
X22:= L1+L2+L4*cos(b)+sX2;
Y22:= L4*sin(b)+sY2.
These equations describe “upper” pairs {X12, Y12} and {X22, Y22} of the
pintograph design as it is presented in Figure 21.2 by using parameters L1, L2, L3
and L4, angles “a” and “b” and by using slacks sX1, sY1, sX2 and sY2. The above
equations were then substituted into the formulae for X and Y, and the final
formulae for X and Y were then created. The unrealistically simplified formula for
X for the case when all rods are of length “L” and all slacks are equal to “s” is
presented in Figure 21.5.
In addition, we must take into account that possible angles of the servo motor’s
shaft, controlled by a microcontroller, can only be set to a limited number of values.
angles “a” and “b” and their tolerances, the resolution of the stepper motors and four
slacks sX1, sY1, sX2 and sY2.
It is clear that in order to get a real-life evaluation of the “design in test”, the
Monte Carlo approach must be used. Thus, an additional parameter, the “number of
Monte Carlo tests”, was added to the software simulator. When “number of Monte
Carlo tests” is set to 1, and all tolerances are set to 0, a software simulator calculates
“X” and “Y” for all geometrically possible values of angles “a” and “b”.
Considering that angles {a, b} in the mathematical model are arguments of nonlinear
functions, the simulator operation results are a non-trivial map of the points that can
be reached by the instrument – not all the points on the XY plane can be reached by
the instrument, so the “points that can be reached” effectively creates a “drawing
zone”, which is a function of the selected {L1..L6}. An exemplary “drawing zone”
created by the software simulator for the “resolution” parameter when set to 90 is
presented in Figure 21.6.
In Figure 21.6, all slacks were set to 0. The resolution of the servo motors was set
to 90 possible angles. To make points visible, the drawing option “bold” was selected.
In this case, bundles of five points are drawn. The grid step was set to 10 mm.
However, when the parameter “number of Monte Carlo tests” was set to 100, the
results were significantly different (see Figure 21.7).
Figure 21.7 presents the effect of the slacks in the pintograph mechanical
structure. To simplify the analysis, all slacks were set to 0.3. The resolution of the
servo motors was set to 90 possible angles. The number of Monte Carlo tests was set
to 100. In this case, the option “bold” was disabled.
We can see that the individual “points” from Figure 21.6 can now be seen as
“clouds” of points. The sizes of those clouds effectively represent the resulting error
of the mechanical design. We can see that the error is different in the different
regions of the drawing plane of the pintograph. We can also see that the error in the
X and Y directions is different.
294 Data Analysis and Related Applications 1
The software simulator created enables us to immediately see the drawing map
for the selected parameters. However, for the user of the laser cutter, it is more
important to evaluate the accuracy of the cut. Thus, the option “draw vertical line”
was added. Two lines as they were drawn by the simulator are presented in
Figure 21.8.
Figure 21.8. Line. Left: number of Monte Carlo tests was set to 1.
Right: number of Monte Carlo tests was set to 100
Monte Carlo Accuracy Evaluation of Laser Cutting Machine 295
In Figure 21.8, the resolution of the servo motors was set to 360, whereas all
tolerances were set to 0.3. By using the option “line”, the end user can visually
evaluate the accuracy of the laser cut.
21.5. Conclusion
The software developed for the Monte Carlo simulator enables the evaluation of
the drawing zone and the drawing accuracy of the four-rod pintograph for the
selected set of parameters. Simulator runs reveal that the four-rod pintograph with a
unit length of 100 mm achieves an accuracy of 0.5 mm in the center of the drawing
zone, which is good enough for an inexpensive DIY laser cutting machine or laser
engraving machine. When better accuracy is required, designs with the customer’s
selected sizes and tolerances of pintograph elements can be tested.
21.6. Acknowledgments
This study was supported by a grant from the ORT Braude College Research
Committee under grant number 5000.838.3.3-58.
21.7. References
Doan, R. (1923). The harmonograph as a project in high school physics. School Science and
Mathematics, 23(5), 450–455.
Joostens, I. and S’heeren, P. (2017). Sand clock: A real eye-catcher. Elektor, 1(January &
February), 33–39.
Kosolapov, S. (2017). Monte-Carlo accuracy evaluation of a pintograph-based laser cutting
machine. Paper presented at The 17th Applied Stochastic Models and Data Analysis
International Conference, London, 6–9 June 2017.
Pinterest (2017). The world’s catalog of ideas [Online]. Available at: https://www.pinterest.
com/pin/216032113350089616/.
22
Nowadays, detailed epidemiological data are available in the form of time series
data (or as an array): N[k] – where N is the documented number of events registered
at the equidistant time moments T(k) = To + k*delta (e.g. “Number of newly
reported cases of Covid-19 in the last 24 hours” – published on a daily basis by
WHO). Theoretically, those data can be adequately described by different dynamic
models containing exponential growth and exponential decay elements. Practically,
parameters of those models are not constants – they can change in time because of
many factors like changing hygiene policies, changing social behavior and
vaccination. Hence, it was decided to use a piecewise approach: short sequential
fragments of time series data are approximated by a function containing some
parameters. The above parameters are evaluated for the first time series data
fragment. Then, the next data fragments are processed. As a result, new time series
data (arrays) are created: evaluated sequences of parameters. Those new series can
be considered and analyzed as functions of time. In the simplest example, the
function to be used for every fragment is A + B*exp( alpha* t). The resulting values
of A, B and alpha in that case are time series data – arrays: A[k], B[k] and alpha[k]
known at the equidistant time moments T(k). By plotting those sequences, it can be
seen if the simple growth or decay model is adequate. Significant jumps in values
may point to an interesting event, for example the start of mass vaccination or the
effect of a non-desirable social behavior on the specific date. In order to make
calculations robust, some preliminary filtration and after-filtration can be used
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
298 Data Analysis and Related Applications 1
(e.g. by using moving average, moving median or other filters such as the Gaussian
filter and Bessel filter). A number of practical examples were considered.
22.1. Introduction
Practically, more sophisticated filters must be used to filter “noise”. The idea of this
approach will be used here to describe signal Y as a sequence of time shifted
piecewise exponents. In that case, the resulting “signals” will be parameters of
“moving exponents”. In some approximations, algorithm data are approximated by
an exponential function A*exp(-alpha*t), where A and “alpha” are the parameters to
be found. Practically, to find values of “A” and “alpha” logarithms (and log graphs)
are used. However, this approach works only if values of “A” and “alpha” are
constants – at least for the time interval selected for measurements.
To analyze the situation when parameters of the function used for the
approximation may change, the following approach (analogous to approach used in
the “moving average”) will be used.
The following function was used to approximate short fragments of original data:
It is known that if three values of some signal Y1, Y2 and Y3 are known for
equidistant points of “t”, “t+delta” and “t+2*delta”, then parameters A, B and alpha
can be found by solving the following equations:
Equ1 := A+B*exp(alpha*t) = Y1
Equ2 := A+B*exp(alpha*(t+delta)) = Y2
Equ3 := A+B*exp(alpha*(t+2*delta)) = Y3
and after operating MAPLE simplifications, the following formulae for the
parameters A, B and alpha were obtained:
A := (Y1*Y3-Y2^2) / (-2*Y2+Y3+Y1)
B := (Y2-Y3)*(Y1-Y2)*((Y2-Y3)/(Y1-Y2))^((-delta-t)/delta)/(-2*Y2+Y3+Y1)
alpha := ln ( (Y2-Y3) / (Y1-Y2) ) / delta
It can be seen that formulae for “alpha” and “A” are relatively simple for
practical implementation. However, the equation for “B” is slightly problematic for
300 Data Analysis and Related Applications 1
the goals of this analysis because, obviously, the value of B depends on the value of
“t”. This behavior can be compensated but to do this, some “starting” moment of “t”
must be set. Hence, in this chapter, only values of A and alpha will be analyzed for
the real data. It can be noted that while for the symbolic calculations the above
formulae are adequate, for the numerical calculations some real-life numerical data
combinations of values may become problematic. For example, if (-2Y2 +Y3+Y1) is
equal to zero, then numerical calculations of A and B cannot be executed. For
numerical calculation of “alpha”, the following “protected” procedure was used:
It must be noted that this protection added “impulse noise”. However, this noise
can be effectively eliminated by using the median filter.
for k to arraySize do
TestData[k]:= evalf(TestA1*(1-exp(k/TestK1))
+Heaviside(k-(1/2)
*arraySize)*TestA2*(1-exp((k-(1/2)*arraySize)/TestK2)))
end do
Parameters were set as: TestA1 = 300, TestK1 = -8. TestA2 = 500, TestK2 = -5.
Figure 22.1 presents the synthetic data in graphical form. The presented signal is
typical in the field of digital electronic signals.
Using Parameters of Piecewise Approximation by Exponents 301
From Figure 22.1, it can be seen that “the second exponent” started at moment
32. It can be seen that at this time, “the first exponent” (that started at moment “1”)
practically became a constant.
The array “TestData” was processed by way of “moving average”, but instead or
“average”, parameters “Alpha”, “A” and “B” were calculated for the different values
of the index “k”. Parameter TestDataStep was set to “1” here.
This procedure effectively creates new data series (arrays): “Zalpha[k]”, “Za[k]”
and “Zb[k]”. Figure 22.2 presents the array “Zalpha” and Figure 22.3 represents the
array “Za”.
The values of Zalpha represent the calculated values of the parameter “alpha”
for the different moment of time. From Figures 22.2 and 22.3, we can clearly
see that in the left part, “the signal” can be described as an exponent having “alpha”
= -0.125 = -1/8 and magnitude A = 300, whereas in the right part, “the signal” can
be described as an exponent having “alpha”= -0.2 = -1/5, and magnitude
A = 300+500 = 800.
Real-life Covid-19 data were downloaded as an Excel file from the site (Ritchie
et al. 2021). This file contains data for a large number of countries. To demonstrate
use of the developed approach, the data concerning Israel and Sweden were used.
Original data for the “total number of Covid-19 cases per million” in Israel
published in the source (Ritchie et al. 2021) were from February 2, 2020 (before that
date no Covid-19 cases were registered in Israel) up to March 20, 2021 – totaling
395 days. However, original data were smoothed by using the MAPLE “moving
median” filter with a parameter 5 and by the “moving average” filter with a
parameter 20. As a result, data presented in Figure 22.4 contain only 370 points,
which means that for valid epidemiological analysis more accurate evaluation of the
introduced time shift must be provided. However, the aim of this chapter is to
provide a preliminary evaluation of the proposed approach; hence, the following
304 Data Analysis and Related Applications 1
Figure 22.7 presents the “total number of Covid-19 deaths per million” for Israel.
Figure 22.8 presents the results of calculations of “alpha” for the data presented in
Figure 22.7. Parameter “TestDataStep” (as for the previous cases) was set to 1.
Figure 22.9 presents the results of calculations of “alpha” for the data presented in
Using Parameters of Piecewise Approximation by Exponents 305
Figure 22.7. However, in that case, parameter “TestDataStep” was set to 4. It can be
seen that using an increased value of that parameter obviously creates more robust
results, albeit decreasing resolution.
Figure 22.10 presents the “total number of Covid-19 deaths per million” for
Sweden. Figure 22.11 presents the results of calculations of “alpha” for the data
presented in Figure 22.10. It can be seen that the behavior of graphs for these two
countries differs. Even in the simplest implementation, the proposed method of data
approximation by using piecewise exponential functions (parameters of which are
effectively changing in time) reveals that well-known parameter: “number of waves”
is not as obvious, as it can be seen by visually observing original data. However,
more data must be checked to evaluate the usefulness of the proposed method. In
addition, different modifications of this method are to be implemented and tested
later.
22.5. Conclusion
Analysis of synthetic and real-life Covid-19 data demonstrates that the proposed
approach can be used to evaluate the validity of mathematical epidemiological
models under test for the different periods of time. Developed equations can be used
306 Data Analysis and Related Applications 1
for the analysis of other processes for which the description by exponents may be
adequate. However, more real-life data from different countries must be analyzed in
order to recommend an optimal set of the smoothing parameters, and to evaluate the
reliability of the proposed approach for the analysis of real-life data.
22.6. References
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
308 Data Analysis and Related Applications 1
23.1. Introduction
The state of the body systems responsible for gas exchange can vary as a result
of the dynamics of the pathological process and as a result of changes in
physiological parameters. So, it means that for an effective management of patients
on ALV, constant monitoring of the functions of external respiration is required
(Petrova et al. 2014).
Of the main parameters registered for respiratory monitoring, the capnogram can
be highlighted, which allows us to estimate the partial pressure of carbon dioxide in
the respiratory mixture, as well as the flow graph, which shows the rate of flow
change and is measured in liters per minute (Vasilev et al. 2015).
There are many parameters that can be used to assess the interaction between the
ventilator and the patient, including data on flow, pressure, breathing volume and
frequency, the ratio of inhalation and exhalation, etc. So far, however, until now, in
the ordinary arsenal of a doctor’s practice, there are no parameters that allow us to
sufficiently assess the effectiveness of external respiration on the monitor. This
assessment requires data on the amount of oxygen uptaking and emitted carbon
dioxide. Nonetheless, the possibilities of obtaining the necessary data exist. The use of
a metabolograph could be taken as an example. A metabolograph is a module that
integrates an artificial lung ventilation machine and allows you to calculate the amount
of energy consumed by the patient, which makes it possible to select the daily kcal
intake. Operation principle is based on the calculation of indirect calorimetry, for
which the data of carbon dioxide emission and oxygen absorption are required, which
is implemented in this device (Petrova et al. 2014). Therefore, this device can be used
to configure an artificial lung ventilation machine and select the best values that
ensure maximum gas exchange efficiency (Mihnevich and Kursov 2008).
Since the device was not originally intended for adjusting the parameters of
artificial respiration, its vitals do not meet all the necessary requirements; in
particular, the indicators of gas emission and absorption are averaged and have a
display delay. These features strongly limit the use of this device by physicians
when setting up an artificial lung ventilation machine. The efficiency of external
The Correlation Between Oxygen Consumption and Excretion of Carbon Dioxide 309
To obtain a sufficient amount of energy, the cells need oxygen, which they
receive from the blood, and it enters the blood from the external environment. In the
process of obtaining energy, carbon dioxide is formed, which must be removed from
the body (Gabdulkhakova et al. 2016). The process of exchange of gas molecules
between the external environment and the body is called external respiration, in
which the lungs play the main role, which are located in a sealed pleural cavity. The
lungs provide contact between the blood and a mixture of gas from the external
environment in special alveolar sacs, which have an extremely thin wall. The gas
mixture enters the alveoli through the air-conducting system – the bronchi and
bronchioles. Blood flows through the arteries, which are divided into capillaries at
the alveoli and then collected into the veins (Figure 23.1).
The mechanics of breathing are provided by the breathing movements of the thorax
and diaphragm. The volume of the thorax increases with inspiration, due to the
contraction of the intercostal muscles and the flattening of the diaphragm. Due to this,
pleural pressure is reduced. The atmospheric pressure in the alveoli stretches the lungs;
as a result, the pressure in the lungs begins to drop below atmospheric pressure. This
difference ensures the flow of air into the lungs – inhalation. Exhalation occurs
according to similar mechanisms, but in the opposite direction. The main difference is
that when inhaling, the driving force is muscle contraction, and when exhaling, it is the
stored elastic traction in the fibers of the lungs and thorax. As soon as the muscles
310 Data Analysis and Related Applications 1
relax, the lungs and thorax, like a spring, decrease in volume, which leads to an
increase in pleural and alveolar pressures and the pressure rises above atmospheric,
and the resulting flow is directed outward (Gutsol et al. 2014).
Each alveolus has its own characteristics of ventilation and perfusion; therefore,
the ratio of ventilation and perfusion may be different in different parts of the lungs.
This may lead to the conclusion that even in a normal state, the ventilation of the
lungs may have some unevenness. With the pathology of the respiratory system, for
example, in inflammatory processes, the degree of ventilation inequalities of
different alveoli increases. It is necessary to constantly maintain an effective ratio of
ventilation and perfusion in the lungs in order to ensure effective gas exchange. The
structure of the respiratory system is aimed at maintaining the required ratio of
ventilation and perfusion, if there are no pathologies
The devices used in intensive care and in resuscitation rooms are implemented
on the principle of PPV (positive pressure ventilation), i.e. the pressure of the gas
mixture in the lungs during inhalation is higher than the atmospheric pressure.
Further, the physician must maintain strict control over the delivery of oxygen,
its consumption and the carbon emissions.
The principle of oximetry is based on the fact that oxygenated hemoglobin and
deoxyhemoglobin absorb the red and infrared (IR) parts of the spectrum differently.
Oxyhemoglobin absorbs IR radiation well, while deoxyhemoglobin absorbs red light
intensively. Saturation (SO2) is the grade of oxygen saturation of the blood,
determined by the ratio of red and IR streams that have come from the source to the
312 Data Analysis and Related Applications 1
photodetector through a tissue site. The pulse wave can be used to determine the
heart rate and assess the quality of peripheral blood flow (Lapitsky 2015).
Gas exchange can be monitored using data from gas analyzers which uses
different types of sensors: paramagnetic, fuel cells, IR absorption sensors, etc. The
indirect calorimetry, which was carried out by the metabalograph, has its own
difficulties for continuous gas analysis. Data on oxygen consumption and carbon
dioxide emission by the body are not registered in real time, but it is necessary to
control these parameters.
23.4. The algorithm for monitoring the carbon emissions and oxygen
consumption
The control of the O2 content in the inhaled and exhaled air, and the respiratory
removal of CO2 can be carried out using inertial mechanical gas analyzers based on
the separation of a component from the gas mixture by special absorbers and
measuring changes in the sample volume at constant pressure, or pressure at a
constant volume of the measuring chamber.
So, the data from the metabolograph cannot be used for continuous monitoring
of oxygen consumption and carbon dioxide emission in real time. This means that
the task arises of implementing an algorithm that will allow solving this problem.
The capnogram shows the partial pressure of carbon dioxide over time, but for
further calculations of volumes, we need to know the concentration of carbon
dioxide. To find it, it is worth referring to the definition of partial pressure and the
formula for calculating it.
The Correlation Between Oxygen Consumption and Excretion of Carbon Dioxide 313
Partial pressure is the pressure that would be produced by a gas that is part of a
mixture of gases if it alone at a given temperature occupies a volume filled with the
entire mixture of gases.
If the gas content in parts or percent and the total pressure of the mixture are
known, then the partial pressure of the gas entering the gas mixture can be
determined.
p1=(a*Pgeneral)/100,
where p1 is the partial pressure of a gas, а is the gas content of the mixture in % and
Pgeneral is the gas mixture pressure.
This equation can be used to find an array of values for the amount of carbon
dioxide in a mixture in parts:
PrCO2 = CO2./Pair,
where PrCO2 is the instantaneous values of the amount of carbon dioxide in parts,
CO2 is the instantaneous values of the partial pressure of carbon dioxide and Pair is
the gas mixture pressure equal to atmospheric.
According to this law, the total pressure of a mixture of gases is equal to the sum
of the partial pressures of the mixture.
Based on this, we can find the partial pressure of nitrogen, provided that the
initial data on the partial pressures of oxygen and carbon dioxide of the supplied gas
mixture are known:
where O2(1) and CO2(1) are the values of the partial pressure of oxygen and carbon
dioxide in the mixture, respectively, and N is the partial pressure of nitrogen
(Petrovsky 1988).
Also, using the equation for partial pressure, you can find the amount of nitrogen
in parts:
PrN = N/ Pair,
314 Data Analysis and Related Applications 1
where PrN is the amount of nitrogen in parts and N is the partial pressure of
nitrogen.
And taking advantage of the fact that there are only three gases in the mixture,
find through the difference an array of instantaneous values for the amount of
oxygen in parts:
where PrO2 is the instantaneous values of the amount of oxygen in parts, PrCO2 is
the instantaneous values of the amount of carbon dioxide in parts and PrN is the
amount of nitrogen in parts.
The algorithm for real-time VO2 and VCO2 measurements includes the
numerical integration of dVO2/dt and dVCO2/dt instantaneous values during
respiratory cycle, derived from the product of a certain gas concentration and total
flow instantaneous values.
Accordingly, the data was digitized and brought to a form with which it was
possible to work further.
23.5. Results
The results of the implementation of this algorithm are two integrated quantities,
as well as their graphs.
The Correlation Between Oxygen Consumption and Excretion of Carbon Dioxide 315
The result of the first integration (Int1) is the difference between the volume of
the inhaled and exhaled oxygen. That is, the amount of oxygen decreased in the
exhaled air in comparison with the inhaled air by 11.3957 ml according to the
calculations.
The result of the second integration (Int2) is the difference between the volume
of the inhaled and exhaled carbon dioxide, and the volume of carbon dioxide
increased in the exhaled air relative to that inhaled by 8.5367 ml.
23.6. Conclusion
23.7. References
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
24
Approximate Bayesian Inference
Using the Mean-Field Distribution
24.1. Introduction
Population models may be used to assess, from data, the interaction laws governing
the individual dynamics (Bongini et al. 2017; Lu et al. 2019). In most of these models,
the interaction of an individual with the rest of the population is represented by means
of some statistics, potentially depending on the state variables of the whole population.
These statistics can take the form of the average velocity in birds swarms, for example,
(Cucker and Smale 2007; Degond et al. 2014), or the mean competition potential
exerted by a population of plants over a single plant in Schneider et al. (2006).
In this chapter, we will consider population models that satisfy a list of frequently
encountered properties:
– Each individual in the population is represented by a state variable x, which
may vary through time, and an individual trait variable θ, which remains constant.
The variability of trait θ from one individual to another can be used to model the
heterogeneous aspect of the population.
– The evolution of a population of N individuals is given by a differential system,
where the motion of each individual i is driven by a transition function hN depending
on some population statistics TiN (t), i.e. for any time t ≥ 0,
dxN
i
∀i ∈ 1; N , (t) = hN (t, xi (t), θi , μ̂N (t)) = HN (t, xi (t), θi , TNi (t)),
dt
N
1 N
where μ̂N (t) = δ xj (t), θj is the empirical population measure,
N j=1 [24.1]
and TiN (t) = E ΦN (xN
i (t), θi , x , θ ), (x , θ ) ∼ μ̂N (t) a statistic
When the initial distribution is factorized and when the transition function
depends, as above, on the empirical measure μ̂N (t), then the system dynamics has
the property of being invariant by permutation of its individuals’ labels. We then
say that the system is symmetric if for any time t ≥ 0 and for any bijection
ρ : 1; N → 1; N , the distribution of the permuted collection (xρ(1:N ) (t), θρ(1:N ) )
is the same as the original collection (x1:N (t), θ1:N ). This property is commonly
shared by population models, where, most often, the assignment of labels is arbitrary
(Carrillo et al. 2010).
Our focus in this chapter is to discuss the statistical inference problems related to
the study of such symmetric systems. More specifically, when some elements of the
Approximate Bayesian Inference Using the Mean-Field Distribution 321
structure of the system are partially known, such as the initial condition or the size N
of the population, determining parameters of the transition function hN or the initial
distribution μ0 can appear as a very complex task, leading to the necessity of building
approximations. In section 24.2, the plant population model introduced by Schneider
et al. (2006) is taken as an example of systems leading to difficult inference problems
when the size of the population is partially known. Section 24.3 uses an asymptotic
property of the empirical measure of the Schneider system when N → ∞, i.e. the
fact that it admits a mean-field limit distribution, to simplify the previously mentioned
inference problem. Section 24.4 deals with the consistency of this approximation.
In this section, we consider a plant growth model with competition, first introduced
by Schneider et al. (2006) and later by Lv et al. (2008) and Nakagawa et al. (2015).
This system describes the growth of Arabidopsis thaliana: each plant is represented
by the state variable s ∈ R∗+ , the diameter of its rosette, its position x ∈ R2 and two
growth parameters γ ∈ R+ , S ∈ R∗+ , namely, the growth rate and the asymptotic
isolated size. As the parameters x, S, γ remain constant over time, they are considered
as components of the individual trait θ. The differential system giving the dynamics
of N plants takes the following expression, for all plant i ∈ 1; N and for all t ≥ 0,
(s0 , xi , Si , γi ) ∼ μ0 ,
si (0) = s0 ,
dsi i
(t) = γi si (t) log(Si /sm )(1 − CN (t)) − log(si (t)/sm ) ,
dt
where sm > 0 is a minimal size parameter and
i 1
CN (t) = C(si (t), sj (t), |xi − xj |)
N −1
j=i
log(sj (t)/sm ) 1 + tanh σ1r log(sj (t)/si (t))
with C(si (t), sj (t), |xi − xj |) = .
|x −x |2
2RM 1 + i σ2 j
x
[24.2]
i
In the equation above, CN (t) models the competition exerted on plant i by all the
other plants at time t. As we can read in the competition potential C(si (t), sj (t), |xi −
xj |), the competition is stronger the closer the competitors are to plant i and larger
they are in relation to plant i. We assume that the distribution μ0 is such that all
plants initially have the same size s0 > sm . RM , σx and σr are parameters of the
322 Data Analysis and Related Applications 1
In other words, the initial size is lower than the asymptotic isolated size S, and
the asymptotic size is below some threshold depending on the competition parameter
RM . In this setting, we can prove that for all i ∈ 1; N , and for all t ≥ 0, sm ≤
i
si (t) ≤ Si and that CN (t) ∈ [0; 1] almost surely, which is consistent with the initial
assumptions of the model. Indeed, for any time t > 0 sufficiently close to zero, we
have the following inequality on the derivative of the state variable:
si (t) d si (t) Si
−γi log ≤ log ≤ γi log ,
sm dt sm si (t)
e−γi t e−γi t
s0 s0
which leads to sm ≤ si (t) ≤ Si and that proves that the
sm Si
inequality sm ≤ si (t) ≤ Si holds for any time t ∈ R+ .
The Schneider system fits into the definition of symmetric population models as
the transition function can be expressed as a function of the individual variables and
the population empirical measure only. The differential system can be rewritten as
follows:
dsi
∀i ∈ 1; N , (t) = hN (si (t), xi , Si , γi , μ̂N (t))
dt
N
1
where μ̂N (t) = δ(si (t), xi , Si , γi )
N i=1
si (t) γi si (t) Si si (t)
− log − log (1 − C(si (t), si (t), 0)) − log
sm N −1 sm sm
[24.3]
In the equation above, we use the notation μ̂s,x N (t) to denote the marginal
distribution of distribution μ̂N (t) with respect to the variables s, x. The second term
Approximate Bayesian Inference Using the Mean-Field Distribution 323
d
i d(m) i (t − tk )m
∀i ∈ 1; N , C̃N (t, tk ) = m
CN (tk ). .
m=0
dt m!
of the posterior density, which can only be known up to a multiplicative factor in our
case, i.e. for all (η, s) ∈ H × RN0 m ,
pη|s (η|s) ∝ ps|η (s|η)pη (η).
Beforehand, the inference requires the evaluation of the likelihood distribution ps|η
of the observations knowing the parameters. The likelihood of the observations has a
density ps|η of expression
+∞
∀(s, η) ∈ RN0 m × H, ps|η (s|η) = pN (N )ps|η,N (s|η, N )
N =N0
⎛ ⎞
N0
m
1 1 2
ps|η,N (s|η, N ) = N0 m exp ⎝− 2 sij − sN
i (tj , η, θ1:N )
⎠
(2πσ 2 ) 2 2σ i=1 j=1
ΘN
× μθ⊗N
0 (dθ1:N ) [24.4]
where sN
i (t, η, θ1:N ) is the solution of the Schneider system of size N . In practice,
we cannot evaluate the trajectories sN i exactly, and we have to resort to numerical
Approximate Bayesian Inference Using the Mean-Field Distribution 325
methods to estimate them. The management of this source of uncertainty is out of the
scope of this chapter, and let us assume that we are able to solve the differential system
exactly, or with a numerical error that we can reasonably ignore.
The main difficulty in the computation of this likelihood distribution comes from
the fact that individuals are interdependent for any time t > 0. As a consequence,
to compute the trajectory sN i , we need to introduce the initial configuration of the
whole population θ1:N as latent variables, although we only observe a subset of N0
individuals. Thus, the likelihood appears as an infinite mixture of densities, due to
the infinite support of the prior pN . Moreover, each of the densities in the mixture
is expressed as an integral over a space of increasing dimension. The inference
problem associated with the simulation of a posterior distribution with a latent variable
changing the dimension of the candidate model is called a trans-dimensional inference
problem (Preston 1975). Numerically simulating such a complex distribution is still
an open research topic, as pointed out in Roberts and Rosenthal (2006).
with d being some metric defined over X ×Θ. When a sequence of probability measure
(μn )n∈N converges to some distribution μ for the metric Wp , it means, in particular,
that, for all ϕ continuous and bounded, we have
and that the moments of order p of the sequence converge also towards the same
moments of the limit distribution. In the case of the population empirical measure,
326 Data Analysis and Related Applications 1
this convergence can only hold almost surely, since μ̂N (t) is a stochastic measure
depending on the initial configuration of the system.
As the empirical measure μ̂N (t) is a description of the population of size N , the
limit μ(t) of this sequence can be interpreted as a description of an infinite population.
The mean-field distribution μ(t) addresses the issues entailed by the interdependence
of the individuals, since the individuals in this infinite population are independent.
Such a paradox can be explained by comparing the interactions within a subgroup of
individuals of constant size in a population of increasing N admitting a mean-field
distribution. Let us illustrate this on a toy example.
with Gaussian initial condition x0i ∼ N (m, σ 2 ). The analytical trajectories of this
system are for all i ∈ 1; N and all t ≥ 0:
N
eλt − 1 0
xi (t) = x0i + x .
N j=1 j
It is quite straightforward to prove that, in this case, the empirical measure has a
mean-field limit, for any time t ≥ 0 and almost surely,
N
1 Wp
μ̂N (t) = δ(xi (t)) −−−−→ N meλt , σ 2 .
N i=1 N →∞
eλt − 1 eλt − 1
Cov(xN N
1 (t), x2 (t)) = σ
2
1+ −−−−→ 0.
N N N →∞
It follows that these two particles are asymptotically independent. This property of
asymptotic independence can be generalized to more complex and nonlinear systems,
as long as they admit mean-field limits. In a symmetric system of the type in equation
[24.1], a necessary condition for the existence of the mean-field limit is the pointwise
convergence of the transition function hN when N → ∞. This pointwise convergence
is clearly satisfied in the case of the toy example above, since we have, for any fixed
probability measure μ ∈ P(R), hN (x, μ) = E{x , x ∼ μ}, which is independent
of N . It is also the case of the Schneider system, as we can see in equation [24.3],
replacing μ̂N (t) by a fixed probability measure μ. Moreover, we can prove the
Approximate Bayesian Inference Using the Mean-Field Distribution 327
2 W
convergence μ̂N (t) −−−−→ μ(t) almost surely for any t ≥ 0, where μ(t) is defined
N →∞
as the pushforward measure of the initial distribution μ0 by the flow of the following
differential equation, starting from the initial configuration (s0 , θ) = (s, x, S, γ),
∂s∞ S
(t, s0 , θ) = γs∞ (t, s0 , θ) log
∂t sm
× 1− C(s∞ (t, s0 , θ), s∞ (t, s0 , θ ), |x −x
|)μ0 (ds0 , dθ )
R∗
+ ×Θ
s∞ (t, s0 , θ)
− log = g(s∞ (t, s0 , θ), θ, s∞ (t, s0 , θ ), θ )μ0 (ds , dθ ).
sm R∗
+ ×Θ
This equation can be interpreted as the continuous version of the original Schneider
system, where the empirical expectation is replaced by the theoretical expectation.
For all time t ≥ 0, the mean-field limit of the Schneider system is given by
μ(t) = (s∞ (t), IdΘ )#μ0 , where # is the pushforward operator between a function
and a probability measure. To prove the existence and uniqueness of the flow s∞ ,
we proceed to the exact same steps as in the proof of the Cauchy–Lipschitz theorem,
except that this time the initial conditions are not scalar but probability distributions.
More specifically, the existence and uniqueness are consequences of a fixed point
procedure, with the recurrence equation:
t
f n+1 (t, s0 , θ) = s0 + g(f n (τ, s0 , θ), θ, f n (τ, s0 , θ ), θ )μ0 (ds , dθ )dτ.
0 R∗
+ ×Θ
The sequence (f n )n∈N converges for some functional metric to a fixed point of
the recurrence function, which defines the mean-field flow s∞ . The convergence of
the empirical measure μ̂N (t) is proved in Della Noce et al. (2019), resorting to an
argument referred to as the propagation of chaos.
Let us return to the inference problem described in section 24.2.2. We consider the
following approximation: if the size of the population is large enough, s∞ is close
to the individual trajectories, and we can assume that observations are made on the
infinite population rather than on the finite population. Under this approximation, the
resulting likelihood of the observations has a simplified expression, in comparison
with the original one in equation [24.4]:
⎛ ⎞
N0
m
1 1
p∞
s|η (s|η) = N0 m exp ⎝− 2 (sij − s∞ (tj , η, θi ))2 ⎠ μθ0 (dθi ).
2
(2πσ ) 2
i=1
2σ j=1
Θ
328 Data Analysis and Related Applications 1
which means that, for an accurate observation protocol (with large N0 and small σ),
the mean-field approximation is relevant for population size N higher than in the case
of a rough observation protocol.
We can therefore use this limit in total variation to approximate the infinite mixture
of densities by a finite mixture, by truncating the infinite sum in equation [24.4] at
a size N = Nmf , above which the mean-field likelihood is close to the original
likelihood, below some tolerance ε
⎛ ⎞
Nmf +∞
N
p̃s|ηmf (s|η) = pN (N )ps|η,N (s|η, N ) + ⎝ pN (N )⎠ p∞ s|η (s|η)
N =N0 N =Nmf +1
This approximated likelihood seems much more manageable than the original one.
Nevertheless, the main difficulty is in the simulation of the mean-field flow s∞ which
is used in the expression of p̃s|η .
To estimate the mean-field flow s∞ , we use a two-level numerical method with the
same structure as the numerical method used in section 2.1 to simulate a population of
finite size N . Similarly, we consider the same subdivision {t0 = 0, t1 , . . . , tM = T }
of the interval [0; T ]. In a finite population, the dynamics of the individuals’ sizes are
i
driven by the competition potential CN (t), and, in the case of an infinite population,
the same role is played by the averaged competition potential, defined for all t ∈
R+ , (s, θ) ∈ D by
Figure 24.2. Top left: mean value of the parameter S according to the position x of the
plant. Top right: mean value of the parameter γ according to the position of the plant.
Bottom: evaluation of the mean-field flow s∞ at the end of the observation period,
with the mean values of parameters S and γ . s∞ is computed using the two-level
numerical scheme introduced in this section. For a color version of this figure, see
www.iste.co.uk/zafeiris/data1.zip
24.5. Conclusion
24.6. References
Bongini, M., Fornasier, M., Hansen, M., Maggioni, M. (2017). Inferring interaction rules from
observations of evolutive systems I: The variational approach. Mathematical Models and
Methods in Applied Sciences, 27(05), 909–951.
Carrillo, J.A., Fornasier, M., Toscani, G., Vecil, F. (2010). Particle, kinetic, and hydrodynamic
models of swarming. In Mathematical Modeling of Collective Behavior in Socio-Economic
and Life Sciences, Naldi, G., Pareschi, L., Toscani, G. (eds). Birkhäuser, Boston.
Cucker, F. and Smale, S. (2007). On the mathematics of emergence. Japanese Journal of
Mathematics, 2(1), 197–227.
Degond, P., Dimarco, G., Mac, T.B.N. (2014). Hydrodynamics of the Kuramoto–Vicsek model
of rotating self-propelled particles. Mathematical Models and Methods in Applied Sciences,
24(02), 277–325.
Della Noce, A., Mathieu, A., Cournède, P.H. (2019). Mean field approximation of a
heterogeneous population of plants in competition. arXiv preprint [Online]. Available at:
arXiv:1906.01368.
Golse, F. (2016). On the dynamics of large particle systems in the mean field limit. In
Macroscopic and Large Scale Phenomena: Coarse Graining, Mean Field Limits and
Ergodicity, Muntean, A., Rademacher, J., Zagaris, A. (eds). Springer, Cham.
Lu, F., Zhong, M., Tang, S., Maggioni, M. (2019). Nonparametric inference of interaction laws
in systems of agents from trajectory data. Proceedings of the National Academy of Sciences,
116(29), 14424–14433.
Lv, Q., Schneider, M.K., Pitchford, J.W. (2008). Individualism in plant populations: Using
stochastic differential equations to model individual neighbourhood-dependent plant growth.
Theoretical Population Biology, 74(1), 74–83.
Approximate Bayesian Inference Using the Mean-Field Distribution 331
Nakagawa, Y., Yokozawa, M., Hara, T. (2015). Competition among plants can lead to an
increase in aggregation of smaller plants around larger ones. Ecological Modelling, 301,
41–53.
Preston, C. (1975). Spatial birth and death processes. Advances in Applied Probability, 7(3),
465–466.
Roberts, G.O. and Rosenthal, J.S. (2006). Harris recurrence of Metropolis-within-Gibbs and
trans-dimensional Markov chains. The Annals of Applied Probability, 16(4), 2123–2139.
Schneider, M.K., Law, R., Illian, J.B. (2006). Quantification of neighbourhood-dependent plant
growth by Bayesian hierarchical modelling. Journal of Ecology, 94, 310–321.
25
Without going through the technical details, we mention that the classical idea
of cubature methods and consequently cubature formulae can be described as a
construction of a probability measure with finite support on a finite-dimensional real
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
334 Data Analysis and Related Applications 1
linear space which approximates the standard Gaussian measure. For more technical
details, see Lyons and Victoir (2002) and Malyarenko et al. (2017). A generalization
of this idea, when a finite-dimensional space is replaced with a Wiener space, can
be used for constructing modern Monte Carlo estimates. In what follows, we briefly
review both classical and modern Monte Carlo approaches.
Simulation can be performed in two approaches, classical and modern. These two
approaches can be described in the following steps.
In this section, we would like to review how we applied the Stratonovich correction
to the Black–Scholes SDE in Nohrouzian and Malyarenko (2019). To begin with,
in mathematical finance, we mostly deal with parabolic partial differential equations
(PDEs). We would like to use cubature method on Wiener space to simulate some
SDEs to transform solving PDEs to estimating the values of stochastic integrals on
Wiener space (for more details, see Nohrouzian and Malyarenko (2019)). Let us give
our full attention to the Wiener space of scalar-valued functions.
In order to explain the idea of the cubature method on Wiener space, we consider
first the dynamics of risky asset prices proposed by Samuelson (1965), i.e.
dS(t) = rS(t)dt + σS(t)dW (t), [25.1]
where S(t) is the time-t price of a (non-dividend paying) risky asset, r and σ
are drift (risk-free interest rate) and diffusion (volatility of asset price) coefficients,
respectively, and {W (t)}t≥0 is a standard one-dimensional Wiener process. Black
and Scholes in Black and Scholes (1973) used equation [25.1] and delta-hedging to
derive the celebrated Black–Scholes PDE. Then, they used such a change of variables
that the above PDE became a well-known heat equation.
C = e−rT E∗ [X],
where r denotes the risk-free interest rate and T is the time to maturity of the claim.
Recall that the mathematical expectation is the integral
−rT
C =e X(ω) dP∗ (ω). [25.5]
Ω
where the stochastic process S(t, ω) denotes the price of a risky asset and is the
solution to the Black–Scholes SDE given in [25.1].
Even in the Black–Scholes model, such a calculation is not simple if the claim X
is more complicated than the European call or put option. Instead, we use the cubature
method.
For a function f , which is P∗ -integrable, the cubature formula takes the form
∞ ∞
f (x) dP∗ (x) ≈ f (x) dQ(x) = λk f (xk ).
−∞ −∞ k=1
If two persons propose two different cubature formulae, which one is better? Note
that any polynomial in x is P∗ -integrable. We say that a cubature formula Q has degree
m if for any polynomial P (x) of degree less than or equal to m, we have
∞
P (x) dP∗ (x) = λk P (xk ),
−∞ k=1
that is, for any such polynomial, the cubature formula is exact. If we have two different
cubature formulae of degree m, then the one with the lower value of is better. The
classical Tchakaloff theorem (Tchakaloff 1957) guarantees the existence of a cubature
formula with less than or equal to the dimension of the linear space of corresponding
polynomials. However, it is an existence theorem; it does not give any way to construct
the nodes xk and the weights λk .
and the formula has degree m if and only if all ωk have a bounded variation and for
all iterated Stratonovich integrals up to degree m, we have
T T T
E∗ [P (ω)] = λk ··· dωk (t ) · · · dωk (t1 ),
k=1 0 t1 t−1
where the integrals on the right-hand side are iterated Riemann–Stiltjes integrals.
The Tchakaloff theorem remains true, but still not constructive. In order to construct
cubature formulae on Wiener space, we followed Lyons and Victoir (see Lyons and
Victoir 2002) where we used advanced algebraic methods presented in Malyarenko
et al. (2017) and Nohrouzian and Malyarenko (2019).
338 Data Analysis and Related Applications 1
25.2.2.1. Application
Let {Y (t)}t≥0 be the solution to the following Itô SDE
We rewrite the above equation in its Stratonovich differential form (see Øksendal
(2013))
We replace W (s) with paths ωk which gives the following integral equations:
t t
Yk (t) = Yk (0) + a(s, Yk (s))ds + b(s, Yk (s))dωk (s), 1 ≤ k ≤ .
0 0
For the next step, let us briefly review the implementation of Stratonovich
corrections and the obtained results in Nohrouzian and Malyarenko (2019).
In order to use the cubature formula on a Wiener space for the Black–Scholes
SDE, we need to rewrite the Itô process given in [25.2] in its Stratonovich form (see,
for example, Øksendal (2013) and Nohrouzian and Malyarenko (2019)). That is,
1
dS(t) = (r − σ 2 )S(t)dt + σS(t) ◦ dW (t). [25.6]
2
Pricing Financial Derivatives in the Hull–White Model 339
Rearranging the last equation and calculating the integral of both hand sides give
1 2
Ŝk (tj ) = Ŝk (tj−1 ) exp (r − σ )[tj − tj−1 ] + σ[ωk (tj ) − ωk (tj−1 )] ,
2
[25.7]
with j = 1, . . . , l and 0 ≤ tj ≤ 1.
where 0 = t0 < t1 < t2 < t3 = 1, i.e. trajectories start from time 0 and stop at time 1,
j = 1, 2, 3, tj − tj−1 = 1/3 and with weight λk and coefficients θk summarized in
Table 25.1.
Figure 25.1 is created in MATLAB® and depicts the two possible sets of
trajectories for cubature of degree 5.
340 Data Analysis and Related Applications 1
Assume that today’s market price of an arbitrary asset is S0 = 20$, the yearly
interest rate is r = 12% and the yearly volatility σ = 30%. Then, we can rearrange
equation [25.8] and substitute the result in equation [25.7]. As a result, we get two
possible sets of trajectories for the price process illustrated in Figure 25.2.
Furthermore, Figure 25.3 depicts the idea of the iterated cubature formula which
we will use later to construct a trinomial tree.
Assume that we divide the time interval [0, T ] into n intervals of not necessarily
equal length, where n ∈ Z, n ≥ 1. Then, we will have {0 = t0 , . . . , tn = T } and an
n-step time grid.
That is,
S1 (1) 1 2 1 2 √
fu = = exp (r − σ ) + σω1 (1) = exp (r − σ )+σ 3 ,
S(0) 2 2
S2 (1) 1 2
fm = = exp r − σ ,
S(0) 2
S3 (1) 1 1 2 √
fd = = exp (r − σ 2 ) + σω3 (1) = exp (r − σ )−σ 3 ,
S(0) 2 2
Furthermore, due to the symmetry of path ω1 and path ω3 , i.e. ω1 = −ω3 , and
the log-normality of the considered price process, i.e. (Sk > 0), it is easy to see that
Pricing Financial Derivatives in the Hull–White Model 343
√
fm = fu fd . Therefore, we have a recombining trinomial tree. Figure 25.5b depicts
a five-step trinomial tree with S0 = 100$, r = 0.05 and σ = 0.1.
Interest-rate models can be divided into two groups. First, spot rate models which
consist of equilibrium models and no-arbitrage models. Second, forward rate models.
Let {r(t)}t≥0 be the stochastic interest rate (instantaneous spot rate), B(t) be the
money market account, P (t, T ) be the market price of a default-free discount bond and
f (t, T ) be the instantaneous forward rate. Then, the following relations hold (Kijima
2013):
∂ ∂
for t ≤ T, r(t) = − ln P (t, T ) T =t
, f (t, T ) = − ln P (t, T ),
∂T ∂T
T T
B(t)
P (t, T ) = exp − f (t, s)ds , P (t, T ) = exp − r(s)ds = .
t t B(T )
Typically, spot-rate models are mean reverting. Let us briefly review equilibrium,
no-arbitrage and forward rate models.
where m and s are the instantaneous drift and standard deviation, respectively, and
assumed to be a function of instantaneous spot rate and usually time-independent.
344 Data Analysis and Related Applications 1
Rendleman–Bartter μr σr
Vasicek a(b − r) σ
√
Cox–Ingersoll–Ross a(b − r) σ r
Table 25.2. Some well-known equilibrium models, where a and b are constants
Proper choices of m and s summarized in Table 25.2 convert SDE [25.9] to SDEs
given in the Rendleman–Bartter model (Rendleman and Bartter 1980), the Vasicek
model (Vasicek 1977) and the Cox–Ingersoll–Ross (CIR) model (Cox et al. 1985)
(see Kijima 2013; Hull 2017).
Unlike equilibrium models, the drift and diffusion parts of the SDE [25.9] in
no-arbitrage models are usually functions of time. Given a filtered probability space
(Ω, F, P, (Ft )t≥0 ), in no-arbitrage models, the dynamic of spot rate r under physical
probability measure P satisfies the following SDE (Kijima 2013; Pascucci 2011):
By definition, the price of a default-free discount bond via risk neutral pricing,
i.e. using the equivalent martingale probability measure Q ∼ P , is given by Pascucci
(2011)
T
P (t, T ) = EQ exp − r(u)du, Ft , 0 ≤ t ≤ T. [25.11]
t
The Ho–Lee model Ho and Lee (1986) and the Hull–White one-factor model (Hull
and White 2015) are some well-known no-arbitrage (as well as affine) models. For the
purpose of this chapter, we will closely look at the Hull–White model and its lattice
applications.
The most famous forward rate models are Black (Black 1975),
Heath–Jarrow–Morton (HJM) (Heath et al. 1990, 1992) and Brace–Gatarek–Musiela
(BGM) (Brace et al. 1997) or the LIBOR market model (see also Jamshidian 1997;
Miltersen et al. 1997). For more detailed information about interest-rate models, the
reader is referred to Kijima (2013); Hull (2017); Nohrouzian et al. (2021).
The Hull–White one-factor model or the extended Vasicek model is a special case
of the Ho–Lee model. The Hull–White one-factor model provides an exact fit to the
initial term structure (see Hull and White 2015). Set α(t, r) = α1 (t) + α2 (t)r =
[ϕ(t) − ar] and β 2 (t, r) = β1 (t) + β2 (t)r = σ 2 in equation [25.10]; then, the
instantaneous short-rate r in the Hull–White model is the solution to
where a and σ are constants. Note that SDE [25.13] (even if a and σ are not
constants and are functions of time) describes a general Gaussian–Markov process
(see Glasserman 2004, Equation (3.41), p. 109). Moreover, the function ϕ(t) can be
calculated from the initial term structure (Hull and White 2015; Hull 2017). That is,
σ2
ϕ(t) = ∂t f (0, t) + af (0, t) + 1 − e−2at , [25.14]
2a
where f (0, t) is the observed instantaneous forward rate in the market at time 0. We
will explain the idea and derivation of φ(t) in section 25.3.6. Using the above equation,
we can rewrite SDE [25.13] in the following form:
σ2
dr(t) = ∂t f (0, t) + a f (0, t) + 2 1 − e−2at − r(t) dt + σdW (t).
2a
[25.15]
SDE [25.15] describes the dynamics of an affine model and has an explicit
solution. We will not go through the theory of affine models. Instead, we would like
to concentrate on the application of the Hull–White model to construct a trinomial
tree using the cubature method. The readers therefore are referred to look at Kijima
(2013), Hull (2017) and Hull and White (2015).
346 Data Analysis and Related Applications 1
Hull and White (1994, 1996) explained how to apply a numerical procedure and
use an interest-rate tree in their model. As we explained in section 25.2.6, the lattice
approximation has advantages in terms of time efficiency and pricing American-style
path-dependent options. Now, we would like to construct the Hull–White trinomial
interest-rate model using the cubature formula of degree 5 on Wiener space.
The discretization of SDE [25.15] in the Euler scheme (see Glasserman 2004)
gives
ˆ ˆ σ2 −2ati
r̂(ti+1 ) = r̂(ti ) + ∂t f (0, ti ) + a f (0, ti ) + 2 1 − e − r̂(ti )
2a
× (ti+1 − ti ) + σ ti+1 − ti Zi+1 , [25.16]
To begin with, the Ho–Lee model and Vasicek model are special cases of the
following general Markov process (Glasserman 2004)
where the functions g(t), h(t) and σ(t) are time-dependent. The solution to the above
SDE (affine model) is given by
t t
H(t) H(t)+H(u)
r(t) = r(0)e + e g(u)du + eH(t)+H(u) σ(u)dW (u),
0 0
where
t
H(t) = h(u)du.
0
The above relation can be verified by the Itô formula. For the Hull–White model,
the general solution of the above for any given r(s), 0 < s < t is
t t
a(t−s) −a(t−u)
r(t) = r(s)e + e ϕ(u)du + e−a(t−u) σ(u)dW (u).
s s
[25.17]
Pricing Financial Derivatives in the Hull–White Model 347
For a given r(s), the first two terms on the right-hand side of the above equation
are the mean of the normally distributed r(t). By definition, for the variance, we have
t
σ2
σr2 (s, t) := σ 2 e−2a(t−u) du = 1 − e−2a(t−s) .
s 2a
r̂(ti+1 ) = e−a(ti+1 −ti ) r̂(ti ) + μ(ti , ti+1 ) + σr (ti , ti+1 )Zi+1 , [25.18]
We will calculate ϕ(t) in the next part of this chapter. Let us start by the calibration
of the market’s data to the short rate process r.
25.3.6.2. Calibration
Assume that X ∼ N (m, v 2 ) is a normal random variable, then we have
T T
1
= exp −E r(u)du + Var r(u)du . [25.20]
t 2 t
Substituting equation [25.17] into the expected value of integral of the process r,
T T
E r(u)du = E[r(u)]du
0 0
T t
r(0)
= 1 − e−aT + e−a(t−u) ϕ(u)dudt.
a 0 0
348 Data Analysis and Related Applications 1
σ −2aT
2
+ e − e−aT .
2a
Pricing Financial Derivatives in the Hull–White Model 349
Thus,
σ2
ϕ(T ) = ∂T f (0, T ) + af (0, T ) + 1 − e−2aT .
2a
Finally, the above equation is true for any maturity. That is,
σ2
ϕ(t) = ∂T f (0, T ) + af (0, t) + 1 − e−2at . [25.21]
T =t 2a
Substituting the result into equation [25.18], we get the following general equation to
simulate process r:
a(ti+1 −ti ) 1 − e−2a(ti+1 −ti )
r̂(ti+1 ) = e r̂(ti ) + μ(ti , ti+1 ) + σ Zi+1 .
2a
[25.22]
where
m n
1 ∂βij
α̃i (t, x) = αi (t, x) − βkj , 1 ≤ i ≤ n,
2 j=1 ∂xk
k=1
Substituting the values of the last equations in equations [25.13] and [25.14]
converts equation [25.15] into
σ2 −2at
dr(t) = ∂t f (0, t) + 1−e + a [f (0, t) − r(t)] dt + σ ◦ dW (t).
2a
[25.23]
Let κ(t) = σ 2 /(2a) 1 − e−2at . In the next step, we rewrite the above equation
in its integral form, that is
t t
r(t) = r(0)+ (∂s f (0, s)+κ(s) + a [f (0, s) − r(s)]) ds + σ ◦dW (s).
0 0
Now, we get the following set of Riemann–Stiltijes integrals given by the cubature
methods
t t
rk (t) = rk (0) + (∂s f (0, s) + κ(s) + a [f (0, s) − rk (s)]) ds + σ dωk (s).
0 0
Taking the derivative of both hand sides of the above equation and applying the
fundamental theorem of calculus replaces the SDE [25.23] to the following finite set
of ODEs:
Let us simulate the SDE and ODE given in [25.15] and [25.24], First, we make
a discretization of both equations. On the one hand, recall that the discretization of
SDE [25.15] in the Euler scheme was given by [25.16] (see Glasserman 2004). If we
use our shortened notations, we have
r̂(ti+1 ) =r̂(ti ) + ∂t fˆ(0, ti ) + κ(ti ) + a[fˆ(0, ti ) − r̂(ti )] Δti + σ Δti Zi+1 ,
[25.25]
On the other hand, in the implementation of [25.24] and for simplicity, we are not
interested in intermediate values in each time grid with length one (see Figure 25.5a).
Pricing Financial Derivatives in the Hull–White Model 351
Therefore, we will use only the last value of the sample path in the cubature formula
in equation [25.8]. After that, discretization of [25.24] gives
r̂k (ti+1 ) =r̂k (ti ) + ∂t fˆ(0, ti ) + κ(ti ) + a[fˆ(0, ti ) − r̂k (ti )] + σωk ,
[25.26]
√
where 0 ≤ i ≤ n, n√is the number of time discretizations, 1 ≤ k ≤ 3, ω1 = − 3,
ω2 = 0 and ω3 = 3 (see the values of ω(t) at the end of grids in Figure 25.2).
Furthermore, values for λk are given in Table 25.1. Also, we observe that
3
r̂(ti ) = λk r̂k (ti ).
k=1
We used the (instantaneous) forward rates (FR) and spot rates (SR) available in
the European Central Bank (ECB) on March 1, 2021. The given rates are based on
triple A, i.e. AAA rated bonds, and are summarized in Table 25.3.
i 0 1 2 3 4 5 6 7 8
ti (in years) 0 0.25 0.50 0.75 1.00 2.00 3.00 4.00 5.00
f (0, ti ) -0.63737 -0.63737 -0.679325 -0.709803 -0.730072 -0.730798 -0.644705 -0.512856 -0.362631
r(ti ) -0.611094 -0.611094 -0.635227 -0.655307 -0.671664 -0.705749 -0.701509 -0.671451 -0.624844
i 9 10 11 12 13 14 15 16 17
ti (in years) 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00
f (0, ti ) -0.211625 -0.070525 0.054764 0.161581 0.249389 0.318988 0.371949 0.410242 0.43597
r(ti ) -0.56848 -0.507261 -0.444655 -0.383053 -0.324043 -0.268616 -0.217334 -0.170444 -0.127979
i 18 19 20 21 22 23 24 25 26
ti (in years) 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00
f (0, ti ) 0.451206 0.457894 0.457791 0.45244 0.443166 0.431087 0.417124 0.402025 0.386386
r(ti ) -0.089822 -0.055759 -0.025518 0.001205 0.024725 0.045355 0.063396 0.079135 0.092835
i 27 28 29 30 31 32 33 - -
ti (in years) 24.00 25.00 26.00 27.00 28.00 29.00 30.00 - -
f (0, ti ) 0.370669 0.355226 0.340317 0.326124 0.312769 0.300323 0.288821 - -
r(ti ) 0.104738 0.115065 0.124013 0.131759 0.13846 0.144253 0.149261 - -
Table 25.3. ECB instantaneous forward rates and spot rates in %, March 1, 2021
Given the data in Table 25.3, we used MATLAB® function polyfit to estimate
fˆ(0, ti ) for i = 1, . . . , n which fits the given data for instantaneous forward rates in
a least-square sense. Figure 25.6 illustrates the graphs of initial data and the obtained
polynomial where the degree of the described polynomial is 6.
After that, we set n = 30 × 52, i.e. 1,560 weeks, a = {0.075, 0.35}, σ = 0.15,
r(t1 ) = r(t0 ) and f (0, t1 ) = f (0, t0 ). Then, substituting the obtained polynomial
352 Data Analysis and Related Applications 1
in equations [25.16] and [25.26], we simulate the spot-rate SDE in the Hull–White
model. We make 100,000 simulations for classical Monte Carlo and then take the
average of simulations. We make one cubature simulation, where we calculate
weighted average values in each step. Note that this is equivalent to simulating
only the middle path in equation [25.26], i.e. for k = 2. The results are depicted
in Figures 25.7a and 25.7b.
Figure 25.7. Initial FR, SR, Monte Carlo and cubature mean
We do not know which model the ECB, or financial institutes which provide AAA
rated bonds, use to estimate spot rates. However, as depicted in Figure 25.7, the
Hull–White model via cubature method seems to fit the ECB’s spot rates fairly well.
The recent described cubature method for estimating forward rate is much faster than
the classical Monte Carlo, but as we will see in section 25.4.2, it cannot be used to
Pricing Financial Derivatives in the Hull–White Model 353
price financial derivatives. For pricing financial derivatives, we need to consider more
paths (trajectories).
Hull and White (1994, 1996) explicitly explained how they constructed their
recombining trinomial tree. To put it briefly, they assumed that at each node the
short-rate can go either:
– up one, straight along and one down;
– or up two, up one and straight along;
– or straight along, down one and down two.
They also calculated the corresponding probabilities to reach each node (see
also Hull (2017)). We will deal, however, with a non-recombining tree created by
the iterated cubature formula of degree 5 on Wiener space. In other words, in order
to price financial derivatives via cubature method, unlike the approach presented
in Figure 25.7, we need to access more possible random values at each time. Therefore,
we iterate the cubature formula which results in obtaining a non-recombining
trinomial tree.
more identical with ECB FRs rather than ECB SRs. The reason is that the Hull–White
model is a mean reverting model and for accurate evolution of the process r̂(t), we
need more time steps. The mean reverting characteristic of the Hull–White model
suggests that we cannot choose a big time interval length and they should be small
enough. Figure 25.9 shows what happens if we do not choose a proper length in each
step. Compare this figure with Figure 25.7.
data. After that, we tried to iterate the cubature formula to get enough random price
trajectories (paths) to calculate desirable payoff functions of financial derivatives.
Iterating cubature trajectories in the Hull–White model resulted in a non-recombining
tree with exponential growth. As a result, we saw that working with a non-recombining
tree would cause inaccuracy in results and would not be effective for a small number
of steps.
We would like to emphasize that, in our experience, cubature formulae work better
for small time intervals. To tackle the extensional growth in the number of nodes in a
non-recombining tree, we mention some ideas. These ideas are to consider cubature
formulae of degree 7 or higher where the number of trajectories on Wiener space are
more and the formula becomes more accurate. If the degree is big enough, then we
might even not need to iterate the formula to get satisfactory results. This, however,
might be extremely complicated. For example, we could find the cubature formula
of degree 7 which creates six paths in Nohrouzian and Malyarenko (2019). To get
weights and coefficients of the cubature formula, we found the solutions to the system
of Lie polynomial equations. In the case of degree 5, this was done analytically.
In the case of degree 7, the algebraic expressions for Lie polynomials occupied 48
A4 pages and we used numerical approach to find solutions to the system. In the
future works, we would like to examine the cubature formula of degree 7 in some
interest-rate models. We will also try to find cubature formulas of higher degrees.
Another suggestion to overcome the exponential growth in the number of nodes is to
either use the recombination method proposed by Litterer and Lyons in (Litterer and
Lyons 2012) or the tree-based branching algorithm (TBBA) proposed by Crisan and
Lyons in (Crisan and Lyons 2002). Finally, we would like to try the following idea
as well. We may shift the initial data by a certain amount to have all data as positive
numbers.
√ Then, using the ideas presented in section 25.2.7, we may use the relation
fm = fu fd in order to make a recombining tree. That is, when the number of steps
2
in the tree is even (odd), we calculate fm and fd and set fu = fm /fd and when the
2
number of steps in the tree is odd (even), we calculate fm and fu and set fd = fm /fu .
25.6. References
Cox, J., Ross, S., Rubinstein, M. (1979). Option pricing: A simplified approach. Journal of
Financial Economics, 7(3), 229–263.
Cox, J., Ingersoll, J., Ross, S. (1985). A theory of the term structure of interest rates.
Econometrica, 53(2), 385–407.
Crisan, J. and Lyons, T. (2002). Minimal entropy approximations and optimal algorithms.
Monte Carlo Methods and Applications, 8(4), 343–355.
Glasserman, P. (2004). Monte Carlo Methods in Financial Engineering. Springer, New York.
Heath, D., Jarrow, R., Morton, A. (1990). Bond pricing and the term structure of interest
rates: A discrete time approximation. Journal of Financial and Quantitative Analysis, 25(4),
419–440.
Heath, D., Jarrow, R., Morton, A. (1992). Bond pricing and the term structure of interest rates:
A new methodology for contingent claims valuation. Econometrica, 60(1), 77–105.
Ho, T. and Lee, S. (1986). Term structure movements and pricing interest rate contingent claims.
Journal of Finance, 41(5), 1011–1029.
Hull, J. (2017). Options, Futures, and Other Derivatives, 10th edition. Pearson, London.
Hull, J. and White, A. (1993). Efficient procedures for valuing European and American
path-dependent options. Journal of Derivatives, 1(1), 21–31.
Hull, J. and White, A. (1994). Numerical procedures for implementing term structure models I:
Single-factor models. Journal of Derivatives, 2(1), 7–16.
Hull, J. and White, A. (1996). Pricing interest rate trees. Journal of Derivatives, 3(3), 26–36.
Hull, J. and White, A. (2015). Pricing interest-rate-derivative securities. Review of Financial
Studies, 3(4), 573–592.
Jamshidian, F. (1997). LIBOR and swap market models and measures. Finance and Stochastics,
1(4), 293–330.
Kijima, M. (2013). Stochastic Processes with Applications to Finance, 2nd edition. CRC Press,
Boca Raton, FL.
Litterer, C. and Lyons, T. (2012). High order recombination and an application to cubature on
Wiener space. The Annals of Applied Probability, 22(4), 1301–1327.
Lyons, T. and Victoir, N. (2002). Cubature on Wiener space. Proceedings of the Royal Society of
London. Series A. Mathematical, Physical and Engineering Sciences, 460(2041), 169–198.
Malyarenko, A., Nohrouzian, H., Silvestrov, S. (2017). An algebraic method for pricing
financial instruments on post-crisis market. Algebraic Structures and Applications. Sergei
D Silvestrov, Anatoliy Malyarenko, Milica Rančić (eds). Springer, Cham.
Merton, R. (1973). Theory of rational option pricing. The Bell Journal of Economics and
Management Science, 4(1), 141–183.
Miltersen, K., Sandmann, K., Sondermann, D. (1997). Theory of rational option pricing.
Journal of Finance, 52(1), 409–430.
Pricing Financial Derivatives in the Hull–White Model 357
Chapter written by Vasilii OREL, Olga NOSYREVA, Tatiana BULDAKOVA, Natalya GUREVA,
Viktoria SMIRNOVA, Andrey KIM and Lubov SHARAFUTDINOVA.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
360 Data Analysis and Related Applications 1
care to the population (mobile medical teams, dispensing drugs and pulse oximeters
for providing medical care at home).
The data of regular statistical observation became the basis for making
operational management decisions for the organization of medical care for the
population in the context of an epidemic rise in morbidity.
26.1. Introduction
In 2020, the Russian Federation, like other countries in the world, accepted the
challenge of the spread of the new coronavirus infection, Covid-19. To the greatest
extent, the rise in incidence has affected large metropolitan areas. St. Petersburg is a
city with a population of over 5 million, with a population density of 3,800 people
per km2, a developed transport network, a high level of population migration within the
city, from small towns in neighboring regions, as well as from foreign countries. These
factors have contributed to the high rate of the spread of Covid-19 infection.
St. Petersburg includes 18 administrative districts. This statistical study of the structure
of the incidence of respiratory infections in the first and second half of 2020 was carried
out using the example of one of the administrative districts. The emergence of Covid-19
has posed challenges for healthcare professionals to quickly diagnose and provide
timely medical care to patients (Orel et al. 2020). The basis for making managerial
decisions on the task set is statistical accounting (Orel et al. 2018).
The area covers 240.3 sq. km or 24,032.6 hectares (16.7% of the area of St.
Petersburg) and is the second largest area among the districts of St. Petersburg. The
average length of the region is: from the south to the north – 21 km, from the east to
the west – 21 km. Geographically, the district is located in the southern part of St.
Petersburg includes five municipal regions located at a distance from each other.
Indicators of the ecological state of land, water and air are within acceptable limits.
The region has a population of 240,809 people (184,490 – adults over 18 years
old; 56,319 – children from 0 to 18 years old). For five years, due to migration
processes, the total population increased by 27.0%, and the child population by
49.5%. High growth rates of the population in general, and of children in particular,
Differences in the Structure of Infectious Morbidity of the Population 361
are associated with intensive housing construction. Nevertheless, the age structure of
the district’s population remains regressive. The share of older age groups (50 years
and older) is 31%, from 15 to 49 years old – 50%, children – 19%. The share of the
female population in the structure of the total population is 54%, and the share of
women of fertile age is 48%. The birth rate is 13.2 per 1,000 people, the mortality
rate is 11.0 per 1,000 people and the natural increase is 1.3. Taking into account the
available data on the dynamics of population growth in the district, by 2024, the
number is expected to be 282,245 people (approximately 68,700 children), and by
2030, up to 350,000 (approximately 85,000 children).
In the district, primary health care for adults and children is provided by two
polyclinics, which include 10 polyclinic departments (five for adults and five for
children). The service area is divided into 88 therapeutic and 65 pediatric areas.
Local general practitioners and pediatricians carry out the primary registration of all
cases of the population’s disease, including infectious diseases.
The peculiarity of the course of the Covid-19 pandemic sets the task of obtaining
reliable statistical data on the situation, with morbidity and mortality as a priority.
Monitoring based on an in-depth study of the course of Covid-19 provides the most
relevant, objective and detailed statistics on this disease, in order to more widely
assess the impact of infection on the population and the course of the disease
(Ministry of Health of the Russian Federation 2020a). When registering these
diseases, the recommendations of the Ministry of Health of the Russian Federation
were used (Ministry of Health of the Russian Federation 2020b).
At the district level, in order to collect daily operational information, a form was
developed and implemented for recording the incidence of the adult and child
population by the individual most important respiratory infections: acute respiratory
viral infectious diseases, new coronavirus infection Covid-19 and community-
acquired pneumonia.
In the work, the total indicators are presented in numerical form with an
accuracy of one person.
Since March 2020, there has been an increase in acute respiratory viral infections
with a special course, which subsequently made it possible to differentiate Covid-19.
An important criterion for the diagnosis of a new coronavirus infection is the results
of laboratory diagnostics, namely, the detection of SARS-CoV-2 RNA using nucleic
acid amplification methods. According to the recommendations of the Ministry of
Differences in the Structure of Infectious Morbidity of the Population 363
Health of the Russian Federation, laboratory tests for SARS-CoV-2 RNA are
recommended for all persons with signs of acute respiratory infections.
Patients who test positive are assigned the disease code – U07.1, and patients
with a negative test result, but showing signs of the disease – U07.2.
It was found that the total number of people with ARVI, new coronavirus
infection Covid-19 and community-acquired pneumonia observed in pediatric and
adult polyclinics has two “waves”. The minimum number of these diseases was
registered in the 13th week (from July 20, 2020 to July 27, 2020) for Covid-19 and
ARVI, and the 17th week (from August 17, 2020 to August 23, 2020) for
community-acquired pneumonia.
It should be noted that the number of ARVI patients without signs of Covid-19
and laboratory confirmation of the SARS-CoV-2 antigen in the “second wave”
(36,879 people) was almost five times higher than in the “first wave” (7,461 people),
in total 44,340 people. In the first “wave”, the number of adult patients prevailed –
70%, and the share of children was 30%. During the “second wave”, ARVI was
predominantly registered in children – 71.7%. The total incidence of acute
respiratory viral infections in the total adult and child population in the “first wave”
was 29.4 per 1,000 people, in the “second” – 154.7 per 1,000 people. Figures 26.1–
26.3 show the dynamics of the number of patients with ARVI under observation in
the district polyclinics during the period of the increase in the incidence (36 weeks
of 2020, from April 20, 2020 to December 31, 2020).
It should be noted that the number of ARVI patients without signs of Covid-19
and laboratory confirmation of the SARS-CoV-2 antigen in the “second wave”
(36,879 people) was almost five times higher than in the “first wave” (7,461 people),
in total 44,340 people. In the first “wave”, the number of adult patients prevailed –
70%, and the share of children was 30%. During the “second wave”, ARVI was
predominantly registered in children – 71.7%. The total incidence of acute
respiratory viral infections in the total adult and child population in the “first wave”
was 29.4 per 1,000 people, in the “second” – 154.7 per 1,000 people. Figure 26.1
shows the dynamics of the number of ARVI patients under observation in the
district polyclinics during the period of the increase in the incidence (36 weeks of
2020, from April 20, 2020 to December 31, 2020).
364 Data Analysis and Related Applications 1
Figure 26.1. Dynamics of the number of patients with ARVI who were under
observation in the district polyclinics during the period of the increase in the incidence
(36 weeks of 2020, from April 20, 2020 to December 31, 2020). For a color version of
this figure, see www.iste.co.uk/zafeiris/data1.zip
The situation with the incidence of Covid-19 and the involvement of the adult
and child population looks similar to ARVI. In the structure of the incidence of
Covid-19 in the first “wave”, adult patients dominated (93.3%, children – 6.7%).
During the second “wave” of the rise in the incidence of Covid-19, the proportion of
children doubled and amounted to 12.9%. The total incidence of Covid-19 in the
first “wave” was recorded at 7.3 per 1,000 people, in the second – 31.4 per
1,000 people. Figure 26.2 shows the number of patients with Covid-19 under
observation in polyclinics of the district during the period of the increase in the
incidence (36 weeks of 2020, from April 20, 2020 to December 31, 2020).
Figure 26.2. Dynamics of the number of patients with Covid-19 who were monitored
in the district polyclinics during the period of the increase in the incidence (36 weeks
of 2020, from April 20, 2020 to December 31, 2020). For a color version of this figure,
see www.iste.co.uk/zafeiris/data1.zip
Differences in the Structure of Infectious Morbidity of the Population 365
The main contribution to mortality from the new coronavirus infection was made
by the incidence of community-acquired pneumonia. It is this complication in the
course of Covid-19 that required the greatest efforts from the healthcare system:
improving the material and technical equipment of inpatient medical institutions,
emergency medical services and optimizing the work of clinics.
An analysis of the total incidence of the population of the district for the
specified period with acute respiratory infections (Covid-19, ARVI, community-
acquired pneumonia) was also carried out. A total of 56,416 cases were registered,
or 234.3 per 1,000 people (see Figure 26.5).
Figure 26.5. Dynamics of the number of patients with Covid-19, ARVI and
community-acquired pneumonia who were monitored in the district polyclinics during
the period of the increase in the incidence (36 weeks of 2020, from April 20, 2020 to
December 31, 2020). For a color version of this figure, see www.iste.co.uk/zafeiris/
data1.zip
decisions for the organization of medical care for the population, in the context of an
epidemic rise in morbidity.
To reduce the incidence, the efforts of various departments have been combined.
Mass cultural, educational and sports events have been canceled. Information work
on the rules for the prevention of infectious diseases is being carried out in the
media, places of mass stay. In educational institutions, the control of medical and
pedagogical workers has been strengthened to prevent contact with sick children and
adults. Schedules of additional cleaning with the use of disinfectants in the premises
of educational institutions have been drawn up.
The district medical service has worked out interaction with the Territorial
Department of Rospotrebnadzor for the control of people arriving from countries
with an unfavorable situation for Covid-19. In the district polyclinics, teams of
epidemiologists, district doctors and paramedical personnel have been created to
monitor and contact people with Covid-19.
Biomaterial sampling points have been organized; the results of PCR tests are
available in personal accounts on the “St. Petersburg’s Health” portal. The work of
call centers for adults and children has been strengthened, as well as interaction with
the city service to receive house calls from doctors for patients. The polyclinics have
a two-month supply of personal protective equipment for medical personnel. An
additional mode of road transport was purchased for the delivery of district general
practitioners and pediatricians to house calls for patients, specialist doctors
(cardiologist, ENT, neurologist, surgeon, etc.) for consultations, sampling of
biomaterial at home (smears, blood), etc. Free distribution of drugs is carried out to
provide medical care to patients with Covid-19 at home. Since December 2020, four
vaccination points have been opened in the region and mass immunization of the
population with the Gam Covid Vac vaccine (Sputnik V) has begun. The measures
taken, based on the results of statistical records and analysis of the incidence of
the new coronavirus infection Covid-19, have significantly improved the
epidemiological situation in the area and preserved the health of the population.
26.4. Conclusion
1) Statistical accounting for analyzing the dynamics of the spread of the new
coronavirus infection Covid-19 is of paramount importance and is an integral part of
the organization of health care during a pandemic.
368 Data Analysis and Related Applications 1
26.5. References
Federal State Statistics Service (2017). Order of the Federal State Statistics Service of January
28, 2009, No. 12 (revised on January 20, 2017). On the approval of statistical tools for the
organization of the Ministry of Health and Social Development of Russia federal
statistical observation in the field of health care.
Government of the Russian Federation (2020). Decree of the Government of the Russian
Federation of March 31, 2020, No. 373. On approval of temporary rules for recording
information in order to prevent the spread of a new coronavirus infection (COVID-19).
Ministry of Health of the Russian Federation (2013). Order of the Ministry of Health of the
Russian Federation and the Federal Service for Supervision in the Field of Consumer
Rights Protection and Human Well-being of October 10, 2013, No. 726n/740. On
optimizing the system of informing about cases of infectious and parasitic diseases.
Moscow.
Ministry of Health of the Russian Federation (2020a). Guidelines for coding and selection of
the underlying condition in morbidity statistics and the initial cause in mortality statistics
associated with COVID-19 [in Russian]. Moscow.
Ministry of Health of the Russian Federation (2020b). Letter of the Ministry of Health of Russia
dated August 4, 2020, No. 13-2 / I / 2-4335. On the coding of coronavirus infection caused
by COVID-19. Moscow.
Differences in the Structure of Infectious Morbidity of the Population 369
Ministry of Health of the Russian Federation (2020c). Interim guidelines for the prevention,
diagnosis and treatment of the new coronavirus infection (COVID-19), Version 10
[in Russian]. Moscow.
Ministry of Health of the USSR (1980). Order of the Ministry of Health of the USSR of
October 4, 1980, No. 1030. On approval of forms of primary medical documentation of
health care institutions.
Orel, V.I., Bezhenar, S.I., Buldakova, T.I., Kim, A.V., Roslova, Z.A., Rubezhov, A.L.,
Orel, O.V., Gurieva, N.A., Nosyreva, O.M., Sharafutdinova, L.L. (2018). Scientific and
practical vector of problems of primary medical and social care in a metropolis. Medicine
and Healthcare Organization, 3(2), 63–67.
Orel, V.I., Gurieva, N.A., Nosyreva, O.M., Smirnovа, V.I., Buldakova, T.I., Libova, E.B.,
Razgulyaeva, D.N., Sharafutdinova, L.L., Kulev. A.G. (2020). Modern medico-
organizational features of coronavirus infection. Pediatrician, 11(6), 5–12.
27
During Covid-19, there was a demand for high data throughput and connectivity
from higher education institutions. This required that backbone and metro links as well
as fiber links be upgraded with more capacity, catering to these high data demands and
connectivity. In addition, there was a need for a reliable and secured network to protect
against attacks from hackers. Cyberinfrastructures needed to be put in place to secure
the network and prevent information from being accessed by unknown/unauthorized
users and outsiders. It was also observed that some network devices lost configurations
with power failures around the areas. Due to this problem, several devices even lost
routing and signaling information. This required a technician to reconfigure the whole
device manually. Moreover, throughout this process, there were several challenges in
the deployment of transmission network equipment.
In this research work, a model was proposed to solve the aforementioned issues,
which was deployed on the software defined network (SDN). This network has three
layers, namely application, control and data. Typically, the open systems
interconnection (OSI) model has seven layers, but some layers have been combined
and reduced to three in this proposed model. This SDN is user-friendly, as it is
programmable to execute some of the tasks. It also saves bandwidth, as it reuses
network resources. If power (electricity) fails and the networking device reboots,
then it will automatically look for configuration information on the SDN server and
compare it with the one configured on the device. If the comparison is the same,
then the device will work as usual. If the network device loses configurations, then
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
372 Data Analysis and Related Applications 1
the SDN server will automatically reload the configurations of the device. The SDN
can also calculate the bandwidth utilization for all the links or routes connected to it.
This assists in finding where network resources are used the most. If a particular
route is congested, then the SDN will look for another alternative route where
utilization is low, in order to load balance the traffic.
27.1. Introduction
The communication network is vast and uses different types of connectivity. The
connectivity to any network can be made by means of fiber, wired or wireless
(Khalighi and Uysal 2014; Zhang and Pedersen 2016; Odeyemi et al. 2017). To
deliver this type of network connectivity, we need to first understand the consumer
requirements. In this present research work, consumers are the institutions of higher
learning. These institutions have challenges of delivering high data throughput to
students, lecturers and academic staff. The issue of high data challenges is more
critical in remote areas (Ercan et al. 2010; El-Seoud et al. 2017; Javidi 2017;
Srivastava 2020). The Covid-19 pandemic has in fact exposed some of these
institutions when it comes to high-speed connectivity. It did not end there; how fragile
the metro and backbone rings were when most people were working from home
(Dargar and Srivastava 2019; Teras et al. 2020; Izumi et al. 2021; Maatuk et al. 2021).
As there is a demand for high data throughput and connectivity from higher
learning institutions, it is required that backbone and metro links be upgraded to
more capacity. This will cater to these high data demands and connectivity (Kim and
Feamster 2013; Megyesi et al. 2017). High capacity alone is not enough; there is a
need to have a reliable and secured network to protect against attacks from hackers.
Cyberinfrastructures need to be put in place to secure the network and prevent
information from being accessed by unknown users and outsiders (van Adrichem
et al. 2014; Shu et al. 2016; Wang et al. 2018).
It has also been noted that some network devices lose configurations with power
failures around the areas. Due to this problem, some devices also lose routing and
signaling information (Guo et al. 2013; Jung and Song 2015). This requires a
technician to come out and reconfigure the whole device.
To overcome these issues, in this research work, we propose a model for deploying
a software defined network (SDN), which helps to increase productivity and reduces
the cost (Nunes et al. 2014; Alsmadi and Xu 2015; Kreutz et al. 2015; Singh
and Srivastava 2018; Ali et al. 2020). It also saves the bandwidth, as it will reuse
network resources. The proposed model was deployed on the SDN. This chapter is
organized as follows. The existing work and the brief history of the open systems
interconnection (OSI) model and its layers are discussed in section 27.2. Section 27.3
presents a new SDN architecture and its benefits through three layers, namely
High Speed and Secured Network Connectivity for Higher Education Institutions 373
application, control and data. Finally, section 27.4 concludes the work and
recommends future aspects.
The OSI model is a framework that defines tasks required for any computer or
network system to communicate with one another (Bakshi 2013; Farhady et al.
2015; Marconett and Yoo 2015; Cox et al. 2017). The history of the OSI model
began in the 1970s by the International Organization for Standardization (ISO) and
the International Telegraph and Telephone Consultative Committee (CCITT). The
CCITT was later succeeded by the International Telecommunication Union (ITU).
The OSI model is also called the basic reference model with specific protocols. This
model is divided into seven layers. Each layer has protocols that affect its
functionality. They interact directly only with the layer underneath it and provide
facilities for the layer above it.
Figure 27.1. The OSI model (Farhady et al. 2015; Cox et al. 2017)
374 Data Analysis and Related Applications 1
The newly selected model has three layers: application, control and
infrastructure, as shown in Figure 27.2. The most important feature of the SDN
architecture is that the data plane is separated from the control plane. This separation
allows for the centralization of management. The lower layer of the SDN, the data
plane, enables the controller plane to have full network management. The data plane
forwards packets to the control plane where decisions are made. This helps to reduce
the cost of network devices and processing at the data plane. Now let us examine
each layer and the roles they play in this new model.
All the issues of the physical layer such as links that are down and hardware
devices that are not reachable on the network, are reported and processed on the
control layer. Once the information is processed on this layer, it will be presented to
the application layer. This is where the information will be made readable for
network engineers and administrators. This also allows the engineers to make any
changes on the network depending on business requirements.
In the new model, seven layers are reduced to three layers. In other cases, it is
difficult to reprogram some of the devices when they lose configurations due to
power failures. With this new model, it is possible to implement all the routing rules
in centralized software. This helps network engineers and administrators to have
more control over the network, as well as to provide high network performance. In
old traditional networks such as the OSI model, the routers become overwhelmed by
data processing and updating routing tables when the network grows rapidly. This
can cause network delays as routers need to exchange information periodically in
order to keep up with the network status. With the SDN model, all of this burden
will be carried by the control plane to ease congestion in routers on the data plane.
The SDN uses OpenFlow as its standard protocol. This protocol is multivendor
and managed by the Open Networking Foundation (ONF). It has evolved over the
years to become the most widely used protocol in SDN applications. It can be
integrated into both software and hardware without any issues on the SDN. It is
advanced in such a way that it uses open-source code to control SDN controllers. It
can interact with any switch and router vendor. This protocol exchange information
between the control plane and OpenFlow resides on the data plane (Xia et al. 2015).
The OpenFlow protocol offers convenient flow table manipulation services for a
controller to insert, delete, modify and find the flow entries. Its main features are
flow tables, which it uses to control the traffic between data plane devices. Each
flow table contains flow entries that are communicated to the controller. The
controller only handles the routing of packets and decision-making.
This new model has efficient use of resources. It has the capabilities to distribute
the workload across the controllers. This, in turn, increases the speed and efficiency
of the network. It allows network engineers to make changes on the network
376 Data Analysis and Related Applications 1
remotely. The SDN is programmable, which allows the control plane to reduce
congestion in the entire network. This new model enables the scalability of the
network with improved security features and resilience to faults.
The SDN can also calculate the bandwidth utilization for all the links or routes
connected to it. This assists in finding where network resources are used the most. If
a particular route is congested, the device’s SDN will look for another alternative
route where utilization is low, in order to load balance the traffic. The SDN concept
is not new – it has been there, but we are extending its application to institutions of
higher learning where the demand for data is needed the most.
ForCES is another protocol used in SDN applications. This protocol proposes the
separation of IP control and data forwarding (Hu et al. 2014). Unlike OpenFlow, it
does not have widespread adoption due to its lack of clear language abstraction
definition and controller-switch communication rules. The main advantage of the
ForCES protocol is that it can be easily integrated into old traditional network
devices since it just adds networking/forwarding elements.
The SDN can be applied in the case where the link is down, in order to enable
the traffic to reroute on another path that is up and running. This can only be
achieved when a ring network is on the backbone, aggregation and last mile. The
SDN simplifies the traditional network and makes things easy for troubleshooting
and maintenance. It will rely more on automation and the programmability of the
network.
The SDN works very well with cloud networking and artificial intelligence
networks. It is the future of secured and reliable communication networks.
27.5. References
van Adrichem, N.L.M., van Asten, B.J., Kuipers, F.A. (2014). Fast recovery in software-
defined networks. 3rd European Workshop on Software Defined Networks (EWSDN),
Budapest, Hungary.
High Speed and Secured Network Connectivity for Higher Education Institutions 377
Ali, J., Lee, G.M., Roh, B.H., Ryu, D.K., Park, G. (2020). Software defined networks
approaches for link failure recovery: A survey. Sustainability, 12(10), 1–28.
Alsmadi, I. and Xu, D. (2015). Security of software defined networks: A survey. Computers
& Security, 53, 79–108.
Bakshi, K. (2013). Considerations for software define network (SDN): Approaches and use
cases. IEEE Aerospace Conference, Big Sky, MT, USA.
Cox, J.H., Chung, J., Donovan, S., Ivey, J., Clark, R.J., Riley, G., Owen, H.L. (2017).
Advancing software defined networks: A survey. IEEE Access, 5, 25487–25526.
Dargar, S.K. and Srivastava, V.M. (2019). Integration of ICT based methods in higher
education teaching of electronic engineering. 10th International Conference of Strategic
Research on Scientific Studies and Education (ICoSReSSE), Rome, Italy.
El-Seoud, S.A., El-Sofany, H.F., Abdelfattah, M., Mohamed, R. (2017). Big data and cloud
computing: Trends and challenges. International Journal of Interactive Mobile
Technologies, 11(2), 34–52.
Ercan, T., Rajabion, L., Sheybani, E. (2010). Effective use of cloud computing in educational
institutions. Procedia – Social and Behavioral Sciences, 2(2), 938–942.
Farhady, H., Lee, H.Y., Nakao, A. (2015). Software-defined networking: A survey. Computer
Networks, 81, 79–95.
Guo, J., Yang, J., Zhang, Y., Chen, Y. (2013). Low cost power failure protection for MLC
NAND flash storage systems with PRAM/DRAM hybrid buffer. Design, Automation &
Test in Europe Conference & Exhibition (DATE), Grenoble, France.
Hu, F., Hao, Q., Bao, K. (2014). A survey on software-defined network and OpenFlow: From
concept to implementation. IEEE Communications Surveys & Tutorials, 16(4),
2184–2189.
Izumi, T., Sukhwani, V., Surjan, A., Shaw, R. (2021). Managing and responding to
pandemics in higher educational institutions: Initial learning from COVID-19.
International Journal of Disaster Resilience in the Built Environment, 12(1), 51–66.
Javidi, G. (2017). Educational data mining and learning analytics: Overview of benefits and
challenges. International Conference on Computational Science and Computational
Intelligence (CSCI), Las Vegas, NV, USA.
Jung, S. and Song, Y.H. (2015). Data loss recovery for power failure in flash memory storage
systems. Journal of Systems Architecture, 61(1), 12–27.
Khalighi, M.A. and Uysal, M (2014). Survey on free-space optical communication:
A communication theory perspective. IEEE Communications Surveys & Tutorials, 16(4),
2231–2258.
Kim, H. and Feamster, N. (2013). Improving network management with software define
networking. IEEE Magazines, 114–119.
378 Data Analysis and Related Applications 1
Kreutz, D., Ramos, F.M.V., Verissimo, P.E., Rothenberg, C.E., Azodolmolky, S., Uhlig, S.
(2015). Software-defined networking: A comprehensive survey. Proceedings of the IEEE,
103(1), 14–76.
Maatuk, A.M., Elberkawi, E.K., Aljawarneh, S., Rashaideh, H., Alharbi, H. (2021). The
COVID-19 pandemic and e-learning: Challenges and opportunities from the perspective
of students and instructors. Journal of Computing in Higher Education, 1–18.
Marconett, D. and Yoo, S.J.B. (2015). Flow broker: A software-defined network controller
architecture for multi-domain brokering and reputation. Journal of Network and Systems
Management, 23, 328–359.
Megyesi, P., Botta, A., Aceto, G., Pescape, A., Molnar, S. (2017). Challenges and solution for
measuring available bandwidth in software define networks. Computer Communications,
99, 48–61.
Nunes, B.A.A., Mendonca, M., Nguyen, X.N., Obraczka, K., Turletti, T. (2014). A survey of
software-defined networking: Past, present, and future of programmable networks. IEEE
Communications Surveys & Tutorials, 16(3), 1617–1634.
Odeyemi, K.O., Owolawi, P.A., Srivastava, V.M. (2017). Performance analysis of free-space
optical system with spatial modulation and diversity combiners over the Gamma-Gamma
atmospheric turbulence. Optics Communications, 382(1), 205–211.
Shu, Z., Wan, J., Li, D.I., Lin, J., Vasilakos, A.V., Imran, M. (2016). Security in software
defined network: Threats and counter measures. Mobile Networks and Applications, 21,
764–776.
Singh, M. and Srivastava, V.M. (2018). An analysis of key challenges for adopting the cloud
computing in Indian education sector. Advances in Computing and Data Sciences, Singh, M.,
Gupta, P.K., Tyagi, V., Flusser, J., Ören, T. (eds). Springer, Singapore.
Srivastava, V.M. (2020). Learning for future education, research, and development through
vision and supervision. International African Conference on Current Studies of Science,
Technology & Social Sciences (African Summit), South Africa.
Teras, M., Suoranta, J., Teras, H., Curcher, M. (2020). Post-Covid-19 education and education
technology “solutionism”: A seller’s market. Postdigital Science and Education, 2(3),
863–878.
Wang, L., Yao, L., Xu, Z., Wu, G., Obaidat, M.S. (2018). CFR: A cooperative link failure
recovery scheme in software defined networks. International Journal of Communication
Systems, 31(10), e3560.
Xia, W., Wen, Y., Foh, C.H., Niyato, D., Xie, H. (2015). Survey on software defined
networking. IEEE Communications Surveys & Tutorials, 17(1), 27–43.
Zhang, S. and Pedersen, G.F. (2016). Mutual coupling reduction for UWB MIMO antennas
with a wideband neutralization line. IEEE Antennas and Wireless Propagation Letters,
15(166–169).
28
For two probability distributions A(x) and B(x) in Rykov et al. (2020b), modified
Laplace–Stieltjes transforms have been introduced. In terms of these transforms, the
analytical expressions of the main reliability characteristics of a double redundant
system with arbitrarily distributed life- and repair times of system components under
the partial repair scenario have been found. In this chapter, we extend the investigation
of the same model under the full repair scenario. The analytical expressions for the
time-dependent and steady-state system probabilities as well as the system reliability
function are presented in this chapter. The proposed approach and obtained analytical
results also allow us to investigate the sensitivity of system reliability characteristics
to the shape of system component life- and repair time distributions.
28.1. Introduction
There are a lot of papers devoted to the study of redundant systems. Calculation
of the reliability characteristics of such systems is not a trivial task, even for a simple
double redundant structure but for arbitrarily distributed component life- and repair
times.
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
380 Data Analysis and Related Applications 1
As opposed to previous works in Utkin (2003), the author proposed the imprecise
reliability models of cold standby systems. The investigation of these models supposed
that arbitrary probability distributions of the component time to failure are possible
and they are restricted only by the available information in the form of lower and
upper probabilities of some events. Any system, subject to arbitrary life- and repair
time distributions, invokes an open problem in reliability theory.
However, in the recent paper by Rykov et al. (2020b), the authors have obtained
analytical expressions of the main characteristics of the reliability of a double
redundant system with partial repair and arbitrarily distributed both life- and repair
times. This result was reached using the theory of decomposable semi-regenerative
processes. Some of the main milestones of this approach are briefly presented below.
In 1955, Smith (1955) proposed the regeneration idea, which was the beginning
of the development of a new direction in the random processes theory. Smith’s theory
helps researchers to solve many applied problems by reducing their complexity. Thus,
this theory led to many generalizations and modifications of regenerative process
theory. Sometimes, the behavior of the process in a separate regeneration period,
as well as the corresponding probabilities, can be complex enough for analytical
calculations, so it is necessary to investigate this process in more detail. Combining
Cinlar (1969) who proposed the theory of semi-Markov processes and Smith’s theory
led to the development of semi-regenerative processes. The continuation of this
generalization was reflected in papers by Klimov (1966), Jacod (1971), Rykov and
Yastrebenetsky (1971) and Korolyuk and Turbin (1976).
Reliability of a Double Redundant System Under the Full Repair Scenario 381
Investigations above dealt with the partial repair of the system. This means that
after a whole system failure, the repair of the failed component is prolonged, and
after its end, the system resumes operation and the repair of the remaining component
begins. However, there is a second way to restore the whole system. Sometimes it
is more useful to consider a full repair instead of a partial one to maintain a high
level of reliability, as well as to save energy and reduce economic costs. Full system
repair means the restoration of all failed components, after which the system becomes
operational and works like a new one. Thus, the problem of studying a system under
a full repair scenario arises. Moreover, in such a system, the time of full repair can be
different from the partial one.
The purpose of this chapter is to apply the DSRP theory to the problem of
studying the reliability characteristics of a double redundant system under the full
repair scenario and arbitrarily distributed life- and repair times of system components,
as well as repair of the whole system.
This chapter is organized as follows. In the next section, the problem statement
and notations are given. Section 28.3 introduces the process for system behavior
described as regenerative and deals with the calculation of the reliability function
and the system’s mean lifetime. In section 28.4, we consider the process behavior
in a separate regeneration period and find time-dependent system state probabilities
(t.d.s.p.s). The final section aims to present the results of steady-state probabilities
(s.s.p.s). This chapter ends with a conclusion and some future research directions.
At the beginning of the work, we assume that both components are in working
order. With the failure of the main component, a reserved one starts operating. If the
repair of a failed component ends before the failure of the other one, the first one
382 Data Analysis and Related Applications 1
returns to reserve as new. Otherwise, the failure of the reserved component before the
failed main component has been repaired results in the failure of the entire system and
the beginning of its full repair.
2
Repair facility
Two-component cold-
standby system
Figure 28.1. Two-component cold-standby repairable system with one repair facility
Assume that both life- and repair times of the system’s components are arbitrarily
distributed. Denote by Ai (i = 1, 2, ...), lifetimes of the system elements, by Bi (i =
1, 2, ...) its partial repair times (repair of an element), and by Ci (i = 1, 2, ...)
full repair of the whole system. Suppose that all these random variables (r.v.s) are
mutually independent and identically (for each type of r.v.s) distributed (i.i.d.). Thus,
the corresponding cumulative distribution functions (c.d.f.) are A(x) = P{Ai ≤ x},
B(x) = P{Bi ≤ x} and C(x) = P{Ci ≤ x} (i = 1, 2, . . . ). Denote also by A, B
and C the r.v.s with the same c.d.f.s as Ai , Bi and Ci , respectively. Suppose that the
instantaneous failures and repairs are impossible and their mean times are finite:
∞
c= (1 − C(x))dx < ∞.
0
where E = {0, 1, 2} is the set of system states and j means the number of failed
components. Figure 28.2 illustrates a transition graph of the considered system.
Reliability of a Double Redundant System Under the Full Repair Scenario 383
A A
0 1 2
B
C
Figure 28.2. Transition graph of a double redundant system under full repair
ã1−B (s) = ã(s) − ãB (s), b̃1−A (s) = b̃(s) − b̃A (s).
– The probabilities P{B ≤ A} and P{B > A} are associated with these
transforms through the following ratios:
∞
ãB (0) = B(x)dA(x) = P{B ≤ A} ≡ p,
0
∞
b̃A (0) = A(x)dB(x) = P{B > A} ≡ q = 1 − p.
0
384 Data Analysis and Related Applications 1
This chapter deals with the following reliability characteristics of the considered
system:
– the reliability function
where F is the time to the first system failure and F (t) its c.d.f.;
– the t.d.s.p.s
– the s.s.p.s
Process J is a regenerative one (see Figure 28.3), and its regeneration times
mean the time moments when the process returns to state 0 after a full system failure
and repair. Here, Gi (i = 1, 2, . . . ) are the sequence of i.i.d. r.v.s of the time
intervals between two consecutive returns of the system to state 0 after full failure,
which represents the lengths of regenerative periods. Denote G(t) = P{Gi ≤ t}
the corresponding c.d.f. of such r.v.s. Define also the lifetime of the system as W ,
and the time to the first system failure as F (see Figure 28.3) and their c.d.f.s by
W (t) = P{W ≤ t} and F (t) = P{F ≤ t}. The following lemma holds for the
LSTs of these distributions.
L EMMA 28.1.– The LSTs w̃(s), f˜(s) and g̃(s) of the corresponding distributions
W (t), F (t) and G(t) are of the form
P ROOF.– The lifetime of the system W is the time between two successive failures
of system components. After a failure in state 0, the system goes to state 1 where
there can be two events: either the recovery of a component in state 1 in time B and
subsequent transition again to state 0, or a failure of a second component in time A
Reliability of a Double Redundant System Under the Full Repair Scenario 385
will happen. Hence, from Figure 28.3, the time W satisfies the following stochastic
equation:
A + W, if B < A,
W = [28.2]
A, if B ≥ A.
J (t ) G
F
W
C
2
B
1
A
0
t
S0 S1(1) S 2(1) S1 S2 S1(1)
F = A + W,G = F + C. [28.3]
Applying LST w̃(s) = E e−sW to [28.2], we can obtain
∞
w̃(s) = E e−sW = e−st dW (t)
0
∞
= e−sx [w̃(s)B(x) + (1 − B(x))] dA(x)
0
C OROLLARY 28.1.– The mean time to the first failure f ≡ E [F ] and the mean length
of the regeneration period g ≡ E [G] have the following form:
a a
f =a+ , g =a+c+ .
q q
According to the renewal theory, the t.d.s.p.s πj (t) of the process J in any time t
can be represented in terms of its distribution in a separate regeneration period G,
(1)
πj (t) = P{J(t) = j, t < G} (j = 0, 1, 2),
Reliability of a Double Redundant System Under the Full Repair Scenario 387
as
t
(1) (1)
πj (t) = πj (t) + πj (t − u)dH(u). [28.4]
0
As follows from Figure 28.3 and as was mentioned in Lemma 28.1, the r.v. G
consists of two time intervals F and C, G = F + C. Therefore, the distribution
(1)
πj (t) of the process J in a separate main regeneration period G can be divided into
two distributions as follows:
(1) (F ) (C)
πj (t) = (1 − δj2 )πj (t)1{t<F } + δj2 πj (t)1{F <t<G} (j = 0, 1, 2).
[28.7]
(1)
Following this representation, the probability π2 (t) can be calculated very easily
and is given in the following lemma.
388 Data Analysis and Related Applications 1
(1) (1)
L EMMA 28.2.– The LT π̃2 (s) of the t.d.s.p. π2 (t) in the main regeneration period
is given by
P ROOF.– From [28.7], it follows that the event {J(t) = 2, t < G} occurs if {F ≤
t < F + C}. Thus, it holds
t
(1) (C)
π2 (t) = πj (t) = P{F ≤ t < F + C} = dF (u)(1 − C(t − u)).
0
∞ t
−su
= e dF (u) e−s(t−u) (1 − C(t − u))dt
0 0
∞
= f˜(s) e−sv (1 − C(v))dv
0
Taking into account (see Lemma 28.1) that F = A + W , the process J behavior
in a separate period F can be divided into its behavior in intervals A and W . Thus, the
process t.d.s.p.s in this period can be expressed as
t
(F ) (W )
π0 (t) ≡ P{J(t) = 0, t < F } = P{t < A} + dA(u)π0 (t − u),
0
t
(F ) (W )
π1 (t) ≡ P{J(t) = 0, t < F } = dA(u)π1 (t − u),
0
Reliability of a Double Redundant System Under the Full Repair Scenario 389
(W )
where πj (t) = P{J(t) = j, t < W } (j = 0, 1) are t.d.s.p.s in a separate period W .
(F ) ∞ (F ) (W ) ∞ (W )
In terms of LT π̃j (s) = e−st πj (t)dt and π̃j (s) = e−st πj (t)dt, we get
0 0
(1) (F ) 1 − ã(s) (W )
π̃0 (s) ≡ π̃0 (s) = + ã(s)π̃0 (s),
s
(1) (F ) (W )
π̃1 (s) ≡ π̃1 (s) = ã(s)π̃1 (s). [28.9]
(W )
For the probabilities πj (t) (j = 0, 1) calculation, we use the theory of DSRP,
briefly described in the introduction. According to this theory, the process distribution
(W )
πj (t) (j = 0, 1) in a separate regeneration period W can be represented in terms
of its distribution
(2)
πj (t) = P{J(t) = j, t < G(1) } (j = 0, 1)
in the embedded regeneration period G(1) with c.d.f. G(1) (t) = P{G(1) ≤ t} and
embedded renewal function H (W ) (t) analogously to equation [28.4] as follows:
t
(W ) (2) (2)
πj (t) = πj (t) + dH (W ) (u)πj (t − u) (j = 0, 1), [28.10]
0
t
(W ) (1)
H (t) + W (t) = G (t) + dH (W ) (u)G(1) (t − u). [28.11]
0
(1)
In the considered case, as embedded regeneration times Sk , we use a random
number ν = min{n : An < Bn } of the following time moments:
(1) (1) (1)
S1 = A1 1{A1 >B1 } , S2 = S1 + A2 1{A1 >B1 ,A2 >B2 } , ... ,
until the event {An ≤ Bn } does not occur for the first time. This means that these
time points belong to the interval W , which is defined in equation [28.2], and the time
(1)
intervals Gi (i = 1, ν) between embedded regeneration points have the distribution
(1)
A(t), G (t) = A(t). Based on these arguments, we get the following statement.
(W )
L EMMA 28.3.– The LT π̃j (s) (j = 0, 1) of the process t.d.s.p.s in a separate
embedded regeneration period W satisfies the relation
(W ) 1 (2)
π̃j (s) = π̃ (s). [28.12]
1 − ãB (s) j
390 Data Analysis and Related Applications 1
∞
P ROOF.– In terms of LT h̃(W ) (s) = e−st dH (W ) (t) and taking into account that
0
g̃ (1) (s) = ã(s) from [28.11], it follows
h̃(W ) (s) + w̃(s) = g̃ (1) (s) + h̃(W ) (s)g̃ (1) (s),
ã(s) − w̃(s)
h̃(W ) (s) = .
1 − ã(s)
(2)
For LT π̃j (s) of t.d.s.p.s in a separate regeneration period G(1) of the second
level and taking into account the first equality of [28.1] from [28.10], it follows
(W ) (2)
π̃j (s) = (1 + h̃(W ) (s))π̃j (s)
1 − w̃(s) (2) 1 (2)
= π̃j (s) = π̃ (s),
1 − ã(s) 1 − ãB (s) j
that ends the proof.
The next step consists of the calculation of the process t.d.s.p.s in a separate
(2)
regeneration period of the second level πj (t) (j = 0, 1).
L EMMA 28.4.– The LTs of t.d.s.p.s in a separate regeneration period of the second
level are
(2) 1
π̃0 (s) = b̃(s) − (ãB (s) + b̃A (s)) ,
s
1
(2)
π̃1 (s) = 1 − (ã(s) + b̃(s)) + ãB (s) + b̃A (s) . [28.13]
s
(2)
P ROOF.– From Figure 28.3, the probabilities πj (t) (j = 0, 1) can be obtained in the
following way.
The event {J(t) = 0, t < G(1) } occurs if {B < t < A}, i.e. if the repair of
the component being repaired ends before time point t until another component fails.
Since the independence of r.v.s A and B, we get
(2)
π0 (t) = P{B < t < A} = B(t)(1 − A(t)).
The event {J(t) = 1, t < G(1) } occurs if {t < B < A} or {t < A < B}, i.e.
being in state 1 at time point t, one component is repaired, and the second did not fail,
or when a failure of the second component occurred before repair completion of the
other one. Due to the incompatibility of the events above, it follows
(2)
π1 (t) = P{t < B < A} + P{t < A < B}
∞ ∞
= (1 − A(u))dB(u) + (1 − B(u))dA(u).
t t
Reliability of a Double Redundant System Under the Full Repair Scenario 391
1
= b̃(s) − (ãB (s) + b̃A (s)) ,
s
⎡∞ ⎤
∞ ∞
(2)
π̃1 (s) = e−st ⎣ (1 − A(u))dB(u) + (1 − B(u))dA(u)⎦
0 t t
∞ u
= e−st dt [(1 − A(u))dB(u) + (1 − B(u))dA(u)]
0 0
∞
1
= (1 − e−su [(1 − A(u))dB(u) + (1 − B(u))dA(u)]
s
0
1
= 1 − (ã(s) + b̃(s)) + ãB (s) + b̃A (s)
s
ends the proof.
(1)
π̃1 (s) = (1 + h̃(s))π̃1 (s)
1 − ãB (s) (W )
= · ã(s)π̃1 (s)
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
1 − ãB (s) ã(s) (2)
= · π̃ (s)
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s)) 1 − ãB (s) 1
1 ã(s) 1 − (ã(s) + b̃(s)) + ã B (s) + b̃ A (s)
= · ;
s 1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
(1)
π̃2 (s) = (1 + h̃(s))π̃2 (s)
1 − ãB (s) (1)
= π̃ (s)
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s)) 2
1 ã(s)(1 − c̃(s))(ã(s) − ãB (s))
= · .
s 1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
into [28.15] and taking into account expressions [28.14] from Theorem 28.2, we can
get s.s.p.s and end the proof of the theorem.
R EMARK 28.1.– Consider a Markov model with exponential distribution of all the
r.v.s of life- and repair times
28.6. Conclusion
In this chapter, a cold double redundant system with a single repair facility and
with arbitrarily distributed life- and repair times of the components of the system
under the full repair scenario, is considered. For the system modeling, the theory
of DSRP has been used. The reliability function, time-dependent and steady-state
probabilities of this process have been calculated. Because the process describes
the system behavior, the obtained probabilities can be used for further calculation
of system reliability indicators. These results can be used for further system study,
for example, for sensitivity analysis of system components to the shape of its time
distributions.
28.7. References
The economic crisis occurring in Europe since 2008 has caused major changes to
people’s lives. Past studies found that mental health disorders have risen during
periods of economic recession for both genders in Europe while others have
supported that males are more vulnerable compared to females. The target of this
study is to assess the depression imprint for a large sample of Europeans after the
2008 crisis. The sample studied in the analysis comes from the database of SHARE
(Survey of Health, Aging and Retirement in Europe), a multidisciplinary
longitudinal and cross-national database including material regarding health,
socioeconomic and demographic information of individuals aged 50 or higher,
resident in several European countries. The selection of respondents included those
participating both in wave 2, carried out in 2006–2007 and wave 6, completed in
2015, covering cross-national material in two time periods, just before and after the
economic recession. For the purposes of the analysis, multinomial logistic regression
models were applied for the total sample and separately by gender, using SPSS 20.
Special attention is given to the concurrent factors being associated with the
depression burden in older ages, covering different domains of life, before and after
economic recession. Findings indicate that health predictors including mobility
limitations, instrumental activities of daily living and long-term illnesses had
increased after 2015 for the total population of individuals, indicating worse health
levels. Further, cognitive function had declined as well. Concerning factors leading
to decreasing depression levels, the highest contribution is due to the reduction of
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
396 Data Analysis and Related Applications 1
29.1. Introduction
The economic crisis bursting out in Europe since 2007 has caused major changes
to people’s lives. It is associated with a reduction in economic growth, an increase in
unemployment and the deterioration of living conditions through the rise of poverty
(World Health Organization 2011). It has been suggested that poverty is a
socioeconomic risk factor, increasing chances of mental health disorders, including
depression and suicide attempts (Fryers et al. 2005; Frasquilho et al. 2016). Analysts
claim that between 2006 and 2012, a significant increase in mean depressive
symptoms has been observed, more likely due to job loss or to a major illness
(Pruchno et al. 2017). Further, research indicates that from January 2008 to
December 2015, general mental health among European populations was aggravated
and suicides increased (Parmar et al. 2016), although there were differences between
countries and population subgroups. Consequently, the well-being of individuals,
their families and of society as a whole has been undermined.
Past studies have found that mental health disorders have risen during periods of
economic downturn for both genders in Europe (Frasquilho et al. 2016) while others
have supported that these are higher among males, as they are considered more
vulnerable during economic recessions compared to females (Gunnell et al. 2015;
Bacigalupe et al. 2016; Gili et al. 2016; Margerison-Zilko et al. 2016). Nevertheless,
the analysis based on country of residence often presents contradictory results. For
example, a study in Portugal revealed that during recessions including the most
recent one (2008–2015), women reported higher levels of distress compared to men,
mainly due to factors affecting mental health, such as income, employment and
social status (Frasquilho et al. 2017). Another study in England reported that the
likelihood of mental health problems in the Great Recession increased more among
females and less educated individuals (Jofre-Bonet et al. 2018). Glonti et al. (2015)
showed that women’s mental health was more vulnerable during economic
downturns, mainly due to a reduction in income levels and changes in employment
status. Other associations of mental health with age, educational attainment and
Predicting Changes in Depression Levels 397
marital status seem to be less significant, although higher educational levels led to
healthier behaviors (Glonti et al. 2015). Contrary to the above-mentioned results
regarding Europe, both men and women in the United States reported lower odds of
depression during and after recession and better mental health during the recession
(Dagher et al. 2015).
The main aim of this study is to assess the effects of changes in health and other
circumstances in the period following the recession of 2008 on depression, for a
sample of Europeans aged 50 or older. Special attention is given to concurrent
factors associated with depression in older ages, covering different domains of life.
More specifically, the main questions of the present analysis are: (a) Which factors’
transitions are more relevant to changes in depression levels? and (b) what are the
differentiations between genders regarding these changes? For the first question,
predictors have been considered in a holistic manner in order to assess their relative
effect and thus, their contribution to the improvement or deterioration of depression
levels. Results will inform authorities regarding the vulnerability of older persons to
specific life events. For the second question, the interest shifts to comparisons
between sexes. Findings may shape social policies related to depression in later life
and help individuals improve their quality of life.
29.2.1. Sample
The sample studied in the analysis comes from the SHARE study, a
multidisciplinary, longitudinal and cross-national database, including material
regarding health, socioeconomic and demographic information of individuals aged
50 or higher, resident in European countries (Börsch-Supan et al. 2013). The
selection of the sample involved respondents who participated both in wave 2,
carried out in 2006–2007, and wave 6, completed in 2015, covering cross-national
material in different time periods. The initial sample of wave 2 included 31,009
respondents; of these, 16,106 persons were excluded from the analysis as 6,332 of
them had died (39.3%), whereas 9,774 (60.6%) had not taken part in wave 6. In
total, the number of individuals who are included in the analysis was 14,903, 6,384
males and 8,519 females, originating in the following European countries: Austria,
Germany, Sweden, Spain, Italy, France, Denmark, Greece, Switzerland, Belgium,
Czech Republic and Poland.
29.2.2. Measures
All the above-mentioned variables recorded at wave 2 have been included in the
models to control for baseline characteristics. Further, to assess the impact of
transitions in health, SES and marital status between waves 2 and 6 on depression,
variables reflecting changes in long-term illness, mobility limitations, instrumental
400 Data Analysis and Related Applications 1
activities of daily living, orientation in time, life satisfaction, financial hardship and
marital status have also been included in the analysis. These variables have three
categories, indicating improvement, worsening or no change; the latter category has
been selected as a reference category for all variables defined this way.
29.3. Results
Table 29.1 shows the percentage distribution for the factors included in the
analysis regarding the samples of respondents and non-respondents separately,
focusing on their differences. The population of non-respondents is a little older
compared to respondents (mean age 66.86 years compared to 62.13 years for
respondents), whereas educational qualifications for both groups are similar (mean
values of education equal to 10.06 and 10.71 years). Further, percentage of males is
higher among non-respondents. Regarding depression, individuals who dropped out
of the study at wave 6 were more vulnerable in depression (percentages of the
disorder are 28.40% and 22.00%, respectively).
Marital status
Married, living with spouse 70.40 74.60 66.60
Registered partnership 1.30 1.40 1.30
Married, not living with spouse 1.30 1.30 1.40
Never married 5.00 4.90 5.10
Divorced 6.60 6.60 6.60
Widowed 15.20 11.20 19.00
Control variables
Country (N)
Austria 1,200 526 674
Germany 2,628 903 1,725
Sweden 2,796 1,432 1,364
Spain 2,427 1,241 1,186
Italy 2,986 1,655 1,331
France 2,989 1,129 1,860
Denmark 2,630 1,419 1,211
Greece 3,412 2,011 1,401
Switzerland 1,498 803 695
Belgium 3,227 1,659 1,568
Czech Republic 2,750 956 1,794
Poland 2,466 1,169 1,297
Depression levels
(based on EURO-D)
No 74.70 78.00 71.60
Yes 25.30 22.00 28.40
Total sample (N) 31,009 14,903 16,106
Mean*, median**.
Table 29.2 shows odds ratios and confidence intervals based on logistic
regression models comparing non-respondents at wave 6 (participating only at
wave 2) to respondents (i.e. participating both waves 2 and 6). The odds ratios are
adjusted for country of residence. Although findings regarding age and education are
significant, whereas the opposite holds for long-term illness, it is obvious that the
remaining factors indicate diversifications in the characteristics of these groups.
For instance, non-respondents include more males and are older compared to
respondents, they include lower proportions reporting a high degree of life
satisfaction (Odds Ratio (OR) = 0.956), lower proportions of married, living with
spouse (OR = 0.892) and of persons dealing with great difficulty with economic
hardship (OR = 0.910). Further, fewer of them report good orientation in time
(OR = 0.754) while they exhibit a slightly higher likelihood of having depression
(OR = 1.069) and of suffering from mobility limitations and limitations in
instrumental activities of daily living (ORs 1.201 and 1.423, respectively).
Table 29.3 shows descriptive results concerning the total sample of the
respondents and males and females, separately. Because of the fact that
measurements cover different time periods, i.e. before and after the economic
downturn, we consider three types of possible transitions for each predictor: an
increase, a decline and no change. Regarding the overall population, all health
factors included in the analysis seem to indicate worse post-crisis health levels.
Moreover, a decline in cognitive function is notable based on the index measuring
orientation in time. Nevertheless, it is obvious that a high percentage of individuals
did not experience any change in the above-mentioned factors. By contrast, factors
relating to a sense of life satisfaction and financial difficulties seem to have
improved. Indeed, a significant portion of the population reported increasing levels
in life satisfaction (37.90%), indicating a more optimistic perspective, as well as
decreasing financial difficulties (33.80%) indicating better socioeconomic status,
whereas a high portion of the population did not exhibit a transition referring to
those predictors. Regarding educational attainment of individuals, the mean value is
10.71 years and as it concerns marital status, it seems that being alone becomes a
more common circumstance following the economic crisis rather than the transition
to being in a relationship (percentage equal to 7.10% vs. 0.60%); being alone is
expressed through the transition from being married, living with spouse or in a
registered partnership to i) married, not living with spouse, or never married or
divorced and ii) widowed. Moreover, being in a relationship refers to the opposite
transition. Finally, depression levels seem to exhibit an increase for more individuals
in the total sample in that period (15.50%) rather than a decrease (10.50%).
The sample consists of 6,384 males and 8,519 females with a mean age of
approximately 62 years. There are gender differentiations to observe. The greatest
difference is detected in mobility limitations, where a more severe worsening is
observed among females (16.40% vs. 12.20% for males). Taking into consideration
life perspectives, it is more frequent for women to enhance their sense of life
satisfaction (39.10% vs. 36.20% for males). Concerning SES, results are similar for
both genders relative to the decreasing or increasing economic hardship. On the
other hand, there is a slight difference in educational qualifications; females have on
average 10.35 years of education compared to males having 11.18 years. Regarding
marital status, the vast majority of males and females did not experience any change
though a higher proportion of women seem to have changed to “be alone” status
compared to men (9.00% vs. 4.70%). Following the recession, depression levels
increased more for women (17.50% vs. 12.80% for men).
406 Data Analysis and Related Applications 1
Total sample
Males (%) Females (%)
(%)
Later-life predictors
Health factors
Long-term illness
Better health 11.80 11.20 12.20
Worse health 19.50 20.80 18.50
No change 68.70 68.00 69.20
Instrumental activities of daily living
Better health 4.90 3.00 6.40
Worse health 13.30 11.40 14.60
No change 81.80 85.50 79.00
Mobility limitations
Better health 5.90 4.70 6.80
Worse health 14.60 12.20 16.40
No change 79.50 83.10 76.80
Cognitive function
Orientation in time
Worse health 10.30 10.40 10.20
Better health 7.70 8.20 7.30
No change 82.00 81.50 82.50
Perspective of life
Life satisfaction
Worse health 29.70 29.70 29.70
Better health 37.90 36.20 39.10
No change 32.50 34.10 31.20
Socio economic status
Household able to make ends meet
Decreased difficulty 33.80 33.50 34.10
Increased difficulty 19.90 18.90 20.6
No change 46.30 47.60 45.30
Educational attainment at wave 2 10.71* 11.00** 11.18* 10.35*
11.00** 11.00**
Predicting Changes in Depression Levels 407
Demographic characteristics
Age at the time of interview at 62.13* 61.00** 62.63* 61.76*
wave 2) 62.00** 61.00**
Gender (wave 2)
Males 42.80
Females 57.20
Marital status
Becoming alone 7.10 4.70 9.00
Being in a new relationship 0.60 0.80 0.40
No change 92.30 94.50 90.6
Control factor
Country of residence at wave 2 (N)
Austria 526 198 328
Germany 903 425 478
Sweden 1,432 620 812
Spain 1,241 517 724
Italy 1,655 734 921
France 1,129 471 658
Denmark 1,419 634 785
Greece 2,011 863 1,148
Switzerland 803 341 462
Belgium 1,659 738 921
Czech Republic 956 362 594
Poland 1,169 481 688
Depression levels
(based on EURO-D)
Improvement 10.50 7.30 12.80
Worsening 15.50 12.8 17.50
No change 74.00 79.80 69.70
Total sample (N) 14,903 6,384 8,519
Mean*, median**.
Table 29.4 shows the findings based on multinomial logistic regression models
for factors associated with decreasing depression levels, controlling for wave 2
characteristics. Overall, the decrease in instrumental activities of daily living and
mobility limitations have the highest effect on decreasing depression levels (ORs
equal to 1.533 and 1.475, respectively). It is observed that a reduction in
instrumental activities of daily living (or mobility limitations) enhances the relative
chances of decreasing depression by 53.30% (or 47.50% for the latter). Moreover,
improvement in orientation in time and life satisfaction levels leads to better
cognitive function and perspective of life and is associated with a higher likelihood
of decreasing depression burden (ORs equal to 1.407 and 1.369, respectively). In
contrast, predictors related to economic hardship, educational attainment and
changes in marital status are insignificant.
Concerning genders, it is clear that there are distinct differences between males
and females. For males, it is more likely to experience an improvement in
depression due to a decrease in long-term illnesses (odds ratio equal to 1.316), while
for females, this seems insignificant. For women, a reduction in instrumental
activities of daily living and mobility limitations contributes to a major decrease in
depression (odds ratios equal to 1.554 and 1.420, respectively). Additionally, an
improvement in orientation in time seems to matter only for men (OR = 1.571),
while for women, it is unimportant. Regarding perspective of life, the sense of life
satisfaction is important for both sexes in an equivalent manner (OR = 1.379 for
males and 1.382 for females). Finally, socioeconomic status (economic difficulties
and educational attainment) as well as becoming single at wave 6 are insignificant
factors for both sexes.
Later-life predictors
Health factors
Long-term illness
No change (ref. cat.) 1 1 1
Better health 1.179 1.316 1.139
(0.994 1.398) (0.968 1.790) (0.927 1.399)
Worse health 0.953 0.881 0.990
(0.798 1.139) (0.647 1.200) (0.796 1.232)
Predicting Changes in Depression Levels 409
Table 29.4. Odds ratios and confidence intervals for predictors related
to an improvement in depression (total model and by gender)a
Table 29.5 shows odds ratios and confidence intervals for all factors that
contribute to increasing depression levels following the recession, controlling for
wave 2 characteristics. It is evident that the significant factors for the total sample
and by sex are the same. In particular, an increase in health factors contributes the
most to a worsening in depression. In fact, for the total sample and females, the
highest odds ratios are related to long-term illness (equal to 1.816 and 1.786,
respectively), whereas for males, instrumental activities of daily living are of major
importance (OR = 2.164). A worsening in orientation in time has an equal
Predicting Changes in Depression Levels 411
contribution to the increase of depression for the joined sample and both genders;
the OR ranges from 1.347 to 1.364. Moreover, a decrease in life satisfaction
indicates worse health and hence increases the likelihood of depression almost by
two times for the whole population and both sexes (OR almost equal to 2). This is a
significant factor, taking into consideration that increasing life satisfaction decreases
the relative chances of increasing depression levels by 22.3% for the total sample
and by 32.7% for males and 15.5% for females. Increasing economic difficulties
seem to strengthen the increase in depression levels, especially for males (OR equal
to 1.614). On the other hand, even though educational attainment and age at the time
of interview are significant factors, they do not substantially alter the increase in
depression burden. Finally, becoming alone at wave 6 is associated with higher
chances regarding an increase in depression, only for males (OR = 1.550).
29.4. Discussion
The main aim of this study is to explore how changes occurring in the period
2007–2015 in various aspects of life, including health, attitude towards life, SES and
marital status, affect depression. Multinomial logistic regression models were
applied in order to detect which factors are associated with decreasing or increasing
depression levels. The study focused on the total sample and on differentiations
between genders.
Concerning past literature, some findings are in accordance with our results,
whereas others are contradictory. Undoubtedly, past analysis has found that
limitations in instrumental activities of daily living and mobility limitations
negatively affect psychological health (Backe et al. 2017; Musich et al. 2018).
Further, poor cognitive function contributes to a deterioration in depressive
disorders in older ages (Hammar and Ardal 2009; Giri et al. 2016). In contrast, life
satisfaction seems to act positively, predicting better mental health (Beutel et al.
2009; Srivastava 2016). Contrary to our results, other studies have found that an
increase in SES, such as better educational attainment, lead to a lower prevalence of
depression (Bjelland et al. 2008; Zhang et al. 2012; Freeman et al. 2016).
Further, a crucial point to consider is that the time period of the transitions
considered in the analysis means that persons in the longitudinal sample have grown
older by about seven years on average and the deterioration in their health is
attributable not only to the economic recession but also to physical wear due to the
ageing process. This fact should be taken into account when interpreting the
findings.
Future research would benefit from the inclusion of more detailed information on
the magnitude, the onset and the duration of exposures to disadvantage and
advantage, which would allow measuring inequality with greater accuracy. Further,
it would be of great interest to study the effects of such factors using information not
only from late adulthood, as in the present case, but also from childhood and middle
adulthood as well, estimating their effect in a cumulative way.
29.5. Conclusion
The present study aimed at assessing the transitions in depression levels for the
European population aged 50 or higher, after the economic recession, considering
Predicting Changes in Depression Levels 415
29.6. Acknowledgments
This work was fully supported by the General Secretariat for Research and
Technology (GSRT) and the Hellenic Foundation for Research and Innovation
(HFRI).
29.7. References
Bacigalupe, A., Esnaola, S., Martín, U. (2016). The impact of the great recession on mental
health and its inequalities: The case of a Southern European region, 1997–2013.
International Journal for Equity in Health, 15(1), 1–10.
Backe, I.F., Patil, G.G., Nes, R.B., Clench-Aas, J. (2017). The relationship between physical
functional limitations, and psychological distress: Considering a possible mediating role
of pain, social support and sense of mastery. SSM – Population Health, 4, 153–163.
Beekman, A.T., Copeland, J.R., Prince, M.J. (1999). Review of community prevalence of
depression in later life. The British Journal of Psychiatry, 174(4), 307–311.
Benton, T., Staab, J., Evans, D.L. (2007). Medical co-morbidity in depressive disorders.
Annals of Clinical Psychiatry, 19(4), 289–303.
Beutel, M.E., Glaesmer, H., Wiltink, J., Marian, H., Brähler, E. (2009). Life satisfaction,
anxiety, depression and resilience across the life span of men. The Aging Male, 13(1),
32–39.
Bjelland, I., Krokstad, S., Mykletun, A., Dahl, A.A., Tell, G.S., Tambs, K. (2008). Does a
higher educational level protect against anxiety and depression? The HUNT study. Social
Science & Medicine, 66(6), 1334–1345.
416 Data Analysis and Related Applications 1
Börsch-Supan, A. and Jurges, H. (2005). The Survey of Health, Aging and Retirement in
Europe. Methodology. Mannheim Research Institute for the Economics of Ageing.
Mannheim, Germany.
Börsch-Supan, A., Brandt, M., Hunkler, C., Kneip, T., Korbmacher, J., Malter, F., Schaan, B.,
Stuck, S., Zuber, S. (2013). Data resource profile: The survey of health, ageing and
retirement in Europe (SHARE). International Journal of Epidemiology, 42(4), 992–1001.
Castro-Costa E., Dewey M., Stewart R., Banerjee S., Huppert F., Mendonca-Lima C.,
Bula C., Reisches F., Wancata J., Ritchie K. et al. (2007). Prevalence of depressive
symptoms and syndromes in later life in ten European countries: The SHARE study. The
British Journal of Psychiatry, 191(5), 393–401.
Castro-Costa E., Dewey M., Stewart R., Banerjee S., Huppert F., Mendonca-Lima C.,
Bula C., Reisches F., Wancata J., Ritchie K. (2008). Ascertaining late-life depressive
symptoms in Europe: An evaluation of the survey version of the EURO-D scale in 10
nations. The SHARE project. International Journal of Methods in Psychiatric Research,
17(1), 12–29.
Dagher, R.K., Chen, J., Thomas, S.B. (2015). Gender differences in mental health outcomes
before, during, and after the great recession. PLoS ONE, 10(5), 1–16.
Dewey, M.E. and Prince, M.J. (2005). Mental health. In Health, Ageing and Retirement in
Europe, First Results from the Survey of Health, Ageing and Retirement in Europe.
Borsch-Supan, A., Brugiavini, A., Jurges, H., Mackenbach, J., Siegrist, J., Weber, G.
(eds), Mannheim Research Institute for the Economics of Ageing (MEA), Mannheim,
Germany.
Fenton, W.S. and Stover, E.S. (2006). Mood disorders: Cardiovascular and diabetes
comorbidity. Current Opinion in Psychiatry, 19(4), 421–427.
Frasquilho, D., Matos, M.G., Salonna, F., Guerreiro, D., Storti, C.C., Gaspar, T., Caldas-de-
Almeida, J.M. (2016). Mental health outcomes in times of economic recession: A
systematic literature review. BMC Public Health, 16(1), 115, 1–40.
Frasquilho, D., Cardoso, G., Ana, A., Silva, M., Caldas-de-Almeida, J.M. (2017). Gender
differences on mental health distress: Findings from the economic recession in Portugal.
European Psychiatry, 41, S(1), S902–S902.
Freeman, A., Tyrovolas, S., Koyanagi, A., Chatterji, S., Leonardi, M., Ayuso-Mateos, J.L.,
Tobiasz-Adamczyk, B., Koskinen, S., Rummel-Kluge, C., Haro, J.M. (2016). The role of
socio-economic status in depression: Results from the COURAGE (aging survey in
Europe). BMC Public Health, 16(1), 1–8.
Fryers, T., Melzer, D., Jenkins, R., Brugha, T. (2005). The distribution of the common mental
disorders: Social inequalities in Europe. Clinical Practice and Epidemiology in Mental
Health, 1(1), 1–12.
Gili, M., López-Navarro, E., Castro, A., Homar, C., Navarro, C., García-Toro, M., García-
Campayo, J., Roca, M. (2016). Gender differences in mental health during the economic
crisis. Psicothema, 28(4), 407–413.
Predicting Changes in Depression Levels 417
Giri, M., Chen, T., Yu, W., Lü, Y. (2016). Prevalence and correlates of cognitive impairment
and depression among elderly people in the world’s fastest growing city, Chongqing,
People’s Republic of China. Clinical Interventions in Aging, 11, 1091–1098.
Glonti, K., Gordeev, V.S., Goryakin, Y., Reeves, A., Stuckler, D., McKee, M., Roberts, B.
(2015). A systematic review on health resilience to economic crises. PLoS One, 10(4),
1–22.
Gunn, J.M., Ayton, D.R., Densley, K., Pallant, J.F., Chondros, P., Herrman, H.E., Dowrick,
C.F. (2010). The association between chronic illness, multimorbidity and depressive
symptoms in an Australian primary care cohort. Social Psychiatry and Psychiatric
Epidemiology, 47(2), 175–84.
Gunnell, D., Donovan, J., Barnes, M., Davies, R., Hawton, K., Kapur, N., Hollingworth, W.,
Metcalfe, C. (2015). The 2008 global financial crisis: Effects on mental health and
suicide. Policy Bristol, Policy Report 3/2015.
Hammar, A. and Ardal, G. (2009). Cognitive functioning in major depression – A
summary. Frontiers in Human Neuroscience, 3(26), 1–7.
Jofre-Bonet, M., Serra-Sastre, V., Vandoros, S. (2018). The impact of the great recession on
health-related risk factors, behaviour and outcomes in England. Social Science &
Medicine, 197, 213–225.
Li, T. and Fung, H.H. (2013). Age differences in trust: An investigation across 38
countries. Journals of Gerontology: Series B, 68(3), 347–355.
Margerison-Zilko, C., Goldman-Mellor, S., Falconi, A., Downing, J. (2016). Health impacts
of the great recession: A critical review. Current Epidemiology Reports, 3(1), 81–91.
Martin-Carrasco, M., Evans-Lacko, S., Dom, G., Christodoulou, N.G., Samochowiec, J.,
González-Fraile, E., Bienkowski, P., Dos Santos, M.J., Wasserman, D. (2016). EPA
guidance on mental health and economic crises in Europe. European Archives of
Psychiatry and Clinical Neuroscience, 266(2), 89–124.
Musich, S., Wang, S.S., Ruiz, J., Hawkins, K. Wicker, E. (2018). The impact of mobility
limitations on health outcomes among older adults. Geriatric Nursing, 39(2), 162–169.
Ormel, J., Rijsdijk, F.V., Sullivan, M., van Sonderen, E., Kempen, G.I. (2002). Temporal and
reciprocal relationship between IADL/ADL disability and depressive symptoms in late
life. The Journals of Gerontology: Series B, 57(4), 338–347.
Parmar, D., Stavropoulou, C., Ioannidis, J.P. (2016). Health outcomes during the 2008
financial crisis in Europe: Systematic literature review. BMJ, 354, 1–11.
Prince, M.J., Reischies, F., Beekman, A.T., Fuhrer, R., Jonker, C., Kivela, S.L., Lawlor, B.A.,
Lobo, A., Magnusson, H., Fichter, M. (1996a). Development of the EURO-D scale – A
European Union initiative to compare symptoms of depression in 14 European
centres. The British Journal of Psychiatry, 174(4), 330–338.
418 Data Analysis and Related Applications 1
Prince, M.J., Beekman, A.T., Deeg, D.J., Fuhrer, R., Kivela, S.L., Lawlor, B.A., Lobo, A.,
Magnusson, H., Meller, I., van Oyen, H. (1996b). Depression symptoms in late life
assessed using the EURO-D scale. Effect of age, gender and marital status in 14 European
centres. The British Journal of Psychiatry, 174(4), 339–345.
Pruchno, R., Heid, A.R., Wilson-Genderson, M. (2017). The great recession, Life events, and
mental health of older adults. The International Journal of Aging and Human
Development, 84(3), 294–312.
Sarkisian, C.A., Hays, R.D., Mangione, C.M. (2002). Do older adults expect to age
successfully? The association between expectations regarding aging and beliefs regarding
healthcare seeking among older adults. Journal of the American Geriatrics Society,
50(11), 1837–1837.
Simon, G.E., Katon, W.J., Lin, E.H., Rutter, C., Manning, W.G., Von Kroff, M.,
Ciechanowski, P., Ludman, E.J., Ypung, B.A. (2007). Cost-effectiveness of systematic
depression treatment among people with diabetes mellitus. Archives of General
Psychiatry, 64(1), 65–72.
Srivastava, A. (2016). Relationship between life satisfaction and depression among working
and non-working married women. International Journal of Education and Psychological
Research (IJEPR), 5(3), 1–7.
Vamos, E.P., Mucsi, I., Keszei, A., Kopp, M.S., Novak, M. (2009). Comorbid depression is
associated with increased healthcare utilization and lost productivity in persons with
diabetes: A large nationally representative Hungarian population survey. Psychosomatic
Medicine, 71(5), 501–507.
Welch, C.A., Czerwinski, D., Ghimire, B., Bertsimas, D. (2009). Depression and costs of
health care. Psychosomatics, 50(4), 392–401.
Wilkinson, L.R. (2016). Financial strain and mental health among older adults during the
great recession. Journals of Gerontology Series B: Psychological Sciences and Social
Sciences, 71(4), 745–754.
World Health Organization (2011). Impact of economic crises on mental health. WHO,
Copenhagen, Denmark.
Yan, X.Y., Huang, S.M., Huang, C.Q., Wu, W.H., Qin, Y. (2011). Marital status and risk for
late life depression: A meta-analysis of the published literature. Journal of International
Medical Research, 39(4), 1142–1154.
Zhang, L., Xu, Y., Nie, H., Zhang, Y., Wu, Y. (2012). The prevalence of depressive
symptoms among the older in China: A meta-analysis. International Journal of Geriatric
Psychiatry, 27(9), 900–906.
List of Authors
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
420 Data Analysis and Related Applications 1
Louisa TESTA
Department of Statistics and
Operations Research
University of Malta
Msida
Malta
Index
A, B data
interpretation, 187
acute respiratory viral infection (ARVI),
storage, 188, 192, 195
359, 362, 363
databases, 31, 32, 34–36, 38–41
approximation by exponents, 297
depression, 395–400, 402–405, 407, 408,
arbitrarily distributed life- and repair
410, 412–414
times, 379, 381, 393
developing markets, 149, 152
asymptotic analysis, 43
devices, 371, 372, 375, 376
batch processing, 163
dividends, 43, 44, 48, 49, 55
Bayesian inference, 106
double redundant system, 379–381, 393
approximate, 319
beta regression, 173–177, 179, 183, 184
E, F
blockchain, 31–36, 38–41
economic downturn, 396, 405, 413, 415
C, D entropies, 237, 238
epidemiological data, 297
calibration, 135–138, 141–144, 146, 148
Europe, 395–398, 414
classification algorithm, 207, 208, 213, 219
exploratory factor analysis (EFA), 81, 82,
CO2, 307, 312, 313
85–90, 94, 95
community-acquired pneumonia, 359,
extreme value theory, 60
362, 363, 365, 366, 368
fixed-income market, 333
compositional data, 115, 116, 118
fixed-radius NN, 67, 68, 70, 71, 73, 74,
confirmatory factor analysis (CFA), 81,
76–79
82, 85–88, 91–95
FlexReg package, 99, 101, 107–110
contingency tables, 238, 246
Covid-19 (see also new coronavirus
G, H
infection), 297, 302–305
cubature method, 333–336, 339, 345, gas analysis, 307, 312, 316
349, 350, 352–354 Gatheral model, 135–140, 143, 147
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
426 Data Analysis and Related Applications 1
gender, 395–401, 404, 405, 407, 408, medical care, 359, 360, 366, 367
410–413, 415 mixture
generalized linear model (GLM), 223 distribution, 116
genetic algorithm, 173, 175, 178, 184 model, 107
geographically weighted regression morbidity, 359, 360, 362, 367
(GWR), 261, 262 multi-armed bandit (MAB), 163, 164,
gestational diabetes mellitus (GDM), 166, 168
67–69, 73–78 multivariate regression, 115
Hamiltonian Monte Carlo (HMC), 115,
118 N, O
higher education, 371
network, 371–376
Hull–White model, 333, 345, 346, 348,
new coronavirus infection (see also
349, 352, 354, 355
Covid-19), 359–363, 365, 367, 368
non-parametric, 67, 68, 70
I, K
normalized data, 187, 192, 193, 195
implied volatility expansions, 135–137, NoSQL, 32–41
147 numerical method, 199
independent and non-identically O2, 307, 312, 313
distributed observations, 223, 224 official land price, 261, 264–267, 271,
insurance models, 43 272
k-nearest neighbour (kNN), 67, 68, 70, optimization, 43, 45, 55
73, 74, 76–79
kernel classification, 72, 74 P, R
S, T thyroid
cancer, 13, 14
simulations, 238, 246
diseases, 3, 4, 7, 10
software
time series data, 297, 298
cost estimation, 275, 282, 283
topological data analysis (TDA), 207
defined network (SDN), 371, 372,
374–376
U, V, W
spatial
clustering, 13 UCB, 163, 165, 168, 170
statistics, 262 validity, 81, 82, 85, 87, 91, 93–95
SQL, 32, 34, 36–41 volcanic areas, 13–17, 19
Stratonovich integral, 333, 337 WEKA (Waikato Environment for
tail distribution, 57 Knowledge Analysis), 275–279,
temporal variation, 261 281–283
tests of fit, 237, 238, 241, 242 Wiener space, 349–351
Summary of Volume 2
Preface
Konstantinos N. ZAFEIRIS, Yiannis DIMOTIKALIS, Christos H. SKIADAS,
Alex KARAGRIGORIOU and Christiana KARAGRIGORIOU-VONTA
Part 1
2.1. Introduction
2.1.1. Distributions for count vectors
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
Data Analysis and Related Applications
3.1. Introduction
3.2. Data and methods
3.3. The trends of class mobility between different birth cohorts
3.4. Conclusion
3.5. References
4.1. Introduction
4.2. Data and methodology
4.3. Results
4.4. Conclusion
4.5. References
6.1. Introduction
6.2. American option pricing
Summary of Volume 2
7.1. Introduction
7.2. Methods
7.3. Results
7.4. Conclusion
7.5. References
8.1. Introduction
8.2. Mathematical model
8.3. Optimization problem
8.4. HJBI equation: formulation and solution
8.5. Concluding remarks
8.6. Acknowledgments
8.7. References
Part 2
9.1. Introduction
9.2. Modeling the process of event occurrence
Data Analysis and Related Applications
10.1. Introduction
10.2. Structural equation modeling using partial least squares
10.2.1. Specification of the internal model
10.2.2. Specification of the external model
10.2.3. Validation statistics for the external model
10.2.4. Overall validation of structural modeling
10.3. Material and method
10.3.1. Agro-ecological context of the study
10.3.2. Data
10.3.3. The structural model and the estimation
10.4. Results and discussion
10.4.1. Checking the block one-dimensionality
10.4.2. Fitting the external model and assessing the quality of the fit
10.4.3. The structural model after revision
10.5. Conclusion
10.6. References
Summary of Volume 2
11.1. Introduction
11.2. Theoretical framework
11.3. Purpose of the research
11.4. Methodology
11.5. Research results
11.6. Conclusion
11.7. References
12.1. Introduction
12.2. Methodology and material
12.2.1. Research tools for measuring motivation and professional
satisfaction for this work
12.2.2. Purpose and objectives of the research
12.2.3. Material and method
12.2.4. Statistical analysis
12.3. Results
12.4. Discussion
12.5. Conclusion
12.6. References
13.1. Introduction
13.2. Methodology
13.3. Discussion and conclusion
13.4. Acknowledgments
13.5. Appendix
13.6. References
Data Analysis and Related Applications
14.1. Introduction
14.2. Materials and methods
14.3. Behavior of Covid-19 disease in the Mediterranean region
14.4. Conclusion
14.5. Acknowledgments
14.6. References
Part 3
15.1. Introduction
15.2. Data and methodological remarks
15.3. Statutory retirement age
15.4. Development of the state of health of population
15.5. Development of the state of health of population in productive and
post-productive ages
15.6. Conclusion
15.7. Acknowledgment
15.8. References
16.1. Introduction
16.2. Preliminary results in the area of EVT for heavy tails and asymptotic
behavior of MOp functionals
16.2.1. A brief review of first- and second-order conditions
16.2.2. Asymptotic behavior of the Hill EVI-estimators
16.2.3. Asymptotic behavior of MOp EVI-estimators under
a regular framework
16.2.4. A brief reference to additive stable laws
16.2.5. Asymptotic behavior of EVI-estimators under
a non-regular framework
16.3. Finite-sample behavior of MOp functionals
Summary of Volume 2
17.1. Introduction
17.2. Demographic development in the V4 countries
17.3. Development of fertility and family policy
17.4. Pension systems of the Visegrad Four countries
17.5. Prediction of future development of V4 populations
17.6. Conclusion
17.7. Acknowledgments
17.8. References
18.1. Introduction
18.2. Methodology and data
18.3. Main results
18.3.1. Effect of mortality
18.3.2. Effects of mortality and health
18.4. Conclusion
18.5. Acknowledgments
18.6. References
19.1. Introduction
19.1.1. Actual mortality patterns
19.1.2. Objectives of the study
19.2. Methods
19.2.1. Data
19.2.2. Force of subjective mortality
19.2.3. Variables
19.2.4. Statistical modeling
Data Analysis and Related Applications
19.3. Results
19.3.1. Sample
19.3.2. Multivariable analyses
19.4. Discussion
19.5. Conclusion
19.6. Acknowledgments
19.7. References
20.1. Introduction
20.2. Binomial mortality model and the empirical distribution of daily
deaths in Germany
20.3. Non-seasonal ARIMA model for weekly data in Germany
20.4. Seasonal ARIMA models of weekly deaths for Spain,
Germany and Sweden
20.5. Measuring excess mortality, especially in Spain, Germany and Sweden
20.6. Forecasting daily deaths in Germany
20.7. Conclusion
20.8. Appendix
20.8.1. Estimation results of the other age classes
20.8.2. Time series decomposition
20.9. References
21.1. Introduction
21.2. Materials and methods
21.2.1. Multilevel logistic model
21.3. Results and discussion
21.4. Conclusion
21.5. References
22.1. Introduction
22.2. Literature review
22.3. Methods
Summary of Volume 2
22.4. Results
22.5. Discussion
22.6. Conclusion
22.7. Acknowledgment
22.8. References
24.1. Introduction
24.2. The setting of the statutory retirement age
24.3. The economic status of elderly workers
24.4. The structure of working people by factors
24.5. The change in the number of workers
24.6. Conclusion
24.7. Acknowledgment
24.8. References
Other titles from
in
Innovation, Entrepreneurship and Management
2022
BOUCHÉ Geneviève
Productive Economy, Contributory Economy: Governance Tools for the
Third Millennium
HELLER David
Valuation of the Liability Structure by Real Options
MATHIEU Valérie
A Customer-oriented Manager for B2B Services: Principles and
Implementation
NOËL Florent, SCHMIDT Géraldine
Employability and Industrial Mutations: Between Individual Trajectories
and Organizational Strategic Planning (Technological Changes and Human
Resources Set – Volume 4)
SALOFF-COSTE Michel
Innovation Ecosystems: The Future of Civilizations and the Civilization of
the Future (Innovation and Technology Set – Volume 14)
VAYRE Emilie
New Spaces and New Working Times
2021
ARCADE Jacques
Strategic Engineering (Innovation and Technology Set – Volume 11)
BÉRANGER Jérôme, RIZOULIÈRES Roland
The Digital Revolution in Health (Health and Innovation Set – Volume 2)
BOBILLIER CHAUMON Marc-Eric
Digital Transformations in the Challenge of Activity and Work:
Understanding and Supporting Technological Changes
(Technological Changes and Human Resources Set – Volume 3)
BUCLET Nicolas
Territorial Ecology and Socio-ecological Transition
(Smart Innovation Set – Volume 34)
DIMOTIKALIS Yannis, KARAGRIGORIOU Alex, PARPOULA Christina,
SKIADIS Christos H
Applied Modeling Techniques and Data Analysis 1: Computational Data
Analysis Methods and Tools (Big Data, Artificial Intelligence and Data
Analysis Set - Volume 7)
Applied Modeling Techniques and Data Analysis 2: Financial,
Demographic, Stochastic and Statistical Models and Methods (Big Data,
Artificial Intelligence and Data Analysis Set – Volume 8)
DISPAS Christophe, KAYANAKIS Georges, SERVEL Nicolas,
STRIUKOVA Ludmila
Innovation and Financial Markets
(Innovation between Risk and Reward Set – Volume 7)
ENJOLRAS Manon
Innovation and Export: The Joint Challenge of the Small Company
(Smart Innovation Set – Volume 37)
FLEURY Sylvain, RICHIR Simon
Immersive Technologies to Accelerate Innovation: How Virtual and
Augmented Reality Enables the Co-Creation of Concepts
(Smart Innovation Set – Volume 38)
GIORGINI Pierre
The Contributory Revolution (Innovation and Technology Set – Volume 13)
GOGLIN Christian
Emotions and Values in Equity Crowdfunding Investment Choices 2:
Modeling and Empirical Study
GRENIER Corinne, OIRY Ewan
Altering Frontiers: Organizational Innovations in Healthcare (Health and
Innovation Set – Volume 1)
GUERRIER Claudine
Security and Its Challenges in the 21st Century (Innovation and Technology
Set – Volume 12)
HELLER David
Performance of Valuation Methods in Financial Transactions (Modern
Finance, Management Innovation and Economic Growth Set – Volume 4)
LEHMANN Paul-Jacques
Liberalism and Capitalism Today
SOULÉ Bastien, HALLÉ Julie, VIGNAL Bénédicte, BOUTROY Éric,
NIER Olivier
Innovation in Sport: Innovation Trajectories and Process Optimization
(Smart Innovation Set – Volume 35)
UZUNIDIS Dimitri, KASMI Fedoua, ADATTO Laurent
Innovation Economics, Engineering and Management Handbook 1:
Main Themes
Innovation Economics, Engineering and Management Handbook 2:
Special Themes
VALLIER Estelle
Innovation in Clusters: Science–Industry Relationships in the Face of
Forced Advancement (Smart Innovation Set – Volume 36)
2020
ACH Yves-Alain, RMADI-SAÏD Sandra
Financial Information and Brand Value: Reflections, Challenges and
Limitations
ANDREOSSO-O’CALLAGHAN Bernadette, DZEVER Sam, JAUSSAUD Jacques,
TAYLOR Robert
Sustainable Development and Energy Transition in Europe and Asia
(Innovation and Technology Set – Volume 9)
BEN SLIMANE Sonia, M’HENNI Hatem
Entrepreneurship and Development: Realities and Future Prospects
(Smart Innovation Set – Volume 30)
CHOUTEAU Marianne, FOREST Joëlle, NGUYEN Céline
Innovation for Society: The P.S.I. Approach
(Smart Innovation Set – Volume 28)
CORON Clotilde
Quantifying Human Resources: Uses and Analysis
(Technological Changes and Human Resources Set – Volume 2)
CORON Clotilde, GILBERT Patrick
Technological Change
(Technological Changes and Human Resources Set – Volume 1)
CERDIN Jean-Luc, PERETTI Jean-Marie
The Success of Apprenticeships: Views of Stakeholders on Training and
Learning (Human Resources Management Set – Volume 3)
DELCHET-COCHET Karen
Circular Economy: From Waste Reduction to Value Creation
(Economic Growth Set – Volume 2)
DIDAY Edwin, GUAN Rong, SAPORTA Gilbert, WANG Huiwen
Advances in Data Science
(Big Data, Artificial Intelligence and Data Analysis Set – Volume 4)
DOS SANTOS PAULINO Victor
Innovation Trends in the Space Industry
(Smart Innovation Set – Volume 25)
GASMI Nacer
Corporate Innovation Strategies: Corporate Social Responsibility and
Shared Value Creation
(Smart Innovation Set – Volume 33)
GOGLIN Christian
Emotions and Values in Equity Crowdfunding Investment Choices 1:
Transdisciplinary Theoretical Approach
GUILHON Bernard
Venture Capital and the Financing of Innovation
(Innovation Between Risk and Reward Set – Volume 6)
LATOUCHE Pascal
Open Innovation: Human Set-up
(Innovation and Technology Set – Volume 10)
LIMA Marcos
Entrepreneurship and Innovation Education: Frameworks and Tools
(Smart Innovation Set – Volume 32)
MACHADO Carolina, DAVIM J. Paulo
Sustainable Management for Managers and Engineers
MAKRIDES Andreas, KARAGRIGORIOU Alex, SKIADAS Christos H.
Data Analysis and Applications 3: Computational, Classification, Financial,
Statistical and Stochastic Methods
(Big Data, Artificial Intelligence and Data Analysis Set – Volume 5)
Data Analysis and Applications 4: Financial Data Analysis and Methods
(Big Data, Artificial Intelligence and Data Analysis Set – Volume 6)
MASSOTTE Pierre, CORSI Patrick
Complex Decision-Making in Economy and Finance
MEUNIER François-Xavier
Dual Innovation Systems: Concepts, Tools and Methods
(Smart Innovation Set – Volume 31)
MICHAUD Thomas
Science Fiction and Innovation Design (Innovation in Engineering and
Technology Set – Volume 6)
MONINO Jean-Louis
Data Control: Major Challenge for the Digital Society
(Smart Innovation Set – Volume 29)
MORLAT Clément
Sustainable Productive System: Eco-development versus Sustainable
Development (Smart Innovation Set – Volume 26)
SAULAIS Pierre, ERMINE Jean-Louis
Knowledge Management in Innovative Companies 2: Understanding and
Deploying a KM Plan within a Learning Organization
(Smart Innovation Set – Volume 27)
2019
AMENDOLA Mario, GAFFARD Jean-Luc
Disorder and Public Concern Around Globalization
BARBAROUX Pierre
Disruptive Technology and Defence Innovation Ecosystems
(Innovation in Engineering and Technology Set – Volume 5)
DOU Henri, JUILLET Alain, CLERC Philippe
Strategic Intelligence for the Future 1: A New Strategic and Operational
Approach
Strategic Intelligence for the Future 2: A New Information Function
Approach
FRIKHA Azza
Measurement in Marketing: Operationalization of Latent Constructs
FRIMOUSSE Soufyane
Innovation and Agility in the Digital Age
(Human Resources Management Set – Volume 2)
GAY Claudine, SZOSTAK Bérangère L.
Innovation and Creativity in SMEs: Challenges, Evolutions and Prospects
(Smart Innovation Set – Volume 21)
GORIA Stéphane, HUMBERT Pierre, ROUSSEL Benoît
Information, Knowledge and Agile Creativity
(Smart Innovation Set – Volume 22)
HELLER David
Investment Decision-making Using Optional Models
(Economic Growth Set – Volume 2)
HELLER David, DE CHADIRAC Sylvain, HALAOUI Lana, JOUVET Camille
The Emergence of Start-ups
(Economic Growth Set – Volume 1)
HÉRAUD Jean-Alain, KERR Fiona, BURGER-HELMCHEN Thierry
Creative Management of Complex Systems
(Smart Innovation Set – Volume 19)
LATOUCHE Pascal
Open Innovation: Corporate Incubator
(Innovation and Technology Set – Volume 7)
LEHMANN Paul-Jacques
The Future of the Euro Currency
LEIGNEL Jean-Louis, MÉNAGER Emmanuel, YABLONSKY Serge
Sustainable Enterprise Performance: A Comprehensive Evaluation Method
LIÈVRE Pascal, AUBRY Monique, GAREL Gilles
Management of Extreme Situations: From Polar Expeditions to Exploration-
Oriented Organizations
MILLOT Michel
Embarrassment of Product Choices 2: Towards a Society of Well-being
N’GOALA Gilles, PEZ-PÉRARD Virginie, PRIM-ALLAZ Isabelle
Augmented Customer Strategy: CRM in the Digital Age
NIKOLOVA Blagovesta
The RRI Challenge: Responsibilization in a State of Tension with Market
Regulation
(Innovation and Responsibility Set – Volume 3)
PELLEGRIN-BOUCHER Estelle, ROY Pierre
Innovation in the Cultural and Creative Industries
(Innovation and Technology Set – Volume 8)
PRIOLON Joël
Financial Markets for Commodities
QUINIOU Matthieu
Blockchain: The Advent of Disintermediation
RAVIX Joël-Thomas, DESCHAMPS Marc
Innovation and Industrial Policies
(Innovation between Risk and Reward Set – Volume 5)
ROGER Alain, VINOT Didier
Skills Management: New Applications, New Questions
(Human Resources Management Set – Volume 1)
SAULAIS Pierre, ERMINE Jean-Louis
Knowledge Management in Innovative Companies 1: Understanding and
Deploying a KM Plan within a Learning Organization
(Smart Innovation Set – Volume 23)
SERVAJEAN-HILST Romaric
Co-innovation Dynamics: The Management of Client-Supplier Interactions
for Open Innovation
(Smart Innovation Set – Volume 20)
SKIADAS Christos H., BOZEMAN James R.
Data Analysis and Applications 1: Clustering and Regression, Modeling-
estimating, Forecasting and Data Mining
(Big Data, Artificial Intelligence and Data Analysis Set – Volume 2)
Data Analysis and Applications 2: Utilization of Results in Europe and
Other Topics
(Big Data, Artificial Intelligence and Data Analysis Set – Volume 3)
UZUNIDIS Dimitri
Systemic Innovation: Entrepreneurial Strategies and Market Dynamics
VIGEZZI Michel
World Industrialization: Shared Inventions, Competitive Innovations and
Social Dynamics
(Smart Innovation Set – Volume 24)
2018
BURKHARDT Kirsten
Private Equity Firms: Their Role in the Formation of Strategic Alliances
CALLENS Stéphane
Creative Globalization
(Smart Innovation Set – Volume 16)
CASADELLA Vanessa
Innovation Systems in Emerging Economies: MINT – Mexico, Indonesia,
Nigeria, Turkey
(Smart Innovation Set – Volume 18)
CHOUTEAU Marianne, FOREST Joëlle, NGUYEN Céline
Science, Technology and Innovation Culture
(Innovation in Engineering and Technology Set – Volume 3)
CORLOSQUET-HABART Marine, JANSSEN Jacques
Big Data for Insurance Companies
(Big Data, Artificial Intelligence and Data Analysis Set – Volume 1)
CROS Françoise
Innovation and Society
(Smart Innovation Set – Volume 15)
DEBREF Romain
Environmental Innovation and Ecodesign: Certainties and Controversies
(Smart Innovation Set – Volume 17)
DOMINGUEZ Noémie
SME Internationalization Strategies: Innovation to Conquer New Markets
ERMINE Jean-Louis
Knowledge Management: The Creative Loop
(Innovation and Technology Set – Volume 5)
GILBERT Patrick, BOBADILLA Natalia, GASTALDI Lise,
LE BOULAIRE Martine, LELEBINA Olga
Innovation, Research and Development Management
IBRAHIMI Mohammed
Mergers & Acquisitions: Theory, Strategy, Finance
LEMAÎTRE Denis
Training Engineers for Innovation
LÉVY Aldo, BEN BOUHENI Faten, AMMI Chantal
Financial Management: USGAAP and IFRS Standards
(Innovation and Technology Set – Volume 6)
MILLOT Michel
Embarrassment of Product Choices 1: How to Consume Differently
PANSERA Mario, OWEN Richard
Innovation and Development: The Politics at the Bottom of the Pyramid
(Innovation and Responsibility Set – Volume 2)
RICHEZ Yves
Corporate Talent Detection and Development
SACHETTI Philippe, ZUPPINGER Thibaud
New Technologies and Branding
(Innovation and Technology Set – Volume 4)
SAMIER Henri
Intuition, Creativity, Innovation
TEMPLE Ludovic, COMPAORÉ SAWADOGO Eveline M.F.W.
Innovation Processes in Agro-Ecological Transitions in Developing
Countries
(Innovation in Engineering and Technology Set – Volume 2)
UZUNIDIS Dimitri
Collective Innovation Processes: Principles and Practices
(Innovation in Engineering and Technology Set – Volume 4)
VAN HOOREBEKE Delphine
The Management of Living Beings or Emo-management
2017
AÏT-EL-HADJ Smaïl
The Ongoing Technological System
(Smart Innovation Set – Volume 11)
BAUDRY Marc, DUMONT Béatrice
Patents: Prompting or Restricting Innovation?
(Smart Innovation Set – Volume 12)
BÉRARD Céline, TEYSSIER Christine
Risk Management: Lever for SME Development and Stakeholder
Value Creation
CHALENÇON Ludivine
Location Strategies and Value Creation of International
Mergers and Acquisitions
CHAUVEL Danièle, BORZILLO Stefano
The Innovative Company: An Ill-defined Object
(Innovation between Risk and Reward Set – Volume 1)
CORSI Patrick
Going Past Limits To Growth
D’ANDRIA Aude, GABARRET Inés
Building 21st Century Entrepreneurship
(Innovation and Technology Set – Volume 2)
DAIDJ Nabyla
Cooperation, Coopetition and Innovation
(Innovation and Technology Set – Volume 3)
FERNEZ-WALCH Sandrine
The Multiple Facets of Innovation Project Management
(Innovation between Risk and Reward Set – Volume 4)
FOREST Joëlle
Creative Rationality and Innovation
(Smart Innovation Set – Volume 14)
GUILHON Bernard
Innovation and Production Ecosystems
(Innovation between Risk and Reward Set – Volume 2)
HAMMOUDI Abdelhakim, DAIDJ Nabyla
Game Theory Approach to Managerial Strategies and Value Creation
(Diverse and Global Perspectives on Value Creation Set – Volume 3)
LALLEMENT Rémi
Intellectual Property and Innovation Protection: New Practices
and New Policy Issues
(Innovation between Risk and Reward Set – Volume 3)
LAPERCHE Blandine
Enterprise Knowledge Capital
(Smart Innovation Set – Volume 13)
LEBERT Didier, EL YOUNSI Hafida
International Specialization Dynamics
(Smart Innovation Set – Volume 9)
MAESSCHALCK Marc
Reflexive Governance for Research and Innovative Knowledge
(Responsible Research and Innovation Set – Volume 6)
MASSOTTE Pierre
Ethics in Social Networking and Business 1: Theory, Practice
and Current Recommendations
Ethics in Social Networking and Business 2: The Future and
Changing Paradigms
MASSOTTE Pierre, CORSI Patrick
Smart Decisions in Complex Systems
MEDINA Mercedes, HERRERO Mónica, URGELLÉS Alicia
Current and Emerging Issues in the Audiovisual Industry
(Diverse and Global Perspectives on Value Creation Set – Volume 1)
MICHAUD Thomas
Innovation, Between Science and Science Fiction
(Smart Innovation Set – Volume 10)
PELLÉ Sophie
Business, Innovation and Responsibility
(Responsible Research and Innovation Set – Volume 7)
SAVIGNAC Emmanuelle
The Gamification of Work: The Use of Games in the Workplace
SUGAHARA Satoshi, DAIDJ Nabyla, USHIO Sumitaka
Value Creation in Management Accounting and Strategic Management:
An Integrated Approach
(Diverse and Global Perspectives on Value Creation Set –Volume 2)
UZUNIDIS Dimitri, SAULAIS Pierre
Innovation Engines: Entrepreneurs and Enterprises in a Turbulent World
(Innovation in Engineering and Technology Set – Volume 1)
2016
BARBAROUX Pierre, ATTOUR Amel, SCHENK Eric
Knowledge Management and Innovation
(Smart Innovation Set – Volume 6)
BEN BOUHENI Faten, AMMI Chantal, LEVY Aldo
Banking Governance, Performance And Risk-Taking: Conventional Banks
Vs Islamic Banks
BOUTILLIER Sophie, CARRÉ Denis, LEVRATTO Nadine
Entrepreneurial Ecosystems (Smart Innovation Set – Volume 2)
BOUTILLIER Sophie, UZUNIDIS Dimitri
The Entrepreneur (Smart Innovation Set – Volume 8)
BOUVARD Patricia, SUZANNE Hervé
Collective Intelligence Development in Business
GALLAUD Delphine, LAPERCHE Blandine
Circular Economy, Industrial Ecology and Short Supply Chains
(Smart Innovation Set – Volume 4)
GUERRIER Claudine
Security and Privacy in the Digital Era
(Innovation and Technology Set – Volume 1)
MEGHOUAR Hicham
Corporate Takeover Targets
MONINO Jean-Louis, SEDKAOUI Soraya
Big Data, Open Data and Data Development
(Smart Innovation Set – Volume 3)
MOREL Laure, LE ROUX Serge
Fab Labs: Innovative User
(Smart Innovation Set – Volume 5)
PICARD Fabienne, TANGUY Corinne
Innovations and Techno-ecological Transition
(Smart Innovation Set – Volume 7)
2015
CASADELLA Vanessa, LIU Zeting, DIMITRI Uzunidis
Innovation Capabilities and Economic Development in Open Economies
(Smart Innovation Set – Volume 1)
CORSI Patrick, MORIN Dominique
Sequencing Apple’s DNA
CORSI Patrick, NEAU Erwan
Innovation Capability Maturity Model
FAIVRE-TAVIGNOT Bénédicte
Social Business and Base of the Pyramid
GODÉ Cécile
Team Coordination in Extreme Environments
MAILLARD Pierre
Competitive Quality and Innovation
MASSOTTE Pierre, CORSI Patrick
Operationalizing Sustainability
MASSOTTE Pierre, CORSI Patrick
Sustainability Calling
2014
DUBÉ Jean, LEGROS Diègo
Spatial Econometrics Using Microdata
LESCA Humbert, LESCA Nicolas
Strategic Decisions and Weak Signals
2013
HABART-CORLOSQUET Marine, JANSSEN Jacques, MANCA Raimondo
VaR Methodology for Non-Gaussian Finance
2012
DAL PONT Jean-Pierre
Process Engineering and Industrial Management
MAILLARD Pierre
Competitive Quality Strategies
POMEROL Jean-Charles
Decision-Making and Action
SZYLAR Christian
UCITS Handbook
2011
LESCA Nicolas
Environmental Scanning and Sustainable Development
LESCA Nicolas, LESCA Humbert
Weak Signals for Strategic Intelligence: Anticipation Tool for Managers
MERCIER-LAURENT Eunika
Innovation Ecosystems
2010
SZYLAR Christian
Risk Management under UCITS III/IV
2009
COHEN Corine
Business Intelligence
ZANINETTI Jean-Marc
Sustainable Development in the USA
2008
CORSI Patrick, DULIEU Mike
The Marketing of Technology Intensive Products and Services
DZEVER Sam, JAUSSAUD Jacques, ANDREOSSO Bernadette
Evolving Corporate Structures and Cultures in Asia: Impact
of Globalization
2007
AMMI Chantal
Global Consumer Behavior
2006
BOUGHZALA Imed, ERMINE Jean-Louis
Trends in Enterprise Knowledge Management
CORSI Patrick et al.
Innovation Engineering: the Power of Intangible Networks
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.