(Innovation, Enterpreneurship, Management Series) Konstantinos N. Zafeiris, Christos H. Skiadas, Yannis Dimotikalis, Alex Karagrigoriou, Christiana Karagrigoriou-Vonta - Data Analysis and Related Appl

Data Analysis and Related Applications 1
Big Data, Artificial Intelligence and Data Analysis Set

coordinated by
Jacques Janssen
Volume 9
Data Analysis and

Related Applications 1
Computational, Algorithmic and

Applied Economic Data Analysis
Edited by
Konstantinos N. Zafeiris
Christos H. Skiadas
Yiannis Dimotikalis
Alex Karagrigoriou
Christiana Karagrigoriou-Vonta
First published 2022 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted
under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or
transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the
case of reprographic reproduction in accordance with the terms and licenses issued by the
CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the
undermentioned address:
ISTE Ltd John Wiley & Sons, Inc.

27-37 St George’s Road 111 River Street
London SW19 4EU Hoboken, NJ 07030
UK USA
www.iste.co.uk www.wiley.com
© ISTE Ltd 2022

The rights of Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis, Alex Karagrigoriou and
Christiana Karagrigoriou-Vonta to be identified as the authors of this work have been asserted by them in
accordance with the Copyright, Designs and Patents Act 1988.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.
Library of Congress Control Number: 2022935196
British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library
ISBN 978-1-78630-771-2
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Konstantinos N. ZAFEIRIS, Yiannis DIMOTIKALIS, Christos H. SKIADAS, Alex KARAGRIGORIOU
and Christiana KARAGRIGORIOU-VONTA
Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1. Performance of Evaluation of Diagnosis of Various Thyroid

Diseases Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . 3
Burcu Bektas GÜNEŞ, Evren BURSUK and Rüya ŞAMLI
1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Data understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3. Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4. Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2. Exploring Chronic Diseases’ Spatial Patterns:

Thyroid Cancer in Sicilian Volcanic Areas . . . . . . . . . . . . . . . . . . . . . 13
Francesca BITONTI and Angelo MAZZA
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2. Epidemiological data and territory . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1. Spatial inhomogeneity and spatial dependence . . . . . . . . . . . . . . . . 18
2.3.2. Standardized incidence ratio (SIR) . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3. Local Moran’s I statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4. Spatial distribution of TC in eastern Sicily . . . . . . . . . . . . . . . . . . . . 22
2.4.1. SIR geographical variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
vi Data Analysis and Related Applications 1
2.4.2. Estimate of the spatial attraction . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3. Analysis of Blockchain-based Databases in Web

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Orhun Ceng BOZO and Rüya ŞAMLI
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2. Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1. Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2. Blockchain types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3. Blockchain-based web applications . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4. Blockchain consensus algorithms . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.5. Other consensus algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3. Analysis stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1. Art Shop web application . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2. SQL-based application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3. NoSQL-based application . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.4. Blockchain-based application . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1. Adding records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2. Query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.3. Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.4. Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 4. Optimization and Asymptotic Analysis

of Insurance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Ekaterina BULINSKAYA
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2. Discrete-time model with reinsurance and bank loans . . . . . . . . . . . . . . 44
4.2.1. Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2. Optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3. Model stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3. Continuous-time insurance model with dividends . . . . . . . . . . . . . . . . 48
4.3.1. Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2. Optimal barrier strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3. Special form of claim distribution . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4. Numerical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Contents vii
4.4. Conclusion and further research directions . . . . . . . . . . . . . . . . . . . . 55

4.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 5. Statistical Analysis of Traffic Volume in

the 25 de Abril Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Frederico CAEIRO, Ayana MATEUS and Conceicao VEIGA de ALMEIDA
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1. Main limit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2. Block maxima method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.3. Largest order statistics method. . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.4. Estimation of other tail parameters . . . . . . . . . . . . . . . . . . . . . . 63
5.4. Results and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 6. Predicting the Risk of Gestational Diabetes Mellitus through

Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Louisa TESTA, Mark A. CARUANA, Maria KONTORINAKI and Charles SAVONA-VENTURA
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2. Nearest neighbor methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.1. Background of the NN methods . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2. The k-nearest neighbors method . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.3. The fixed-radius NN method. . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.4. The kernel-NN method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.5. Algorithms of the three considered NN methods. . . . . . . . . . . . . . . 72
6.2.6. Parameter and distance metric selection . . . . . . . . . . . . . . . . . . . 74
6.3. Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.1. Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.2. Variable selection and data splitting. . . . . . . . . . . . . . . . . . . . . . 75
6.3.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3.4. A discussion and comparison of results . . . . . . . . . . . . . . . . . . . . 78
6.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
viii Data Analysis and Related Applications 1
Chapter 7. Political Trust in National Institutions: The Significance

of Items’ Level of Measurement in the Validation of Constructs . . . . . . . 81
Anastasia CHARALAMPI, Eva TSOUPAROPOULOU, Joanna TSIGANOU and
Catherine MICHALOPOULOU
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.1. Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.2. Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.3. Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.1. EFA results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.2. CFA results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3.3. Scale construction and assessment . . . . . . . . . . . . . . . . . . . . . . 91
7.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.5. Funding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chapter 8. The State of the Art in Flexible Regression Models for

Univariate Bounded Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Agnese Maria DI BRISCO, Roberto ASCARI, Sonia MIGLIORATI and Andrea ONGARO
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.2. Regression model for bounded responses . . . . . . . . . . . . . . . . . . . . . 101
8.2.1. Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.2. Main distributions on the bounded support . . . . . . . . . . . . . . . . . . 103
8.2.3. Inference and fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.3. Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.3.1. Stress data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.3.2. Reading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.4. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Chapter 9. Simulation Studies for a Special Mixture Regression Model

with Multivariate Responses on the Simplex . . . . . . . . . . . . . . . . . . . 115
Agnese Maria DI BRISCO, Roberto ASCARI, Sonia MIGLIORATI and Andrea ONGARO
9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.2. Dirichlet and EFD distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.3. Dirichlet and EFD regression models . . . . . . . . . . . . . . . . . . . . . . . 118
9.3.1. Inference and fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.4. Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.4.1. Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Contents ix
Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 10. Numerical Studies of Implied Volatility Expansions

Under the Gatheral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Marko DIMITROV, Mohammed ALBUHAYRI, Ying NI and Anatoliy MALYARENKO
10.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2. Asymptotic expansions of implied volatility . . . . . . . . . . . . . . . . . . . 137
10.3. Performance of the asymptotic expansions . . . . . . . . . . . . . . . . . . . 139
10.4. Calibration using the asymptotic expansions . . . . . . . . . . . . . . . . . . 141
10.4.1. A partial calibration procedure . . . . . . . . . . . . . . . . . . . . . . . . 142
10.4.2. Calibration to synthetic and market data . . . . . . . . . . . . . . . . . . 143
10.5. Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Chapter 11. Performance Persistence of Polish Mutual Funds:

Mobility Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Dariusz FILIP
11.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

11.2. Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
11.3. Dataset and empirical design . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.4. Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
11.5. Monthly perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
11.6. Quarterly perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
11.7. Yearly perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
11.8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
11.9. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Chapter 12. Invariant Description for a Batch Version of the UCB Strategy
with Unknown Control Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Sergey GARBAR
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

12.2. UCB strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.3. Batch version of the strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.4. Invariant description with a unit control horizon . . . . . . . . . . . . . . . . 166
12.5. Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
12.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
12.7. Affiliations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
12.8. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
x Data Analysis and Related Applications 1
Chapter 13. A New Non-monotonic Link Function for

Beta Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Gloria GHENO
13.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

13.2. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.3. Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
13.4. Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
13.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Chapter 14. A Method of Big Data Collection and Normalization

for Electronic Engineering Applications . . . . . . . . . . . . . . . . . . . . . . 187
Naveenbalaji GOWTHAMAN and Viranjay M. SRIVASTAVA
14.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

14.2. Machine learning (ML) in electronic engineering . . . . . . . . . . . . . . . . 189
14.2.1. Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.2.2. Accessing the data repositories . . . . . . . . . . . . . . . . . . . . . . . . 191
14.2.3. Data storage and management . . . . . . . . . . . . . . . . . . . . . . . . 192
14.3. Electronic engineering applications – data science . . . . . . . . . . . . . . . 193
14.4. Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
14.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Chapter 15. Stochastic Runge–Kutta Solvers Based on Markov

Jump Processes and Applications to Non-autonomous Systems
of Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Flavius GUIAŞ
15.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
15.2. Description of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
15.2.1. The direct simulation method. . . . . . . . . . . . . . . . . . . . . . . . . 201
15.2.2. Picard iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
15.2.3. Runge–Kutta steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
15.3. Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.3.1. The Lorenz system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.3.2. A combustion model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
15.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
15.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Contents xi
Chapter 16. Interpreting a Topological Measure of Complexity for

Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Alan HYLTON, Ian LIM, Michael MOY and Robert SHORT
16.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
16.2. Persistent homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
16.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.3.1. Neural networks and binary classification . . . . . . . . . . . . . . . . . . 213
16.3.2. Persistent homology of a decision boundary . . . . . . . . . . . . . . . . 213
16.3.3. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
16.4. Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
16.4.1. Three-dimensional binary classification . . . . . . . . . . . . . . . . . . . 215
16.4.2. Data divided by a hyperplane. . . . . . . . . . . . . . . . . . . . . . . . . 217
16.5. Conclusion and discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
16.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Chapter 17. The Minimum Renyi’s Pseudodistance Estimators for

Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
María JAENADA and Leandro PARDO
17.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
17.2. The minimum RP estimators for the GLM model: asymptotic
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
17.3. Example: Poisson regression model . . . . . . . . . . . . . . . . . . . . . . . 230
17.3.1. Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
17.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
17.5. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
17.6. Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
17.6.1. Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
17.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Chapter 18. Data Analysis based on Entropies and Measures

of Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Christos MESELIDIS, Alex KARAGRIGORIOU and Takis PAPAIOANNOU
18.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
18.2. Divergence measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
18.3. Tests of fit based on Φ−divergence measures . . . . . . . . . . . . . . . . . . 241
18.4. Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
18.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
xii Data Analysis and Related Applications 1
Part 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Chapter 19. Geographically Weighted Regression for Official Land

Prices and their Temporal Variation in Tokyo . . . . . . . . . . . . . . . . . . . 261
Yuta KANNO and Takayuki SHIOHAMA
19.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
19.2. Models and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
19.3. Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
19.3.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
19.3.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
19.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
19.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Chapter 20. Software Cost Estimation Using Machine

Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Sukran EBREN KARA and Rüya ŞAMLI
20.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

20.2. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
20.2.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
20.2.2. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
20.2.3. Evaluating the performance of the model . . . . . . . . . . . . . . . . . . 278
20.3. Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
20.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
20.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Chapter 21. Monte Carlo Accuracy Evaluation of Laser

Cutting Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Samuel KOSOLAPOV
21.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
21.2. Mathematical model of a pintograph . . . . . . . . . . . . . . . . . . . . . . . 286
21.3. Monte Carlo simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
21.4. Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
21.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
21.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Contents xiii
Chapter 22. Using Parameters of Piecewise Approximation by

Exponents for Epidemiological Time Series Data Analysis . . . . . . . . . . 297
Samuel KOSOLAPOV
22.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

22.2. Deriving equations for moving exponent parameters . . . . . . . . . . . . . . 298
22.3. Validation of derived equations by using synthetic data . . . . . . . . . . . . 300
22.4. Using derived equations to analyze real-life Covid-19 data . . . . . . . . . . 302
22.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
22.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Chapter 23. The Correlation Between Oxygen Consumption and

Excretion of Carbon Dioxide in the Human Respiratory Cycle . . . . . . . . 307
Anatoly KOVALENKO, Konstantin LEBEDINSKII and Verangelina MOLOSHNEVA
23.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

23.2. Respiratory function physiology: ventilation–perfusion ratio . . . . . . . . . 309
23.3. The basic principle of operation of artificial lung ventilation devices:
patient monitoring parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
23.4. The algorithm for monitoring the carbon emissions and oxygen
consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
23.5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
23.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
23.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Chapter 24. Approximate Bayesian Inference Using the

Mean-Field Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Antonin DELLA NOCE and Paul-Henry COURNÈDE
24.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
24.2. Inference problem in a symmetric population system . . . . . . . . . . . . . . 321
24.2.1. Example of a symmetric system describing plant competition . . . . . . 321
24.2.2. Inference problem of the Schneider system, in a more
general setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
24.3. Properties of the mean-field distribution . . . . . . . . . . . . . . . . . . . . . 325
24.4. Mean-field approximated inference. . . . . . . . . . . . . . . . . . . . . . . . 327
24.4.1. Case of systems admitting a mean-field limit . . . . . . . . . . . . . . . . 327
24.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
24.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
xiv Data Analysis and Related Applications 1
Chapter 25. Pricing Financial Derivatives in the Hull–White Model Using

Cubature Methods on Wiener Space . . . . . . . . . . . . . . . . . . . . . . . . 333
Hossein NOHROUZIAN, Anatoliy MALYARENKO and Ying NI
25.1. Introduction and outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
25.2. Cubature formulae on Wiener space . . . . . . . . . . . . . . . . . . . . . . . 335
25.2.1. A simple example of classical Monte Carlo estimates . . . . . . . . . . . 335
25.2.2. Modern Monte Carlo estimates via cubature method. . . . . . . . . . . . 336
25.2.3. An application in the Black–Scholes SDE . . . . . . . . . . . . . . . . . 338
25.2.4. Trajectories of the cubature formula of degree 5 on Wiener space . . . . 339
25.2.5. Trajectories of price process given in equation [25.7] . . . . . . . . . . 340
25.2.6. An application on path-dependent derivatives . . . . . . . . . . . . . . . 341
25.2.7. Trinomial tree (model) via cubature formulae of degree 5 . . . . . . . . . 342
25.3. Interest-rate models and Hull–White one-factor model . . . . . . . . . . . . . 343
25.3.1. Equilibrium models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
25.3.2. No-arbitrage models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
25.3.3. Forward rate models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
25.3.4. Hull–White one-factor model . . . . . . . . . . . . . . . . . . . . . . . . 345
25.3.5. Discretization of the Hull–White model via Euler scheme . . . . . . . . 346
25.3.6. Hull–White model for bond prices . . . . . . . . . . . . . . . . . . . . . . 346
25.4. The Hull–White model via cubature method. . . . . . . . . . . . . . . . . . . 349
25.4.1. Simulating SDE [25.15] and ODE [25.24] . . . . . . . . . . . . . . . . 350
25.4.2. The Hull–White interest-rate tree via iterated cubature formulae:
some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
25.5. Discussion and future works . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
25.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Chapter 26. Differences in the Structure of Infectious Morbidity

of the Population during the First and Second Half of
2020 in St. Petersburg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Vasilii OREL, Olga NOSYREVA, Tatiana BULDAKOVA, Natalya GUREVA, Viktoria SMIRNOVA,
Andrey KIM and Lubov SHARAFUTDINOVA
26.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360

26.2. Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
26.2.1. Characteristics of the territory of the district . . . . . . . . . . . . . . . . 360
26.2.2. Demographic characteristics of the area . . . . . . . . . . . . . . . . . . . 360
26.2.3. Characteristics of the district medical service . . . . . . . . . . . . . . . . 361
26.2.4. The procedure for collecting primary information on cases of diseases
of the population with a new coronavirus infection . . . . . . . . . . . . . . . . . 361
26.3. Results of the analysis of the incidence of acute respiratory viral infectious
diseases, new coronavirus infection Covid-19 and community-acquired
pneumonia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
Contents xv
26.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

26.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
Chapter 27. High Speed and Secured Network Connectivity for Higher
Education Institutions Using Software Defined Networks . . . . . . . . . . . 371
Lincoln S. PETER and Viranjay M. SRIVASTAVA
27.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

27.2. Existing model review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
27.3. Selection of a suitable model . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
27.4. Conclusion and future recommendations . . . . . . . . . . . . . . . . . . . . . 376
27.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Chapter 28. Reliability of a Double Redundant System Under the

Full Repair Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Vladimir RYKOV and Nika IVANOVA
28.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
28.2. Problem statement, assumptions and notations . . . . . . . . . . . . . . . . . 381
28.3. Reliability function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
28.4. Time-dependent system state probabilities . . . . . . . . . . . . . . . . . . . . 386
28.4.1. General representation of t.d.s.p.s . . . . . . . . . . . . . . . . . . . . . . 386
28.4.2. T.d.s.p.s in a separate regeneration period. . . . . . . . . . . . . . . . . . 387
28.5. Steady-state probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
28.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
28.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Chapter 29. Predicting Changes in Depression Levels Following the

European Economic Downturn of 2008 . . . . . . . . . . . . . . . . . . . . . . . 395
Eleni SERAFETINIDOU and Georgia VERROPOULOU
29.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

29.1.1. Aims of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
29.2. Data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
29.2.1. Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
29.2.2. Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
29.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
29.3.1. Descriptive findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
29.3.2. Non-respondents compared to respondents at baseline (wave 2) . . . . . 403
29.3.3. Descriptive findings for respondents – analysis by gender . . . . . . . . 405
29.3.4. Findings regarding decreasing depression levels – analysis for the
total sample and by gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
29.3.5. Findings regarding increasing depression levels – analysis for the
total sample and by gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
xvi Data Analysis and Related Applications 1
29.4. Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

29.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
29.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
List of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Summary of Volume 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

Preface
This book is a collective work with contributions by leading experts on “Data

Analysis and Related Applications: Theory and Practice”.
The field of data analysis has grown enormously over recent decades due to the
rapid growth of the computer industry, the continuous development of innovative
algorithmic techniques and recent advances in statistical tools and methods. Due to
the wide applicability of data analysis, a collective work is always needed to bring
all recent developments in the field, from all areas of science and engineering, under
a single umbrella.
The contributions to this collective work are by a number of leading scientists,

analysts, engineers, demographers, health experts, mathematicians and statisticians
who have been working on the front end of data analysis. The chapters included in
this collective volume represent a cross-section of current concerns and research
interests in the scientific areas mentioned. The material is divided into four parts
and 29 chapters in a form that will provide the reader with both methodological and
practical information on data analytic methods, models and techniques, together
with a wide range of appropriate applications.
Part 1 focuses mainly on computational data analysis and related fields, with
nine chapters covering machine learning algorithms, web applications, spatial
analysis, multivariate regression, factor analysis, mixture models, non-parametric
techniques and tail distributions.
Part 2 focuses mainly on stochastic and algorithmic data analysis and related
fields, with nine chapters covering volatility, calibration, segmentation, Markov
chains, genetic algorithms, classification algorithms, batch processing, entropies and
pseudodistances.
xviii Data Analysis and Related Applications 1
Part 3 focuses mainly on applied statistical data analysis and related fields, with
five chapters covering spatial statistics, Monte Carlo methods, machine learning
methods, time series analysis and gas analysis.
Part 4 focuses mainly on economic and numerical data analysis and related
fields, with six chapters covering economic downturn, cyber systems, morbidity,
fixed-income market, Bayesian inference and reliability analysis.
Konstantinos N. ZAFEIRIS
Yiannis DIMOTIKALIS
Christos H. SKIADAS
Alex KARAGRIGORIOU
Christiana KARAGRIGORIOU-VONTA
April 2022
PART 1
Additive Manufacturing of Metal Alloys 1: Processes, Raw Materials and Numerical Simulation,
First Edition. Edited by Konstantinos N. Zafeiris, Christos H. Skiadas, Yiannis Dimotikalis Alex
Karagrigoriou and Christiana Karagrigoriou-Vonta.
© ISTE Ltd 2022. Published by ISTE Ltd and John Wiley & Sons, Inc.
1
Performance of Evaluation of Diagnosis

of Various Thyroid Diseases Using
Machine Learning Techniques
Thyroid cancer is the second most prevalent cancer type among women in
Turkey. The number of people diagnosed with thyroid cancer in the United States in
2021 is estimated as 44,280, according to the report published by the American
Cancer Society. The risk of thyroid cancer can be reduced by early diagnosis and
treatment. This study is focused on predicting five different thyroid diseases, based
on various symptoms and reports of the thyroid. Several machine learning
algorithms, such as support vector machine, k-nearest neighbors, artificial neural
network and decision tree are used for diagnosis of various thyroid diseases, and
their classification performances are compared with each other. For this purpose,
a thyroid disease dataset gathered from the Department of Nuclear Medicine and
Endocrinology in Istanbul University-Cerrahpaşa Faculty of Medicine was used.
1.1. Introduction
According to Feigenbaum, the pioneer of Artificial Intelligence, “an expert

system, which is one of the branches of Artificial Intelligence, is an intelligent
computer program that uses knowledge and inference procedures to solve problems
that are difficult enough to require significant human expertise for their solution”
(Bursuk 1999). The first expert system is DENDRAL, developed by chemist Joshua
Lederberg to describe chemical molecular structures in 1965. Since then, the
spectrum of artificial intelligence and expert systems especially has expanded with
technological developments (Bursuk 1999; Nohria 2015).
Chapter written by Burcu Bektas GÜNEŞ, Evren BURSUK and Rüya ŞAMLI.
4 Data Analysis and Related Applications 1
Medicine occupies a lot of space in artificial intelligence and expert systems, in

order to diagnose diseases, such as cancer, that can have serious consequences and
even lead to death, in the early stages and to apply the right treatment method to
patients, in order to ensure that these patients lead a quality life and to increase their
survival rates (Bursuk 1999; Nohria 2015).
There are four basic steps in the decision-making process providing diagnosis in
medicine. These are: cue acquisition, hypothesis generation, cue interpretation and
hypothesis evaluation. In modern times, the wide variety of diseases (differential
diagnosis), complicated disease states (the presence of more than one disease in the
same person), selectivity in perception, variety/size of medical data, insufficient
time allocated to the evaluation processes and the need for these processes to be
done in a limited time are all factors that may cause errors in the steps of this
decision-making process. Physical or emotional changes due to human nature such
as stress, fatigue, distraction, illness or inexperience can also increase the likelihood
of these diagnostic errors. Considering today’s technology, various computer-aided
systems are used to reduce these errors, and a new one is added to these systems
every day (Bursuk 1999; Nohria 2015). In addition, machine learning (ML), another
branch of artificial intelligence, is used in programs designed recently. It is used in
an increasingly wide range.
There are a number of research works on the classification of thyroid diseases in

the literature. Wang et al. proposed a deep learning-based method to diagnose
benign- or malignant-type thyroid nodules using ultrasound images. They compared
the radiomics and deep learning-based approaches. Deep learning turned out to be
the best approach (Wang et al. 2020). Godara and Kumar used logistics regression
and support vector machine (SVM) ML techniques to analyze the thyroid dataset.
They compared these two algorithms based on precision, recall, F-measure, receiver
operating characteristic curve (ROC) and root-mean-square (RMS) error. Logistic
regression turned out to be the best classifier (Godara and Kumar 2018). Obeidavi
et al. proposed a neural network-based method to diagnose the types of thyroid
disease. In this research, the dataset consisting of T3UR, FTI, FT4, FT3, T4, T3 and
TSH was conducted on 244 subjects. The results of this research indicated that, by
hormone tests and using neural networks, various types of thyroid diseases can be
diagnosed and the neural network provides almost 100% correct answers (Reza
Obeidavi et al. 2017).
In this study, we explored the use of machine learning methodology for the
automatic classification of thyroid diseases using 10 attributes. We used the private
dataset that contains the information of 130 patients from the Department of Nuclear
Medicine and Endocrinology in Istanbul University-Cerrahpaşa Faculty of
Medicine, Turkey (IUC). After pre-processing stages, the data were trained by
adapting most of the ML algorithms to our data. Results of this research indicated
Performance of Evaluation of Diagnosis of Various Thyroid Diseases 5
that by using all the findings (physical examination, laboratory findings and
radiologic findings) together, various types of thyroid disease can be diagnosed and
the ML provides almost 100% correct answers.
1.2. Data understanding
This research was carried out using physical examination, laboratory findings
and radiologic findings, depicted in Table 1.1. Data were obtained from IUC after
the Ethical Committee’s approval.
This dataset contains 10 attributes of 130 patients. Each measurement vector

consists of 10 values – seven attributes are binary and three attributes are
continuous. The binary and continuous attribute values are mapped to zero and one,
where zero refers to false (normal) and one refers to true (abnormal).
Attribute Domain Mapped domain

Physical examination
Hypothyroid findings [0, 1] [0, 1]
Hyperthyroid findings [0, 1] [0, 1]
Ophthalmopathy [0, 1] [0, 1]
Past viral inflammation [0, 1] [0, 1]
Goiter [0, 1] [0, 1]
Bilateral [0, 1] [0, 1]
Laboratory findings
Thyroid-stimulating hormone (TSH) (<0,0002 – >100) [0, 1]
Triiodothyronine (TT3) [0,532 – >800) [0, 1]
Total thyroxin (TT4) [0,2 – >30) [0, 1]
Radiologic findings
Nodular thyroid [0, 1] [0, 1]
Table 1.1. Dataset attribute description
This dataset contains five diseases. These are Plummer disease, toxic
multi-nodular goiter, Hashimoto’s disease, Graves’ disease and subacute thyroiditis.
In this context, the number of target attributes are seven for Plummer disease, 40 for
toxic multi-nodular goiter, 32 for Hashimoto’s disease, 48 for Graves’ disease and
three for subacute thyroiditis for multiple classifications, as shown in Figure 1.1.
Figure 1.1. Class visualization for the whole dataset. For a color
version of this figure, see www.iste.co.uk/zafeiris/data1.zip
1.3. Modeling
For five different diseases, analyses were performed using machine learning
methods. SVM, k-nearest neighbors (KNN), artificial neural network (ANN) and
decision tree (DT) were used. With these algorithms, fivefold cross-validation was
used as a performance evaluation method for the dataset before the models were
performed. According to this method, the dataset is divided into five equal parts
each time, one part is chosen to be tested and the others are used as training data.
The accuracy metric in equation [1.1], the precision metric in equation [1.2], the
recall metric in equation [1.3] and F-measure metric in equation [1.4] are widely
used for model performance. In this study, accuracy was selected as the model
performance evaluation metric.
[1.1]
[1.2]
[1.3]
∗
2∗ [1.4]
True positive (TP): the true label of the given sample is positive; it refers to the
number of data that the classifier also predicts as positive. True negative (TN):
the true label of the given sample is negative; it refers to the number of data that the
classifier predicts as negative. False positive (FP): the true label is negative but
refers to the number of data the classifier incorrectly predicts positively. False
negative (FN): the true label is positive but refers to the number of data the classifier
incorrectly predicts negatively (Bulut et al. 2020).
SVM, KNN, ANN and DT were selected as the classification models.
SVM is one of the managed machine learning algorithms used for both
classification and regression issues, and is generally used for a bit of arrangement
problems. Each data item is plotted as a point in n-dimensional space with the value
of each feature being the value of a particular coordinate. The classification then
takes place by finding the hyper-plane that ideally differentiates the classes (Razia
et al. 2018; Raisinghani et al. 2019; Dharmarajan et al. 2020).
KNN is a simple, supervised machine learning algorithm that can be used to

solve both classification and regression problems. The algorithm is classified by the
majority of vote to its neighbors, with the case being assigned to the class, the most
common among its k nearest neighbors. This is measured by a distance function.
If k = 1, then the case value is simply assigned to the class of its nearest neighbor.
The three distance measures are noted as valid continuous variables (Dharmarajan
et al. 2020). In this study, the k value was taken as 3.
ANN is a well-known artificial intelligence technique for solving problems that

are difficult to be solved by human beings or conventional computational algorithms
(Hameed 2017). ANN can learn and adjust itself to solve different nonlinear
problems via modifying certain weights during the training process with offline data.
There are many existing architectures of ANN. In general, fundamental architectures
of ANN are single-layer feedforward, multi-layer feedforward and recurrent (Haykin
and Haykin 2009). In this study, a multi-layer feedforward ANN is used to
recognize the type of thyroid diseases. As a result of different trials, it was seen that
four hidden layers (h = 4) and learning rate (lr) 0.3 gave the best results. Therefore, a
four hidden layer structure was established. Back propagation is used as a learning
algorithm to train ANN. First, synaptic weights are initialized with random values.
Then, at each iteration of the back propagation algorithm, one input sample is
applied to ANN to produce the actual output. After that, the error is computed
between the actual output and the desired output. Depending on this error, the
synaptic weights are updated to minimize the error (Hameed 2017).
DT is one of the most important classification and prediction methods in

supervised learning. A decision tree classifier has a tree-type structure that provides
stability and high accuracy. Nodes and leaves are the two elements of which
decision trees are formed. Nodes help in the testing of a particular attribute, and
leaves represent a class. The DT algorithm commonly uses the gini index,
information gain, chi-square and reduction in variance to make a strategic split
(Raisinghani et al. 2019; Chaubey et al. 2021). In this study, the J48 decision tree
algorithm was used.
1.4. Findings
The performance of the models is assessed using the accuracy metric. The results
are shown in Table 1.2 and Figure 1.2. The SVM algorithm achieved 100%
performance. Figure 1.2 shows the accuracy performances of the ML algorithms
compared with each other.
Algorithm used Accuracy

SVM 1
ANN (h = 4, lr = 0.3) 0.992
KNN (k = 3) 0.9769
Decision tree (J48) 0.9923
Table 1.2. Result analysis
Accuracy
1.02
1
0.98
0.96
0.94
SVM ANN
KNN Decision Tree
Figure 1.2. Accuracy comparison. For a color version

of this figure, see www.iste.co.uk/zafeiris/data1.zip
The confusion matrix is used to evaluate the effectiveness of the classification

model. The matrix compares the actual target values with the predictions of the
machine learning algorithm. The confusion matrix of our dataset is obtained as

shown in Figures 1.3–1.6.
Predicted Label
Toxic
Graves’ Hashimoto’s Subacute Plummer
multi-nodular
disease disease thyroiditis disease
goiter
True Label
48 0 0 0 0
0 32 0 0 0
0 0 3 0 0
0 0 0 40 0
0 0 0 0 7
Figure 1.3. Confusion matrix for SVM
Predicted Label
Toxic
multi-nodular
goiter
True Label
48 0 0 0 0
0 32 0 0 0
0 0 2 0 1
0 0 0 40 0
0 0 0 0 7
Figure 1.4. Confusion matrix for ANN
Predicted Label
Toxic
multi-nodular
goiter
True Label
48 0 0 0 0
0 32 0 0 0
3 0 0 0 0
0 0 0 40 0
0 0 0 0 7
Figure 1.5. Confusion matrix for KNN

Predicted Label
Toxic
multi-nodular
goiter
True Label
48 0 0 0 0
0 32 0 0 0
0 0 2 0 1
0 0 0 40 0
0 0 0 0 7
Figure 1.6. Confusion matrix for DT (J48)
1.5. Conclusion
In this study, we explored the use of machine learning methodologies for the
automatic classification of thyroid diseases using 10 attributes. We used the private
dataset that contains the information of 130 patients from IUC. After pre-processing
stages, the data were trained by adapting most of the ML algorithms to our data. The
results of this research indicated that by using all the findings (physical examination,
laboratory findings and radiologic findings) together, various types of thyroid
disease can be diagnosed and the ML provides almost 100% correct answers. The
IUC dataset was sufficiently differentiated according to the disease for which it was
labeled. For this reason, ML algorithms have shown very high performances.
Overfitting was not observed. This system can be developed by using a larger and
more balanced dataset. Further development can be done by using image processing
of ultrasonic scanning of thyroid images to predict thyroid nodules, which cannot be
recognized in laboratory findings.
1.6. References
Bulut, B., Kalın, V., Güneş, B.B., Khazhin, R. (2020). Deep learning approach for detection
of retinal abnormalities based on color fundus images. 2020 Innovations in Intelligent
Systems and Applications Conference, 1–6, Istanbul, 15–17 October 2020.
Bursuk, E. (1999). A diagnostic expert system for cardiological, respiratory, vascular and
hematological diseases. Master’s thesis, Institute of Biomedical Engineering, Bosphorus
University, Istanbul.
Chaubey, G., Bisen, D., Arjaria, S., Yadav, V. (2021). Thyroid disease prediction using
machine learning approaches. Natl. Acad. Sci. Lett., 44(3), 233–238.
Dharmarajan, K., Balasree, K., Arunachalam, A.S., Abirmai, K. (2020). Thyroid disease
classification using decision tree and SVM. Indian J. Public Health Res. Dev., 11, 229.
Godara, S. and Kumar, S. (2018). Prediction of thyroid disease using machine learning
techniques. International Journal of Electronics Engineering, 10(2), 787–793.
Hameed, M.A. (2017). Artificial neural network system for thyroid diagnosis. Eng. Sci.,
11(25), 518–528.
Haykin, S.S. and Haykin, S.S. (2009). Neural Networks and Learning Machines, 3rd edition.
Prentice Hall, New York.
Nohria, R. (2015). Medical expert system – A comprehensive review. Int. J. Comput. Appl.,
130(7), 44–50.
Raisinghani, S., Shamdasani, R., Motwani, M., Bahreja, A., Raghavan Nair Lalitha, P. (2019).
Thyroid prediction using machine learning techniques. In ICACDS 2019: Advances in
Computing and Data Sciences, Singh, M., Gupta, P., Tyagi, V., Flusser, J., Ören, T.,
Kashyap, R. (eds). Springer, Singapore.
Razia, S., Swathi Prathyusha, P., Krishna, N.V., Sumana, N. (2018). A comparative study of
machine learning algorithms on thyroid disease prediction. International Journal of
Engineering & Technology, 7(2.8), 315–319.
Reza Obeidavi, M., Rafiee, A., Mahdiyar, O. (2017). Diagnosing thyroid disease by neural
networks. Biomed. Pharmacol. J., 10(2), 509–524.
Wang, Y., Yue, W., Li, X., Liu, S., Guo, L., Xu, H., Zhang, H., Yang, G. (2020). Comparison
study of radiomics and deep learning-based methods for thyroid nodules classification
using ultrasound images. IEEE Access, 8, 52010–52017.
2
Exploring Chronic Diseases’

Spatial Patterns: Thyroid Cancer
in Sicilian Volcanic Areas
Spatial analyses of infectious diseases have a long tradition, and with the
contemporary increasing incidences of chronic and degenerative diseases, consistent
interest has emerged regarding the geography of these types of non-infectious
pathologies and their environmental correlations. In this work, we explore spatial
variations in the prevalence of thyroid cancer, taking into account the demographic
heterogeneity in the at-risk population at the small-area level.
This work aims to enhance the existing research surrounding thyroid incidence in
volcanic areas by analyzing spatial patterns of thyroid cancer cases in Mount Etna’s
area, in the eastern part of Sicily. It is known from the medical literature that several
constituents of volcanic lava and ashes, such as radioactive and heavy metals, are
involved in the pathogenesis of thyroid cancer via the biocontamination of
atmosphere, soil and aquifers. Here, we exploit a unique dataset that allowed us to
geocode the geographic location of cases at the household level, whereas all studies
that we are aware of use aggregated data. Applying the local Moran’s I statistic as a
means for detecting spatial clustering, we aimed to disentangle the spatial
aggregation of thyroid cancer cases due to the proximity to a volcanic area from that
due to the geographic variations in the density of the population at risk and other
concomitant environmental risk factors.
Chapter written by Francesca BITONTI and Angelo MAZZA.

For a color version of all the figures in this chapter, see www.iste.co.uk/zafeiris/data1.zip.
Our preliminary findings seem to confirm a vast empirical literature that has
revealed an increased thyroid cancer incidence in volcanic areas, such as islands,
Hawaii and the Philippines, where an intense basaltic volcanic activity has also been
long detected; furthermore, parts of the Etna volcanic area seem to be more affected
than others.
2.1. Introduction
At the end of the 18th century, Dr. Valentine Seaman mapped yellow fever cases
in New York and thus succeeded in highlighting a possible correlation between the
sites of various dumps and the location of the cases (Stevenson 1965). About
60 years later, John Snow came up with the idea of creating a map of the cholera
cases that were plaguing Soho (London) at the time, and he realized that the cause of
the epidemic was due to a specific public fountain. By closing the fountain he
managed to stop the infection (Snow 1855; Walter 2000). These are just two of the
first attempts to use cartography as a tool to provide epidemiological information.
From that time on, geographic maps have increasingly been adopted as a traditional
tool to visualize the spatial distribution of diseases in the field of health. In general,
considerable effort has been devoted to the development of geographic information
systems (GIS) that facilitate the understanding of public health problems and foster
collaboration between physicians, epidemiologists and geographers to map and
predict disease risk (Croner et al. 1996). As a result of the epidemiological
transition, the long tradition of using geographic techniques for the analysis of
infectious diseases has assisted a similar application in the geographic distribution of
chronic diseases such as cancer and various types of heart disease (Ghosh et al.
1999; Wakefield 2007). There are many environmental risk factors included among
the possible concurrent causes of non-infectious pathologies, and geographical
representations constitute a valid tool for conducting exploratory analyses on the
spatial distribution of cases. In particular, May (1950) emphasized how a disease is
the product of the interaction between pathological factors (such as vectors and
genetic causes) and geographical factors acting on a physical, biological and social
level.
To date, many epidemiological studies suggest that the etiology of thyroid cancer
(TC) includes the presence of an active volcano among several factors such as the
technological improvement of screening systems, iodine consumption and others
(Marcello et al. 2014; Vigneri et al. 2015). TC is the most widespread endocrine
neoplasm, whose incidence has grown steadily around the world in recent decades
(Curado et al. 2007; Kilfoy et al. 2009; Fitzmaurice et al. 2015; Liu et al. 2017).
Exploring Chronic Diseases’ Spatial Patterns 15
An extremely high incidence of TC was found in Hawaii (Goodman et al. 1988;

Kolonel et al. 1990; Hawai’i Tumor Registry 2019), Iceland (Arnbjörnsson et al.
1986; Hrafnkelsson et al. 1989; Bray et al. 2017), the Philippines (Duntas and
Doumas 2009; Caguioa et al. 2019) and Sicily (Pellegriti et al. 2009; Malandrino
et al. 2013; Vigneri et al. 2017); all regions whose common denominator is the
presence of active volcanoes (Duntas and Doumas 2009). Although the underlying
causes of the progressive increase in the incidence of TC are still poorly defined and
greatly debated, many studies have suggested a potential relationship between
volcanic activity and the increase in the incidence of TC. Kung et al. (1981)
analyzed data of cancer registries from various areas, including Hawaii and Iceland,
and identified elements present in volcanic gases as plausible etiological agents of
TC. The research by Goodman et al. (1988) showed that the incidence of TC among
Hawaiian residents was higher than that of people of the same ethnic group but
residing elsewhere. This result supports the idea that environmental risk factors,
such as the volcanic nature of the territory, can play a critical role in increasing the
risk of TC. The same phenomenon of increased risk of TC emerged in other
volcanic areas such as the Vesuvius area in Campania (Biondi et al. 2019), New
Caledonia (Truong et al. 1985; Bray et al. 2017) and French Polynesia (Curado et al.
2007). The area around Mount Etna, in Sicily, was recently monitored because the
figures of the Cancer Registry of Eastern Sicily (CRES) recorded that the incidence
of TC in the vicinity of the volcano is double compared to the same data relating to
the entire Sicily region (Pellegriti et al. 2009). Several analyses of Sicilian data have
found a possible association between the volcanic environment and the increased
risk of TC in the proximity of Mount Etna (Vigneri et al. 2015; Malandrino et al.
2016).
All the aforementioned studies reinforce the hypothesis of a volcano–TC

relationship but lack a geographical approach to analyze the phenomenon of interest.
The epidemiological data of volcanic areas require, in our opinion, a geographical
investigation capable of offering a new vision of the risk of TC. When mentioning
the concepts of proximity and spatial variations, we cannot neglect the geographical
tools and approaches of spatial statistics. Our work represents an attempt to fill this
gap in the literature, introducing the geographical perspective in the study of the
distribution of TC in space. In particular, after having georeferenced the data from
CRES using the Google Maps Geocoding API interface, we have created maps
describing the risk of TC at the census tract level in the provinces of Messina,
Catania, Enna and Siracusa during the period 2003–2016. The chosen risk indicator
is the standardized incidence ratio (SIR), calculated by indirect standardization.
Using additional maps, we have shown which sections record an increase in
incidence compared to the expected one, which is statistically significant.
Subsequently, to evaluate the presence of clusters of high-risk areas we applied the
local Moran’s I index. The local Moran’s I statistic is able to detect the presence of
spatial autocorrelation at the level of sub-areas, which may not emerge at the global
level. Although TC case maps and cluster analysis cannot prove the causal
mechanisms underlying the investigated phenomenon, we rely on these
methodologies to provide further evidence regarding the volcano–TC relationship
and to support decision-making in the public health sector. Our results show the
presence of areas of greater risk that would suggest a possible effect of proximity to
Mount Etna and also to Mount Vulcano, although the latter presents a reduced
activity in comparison with the first one. Despite this, given the exploratory
contribution of our work, a more in-depth study is required to gain a greater
understanding of the phenomenon.
This work is organized as follows: the second section describes the available
data and the salient features of the area under analysis; the third section reports the
methodology applied, with particular mention of SIR and local Moran’s I index; the
fourth section illustrates and discusses the distribution of TC in the eastern part of
Sicily and shows the presence of clusters of high- and low-risk areas and the fifth
and last section summarizes and concludes the work.
2.2. Epidemiological data and territory
TC is the most widespread endocrine neoplasm in the world and has been
increasing steadily in recent decades (Curado et al. 2007; Kilfoy et al. 2009;
Fitzmaurice et al. 2015). Incidence rates significantly higher than the national
averages were recorded in various volcanic areas such as the area that we consider in
this work, eastern Sicily. This area includes four provinces: Messina, Catania, Enna
and Siracusa. The volcanic area that refers to Mount Etna, the highest active
European volcano, is located in the province of Catania but involves some other
areas of the southern province of Messina. Pellegriti et al. (2009) actually report a
considerable increase in the incidence rate of TC compared to the Italian average,
especially in the province of Catania. The Sicilian TT incidence figures are made
public in the Health Atlas of Sicily, published by the Department for Health
Activities and Epidemiological Observatory (Regional Health Department 2106).
Table 2.1 shows the TT incidence rate for the provinces of eastern Sicily (calculated
for the period 2003–2011 by standardization on the new European population per
100,000 inhabitants), disclosed in the Health Atlas. The rate is always higher for
women than men, as known in the literature, and higher than the regional value in
the provinces of Catania and Messina, for both sexes.
Several studies have revealed, over time, the presence of high levels of heavy
metals in the volcanic area, as a result of the continuous emissions of gas (mainly
composed of gases such as CO2 and SO2), ash and lava by Mount Etna (Buat-Ménard
and Arnold 1978; Cimino and Ziino 1983; Caltabiano et al. 2004; Andronico et al.
2009; D’Aleo et al. 2016). Such heavy metals include among others arsenic,
cadmium, chromium, cobalt, mercury, tungsten and zinc which, in high
concentrations, could contaminate soil, water and the atmosphere, eventually
entering the food chain (Vigneri et al. 2017). These works indicate that the presence
of an active volcano could contaminate the surrounding area through the repeated
emissions leading to potential repercussions for human health.
Area Males Females
Catania 10.6 35.5
Enna 7.7 25.5
Messina 9.6 29.1
Siracusa 5.2 18.9
Sicily 7.7 25.6
Table 2.1. TT age-standardized SIR for geographical

area. Source: Health Atlas of Sicily, 2016
The territory of these provinces is heterogeneous and includes the volcanic area
as well as urban, rural and industrial regions (Istat 2013). As a result, the resident
population and the cases of TC are distributed in a non-homogeneous way according
to the characteristics of the urban morphology and of the natural environment
(Figure 2.1).
The analyzed cases of TC were recorded by CRES and refer to individuals

residing in the four provinces of interest, aged between 5 and 95 years, and who
manifested the disease in the period 2003–2016. The residential addresses of
individuals have been geocoded using the Google Maps Geocoding API interface
(https://cloud.google.com/maps-platform/). The data concerning the population
residing in the same provinces, on the other hand, come from the 15th General
Population Census carried out by Istat (Italian National Institute of Statistics) in
2011.
Figure 2.1. Spatial arrangement of resident population.

Source: 15th Italian General Population Census
2.3. Methodology
2.3.1. Spatial inhomogeneity and spatial dependence
The spatial distribution of cancer can be represented through a planar point

process, displayed as a series of points on a map in which the points, strictly called
“events”, represent precisely the cancer cases. The probability of finding a cancer
case changes according to the geographical distribution of the population and to the
presence of environmental risk factors. It is well known that the population is not
uniformly spread over the territory but is concentrated in localized densely
populated urban areas, leaving large rural and mountain areas mostly deserted. The
morphology of the territory can also present considerable differences even between
neighboring areas, such as the presence of volcanic areas adjacent to coastal and
plain areas. Therefore, the expected risk of cancer will be higher where the number
of population at risk is high, and the environmental factors are close. Conversely, the
risk will be relatively lower in sparsely populated areas or where the natural causes
of the risk are missing.
The variability in the distribution of tumor events is described by the non-

homogeneous Poisson point process. In this model, the number of events N(U) in a
given area U ⊆ R, where R is the entire study region, follows a Poisson distribution
with variable spatial intensity λ(u). Therefore, the expected number of events is
In this case, it is possible that neighboring areas with similar population density
or in the presence (absence) of other risk factors, give rise to actual clusters of high,
medium and low risk of TC. The analysis of the similarity of the attributes of nearby
geographic areas is generally part of the study of spatial autocorrelation, which
evaluates the spatial distribution of a particular process in terms of relationships,
mutual influences and distance (Cressie 1991; Anselin and Rey 2010; Borruso and
Murgante 2012).
2.3.2. Standardized incidence ratio (SIR)
The risk of TC was represented through the production of maps showing the
spatial distribution, for each census tract, of the standardized incidence ratio (SIR).
The SIRs were calculated for each inhabited census tract by indirect standardization
(Waller and Gotway 2004, pp. 12–15), using the incidence rate of TC observed in
the same period (2003–2016) in the whole of eastern Sicily. SIR is the ratio between
observed TC cases and expected TC cases in each census tract i
where Oi is the number of cases observed for census tract i and Ei is the number of
cases expected in the same census tract i. The number of expected cases is calculated
as the product of the population at risk (and therefore the entire resident population)
in the given census tract i and the general incidence rate for the entire investigated
area
=
where Pi is the population at risk in the specific census tract i and r+ is the general
incidence rate of TC, calculated for the four provinces of interest as a whole, as
where O+ corresponds to the number of cases of TC observed and P+ is the resident

population in the whole of eastern Sicily. The subscript + indicates that the variables
are calculated for the totality of the study area. Hence, it follows that the SIR of a
single census tract is thus calculated as
When the characteristics of the population determine a subdivision into strata

with different risk levels, it is necessary to give a proper weight to each stratum
based on its own specific risk. In this case, instead of calculating a general rate of
incidence r+ for the entire reference area, a different rate is calculated for each
∑
stratum j as =∑ . Hence, the expected number of cases in census tract i is given
by = ∑ . When the number of expected cases Ei is very low, as in the case
of many types of tumor, it is generally assumed that the number of observed cases Oi
comes from a Poisson distribution with mean θi Ei, where θi is the relative risk of the
section i. Therefore, the relative risk of a specific census section equal to 1 implies
that this risk is equal to the risk of the entire reference area. It is therefore of interest
to locate the areas in which the relative risk, estimated by SIR, is greater than 1 and
therefore greater than expected (Banerjee et al. 2004, pp. 150–152; Bivand et al.
2008, pp. 320–323). By exploiting the fact that the TC cases are distributed
according to a Poisson, it is possible to construct the confidence intervals at 95% of
SIRs with an exact method, using the “pois.exact” function of the “epitools”
package contained in the R software (R Core Team 2014). The exact method was
preferred to the normal approximation since the number of cases observed in many
census sections was found to be small. In fact, when the number of observed cases
Oi is low, the Poisson distribution is strongly asymmetric and therefore it cannot be
approximated to a normal distribution (Breslow and Day 1987).
The SIR index suffers from limits in terms of variability: sparsely populated
areas have a high probability of resulting in a significantly high index, showing a
fallacious increase in the risk of TC. Furthermore, by construction, the standard
error of SIR tends to be large for sparsely populated areas and small for densely
populated ones. As a result, the confidence intervals of SIR will attribute
significance mostly to the highly populated areas (Haining 2003). On the whole,
areas with low population density often result in extreme values of SIR while highly
populated areas are mostly associated with SIR significantly different from 1. To
overcome these issues and contain the variability in the spatial distribution of the
population, we will consider only the census tracts with more than 30 residents for
the calculation of SIR. On the contrary, when computing the expected global
number of cases for each stratum, rj, we will consider the totality of TC cases and
the resident population.
2.3.3. Local Moran’s I statistic
The local Moran’s I indicator belongs to the so-called LISA (Local Indicators of
Spatial Association) or local indicators of spatial autocorrelation proposed by
Anselin (1995). It is calculated with the following formula:
− ̅
= − ̅
,
where n is the number of geographical units, xi is the value of the variable x in

region i, x¯ is the sample mean of the variable, xj is the value of the variable x in all
other regions (where j ≠ i), S2i is the sample variance of the variable x and wij is a
weight that can be defined as the inverse of the distance between the various
regions. There are other ways to define wij, some contemplate choosing a limit
distance to define the neighborhood of a given region: the regions that fall within the
limit distance take on a weight equal to one, while the external regions take on a
weight equal to zero.
Positive and high values of the local Moran’s I index indicate that a given region
is surrounded by neighboring regions with similar high (or low) values of the
variable under study. In this case, the spatial groups detected are defined as
“high–high” (region with a high value surrounded by regions with high values) or
“low–low” (region with low value surrounded by regions with low values). In terms
of cancer risk, a “high–high” cluster would indicate a high-risk area, while a
“low–low” cluster would denote a low-risk area. Negative values of the local
Moran’s I reveal that the region under examination is a spatial outlier. A spatial
outlier is an area that has a markedly different value from that of its neighbors
(Cerioli and Riani 1999). Spatial outliers are divided into “high–low” (high value
surrounded by neighbors with low values) and “low–high” (low value surrounded by
neighbors with high values).
The local Moran’s I can be standardized so that its significance can be tested
under normal distribution assumption. However, its distribution under the null
hypothesis of absence of spatial autocorrelation may not be normal, especially in the

presence of highly asymmetric data. For this reason, it is possible to adopt the
method of conditional permutation (Anselin 1995), which does not presuppose
assumptions on the data. According to this approach, when the value of an attribute
of a given region is evaluated, its value is kept fixed, and all other values (from other
regions) are randomly permuted without repetition. Each time the other values are
permuted, the local Moran’s I index is calculated to form an empirical reference
distribution. The significance level (called “pseudo p-value”) can be estimated by
comparing the index actually observed on the data with the empirical distribution
created by conditional permutation (Anselin 2005). Each pseudo p-value is
computed as (M + 1)/(R + 1), where R is the number of permutations and M is the
number of instances where a statistic computed via permutations is equal to or
greater than the observed value (for positive values index) or less than or equal to
the observed value (for negative index values). In this study, all local Moran’s I
indices were tested using 999 permutations and the significance level was chosen
<0.05.
2.4. Spatial distribution of TC in eastern Sicily
2.4.1. SIR geographical variation
In eastern Sicily from 2003 to 2016, 7,182 individuals were affected by TC. The
etiology of this tumor is complex and varied, and can be genetic as well as
preventive, come from dietary causes, etc. as already mentioned. In the case of
Sicily, the distribution of TC cases could also be conditioned by two geographical
components:
– the spatial arrangement of the resident population, with particular reference to
the female part, which is known to be the most affected by TC (Parkin et al. 2005).
Where the population is more concentrated or where the female population is
predominant, it will be more likely to record a high incidence of TC;
– the presence of environmental factors such as the volcanic nature of the
territory. The fumes emitted by an active volcano, such as Mount Etna, are able to
transport heavy metals and radioactive substances capable of contaminating the air,
water and soil of the surrounding areas (Fiore et al. 2019).
In an attempt to distinguish the effects of the two geographical components on

the spatial distribution of TC cases, we propose maps of the SIR by census tract and
its significant confidence intervals. SIRs were computed by dividing the population
into strata based on age and sex, to reflect the variation in the risk of TC due to these
two demographic variables. Therefore, a different overall risk rate was calculated for
each stratum (see section 2.3.2).
Figure 2.2(a)and 2.2(b) shows, respectively, SIR by census section and the
relative confidence intervals. From the mere SIR representation (Figure 2.1(a)),
different risk areas emerge, namely those with an SIR value greater than 1. These
areas are located in the area around Mount Etna as well as in the non-volcanic
provinces, especially in those of Enna and Messina. The consideration of the
confidence intervals for SIR (Figure 2.3(b)) instead highlights the area south-east of
Mount Etna and different sections belonging mainly to the Messina province.
In both maps, it is evident that if in the non-volcanic provinces the census sections
with SIR greater than 1 are casually arranged on the territory, in the province of
Catania, the risk sections are concentrated in an area close to Mount Etna, leaving
the rest of the province almost free. Furthermore, the location of the risk areas along
the NW–SE axis could suggest that persistent winds in the SE direction could carry
the toxic substances emitted by the volcano, therefore polluting the atmosphere of
the territories positioned along this corridor, as highlighted in Boffetta et al. (2020).
It is also interesting to note that the census sections on the island of Lipari show a
high and significant SIR. Indeed, this area is also of a volcanic type and is located in
the immediate vicinity of Mount Vulcano, an active volcano presenting only a little
activity compared to that of Mount Etna. The island of Vulcano is home to
numerous sulfurous fumaroles as well as a field of frequent submarine volcanic CO2
emissions, whose spatial distribution follows the direction given by persistent winds
blowing from the NW (Vizzini et al. 2020). Moreover, Vizzini et al. (2013) stated
that the area experiences “low”-level contamination due to elements such as Ba, Fe,
As and Cd. Overall, the significance of SIR in Lipari seems to further corroborate
the idea that a volcano can influence the incidence of TC nearby.
Figure 2.2. SIR distribution by census section (a) and representation of

its significance (b). Source: author’s elaboration on CRES data
2.4.2. Estimate of the spatial attraction
To visualize the presence of clusters of TC cases on the area in question and to

analyze their arrangement in relation to the proximity of Mount Etna, we mapped
the local Moran’s I index of the previously calculated SIRs. To date, in the
literature, there is no empirical method or clear theoretical foundation to guide the
choice of the “correct” spatial weight matrix (Anselin and Bera 1998); for this
reason, it is common practice to experiment with different types of the matrix. On
the other hand, LeSage and Pace (2014) found no solid theoretical basis showing
that estimates and inferences from spatial regression models are sensitive to a
particular specification of the spatial weight matrix. In this study, we employed a
row standardized binary spatial weights matrix, based on the 20th-order queen
contiguity criterion (also including intermediate orders) for the Moran I local index
calculations. Contiguity based on the queen criterion is often selected to analyze
areal data. The decision to include orders up to the 20th comes from the necessity to
consider the inverse correlation between the population residing in a specific section
and the area of the section itself. Generally, in fact, small census sections are
densely populated, while the large ones coincide with sparsely populated rural areas.
If we had considered a lower order, the smaller census sections with higher
population density (and therefore with a large population at risk) would have had a
restricted neighborhood, while the large and relatively less populated sections would
have had an extended neighborhood as a result of their same amplitude. The
consideration of the 20th order allowed us to build neighborhoods that were
“comparable” to each other in terms of geographical extension for all census
sections.
Figure 2.3(a) shows the local Moran’s I statistic, while Figure 2.3(b) shows the
pseudo p-values obtained from the conditioned permutation procedure. Low-risk
census sections surrounded by low-risk census sections are represented in bright
yellow; those of high risk with high-risk neighbors are in the brown; low-risk
sections surrounded by neighboring high-risk sections are colored light orange and
high-risk ones with a low-risk neighborhood appear in dark orange. Figure 2.3(a)
shows a variation in the risk between the northeast and the southwest: southern and
western internal areas do not host high-risk clusters, while the eastern and northern
ones present different high-risk clusters. In particular, there are extensive low-risk
clusters along the eastern coast of Messina and Syracuse, whereas high-risk groups
emerge in the SSE area to Mount Etna, in the Aeolian Islands up north and on the
northern coast near Barcellona Pozzo di Gotto. Figure 2.3(b) illustrates that the
sections constituting the high- and low-risk clusters are significant at a level equal to
at most α = 0.05. Finally, it should be noted that most of the considered sections
were found to be of insignificant risk, as can be seen from the large gray areas
present in both maps.
Figure 2.3. Risk cluster map (a) and relative p-values

(b). Source: authors’ elaboration on CRES data
The cluster analysis could confirm the hypothesis according to which persistent
winds in the SE direction would push the radioactive substances emitted by the
volcano towards areas that report a high risk. A similar suggestion seems to apply to
the Aeolian Islands and the sections near Barcellona Pozzo di Gotto.
2.5. Conclusion
The study of the geographical spread of infectious diseases has a consolidated

tradition. The growing incidence of chronic degenerative diseases (mostly cancers
and cardiovascular pathologies) has led to the application of typical methodologies
for studying infections diffusion also in this area. When environmental factors are
included among the contributing causes of similar diseases, such as TC, the
geographical analysis is a fundamental step to obtain a greater understanding of the
distribution of risk and incidence. One of the environmental factors often cited as a
possible cause of the onset of TC is the presence of an active volcano. In various
volcanic regions around the world, various studies carried out on data from local
cancer registries have reported a significant increase in the incidence of TC (see
section 2.1). In this work, we mapped the TC cases in eastern Sicily to visualize the
risk areas and relate them to their proximity to Mount Etna. The health data
analyzed were released by the CRES. The geocoding activity of the TC cases’
addresses has allowed us to work at the census tracts level and build, therefore,
indexes and maps of extreme geographic precision. To quantify the risk, we adopted
SIR weighted for different strata of the population and calculated by indirect
standardization. From the first maps obtained, we found a possible significant risk
area at the foot of Mount Etna. We then conducted a cluster analysis to uncover
possible high-risk pocket in the area. We computed the local Moran’s I index on the
SIR previously obtained and created maps of high- and low-risk clusters, and of risk
change. These maps highlighted the presence of a high-risk cluster to the SSE of
Mount Etna, in the Aeolian Islands, and near Barcellona Pozzo di Gotto. In the rest
of the region, no other important high-risk clusters have emerged. The detection of
areas of greatest risk located near Mount Etna seems to support the hypothesis that
the presence of a volcano may influence the incidence of TC in the surrounding
people. In addition to this, the risk areas emerged on the island of Lipari (and on the
Aeolian Islands on the whole), and along the northern coast of Sicily also seem to
indicate a possible influence of the nearby Mount Vulcano. This preliminary finding
should be of crucial interest for public health and could optimize the distribution of
local health services and implement targeted screening, monitoring and prevention
campaigns by efficiently exploiting the available resources.
In general, after controlling for the demographic factors affecting the TC

incidence, the adoption of geographical maps allowed us to visualize the variation in
the risk of TC in space and in relation to the distance from the volcano. The intent of
this work was exploratory; therefore, further analyses should be conducted to obtain
a more detailed and in-depth understanding of the environmental factors that
contribute to the onset of TC.
2.6. References
Andronico, D., Spinetti, C., Cristaldi, A., Buongiorno, M.F. (2009). Observations of Mt. Etna
volcanic ash plumes in 2006: An integrated approach from ground-based and polar
satellite NOAA-AVHRR monitoring system. Journal of Volcanology and Geothermal
Research, 180, 35–147.
Anselin, L. (1995). Local indicators of spatial association–LISA. Geographical Analysis, 27,
93–115.
Anselin, L. (2005). Exploring spatial data with GeoDa: A workbook. Workbook, Spatial
Analysis Laboratory, Department of Geography, University of Illinois, Urbana, IL.
Anselin, L. and Bera, A.K. (1998). Spatial dependence in linear regression models with an
introduction to spatial econometrics. In Handbook of Applied Economic Statistics, Ullah,
A. and Giles, D. (eds). Marcel Dekker, New York.
Anselin, L. and Rey, S.J. (2010). Perspectives on Spatial Data Analysis. Springer, Berlin,
Heidelberg.
Arnbjörnsson, E., Arnbiörnsson, A., Ólafsson, A. (1986). Thyroid cancer incidence in relation
to volcanic activity. Archives of Environmental Health, 41(1), 36–40.
Assessorato Regionale alla Salute (2016). Atlante Sanitario della Sicilia. Supplement,
Dipartimento per le Attività Sanitarie ed Osservatorio Epidemiologico.
Banerjee, S., Carlin, B.P., Gelfand, A.E. (2004). Hierarchical Modeling and Analysis for
Spatial Data. Chapman & Hall/CRC, Boca Raton/London.
Biondi, B., Arpaia, D., Montuori, P., Ciancia, G., Ippolito, S., Pettinato, G., Triassi, M.
(2012). Under the shadow of Vesuvius: A risk for thyroid cancer? Thyroid, 22(12),
1296–1297.
Bivand, R.S., Pebesma, E., Gómez-Rubio, V. (2008). Applied Spatial Data Analysis with R.
Springer, New York.
Boffetta, P., Memeo, L., Giuffrida, D., Ferrante, M., Sciacca. S. (2020). Exposure to
emissions from Mount Etna (Sicily, Italy) and incidence of thyroid cancer: A geographic
analysis. Scientific Reports, 10, 21298.
Borruso, G. and Murgante, B. (2012). Analisi dei fenomeni immigratori e tecniche di
autocorrelazione spaziale. Primi risultati e riflessioni, Geotema, 43–45.
Bray, F., Colombet, M., Mery, L., Piñeros, M., Znaor, A., Zanetti R., Ferlay, J. (2017).
Cancer Incidence in Five Continents, Volume XI. International Agency for Research on
Cancer, Lyon.
Breslow, N.E. and Day, N.E. (1987). Statistical Methods in Cancer Research, Heseltine, E.
(ed.). IARC Scientiphic Publications no. 82, Lyon.
Buat-Ménard, P. and Arnold, M. (1978). The heavy metal chemistry of atmospheric
rarticulate matter emitted by Mount Etna Volcano. Geophysical Research Letters, 5(4),
245–248.
Caguioa, P.B., Bebero, K.G.M., Bendebel, M.T.B., Saldana, J.S. (2019). Incidence of thyroid
carcinoma in the Philippines: A retrospective study from a tertiary university hospital.
Annals of Oncology, 30.
Caltabiano, T., Burton, M., Giammanco, S., Allard, P., Bruno, N., Murè, F., Romano, R.
(2004). Volcanic gas emissions from the summit craters and flanks of Mt. Etna,
1987–2000. Geophysical Monograph Series, 143, 111–128.
Cerioli, A. and Riani, M. (1999). The ordering of spatial data and the detection of multiple
outliers. Journal of Computational and Graphical Statistics, 8(2), 239–258.
Cimino, G. and Ziino, M. (1983). Heavy metal pollution. Part VII. Emissions from Mount
Etna volcano. Geophysical Research Letters, 10(1), 31–34.
Cressie, N. (1991). Statistics for Spatial Data. Wiley, New York.
Croner, C.M., Sperling, J., Broome. F.R. (1996). Geographic information systems (GIS): New
perspectives in understanding human health and environmental relationships. Statistics in
Medicine, 15(18), 1961–1977.
Curado, M.-P.E., Brenda, H.R.S., Storm, H., Ferlay, M., Heanue, J., Boyle. P. (2007). Cancer
Incidence in Five Continents, Volume IX. WHO, Geneva.
D’Aleo, R., Bitetto, M., Delle Donne, D., Tamburello, G., Battaglia, A., Coltelli, M.,
Patanè, D., Prestifilippo, M., Sciotto, M., Aiuppa, A. (2016). Spatially resolved SO2 flux
emissions from Mt Etna. Geophysical Research Letters, 43(14), 7511–7519.
Duntas, L.H. and Doumas, C. (2009). The “rings of fire” and thyroid cancer. Hormones, 8(4),
249–253.
Fiore, M., Conti, G.O., Caltabiano, R., Buffone, A., Zuccarello, P., Cormaci, L., Cannizzaro,
M.A., Ferrante. M. (2019). Role of emerging environmental risk factors in thyroid cancer:
A brief review. International Journal of Environmental Research and Public Health,
16(1185).
Fitzmaurice, C., Dicker, D., Pain, A., Hamavid, H., Moradi-Lakeh, M., MacIntyre, M.F.,
Allen, C., Hansen, G., Hansen, G., Woodbrook, R. et al. (2015). The global burden of
cancer 2013. JAMA Oncology, 1(4), 505–527.
Ghosh, M., Natarajan, K., Waller, L.A., Kim, D. (1999). Hierarchical Bayes GLMs for the
analysis of spatial data: An application to disease mapping. Journal of Statistical
Planning and Inference, 75(2).
Goodman, M.T., Yoshizawa, C.N., Kolonel, L.N. (1988). Descriptive epidemiology of
thyroid cancer in Hawaii. Cancer, 61, 1272–1281.
Haining, R. (2003). Spatial Data Analysis: Theory and Practice. Cambridge University Press,
Cambridge.
Hawai’i Tumor Registry (2019). Hawai’i Cancer at a Glance 2012–2016. Hawai’i Tumor
Registry.
Hrafnkelsson, J.H., Tulinius, J.G., Ólafsdottir, J.G., Sigvaldason. H. (1989). Papillary thyroid
carcinoma in Iceland: A study of the occurrence in families and the coexistence of other
primary tumours. Acta Oncologica, 28(6), 785–788.
Istat (2013). La Sicilia, un territorio che cambia. Istat.
Kilfoy, B.A., Zheng, T., Holford, T.R., Han, X., Ward, M.H., Sjodin, A., Zhang, Y., Bai, Y.,
Zhu, C., Guo, G.L. et al. (2009). International patterns and trends in thyroid cancer
incidence, 1973–2002. Cancer Causes and Control, 20(5), 525–531.
Kolonel, L.N., Hankin, J.H., Wilkens, L.R., Fukunaga, F.H., Ward Hinds, M. (1990). An
epidemiologic study of thyroid cancer in Hawaii. Cancer Causes and Control,
1, 223–234.
Kung, T.M., Ng, W.L., Gibson, J.B. (1981). Volcanoes and carcinoma of the thyroid:
A possible association. Archives of Environmental Health, 36(5), 265–267.
LeSage J.P. and Pace, K.R. (2014). The biggest myth in spatial econometrics. Econometrics,
2(4), 217–249.
Liu, Y., Su, L., Xiao, H. (2017). Review of factors related to the thyroid cancer epidemic.
International Journal of Endocrinology, 2017:5308635. doi: 10.1155/2017/5308635.
Malandrino, P., Scollo, C., Marturano, I., Russo, M., Tavarelli, M., Attard, M., Richiusa, P.,
Violi, M.A., Dardanoni, G., Vigneri, R. et al. (2013). Descriptive epidemiology of human
thyroid cancer: Experience from a regional registry and the “Volcanic Factor”. Frontiers
in Endocrinology, 4(65), 1–7.
Malandrino, P., Russo, M., Ronchi, A., Minoia, C., Cataldo, D., Regalbuto, C., Giordano, C.,
Attard, M., Squatrito, S., Trimarchi, F. et al. (2016). Increased thyroid cancer incidence
in a basaltic volcanic area is associated with non-anthropogenic pollution and
biocontamination. Endocrine, 53, 471–479.
Marcello, M.A., Malandrino, P., Almeida, J.F.M., Martins, M.B., Cunha, L.L., Bufalo, N.E.,
Pellegriti, G., Ward, L.S. (2014). The influence of the environment on the development of
thyroid tumors: A new appraisal. Endocrine-related Cancer, 21(5), T235–T254.
May, J.M. (1950). Medical geography: Its methods and objectives. Geographical Review,
40(1), 9–41.
Parkin, D.M., Bray, F., Ferlay, J., Pisani, P. (2005). Global cancer statistics, 2002. CA:
A Cancer Journal for Clinicians, 55(2), 74–108.
Pellegriti, G., De Vathaire, F., Scollo, C., Attard, M., Giordano, C., Arena, S., Dardanoni, G.,
Frasca, F., Malandrino, P., Vermiglio, F. (2009). Papillary thyroid cancer incidence in the
volcanic area of Sicily. Journal of the National Cancer Institute, 101, 1575–1583.
Snow, J. (1855). On the Mode of Communication of Cholera. John Churchill, London.
Stevenson, L.G. (1965). Putting disease on the map: The early use of spot maps in the
study of yellow fever. Journal of the History of Medicine and Allied Sciences, 20(3),
226–261.
Truong, T., Rougier, Y., Dubourdieu, D., Guihenneuc-Jouyaux, C., Orsi, L., Hémon, D.,
Guénel, P. (1985). Time trends and geographic variations for thyroid cancer in New
Caledonia, a very high incidence area (1985–1999). European Journal of Cancer
Prevention, 16(1), 62–70.
Vigneri, R., Malandrino, P., Vigneri, P. (2015). The changing epidemiology of thyroid
cancer: Why is incidence increasing? Current Opinion in Oncology, 27, 1–7.
Vigneri, R., Malandrino, F., Russo, G.M., Vigneri, P. (2017). Heavy metals in the
volcanic environment and thyroid cancer. Molecular and Cellular Endocrinology, 457,
73–80.
Vizzini, S., Di Leonardo, R., Costa, V., Tramati, C.D., Luzzu, F., Mazzola, A. (2013). Trace
element bias in the use of CO2 vents as analogues for low pH environments: Implications
for contamination levels in acidified oceans. Estuarine, Coastal and Shelf Science, 134,
19–30.
Vizzini, S., Andolina, C., Caruso, C., Corbo, A. (2020). Isole Eolie: I campi di emissioni
vulcaniche sottomarine di CO2 a Vulcano e Panarea. Memorie Descrittive della Carta
geologica d’Italia, 105, 91–96.
Wakefield, J. (2007). Disease mapping and spatial regression with count data. Biostatistics,
8(2), 158–183.
Waller, L.A. and Gotway, C.A. (2004). Applied Spatial Statistics for Public Health Data.
John Wiley & Sons, Hoboken, NJ.
Walter, S.D. (2000). Disease mapping: A historical perspective. In Spatial Epidemiology:
Methods and Applications, Elliott, P., Wakefield, J., Best, N., Briggs, D. (eds). Oxford
University Press, Oxford.
3
Analysis of Blockchain-based
Databases in Web Applications
The functions of relational, non-relational databases and blockchain-based

databases in web applications were compared. We evaluated whether these systems
with different capabilities have performance differences based on users, whether
they have security vulnerabilities while meeting the user needs, and the advantages
and disadvantages of using the blockchain.
The types of blockchain technologies and the use of different combinations of

database types in different situations were analyzed in terms of functionality,
security and performance in web applications. Significant differences were noted in
the results, especially in which steps were used.
3.1. Introduction
Databases have been continuously improved from the start of using computers to
the current day, where they have become indispensable in our daily lives. Current
database management systems are powered by a legacy that has been developed
over many years according to users’ needs, alongside the invention of computers.
Technologies continue to be developed according to the needs and level of
civilization that humanity has reached. Blockchain, one of the solutions that has
been developed, has brought devastating changes in various fields.
We are currently in a period where the blockchain technology and traditional

approach are blended together, developed and used in solutions. In this study,
Chapter written by Orhun Ceng BOZO and Rüya ŞAMLI.
blockchain-based systems and SQL and NoSQL database systems will be compared.
Some analysis will be shared through an example of an Art Shop web application.
3.2. Background
3.2.1. Blockchain
Blockchain technology, which gained popularity with the invention of Bitcoin,

began to be used in applications because it offers some different solutions compared
to database management systems created with SQL and NoSQL databases.
In general, the blockchain is used as an alternative to traditional databases and/or

together with traditional databases, providing peer-to-peer transactions without the
need for a central authority, keeping the data in pairs with a distributed ledger, linked
in a chained manner in a Merkle-tree structure (Merkle 1980; Nakamoto 2008).
The blockchain technology has various advantages and disadvantages compared

to applications working with traditional databases. While examining the
characteristics of these differences, the types of blockchains and the architectures of
the structures that have been created with the blockchain should be examined.
3.2.2. Blockchain types
Public/permissionless blockchains: structures where anyone can participate,

create transactions, verify these transactions and update the state of the blockchain.
All transactions and the state of the chain are transparent and accessible to everyone.
Figure 3.1. Blockchain-based web application layers

Analysis of Blockchain-based Databases in Web Applications 33
Private/permissioned blockchains: structures that have the opposite

characteristics to public chains. Private blockchains are structures where only
authorized users can access the blockchain and data is hidden from public access.
3.2.3. Blockchain-based web applications
Blockchain-based web applications can be examined from an architectural point

of view with application, consensus and networking layers, as shown in Figure 3.1.
Application layer: the structure that defines the business part of the blockchain
and regulates state transitions.
Consensus layer: the layer that enables the nodes to agree on the state of the
blockchain and creates the decision-making mechanism of the decentralized
structure.
Networking layer: the layer responsible for the reproduction and propagation of
transactions, state transition messages and consensus messages
(https://v1.cosmos.network/intro).
3.2.4. Blockchain consensus algorithms
The choice of a blockchain algorithm – like the choice between building a

blockchain-based system and using traditional data management systems – is a
fundamental choice point. It radically affects the capabilities, performance, security
and operation of the application.
Proof of work (PoW): blockchain state changes are performed using computing
power resources. The source of evidence required to establish a new ring or a new
transaction is greater than that required to verify an established evidence. The idea
behind this asymmetry is to prevent the system from being deceived by a fraudulent
transaction.
Bitcoin, the most popular blockchain, also works with the PoW algorithm.
In 2021, the daily average confirmation time (the average time for a transaction with
miner fees to be included in a mined block and added to the public ledger) exceeded
the monthly average of 800 minutes (Blockchain.com n/a).
Proof of stake (PoS): against the limits of the PoW algorithm, the proposal of
the PoS algorithm, which was first suggested in a forum in 2011, is based on the
concepts that a new node that wishes to participate in the block creation process
must first prove that they have a certain number of relevant value tokens and
lock/stake a certain amount of value into the escrow account. The locked amount is
an escrow in order to ensure the security of the transaction. If the node performing
the relevant transaction behaves inappropriately, it may lose the value it has locked
in escrow and is not allowed to participate in any of the transactions to change the
block state again, according to the rules.
3.2.5. Other consensus algorithms
Applications are being developed that use PoW and PoS algorithms in a hybrid
way or with consensus algorithms that are developed with different mechanisms
from start to finish. While these algorithms are sometimes developed based on
users’ needs, sometimes they have to be used by considering the restrictions on the
business logic side. Blockchain applications are brought to life with new algorithms
every day and offered to the masses to use (Ferdous et al. 2020).
3.3. Analysis stack
3.3.1. Art Shop web application
The art gallery application, with an inventory of artworks, was created with both
MySQL(SQL), MongoDB(NoSQL) and a blockchain-based structure. With the art
gallery application, the gallery owner can add and delete new artworks to their
inventory and update the specified information of the works.
3.3.2. SQL-based application
Relational database management system MySQL 8.0.21 and PhpMyAdmin 5.0.3

were used to implement the SQL database structure created with the tables in the
diagram shown in Figure 3.2. Apache 2.4.41 is used for server management, and
PHP 7.4 is used for applications. As for servers, DigitalOcean’s servers with 2 GB
memory/1 CPU were used with Ubuntu 20.04 operating system.
Figure 3.2. Art Shop relational database diagram. For a color

3.3.3. NoSQL-based application
For the NoSQL version of the Art Shop application, Mongo Atlas and MongoDB
version 4.4.6 were used. On the server side, Ubuntu 20.04 operating system and
DigitalOcean’s 2 GB memory/1 CPU system were used. Strapi 3.6.5 was used as the
content management system.
Three main collections were created on the Strapi in MongoDB: customers,

artworks and transactions.
3.3.4. Blockchain-based application
For the blockchain-supported version of the Art Shop web application, the
Ubuntu 20.04 operating system was used in DigitalOcean’s 2 GB memory, 1 CPU
droplet. Starport 0.16 was used to create and manage the blockchain. Go 1.16 was
installed to run Starport. Starport’s frontend application works with “Vue.js”.
“Node.js” and “npm” were also installed to run these packages on the server.
3.4. Analysis
3.4.1. Adding records
In the Art Shop application, the shop owner has three main options when they
want to add pieces of art to the inventory:
1) adding data directly to the database with the command interface;
2) adding data using the graphical interface of the database management systems;
3) adding data using the specially developed application user interface.
The SQL- and NoSQL-enhanced custom commands can be used to add inputs in
bulk with a json file or comma-separated values.
Adding new inputs is one of the tasks that can compare the performance of the
systems. The scenario of a 150-line comma-separated values (CSV) file, seen in
Figure 3.3, as sample data, and the art shop owner adding their inventory to the web
application in one go, was implemented using the command interface. The SQL and
NoSQL systems task was performed without any problems. It was completed in the
times that can be seen in Figures 3.4 and 3.5. The SQL system performed the task of
adding bulk data faster than the NoSQL system. While the “LOAD DATA LOCAL
INFILE” command is used directly on the server for SQL, the “mongoimport”
command, which provides connection to MongoDB Atlas servers via a local
computer, is used for the NoSQL structure.
Multiple data were used in the SQL and NoSQL systems in order for the
performance test to be meaningful during the addition of the entries, but since the
Starport infrastructure used in the blockchain-supported system – as in all popular
algorithms – actually uses Tendermint’s consensus algorithm, called BFT POS, the
inputs must be added one by one (Ferdous et al. 2020). In order for a node to add
more than one entry, the messages must be added to the Merkle-tree structure with
unique hashes and proven one at a time.
An input has been added to the blockchain for the blockchain-based system that
uses the Starport infrastructure, in which data is added one by one, due to its structure.
This addition first required the creation of functions for the CRUD (Create, Read,
Update, Delete) actions of the digital asset. The creation of the artwork structure as a
digital asset on the blockchain was accomplished with the Starport command “starport
type artwork Arttype name artist year owner”. Adjustments were then made to various
proto and structural files for the API system for Starport’s web application. Then, with
the command “artshopd tx artshop create-artwork ‘Painting’ ‘Name2’ ‘Artist 2’ ‘1999’
‘0’ --from=dbtests”, the first registration of the blockchain was made on behalf of the
user “dbtests”. Figure 3.6 shows the time associated with the record added to the
blockchain by the validator of the chain.
Figure 3.3. Art Shop inventory dummy-data of 150 lines
Figure 3.4. The process of adding 150 lines of dummy-data with

the SQL system and the time elapsed. For a color version
Figure 3.5. The process of adding 150 lines of dummy-data with the
NoSQL system and the time elapsed. For a color version
Figure 3.6. Adding a single entry to the blockchain-supported system and the time
elapsed. For a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
Since there will be a query test later in the analysis, adding 150 records to the
blockchain-based system and running the queries one by one, which took a total of
450 seconds, were both fully achieved.
According to the tests made, the order in time taken to add inputs, from fastest to
slowest, is SQL, NoSQL, blockchain.
3.4.2. Query
Figure 3.7 shows the queries and operation times to query the names of the
artworks starting with the expression “Name 1” among 150 lines, which are the
sample data containing the artworks in the Art Shop application, from the relevant
table in the database.
System Query Time

SQL select * from artworks where name like “Name 1%” 0.0010 seconds
NoSQL db.artworks.find( {“name”:{ $regex : /^Name 1/ } }) 0.046 seconds
Blockchain artshopd query artshop search-artwork “Name 1” 0.072 seconds
Figure 3.7. Queries and operation times in SQL,

NoSQL and blockchain-based databases
The detailed operation time was not displayed in the MongoDB command
interface used for the NoSQL system. Before the query, the “setVerboseShell(true)”
command was run to show the operation time at the end of the operation.
In Starport, which is used for the blockchain-based database system, only the
following commands are available on the query side of the CRUD functions created
with the “type” command by default. In order to run the query carried out in this
system, which was created using the GO language, the query must be added to the
system by developing it.
Query commands that come by default with the “type” command in Starport are:
“list-artwork”: outputs all entries; “show-artwork”: if there is an entry with the

submitted id number, it sends a single entry as output.
3.4.3. Functionality
The first published version of MySQL database used for SQL was published in
1995, and MongoDB, used for the NoSQL system, was published in 2005. Starport,
the open-source code developed by the Tendermint company, which is used for the
blockchain-supported system, released its first version at the beginning of 2020.
There are database management systems applications that have been developed
for traditional database systems that are legacies from the past, server files that are
installed with one click in server companies, and ready-made database servers
managed by the cloud. On the blockchain side, there are limited alternatives that
provide managed server service in the cloud server. These are Oracle (Oracle.com
n/a) and Amazon Web Services (AWS.amazon.com n/a) solutions.
For software development, the legacy from the past and SQL- and NoSQL-related
resources are more. For blockchain-based database systems, the Tendermint
consensus algorithm, Cosmos SDK and Starport solutions offer a start for
application, consensus and network layers, but it is necessary to develop
improvements and arrangements according to application needs. With its API
support and frontend application, Cosmos SDK has facilitated the development of
blockchain-based web applications in web applications.
3.4.4. Security
In web applications, management and functionality in parts such as membership

and transaction authorizations are based on basic issues such as choosing a blockchain
type and choosing an algorithm. With the Tendermint BFT POS consensus algorithm
used, memberships to the application are started with the definitions in the config.yml
file shown in Figure 3.8. As required by the algorithm’s working logic, nodes holding
a certain digital unit stake perform asset transfer and transfer confirmation and earn
rewards. In other words, the public id numbers of everyone who is a member of the
system should be shared with all nodes. For an entry in the system to be approved to
the blockchain, at least 2/3 of all validators must approve the transfer (Kwon 2014).
Figure 3.8. Starport config.yml file. For a color version

In SQL and NoSQL systems, IP addresses that can be connected to the database
server can be defined in the database management systems layer. All requests can be
blocked except requests from these IP addresses. In addition, by defining user
accounts other than IP addresses, authorization can be made for users coming from a
specific IP address and authorized with a username/password. In the PoS algorithm
on the blockchain side, if a fraudulent transaction is detected, then the staked value
may not be returned to the node and the account may be deleted from the chain
completely.
In blockchain-based databases, the data history is completely original and cannot

be changed, thanks to the blockchain technology. That is, system users cannot
change the history of the created and branched chain. Historical data can be changed
if firewalls in SQL and NoSQL systems are broken. This can be noted on special
inspection that it has been changed on systems with a huge number of rows.
While the blockchain-based systems are being installed, the architectural

structure can be designed uniquely to each chain. However, if the server running the
blockchain system developed using Starport is shut down and the files are not
backed up, the chain and the status of the chain will be permanently deleted. This
also applies to SQL and NoSQL databases. For all database systems, at least one
application must be running on a server to maintain the final data. The Cosmos SDK
key-value store uses the Go language version of LevelDB (Github.com n/a). The
state of the blockchain (key-value store) can be backed up by writing to an SQL or
NoSQL database, so that even if no node is running in the system, it can be started
again with the backup in the database.
3.5. Conclusion
Web applications have taken over all of our daily lives. All of the main areas,
such as government applications, health, finance and entertainment, are
indispensable and irreversibly managed with web applications. As the estimates of
the number of devices connected to each other increase day by day, the
communication, security and speed of this interconnected crowd all gain importance.
With the increase in this number, the popularity of decentralized and trustless
blockchain-based structures is increasing.
Databases in web applications were classified according to the units where the
data was kept and the relationships of the data with each other. In the blockchain, it
is classified according to the system participation and consensus algorithm.
Blockchain-based systems are decentralized, consistent and eliminate the trust
problem.
SQL and NoSQL constructs record, send and process data that crosses the
authority barrier without questioning application layer decisions. The use of their
own functions in them is not very common due to the rapidity of development in
web languages and the flexibility of web languages. Blockchain technologies are
also diversified, especially on the consensus side. Authority and participation in the
system is one of the most critical points for data processing in web applications that
are intended for use by multiple stakeholders.
As can clearly be seen in the tests performed on the performance side,

blockchain technology is rather slow compared to SQL and NoSQL database
technologies due to its structural features. The concepts for which the blockchain
technology can be preferred in web applications can be stated as decentralized
decision-making, not changing the history of the data and eliminating the trust
problem with difficult mathematical formulas.
3.6. References
AWS.amazon.com (n/a). Amazon managed blockchain [Online]. Available at:

https://aws.amazon.com/en/managed-blockchain/ [Accessed 1 July 2021].
Blockchain.com (n/a). Average confirmation time [Online]. Available at:
https://www.blockchain.com/charts/avg-confirmation-time [Accessed 8 May 2021].
Cosmos (n/a). What is Cosmos? [Online]. Available at: https://v1.cosmos.network/intro
[Accessed 5 May 2021].
Ferdous, M.S., Chowdhury, M., Hoque, M., Colman, A. (2020). Blockchain consensus
algorithms: A survey [Online]. Available at: https://arxiv.org/abs/2001.07091.
Github.com (n/a). Tendermint DB [Online]. Available at: https://github.com/tendermint/

tm-db [Accessed 5 May 2021].
Kwon, J. (2014). Tendermint: Consensus without mining [Online]. Available at:
https://tendermint.com/static/docs/tendermint.pdf.
Merkle, R.C. (1980). Protocols for public key cryptosystems. Proceedings of the 1980
Symposium on Security and Privacy: April 15–16, 1980, Oakland, California. IEEE
Computer Society.
Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system [Online]. Available at:
Bitcoin.org.
Oracle.com (n/a). Oracle blockchain platform cloud service [Online]. Available at:
https://www.oracle.com/blockchain/cloud-platform [Accessed 1 July 2021].
4
Optimization and Asymptotic
Analysis of Insurance Models1
Insurance is the oldest domain of applied probability. Moreover, the mathematical

models arising there can be used in other areas of applied probability. Therefore,
the optimization of insurance models performance and their asymptotic analysis
are very important. The modern period in actuarial sciences is characterized by the
investigation of complex systems and the employment of sophisticated mathematical
tools. Discrete-time models became popular since, in many cases, they describe more
precisely the real situation. Hence, we study two models (the discrete-time one and the
continuous-time one) in the framework of cost approach. Reinsurance, dividends and
bank loans are the controls in optimization problems. Models stability with respect to
small perturbations of underlying distributions is treated as well using the probability
metrics.
4.1. Introduction
The investigation of insurance models is a primary task of actuarial sciences. A

keyword in all definitions of actuarial sciences is risk. It is present whenever the
outcome is uncertain, whether favorable or unfavorable. Actuarial sciences emerged
in the 17th century (see Bernstein (1996)), although methods for transferring or
distributing risk were practiced by Chinese and Babylonian traders as long ago as
the 3rd and 2nd millennia BC, respectively. Actuarial sciences have an interesting
history consisting of four periods (deterministic, stochastic, financial, modern – ERM
Chapter written by Ekaterina B ULINSKAYA.

1 Research is partially supported by the Russian Foundation for Basic Research, project
20-01-00487.
(see Bulinskaya (2017)). As stated, the modern period is characterized by an

investigation of complex systems and the employment of sophisticated mathematical
tools. Discrete-time models became popular since, in many cases, they describe more
precisely the real situation. They can also serve as the approximation of corresponding
continuous-time models (see Dickson and Waters (2004)).
This chapter is organized as follows. In section 4.2, we consider a discrete-time

insurance model with proportional reinsurance and bank loans. We establish the
optimal policy of loans in the framework of cost approach and find the conditions for
the model stability with respect to small perturbations of the underlying distributions.
A continuous-time Cramér–Lundberg model with dividends is treated in section 4.3.
An optimal barrier is found for the special type of claims distribution. Numerical
results are also provided. Our conclusion and further investigation directions are
discussed in section 4.4.
4.2. Discrete-time model with reinsurance and bank loans
4.2.1. Model description
Suppose that the claims arriving to an insurance company are described by a

sequence of independent identically distributed (i.i.d.) non-negative random variables
(r.v.’s) {Xi , i 1}. Here, Xi is the claim amount during the ith period (year,
month or day). Let F (x) be its distribution function (d.f.) having density ϕ(x) and
finite expectation. Put, as usual, F̄ (x) = 1 − F (x). The company uses proportional
reinsurance with quota α and bank loans. If a loan is taken at the beginning of the
period (before the claim arrival), the rate is k, whereas the loan after the claim arrival
is taken at the rate r with r > k. Our aim is to choose the loans in such a way that
the additional payments entailed by loans are minimized. Denote by M the premium
acquired by the direct insurer (after reinsurance) during each period. If x is the initial
capital (surplus, reserve), then f1 (x), the minimal expected additional cost during one
period, is given by
f1 (x) = min[k(y − x) + rE(αX − y)+ ], where (z)+ = max(0, z). [4.1]
yx
Clearly, equation [4.1] can be rewritten in the form:

∞
f1 (x) = −kx + min G1 (y), with G1 (y) = ky + rα F̄ (s) ds. [4.2]
yx y/α
Now let fn (x) be the minimal expected cost during n periods and β be the discount
factor for future costs. Then, using dynamic programming (see Bellman (1957)), we
easily obtain the following relation:
fn (x) = −kx + min Gn (y), with
yx
Gn (y) = G1 (y) + βEfn−1 (y + M − αX). [4.3]

Optimization and Asymptotic Analysis of Insurance Models 45
4.2.2. Optimization problem
It is not difficult to prove the main optimization result.
T HEOREM 4.1.– There exists an increasing sequence of critical levels {yn }n1 such
that:

Gn (yn ), if x yn ,
fn (x) = −kx + [4.4]
Gn (x), if x > yn .
The sequence is bounded by ȳ satisfying the equation H(y) = 0, where H(y) =

G1 (y) − kβ.
P ROOF.– Consider at first a one-period case. Obviously, we have to find the solution
of the equation G1 (y) = 0, where G1 (y) = k − rF̄ (y/α). Due to assumption r > k,
it follows immediately that y1 = αF −1 (1 − k/r) exists; moreover, it is the unique
solution of the equation under consideration.
Further results are obtained by induction. Since f1 (x) is given by [4.4], it is

possible to write

0, if x y1 , −k, if x y1 ,
f1 (x) = −k +
= [4.5]
G1 (x), if x > y1 , −rF̄ (y/α), if x > y1 .
Hence, it is clear that f1 (x) < 0 for all x. Moreover, on the one hand,
∞
G2 (y) = G1 (y) + β f1 (y + M − αs)ϕ(s) ds ≤ G1 (y),
0
on the other hand, we can write G2 (y) in the form:

y+M −y1
α
G1 (y) − βk + β G1 (y + M − αs) ds = H(y)
0
y+M −y1
α
+β G1 (y + M − αs) ds.
0
That means y1 < y2 < ȳ. Furthermore, f2 (x) < 0 for all x. Thus, the base of
induction is established. Assuming that [4.4] is true for the number of periods less or
equal to n, we prove its validity for n + 1. It is possible to write
⎧
⎪
⎨0, if x yn−1 ,
fn (x) − fn−1

(x) = −Gn−1 (x), if yn−1 < x yn ,
⎪
⎩
Gn (x) − Gn−1 (x), if x > yn .
Since
∞
Gn+1 (y) − Gn (y) = β [fn (y + M − αs) − fn−1

(y + M − αs)]ϕ(s) ds,
0
we deduce that Gn+1 (y) < Gn (y), so yn < yn+1 . Rewriting Gn+1 (y) as follows:
∞
G1 (y) + β fn (y + M − αs)ϕ(s) ds = H(y)
0
y+M −yn
α
+β Gn (y + M − αs) ds,
0
it is easy to see that Gn+1 (y) > H(y). This entails the needed relation yn+1 < ȳ,
thus ending the proof. It is possible to formulate an obvious corollary:
C OROLLARY 4.1.– There exists ŷ = limn→∞ yn .
R EMARK 4.1.– It is interesting to mention that ŷ = ȳ only for M = 0 whereas ŷ < ȳ

for M > 0.
4.2.3. Model stability
Now, we turn to sensitivity analysis and prove that the model under consideration
is stable with respect to small perturbations of the underlying distribution. For this
purpose, we introduce two variants of the model. In the first one, the claim distribution
has density ϕX (x) and d.f. is denoted by FX (x). In the second one, the claim density
is ϕY (x) and d.f. is FY (x). The corresponding cost functions are denoted by fn,X (x)
and fn,Y (x). The distance between distributions will be measured by means of the
Kantorovich metric.
D EFINITION 4.1.– For random variables X and Y defined on some probability space
and possessing finite expectations, it is possible to define their distance on the base of
the Kantorovich metric in the following way:
∞
κ(X, Y ) = |FX (t) − FY (t)| dt
−∞
where FX and FY are the distribution functions of X and Y, respectively.
The distance between the cost functions is measured in terms of the Kolmogorov
uniform metric. Thus, we are going to study.
Δn = sup |fn,X (x) − fn,Y (x)|.

x
To this end, we need the following:

L EMMA 4.1.– Let functions gi (y), i = 1, 2, be such that |g1 (y) − g2 (y)| < δ for some
δ > 0 and any y, then supx | inf yx g1 (y) − inf yx g2 (y)| < δ.
P ROOF.– Fix x and put Ci = inf yx gi (y). Then, according to the definition of
infimum, for any ε > 0, there exists y1 (ε) x such that g1 (y1 (ε)) < C1 + ε.
Therefore, g2 (y1 (ε)) < g1 (y1 (ε)) + δ < C1 + ε + δ implying C2 < g2 (y1 (ε)) <
C1 + ε + δ. Letting ε → 0, we obtain immediately C2 < C1 + δ. In a similar way, we
establish C1 < C2 + δ, thus obtaining the desired result |C1 − C2 | < δ. Now, we are
able to estimate Δ1 .
L EMMA 4.2.– Assume κ(X, Y ) = ρ, then Δ1 αrρ.
P ROOF.– According to Lemma 4.1, we need to estimate |G1,X (y) − G1,Y (y)| for any
+
∞ functions gives G1,X (y) − G1,Y (y) = r[E(αX − y) −
y. The definition of these
+
E(αY − y) ] = rα y/α (F̄X (t) − F̄Y (t)) dt. This leads immediately to the desired
estimate. Next, we prove the main result demonstrating the model’s stability.
n
T HEOREM 4.2.– If κ(X, Y ) = ρ, then Δn Dn ρ, where Dn = α( r(1−β
1−β
)
+
n
k(β−β )
1−β ).
P ROOF.– As in Lemma 4.2, we begin with the estimation of |Gn,X (y) − Gn,Y (y)| for
any y. Due to definition [4.3], we have:
|Gn,X (y) − Gn,Y (y)| |G1,X (y) − G1,Y (y)|

∞
+ β| fn−1,X (y + M − αs)ϕX (s) ds
0
∞
− fn−1,Y (y + M − αs)ϕY (s) ds|.
0
Obviously, the first term on the right-hand side of the inequality is less than αrρ.
To estimate the second term, we rewrite it in the form:
∞ ∞
β| fn−1,X (y + M − αs)ϕX (s) ds − fn−1,Y (y + M − αs)ϕX (s) ds
0 0
∞ ∞
+ fn−1,Y (y + M − αs)ϕX (s) ds − fn−1,Y (y + M − αs)ϕY (s) ds|.
0 0
Clearly,
∞
| [fn−1,X (y + M − αs) − fn−1,Y (y + M − αs)]ϕX (s) ds| Δn−1 .
0
∞
Integrating by parts, we rewrite fn−1,Y (y + M − αs)ϕY (s) ds in the form:
0
∞
−fn−1,Y (y + M − αs)F̄Y (s)|∞
0 − α
fn−1,Y (y + M − αs)F̄Y (s) ds
0
∞

= fn−1,Y (y + M ) − α fn−1,Y (y + M − αs)F̄Y (s) ds.
0
Hence, we obtain:

Δn αrρ + αβ max |fn−1,Y (y)|ρ + βΔn−1 .
y

It is not difficult to prove that maxy |fn−1,Y (y)| k for all n, so:
Δn α(r + kβ)ρ + βΔn−1 .
Solving this recurrent relation, we finish the proof and get the desired form of Dn .
r+kβ
C OROLLARY 4.2.– Δn 1−β αρ for any n.
In other words, we established the stability of the model with respect to small
perturbations of claim distribution.
4.3. Continuous-time insurance model with dividends
4.3.1. Model description
In this section, we consider the classical Cramér–Lundberg model. Such a model

was first introduced in 1903 by Lundberg (1903) and developed further in the
1930s of the last century by Cramér (1955). It was the beginning of the reliability
approach (which is still popular) taking the ruin probability as a risk measure. It is
a continuous-time model assuming that insurance company capital R(t) at time t is
given by the relation:
N (t)
R(t) = x + ct − Xn
n=1
where x = R(0) is the initial capital, Xn is the amount of the nth claim, and N (t)
is the number of claims up to time t. The sequence {Xn , n 1} consists of i.i.d.
non-negative r.v.’s with finite mean and d.f. F (x). It is independent of the Poisson
process N (t) with intensity λ. The premium inflow rate is c > 0.
Starting with the seminal paper by De Finetti (1957), published in 1957, the study
of dividends is an important subject for actuarial mathematics. We mention also in
passing the papers by Gordon (1959) and Miller and Modigliani (1961), which were
among the first to treat dividends problem, and the paper by Albrecher and Thonhauser
(2009) which gives the review of results obtained before 2009.
Let us consider the Cramér–Lundberg process with dividends payed according to

barrier strategy with level b. Then, the company capital at time t
Q(t) = Q0 + ct − S(t) − L(t),

N (t)
where S(t) = n=1 Xn and L(t) is the dividend process satisfying the following
conditions:
1) L(t + 0) − L(t) Q(t),
2) L(t) = L(T ) for all t T , where T = inf{t : Q(t) < 0} is the ruin time.
The objective function V (Q0 , L) is the expected discounted dividend payed until
ruin. To calculate it, we introduce the following notation. Let L be some strategy of
dividend payment, Q0 be the initial capital, and δ be the force of interest, δ > 0. Then,
it is possible to write:
T
V (Q0 , L) = E e−δt dL(t) .
0
D EFINITION 4.2.– The strategy L0 is optimal if
V (Q0 ) := V (Q0 , L0 ) = sup V (Q0 , L).

L
4.3.2. Optimal barrier strategy
Further on we consider a barrier strategy.
D EFINITION 4.3.– The strategy L is called barrier one with barrier level b, if for
Q(t) > b, the amount Q(t) − b is payed immediately, if Q(t) = b, all the premium
inflow is payed as a dividend and in case Q(t) < b nothing is payed.
We are going to use the following result proved in Gerber (1969).
T HEOREM 4.3.– There exists b∗ such that for any initial capital satisfying condition
0 Q0 b∗ the barrier strategy specified by this level is optimal.
Put for simplicity Q0 = Q. Then, it is not difficult to prove using the total
probability formula and properties of the Poisson process that V (Q, b) satisfies the
following integro-differential equation:

Q+0
∂V (Q, b) λ+δ λ
= V (Q, b) − V (Q − y, b) dF (y) [4.6]
∂Q c c
0
with boundary condition:

b+0
c λ
V (b, b) = + V (b − y, b) dF (y). [4.7]
λ+δ λ+δ
0
We can find in the book by Bühlmann (1970) that the solution of [4.6] with
boundary condition [4.7] has the form:
h(Q)
V (Q, b) =
h (b)
where function h(x) is a unique solution, up to a constant factor, of the equation

x+0
λ+δ λ
h (x) = h(x) − h(x − y) dF (y) [4.8]
c c
0
for 0 < x < ∞.
4.3.3. Special form of claim distribution
We have established that in order to find an optimal barrier strategy, it is necessary

to solve two problems: to obtain the positive solution of [4.8] and calculate such a
level b∗ , that h (b∗ ) is minimal.
T HEOREM 4.4.– Assume that d.f. of claims has a density ϕ(y) given by ϕ(y) =
P (y)e−y , y 0, where P (y) is the polynomial of degree m. Then, the
integro-differential equation [4.8] can be reduced to a homogeneous ordinary
differential equation of degree m + 2 with constant coefficients.
P ROOF.– Substituting in [4.8] expression dF (y) = ϕ(y) · dy, we have:

x

ch (x) = (λ + δ)h(x) − λ h(x − y)e−y P (y) dy. [4.9]
0
Differentiation gives the following relation:

x
ch (x) = (λ + δ)h (x) − λh(0)e−x P (x) − λ h (x − y)e−y P (y) dy.
0
Integration by parts of the last summand and replacement of the integral via
formula [4.9] leads to:
ch (x) = (λ + δ − c)h (x) + (λ + δ − λP (0))h(x)
x
−λ h(x − y)e−y P (y) dy. [4.10]
0
Note that we have already got a linear differential equation with constant
coefficients plus the integral term. Performing the same transformation of expression
[4.10], we obtain the differential equation of higher order:
ch (x) = (λ + δ − 2c)h (x) + (2λ + 2δ − λP (0) − c)h (x)
x

+ (λ + δ − λP (0) − λP (0))h(x) − λ h(x − y)e−y P (y) dy.
0
[4.11]
Finally, to establish the relation between the function h(x) and its several
derivatives, we repeat the previous cycle, getting:
ch(4) (x) = (λ + δ − 3c)h (x) + (3λ + 3δ − λP (0) − 3c)h (x)
+ (3λ + 3δ − 2λP (0) − λP (0) − c)h (x) + (λ + δ − λP (0)
x

− λP (0) − λP (0))h(x) − λ h(x − y)e−y P (y) dy. [4.12]
0
Each time, we obtain the homogeneous linear differential equation with constant
coefficients plus integral term. To make clear that such a statement is true for any order
of equation, we write the following Table 4.1 showing how the new coefficients are
related to those from the previous equation.
ch (x) ch (x) ch (x) ch(4) (x) ...

h(x) λ+δ λ−λP (0)+δ λ−λP (0)−λP (0)+δ λ−λP (0)−λP (0)−λP (0) + δ . . .
h (x) 0 λ+δ−c 2λ + 2δ − λP (0) − c 3λ + 3δ − 2λP (0) − λP (0) − c . . .
h (x) 0 0 λ + δ − 2c 3λ + 3δ − λP (0) − 3c ...
h (x) 0 0 0 λ + δ − 3c ...
... ... ... ... ... ...
Table 4.1. Relations between coefficients
The lth column of Table 4.1 presents the coefficients in expression ch(l) (x)
corresponding to h(x) (the first raw) and h(j−1) (x) (the jth raw), j 1. Hence, it is
not difficult to see that the coefficients of the main diagonal have the form λ + δ − kc
for non-negative integer k. It is clear from the procedure of getting the equation of
order k + 1 from that of order k. The same reason is for the expressions in the first
row of the table (coefficients by h(x), having the form λ + δ (in the first column) and
λ(1 − P (0)− P (0)− . . .− P (k) (0))+ δ (in the (k + 1)th column for any non-negative
integer k)). In order to calculate the other non-zero coefficients (i.e. those on the main
diagonal and above), we have to use the following rule: di,j = di,j−1 + di−1,j−1 if
di,j stands in the ith row and jth column. Obviously, all the coefficients are constant.
In all the equations along with derivatives, there exists an integral term. However,
the order of derivative of polynomial P (y) under the sign of integral increases each
time when we pass from the kth equation to the (k + 1)th one. Thus, using the
induction, we obtain:
ch(m+2) (x) = (λ + δ − (m + 1)c)h(m+1) (x)

m
+ . . . + (λ + δ − λ P (i) (0))h(x)
i=0
x
−λ h(x − y)e−y P (m+1) (y) dy.
0
Since the (m + 1)th derivative of the polynomial is zero, the integral term
disappears. It follows from the proved theorem that it is easier to find the optimal
barrier for 0 Q b if the d.f. satisfies the condition dF (y) = e−y P (y) · dy,
where P (y) is a polynomial of degree m. An example of such a distribution is
Γ(m + 1, 1), where m is a non-negative integer. In this case, the density has the form
1 m −y
ϕ(y) = m! y e .
An explicit solution is obtained for the case m = 0 (exponential distribution),

whereas for m 1, numerical analysis is carried out. The program is written using
Python programming language.
Therefore, we assume further that the claim amount has the exponential
distribution with parameter γ, that is, F (y) = 1 − e−γy , where γ is the inverse of
mathematical expectation. Since ϕ(y) = γe−γy (the polynomial degree is equal to
zero), proceeding as in the general case, we obtain a homogeneous linear differential
equation of the second degree with constant coefficients. In fact, we obtain
ch (x) − (λ + δ − cγ)h (x) − δγh(x) = 0.
Let r1 and r2 be the roots of the characteristic equation:
cr2 − (λ + δ − cγ)r − δγ = 0.
The general solution of the differential equation under consideration has the form:
h(x) = C1 er1 x + C2 er2 x ,
where C1 and C2 are constants to be determined.
Due to our assumptions, all the parameters are positive. It follows immediately
that the signs of roots are different. For certainty, suppose that r1 > 0 and r2 < 0.
L EMMA 4.3.– The following relation holds:

C1 r1 + γ
=− .
C2 r2 + γ
P ROOF.– We substitute the explicit form of h(x) in [4.8]. After calculation of h (x)
and integral in this equation, we set x = 0, obtaining:
λ+δ λ+δ
C1 r1 + C2 r2 = C1 + C2 .
c c
According to Vieta’s theorem, we have λ+δ−cγ c = r1 + r2 , in other words,
λ+δ
c = γ + r1 + r2 . Using this relation, we obtain 0 = (γ + r2 )C1 + (γ + r1 )C2 .
Whence it follows immediately that C 1
C2 has the desired form, thus ending the proof.
The inequality r2 + γ > 0 is also valid. Therefore, according to Lemma 4.3 C 1
C2 < 0
and constants C1 and C2 have different signs. Our aim is to find a positive solution
h(x) for 0 < x < ∞. To this end, we need to have C1 > 0 (this is easy to understand,
letting x tend to +∞ in the expression of h(x)), hence, C2 < 0 and C1 + C2 > 0 (we
want to have h(0) > 0).
Thus, we established the form of h(x) and found the restrictions on the constants.
The last step is to find the optimal barrier b∗ in the set [0; +∞). In order to
minimize the derivative h (x), we have to find the root of the equation h (x) = 0.
Hence, we have to solve the following equation:
∗ ∗
C1 r12 er1 b + C2 r22 er2 b = 0,
giving

∗ 1 C2 r22
b = ln − .
r1 − r2 C1 r12
The right-hand side is well defined, since we have already established that:
C2 r2 + γ
− = .
C1 r1 + γ
Thus, we have obtained the following expression:

∗ 1 (r2 + γ)r22
b = ln . [4.13]
r1 − r2 (r1 + γ)r12
4.3.4. Numerical analysis
We consider here the claim distributions with density ϕ(y) = γe−γy , that is,
exponential distribution with parameter γ. To find the optimal barrier b∗ , the formula
[4.13] is used.
In the following, we provide the Python code for solving the problem of barrier
calculation in a particular case and the results obtained:
import math
delt = 0.05 # force of interest

c = 1.1e6 # annual income (10 percent margin)
lam = 1e3 # average NUMBER of claims
gam = 1e-3 # inverse average SIZE of a claim
r = np.roots([c, -(lam+delt-c*gam), -delt*gam])

barrier = math.log(((r[1]+gam)*r[1]*r[1])/
((r[0]+gam)*r[0]*r[0]), math.e)/(r[0]-r[1])
# roots of the equation
print(barrier)
>>> 112450.45449333431 # the optimal barrier
In Table 4.2, the optimal values of barrier b∗ are given for some parameters under
the additional condition γ1 · δ = 1, 000, 000.
1/γ λ +5% +10% +15% +20%

c = 1,050,000 c = 1,100,000 c = 1,150,000 c = 1,200,000
100 10,000 25,705 16,392 12,571 10,454
500 2,000 93,674 64,050 50,440 42,574
1,000 1,000 156,529 112,450 90,094 76,750
5,000 200 424,339 375,834 322,837 284,878
10,000 100 573,129 588,160 532,745 482,573
Table 4.2. Case 1

1
The results for the case γ · δ = 5, 000, 000 are given in Table 4.3.
1/γ λ +5% +10% +15% +20%

c = 5,250,000 c = 5,500,000 c = 5,750,000 c = 6,000,000
100 50,000 32,530 19,944 15,043 12,387
500 10,000 128,526 81,962 62,857 52,269
1,000 5,000 227,297 148,555 115,041 96,199
5,000 1,000 782,644 562,252 450,470 383,749
10,000 500 1,253,205 965,726 792,042 682,949
Table 4.3. Case 2
In both cases, we calculated c in such a way that there will be a safety load of 5%,
10%, 15% and 20%. Note that the increase of safety loads leads to the decrease of the
optimal barrier level.
4.4. Conclusion and further research directions
In section 4.2, we have investigated a discrete-time model and established its

stability with respect to small perturbations of underlying distribution in terms of
the Kantorovich metric. Moreover, we carried out the optimization of company
performance in the framework of cost approach, proving that the best strategy of
bank loans is determined by an increasing sequence of critical levels. However, we
supposed that the quota share type reinsurance treaty is the same for all periods.
The next steps are the consideration of non-proportional reinsurance and the search
for optimal reinsurance depending on the length of the planning horizon. It is also
interesting to deal with other metrics and prove the limit theorems, as in Bulinskaya
and Gusak (2016).
A problem, proposed in the book by Bühlmann (1970), is solved in section 4.3. The
company capital is described by a compound Poisson process controlled by dividend
strategy. The expected discounted dividends until ruin are chosen as the objective
function. For the barrier strategy, the explicit form of the linear differential equation
is established if the claim amounts have the density ϕ(y) = P (y)e−y , where P (y) is
a polynomial of degree m. Gamma-distributions with integer parameter belong to this
class. Further investigation includes the sensitivity analysis of such a model and the
consideration of more complicated models including dependence between the claim
amounts and their number, investment in risky and non-risky assets and taxes. Other
dividend strategies can be considered (see Bulinskaya (2018)). Due to a lack of space,
these results will be published in another paper.
4.5. References
Albrecher, H. and Thonhauser, S. (2009). Optimality results for dividend problems in insurance.
Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales. Serie A. Matematicas,
103(2), 295–320.
Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.
Bernstein, P.L. (1996). Against the Gods: The Remarkable Story of Risk. John Wiley and Sons,
Inc., New York.
Bulinskaya, E. (2017). New research directions in modern actuarial sciences. Springer
Proceedings in Mathematics and Statistics, 208, 349–408.
Bulinskaya, E. (2018). Asymptotic analysis and optimization of some insurance models.
Applied Stochastic Models in Business and Industry, 34(6), 762–773.
Bulinskaya, E. and Gusak, J. (2016). Optimal control and sensitivity analysis for two risk
models. Communications in Statistics – Simulation and Computation, 45, 1451–1466.
Bühlmann, H. (1970). Mathematical Methods in Risk Theory. Springer, Berlin, Heidelberg,
New York.
Cramér, H. (1955). Collective Risk Theory: A Survey of the Theory from the Point of View of the
Theory of Stochastic Process. Ab Nordiska Bokhandeln, Stockholm.
De Finetti, B. (1957). Su un’impostazione alternativa della teoria collettiva del rischio.
Transactions of the XV International Congress of Actuaries, 433–443.
Dickson, D.C.M. and Waters, H. (2004). Some optimal dividends problems. ASTIN Bulletin,
34, 49–74.
Gerber, H. (1969). Entscheidungskriterien fur den zusammengesetzten Poisson-Prozess.
Schweizerische Vereinigung der Versicherungsmathematiker Mitteilungen, 69, 185–228.
Gordon, M.J. (1959). Dividends, earnings and stock prices. Review of Economics and Statistics,
41, 99–105.
Lundberg, F. (1903). Approximerad Framstallning av sannolikhetsfunktionen. Aterforsakering
av Kollektivrisker. Akad. Afhandling, Almqvist o. Wiksell, Uppsala.
Miller, M.H. and Modigliani, F. (1961). Dividend policy, growth, and the valuation of shares.
The Journal of Business, 34(4), 411–433.
5
Statistical Analysis of Traffic

Volume in the 25 de Abril Bridge
Bridges are important structures. They are used on land transportation to connect
different points that are usually inaccessible. Loading forces due to traffic volume and
flow are important physical factors that affect the bridge’s structural reliability. Thus,
for safety assessments, it is important to monitor and study traffic volume. In this
work, we analyze the traffic data on the 25 de Abril Bridge in Portugal. The aim is to
study the tail distribution.
5.1. Introduction
Bridges are the structures that allow people and vehicles to cross a space between
two elevations. They are used to join roads, as well as to connect the two banks of a
body of water, like a lake or river, or a deep opening, like a valley. The assessment of
the safety of existing bridges has received technical and scientific attention, partly due
to the occurrence of grave accidents in these structures. For safety assessments, it is
thus important to monitor and study traffic volume and flow on bridges. In this work,
we analyze the traffic volume data on the 25 de Abril Bridge, in Portugal (Figure 5.1).
One main concern is the analysis of high traffic since it can lead to long periods of
traffic congestion which can result in higher probabilities of failure of the bridge in
its lifetime. The 25 de Abril Bridge opened on the 6th of August 1966 and connects
Lisbon to the southern side of the Tagus River. This is the longest suspension bridge in
Europe, with a total length of 2,277 meters. It has two levels: the upper level for cars
with a three-lane roadway in each direction with a dividing guardrail as well as a lower
one, built-in 1999, for trains. Due to its similarity and because it was manufactured by
the same company, it is often compared to the Golden Gate Bridge in San Francisco.
Chapter written by Frederico C AEIRO, Ayana M ATEUS and Conceicao V EIGA DE A LMEIDA.
The rest of this chapter is organized as follows: in section 5.2 we describe the data
under study. In section 5.3, we review the extreme value methodology used in this
work. Finally, in section 5.4, we apply the extreme value models to infer the extremal
behavior of the traffic volume and provide some concluding remarks.
Figure 5.1. 25 de Abril Bridge and the Sanctuary of Christ the King monument (to
the right of the photo) in the city of Almada. The photo was taken by the first author
in September 2019. For a color version of this figure, see www.iste.co.uk/zafeiris/
data1.zip
5.2. Data
The traffic data we considered in our analysis was provided by INE (Instituto
Nacional de Estatística/Statistics Portugal) and by IMT (Instituto da Mobilidade e dos
Transportes, I.P.). Although there are only tolls in the South-North direction, traffic is
also counted in the other direction through sensors placed on the floor. The available
data consists in the number of vehicles. No information is available regarding the
class of a vehicle and the corresponding load. INE provided an archive of public
data with easy online access. Regarding traffic volume, the data obtained from INE
consists in the annual and monthly average daily traffic between 1998 and 2019. To
study variations in the traffic, including the extreme values, the daily average could
be meaningless. Thus, daily (or hourly) observations are more appropriate to make
inferences in the right tail. Daily values since January 1, 2010 to December 31, 2018
were provided on request by IMT. We also obtained from IMT annual and monthly
average daily data for the years before 1998.
Figure 5.2 shows the annual average daily traffic from 1966 to 2019. The years
from 1966 to 2001 corresponds to a period of traffic growth. After 2001, the annual
average daily traffic number appears to be stationary, with a change point in 2010.
Note that the year 2001 corresponds to the beginning of the Portugal economic
Statistical Analysis of Traffic Volume in the 25 de Abril Bridge 59
downturn and 2010 corresponds to the beginning of the sustainability financial

crisis. In Figure 5.3, we present the time series plot of the daily traffic volume.
The plot evidence shows strong seasonality within each year. The traffic volume is
smaller in the winter months (December–February) and higher in the summer months
(June–August). The three smallest values occurred on February 9, 2014 (82,408
vehicles), March 20, 2016 (82,654 vehicles) and March 11, 2018 (88,765 vehicles).
The smallest number of vehicles was a consequence of the strong wind: the central
lanes were closed, and traffic was closed to motorcycles and vehicles with canvas
hoods. The other two dates coincide with the Lisbon Half Marathon where the bridge
was closed to vehicles for several hours. The highest number of vehicles registered in
the period 2010–2018 occurred on July 2, 2010 (180,846 vehicles).
Annual Average Daily Traffic

100000 150000
number of vehicles
50000
0
1970 1980 1990 2000 2010 2020

year
Figure 5.2. Annual average daily traffic volume for

the 25 de Abril Bridge, between 1966 and 2019
Daily Traffic Volume

160000
number of vehicles / day
120000
80000
2010 2012 2014 2016 2018

date
Figure 5.3. Daily traffic volume for the 25 de Abril Bridge,

between January 1, 2010 and December 31, 2018
5.3. Methodology
The objective of extreme value theory (EVT) is to quantify the stochastic

behaviour of extreme events, such as extreme climate events, a stock market crash or
a new world record in athletics. The domains of application of EVT are quite diverse
and include fields such as biology, hydrology, meteorology, geology, insurance,
finance, structural engineering, sports and telecommunications. Thus, EVT provides a
framework to model the tail behaviour and a tool to predict the likelihood of extreme
events.
5.3.1. Main limit results
Let (X1 , . . . , Xn ) be a sample of independent and identically distributed (iid)

random variables from an underlying population with unknown distribution function
(df) F . Here, and due to the nature of the problem under study, we will always deal
with the right tail of F . Since:
min(X1 , . . . , Xn ) = − max(−X1 , . . . , −Xn ),
results for the left tail can be easily derived from the analogous results for the right
tail. Fréchet (1927) and Fisher and Tippett (1928) were the first to derive asymptotic
probability models for the transformed sample maximum. The first fundamental
limit result is due to Gnedenko (1943) who fully characterized the three possible
non-degenerate limit distributions of the linearly normalized sample maximum of
iid random variables (see also von Mises (1964) 1). This result is now known as the
extremal types theorem. Let X(n) = max1≤i≤n (Xi ) be the sample maximum. Let
us also assume that there exist normalizing constants an > 0, bn ∈ R and some
non-degenerate df G such that, for all x,

X(n) − bn
lim P ≤ x = G(x). [5.1]
n→∞ an
With the appropriate choice of the normalizing constants, G must be one of the
three limit models, which may be unified in the generalized extreme value (GEV)
distribution,

exp −(1 + ξx)−1/ξ , 1 + ξx > 0 if ξ = 0
G(x) ≡ G(x|ξ) := [5.2]
exp(− exp(−x)), x ∈ R if ξ = 0
here presented in the von Mises–Jenkinson form (Jenkinson 1955; von Mises 1964).
When the non-degenerate limit in [5.1] exists, we say that F belongs to the
1 This reference is a reprint of the 1936 edition, found at: von Mises, R. (1936). La distribution
de la plus grande de n valeurs, Rev., Math, Union Interbalcanique, 1, 141–160.
max-domain of attraction of G and write F ∈ D(G). The shape parameter ξ is the

extreme value index (EVI), the most important parameter associated with extreme
events. This real parameter weights the upper tail of F . As ξ increases, the probability
of occurrence of extreme values of X becomes higher. The GEV model unifies the
three possible limit max-stables distributions: the Weibull (ξ < 0), the Gumbel
(ξ = 0) and the Fréchet (ξ > 0). The GEV distribution is nowadays a common model
for extreme value analysis since it covers all three forms of extreme value distributions.
Another important result in the field of EVT is the joint limiting distribution of
the r largest order statistics (with r fixed). We will assume that equation [5.1] holds,
i.e. (X(n) − bn )/an converges in distribution to G(x), with adequate normalizing
constants an > 0 and bn ∈ R. Then, the joint limiting distribution of the normalized
r largest order statistics is:

X(n) − bn X(n−1) − bn X(n−r+1) − bn
, ,...,
an an an
with X(n) ≥ X(n−1) ≥ . . . ≥ X(n−r+1) , is the multivariate GEV model (Dwass
1964), with an associated probability density function given by:
r−1
g(x(n−i+1) )
hr (x(n) , x(n−1) , · · · , x(n−r+1) ) = g(x(n−r+1) ) , [5.3]
i=1
G(x(n−i+1) )
if x(n) > x(n−1) > · · · > x(n−r+1) , where g(x) = ∂G(x) ∂x , and G(x) is the GEV
distribution given in [5.2]. Note that for r = 1, equation [5.3] corresponds to the
density function of the GEV distribution, as expected. Also, if we consider the extreme
order statistic X(k) for some fixed k, we have (Arnold et al. 1992):
k−1
(− ln G(x))i
X(n−k+1) − bn
lim P ≤ x = G(x) . [5.4]
n→∞ an i!
i=0
If k = 1, the limit distribution in equation [5.4] is the GEV distribution in

equation [5.2]. There is thus a strong relationship between the asymptotic distribution
of the sample maximum, X(n) , the asymptotic distribution of the r largest order
statistics and the extreme order statistic X(n−k+1) , with k fixed. Other important limit
results outside the scope of this paper can be found in other books (Leadbetter et al.
1983; Arnold et al. 1992; Coles 2001; David and Nagaraja 2003; de Haan and Ferreira
2006). For an overview of several topics in the field of EVT, see Beirlant et al. (2012),
Davison and Huser (2015) and Gomes and Guillou (2015).
5.3.2. Block maxima method
The block maxima method consists of dividing the initial sample into disjoint
blocks of equal size and fitting the GEV model in equation [5.2] to the sample of
block maxima. The size of the block is important due to the usual trade-off between
bias (small block size) and variance (large block size). When working with time-series
data, it is usual to choose the block length as one year. This choice allows us to assume
that the block maxima is iid, even though data has serial dependence. The limit in
equation [5.1] justifies the following approximation, for large values of n:

z − bn
P (X(n) ≤ z) ≈ G
an
Because the GEV model provides only an approximation for the distribution of
Mn , bias due to model misspecification can occur. Since the normalizing constants
an > 0 and bn ∈ R are unknown, they are incorporated in the GEV distribution as
location and scale parameters, λ and δ, leading to the model:
−1/ξ
exp − 1 + ξ z−λ , 1 + ξ z−λ
δ > 0 if ξ = 0 [5.5]
G(z|ξ, λ, δ) := δ
z−λ
exp − exp − δ , z∈R if ξ = 0
Next, we fit the GEV model in equation [5.5] to the block maxima sample.
The estimation of the parameters (ξ, λ, δ) is usually performed using the maximum
likelihood method or the probability weighted moment (PWM) method (Hosking
et al. 1985). Since the support of the GEV model may depend on its parameters, the
asymptotic normality of the maximum likelihood estimators may not hold. However,
if ξ > −0.5, the maximum likelihood estimators are consistent and asymptotically
normal (Smith 1985). Regarding PWM estimators, consistency and asymptotically
normality can be guaranteed for ξ < 1 and ξ < 0.5, respectively. Note that in
practical applications, we often have −0.5 < ξ < 0.5. Additional asymptotic results
for the block maxima method were recently presented in Bücher and Segers (2017)
and Dombry and Ferreira (2019).
Model checking can be done with a histogram, a probability plot, a quantile plot or
with a return level plot with empirical estimates of the return level function (see Coles
(2001) and Reiss and Thomas (2007) for further details).
5.3.3. Largest order statistics method
When analyzing extreme values with the block maxima method, we often miss
several extreme observations. This problem has motivated researchers to use more
extreme values from the sample. Smith (1986) and Weissman (1978) were the first
to make inference with a model based on the r-largest order statistics from each
block. Under this approach, the initial sample is divided into blocks and we select
the r-largest order statistics from each block. Then, the model in equation [5.3] with
additional location and scale parameters λ and δ > 0 is fitted to the data. The
estimation is usually performed by maximum likelihood. As with the choice of the
block length, the choice of the parameter r accommodates a trade-off between bias
(large r) and variance (small r). In practice, it is advisable not to choose r too large
(Smith 1986).
R EMARK 5.1.– Note that both probabilistic models used in sections 5.3.2 and 5.3.3
share the same shape, location and scale parameters, (ξ, λ, δ). Therefore, it is usual
to estimate those parameters, using the r-largest order statistics method, and then
incorporate those estimates in the GEV model in equation [5.5] to estimate other
important parameters.
5.3.4. Estimation of other tail parameters
Estimation of the model parameters is an important first step for further inference
in the tail. The second and most important step is to yield precise inference about
the tail behaviour of F . More precisely, estimate parameters such as an upper tail
probability, an extreme quantile or the right-endpoint of F , whenever finite.
An upper tail probability is the probability that the block maximum exceeds some
high value yp with probability p (p small). The tail probability can be estimated by
1 − G(yp |, ξ̂, λ̂, δ̂), where G is the GEV df in equation [5.5].
Extreme quantiles exceeded with probability p of the block maximum can be

obtained by inverting the GEV df in equation [5.5] and replacing the parameters by
the corresponding estimates,
ˆ λ̂, δ̂)
q̂1−p := G← (1 − p|, ξ,
The quantile q1−p is also the level expected to be exceeded on average once every
1/p years. We usually say that q1−p is the return level associated with the return period
1/p. A plot of the return period (on a logarithmic scale) versus the return level is called
a return level plot.
Let ω = sup{x : F (x) < 1} denote the right endpoint of the GEV model. If
ξ < 0, the right endpoint is finite and can be estimated by:
δ̂
ω̂ = λ̂ − .
ξˆ
5.4. Results and conclusion
The models presented in section 5.3 will now be applied to the traffic data of
the 25 de Abril Bridge. We will consider only the period where daily values are
available (2010–2018). Due to yearly seasonality, the block is defined as one year.
All computations were done in R software, with package ismev (Heffernan and
Stephenson 2018). Table 5.1 shows the maximized log-likelihood (ll0 ), parameters
estimates and standard errors in parentheses of the GEV (r = 1) and a multivariate
GEV model with 2 ≤ r ≤ 5.
r ll0 λ̂ σ̂ ξ̂
1 −87.599 170,156.651 (1,409.031) 3,778.887 (1,026.358) −0.132 (0.240)
2 −168.730 172,045.883 (1,404.189) 4,348.683 (664.485) −0.346 (0.164)
3 −244.945 172,548.763 (1,255.496) 4,071.888 (526.778) −0.314 (0.148)
4 −318.084 172,636.307 (1,109.469) 3,858.426 (464.409) −0.277 (0.123)
5 −388.923 172,390.040 (986.444) 3,546.707 (357.260) −0.250 (0.097)
Table 5.1. Maximized log-likelihood (ll0 ), parameters estimates and standard errors
in parentheses of the GEV (r = 1) and multivariate GEV model with 2 ≤ r ≤ 5
Comparing the results, we note that both estimates and standard errors change with
different values of r. The standard errors decrease as r increases. Due to a possible
increase of bias, it is advisable to not let r be too large. Coles (2001) suggests choosing
r as large as possible, subject to diagnostics of the fit.
Probability Plot Quantile Plot

0.8
175000
Empirical
Model
0.4
165000
0.0
0.2 0.4 0.6 0.8 1.0 168000 172000 176000
Empirical Model
Return Level Plot Density Plot

165000 175000 185000
8e−05
Return Level
f(z)
4e−05
0e+00
1e−01 1e+00 1e+01 1e+02 1e+03 165000 170000 175000 180000 185000
Return Period z
Figure 5.4. Diagnostic plots of the GEV model fit to the yearly maximum
from the daily traffic data of the 25 de Abril Bridge. For a color
We validated the fitted model using the histogram, the probability plot, the quantile
plot and the return level plot. These plots confirm that the fit is more satisfactory for
r = 1. In Figure 5.4, we present the diagnostic plots of the GEV distribution based on
the block maxima method (r = 1).
Using the delta method, the asymptotic 95% confidence intervals for the
parameters ξ, λ and δ are, respectively, (−0.603, 0.338), (167395.0, 172918.3) and
(1767.262, 5790.513). Despite the fact that the point estimate of the shape parameter
ξ is negative, the corresponding confidence interval includes the value zero. Therefore,
we do not have enough evidence to assume that the Weibull model is the most
appropriate one. The likelihood ratio test statistic is equal to 0.292 which suggests
that the Gumbel model could be adequate. Nevertheless, we decided to take the safest
decision and prefer to model the tail within the GEV family of distributions.
In Table 5.2, we provide estimates and confidence intervals for the m-year return
level (m = 10, 50, 100). Assuming the stationarity of future extreme values, we expect
a daily traffic always below 195,000 vehicles during the next 100 years. Also, since the
estimate of the shape parameter is negative, the endpoint estimate is 198,690 vehicles.
Return period Return level 95% confidence interval for the return level
10 177,511 (173,115, 181,906)
50 181,672 (173,267, 190,076)
100 183,175 (172,367, 193,982)
Table 5.2. Return period and estimates of the

return level with 95% confidence interval
5.5. Acknowledgements
This work was partially funded by national funds through the FCT – Fundação
para a Ciência e a Tecnologia, I.P., under the scope of the project UIDB/00297/2020
(Center for Mathematics and Applications).
5.6. References
Arnold, B.C., Balakrishnan, N., Nagaraja, H.N. (1992). A First Course in Order Statistics.
Wiley, New York.
Beirlant, J., Caeiro, F., Gomes, M.I. (2012). An overview and open research topics in statistics
of univariate extremes. Revstat – Statistical Journal, 10(1), 1–31.
Bücher, A. and Segers, J. (2017). On the maximum likelihood estimator for the generalized
extreme-value distribution. Extremes, 20, 839–872.
Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag,

London.
David, H.A. and Nagaraja, H.N. (2003). Order Statistics, 3rd edition. Wiley, Hoboken, NJ.
Davison, A.C. and Huser, R. (2015). Statistics of extremes. Annual Review of Statistics and Its
Application, 2(1), 203–235.
Dombry, C. and Ferreira, A. (2019). Maximum likelihood estimators based on the block
maxima method. Bernoulli, 25(3), 1690–1723.
Dwass, M. (1964). Extremal processes. Annals of Mathematical Statistics, 35, 1718–1725.
Fisher, R.A. and Tippett, L.H.C. (1928). Limiting forms of the frequency of the largest or
smallest member of a sample. Proceedings of the Cambridge Philosophical Society, 24,
180–190.
Fréchet, M. (1927). Sur le loi de probabilité de l’écart maximum. Annales de la Société
polonaise de mathématique, 6, 93–116.
Gnedenko, B.V. (1943). Sur la distribution limite du terme maximum d’une série aléatoire.
Annals of Mathematics, 44, 423–453.
Gomes, M.I. and Guillou, A. (2015). Extreme value theory and statistics of univariate extremes:
A review. International Statistical Review, 83(2), 263–292.
de Haan, L. and Ferreira, A. (2006). Extreme Value Theory: An Introduction. Springer
Science+Business Media LLC, New York.
Heffernan, J.E. and Stephenson, A.G. (2018). Package “ismev”: An introduction to statistical
modeling of extreme values, version 1.42. Document, May 11.
Hosking, J., Wallis, J., Wood, E. (1985). Estimation of the generalized extreme value
distribution by the method of probability-weighted moments. Technometrics, 27(3),
251–261.
Jenkinson, A.F. (1955). The frequency distribution of the annual maximum (or minimum)
values of meteorological elements. Quarterly Journal of the Royal Meteorological Society,
81, 158–171.
Leadbetter, M.R., Lindgren, G., Rootzén, H. (1983). Extremes and Related Properties of
Random Sequences and Processes. Springer-Verlag, New York, Berlin.
von Mises, R. (1964). La distribution de la plus grande de n valeurs. Selected Papers of Richard
von Mises, American Mathematical Society, 2, 271–294.
Reiss, R.D. and Thomas, M. (2007). Statistical Analysis of Extreme Values: With Applications
to Insurance, Finance, Hydrology and Other Fields, 3rd edition. Birkhäuser, Berlin.
Smith, R.L. (1985). Maximum likelihood estimation in a class of nonregular cases. Biometrika,
72(1), 67–90.
Smith, R.L. (1986). Extreme value theory based on the r largest annual events. Journal of
Hydrology, 86, 27–43.
Weissman, I. (1978). Estimation of parameters and large quantiles based on the k largest
observations. Journal of the American Statistical Association, 73, 812–815.
6
Predicting the Risk of Gestational

Diabetes Mellitus through Nearest
Neighbor Classification
Gestational diabetes mellitus (GDM) may arise as a complication of pregnancy

and can adversely affect both mother and child. Diagnosis of this condition is
carried out through screening coupled with an oral glucose test. This procedure
is costly and time-consuming. Therefore, it would be desirable if a clinical risk
assessment method could filter out any individuals who are not at risk of acquiring
this disease. This problem can be tackled as a binary classification problem.
In this study, our aim is to compare and contrast the results obtained through
binary logistic regression (BLR), used in previous studies, and three well-known
non-parametric classification techniques, namely the k-nearest neighbors (kNN)
method, the fixed-radius-NN method and the kernel-NN method. These techniques
were selected due to their relative simplicity, applicability, lack of assumptions and
nice theoretical properties. The test dataset contains information related to 1,368
subjects across 11 Mediterranean countries. Using various performance measures, the
results revealed that NN methods succeeded in outperforming the BLR method.
6.1. Introduction
The World Health Organization (WHO) defines diabetes mellitus as “a chronic,
metabolic disease characterized by elevated levels of blood glucose (or blood sugar),
which leads over time to serious damage to the heart, blood vessels, eyes, kidneys
and nerves”. Gestational diabetes mellitus (GDM) is a form of diabetes which arises
Chapter written by Louisa T ESTA, Mark A. C ARUANA, Maria KONTORINAKI and

Charles S AVONA -V ENTURA.
as a complication of pregnancy. Alfadhli (2015) discusses how, worldwide, the

prevalence of this disease fluctuates between 1% and 20%, and these rates are higher
for certain ethnic groups such as Indian, African, Hispanic and Asian women. In 2010,
the International Association of Diabetes and Pregnancy Study Group (IADPSG)
established new criteria for the diagnosis of GDM; pregnant women are screened,
and the oral glucose tolerance test (OGTT) is used for diagnosis.
Savona-Ventura et al. (2013) explained that the screening, as well as the OGTT, are
costly diagnostic methods. To this end, an alternative clinical risk assessment method
for GDM, based on explanatory variables that can be easily measured at minimal cost,
is sought to preclude these tests, especially in countries and health centers dealing
with budget cuts and a lack of resources. The prediction of the risk of an individual
acquiring GDM is a problem that can be tackled using a variety of classification
techniques. Savona-Ventura et al. (2013) applied binary logistic regression (BLR).
In the literature, some shortcomings of the BLR model devised in the study by
Savona-Ventura et al. (2013) are outlined. Kotzaeridi et al. (2021) remark that this
model tended to underestimate the risk of GDM. Furthermore, Lamain-de Ruiter
et al. (2017) found that this same model also involved a moderate risk of bias when
compared to other models. Thus, we seek alternative methods which may serve as
an improvement over the BLR model implemented by Savona-Ventura et al. (2013).
Nearest neighbor (NN) methods, which are non-parametric classification techniques,
were found to be commonly used in studies involving the prediction of diabetes
mellitus. Kandhasamy and Balamuali (2015) compared the performance of four
popular classification techniques, namely the J48 decision tree, the k-nearest neighbor
(kNN) classifier, random forests and support vector machines (SVMs), in predicting
the risk of diabetes mellitus for noisy (or inconsistent) data with missing values and
for consistent data. The study showed that the J48 decision tree performed best for the
noisy data, while random forests and the kNN classifier with k = 1 performed best
for the consistent data. Furthermore, Saxena et al. (2004) discuss in detail the use of
kNN in classifying diabetes mellitus. The authors applied this algorithm to a dataset
consisting of 11 variables, among which were glucose concentration, age, sex and
body mass index. Saxena et al. (2004) then analyzed the results obtained for k = 3 and
k = 5 through the use of well-known performance measures; the results of the study
led to the conclusion that the error rate increased for the larger value of k, and so better
results were obtained for k = 3.
In this chapter, our main aim is to test the applicability and the performance
of three well-known NN methods, to the problem of predicting the risk of GDM.
In particular, we focus on the application of the kNN method, the fixed-radius-NN
method and the kernel-NN method. These methods will be applied to a dataset
pertaining to 1,368 pregnant women from 11 Mediterranean countries. More
specifically, the dataset consists of 71 explanatory variables such as age, pre-existing
hypertension, menstrual cycle regularity and history of diabetes in the family. Since
the classification accuracy may be affected by factors such as the presence of missing
Predicting the Risk of Gestational Diabetes Mellitus 69
data or an imbalance between the considered classes, imputation techniques will be

implemented to deal with missing values, while the class imbalance between the
positive and negative cases for GDM will be tackled through a technique called
SMOTE-NC. The performance of these methods will then be evaluated and compared
through the use of various performance measures. In addition, the results obtained will
be compared to those obtained in Savona-Ventura et al. (2013), where BLR has been
applied to the same dataset.
The rest of this chapter is structured as follows. In section 6.2, we discuss in

detail the NN techniques used in this chapter, as well as presenting some important
convergence results. In section 6.3, we provide a thorough description of the dataset
used and the corresponding preliminary data analysis; we also present and compare the
results obtained from the implementation of NN methods and BLR. Finally, section
6.4 contains the concluding remarks of this study.
6.2. Nearest neighbor methods
6.2.1. Background of the NN methods
We begin this section by introducing some notations that will be used throughout
this chapter. The problem being tackled here involves binary prediction, which means
having two possible class labels: positive for GDM (1) and negative for GDM (0).
Thus, let Y be a random variable that represents a possible class label of an individual,
and let X = (X1 , . . . , Xp ) be a p-dimensional random vector whose components,
which are random variables, represent a certain feature in the dataset, for example, the
age of the mother and number of miscarriages. Also, let x = (x1 , . . . , xp ) be a vector
of observed values. In a classification problem, we make use of a dataset comprising
a finite sample of independent, identically distributed pairs (x1 , y1 ), . . . , (xn , yn ),
where yi indicates the class label of the ith observation, for i = 1, ..., n, and n
denotes the sample size. Then, we aim at using this dataset to estimate a function
Ŷ that, given a newly obtained observation/feature vector x, outputs a predicted label
Ŷ (x) ∈ {0, 1}. The function Ŷ is called a classifier. The best classifier in terms of
minimizing probability of error is the so-called Bayes classifier and is defined as
follows:
ŶBayes (x) = argmax P(Y = y|X = x)
y∈{0,1}

1 if P(Y = 1|X = x) ≥ P(Y = 0|X = x)
= . [6.1]
0 otherwise
In words, the Bayes classifier compares the conditional probabilities P(Y =

0|X = x) and P(Y = 1|X = x), given the observed feature vector X = x and
if the former is higher than the latter, it predicts x to have label 0, otherwise it predicts
x to have label 1. By defining η(x) = P(Y = 1|X = x), it can easily be shown that
equation [6.1] can be re-written as follows:

1 if η(x) ≥ 12
ŶBayes (x) = . [6.2]
0 otherwise
It was shown by Chen and Shah (2018) that the Bayes classifier in equation [6.2]
is indeed the one that minimizes the probability of a misclassification. Thus, any
classification procedure cannot do better than the Bayes classifier. Unfortunately, in
classification, we do not know the Bayes classifier ŶBayes and have to estimate it
from training data. In the next sections, we will see how we can define/approximate
the function η using three different NN methods.
6.2.2. The k-nearest neighbors method
The kNN algorithm is a non-parametric method that is used for classification and
regression. In the former, to decide the class label of a feature vector, we consider the
k points in the set of observed data that are closest to the point of interest. An object is
allocated to the most common class among its k nearest neighbors, where k ∈ Z+ and
usually takes on a small value. If k = 1, then the object is merely predicted to belong
to the same class as that single nearest neighbor.
Using the set-up shown in the previous section, we now proceed to define an
estimate η̂ for η(x) = P(Y = 1|X = x) as follows:
k
1
η̂(x) = Y(i) (x), [6.3]
k i=1
where Y(i) = 1 if the ith neighbor of x has label 1 and 0 otherwise. Hence, an estimate
for equation [6.2] is as follows:

1 if η̂(x) ≥ 12
ŶkN N (x) = . [6.4]
0 if otherwise
Over the years, a number of results concerning upper and lower bounds on
misclassification errors of the kNN classifier as well as a number of convergence
guarantees have been proven. These can be found in Chaudhuri and Dasgupta (2014).
We now move on to discuss the fixed-radius NN method.
6.2.3. The fixed-radius NN method
The fixed-radius NN classification method is another technique that is used in

tackling binary classification problems. This method is similar to kNN, however
instead of determining the test point’s label by looking at its k nearest neighbors,
this point is assigned a class label through a majority vote of its neighbors that are
captured within a ball of radius r. We have to note here that if the radius r, which can
take any positive value, is not chosen carefully, then there is a risk of not finding any
points in the neighborhood of the test point.
According to the background presented in section 2.1, an estimate η̂ for η(x) =

P(Y = 1|X = x), when considering the fixed-radius NN, is derived as follows:
n 1 n
i=1 ρ(x,xi )≤r yi
n
1ρ(x,xi )≤r if i=1 1ρ(x,xi )≤r > 0
η̂f r−N N (x) = i=1 , [6.5]
0 if otherwise
where ρ represents the considered distance function and 1{·} is an indicator function
taking the value of 1 if its argument is true and 0 otherwise. The difference between
equations [6.5] and [6.3] is that instead of taking the average of the labels of the
k-nearest neighbors of the test point, η̂f r−N N (x) is estimated by taking the average of
the labels of all points within distance r from the reference point x. Hence, an estimate
of equation [6.2] is obtained by replacing η with equation [6.5] in equation [6.2] and
we obtain:

1 if η̂f r−N N (x) ≥ 12 and ni=1 1ρ(x,xi )≤r > 0
Ŷf r−N N (x) = . [6.6]
0 if otherwise
Convergence results related to this method were discussed in detail in Chen and
Shah (2018). Next, we proceed to discuss the kernel-NN method.
6.2.4. The kernel-NN method
The kernel method makes use of a kernel function K : R+ → [0, 1] which

takes the normalized distance between the test point and a training point (i.e. the
distance calculated using some distance metric, divided by a bandwidth parameter
h), and produces a similarity score, or weight, between 0 and 1. Therefore, in this
case, each training point is given a weighting depending on how far it is from the test
point. The aforementioned bandwidth parameter h controls this weighting; a lower
bandwidth would mean that only points very close to the test point would contribute
to the weighting, meaning that points far away would contribute zero or very little
weight. A higher bandwidth, on the contrary, would mean that points further away
from the test point would give a slightly higher contribution. Furthermore, Chen and
Shah (2018) note that the kernel function is assumed to be monotonically decreasing
on the positive domain, meaning that the further away a training point is from the
test point, the lower its similarity score is. Guidoum (2015) remarks that, unlike the
choice of the bandwidth parameter h, the choice of the kernel function is not that
crucial since, as previously mentioned, the former is the one that affects which points
give the most contribution, yielding equally good results for different kernel functions.
In literature, various kernel functions have been proposed. These include the
uniform, Epanenchnikov, normal, biweight and triweight kernels, among others. We
invite the interested reader to refer to Scheid (2004) and Guidoum (2015) for a
discussion on various types of kernels. Since the choice of the kernel function is not
crucial, throughout this chapter the Epanechnikov kernel was selected. Indeed, when
other kernels were selected the results did not change significantly.
As in the previous two sections, we now provide an estimate for the conditional
probability η. For kernel classification, this estimator is defined as follows:
⎧
ρ(x,xi )

⎪
⎪
n
i=1 K yi
⎨
h
if ni=1 K ρ(x,x i)
>0
n ρ(x,xi ) h
η̂Kernel−N N (x, h) = i=1 K
,
⎪
⎪
h
⎩
0 if otherwise
[6.7]
where the kernel function K determines the contribution of the ith training point to
the class label prediction through a weighted average. As with the previous methods,
we now replace η with η̂Kernel−N N in equation [6.2] to obtain the following:
n
1 if η̂Kernel−N N (x) ≥ 12 and i=1 K ρ(x,x h
i)
>0
ŶKernel−N N (x) =
0 if otherwise
[6.8]
Convergence results related to this method were discussed in detail in Chen and
Shah (2018).
6.2.5. Algorithms of the three considered NN methods
In practice, when we want to test the classification performance of an algorithm,

we need to split the available dataset of the observed feature vectors into two sets.
These two sets are of unequal sizes and are called the training set and the test set; the
former contains the data of m individuals, while the latter contains the data of n − m
individuals. The individuals are placed in one of these two sets in a random way;
however, the class proportions should be maintained to assure that the training and test
sets share the same distribution characteristics. This means that if 25% of the original
data belonged to the minority class, then 25% of the training set and the test set will
also belong to the minority class (in our case, these are individuals that tested positive
for GDM). Thus, the training set will be denoted by ((x1 , y1 ), . . . , (xm , ym )) and the
test set will be denoted by ((xm+1 , ym+1 ), . . . , (xn , yn )). Therefore, xi , 1 ≤ i ≤ m is
the feature vector that contains the observed values for the ith individual in the training
set and will sometimes be called a training point. Similarly, xj , m + 1 ≤ j ≤ n is the
feature vector that contains the observed values for the jth individual in the test set,
and for simplicity will sometimes be called a test point. Therefore, although the class
label of the test point is known, we will be using the previously described NN methods
to predict it in order to evaluate the classification performance of the algorithms.
Algorithm 6.1 provides the pseudocode of kNN, while Algorithms 6.2 and 6.3
provide the pseudocode of the fixed-radius NN method and the kernel NN method,
respectively.
Algorithm 6.1 The k-nearest neighbors algorithm

Input: The training set {x1 , . . . , xm }, the test set {xm+1 , . . . , xn }, the class labels yi ∈ {0, 1} of the points in the
training set for each i ∈ {1, . . . , m}, the value of k
Output: The predicted class labels yj ∈ {0, 1} of the test points xj for each j ∈ {m + 1, . . . , n}
1: procedure k-NN
2: for each xj in the test set do
3: for each xi in the training set do
4: calculate the distance di between xi and the test point xj and store
5: each distance in a list [di ]
6: end for
7: sort the list of distances [di ] in ascending order and choose the training
8: points associated with the first k distances
9: determine the class label yj ∈ {0, 1} for the test point by checking which
10: is the most frequent class among these points, and assign this class label
11: label to the test point xj
12: if k is even and there are k 2 neighbors belonging to the same class then
13: randomly select a class label 0 or 1 for xj
14: end if
15: end for
16: end procedure
Algorithm 6.2 The fixed-radius nearest neighbor algorithm

training set for each i ∈ {1, . . . , m}, the value of the radius r
1: procedure FIXED -R ADIUS NN
4: calculate the distance di between xi and the test point xj and store
5: each distance in a list [di ]
6: end for
7: sort the list of distances [di ] in ascending order and choose the training
8: points whose distance from the test point is less than or equal to r
9: determine the class label yj ∈ {0, 1} for the test point by checking which
10: is the most frequent class among these points, and assign this class label
11: to the test point xj
12: if the number of neighbors belonging to each class is tied then
13: randomly select a class label 0 or 1 for xj
14: end if
15: end for
16: end procedure
Algorithm 6.3 The kernel nearest neighbors algorithm

training set for each i ∈ {1, . . . , m}, the value of the bandwidth h
1: procedure K ERNEL C LASSIFICATION
4: calculate the distance di between xi and the test point xj , divide this distance by
5: the bandwidth h and call this value si
6: compute kernel function at si to obtain a similarity score and store in a list [K(si )]
7: multiply this value by the class label yi for xi and store in a list [K(si )yi ]
8: end for
9: divide the average of the values in the list [K(si )yi ] by the average of the values in the
10: list [K(si )]
11: if the value obtained is greater than or equal to 12 then
12: assign class label yj = 1 to xj
13: else assign class label yj = 0 to xj
14: end if
15: end for
16: end procedure
6.2.6. Parameter and distance metric selection
The NN algorithms presented in the previous sections employ parameters that

need to be appropriately selected to achieve accurate predictions. In kNN, the main
parameter that needs to be appropriately selected is k, that is, the number of nearest
neighbors of the test data point to be considered. Furthermore, the size of the radius
r needs to be properly selected for the fixed-radius NN, while the optimal bandwidth
size h needs to be selected for the kernel methods. These parameters will be selected
through k-fold cross validation which in turn makes use of the maximum value of the
area under the ROC curve to choose the optimal parameter.
The NN methods discussed above all require a distance metric. To this end, an
appropriate distance metric must be utilized to account for the type of features in
the feature space (continuous variables, categorical variables or mixed data). The
choice of this metric is made particularly tricky especially in the case where a mixed
dataset has to be considered; some variables are quantitative in nature, while other
variables are categorical. Hence, in such cases, traditional distance metrics like the
Euclidean distance are not appropriate. However, there exist distance metrics that have
been designed for mixed data. These are the heterogeneous value difference metric
(HVDM) and the heterogeneous Euclidean-overlap metric (HEOM) defined in Mody
(2009).
In the next section, we discuss the results obtained when the above three techniques
were applied to a dataset for GDM risk prediction.
6.3. Experimental results
6.3.1. Dataset description
The dataset utilized is made up of information related to 1,368 mothers and their
newborn babies from 11 countries collected between 2010 and 2011. It is a mixed
dataset consisting of 72 variables – 44 categorical and 28 continuous. The categorical
variables also include the binary response variable that indicates whether the mother
has GDM or not, thus representing the mother’s class label. We note that the number
of variables in the dataset is quite large, and some may not be relevant in determining
whether a mother is at risk of being diagnosed with GDM. Therefore, appropriate tests
(described in section 6.3.2) were used to eliminate insignificant variables.
Furthermore, we see that 352 mothers were diagnosed with GDM according to
the International Association of Diabetes and Pregnancy Study Groups’ (IADPSGs)
criteria, which make up 26% of the total number of mothers. On the contrary, 1,016
mothers were found not to have GDM, making up 74% of the cases. Thus, we are
dealing with imbalanced data, since the majority of mothers do not have GDM, and
the minority do. This may deteriorate the performance of the classifier by increasing
the false negatives. To overcome this problem, the SMOTE-NC technique as described
by Chawla et al. (2002) was implemented prior to the application of the classification
techniques to balance out the data.
6.3.2. Variable selection and data splitting
The robustness of NN techniques is inherently dependent on the dimension of

the problem, i.e. the number of explanatory variables involved in the dataset (see
Chapter 2 in Hastie et al. (2001)). High-dimensional problems favor fairly large
neighborhoods of observations close to any reference point x, this way compromising
the effectiveness of the estimation of η̂ in equations [6.3], [6.5] and [6.7]. Moreover,
and in contrast to other parametric techniques such as BLR, all explanatory variables
are equally important when NN techniques are used for making predictions, making
it impossible for the algorithms themselves to identify the variables that affect the
response more. Due to the reasons above, proceeding with a variable selection
technique for reducing the problem’s dimension is essential.
In order to determine which variables are significant in diagnosing mothers with

GDM, two tests were carried out at a 0.05 level of significance in accordance with the
nature of each variable: a Pearson’s chi-square test for independence between GDM
diagnosis and the other 43 categorical variables, and a Mann–Whitney U test between
GDM diagnosis and the 28 continuous variables.
After performing a chi-square test on the categorical variables, only 12 turned

out to be significant and are as follows: “Country”, “Mother’s country of birth”,
“Father’s country of birth”, “Pre-existing hypertension”, “Family history: diabetes

mellitus in mother”, “Family history: diabetes mellitus in father”, “Family history:
diabetes mellitus in siblings”, “Menstrual cycle regularity”, “Fasting glycosuria”,
“Insulin used”, “Oral medication” and “Special care baby unit admission”.
Furthermore, after implementing a Mann–Whitney U test on the continuous

variables, 28 of these variables resulted as significant and were therefore retained.
These include “Age”, “Pre-pregnancy body mass index”, “Systolic blood pressure”,
“Diastolic blood pressure”, “Absolute hemoglobin A1c”, “Hemoglobin A1c”,
“Parity”, “Personal history: macrosomia”, “Weight pre-pregnancy”, “Weight at oral
glucose tolerance test”, “Body mass index at oral glucose tolerance test”, “Fasting
blood glucose”, “One-hour blood glucose”, “Two-hour blood glucose”, “Area under
the curve”, “Insulin level”, “Gestational age at delivery” and “Apgar score”.
Therefore, after performing variable selection, we are left with 12 categorical and
18 continuous variables, making up 30 variables in all.
The dataset was split into two non-overlapping sets: the training set and the test
set. This was carried out at an 80:20 ratio, with the training set in this case being made
up of 1,094 mothers, 283 (26%) of whom were diagnosed with GDM, and the test set
consisting of 274 mothers, 69 (25%) of whom were diagnosed with GDM. The class
imbalance in the training set was then catered for through the use of the SMOTE-NC
algorithm, as explained in Chawla et al. (2002).
6.3.3. Results
Variables were first scaled and standardized before fitting any models. The optimal
hyperparameter values for all three NN methods were then found using 10-fold cross
validation; namely, for kNN, k = 5, for fixed-radius NN, r = 5.2 and for kernel-NN,
h = 0.19.
Actual
Positive Negative Total
Positive 40 6 46
Predicted
Negative 29 199 228
Total 69 205 274
Table 6.1. Confusion matrix for kNN on the test set (balanced case)
In the following, we will take a look at the confusion matrices obtained for each
method. Table 6.1 presents the confusion matrix for kNN, Table 6.2 presents the
confusion matrix for fixed-radius NN and Table 6.3 presents the confusion matrix
for kernel-NN.
Actual
Positive 2 0 2
Predicted
Negative 73 199 272
Total 75 199 274
Table 6.2. Confusion matrix for fixed-radius NN on the test set (balanced case)
Actual
Positive 43 26 69
Predicted
Negative 2 203 205
Total 45 229 274
Table 6.3. Confusion matrix for the kernel-NN on the test set (balanced case)
The BLR model was implemented by Savona-Ventura et al. (2013) in order to

predict the probability of developing GDM and included three binary predictors. These
indicated whether fasting blood glucose (FBG) was at a level higher than 5.0 mmol/L,
whether the maternal age was greater than or equal to 30 years, and whether diastolic
blood pressure was greater than or equal to 80 mmHg. This particular model was
first applied to the test set in both the imbalanced case (which was that examined
in Savona-Ventura et al. (2013)) and balanced cases. Tables 6.4 and 6.5 present the
confusion matrices obtained for BLR applied to the imbalanced and balanced datasets,
respectively.
Actual
Positive 42 6 48
Predicted
Negative 33 193 226
Total 75 199 274
Table 6.4. Confusion matrix for BLR on the test set using
the original variables (imbalanced case)
Actual
Positive 45 10 55
Predicted
Negative 24 195 219
Total 69 205 274
the original variables (balanced case)
After checking for multicollinearity through the Spearman correlation matrix and
removing any correlated predictors, another BLR model was applied to the training
set, and the parsimonious model was obtained using a backward stepwise process.
In this case, seven significant variables were found and retained, namely “Country”,
“Family history: diabetes mellitus in mother”, “Family history: diabetes mellitus in
father”, “Parity”, “Weight at oral glucose tolerance test”, “Area under the curve” and
“Apgar score”. Finally, this BLR model was then applied to the test set. The confusion
matrix obtained is given in Table 6.6.
Actual
Positive 49 7 56
Predicted
Negative 20 198 205
Total 69 205 274
only the significant variables (balanced case)
We should note that for the kNN, kernel NN and BLR methods, the proportion
of true predictions as can be seen in the confusion matrices were relatively higher to
those for false predictions, meaning that these techniques seem to be adequate for the
data. However, for the fixed-radius NN method, the confusion matrix in the balanced
case showed a high proportion of true negatives (72.6%), while the proportion of
true positives (0.73%) was very low relative to false predictions. This means that
this method is probably not the best for the data, since it does not perform well in
diagnosing mothers who have GDM.
6.3.4. A discussion and comparison of results
Table 6.7 shows an ordering of the methods according to their overall performance
on the test set, from best to worst. This was based on five performance measures,
namely accuracy, area under the ROC curve, precision, sensitivity and F1 score.
We see here that BLR using the original variables on the test set in the imbalanced
case (which was the original case studied by Savona-Ventura et al. (2013)) was the
fifth best classification technique, surpassing fixed-radius NN which had the worst
overall performance. Furthermore, BLR using the original variables on the test set in
the balanced case came in fourth overall. The kNN method in the balanced case had
the best AUC, and the best overall performance for the NN methods followed by the
kernel method. Finally, the binary logistic regression technique applied to the balanced
data after obtaining the parsimonious model performed slightly better than the kNN
method and proved to perform the best overall for this dataset.
Technique Case Accuracy AUC Precision Sensitivity F1 Score

Balanced -
BLR 0.902 0.838 0.875 0.710 0.784
Selected Variables
kNN Balanced 0.872 0.945 0.870 0.580 0.696
Kernel-NN Balanced 0.898 0.913 0.623 0.956 0.754
Balanced -
BLR Original 0.876 0.802 0.818 0.652 0.726
Variables
Imbalanced -
BLR Original 0.858 0.765 0.875 0.560 0.683
Variables
FR-NN Balanced 0.733 0.762 1.000 0.027 0.052
Table 6.7. Classification techniques in order of performance
6.4. Conclusion
In this chapter, three NN techniques were studied, namely kNN, fixed-radius NN

and the kernel method, and their implementation in binary classification. Upon
applying these methods to the dataset, together with binary logistic regression, we
conclude that in comparison to the NN methods, the binary logistic regression
technique using the variables in the parsimonious model for the balanced case gave
a slightly better performance; however, kNN and the kernel method performed better
in relation to the binary logistic regression model applied by Savona-Ventura et al.
(2013).
While carrying out this study, a limitation encountered was that 10-fold cross
validation to determine the optimal hyperparameters for kNN and fixed-radius NN
in Python was not computationally efficient, meaning that it took a very long time
to train the algorithms. A possible improvement to the study may be the exploration
of Bayesian neural networks for classification problems, where cross validation is no
longer needed and so the algorithm is trained more efficiently using MCMC methods.
Alternative classification methods found in the literature can also be applied and
compared with NN methods to obtain the best model for prediction, namely decision
trees, random forests and support vector machines, for example, which are also widely
used in these types of problems.
6.5. References
Alfadhli, E.M. (2015). Gestational diabetes mellitus. Saudi Medical Journal, 36(4), 399–406.
Chaudhuri, K. and Dasgupta, S. (2014). Rates of convergence for nearest neighbour
classification. Proceedings of the 27th International Conference on Neural Information
Processing Systems – Volume 2, 3437–3445, Montreal.
Chawla, N.V., Bowyer, K., Hall, L.O., Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.
Chen, G.H. and Shah, D. (2018). Explaining the success of nearest neighbour methods in
prediction. Foundations and Trends in Machine Learning, 10(5–6), 337–588.
Guidoum, A.C. (2015). Kernel estimator and bandwidth selection for density and its derivatives
[Online]. Available at: https://cran.r-project.org/web/packages/kedd/vignettes/kedd.pdf.
Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning: Data
Mining, Inference and Prediction. Springer, New York.
Kandhasamy, J.P. and Balamuali, S. (2015). Performance analysis of classifier models to predict
diabetes mellitus. Procedia Computer Science, 47, 45–51.
Kotzaeridi, G., Blatter, J., Eppel, D., Rosicky, I., Mittlboeck, M., Yerlikaya-Schatten, G.,
Schatten, C. (2021). Performance of early risk assessment tools to predict the later
development of gestational diabetes. European Journal of Clinical Investigation, 51(23).
Lamain-de Ruiter, M., Kwee, A., Naaktgeboren, C.A., Franx, A., Moons, K., Koster, M. (2017).
Prediction models for the risk of gestational diabetes: A systematic review. Diagnostic and
Prognostic Research, 1, 3.
Mody, R. (2009). Optimizing the distance function for nearest neighbors classifcation. Thesis,
University of California San Diego [Online]. Available at: https://escholarship.org/uc/item/
9b3839xn.
Savona-Ventura, C., Vassallo, J., Marre, M., Karamanos, B.G. (2013). A composite risk
assessment model to screen for gestational diabetes meillitus among Mediterranean women.
International Journal of Gynecology and Obstetrics, 120(3), 240–244.
Saxena, K., Khan, D.Z., Singh, S. (2004). Diagnosis of diabetes mellitus using K nearest
neighbor algorithm. International Journal of Computer Science Trends and Technology, 2(4),
36–43.
Scheid, S. (2004). Introduction to kernel smoothing [Online]. Available at: https://compdiag.
molgen.mpg.de/docs/talk_05_01_04_stefanie.pdf.
7
Political Trust in National Institutions:

The Significance of Items’
Level of Measurement in the
Validation of Constructs
The most important consideration in any statistical analysis is to ascertain the

level of measurement of the input variables, which guides the appropriate
methodological steps to be used. In this chapter, we carry out the validation of the
2008 European Social Survey (ESS) and European Values Study (EVS), a
measurement of political trust in national institutions for Greece, Portugal and Spain
when items are considered as pseudo-interval (ESS) and ordinal (EVS). For the
validation of this construct, the sample in each country was randomly split into two
halves and first exploratory factor analysis (EFA) was performed on one half-sample
in order to assess its construct validity. The structure identified by EFA was then
investigated by carrying out confirmatory factor analysis (CFA) on the second half.
Based on the full sample, the psychometric properties of the resulting scales were
assessed.
In all the countries of both surveys, EFA performed on the first half-samples
resulted in a unidimensional solution based on the four common items of the
political trust in national institutions. CFA performed on the second half-samples
and the full samples resulted in adequate model fit for all cases. Moreover, the
analysis provided reliable scales which were of adequate convergent validity. The
methodology presented may be easily applied to other cases of validating scales
composed of pseudo-interval or ordinal items.
Chapter written by Anastasia CHARALAMPI, Eva TSOUPAROPOULOU, Joanna TSIGANOU and

Catherine MICHALOPOULOU.
7.1. Introduction
Scaling theory presupposes investigation of scales’ structures (dimensionality)

and assessment of the psychometric properties of the scale or subscales by
ascertaining their reliability and validity. The initial choice of methodological steps
depends on whether this investigation aims to theory development, i.e. subscales are
not predetermined by theory, or theory testing, i.e. subscales are predetermined by
theory (Thompson 2005; Tabachnick and Fidell 2007). In order to proceed with
theory testing (Michalopoulou 2017; Charalampi 2018; Charalampi et al. 2019,
2020), as in the case of the present study, an adequate sample size is randomly split
into two halves and exploratory factor analysis (EFA) is conducted on one
half-sample. The structure is then investigated by performing confirmatory factor
analysis (CFA) on the second half-sample. Thus, the structure identified by EFA is
validated by applying CFA. Based on the EFA and the CFA results for the total
sample, the validity and reliability of the resulting scale or subscales and their
distributional properties are assessed.
The most important consideration in any statistical analysis – whether univariate,

bivariate or multivariate – is to ascertain the level of measurement of the input
variables, which guides the appropriate methodological steps to be used. Following
the traditional typology of nominal, ordinal, interval and ratio levels of measurement
(Stevens 1946), the items of most attitude scales are considered as ordinal. However,
as we have noted in previous work (Michalopoulou 2017; Charalampi 2018;
Charalampi et al. 2019, 2020), when the number of response categories used for
each item is at least five, ordinal categories can be treated as interval and standard
statistical analyses may be performed using these pseudo-interval variables
(Bartholomew et al. 2008).
In this chapter, to demonstrate the importance of ascertaining the level of

measurement of the items in choosing the methods to be used, we carry out the
investigation and assessment of the 2008 European Social Survey (ESS) and
European Values Study (EVS) measurement of political trust in national institutions
for Greece, Portugal and Spain, when items are considered as pseudo-interval (ESS)
and ordinal (EVS). In performing these analyses, a sequence of theoretical and
rule-of-thumb decisions is presented following the traditional methodology. The
measurement of political trust in institutions serves as a good example as in the
literature (e.g. Daskalopoulou 2018; Ervasti et al. 2018), the ESS items – with few
exceptions (e.g. Smets et al. 2013; Hooghe and Kern 2015) – are simply summed up
to provide an indicator of political trust in institutions without first validating the
scale’s structure and assessing its psychometric properties as the attitude scaling
theory requires before scales are applied.
Political Trust in National Institutions 83
7.2. Methods
7.2.1. Participants
The analysis was based on the 2008 ESS and EVS data for Greece, Portugal and
Spain. The ESS defines the survey population as all individuals aged 15+ residing
within private households in each country, regardless of their nationality, citizenship
or language, and this definition applies to all rounds of the survey. The EVS applies
a similar definition with the exception of age, which is defined at 18+. Therefore,
the analysis is based on those aged 18+ for both datasets so as to establish their
comparability. In Table 7.1, the demographic and social characteristics of the
participants aged 18+ are presented.
Secondary In paid
Men Women Age mean Married
Country N education or work*
(%) (%) (SD) (%)
lower (%) (%)
Greece
ESS 2,019 45.1 54.9 45.8 (16.3) 59.5 73.5 58.1
EVS 1,500 43.3 56.7 49.6 (18.4) 59.3 82.2 45.4
Portugal
ESS 2,296 38.5 61.5 53.9 (19.2) 56.9 87.4 41.6
EVS 1,553 40.4 59.6 53.0 (18.7) 59.6 91.0 47.2
Spain
ESS 2,486 47.5 52.5 47.9 (18.6) 56.6 76.0 54.1
EVS 1,500 43.9 56.1 47.9 (19.4) 45.5 67.8 50.9
*The reference period for the respondent’s main activity was defined as during the last seven
days.
Table 7.1. Demographic and social characteristics of participants aged

18+ for Greece, Portugal and Spain: European Social Survey (ESS)
and European Values Study (EVS), 2008
As shown, in all samples, there were more women than men. Gender is
distributed similarly in the Greek and Portuguese ESS and EVS samples. In the
Spanish case, a difference of 3.4% is detected between the two samples. In all
samples, the mean age was over 45.8 years. In the cases of Portugal and Spain, the
mean age was the same for both the ESS and EVS samples. In the case of Greece,
the mean age of the EVS sample was higher than the ESS one. More than 45.4% of
the participants were married. In the case of Greece, there were no differences
between the two samples. In the case of Portugal, the percentage of married
participants in the EVS sample was higher than in the ESS one and the reverse holds
true for the case of Spain. In all samples, more than 73.5% had completed secondary
education or lower. In the cases of Greece and Portugal, the percentage of those that
had completed secondary education or lower was higher in the EVS sample and the
reverse holds true for the case of Spain. In all samples, at least 41.5% were in paid
work. In the cases of Greece and Spain, the percentage of those that were in paid
work is higher in the ESS sample and the reverse holds true for the case of Portugal.
7.2.2. Instrument
In the ESS core questionnaire, five items are used for the measurement of
political trust in national institutions: parliament, legal system, police, politicians
and political parties. All these items are included in all rounds of the survey with the
exception of the question on political parties, which was introduced in the second
round (2004). Each item is assigned a scale ranging from 0 (no trust at all) to 10
(complete trust). The level of measurement of these items is pseudo-interval. The
EVS measures political trust as confidence in 17 national institutions of which only
four are common with ESS (Table 7.2): police, courts, political parties and
parliament. These items are assigned a scale ranging from 1 (a great deal) to 4 (none
at all) and therefore their level of measurement is ordinal. The values of the EVS
items were first reversed, in order to achieve correspondence between the ordering
of the response categories to the ESS items.
ESS Aligned scale EVS Aligned scale Item

Item
question (ESS) question (EVS) label
[Country’]s Parliament B4 0–10 Q63_v211 1–4 (R) PT1
Τhe legal system B5 0–10 Q63_v218 1–4 (R) PT2
Τhe police B6 0–10 Q63_v210 1–4 (R) PT3
Political parties B8 0–10 Q63_v221 1–4 (R) PT5
R = the values of these items were reversed before the analysis. The ESS wording of these
questions is as follows: “Using this card, please tell me on a score of 0-10 how much you
personally trust each of the institutions I read out. 0 means you do not trust an institution at
all, and 10 means you have complete trust.” The EVS wording of these questions is as
follows: “Please look at this card and tell me, for each item listed, how much confidence you
have in them, is it a great deal (1), quite a lot (2), not very much (3) or none at all (4)?”
Table 7.2. The European Social Survey (ESS) and European Values
Study (EVS) measurement of political trust in national institutions
7.2.3. Statistical analyses
Initially, as the methodological process falls under theory testing (Thompson

2005), the sample in each country was randomly split into two halves.
EFA was performed on the first half in order to assess the construct validity of
the scale (Fabrigar et al. 1999; Bartholomew et al. 2008). The structure suggested by
EFA was subsequently validated by carrying out CFA on the second half. Based on
the full sample and the CFA results, the psychometric properties of the scale were
assessed. Statistical analyses were performed using Mplus Version 8.4 and IBM
SPSS Statistics Version 20.
The half-sample sizes were large enough (>300) to carry out factor analyses
(Tabachnick and Fidell 2007). Since sample sizes ranged from 1,500 (Greece and
Spain, EVS) to 2,486 (Spain, ESS), the half-samples were 750 (Greece and Spain,
EVS) to 1,243 (Spain, ESS) and were therefore considered large enough to carry out
factor analyses separately in each country.
Initially, missing data analysis and data screening for outliers and unengaged
responses was performed for both half-samples (Michalopoulou 2017; Charalampi
2018; Charalampi et al. 2019, 2020). Only cases with missing values on all items
were automatically excluded from the analysis (Muthén and Muthén 1998–2017).
Cases were also eliminated if they exhibited low standard deviation (< 0.5), i.e. no
variance in the responses (Gaskin 2016). Data screening for outliers was based on
background variables, for example, gender (dichotomy), age (ratio) and education
(pseudo-interval). Cases were eliminated if they were shown in the boxplots as
outliers (Gaskin 2016; see also Thompson 2005; Tabachnick and Fidell 2007;
Brown 2015).
7.2.3.1. EFA
In performing EFA, the following sequence of decisions was required
(Michalopoulou 2017; Charalampi 2018; Charalampi et al. 2019, 2020):
1) Initially, the items’ frequency distributions were inspected and, in the case of
pseudo-interval items for floor and ceiling effects, bearing in mind that percentages
of responses less than 15 are normally deemed to be acceptable (Terwee et al. 2007).
In the case of pseudo-interval items, the appropriate univariate statistics were
computed for each item and their distributional properties were inspected (testing for
normality) to decide on the appropriateness of the methods to be used. The criterion
of corrected item-total correlations < 0.30 (Nunnally and Bernstein 1994) was used
to decide which items to exclude from the analysis. In the case of ordinal items, only
the mode and median were computed for each item.
2) The covariance matrix and the polychoric correlation matrix were employed
as the appropriate matrices of associations for pseudo-interval and ordinal items,
respectively (Brown 2015).
3) Maximum likelihood and robust weighted least squares were applied as the
appropriate methods of factor extraction for pseudo-interval and ordinal items,
respectively (Brown 2015).
4) Considering the factor analytic theory, “factors that are represented by two or
three indicators may be underdetermined […] and highly unstable across
replications” (Brown 2015, p. 21), only a unidimensional model could be tested.
5) Items were considered salient if their factor loadings were > 0.30 and therefore
the meaning of the dimension was inferred from these items (Fabrigar et al. 1999;
Thompson 2005). Items with loadings < 0.30 (i.e. low communalities) were
excluded from the analysis (Brown 2015).
7.2.3.2. CFA
In applying CFA, the following sequence of decisions was required
(Michalopoulou 2017; Charalampi 2018; Charalampi et al. 2019, 2020):
1) The decision on the inclusion of items in the analysis was based on the results
of the item analysis and EFA carried out on the first half-sample.
2) CFA was performed using the covariance matrix of associations and
maximum likelihood estimation in the case of pseudo-interval items and the
polychoric correlation matrix and robust weighted least squares in the case of
ordinal items.
3) Model fit was considered adequate if χ2/df < 3, standardized root-mean-square
residual (SRMR) < 0.05, comparative fit index (CFI) and Tucker-Lewis index (TLI)
values were ≥ 0.95 and the root-mean-square error approximation (RMSEA) ≤ 0.06
with the 90% confidence interval (CI) upper limit ≤ 0.06 (Bollen 1989; Hu and
Bentler 1999; Thompson 2005; Tabachnick and Fidell 2007; Schmitt 2011; Brown
2015). Model fit was considered acceptable if χ2/df < 3, SRMR < 0.08, CFI and TLI
values were > 0.90 and RMSEA < 0.08 with the 90% CI upper limit < 0.08 (Hu and
Bentler 1999; Marsh et al. 2004). However, because SRMR seems to not perform
well in CFA models with categorical items (Yu 2002; Brown 2015), it was not used
in the case of ordinal items.
4) Searches for modification indices and further specifications were performed.
Where necessary, correlations between error variances were introduced (Thompson
2005; Brown 2015).
7.2.3.3. Scale construction and assessment

The scale was constructed for the full sample for all countries of both surveys by
averaging the defining items so that low and high scores would indicate low and
high levels of political trust and descriptive statistics were computed. Based on the
CFA results for the full sample, as in our previous work (Michalopoulou 2017;
Charalampi 2018; Charalampi et al. 2019, 2020), the average variance extracted
(AVE) was computed for each scale. Convergent validity was considered adequate if
the AVE was above or around 0.50 (Fornell and Larcker 1981). Average inter-item
correlations in the recommended range of 0.15–0.5 that cluster near their mean value
were used as an indication of the unidimensionality of the scale (Clark and Watson
1995). Moreover, based on the CFA results for the full sample, a scale was
considered to be reliable if the composite reliability coefficient (Raykov 2007) was
above or around 0.70, i.e. using the same criterion as for Cronbach’s alpha
coefficients (Nunnally and Bernstein 1994). However, if AVE is less than 0.5, but
composite reliability is higher than 0.6, the convergent validity of the construct is
still adequate (Fornell and Larcker 1981).
In order to facilitate the comparison between the results of the two surveys for
the full samples, all items of the EVS survey datasets were rescaled into a 0–10 scale
by applying the following simple transformation (Charalampi 2018; Charalampi
et al. 2019, 2020):
−
. − +
−
7.3. Results
The full sample screening of datasets for both surveys identified no unengaged
responses (standard deviation = 0.000). In the Portuguese sample of both surveys,
four and six outlying cases with a higher education degree were detected,
respectively, and it was decided not to reject them from the analysis. There were
four, eleven and five cases with missing values on all items in the Greek, Portuguese
and Spanish ESS samples, respectively. Moreover, there were seventeen and eight
cases with missing values on all items in the Portuguese and Spanish EVS samples,
respectively. These cases were excluded from the analysis.
7.3.1. EFA results
In every country of the ESS datasets, respondents had used the full range of
possible responses for all items (Table 7.3). The majority of the responses were
clustered closer to the lower end of their respective scales. Floor effects were present
in all three countries’ samples for the item measuring trust in political parties (PT5),
and consequently, this item had the lowest mean responses. Relatively high mean
responses were found for the item defining trust in the police (PT3), mainly in the
case of the Spanish sample. None of the items were rejected based on the criterion of
corrected item-total correlations < 0.30. Non-normality was not severe for any item
(skewness > 2; kurtosis > 7). As shown, the proportion of missing values was
negligible, exceeding 5.4% for only one item (PT1) of the Spanish sample.
In parallel, frequency distributions and mode and median values of the items
based on the first Greek, Portuguese and Spanish EVS half-samples were inspected
(Table 7.4). The full range of possible responses was used for all items. The
majority of the responses were clustered around the middle and closer to the lower
end of their respective scales. As shown, the proportion of missing values was
negligible, exceeding 4.5% for only two items of the Portuguese (PT1) and Spanish
(PT1) sample, respectively.
EFA for the pseudo-interval and ordinal items was performed on the first
half-samples with maximum likelihood of the covariance matrix of associations and
with robust weighted least squares of the polychoric matrix of associations,
respectively. Table 7.5 shows the factorial structure of the one-factor solutions of
both surveys. All items exhibited strong factor loadings (≥ 0.40).
7.3.2. CFA results
The one first-order factor model indicated by the EFA results was tested by
performing CFA on the second half-samples. Modification searches were conducted,
and, where necessary, correlations between error variances were introduced. The
CFA results for the Greek ESS and EVS samples were χ2/df = 2.23 (df = 1),
SRMR = 0.005, CFI = 0.999, TLI = 0.996, RMSEA (90% CI) = 0.035 (0.000–0.099)
and χ2/df = 14.87 (df = 2), CFI = 0.983, TLI = 0.950, RMSEA (90% CI) = 0.136
(0.095–0.181), respectively. The CFA results for the Portuguese ESS and EVS
samples were χ2/df = 5.09 (df = 1) SRMR = 0.010, CFI = 0.997, TLI = 0.982, RMSEA
(90% CI) = 0.060 (0.017–0.115) and χ2/df = 5.23 (df = 1), CFI = 0.998, TLI = 0.986,
RMSEA (90% CI) = 0.074 (0.023–0.142), respectively. The CFA results for the
Spanish ESS and EVS samples were χ2/df = 1.79 (df = 1), SRMR = 0.005,
CFI = 0.999, TLI = 0.997, RMSEA (90% CI) = 0.025 (0.000–0.085) and χ2/df = 1.81
(df = 1), CFI = 0.999, TLI = 0.992, RMSEA (90% CI) = 0.033 (0.000–0.109),
respectively.
Frequency percentage of response categories
Country/item Mean SD 95% CI 0 1 2 3 4 5 6 7 8 9 10 NA Skew. Kurt. CC
Greece (n = 1,009)
PT1 3.58 2.496 3.42–3.74 14.1 12.1 11.1 11.4 9.6 18.6 8.8 7.2 4.5 1.1 0.8 0.8 0.20 -0.88 0.706
PT2 4.75 2.577 4.59–4.91 7.8 6.1 7.3 10.3 10.3 16.9 10.4 14.7 10.3 4.2 1.1 0.5 -0.26 -0.85 0.765
PT3 4.87 2.604 4.71–5.04 6.9 5.9 7.7 8.9 10.3 19.4 10.3 12.3 10.7 4.9 2.4 0.2 -0.17 -0.76 0.619
PT5 2.50 2.151 2.36–2.63 23.3 18.7 11.6 12.8 10.9 14.2 3.3 2.9 1.3 0.1 0.2 0.8 0.57 -0.51 0.589
Portugal (n = 1,148)
PT1 3.43 2.403 3.28–3.58 16.5 7.0 11.6 13.4 11.9 17.5 7.1 5.1 2.8 0.7 1.0 5.4 0.26 -0.55 0.681
PT2 3.77 2.478 3.61–3.92 13.1 6.6 9.4 14.5 10.5 17.8 8.6 6.4 5.4 1.8 0.7 5.1 0.15 -0.73 0.689
PT3 5.39 2.330 5.25–5.53 5.0 1.3 5.0 6.7 9.1 21.9 15.2 15.1 12.0 3.3 3.9 1.6 -0.43 -0.05 0.456
PT5 2.42 2.114 2.29–2.54 30.1 9.6 14.0 12.4 13.0 12.3 2.8 2.3 0.4 0.2 0.2 2.8 0.46 -0.62 0.595
Spain (n = 1,243)
PT1 4.93 2.268 4.80–5.06 5.8 2.9 5.1 7.6 10.5 21.7 15.7 10.9 8.5 2.3 0.7 8.2 -0.43 -0.22 0.658
PT2 4.20 2.444 4.06–4.35 8.8 6.8 10.7 11.2 13.2 18.8 9.9 8.0 7.1 2.2 0.9 2.4 0.04 -0.71 0.669
PT3 5.95 2.185 5.82–6.08 2.3 2.0 3.3 5.3 7.2 17.4 15.7 19.4 16.7 7.1 3.0 0.7 -0.63 0.16 0.536
PT5 3.33 2.331 3.19–3.46 17.6 9.5 11.2 13.2 13.0 17.7 6.8 3.8 2.7 1.1 0.3 3.1 0.24 -0.62 0.608
SD = standard deviation; CI = confidence interval; NA = no answer (missing values); Skew. = skewness; Kurt. = kurtosis; CC = corrected
item-total correlation.
Standard errors for skewness and kurtosis of the Greek items were 0.078 and 0.155, respectively; standard errors for skewness and kurtosis of
the Portuguese items were 0.076 and 0.152, respectively; standard errors for skewness and kurtosis of the Spanish items were 0.073 and 0.146,
respectively.
Table 7.3. Item analysis of the political trust in national institutions for Greece, Portugal
and Spain based on the first half-samples: European Social Survey, 2008
Political Trust to National Institutions
89
Frequency percent of response categories

Country/item Mode Median 1 2 3 4 NA
Greece (n = 750)
PT1 2 2 26.4 42.9 25.2 3.9 1.6
PT2 3 2 17.9 32.8 36.3 11.9 1.2
PT3 3 3 16.4 29.7 39.5 14.1 0.3
PT5 2 2 37.9 44.7 15.1 1.6 0.8
Portugal (n = 776)
PT1 2 2 22.4 33.6 31.6 3.6 8.8
PT2 3 2 22.0 30.3 37.4 6.8 3.5
PT3 3 3 5.9 16.2 57.1 18.9 1.8
PT5 1 2 38.7 35.8 18.7 3.5 3.4
Spain (n = 750)
PT1 3 3 8.9 34.1 40.3 6.7 10.0
PT2 2 2 14.5 41.3 32.7 7.6 3.9
PT3 3 3 6.8 23.9 52.4 14.0 2.9
PT5 2 2 30.8 48.9 14.3 1.5 4.5
NA = no answer (missing values). Items’ values were reversed so as: 1 = none at all; 2 = not
very much; 3 = quite a lot; 4 = a great deal.
Table 7.4. Item analysis of the political trust in national institutions for Greece,
Portugal and Spain based on the first half-samples: European Values Study, 2008
Survey/item Greece Portugal Spain

ESS: (n) (1,008) (1,143) (1,240)
PT1 0.770 0.830 0.765
PT2 0.883 0.731 0.772
PT3 0.718 0.473 0.590
PT5 0.628 0.714 0.691
EVS: (n) (750) (767) (742)
PT1 0.802 0.858 0.791
PT2 0.749 0.774 0.637
PT3 0.689 0.457 0.516
PT5 0.677 0.767 0.676
Exploratory factor analysis was performed with maximum likelihood of the covariance matrix
and robust weighted least squares of the polychoric correlation matrix on the first
half-samples of the ESS and EVS, respectively.
Table 7.5. Exploratory factor analysis of the political trust in national institutions
items performed on the first half-samples of Greece, Portugal and Spain:
European Social Survey (ESS) and European Values Study (EVS), 2008
In all these cases, the model df were one, with the exception of the Greek EVS
sample, where the model df were two. However, although the half-sample sizes were
large enough, ranging from (750) to (1,241), for these single (and double) degree of
freedom models, the RMSEA 90% CI limits ranged from 0.0 to 0.181, suggesting
that they were “likely somewhere between perfect and extremely horrible! Clearly,
any RMSEA value with a CI this wide is of no value” (Kenny et al. 2015, p. 501).
Moreover, as all models were composed of four items, we considered the (Kenny
and McCoach 2003) results that the RMSEA tends to improve by the addition of
more items to the model whereas the CFI and TLI tend to worsen as the number of
items in the model increases. In this respect, relying on the SRMR, CFI and TLI
values – with all the reservations expressed by Kenny et al. (2015) – the findings
suggested adequate model fit for all models under consideration.
7.3.3. Scale construction and assessment
Scales were constructed by averaging their defining items. In Table 7.6,

descriptive statistics, composite reliability and convergent validity are presented for
the full ESS and EVS samples of the three countries. All items of the political trust
in national institutions scale of the EVS countries were rescaled into a 0–10 scale in
order to facilitate the comparison with the results of the ESS countries.
The AVE was computed for each scale based on the CFA repeated for the full
samples of Greece (Figure 7.1, ESS: χ2/df = 4.19 (df = 1), SRMR = 0.005,
CFI = 0.999, TLI = 0.995 and RMSEA = 0.040 with the 90% CI = 0.007–0.082 and
EVS: χ2/df = 5.30 (df = 1), CFI = 0.999, TLI = 0.992 and RMSEA = 0.054 with
the 90% CI = 0.017–0.102), Portugal (Figure 7.2, ESS: χ2/df = 9.17 (df= 1),
SRMR = .009, CFI = 0.997, TLI = 0.982 and RMSEA = 0.060 with the 90%
CI = 0.030–0.098 and EVS: χ2/df = 10.08 (df = 1), CFI = 0.998, TLI = 0.986 and
RMSEA = 0.077 with the 90% CI = 0.039–0.123) and Spain (Figure 7.3, ESS:
χ2/df = 2.45 (df = 1), SRMR = 0.004, CFI = 0.999, TLI = 0.997 and RMSEA = 0.024
with the 90% CI = 0.000–0.064 and EVS: χ2/df = 8.65 (df = 1), CFI = 0.994,
TLI = 0.967 and RMSEA = 0.072 with the 90% CI = 0.034–0.119). Therefore,
based on the argument presented for the CFA results of half-samples, all models
provided adequate model fit.
The political trust in national institutions scale demonstrated adequate

convergent validity (AVE above and around 0.50) for both the Greek and
Portuguese samples and the Spanish ESS sample. In the Spanish EVS sample,
although the scale’s AVE was less than 0.5, its composite reliability was higher than
0.6, therefore providing evidence of adequate convergent validity. The average
inter-item correlations were within the recommended range for unidimensionality

(0.15–0.5) in all countries, except for the Greek ESS dataset. In the Greek,
Portuguese and Spanish ESS samples, the political trust in national institutions scale
was reliable with composite reliability values 0.828, 0.775 and 0.791 (≥0.70),
respectively. Accordingly, in the Greek, Portuguese and Spanish EVS samples, the
scale was reliable with composite reliability values 0.795, 0.819 and 0.743 (≥0.70),
respectively.
Figure 7.1. Standardized solution for the political trust (pt) one first-order factor
models based on CFA performed on the Greek ESS (N = 2,015) and EVS
(N = 1,500) full samples. Observed variables are represented by squares and the
latent variable by a circle
models based on CFA performed on the Portuguese ESS (N = 2,285) and EVS
models based on CFA performed on the Spanish ESS (N = 2,481) and EVS
Greece Portugal Spain

Number of items 4 4 4
ESS: N 2,015 2,285 2,481

Mean (standard error) 3.92 (0.046) 3.77 (0.040) 4.65 (0.038)
95% confidence interval 3.83–4.01 3.69–3.85 4.58–4.73
Standard deviation 2.032 1.836 1.799
Skewness 0.043 0.034 -0.151
Kurtosis -0.688 -0.309 -0.134
Convergent validity 0.556 0.478 0.490
Composite reliability 0.828 0.775 0.791
Average inter-item correlation 0.571 0.490 0.502
Min.–max. correlations 0.418–0.695 0.280–0.640 0.389–0.581
Range of correlations 0.278 0.360 0.193
EVS: N 1,500 1,536 1,492
Mean (standard error) 4.12 (0.058) 4.44 (0.057) 4.51 (0.051)
95% confidence interval 4.00–4.23 4.33–4.55 4.41–4.61
Standard deviation 2.193 2.107 1.821
Skewness 0.118 -0.024 -0.111

Kurtosis -0.392 -0.491 -0.110
Convergent validity 0.497 0.538 0.422
Composite reliability 0.795 0.819 0.743
Average inter-item correlation 0.447 0.422 0.329
Min.–max. correlations 0.354–0.516 0.192–0.575 0.190–0.439
Range of correlations 0.163 0.383 0.249
ESS: standard errors for skewness and kurtosis of the Greek scale were 0.055 and 0.110,
respectively; standard errors for skewness and kurtosis of the Portuguese scale were 0.054 and
0.108, respectively; standard errors for skewness and kurtosis of the Spanish scale were 0.052
and 0.103, respectively.
EVS: standard errors for skewness and kurtosis of the Greek scale were 0.064 and 0.129,
respectively; standard errors for skewness and kurtosis of the Portuguese scale were 0.066 and
0.133, respectively; standard errors for skewness and kurtosis of the Spanish scale were 0.068
and 0.136, respectively.
Table 7.6. Descriptive statistics, convergent validity, composite reliability and internal
consistencies of the political trust in national institutions items based on the full
samples of Greece, Portugal and Spain: European Social Survey (ESS) and
European Values Study (EVS), 2008
As shown, higher mean scale values were obtained from the EVS samples of
Greece and Portugal, and the reverse holds true for the Spanish samples.
7.4. Conclusion
This chapter demonstrated the importance of ascertaining the level of

measurement of the items in choosing the appropriate methods to be applied when
validating a construct. The 2008 ESS and EVS measurement of political trust in
national institutions were used for Greece, Portugal and Spain where items were
considered as pseudo-interval (ESS) and ordinal (EVS).
The investigation of the structure (dimensionality) of the 2008 ESS and EVS
measurement of political trust in national institutions scale by applying the
traditional approaches of EFA and CFA to randomly split half-samples resulted in
all countries in a unidimensional structure following Brown’s (2015)
recommendation to eliminate from the analysis factors defined by two or three
items.
The demonstration of the complex sequence of decisions required in performing

EFA and CFA, based on current theory and practice, should be noted among the
strengths of the study. In all countries and both datasets, the four items of the
political trust in national institutions scale exhibited strong factor loadings based on
EFA results. CFA performed on the second half-samples and the full samples
resulted in adequate model fit for all cases. However, the RMSEA and its 90% CI
were not used for the assessment of the CFA models, although the suggestion by
Kenny et al. (2015, p. 503) that “for a single-factor model with four indicators, there
are several alternatives about what two parameters to add, for example, two
correlated errors or a second factor with a correlated error” was taken up but to no
avail. Moreover, the analysis provided reliable scales which were of adequate
convergent validity. Applying the appropriate modifications, the methodology
presented may be easily applied to other cases of validating scales that are composed
of pseudo-interval or ordinal items.
7.5. Funding
This research, conducted under the auspices of the National Centre for Social
Research, was co-financed by Greece and the European Union (European Social
Fund – ESF) through the Operational Programme “Human Resources Development,
Education and Lifelong Learning 2014-2020” in the context of the project “Greece
and Southern Europe: Investigating political trust to institutions, social trust and
human values, 2002-2017” (MIS 5049524).
7.6. References
Bartholomew, D.J., Steele, F., Moustaki, I., Galbraith, J. (2008). Analysis of Multivariate
Social Science Data. Chapman & Hall/CRC, London.
Bollen, K.A. (1989). Structural Equations with Latent Variables. John Wiley & Sons,
New York.
Brown, T.A. (2015). Confirmatory Factor Analysis for Applied Research, 2nd edition. The
Guilford Press, New York.
Charalampi, A. (2018). The importance of items’ level of measurement in investigating the
structure and assessing the psychometric properties of multidimensional constructs.
Doctoral Dissertation, Panteion University of Social and Political Sciences, Athens.
Charalampi, A., Michalopoulou, C., Richardson, C. (2019). Determining the structure and
assessing the psychometric properties of multidimensional scales constructed from ordinal
and pseudo-interval items. Communications in Statistics – Case Studies, Data Analysis
and Applications, 5(1), 26–38.
Charalampi, A., Michalopoulou, C., Richardson, C. (2020). Validation of the 2012 European
Social Survey measurement of wellbeing in seventeen European countries. Applied
Research in Quality of Life, 15(1), 73–105.
Clark, L.A. and Watson, D. (1995). Constructing validity: Basic issues in objective scale
development. Psychological Assessment, 7(3), 309–319.
Daskalopoulou, I. (2018). Individual-level evidence on the causal relationship between social
trust and institutional trust. Social Indicators Research, 144, 275–298.
Ervasti, H., Kouvo, A., Venetoklis, T. (2018). Social and institutional trust in times of crisis:
Greece, 2002–2011. Social Indicators Research, 141, 1207–1231.
Fabrigar, L.R., Wegener, D.T., MacCallum, R.C., Strahan, E.J. (1999). Evaluating the use of
exploratory factor analysis in psychological research. Psychological Methods, 4(3),
272–299.
Fornell, C. and Larcker, D.F. (1981). Evaluating structural equation models with
unobservable variables and measurement error. Journal of Marketing Research, 18(1),
39–50.
Gaskin, J. (2016). Data screening. Gaskination’s StatWiki [Online]. Available at:
http://statwiki.gaskination.com/index.php?title=Main_Page [Accessed 30 June 2016].
Hooghe, M. and Kern, A. (2015). Party membership and closeness and the development of
trust in political institutions: An analysis of the European Social Survey, 2002–2010.
Party Politics, 21(6), 944–956.
Hu, L. and Bentler, P.M. (1999). Cutoff criteria for fit indexes in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation Modeling,
6(1), 1–55.
Kenny, D.A. and McCoach, D.B. (2003). Effect of the number of variables on measures of fit
in structural equation modeling. Structural Equation Modeling, 10(3), 333–351.
Kenny, D.A., Kaniskan, B., McCoach, D.B. (2015). The performance of RMSEA in models
with small degrees of freedom. Sociological Methods & Research, 44(3), 486–507.
Marsh, H.W., Hau, K.T., Wen, Z. (2004). In search of golden rules: Comment on hypotheses-
testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing
Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11(3), 320–341.
Michalopoulou, C. (2017). Likert scales require validation before application – Another
cautionary tale. BMS Bulletin de Méthodologie Sociologique, 134, 5–23.
Muthén, L.K. and Muthén, B.O. (1998–2017). Mplus User’s Guide, 8th edition. Muthén &
Muthén, Los Angeles, CA.
Nunnally, J.C. and Bernstein, I.H. (1994). Psychometric Theory. McGraw-Hill, New York.
Raykov, T. (2007). Reliability if deleted, not “alpha if deleted”: Evaluation of scale reliability
following component deletion. British Journal of Mathematical and Statistical
Psychology, 60(2), 201–216.
Schmitt, T.A. (2011). Current methodological considerations in exploratory and confirmatory

factor analysis. Journal of Psychoeducational Assessment, 29, 304–322.
Smets, A., Hooghe, M., Quintelier, E. (2013). The scale validity of trust in political
institutions measurements over time in Belgium. An analysis of the European social
survey, 2002–2010. Paper presented at the 5th European Survey Research Association
(ESRA) Conference, Ljubljana, 15–19 July 2013.
Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.
Tabachnick, B.G. and Fidell, L.S. (2007). Using Multivariate Statistics. Pearson Allyn &
Bacon, Upper Saddle River, NJ.
Terwee, C.B., Bot, S.D.M., de Boer, M.R., van der Windt, D.A.W.M., Knol, D.L., Dekker, J.,
Bouter, L.M., de Vet, H.C.W. (2007). Quality criteria were proposed for measurement
properties of health status questionnaires. Journal of Clinical Epidemiology, 60(1), 34–42.
Thompson, B. (2005). Exploratory and Confirmatory Factor Analysis: Understanding
Concepts and Applications. American Psychological Association, Washington, DC.
Yu, C.Y. (2002). Evaluating cutoff criteria of model fit indices for latent variable models with
binary and continuous outcomes. Unpublished Doctoral Dissertation, University of
California, Los Angeles, CA.
8
The State of the Art in Flexible

Regression Models for Univariate
Bounded Responses
Modeling bounded continuous responses, such as proportions and rates, is a

relevant problem in methodological and applied statistics. A further issue is observed
if the response takes values at the boundary of the support. Given that standard
models are unsuitable, a successful and relatively recent branch of research favors
modeling the response variable according to distributions that are well-defined on the
restricted support. A popular and well-researched choice is the beta regression model
and its augmented version. More flexible alternatives than the beta regression model,
among which are the flexible beta and the variance inflated beta, have been tailored
for data with outlying observations, latent structures and heavy tails. These models
are based on special mixtures of beta distributions and their augmented versions
to handle the presence of values at the boundary of the support. The aim of this
chapter is to provide a comprehensive review of these models and to briefly describe
the FlexReg package, a newly available tool on CRAN that offers an efficient and
easy-to-use implementation of regression models for bounded responses. Two real
data applications are performed to show the relevance of correctly modeling the
bounded response and to make comparisons between models. Inferential issues are
dealt with by the (Bayesian) Hamiltonian Monte Carlo algorithm.
Chapter written by Agnese Maria D I B RISCO, Roberto A SCARI, Sonia M IGLIORATI and
Andrea O NGARO.
8.1. Introduction
The development of statistical methods to deal with bounded responses in

regression models has seen rapid growth over recent years. The purpose of this chapter
is to review the state of the art in flexible regression models for univariate bounded
responses.
When a continuous variable restricted to the interval (0, 1) is to be regressed onto

covariates, it goes without saying that standard approaches are unsuitable, otherwise
some odd results can be observed such as fitted values outside the support. A possible
solution is to transform the response variable so that its support becomes the real line
and then switch back to standard methods (Aitchison 1986). Despite it being very
tempting to find a way to restore the well-established regression methodology, this
approach has, in our opinion, two relevant drawbacks. On one side, the methodological
issue about the failure of the normality assumption, since proportions very often
show asymmetric distributions and homoscedasticity. On the other side, the practical
issue of finding meaningful interpretations of the estimated parameters in terms of
the original response variable. A different solution, the one that we favored, is to
model the response directly on its restricted support by adopting specific regression
models. With the goal of defining a parametric regression model, it is then necessary
to establish some proper distributions on the restricted support. The first attempt in
the latter direction was to model the bounded response variable according to a beta
distribution, thus defining a regression model for its mean (Ferrari and Cribari-Neto
2004). Extensions to this approach have been proposed in the direction of also
modeling the precision parameter of the beta (Smithson and Verkuilen 2006).
Although the beta distribution can show very different shapes, it fails to model
a wide range of phenomena, including heavy tails and bimodal responses. To
achieve greater flexibility, two further distributions have been proposed on the
restricted interval that take advantage of a mixture structure. The first one is called
variance-inflated beta (VIB) (Di Brisco et al. 2020); it is a mixture of two betas sharing
a common mean parameter where the first component has a precision decreased by
factor k, and thus, it displays a larger variance. The second distribution that has been
proposed to enhance the flexibility of the regression models for constrained data is
the flexible beta (FB) (Migliorati et al. 2018). The rationale behind this distribution
is to consider a mixture of two betas sharing a common precision parameter and
with different component means. It is noteworthy that, not only are the component
means indeed different, but they are also arranged so that the first component mean
is greater than the second one, thus avoiding any computational burden related to
label-switching (Frühwirth-Schnatter 2006).
Other strategies to deal with bounded responses have been proposed in the
literature such as regression models based on a new class of Johnson SB distributions
(Lemonte et al. 2016), mixed regression models based on the simplex distribution
The State of the Art in Flexible Regression Models 101
(Qiu et al. 2008), quantile regression models (Bayes et al. 2017) and fully
non-parametric regression models (Barrientos et al. 2017). The analysis of these
proposals goes beyond the aim of this chapter.
In real data applications, it is quite common to observe a bounded response

variable with some observations at the boundary of the support. Since the beta as
well as its flexible extensions do not admit values at the boundary, a simple solution
to preserve their usage is to transform the response variable from [0, 1] back to the
open interval (0, 1). This approach is convenient either when the percentage of values
at the boundary is negligible or when the 0/1 values stem from approximation errors.
However, it may happen that the 0 indicates exactly the absence of the phenomenon
and the 1 its wholeness. In this latter scenario, the transformation approach does not
make it possible to fully exploit the information provided by values at the boundary.
Therefore, it seems more fruitful to adopt an augmentation strategy that consists of
augmenting the probability density function (pdf) of the distribution defined on the
open interval (0, 1) by adding positive probabilities to the occurrence of the values at
the boundary.
In addition to the review of flexible regression models, either augmented or not,

the aim of this chapter is to provide a quick overview of the FlexReg package that fits
FB regression (FBreg), beta regression (Breg) and VIB regression (VIBreg) models
for bounded responses with a Bayesian approach to inference. All functions within the
package are written in R language whereas the models are written in Stan language
(Stan Development Team 2016). The package is available from the Comprehensive R
Archive Network (CRAN) 1.
The rest of this chapter is structured as follows. Section 8.2 describes the general
framework of a regression model for bounded responses, whereas section 8.2.1
extends the model with the augmentation strategy. Section 8.2.2 illustrates the beta
distribution and its flexible extensions and shows how to get the corresponding
parametric regression models, either augmented or not. Section 8.3 is dedicated to two
case studies. Section 8.3.1 illustrates the analysis of the “Stress” dataset by making use
of the regression models without augmentation; it also provides a quick overview of
the FlexReg package. Section 8.3.2 focuses on the analysis of the “Reading” dataset
by illustrating mainly the augmented regression models.
8.2. Regression model for bounded responses
Proportions, rates, or, more generally, phenomena defined on a restricted support

often play the role of dependent variable in a regression framework.
Let us consider a random variable (rv) Y on (0, 1) – the generalization to a generic

open interval of type (a, b), −∞ < a < b < ∞, being straightforward. Let us assume
1 https://CRAN.R-project.org/package=FlexReg.
that Y follows a well-defined distribution on the open interval – in section 8.2.2 we

explore some alternatives– and let us identify the mean of the variable, μ = E(Y ),
and its precision, φ = q V ar(Y )−1 , as a function of the variance. We consider a
sample of Yi , i = 1, . . . , n, i.i.d. response variables distributed as Y .
In a regression framework, it is reasonable to let the mean parameter be a function

of the vector of covariates as follows:

g1 (μi ) = x1i β1 [8.1]

where g1 (·) is an adequate link function, x1i is a vector of covariates observed on
subject i (i = 1, . . . , n), and β1 is a vector of regression coefficients for the mean.
Thus, the regression model is of GLM-type, despite not being exactly a GLM since
the beta distribution (and its extensions) does not belong to the dispersion-exponential
family (McCullagh and Nelder 1989). Function g1 (·) has to be strictly monotone and
twice differentiable, and many options are available such as logit, probit, cloglog and
loglog. Even so, the choice often falls on the logit function since it allows a convenient
interpretation of the regression coefficients as logarithms of odds.
Moreover, we can also link the precision parameter to some covariates (either the
same as in the regression model for the mean or different). To do that, equation [8.1]
is complemented with the following:

g2 (φi ) = x2i β2 [8.2]

where g2 (·) is an adequate link function, x2i is a vector of covariates observed on
subject i (i = 1, . . . , n), and β2 is a vector of regression coefficients for the precision.
Common choices for g2 (·) are the logarithm and the square root.
8.2.1. Augmentation
In real data applications, it is not uncommon to observe values at the boundary

of the support when dealing with constrained variables. These values might be due
to approximation errors, in which case it might be adequate to just transform the
response from [0, 1] back to the open interval (0, 1). Conversely, in many scenarios, a
response exactly equal to 0 or 1 has a clear interpretation as absence or wholeness of
the phenomenon at hand. In this latter case, it might be of interest to model the data at
the boundary separately from the rest of the sample. This can be achieved through the
augmentation strategy.
An augmented distribution is a three-part mixture that assigns positive

probabilities to 0 and 1 and a (continuous) density to the open interval (0, 1). The
pdf results equal to:
⎧
⎪
⎨ q0 if y = 0
fA (y; η, q0 , q1 ) = q1 if y = 1 [8.3]
⎪
⎩
q2 f (y; η) if 0 < y < 1
where the vector (q0 , q1 , q2 ) belongs to the simplex being 0 < q0 , q1 , q2 < 1 and
q0 + q1 + q2 = 1. The density f (y; η), with η being a vector of parameters which
will include at least a mean and a precision parameter, is defined on the open interval
and it can be either a beta or one of its flexible alternatives (see section 8.2.2). The
marginal mean and variance of an rv with an augmented distribution are equal to:
E(Y ) = q1 + (1 − q0 − q1 )E(Y |0 < Y < 1)

V ar(Y ) = (1 − q0 − q1 )V ar(Y |0 < Y < 1) + q1 +
(1 − q0 − q1 )E(Y |0 < Y < 1)2 − [q1 + (1 − q0 − q1 )E(Y |0 < Y < 1)]2
Let us consider a sample of Yi , i = 1, . . . , n, i.i.d. response variables distributed

as Y with an augmented distribution. Focusing on the part of the distribution defined
on the open interval (0, 1), the regression model for the mean, E(Yi |0 < Yi < 1),
and eventually for the precision parameter, understood as a function of the variance
V ar(Yi |0 < Yi < 1), are defined as in equations [8.1] and [8.2]. The regression model
for the augmented part of the distribution is defined as follows:

g3 (q1i ) = x3i β 3
[8.4]
g4 (q0i ) = x4i β 4

where g3 (·) and g4 (·) have to be proper link functions, x3i and x4i are the vectors
of covariates observed on subject i (i = 1, . . . , n), and β3 and β4 are the vectors of
regression coefficients for the probabilities of values at the upper and lower bound,
respectively. We propose a bivariate logit link, g3 (q1i ) = log(q1i /(1 − q0i − q1i )) and
g4 (q0i ) = log(q0i /(1 − q0i − q1i )), that has a twofold advantage. First, it retains the
interpretation of parameters in terms of odds ratios. Besides, it is in compliance with
the unit-sum constraint, i.e. q0 + q1 + q2 = 1.
Please note that, by simply setting one or both probabilities q1 and q0 equal to zero,
it is possible to model scenarios where only 0s or 1s or neither are observed, the latter
case restoring a non-augmented regression model.
8.2.2. Main distributions on the bounded support
Having in mind the regression framework that has just been outlined, a
fully parametric approach requires the definition of a proper distribution on the
bounded support for the response variable. As a general rule, it is convenient, for
regression purposes, to express the distributions on the bounded support in terms of
mean-precision parameters.
A well-known distribution on the open interval is the beta one. The standard Breg
model is derived if Yi , i = 1, . . . , n, are independent and follow a beta distribution.
Its augmented version, referred to as the augmented beta regression (ABreg) model,
is obtained when the density function f (y; η) in equation [8.3], for 0 < y < 1,
is of a beta rv. The probability density function of the beta with a mean-precision
parameterization, Y ∼ Beta(μφ, (1 − μ)φ), is as follows:
Γ(φ)
fB∗ (y; μ, φ) = y μφ−1 (1 − y)(1−μ)φ−1
Γ(μφ)Γ((1 − μ)φ)
for 0 < y < 1, where the parameter 0 < μ < 1 identifies the mean and φ > 0 is
interpreted as a precision parameter being:
μ(1 − μ)
V ar(Y ) = .
φ+1
By letting vary the parameters that index the distribution, we can observe a
variety of shapes. Although its inherent flexibility, the beta is not designed to model
heavy tails (often due to outlying observations) and bimodality (possibly due to latent
structures in data).
The flexible extensions of the beta originate to precisely manage these types of
data patterns that, in our experience, often occur in practical situations. The flexibility
of which we speak is achieved by making use of mixture distributions.
The first flexible extension refers to the VIB distribution, Y ∼ V IB(μ, φ, k, p)

whose probability density function is as follows:
fV IB (y; μ, φ, k, p) = pfB∗ (y; μ, kφ) + (1 − p)fB∗ (y; μ, φ)
for 0 < y < 1, where 0 < μ < 1 identifies the overall mean of Y (as well as mixture
component means), 0 < k < 1 is a measure of the extent of the variance inflation,
0 < p < 1 is the mixing proportion parameter, and φ > 0 plays the role of a precision
parameter, since as it increases V ar(Y ) decreases. The idea behind this distribution
is to draw up a mixture of two betas where one component is entirely dedicated to
outlying observations.
The standard VIBreg model is derived if Yi , i = 1, . . . , n, are independent and

follow a VIB distribution. In the presence of values at the boundary of the support, the
augmented VIB regression (AVIBreg) model is obtained when the density function
f (y; η) in equation [8.3], for 0 < y < 1, is of a VIB rv.
The second flexible extension refers to the FB distribution, Y ∼ F B(μ, φ, w̃, p),
whose probability density function is as follows:
fF B (y; μ, φ, w̃, p) = pfB∗ (y; μ − pw̃ + w̃, φ) + (1 − p)fB∗ (y; μ − pw̃, φ)
for 0 < y < 1, where 0 < μ < 1 identifies the mean of Y , 0 < w̃ <
min{μ/p, (1 − μ)/(1 − p)} is a measure of distance between the two mixture
components, 0 < p < 1 is the mixing proportion parameter, and φ > 0 is a

precision parameter since as it increases V ar(Y ) decreases. It is worth noting that
the parameters μ, p and w̃ are linked by some constraint, so it is convenient to define a
slightly different parameterization that includes a normalized distance between the two
mixture components, w = w̃/ min{μ/p, (1 − μ)/(1 − p)}. The final parameterization
of the FB distribution, depending on parameters 0 < μ < 1, φ > 0, 0 < w < 1
and 0 < p < 1, has the merit of being variation independent, a useful feature in the
Bayesian framework.
The standard FBreg model is derived if Yi , i = 1, . . . , n, are independent and

follow an FB distribution. Otherwise its augmented extension, the augmented FB
regression (AFBreg) model, is achieved when the density function f (y; η) in equation
[8.3], for 0 < y < 1, is of a FB rv.
3.5
3.5
3.0
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
3.5
5
3.0
4
2.5
2.0
3
1.5
2
1.0
1
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 8.1. Top panels: the dotted line refers to the density of a Beta(0.5, 20). On
the left-hand side, the dashed lines refer to densities from a V IB rv with μ = 0.5,
φ = 20, k = 0.1 and p = {0.1 (red), 0.3 (green), 0.5 (blue)}. On the right-hand side,
the dashed lines refer to densities from an F B rv with μ = 0.5, φ = 20, p = 0.5
and w = {0.3 (red), 0.5 (green), 0.8 (blue)}. Bottom panels: the dotted line refers to the
density of a Beta(0.8, 10). On the left-hand side, the dashed lines refer to densities from
a V IB rv with μ = 0.8, φ = 10, k = 0.01 and p = {0.3 (red), 0.5 (green), 0.8 (blue)}.
On the right-hand side, the dashed lines refer to densities from an F B rv with μ = 0.8,
φ = 10, p = 0.9 and w = {0.5 (red), 0.8 (green), 0.9 (blue)}
The best way to appreciate the additional flexibility provided by the proposed
mixture distributions is to visualize some densities. In the top panels of Figure 8.1,
the dotted line represents a symmetric beta density with mean equal to 0.5. The VIB
distribution enables densities, represented as colored dashed lines on the left-hand
side, still centered in 0.5 and furthermore with heavier tails than the one of the beta.
Conversely, the FB distribution can provide densities, represented as colored dashed
lines on the right-hand side, which are bimodal: the overall mean is still equal to 0.5
but the component means are different. Another scenario, represented in the bottom
panels of Figure 8.1, is considered a negatively skewed beta. By properly setting the
parameters of the VIB, it is possible to get densities, represented as colored dashed
lines on the left-hand side, that put increasing mass on the left tail of the distribution.
Moreover, the FB can handle a heavier left tail still preserving the center of the
distribution, as it emerges by looking at the right-hand side.
8.2.3. Inference and fit
Inference in regression models for bounded responses can be done either with
a likelihood-based approach or a Bayesian one. In fact, likelihood-based inference
requires numerical integration and optimization which often leads to analytical
challenges and computational issues.
Bayesian inference, on the contrary, has a straightforward way of dealing

with complex data and mixtures through the incomplete data mechanism
(Frühwirth-Schnatter 2006). Among the many MC algorithms, a recent solution
is the Hamiltonian Monte Carlo (HMC) (Duane et al. 1987; Neal 1994) that
originates as a characterization of the Metropolis algorithm. The HMC can be
implemented straightforwardly through the Stan modeling language (Gelman et al.
2014; Stan Development Team 2016). To make inference on the samples from the
posterior distributions, it is necessary to specify the full likelihood function and prior
distributions for the unknown parameters.
The definition of the likelihood derives naturally from the distributional

assumption on the response variable. Whereas, the specification of the prior
distributions deserves some more remarks. As a general rule, non- or weakly
informative priors are selected to induce the minimum impact on the posteriors
(Albert 2009). Moreover, the assumption of prior independence, which holds for
all the regression models at hand, ensures the factorization of the multivariate prior
density into univariate ones, one for each parameter. Going back to the general
non-informative priors, we choose, for the regression coefficients β h (h = 1, 2, 3, 4),
a multivariate normal with zero mean vector and a diagonal covariance matrix with
“large” values for the variances to induce flatness. In the case where the precision
parameter is not regressed onto covariates a prior is put directly on φ and usually
consists of a Gamma(g, g) with a “small” hyperparameter g, for example g = 0.001.
The additional parameters of the FB type and VIB type regression models, which are
the mixing proportion p, the normalized distance w, and the extent of the variance
inflation k, all have a uniform prior on (0, 1).
The evaluation of fit of a model is made through the widely applicable information
criterion (WAIC) (Vehtari et al. 2017) whose rationale is the same as that of standard
comparison criteria, namely penalizing an estimate of the goodness of fit of a model
by an estimate of its complexity. The advantage of WAIC over other well-established
criteria, in this framework, consists of its being fully Bayesian and well defined for
mixture models. The rule of thumb states that models with smaller values of the
comparison criteria are better in fit.
8.3. Case studies
The best way to illustrate all the methodological aspects described so far is to resort
to some practical applications. The computational implementation of the regression
models without augmentation, that is, estimation issues and assessment of the results,
is made easier with the R package FlexReg. An upgrade of the package containing the
augmented versions of all models at hand is forthcoming.
8.3.1. Stress data
The “Stress” dataset, available from the FlexReg package, concerns a sample
of non-clinical women in Townsville, Queensland, Australia. Respondents were
asked to fill out a questionnaire from which the stress and anxiety rates were
computed (Smithson and Verkuilen 2006). We fit Breg, FBreg and VIBreg regression
models by regressing the mean of anxiety onto the stress level. Each model is run
20,000 iterations with the first half as burn-in. The implementation is done with the
flexreg() function of the FlexReg package:
> data("Stress")
> flexreg(formula = anxiety ∼ stress, dataset = Stress,

type=c("FB", "VIB","Beta"), n.iter = 20000)
Please note that, as much as possible, the function preserves the structure of lm()
and glm() functions so as to facilitate its use among R users. In particular, formula
is the main argument of the function where the user has to specify the name of the
dependent variable and, separated by a tilde, the names of the covariates for the
regression model for the mean. If appropriate, we can also specify the names of the
covariates for the regression model for the precision (separated from the rest by a
vertical bar). The argument type allows us to select the type of model out of Breg,
FBreg and VIBreg.
Once the parameters have been estimated through the HMC algorithm and before
continuing with further analyses, it is a good practice to check for the convergence
to the posterior distributions. On that, the FlexReg package is endowed with the
convergence.plot() and the convergence.diag() functions, both requiring as
the main argument an object of class flexreg which is obtained as a result of the
flexreg() function. The former produces a .pdf file containing some convergence
plots (i.e. density-plots, trace-plots, intervals, rates, Rhat and autocorrelation-plots) for
the Monte Carlo draws. The latter returns some diagnostics to check for convergence
to the equilibrium distribution of the Markov chains and it prints the number (and
percentage) of iterations that ended with a divergence and that saturated the max
treedepth, and the E-BFMI values for each chain for which E-BFMI is less than 0.2
(Gelman et al. 2014).
When the dependent variable is a function of a quantitative covariate, as in this

case study, it is often useful to plot the fitted means of the response over the scatterplot
of the covariate versus the dependent variable. The computation of the fitted means is
made easy thanks to the predict() function evaluated on an object of class flexreg.
From the left-hand side panel of Figure 8.2, we can note that the regression curves
of the Breg and VIBreg models are almost overlapping, whereas the curve of the
FBreg is slightly shifted towards the bottom. The behavior of the FBreg must also be
understood in terms of group regression curves (represented as colored-dashed lines).
It is worth noting the ability of the FBreg model to fit the data well by dedicating
the first component mean to the points on the top of the scatterplot, and the second
component mean to the points shifted towards the x-axis.
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
Anxiety
Anxiety
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
Stress Stress
Figure 8.2. Left-hand side: fitted regression curves for the models. Breg in dotted line,
VIBreg in solid line and FBreg in dashed lines. Colored dashed curves refer to the
component means of the FBreg model. Right-hand side: scatterplot of stress level
versus anxiety level. Red dots refer to subjects belonging to group 1
To make a comparison between competing models, a useful tool is the WAIC

criterion which indicates the best model in terms of fit and the number of parameters.
The FlexReg package computes the WAIC values through the waic() function. The
model with the best fit to the data, understood as the lowest WAIC, is the FBreg
(WAIC = −566.4), whereas the Breg (WAIC = −558.8) and VIBreg (WAIC =
−558.2) models have a similar performance.
Aside from the assessment of the best model, it is of interest to evaluate any
inconsistency between observed and predicted values. In a Bayesian perspective, it is
convenient to compute the posterior predictive distribution, namely the distribution of
unobserved values conditional on the observed data. This operation is straightforward
for our flexible regression models thanks to the posterior_predict() function, the
result of which is an object of class flexreg_postpred containing a matrix with
the simulated posterior predictions. The plot method applied to posterior predictives
returns the posterior predictive interval for each statistical unit plus the observed value
of the response variable in red dots. By way of example, Figure 8.3 shows the 95%
posterior predictive intervals for the VIBreg model. It is worth noting that the model
provides accurate predictive intervals since all observed values are comprised within
the intervals. A similar behavior also holds for the Breg and FBreg models.
1.00
− − − −
−
Posterior Predictive
0.75
− − −
− − − −
− −− −
0.50 − − −
− − − − − −−
− − −− − − −− −−
−
−−− −−−
−− −−− −
− −
− −− −
− − −−
− − −−−−−−− − − −−−− − − −−−− −−−− − − − −− − −
−−− −−− − − − − − − −− −− −− − − − − − −−
0.25 −− − −−−− − − − − − −
− − − −−− − −− − − −−
− − −−−− −− −− − −− −− − − − − − − −− −
− − − − −
− − − −
0.00 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−
0 50 100 150
Unit ID
Figure 8.3. 95% posterior predictive intervals for each statistical unit for the
VIBreg model. The observed anxiety levels are represented with orange dots
The last inspection regarding the behavior of the regression models involves the
Bayesian residuals, either raw or standardized:
riraw
riraw = yi − μ̂i ristd = , i = 1, . . . , n [8.5]
Var(yi )
where μ̂i and Var(yi ) are the predicted mean and variance of the response, both
assessed using the posterior means of the parameters.
The computation of residuals for flexible models can be done through the function
called residuals(). The argument object features an object of class flexreg,
which contains all the results related to the estimated model of type Breg, VIBreg
or FBreg. By specifying the argument type, it is possible to compute either raw or
standardized residuals. Furthermore, if the model is of FB type, the function allows us
to compute also the cluster residuals that are obtained as the difference between the
observed responses and the cluster means. This is achieved by simply setting cluster
= T:
>residuals(object, type = c("raw", "standardized"), cluster = T)
It is worth noting that the cluster residuals computed for the FBreg model allow us
to provide a classification of data into two clusters, as shown on the right-hand side of
Figure 8.2. This result is consistent with that seen with the regression curves.
8.3.2. Reading data
The second dataset we explore, likewise from the FlexReg package, is called
“Reading” and it collects data referring to a group of 44 children, 19 of whom have
received a diagnosis of dyslexia. Available types of information concern the proportion
of accuracy in reading tasks and the non-verbal intelligent quotient (IQ), besides the
dyslexia status (DYS, being dyslexic (1) or not (0)).
This case study has been extensively analyzed within the literature on regression
models for bounded responses and it is of special interest because of the presence of
values at the upper boundary of the support, corresponding to children (13 out of 44)
that achieved a perfect score in reading tests. One possibility is to handle this dataset
by simpling transforming the response variable from (0, 1] to the open interval (0, 1).
An alternative option is to analyze the dataset through an augmentation strategy.
If we assume ignorance about the dyslexic status of children, we can fit an

augmented regression model where the mean and the proportion q1 are both regressed
onto the IQ covariate:
logit(μi ) = β10 + β11 IQ
[8.6]
logit(q1i ) = β30 + β31 IQ
Please note that q0 = 0 since there are no values at the lower limit of the support.
We fit ABreg, AFBreg and AVIBreg regression models according to equation [8.6].
Each model is run 20,000 iterations with the first half as burn-in. For comparison
purposes, we also fit the standard Breg, FBreg and VIBreg models by setting q1 = 0
in equation [8.6] and by simply transforming the response values from (0, 1] to the
open interval (0, 1). Focusing on the models with augmentation, the AFBreg model
shows excellent fit to data. Indeed, the WAIC estimated for the AFBreg model is equal
to 8.1, whereas the WAIC values of the ABreg and VIBreg models are greater and
equal to 12.4 and 12.6, respectively. An analogous outcome is achieved by comparing

the models without augmentation: the best fit (lowest WAIC) is provided by the FBreg
model (WAIC = -84.1), whereas the Breg (WAIC = -63.6) and VIBreg (WAIC = -63.2)
models show a similar performance. The impact of the augmentation strategy is
particularly evident from the left-hand panel of Figure 8.4. When the values at the
upper boundary of the support are modeled separately from the rest of data, the effect
on the regression curves for the mean of the augmented model is a shift towards the
bottom with respect to the regression curve of the non-augmented model, regardless
of the type of model. In both cases, with and without augmentation, the regression
models of FB type generate more flat curves than the ones from competing models.
This behavior is a direct consequence of the special mixture structure of the FB
distribution. By way of example, if we focus on the AFBreg model (see the right-hand
panel of Figure 8.4), it emerges that the model dedicates the first component to the fit
of the observations with the highest scores in the reading accuracy test (without the
perfect scores) which also corresponds to non-dyslexic children, whereas the second
component is dedicated to the remaining part of the data cloud. Therefore, in a sense,
we can conclude that the models of FB type are able to detect the clustering structure
induced by the dyslexic status, which is assumed latent so far.
1.0
1.0
0.9
0.9
0.8
0.8
Accuracy
Accuracy
0.7
0.7
0.6
0.6
0.5
0.5
−1 0 1 2 −1 0 1 2
IQ IQ
Figure 8.4. Left-hand side: fitted regression curves for the models with augmentation
(violet lines) and without augmentation (black lines) refer to the Breg (dotted lines),
VIBreg (solid lines) and FBreg (dashed lines) models. Right-hand side: fitted
regression curve for the overall mean (solid line) and for the component means (dashed
lines) of the AFBreg model
As a second step, we consider a complete model by regressing the mean, the

precision, and the excess of 1s onto the quantitative covariate IQ, the dyslexic status
(DYS) and their interaction. The ABreg, AFBreg and AVIBreg regression models
with regression equations as in equation [8.6] have been estimated through the HMC
algorithm:
logit(μi ) = β10 + β11 DY S + β12 IQ + β13 IQ × DY S

log(φi ) = β20 + β21 DY S + β22 IQ + β23 IQ × DY S [8.7]
logit(q1i ) = β30 + β31 DY S + β32 IQ + β33 IQ × DY S
The three competing models show similar fit to data, in terms of WAIC. Moreover,
by looking at posterior means and credible intervals (CIs) from Table 8.1, it emerges
that the dyslexic status of children plays a significant role in explaining both the
probability of achieving a perfect score, the mean and the precision of the reading
accuracy response variable in all competing models.
ABreg AVIBreg AFBreg

Parameters Mean 95% CI Mean 95% CI Mean 95% CI
β1,0 (intercept) 0.385 (0.25;0.523) 0.389 (0.253;0.538) 0.405 (0.236;0.585)
β1,1 (DYS) 0.872 (0.347;1.341) 0.854 (0.336;1.314) 0.83 (0.402;1.299)
mean
β1,2 (IQ) -0.079 (-0.217;0.07) -0.078 (-0.209;0.081) -0.076 (-0.214;0.074)

β1,3 (IQ×DYS) 0.438 (-0.423;1.447) 0.434 (-0.385;1.467) 0.38 (-0.39;1.305)
β2,0 (intercept) 4.434 (3.261;5.377) 4.819 (3.467;6.599) 4.557 (3.27;5.687)
precision
β2,1 (DYS) -2.011 (-3.471;-0.519) -1.955 (-3.49;-0.378) -1.557 (-3.594;0.901)

β2,2 (IQ) 0.598 (-0.549;1.651) 0.576 (-0.578;1.694) 0.575 (-0.729;1.768)
β2,3 (IQ×DYS) -1.246 (-3.061;0.887) -1.273 (-3.21;0.883) -1.902 (-4.885;0.885)
β3,0 (intercept) -4.854 (-13.809;-1.575) -4.623 (-11.454;-1.564) -5.021 (-14.709;-1.506)
augment.
β3,1 (DYS) 4.453 (0.891;13.559) 4.22 (0.867;10.999) 4.612 (0.9;14.286)

β3,2 (IQ) 0.715 (-1.691;3.228) 0.719 (-1.683;3.36) 0.721 (-1.78;3.331)
β3,3 (IQ×DYS) 0.166 (-2.378;2.708) 0.166 (-2.528;2.798) 0.153 (-2.555;2.89)
p 0.504 (0.02;0.98) 0.386 (0.009;0.987)
k 0.566 (0.077;0.978)
w 0.296 (0.012;0.81)
WAIC -23.8 -23.6 -23.2
Table 8.1. Reading data: posterior means and CIs for the parameters of the
AFBreg, AVIBreg and ABreg regression models together with WAIC values
8.4. References
Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman and Hall,
London.
Albert, J. (2009). Bayesian Computation With R, 2nd edition. Springer Science, New York.
Barrientos, A.F., Jara, A., Quintana, F.A. (2017). Fully nonparametric regression for bounded
data using dependent Bernstein polynomials. Journal of the American Statistical Association,
112(518), 806–825.
Bayes, C., Bazan, J.L., de Castro, M. (2017). A quantile parametric mixed regression model for
bounded response variables. Statistics and Its Interface, 10(3), 483–493.
Di Brisco, A.M., Migliorati, S., Ongaro, A. (2020). Robustness against outliers: A new variance
inflated regression model for proportions. Statistical Modelling, 20(3), 274–309.
Duane, S., Kennedy, A., Pendleton, B.J., Roweth, D. (1987). Hybrid Monte Carlo. Physics
Letters B, 195(2), 216–222.
Ferrari, S. and Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions.
Journal of Applied Statistics, 31(7), 799–815.
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer, Berlin,
Heidelberg.
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B. (2014). Bayesian Data Analysis, Volume 2.
Taylor & Francis, Abingdon.
Lemonte, A.J. and Bazán, J.L. (2016). New class of Johnson SB distributions and its associated
regression model for rates and proportions. Biometrical Journal, 58(4), 727–746.
McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, Volume 37. CRC Press,
Boca Raton, FL.
Migliorati, S., Di Brisco, A.M., Ongaro, A. (2018). A new regression model for bounded
responses. Bayesian Analysis, 13(3), 845–872.
Neal, R.M. (1994). An improved acceptance procedure for the hybrid Monte Carlo algorithm.
Journal of Computational Physics, 111(1), 194–203.
Qiu, Z., Song, P.X.-K., Tan, M. (2008). Simplex mixed-effects models for longitudinal
proportional data. Scandinavian Journal of Statistics, 35(4), 577–596.
Smithson, M. and Verkuilen, J. (2006). A better lemon squeezer? Maximum-likelihood
regression with beta-distributed dependent variables. Psychological Methods, 11(1), 54–71.
Stan Development Team (2016). Stan modeling language users guide and reference manual
[Online]. Available at: https://mc-stan.org/users/documentation/.
Vehtari, A., Gelman, A., Gabry, J. (2017). Practical Bayesian model evaluation using
leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
9
Simulation Studies for a Special Mixture

Regression Model with Multivariate
Responses on the Simplex
Compositional data are defined as vectors whose elements are strictly positive
and subject to a unit-sum constraint. When the multivariate response is of
compositional type, a proper regression model that takes account of the unit-sum
constraint is required. This contribution illustrates a new multivariate regression
model for compositional data that is based on a mixture of Dirichlet-distributed
components. Its complex structure is offset by good theoretical properties (among
which identifiability) and a greater flexibility than the standard Dirichlet regression
model. We perform intensive simulation studies to evaluate the fit of the proposed
regression model and its robustness in the presence of multivariate outliers. The
(Bayesian) estimation procedure is performed via the efficient Hamiltonian Monte
Carlo algorithm.
9.1. Introduction
Compositional data, namely proportions of some whole, are encountered in several

fields of science and require proper statistical tools of analysis (Aitchison 2003).
Indeed, compositional data have the peculiarity of being vector of proportions lying
D
on the simplex space: S D = {Y : Y j > 0, j = 1, . . . , D, ∑ Y j = 1}. The analysis of
j=1
Chapter written by Agnese Maria D I B RISCO, Roberto A SCARI, Sonia M IGLIORATI and
Andrea O NGARO.
compositional data is challenging since it cannot make use of standard techniques

that might lead to distorted results due to ignoring the unit-sum constraint. A
fruitful strategy in the analysis of compositional data takes advantage of statistical
distributions defined on the simplex. Among them, the Dirichlet distribution is a
widespread distribution for a D-dimensional vector y ∈ S D . A regression model based
on the Dirichlet distribution is straightforward for compositional data and proves
to behave satisfactorily (Campbell and Mosimann 2009; Hijazi and Jernigan 2009;
Maier 2014). As a counterpart, it has some limitations among which its inability
of modeling multimodality, heavy tails and the eventual presence of outliers. A
convenient approach to induce multimodality and an overall increase in flexibility is to
consider a mixture distribution. In this regard, we propose to resort to a special finite
mixture of Dirichlet components referred to as the Extended Flexible Dirichlet (EFD)
(Ongaro). Moreover, we illustrate a regression model based on the EFD distribution
(Di Brisco et al. 2019). The aim of this work is to intensively study the behavior of the
EFD regression model in many simulated scenarios covering some relevant statistical
issues such as the presence of outliers, heavy tails and latent groups. We compare
the EFD regression model with the Dirichlet one in terms of fit and estimates of the
regression parameters.
The rest of this chapter is organized as follows. Section 9.2 introduces the Dirichlet
and the EFD distributions, and it shows convenient parameterizations for regression
purposes. Section 9.3 outlines details on the EFD regression model. Section 9.3.1
provides an overview on the HMC algorithm, a Bayesian approach to inference
especially suited for mixture models. Section 9.4 illustrates several simulation studies
that have been performed to evaluate the behavior and the fit to data of the EFD
regression model in comparison to the Dirichlet one.
9.2. Dirichlet and EFD distributions
A Dirichlet-distributed D-dimensional vector y ∈ S D has a probability density

function (p.d.f.) as follows:
Γ(α + ) D
α −1
fD (y; α) = ∏ yj j ,
∏Dj=1 Γ(α j ) j=1
[9.1]
where α = (α1 , . . . , αD ) , α j > 0 and α + = ∑Dj=1 α j . With the aim of regressing

a compositional vector onto a set of covariates, it is convenient to work with an
alternative parameterization based on the mean vector ᾱ = (ᾱ1 , . . . , ᾱD ) ∈ S D ,
α
where ᾱ j = E [Y j ] = α +j for j = 1, . . . , D, and α + = ∑Dj=1 α j > 0 that represents the
precision parameter of the Dirichlet distribution.
Simulation Studies for a Special Mixture Regression Model 117
An EFD distributed D-dimensional vector y ∈ S D has the following p.d.f.:

D
yrαr −1 D
Γ(αh )Γ(α + + τh ) τh
fEF D (y; α, τ , p) = ∏ ∑ ph yh , [9.2]
r=1 Γ(αr ) h=1 Γ(αh + τh )
where p ∈ S D and vectors α = (α1 , . . . , αD ) and τ = (τ1 , . . . , τD ) have positive

elements. The EFD distribution function admits the following mixture representation:
D
EFD(y; α, τ , p) = ∑ pr Dir(y; α + τr er ), [9.3]
r=1
where Dir(·; ·) denotes the Dirichlet distribution, and er is a vector of zeros except
for the r-th element which is equal to one. It is worth noting that the EFD distribution
contains the Dirichlet as an inner point when τr = 1 and pr = ᾱr for every r = 1, . . . , D.
The p.d.f. of the EFD admits a variety of shapes including, but not limited to, uni- and
multi-modal ones. Moreover, the richer parameterization of the EFD with respect to
the Dirichlet allows for a more flexible modelization of the dependence structure of
the composition. Finally, the EFD distribution shows several theoretical properties,
i.e. some simplicial forms of dependence/independence and identifiability (Ongaro),
that make it tractable from computational and inferential points of view.
To define a regression model based on the EFD, it is convenient to adopt an

alternative parameterization that explicitly includes the mean vector. To this end,
note that the r-th Dirichlet component in equation [9.3] has a mean vector λr =
(1 − wr )ᾱ + wr er , (where ᾱ = αα+ and wr = α +τ+r τr ), which can be interpreted as a
weighted average of a common barycenter, ᾱ and the r-th simplex vertex er . The
first-order moment of the EFD easily follows from its mixture structure:
D
μ j = E [Y j ] = ∑ pr λr j = ᾱ j ∑ pr (1 − wr ) + p j w j . [9.4]
r=1 r
However, the parameterization of the EFD based on μ j , p j and w j ( j = 1, . . . , D) is
not variation independent since some constraints hold between parameters, and thus
the following inequalities referred to w j , j = 1, . . . , D, can be derived as follows:
μj μ j 1 − ∑r pr wr
(i) w j < , (ii) w j > − .
pj pj pj
Variation independence, whose lack is a potential issue for Bayesian inference
through Monte Carlo (MC) methods, can be achieved by normalizing w j as follows:
w
w̃ j = j , j = 1, . . . , D. [9.5]
μ
min p jj , 1
The parameterization of the EFD distribution depending on μ ∈ S D , p ∈ S D ,

w̃ j ∈ (0, 1) for every j, and α + > 0 has the double benefit of being variation
independent – useful for Bayesian inference – and explicitly including the mean
vector – useful for regression purposes.
9.3. Dirichlet and EFD regression models
Since both parameterizations of the Dirichlet and of the EFD illustrated in section
9.2 explicitly include the mean vector μ, it is possible to derive a regression model
for compositional data. Let Y = (Y1 , . . . , Yn ) be the response matrix such that Yi ,
for i = 1, . . . , n, is a D-dimensional vector on the simplex, and let X = (x1 , . . . , xn )
be the design matrix such that xi are (K + 1)-dimensional vectors. The mean vector
ν i of Yi can be regressed onto a set of covariates in accordance with a GLM strategy
(McCullagh and Nelder 1989). Indeed, since ν i lies on the simplex, a multinomial
logit link function can be adopted as follows:

νi j
g(νi j ) = log = xi β j , [9.6]
νiD
where νi j = E [Yi j ], xi = (1, xi1 , . . . , xiK ) is the vector of covariates, and β j = (β j0 ,β j1 ,
. . . , β jK ) is a vector of regression coefficients. Please note that the Dth category is
conventionally fixed as baseline, so that βDk = 0 for k = 0, 1, . . . , K, and thus:
⎧
exp(xi β j )
⎨ , for j = 1, . . . , D − 1
νi j = g−1 (xi β j ) = 1+∑r=1 exp(xi β r )
D−1
[9.7]
⎩ 1
, for j = D.
1+∑r=1 exp(xi β r )
D−1
If Yi are Dirichlet distributed, we recover the Dirichlet regression (DirReg) model

(Hijazi and Jernigan 2009; Maier 2014) by substituting νi j with ᾱi j in equation [9.6].
Similarly, if Yi are EFD distributed, we get the EFD regression (EFDReg) model
(Di Brisco et al. 2019) by replacing νi j with μi j in equation [9.6].
9.3.1. Inference and fit
To obtain estimates of the unknown parameters of EFDReg and DirReg models,

we favor a Bayesian approach. This choice is mainly motivated by the difficulty, both
computational and analytical, of likelihood-based inferential approaches in dealing
with complex models such as mixtures. Conversely, the finite mixture structure of
the EFD distribution can be advantageously treated as an incomplete data problem
in a Bayesian paradigm (Frühwirth-Schnatter 2006). Among the MC methods, a
recent solution is the Hamiltonian Monte Carlo (HMC) (Neal 1994) algorithm, a
generalization of the Metropolis algorithm which combines Markov Chain Monte
Carlo (MCMC) and deterministic simulation methods. The (simulated) posterior
distributions of the unknown parameters are simulated on the basis of the full
likelihood and prior distributions. With regard to choice of priors, we adopt non- or
weakly informative priors to induce the minimum impact on the posteriors (Albert
1987), and suppose prior independence. We select a multivariate normal with zero
mean vector and diagonal covariance matrix with “large” values of the variances
as non-informative prior for the regression parameters β j . Moreover, we adopt a
Uniform(0, 1) prior for w̃ j , j = 1, . . . , D, and a Dirichlet prior with hyperparameter

1 for the vector p. Finally, we use a Gamma(g, g) prior, with g “small” and equal to
0.001, for the precision parameter α + . Among the variety of fitting criteria, we favor
the Watanabe-Akaike information criterion (WAIC) (Vehtari et al. 2017) because it is
fully Bayesian and well-defined for non-regular models such as mixture ones as well.
As a general rule, the smaller the criterion is, the better the model fit.
9.4. Simulation studies
To compare the performances of the EFDReg and DirReg models, we simulated

a variety of scenarios that cover many potentially tricky problems among which
multimodality, as well as the presence of heavy tails, outliers and latent groups. In
the following, we illustrate the samples’ simulation schemes and the main inferential
results for each scenario. We took advantage of the HMC algorithm for estimating
the vector of unknown parameters η = (β 1 , . . . , β D−1 , α + , p, w̃) . The algorithm is
easily implemented by the Stan modeling language (Stan Development Team 2016).
We ran chains of length 10,000 and we discarded the first half. Moreover, we checked
the convergence to the target distribution through graphical tools, such as trace-plots,
density-plots and autocorrelation-plots, as well as diagnostic measures such as the
potential scale reduction factor, the effective sample size and the Raftery–Lewis test
(Gelman et al. 2013). Each scenario is replicated 500 times to evaluate MC estimates.
Computational times are of approximately 60 seconds for DirReg and 300 seconds for
EFDReg.
Fitting study: First, we evaluated some fitting studies by simulating from

Dirichlet (scenario (i)) and EFD (scenario (ii)) regression models. The objective of
these studies is to analyze the goodness of fit and estimates of regression coefficients.
The sample size is n = 150, and the multivariate response lies on the three-part
simplex. In both scenarios, a quantitative covariate x, uniformly distributed in
(−0.5, 0.5), is included in the regression model for the mean (see equation [9.6]), with
regression coefficients set equal to β10 = 1, β11 = 2, β20 = 0.5, β21 = −3. In scenario
(i), the response is Dirichlet distributed and the precision parameter is α + = 50. In
scenario (ii), the response is EFD distributed, and the remaining parameters are fixed
equal to α + = 50, p = (1/3, 1/3, 1/3) and w̃ = (0.6, 0.2, 0.7). Ternary plots and
scatterplots of each element of the composition yi j , j = 1, 2, 3, versus the quantitative
covariate x are shown in Figures 9.1 (scenario (i)) and 9.2 (scenario (ii)). It is worth
noting the absence of whatever cluster in scenario (i), while in scenario (ii) it is clear
the presence of clusters and multiple modes.
Presence of outliers: To evaluate the behavior of the DirReg and EFDReg models
in the presence of outliers, we perturbed scenario (i) according to the following
perturbation scheme. We randomly selected 15 observations (10% of the sample size)
and we applied the perturbation operation defined as y ⊕ δ = C {y1 · δ1 , . . . , yD · δD } ∈
S D , where y and δ are the vectors on the simplex playing the roles of perturbed
and perturbing element, respectively. Moreover, the closure operation C {·} is defined
as C {q} = {q1 /q+ , . . . , qD /q+ } with q+ = ∑Dj=1 q j and q j > 0, ∀ j = 1, . . . , D. The
neutral element of the perturbation operation is δ = (1/D, . . . , 1/D) , so that if
element y j is perturbed by δ j greater (lower) than 1/D, the perturbation is upward
(downward). We set three scenarios of perturbation by fixing the perturbing element
δ equal to (0.86, 0.07, 0.07) in scenario (I), (0.07, 0.86, 0.07) in scenario (II) and
(0.07, 0.07, 0.86) in scenario (III).
y2
100
0.75
20
80
40
60 0.50
y1
60
40
80
20 0.25
10
0
y1 y3
20
40
60
80
0
10
−0.50 −0.25 0.00 0.25 0.50

x
0.8
0.3
0.6
y2
y3
0.4 0.2
0.2
0.1
0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figure 9.1. Representations of one replication from scenario (i)
Figures 9.3, 9.4 and 9.5 show the effect of perturbation on the Dirichlet-distributed
responses. In all plots, the perturbed points are in light-blue while unperturbed points
are in black. Looking at the scatterplots, we can observe that scenario (I) assumes
some outlying observations upward for the first element and downward for the second
and third elements of the composition; this is coherent with the chosen vector δ
that has the first element greater than 0.5 and the second and third elements lower
than 0.5. Instead, in scenarios (II) and (III), the second and third elements of the
composition respectively are perturbed upward while the remaining elements are
perturbed downward. Focusing on the ternary plots, it is worth noting that the effect
of perturbation in scenario (III) is clearly visible in that the group of perturbed values,
in blue, is well-separated from the remaining points, in black. The overall effect of
perturbing vector δ = (0.07, 0.07, 0.86) is thus to shift the cloud of points towards
the bottom-right vertex of the plot. Conversely, in scenarios (I) and (II), the perturbed
points are overall shifted towards the bottom-left and top vertex of the ternary plot,
respectively, i.e. in a region with a higher presence of unperturbed points.
1.00
y2
100
0.75
20
80
40
60
y1
0.50
60
40
80
20 0.25
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50
x
1.00
0.6
0.75
0.4
0.50
y2
y3
0.2
0.25
0.00 0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figure 9.2. Representations of one replication from scenario (ii)
Presence of latent groups: The following simulation study explores the case
of the presence of a latent (unobserved) covariate that induces the occurrence of
clusters. Therefore, data are simulated by including an additional covariate in the
regression model that is assumed unknown, not accounted for by the estimates of the
Dirichlet and EFDReg models. In particular, we replicated the generating mechanism
of fitting study (i) by adding a latent dichotomous covariate (scenario (a)) and a latent
covariate with three categories (scenario (b)). In scenario (a), the additional regression
coefficients are β12 = −1 and β22 = 2, and in scenario (b), they also include β13 = 0.5
and β23 = −3. With respect to the dichotomous covariate of scenario (a), the categories
have probabilities of 0.3 and 0.7. In scenario (b), the three categories of the latent
covariate have probabilities of 0.3, 0.15, and 0.55. Figures 9.6 and 9.7 show one
random replication from scenarios (a) and (b) with latent groups, respectively. In the
ternary plots, points are colored and shaped according to their belonging to the latent
groups. The existence of two and three clusters respectively is particularly visible from
the scatterplots referred to the first and second elements of the composition.
1.00
y2
100
0.75
20
80
40
60
y1
0.50
60
40
80
20
0.25
10
0
y1 y3
20
40
60
80
0
10
−0.50 −0.25 0.00 0.25 0.50

x
0.3
0.75
0.2
0.50
y2
y3
0.25 0.1
0.00
0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figure 9.3. Scenario (I). Perturbed points are in

light-blue and unperturbed points are in black
y2
100 0.75
20
80
0.50
40
60
y1
60
40
0.25
80
20
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50
x
1.00
0.3
0.75
0.2
y2
y3
0.50
0.1
0.25
0.00
0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figure 9.4. Scenario (II). Perturbed points are in

Generic mixture of Dirichlet: Finally, we evaluate the case of a generic mixture

of two Dirichlet distributions: π Dir(yi ; ᾱi , α1+ ) + (1 − π )Dir(yi ; ᾱi , α2+ ). Please note
that the mixture structure is not of EFD type. Both Dirichlet distributions have the
same regression model equal to that of scenario (i), but they differ in their precision
parameters that are equal to α1+ = 2 and α2+ = 50, respectively. The mixing proportion
parameter π is equal to 0.3.
The generic mixture of two Dirichlet distributions has been chosen to induce
heavier tails than that of the Dirichlet. The ternary plot on the top left panel of
Figure 9.8 shows one random replication from the generic mixture where the green
points belong to the first component of the mixture and the orange triangles belong to
the second component. We can observe that the majority of points (belonging to the
second component of the mixture) are placed on the ternary plot and on the scatterplots
similarly to scenario (i). At the same time, the group of data coming from the Dirichlet
with the smaller precision parameter is far from the remaining points. Focusing on the
scatterplots referred to the first and second elements of the composition (top right and
bottom left panels of Figure 9.8), it is worth noting that the responses belonging to the
first component of the mixture, that is, the one with the smaller precision parameter,
depart from the data cloud both upward and downward.
y2
100
0.75
20
80
40
60 0.50
y1
60
40
0.25
80
20
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50
x
0.8
0.8
0.6
0.6
y2
y3
0.4
0.4
0.2
0.2
0.0 0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figure 9.5. Scenario (III). Perturbed points are in

9.4.1. Comments
Table 9.1 shows the WAIC values in all simulation studies. In fitting study (i),
where the data generating mechanism is Dirichlet, the WAIC of both models is
comparable, while in all remaining scenarios the EFDReg model is far better than
the DirReg one. The superiority in fit of the EFDReg model is particularly noticeable
in fitting study (ii), in all scenarios with outliers, and in the presence of a latent
group induced by a dichotomous covariate (scenario (a)). Scenario (b) (i.e. three latent
groups) and the scenario from a generic mixture of two Dirichlet distributions are
particularly challenging and result in a difficulty in fit for both models. Nevertheless,
the EFDReg model is capable of providing a better adaptation to data (lower WAIC)
than the DirReg.
y2
100
0.75
20
80
40
60 0.50
y1
60
40
0.25
80
20
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50

x
1.00
0.6
0.75
0.4
y2
y3
0.50
0.2
0.25
0.00 0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figure 9.6. Representations of one replication

in the presence of latent groups, scenario (a)
Let us now analyze and comment on the posterior means and MSEs for the
Dirichlet and EFD regression models in all scenarios. All results can be found in
Tables 9.2, 9.3 and 9.4. Moreover, we deepen the analysis of the two models by
inspecting the regression curves that are superimposed on the scatterplots in Figures
9.1–9.8. In all figures, black solid lines refer to the EFD model and black dashed lines
refer to the Dirichlet one. In some scenarios only the solid line appears meaning that
the regression curves of both models are almost coincident. Colored lines are referred
to the component means λ 1 (orange), λ 2 (blue), and λ 3 (green) of the EFDReg model.
1.00
y2
100
0.75
20
80
40
60
y1
0.50
60
40
0.25
80
20
10
0
y1 y3
20
40
60
80
0
10
0.00
−0.50 −0.25 0.00 0.25 0.50
x
1.00
0.6
0.75
0.4
0.50
y2
y3
0.25 0.2
0.00
0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figure 9.7. Representations of one replication

in the presence of latent groups, scenario (b)
Results about the fitting study with Dirichlet-distributed data (scenario (i)), can be
found in the second and third columns of Table 9.2. It is worth noting that both models
provide precise estimates for the regression parameters and similar MSEs. This is
confirmed by almost identical regression curves for the Dirichlet (black dashed line)
and EFD (black solid line) models (see scatterplots in Figure 9.1). The DirReg model
also provides a precise estimate for the precision parameter α + , while the EFDReg
model slightly overestimates it. Looking at the additional parameters of the EFDReg
model, we can observe that the adaptation to Dirichlet-distributed data is achieved
thanks to equally weighted (estimated p j equal to approximately 0.3 for j = 1, 2, 3)
and close component means (small estimated distances w̃ j between components).

Graphically, the regression curves referred to component means of the EFDReg model
(colored solid lines) are close together and with similar distances.
1.00
0.75
y1
0.50
0.25
0.00
−0.50 −0.25 0.00 0.25 0.50

x
1.00
0.6
0.75
0.4
y2
y3
0.50
0.2
0.25
0.00 0.0
−0.50 −0.25 0.00 0.25 0.50 −0.50 −0.25 0.00 0.25 0.50
x x
Figure 9.8. Representations of one replication of

a generic mixture of two Dirichlet distributions
Fitting studies Presence of outliers Latent groups Generic mixture

Scenario (i) (ii) (I) (II) (III) (a) (b)
Dir -948.029 -512.614 -633.131 -623.990 -565.643 -430.142 -611.100 -557.735
EFD -946.140 -883.605 -849.742 -814.106 -833.923 -770.366 -764.415 -593.995
Table 9.1. WAIC values in the simulation studies
In fitting study (ii), the EFDReg model adapts well to data and provides precise
estimates with low MSEs and SEs for all the parameters (see Table 9.4). On the
contrary, the DirReg model, in trying to adapt to data, estimates a considerably lower
precision than the true one, and it also fails to correctly estimate some of the regression
parameters. From scatterplots in Figure 9.2 it emerges that the regression curves for
the EFDReg model adapt very well to data (both for the overall mean and for the
component means), while they are systematically more flat for the DirReg model.
Scenario Fitting study (i) Latent groups (a) Latent groups (b)
Model Dir EFD Dir EFD Dir EFD
β10 = 1 1.001 (0.001) 1.001 (0.001) 0.514 (0.231) 0.850 (0.024) 0.756 (0.060) 1.104 (0.012)
β11 = 2 1.998 (0.018) 1.992 (0.018) 1.567 (0.203) 1.665 (0.131) 1.328 (0.461) 1.299 (0.510)
β20 = 0.5 0.501 (0.001) 0.502 (0.001) 0.883 (0.148) 1.275 (0.605) -0.600 (1.213) -0.044 (0.230)
β21 = −3 -3.006 (0.021) -2.998 (0.021) -1.963 (1.086) -2.096 (0.835) -1.568 (2.096) -1.634 (1.945)
α + = 50 50.052 (4.338) 53.636 (5.473) 5.894 (0.258) 21.241 (2.023) 3.033 (0.121) 5.389 (0.600)
p1 — 0.290 (0.111) — 0.640 (0.030) — 0.582 (0.071)
p2 — 0.315 (0.119) — 0.353 (0.030) — 0.409 (0.071)
p3 — 0.395 (0.116) — 0.007 (0.002) — 0.009 (0.001)
w̃1 — 0.149 (0.031) — 0.608 (0.032) — 0.584 (0.037)
w̃2 — 0.146 (0.041) — 0.707 (0.021) — 0.780 (0.019)
w̃3 — 0.151 (0.032) — 0.461 (0.056) — 0.421 (0.012)
Table 9.2. Posterior means for the Dirichlet and EFD regression models in fitting
study (i) and in scenarios (a) and (b) with latent groups. MSEs for the regression
coefficients and SEs for remaining parameters are in parenthesis
The estimates of the unknown parameters in the three scenarios with outliers
are shown in Table 9.3. Moreover, the regression curves of the Dirichlet and EFD
models are plotted on the scatterplots in Figures 9.3, 9.4 and 9.5 referred to scenarios
(I), (II), and (III), respectively. The estimates of the regression parameters of the
Dirichlet and EFD models are affected by the presence of outliers. The element
of flexibility used by the DirReg model in order to adapt to data that depart from
the Dirichlet distribution is given by the precision parameter, that is systematically
underestimated in all scenarios with outliers. Conversely, the EFDReg model can take
advantage of its special mixture structure to better adapt to data. It is worth noting
that in all scenarios with outliers, one component of the mixture is dedicated to the
group of perturbed values as indicated by the corresponding p j estimate which is
around 0.1. The remaining two components equally describe the remaining majority
of unperturbed data with estimates of p j ’s between 0.3 and 0.5. The analysis of the
regression curves allows us to better understand the different behavior of the DirReg
and EFDReg models. The regression curves of the DirReg model are slightly shifted
with respect to the regression curves of the DirReg in the scenario without perturbation
(dotted lines in Figures 9.3–9.5) in the direction of the perturbed values. Instead,
looking at the component means of the EFD we note that the first, second and third
components of the mixture are entirely dedicated to model the subgroup of outliers in
scenarios (I), (II) and (III).
Scenario Outliers (I) Outliers (II) Outliers (III)

Model Dir EFD Dir EFD Dir EFD
β10 = 1 1.134 (0.019) 1.183 (0.036) 0.914 (0.008) 0.992 (0.001) 0.699 (0.092) 0.682 (0.103)
β11 = 2 1.840 (0.065) 1.748 (0.081) 1.871 (0.030) 1.924 (0.022) 1.830 (0.075) 1.962 (0.014)
β20 = 0.5 0.472 (0.002) 0.502 (0.002) 0.683 (0.035) 0.898 (0.180) 0.262 (0.058) 0.192 (0.097)
β21 = −3 -2.693 (0.112) -2.895 (0.035) -2.728 (0.123) -2.349 (0.462) -2.658 (0.165) -2.929 (0.020)
α + = 50 14.930 (1.177) 37.754 (7.304) 15.012 (1.205) 35.590 (4.526) 13.328 (0.940) 41.561 (4.424)
p1 — 0.121 (0.170) — 0.553 (0.282) — 0.479 (0.228)
p2 — 0.372 (0.300) — 0.147 (0.086) — 0.434 (0.229)
p3 — 0.507 (0.309) — 0.300 (0.281) — 0.087 (0.008)
w̃1 — 0.744 (0.054) — 0.281 (0.084) — 0.155 (0.050)
w̃2 — 0.265 (0.090) — 0.706 (0.057) — 0.153 (0.052)
w̃3 — 0.272 (0.101) — 0.265 (0.101) — 0.613 (0.030)
Table 9.3. Posterior means for the Dirichlet and EFD regression models in
scenarios (I), (II) and (III) with outliers. MSEs for the regression coefficients
and SEs for remaining parameters are in parenthesis
Results concerning the presence of some latent groups in data are shown in the
last four columns of Table 9.2. The estimates of regression parameters are biased for
both models. Once again, the DirReg model tries to adapt to data by estimating a
very low value for the precision parameter, nevertheless this results in a very poor fit.
The regression curves of the DirReg model, reported in Figures 9.6 and 9.7, severely
miss the data cloud, particularly in scenario (a). The EFDReg model has a satisfactory
behavior in scenario (a) where the latent covariate has two categories with probabilities
of 0.3 and 0.7. These latent clusters are grasped by the EFDReg model with an estimate
equal to 0.64 and 0.353 for the mixing proportions p1 and p2 of the first and second
component, and an estimate close to zero for p3 . This is clearly reflected by the
regression curves of the component means of the EFD model plotted in Figures 9.6 and
9.7. It is worth noting that the orange and blue lines λ 1 and λ 2 perfectly fit the two data
clouds. On the contrary, the green line λ 3 has a very poor fit, but this does not affect
the overall fit of the model since the third component of the mixture has a probability
of occurrence around zero. Scenario (b) is more challenging for the EFDReg model.
Please recall that this scenario assumes the existence of a latent covariate having three
categories with probabilities of 0.3, 0.15 and 0.55. Nevertheless, the EFD model is
able to capture only two out of the three latent clusters, as witnessed by the estimate
of the third mixing proportion p3 which is close to zero. A look at the regression
curves of the component means of the EFD model (Figure 9.7) better explains this
behavior. The first scatterplot, referred to the first element of the response, shows a
good fit of the orange curve λ 1 . The remaining two curves λ 2 and λ 3 are unable to
describe the two visible clusters of data since they are placed in the middle. In the
second scatterplot, referred to second elements of the response, the blue curve adapts
well to one cluster, the green and orange ones are almost overlapping and fit a second
cluster well while a third cluster of data is missed by all curves. In the third scatterplot,
referred to the third element of the response, the blue and orange curves cross the data
cloud, but the green one misses it completely. Overall, the EFDReg model has an
excessively rigid mixture structure to adapt to this scenario well, whilst remaining a
far better model than the Dirichlet one.
Scenario Fitting study (ii) Scenario Generic mixture
Model Dir EFD Model Dir EFD
β10 = 1 1.087 (0.015) 1.014 (0.010) β10 = 1 1.006 (0.010) 0.947 (0.010)
β11 = 2 1.990 (0.069) 1.999 (0.012) β11 = 2 2.063 (0.149) 1.967 (0.083)
β20 = 0.5 0.752 (0.068) 0.511 (0.010) β20 = 0.5 0.501 (0.014) 0.457 (0.014)
β21 = −3 -2.409 (0.395) -3.009 (0.014) β21 = −3 -2.967 (0.189) -2.866 (0.159)
α + = 50 6.444 (0.306) 50.153 (4.253) α+ 5.619 (0.892) 6.877 (1.684)
p1 = 1/3 — 0.335 (0.024) p1 — 0.149 (0.239)
p2 = 1/3 — 0.335 (0.034) p2 — 0.189 (0.293)
p3 = 1/3 — 0.331 (0.035) p3 — 0.662 (0.353)
w̃1 = 0.6 — 0.601 (0.016) w̃1 — 0.563 (0.230)
w̃2 = 0.2 — 0.199 (0.032) w̃2 — 0.553 (0.229)
w̃3 = 0.7 — 0.694 (0.029) w̃3 — 0.732 (0.153)
Table 9.4. Posterior means for the Dirichlet and EFD regression models in fitting
study (ii) and in case of a generic mixture of Dirichlet. MSEs for the regression
coefficients and SEs for remaining parameters are in parenthesis
The last two columns of Table 9.4 show the estimates in case observations come
from a generic mixture of two Dirichlet distributions. It is worth recalling that this
scenario assumes that the second mixture component follows the same Dirichlet
distribution as in scenario (i), and the first component differs from the second one
because of the presence of a lower precision parameter. Both the DirReg and EFDReg
models provide reasonably unbiased estimates of the regression parameters, despite
the MSEs being greater than the ones in scenario (i). To confirm this, the regression
curves (dashed and solid lines in Figure 9.8) adapt well to the majority of observations
and are almost overlapping. The presence of a group of data, around 30%, coming
from the Dirichlet distribution with a lower precision parameter forces the DirReg
model to provide a low estimate of the precision parameter in trying to adapt to data.
The EFDReg performs better than the DirReg model since it is capable of recognizing
the presence of some clusters in data. In particular, it dedicates the third component to
describing the majority of data, indeed the estimate of p3 is approximately equal to 0.7.
Instead, the first and second components are dedicated to data coming from the second
component of the generic mixture, and they show similar estimates of all parameters
(p j and w̃ j ). In this regard, the green curve is near the solid one, particularly in the
scatterplots referred to the first and second elements of the composition. Differently,
the blue and orange curves fit the values from the second component of the generic
mixture, and they are placed either upward or downward with respect to the majority
of points in the scatterplots.
9.5. References
Aitchison, J. (2003). The Statistical Analysis of Compositional Data. The Blackburn Press,
London.
Albert, J. (1987). Bayesian computation with R. ASA Proceedings of Section on Statistical
Graphics.
Campbell, G. and Mosimann, J.E. (2009). Multivariate analysis of size and shape: Modelling
with the Dirichlet distribution. ASA Proceedings of Section on Statistical Graphics, 93–101.
Di Brisco, A.M., Ascari, R., Migliorati, S., Ongaro, A. (2019). A new regression model for
bounded multivariate responses. Smart Statistics for Smart Applications – Book of Short
Papers SIS, 817–822.
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer Science
+ Business Media, New York.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2013). Bayesian
Data Analysis, 3rd edition. CRC Press, London.
Hijazi, R.H. and Jernigan, R.W. (2009). Modelling compositional data using Dirichlet
regression models. Journal of Applied Probability and Statistics, 4, 77–91.
Maier, M.J. (2014). Dirichletreg: Dirichlet regression for compositional data in R. Paper,
Research Report Series, University of Economics and Business, Vienna.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman & Hall, London.
Neal, R.M. (1994). An improved acceptance procedure for the hybrid Monte Carlo algorithm.
Journal of Computational Physics, 111(1), 194–203.
Ongaro, A., Migliorati, S., Ascari, R., Ongaro et al. (2020). A new mixture model on the
simplex. Statistics and Computing [Online]. Available at: https://doi.org/10.1007/s11222-
019-09920-x.
Stan Development Team (2016). Stan modeling language users guide and reference manual
[Online]. Available at: http://mc-stan.org/.
Vehtari, A., Gelman, A., Gabry, J. (2017). Practical Bayesian model evaluation using
leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
PART 2
10
Numerical Studies of Implied Volatility

Expansions Under the Gatheral Model
The Gatheral model is a three-factor model with mean-reverting stochastic

volatility that reverts to a stochastic long-run mean. This chapter reviews previous
analytical results on the first- and second-order implied volatility expansions under
this model. Using the Monte Carlo simulation as the benchmark method, numerical
studies are conducted to investigate the accuracy and properties of these analytical
expansions. Moreover, a partial calibration procedure is proposed using these
expansions. This calibration procedure is implemented on real market data of daily
implied volatility surfaces for an underlying market index and an underlying equity
stock for periods both before and during the Covid-19 crisis.
10.1. Introduction
The classical Black–Scholes option pricing model assumes that the underlying
asset follows a geometric Brownian motion with constant volatility, but there are a
significant number of model extensions to ease this assumption of constant volatility.
One of the recent and popular extensions is the Gatheral model, given in Gatheral
(2008), where a double-mean-reverting market model is considered. The same model
is later considered in Bayer et al. (2013).
Our object of interest is the asymptotic expansions of implied volatility under the
Gatheral model presented in earlier research in Albuhayri et al. (2021). Albuhayri
Chapter written by Marko D IMITROV, Mohammed A LBUHAYRI, Ying N I and

Anatoliy M ALYARENKO.
et al. (2021) obtained such asymptotic expansions by applying the Taylor formula of
the implied volatility given in Pagliarani et al. (2017) to the Gatheral model. Applying
this general Taylor formula to a specific three-factor model is a non-trivial task.
In Albuhayri et al. (2021), only analytical formulas were obtained. Therefore, there
is a need for a thorough numerical study on the performances of these asymptotic
expansions as approximation formulas for the implied volatility. Moreover, it is of
practical interest to investigate how these approximation formulas can be used for
calibrating the model to real data. This chapter addresses these two issues for the first-
and second-order implied volatility expansions. The contribution of this chapter is to:
1) clarify for which range of option parameters the first- and second-order
expansions give reasonable approximations;
2) propose a convenient and straightforward partial calibration procedure and
implement it to synthetic and real market data.
Since there is no exact analytical formula on the implied volatility under the
Gatheral model, we use the Monte Carlo simulation to generate benchmark (reference)
values of implied volatilities for the numerical study on the performances of the
asymptotic expansions.
Calibration means finding model parameters such that the model is consistent
with the market data. In terms of option pricing models, we often minimize an error
function on the differences between the market and model implied volatilities. In
contrast to the time-consuming Monte Carlo simulation method, an analytical formula
of the implied volatility model is beneficial for calibration purposes. For a general
overview of model calibration to option data, we refer to Hilpisch (2019).
For the calibration task, we take advantage of the simple polynomial form of the
implied volatility approximation formulas associated with the first- and second-order
asymptotic expansion in order to propose a simple partial calibration procedure. By
saying partial calibration, we mean that only a part of the original model parameters
or a grouped form of them can be calibrated. However, such partial calibration is
still useful as an intermediate step towards the final local optimization problem on
the full calibration, which is a standard procedure. Moreover, our partial calibration
reveals easily applicable values, like the present volatility level. It should be mentioned
that Fouque et al. (2011) have proposed a calibration procedure under a different
three-factor model using a polynomial form of the implied volatility. The asymptotic
expansion in their work was obtained under the assumption of a fast mean-reverting
volatility component together with a slow mean-reverting volatility component.
This chapter proceeds as follows. The first- and second-order asymptotic

expansions of implied volatility under the Gatheral model are reviewed in section 10.2.
The numerical studies on the performances of approximations associated with these
Numerical Studies of Implied Volatility Expansions Under the Gatheral Model 137
expansions are presented in section 10.3. In section 10.4, a partial calibration

procedure is proposed and synthetic and real data calibration is conducted. Finally,
section 10.5 gives the conclusion and future work.
10.2. Asymptotic expansions of implied volatility
In practice, market prices of options cannot be explained by the Black–Scholes

model. One possible solution to this problem is as follows. Following Dupire (1997),
we postulate that under a martingale measure the underlying asset price S(t), t ∈
[0, T ] satisfies the following equation:
dS(t) = η(t, S(t))S(t) dW ∗ (t), S(0) = s,
where s is a deterministic positive number and the coefficient η(t, S(t)) is called the
local volatility function.
A more recent and general extension is given in Pagliarani et al. (2017). Assuming
that under a martingale probability measure, the market model is described by a
Rd -valued stochastic process (S(t), Y2 (t), . . . , Yd (t)) that satisfies the following
system of stochastic differential equations:
dS(t) = η1 (t, S(t), Y(t))S(t) dW1∗ (t), S(0) = s,
dYi (t) = μi (t, S(t), Y(t)) dt + ηi (t, S(t), Y(t)) dWi∗ (t), Y(0) = y,
where 2 ≤ i ≤ d and Y(t) is a vector with components Yi (t), y ∈ Rd−1 is a
deterministic vector, and the time t correlation matrix of the Rd -valued stochastic
process with components Wi∗ (t) has entries:
ρij (t, S(t), Y(t)) ∈ [−1, 1].
In the rest of this chapter, we refer to this model as a local stochastic volatility
model.
The model under consideration is the Gatheral model from this family of local
stochastic volatility models. The Gatheral model is a double-mean-reverting market
model proposed by Gatheral (2008). In a subsequent publication, Bayer et al. (2013),
the model is given as follows:

dS(t) = v(t)S(t) dW1∗ (t),
dv(t) = κ1 (v (t) − v(t)) dt + ξ1 v α1 (t) dW2∗ (t), [10.1]
dv (t) = κ2 (θ − v (t)) dt + ξ2 v α2 (t) dW3∗ (t),
and the time t correlation matrix of the R3 -valued stochastic process with components
Wi∗ (t) has entries:
ρij (t, S(t), Y(t)) = ρij ∈ [−1, 1].
The reason why we choose this model was described by Bayer et al. (2013) as
follows:
Thus variance mean-reverts to a level that itself moves slowly over time
with the state of the economy.
By setting α1 = α2 = 0.5 in the Gatheral model above, we have what is known

as the Double Heston model. Similarly, setting α1 = α2 = 1 leads to the Double
Lognormal model.
The European call options are traded on the market; however, the stock’s volatility,
σ, is not directly observable. A possible solution to this problem follows. It is well
known that the Black–Scholes price with zero interest rate satisfies the following
boundary value problem for the Black–Scholes partial differential equation:
∂C(S, t) σ 2 2 ∂ 2 C(S, t)
+ S = 0,
∂t 2 ∂S 2 [10.2]
lim C(S, t) = max{0, S − K}
t↑T
with (S, t) ∈ (0, ∞) × (0, T ). Berestycki et al. (2002) describe a possible solution as
follows:
[. . . ] it is common practice to start from the observed prices and invert

the closed-form solution to (2) in order to find that constant σ – called
implied volatility – for which the solution to (2) agrees with the market
price at today’s value of the stock.
Note that their equation (2) is our equation [10.2].
As addressed in the introduction, an analytical formula of the implied volatility

under the Gatheral model is useful for the model calibration purpose in particular.
Such a formula can also be used for European option valuation by plugging in the
model implied volatility into the famous Black–Scholes option pricing formula.
In Albuhayri et al. (2021), we proved the following results under the model
(equation [10.1]).
T HEOREM 10.1.– The asymptotic expansion of order 1 of the implied volatility has
the form:
√ 1 √
σ (t, x0 , ν0 ; T, k) = v0 + ρ12 ξ1 ν0α1 −1 (k − x0 ) + o T − t + |k − x0 | .
4
T HEOREM 10.2.– The asymptotic expansion of order 2 of the implied volatility has
the form:
√ 1
σ(t, x0 , ν0 , ν0 ; T, k) = ν0 + ρ12 ξ1 ν0α1 −1 (k − x0 )
8
1 −1/2
+ 32κ1 ν0 (ν0 − ν0 ) + 8ρ12 ξ1 ν0α1
128

2α −3/2
+ 3ρ212 ξ12 ν0 1 (T − t)
3 2 2 2α1 −2
− ρ ξ ν (k − x0 )2
64 12 1 0
+ o(T − t + (k − x0 )2 ).
Here, x0 = ln S0 , k is the logarithmic strike price of a European call option, the

quantity k − x0 will be called the log-moneyness in the rest of this chapter. Discarding
the o(·) terms, we can use these expansions as approximation formulas for the model
implied volatility.
10.3. Performance of the asymptotic expansions
In this section, we conduct numerical studies on the performance of these

asymptotic expansions of the implied volatility. We focus on two special cases of the
Gatheral model given in equation [10.1]: the Double Heston model and the Double
Lognormal model. As there is no exact formula for the implied volatility under the
Gatheral model, we employ the Monte Carlo simulation to generate benchmark values
for the implied volatilities. All errors are calculated by treating the benchmark values
from the Monte Carlo simulation as the exact values.
The parameters used for the simulation are given in Table 10.1. Here, the initial
asset price is denoted by S0 , the number of steps by M , the number of paths by I, and
the rest are the Gatheral model parameters.
Parameter Value Parameter Value Parameter Value
r 0 κ1 5.5 ρ12 −0.4
S0 100 κ2 0.1 ρ13 0
M 150 v0 0.05 ρ23 0
I 10000000 v0 0.04 θ 0.078
Table 10.1. Fixed parameters used for the numerical study
The number of steps, M , is larger for longer maturities. The parameter choices
come from Bayer et al. (2013). Here, ρ13 = ρ23 = 0 is the most realistic situation.
Indeed, the correlation between the underlying asset and the long-run mean and the
correlation between the volatility and its long-run mean should be close to zero.
Correlation ρ12 is set to be negative as the underlying price and the volatility are
usually negatively correlated.
For the benchmark Monte Carlo simulation, the Euler–Maruyama discretization

scheme in combination with the moment matching method for the set of generated
standard normal pseudo-random numbers, volatility and underlying asset’s price series
was used (see, for example, Hilpisch (2019) for the moment-matching method). In
addition, the value of a forward contract on the same underlying, and with the same
strike and maturity was used as a control variate. This is because the exact value of
the forward contract can be easily computed.
By setting α1 = α2 = 0.5 in the Gatheral model, the Double Heston model is

obtained. Similarly, α1 = α2 = 1 leads to the Double Lognormal model.
We consider 130 options with 10 maturities (30, 60, 91, 122, 152, 182, 273, 365,
547 and 730 calendar days) and with log-moneyness between −0.2 and 0.2 and report
the proportion of options that can be approximated within a relative error of 5% using
the second-order asymptotic expansion below.
For the Double Heston model, this proportion is 45% of all options. However, the
accuracy becomes much higher for options with log-moneyness between −0.1 and
0.07, and maturities from 30 days to 1 year.
Figures 10.1 and 10.2 are the examples of the asymptotic expansions of orders 1
and 2 of the implied volatility, and the benchmark values for two different times to
maturities, 30 days and 1 year, respectively. The number of time steps was M = 300.
From this example, it may be seen that the asymptotic expansion of order 2 gives better
approximations, as expected. In addition, note that Figure 10.2 represents the worst
case for maturities ranging from 30 days to 1 year. The second-order approximations
are more accurate for maturities shorter than 1 year.
Similarly, for the Double Lognormal model, with the second-order expansion, the
corresponding proportion of options that can be approximated within a relative error
of 5% is around 55%. For options with log-moneyness between −0.07 and 0.096, and
maturities from 30 days to 1 year options, the accuracy again becomes higher.
For the first-order expansion, the approximation is decent with relative error less
than 5% only for options with a maturity as short as 30 days, and log-moneyness
between −0.1 and 0.07.
Similar experiments have been done for other values of α1 , α2 , for example, α1 =
α2 = 0.94, the results are alike.
Figure 10.1. An example of the asymptotic expansions of orders 1 and 2 of the

implied volatility, and the benchmark value for a call option with a maturity of 30 days
Figure 10.2. An example of the asymptotic expansions of orders 1 and 2 of the

implied volatility, and the benchmark value for a call option with a maturity of 1 year
10.4. Calibration using the asymptotic expansions
The numerical study in the previous section gives a base for the calibration. We
know now for which range of log-moneyness and maturities the first- and second-order
expansions can be used as calibration formulas.
10.4.1. A partial calibration procedure
We propose a simple partial calibration procedure for some model parameters

or some grouped expressions of parameters. By saying partial calibration, we mean
that only a part of the original model parameters or a grouped form of them will
be calibrated. If a full calibration is desired, we can use the results from the partial
calibration as inputs for the final local optimization over all model parameters. For
this step, we can use the second-order implied volatility and market options with a
suitable range of maturities and strikes as given in the previous section. As this step of
full calibration is a standard procedure, we will not discuss it here.
To explain the partial calibration procedure, recall the form of the asymptotic
expansion of order 1 of implied volatility:
√ 1
σ1 (t, x0 , ν0 ; T, k) = v0 + ρ12 ξ1 ν0α1 −1 (k − x0 )
4
√ [10.3]
+o T − t + |k − x0 | ,
and the form of the asymptotic expansion of order 2 of the implied volatility:
√ 1
σ2 (t, x0 , ν0 , ν0 ; T, k) = ν0 + ρ12 ξ1 ν0α1 −1 (k − x0 )
8
1 −1/2
+ 32κ1 ν0 (ν0 − ν0 ) + 8ρ12 ξ1 ν0α1
128

2α −3/2
+ 3ρ212 ξ12 ν0 1 (T − t) [10.4]
3 2 2 2α1 −2
− ρ ξ ν (k − x0 )2
64 12 1 0
+ o(T − t + (k − x0 )2 ).
Note from these forms that by setting in equation [10.3]:

√
β0 = ν0 ,
1
β1 = ρ12 ξ1 ν0α1 −1
4
and in equation [10.4]:
√ 1 3
γ0 = ν0 + ρ12 ξ1 ν0α1 −1 (k − x0 ) − ρ212 ξ12 ν02α1 −2 (k − x0 )2 [10.5]
8 64
1 −1/2 2α −3/2

γ1 = κ 1 ν0 (ν0 − ν0 ) + 8ρ12 ξ1 ν0α1 + 3ρ212 ξ12 ν0 1 [10.6]
128
the calibration formulas become:

σ1 (t, x0 , ν0 ; T, k) ≈ β0 + β1 (k − x0 ), [10.7]
σ2 (t, x0 , ν0 , ν0 ; T, k) ≈ γ0 + γ1 (T − t). [10.8]
The calibration goes as follows. First, consider the asymptotic expansion of order
1 of the implied volatility in the form given by equation [10.7]. Looking at the
log-moneyness k − x0 as the independent variable, and the σ1 as a dependent variable,
the form can be seen as a simple linear regression model. The numerical study suggests
that the optimal choice of log-moneyness to fit the model would be between −0.1 and
0.07, using 30 days options. By doing so, the estimate of the intercept β0 leads to
√
ν0 , and the estimate of the slope β1 leads to ρ12 ξ1 ν0α1 −1 /4. In combination, those
two quantities finally yield values for ν0 and the product ρ12 ξ1 .
Next, from equation [10.8], looking at T − t as the independent variable and σ2

as the dependent variable, a simple linear regression can be fitted. However, to get
the required estimates, a previous discussion should be used in combination with
equations [10.5] and [10.6]. That is, as the asymptotic expansion of order 1 of the
implied volatility leads to ν0 and ρ12 ξ1 , this can be used in equations [10.5] and [10.6]
to eliminate those parameters. Next, using at-the-money options, only part of equation
[10.6] of the interest is the product κ1 (ν0 − ν0 ) that may be obtained from the slope.
To get the calibrated values using equation [10.4], at-the-money 30 days to 1

year options are used in combination with calibrated values ν0 , ρ12 ξ1 , obtained from
equation [10.3] as explained.
Therefore, this procedure gives calibrated values of ν0 , ρ12 ξ1 , κ1 (ν0 − ν0 ). It is

worth reiterating that these values may be used as an intermediate result for a local
optimization procedure on a full calibration such that the difference between the model
implied volatilizes and market implied volatilities are minimized.
10.4.2. Calibration to synthetic and market data
To compare the calibrated and true parameter values, a good way to start is by
generating an implied surface using the Monte Carlo simulation with a known set
of parameters and applying the calibration procedure. To do so, the Gatheral model
with parameters given in Table 10.1 and a fixed value of α1 = α2 = 0.94 is used.
The synthetic data is generated for options with log-moneyness from −0.2 to 0.2,
and maturities from 30 days to 2 years, but, as discussed previously in the calibration
procedure, only the part of options with suitable log-moneyness and maturities are
used.
As expected, the calibration gives close-to-true values. Table 10.2 shows that the
difference between true and calibrated values is fairly small.
Parameter/initial value True value Calibrated

ν0 0.050 0.049
ρ12 ξ1 −2.8500 −2.6424
κ1 (ν0 − ν0 ) −0.055 −0.031
Table 10.2. Result of calibration using synthetic data
Before moving to the calibration of the real market data, a short description of
the dataset follows. It consists of daily implied volatility surfaces calculated and
interpolated from the traded call options on ABB stock and on Eurostock 50 Index,
in Nasdaqomx Nordic Exchange and Eurex, respectively. The dataset is processed
from the data provided by the company OptionMetrics LLC. The period is from
November 2019 to November 2020. That is, it starts before the Covid-19 pandemic,
which could be interesting. There are 10 time-to-maturities, 30, 60, 91, 122, 152,
182, 273, 365, 547 and 730 calendar days, There are also 13 implied exercise prices
obtained from the well-known Greek Deltas (0.20 + 0.05n, n = 0, 1, 2, . . . , 12). For
this study, only options with maturities from 30 days to 1 year with a suitable range of
log-moneyness are of interest. While using the first-order expansion, we take 30-day
options with a range of log-moneyness from −0.1 to 0.07. We use at-the-money
(or closely at-the-money) options with maturities from 30 days to 1 year for the
second-order expansion.
For simplicity, we calibrate the special case of Double Lognormal model, i.e. we
set α1 = α2 = 1.
Applying the calibration procedure to the real market data, Figures 10.3 and 10.4
show calibrated daily values of the volatility process of the ABB stock and Eurostock
50 Index, respectively. On both figures, it is clear that the pandemic had a great
impact on the volatility, in the middle of March. This period is when Covid-19 started
spreading in Europe, and high volatility was expected.
Figures 10.5 and 10.6 show calibrated daily values of the product of the correlation
ρ12 and ξ1 of the ABB stock and Eurostock 50 Index, respectively. Because ξ1 is
assumed to be positive, and the realistic situation suggests that ρ12 should be negative,
the product should be negative too. This can be seen in Figure 10.5. Besides this, the
product should be constant if the Double Lognormal model is a representation of the
market. However, in Figure 10.6, it can be seen that there are some extremely positive
values. As the situation is severe, due to Covid-19, a calibration should be done more
often during this period to avoid getting unrealistic values. Again, in both cases, the
impact of the pandemic is obvious.
Calibrated Volatility Process ABB

0.8
0.7
0.6
0.5
v0
√
0.4
0.3
0.2
0.1
2019-11 2020-01 2020-03 2020-05 2020-07 2020-09 2020-11

Date
Figure 10.3. Calibrated daily values for the daily

√
volatility v0 using ABB stock data
Calibrated Volatility Process Index

1.2
1.0
0.8
v0
√
0.6
0.4
0.2
2019-11 2020-01 2020-03 2020-05 2020-07 2020-09 2020-11

Date
Figure 10.4. Calibrated daily values for the daily

√
volatility v0 using Eurostock 50 Index data
To calibrate daily values for the product of reversion rate κ1 and difference ν0 −ν0 ,
κ1 (ν0 − ν0 ), as mentioned, a linear regression of implied volatilities against a range of
time-to-maturities for at-the-money options was used. Figures 10.7 and 10.8 show
values obtained from the calibration for the ABB stock and Eurostock 50 Index,
respectively. It seems like the pandemic had more impact on the ABB stock in
this case. During the middle of March, the effect of the Covid-19 pandemic was
undoubtedly the largest.
Calibrated daily values for the product of ρ12 and ξ1 ABB
−3
−4
−5
ρ12 ξ1
−6
−7
−8
2019-11 2020-01 2020-03 2020-05 2020-07 2020-09 2020-11

Date
Figure 10.5. Calibrated daily values for the product

of ρ12 and ξ1 , ρ12 ξ1 , using ABB stock data
Calibrated daily values for the product of ρ12 and ξ1 Index
10
5
ρ12 ξ1
−5
−10
2019-11 2020-01 2020-03 2020-05 2020-07 2020-09 2020-11

Date

of ρ12 and ξ1 , ρ12 ξ1 , using Eurostock 50 Index data
Calibrated daily values for κ1 (v0 − v0 ) ABB

0.00
−0.25
−0.50
κ1 (v0 − v0 )
−0.75
−1.00
−1.25
−1.50
−1.75
2019-11 2020-01 2020-03 2020-05 2020-07 2020-09 2020-11
Date

κ1 (v0 − v0 ) using ABB stock data
Calibrated daily values for κ1 (v0 − v0 ) Index

0
−5
κ1 (v0 − v0 )
−10
−15
−20
2019-11 2020-01 2020-03 2020-05 2020-07 2020-09 2020-11

Date

κ1 (v0 − v0 ) using Eurostock 50 Index data
10.5. Conclusion and future work
We have investigated the performance of the first- and second-order implied

volatility expansions under the Gatheral model and concluded that the first-order
expansion gives a plausible approximation when the maturity is as short as 30 days.
In contrast, the second-order expansion yields good approximations for maturity up to

1 year. Both expansions work well only for a range of values of log-moneyness.
A simple partial calibration procedure was presented, taking advantage of a simple

polynomial form of these expansions as functions of maturities and log-moneyness. In
implementing our calibration procedure, note that the effect of the Covid-19 pandemic
on the model calibration is high. It can especially be noted during the middle of March,
when the effect is the largest. Particular attention should be paid to the calibration
when the market is undergoing a similar crisis.
In future work, the performance of the third-order asymptotic expansion for

implied volatility should be studied and compared to the second-order expansion. A
full calibration to the market data can also be done.
10.6. References
Albuhayri, M., Malyarenko, A., Silvestrov, S., Ni, Y., Engström, C., Tewolde, F., Zhang, J.
(2021). Asymptotics of implied volatility in the gatheral double stochastic volatility model.
In Applied Modeling Techniques and Data Analysis 2, Dimotikalis, Y., Karagrigoriou, A.,
Parpoula, C., Skiadas, C.H. (eds). ISTE Ltd, London, and John Wiley & Sons, New York.
Bayer, C., Gatheral, J., Karlsmark, M. (2013). Fast Ninomiya–Victoir calibration of the
double-mean-reverting model. Quantitative Finance, 13(11), 1813–1829.
Berestycki, H., Busca, J., Florent, I. (2002). Asymptotics and calibration of local volatility
models. Quantitative Finance, 2(1), 61–69.
Dupire, B. (1997). Pricing and hedging with smiles. In Mathematics of Derivative Securities,
Dempster, M.A.H. and Pliska, S.R. (eds). Cambridge University Press, Cambridge.
Fouque, J.P., Papanicolaou, G., Sircar, R., Sølna, K. (2011). Multiscale Stochastic Volatility for
Equity, Interest Rate, and Credit Derivatives. Cambridge University Press, Cambridge.
Gatheral, J. (2008). Consistent modeling of SPX and VIX options. Paper presented at The Fifth
World Congress of the Bachelier Finance Society, London, 18 July 2008.
Hilpisch, Y. (2019). Derivatives Analytics with Python: Data Analysis, Models, Simulation,
Calibration and Hedging. John Wiley & Sons, New York.
Pagliarani, S. and Pascucci, A. (2017). The exact Taylor formula of the implied volatility.
Finance and Stochastics, 21(3), 661–718.
11
Performance Persistence of Polish

Mutual Funds: Mobility Measures
The purpose of this chapter was to evaluate the phenomenon of performance

persistence in a developing market. The analysis was conducted for Polish mutual
funds from three time perspectives: monthly, quarterly and yearly. The research
approach applied was a Markovian framework supported with a few mobility
measures. The results reveal the existence of limited performance persistence, which
decreases as the timeframe increases. Moreover, the observed propensity for a
relative repetition of mutual fund performance in consecutive periods seems to
involve losers rather than winners, and hence it takes the form of the “icy hand”
effect.
11.1. Introduction
The notion of the “hot hand” is defined in social psychology as a string of

successes believed in by many gamblers during continued gaming. The name of the
phenomenon originates from the conviction shared by basketball players and fans
that the likelihood of a player hitting a shot is greater after a hit than after a miss on
the previous shot (Wilkinson and Klaes 2012). In finance, the term stands for
making decisions on the basis of an increase in stock prices in the previous period
on the assumption that it will be similar in the next period (see Haslem 2003).
Another cognitive social bias, the so-called “gambler’s fallacy”, is in a way an
opposite concept, as it describes the situation of a mistaken belief that the
occurrence of a certain random event is less expected to happen following an event
or a series of events.
Chapter written by Dariusz FILIP.
The hot-hand phenomenon can be extended to the literature on mutual funds,

where it is better known as performance persistence, which means financial
intermediaries’ tendency to achieve similar results in consecutive periods. Present
studies distinguish two types of this tendency: winning persistence, which occurs
when managers repeat good investment results, and its opposite – losing persistence,
which means achieving bad results in subsequent periods. With respect to the
subject matter of this chapter, winning persistence can be correlated with the
hot-hand effect, while losing persistence can be correlated with the opposite
phenomenon, namely the so-called “icy hand”.
The issue of the hot-hand or icy-hand effect in the performance of mutual funds
could be highly useful to managers of collective investment companies as well as
individual investors for a number of reasons. The former could apply the findings
associated with these effects from behavioral finance in their information and
marketing activities. The latter, in turn, might find the relations between the results
occurring in consecutive periods important in the context of evaluating returns and
continuing the possible winning or losing streak. Moreover, advantages of this issue
can be viewed from a third perspective: studies dealing with hot-handed managers of
funds enable academicians to evaluate the intensity of competition in the branch or
enhance the knowledge about portfolio management theories.
The main purpose of this chapter is to examine whether the performance

persistence phenomenon exists in the Polish mutual fund market. To this end, the
Markovian approach, used in the construction of special chains with transition
matrices, supported with a few mobility measures, can be employed. This
framework is still unknown in the area of finance. Nevertheless, this study, along
with the findings of Filip and Rogala (2021), may be considered as an introduction
to the research on the performance of mutual funds in developing countries by
means of stochastic processes and as a basis for further discussions and analyses
with this respect.
The remainder of this chapter is organized as follows: section 11.2 briefly

reviews the financial literature concerned with performance persistence in
consecutive periods and groups the most significant studies by research methods and
geographical criteria adopted by authors; section 11.3 concentrates on the empirical
design and dataset; section 11.4 provides empirical findings and the final section
presents the concluding remarks.
11.2. Literature review
This section will provide a brief review of the literature discussing research on
performance persistence. Such research normally consists in comparing the rates of
Performance Persistence of Polish Mutual Funds: Mobility Measures 151
return achieved in successive periods. In some of the above-mentioned studies,

performance analyses adopt an approach related to stochastic modeling, which will
be emphasized in this chapter.
The analysis of the relevant literature discussing the issue of persistence of the
mutual fund allocation effects has allowed identification of three groups of research.
The first one covers basic research, which was the first to show performance
persistence and at the same time constituted the starting point for numerous
subsequent inquiries. The criterion for classifying studies to another group was the
emergence of more recent streams in this area, which comprised more advanced
research approaches. One of them is the publications engaging a Markov chain. The
last group of research included in the review contains the studies describing the
European, including Polish, experiences as regards the occurrence of performance
persistence of domestic mutual funds.
The empirical studies of the turn of the 1990s (Grinblatt and Titman 1989;
Brown and Goetzmann 1995) were the first to suggest a relative stability of the
returns generated by mutual funds. It is then that, for example, Hendricks et al.
(1993) identified the above-mentioned hot-hand effect, which refers to a short-term
performance persistence. Other studies attempted also to determine whether
performance persistence was connected with managerial characteristics or stock
selection (e.g. Grinblatt and Titman 1992). The additionally asked question
concerned the issue whether performance persistence of mutual funds might
possibly be a group phenomenon of adopting a common investment strategy
consisting in allocating assets to the securities that performed well in earlier periods
(e.g. Goetzmann and Ibbotson 1994). This is when a set of research tools ranging
from regression models and analysis of Spearman rank correlation coefficients to the
now classical contingency tables were developed.
One of the reasons for the relative performance persistence over time was
identified as the so-called survivorship bias. For instance, Malkiel (1995), whose
inquiries additionally involved the funds which discontinued their activities, stated
that the evidence for recurring results in a survivorship-bias-free sample deteriorated
with time. At the same time, he was critical about the hypothesis providing that
some managers were able to continuously achieve better results at an acceptable risk
level. Carhart (1997), in turn, noted that funds generating better short-term returns
managed to do so by applying a momentum strategy. On the other hand, returns on
investments diminished after transaction costs were taken into account. Significant
performance persistence was notable, but this was the case with losing funds.
In subsequent years, a few important research streams emerged in the discussed

area. They were characterized by using more advanced research techniques. For
instance, Huij and Verbeek (2007) showed, based on Bayesian methods, that
historical results achieved by equity funds influenced future performance, yet in the
short term only. When extending the timeframe of the analysis, the relations
between the returns generated in successive periods vanished. Interestingly, the
persistence effect of good performance was stronger for younger and smaller funds.
The researchers argued that achieving higher or lower results is related to managers’
luck rather than skills.
Huij and Derwall (2008), in turn, chose to use and confront a broad range of
research methods: from contingency tables, which are traditional for the discussed
stream of the relevant literature, to bootstrap techniques. Their findings showed that
the examined bond funds, which were characterized by good and poor performance
in the past, repeated their rates of return in subsequent periods. Using traditional
research methods, this study demonstrates the existence of a relationship between
managerial skills and performance persistence in virtually all analyzed groups of
funds.
The first studies employing the Markovian approach to the research on

performance persistence were published only recently. One of them is Drakos et al.
(2015), which uses Markov chains, supported with an expanded set of mobility
measures, to analyze performance dependence in consecutive periods. The authors
noted a certain tendency for repeating performance of mutual funds. They called it
inertia, which, however, was characterized by a decrease over time. The applied
mobility indices imply the increasing degree of performance reversal, which means
departure from performance persistence.
Studies from non-US markets refer to the invoked research approaches relying
on Markov chains fairly infrequently. The group of research from European markets
that employ chiefly traditional tools includes, for example, Otten and Bams (2002).
Its authors dealt with the results achieved by equity funds coming from the United
Kingdom, France, Germany, Italy and the Netherlands. Earlier, however, Dahlquist
et al. (2000) based their reasoning on a sample of Swedish funds operating in several
core market segments. Casarin et al. (2008), in turn, examined performance
persistence for Italian funds.
From among more recent studies, which, however, come from developing
markets, the following also deserve attention: Koutsokostas et al. (2020) for Greek
equity funds, performance assessment for Hungarian funds by Bota and Ormos
(2017), studies by Czekaj and Grotowski (2014) and Machnik (2020), among others,
for Polish funds. The findings of the above-mentioned analyses were not consistent,
but long-term persistence was ruled out virtually every time. In the context of an
attempt to capture short-term persistence, the conclusions were more convergent, yet
often dependent on the employed research approach or performance measure.
Generally, the relevant literature, despite the multitude of studies, the abundance of
topics and the diversity of the used data, has still failed to provide straightforward
answers to a number of substantial questions. This means that the discussed
phenomenon deserves a further analysis.
11.3. Dataset and empirical design
The sample used in this study consists of 101 Polish open-end investment funds
covering the period from January 2000 to September 2018. The dataset involving
monthly unit prices of the above registered domestic equity funds was derived from
the reports by Analizy Online, a web service collecting this kind of information in
Poland. Moreover, data on the values of the stock exchange index, which was
important for calculating the used measure of returns, came from the Warsaw Stock
Exchange (GPW) website.
The performance measurement employed in this study uses asset unit values. It
was decided to use continuous return as the base rate, which is one of the most
popular measures of investment effects used in financial analyses. It is based on the
values of funds’ share units and can be calculated logarithmically as follows:
ri = ln(upt/upt-1), [11.1]
where ri is the continuous return of fund i in period t and upt and upt-1 are the unit
prices of fund i at the end (t) and at the beginning (t-1) of the analyzed period,
respectively.
In the next step, the median of the rates of return calculated in this manner is
used to identify the winning funds and the losing funds in individual periods.
The benchmark return being the stock exchange index is then deducted from the
funds’ rate of return. The market-adjusted return allows the determination of the rate
of income exceeding the benchmark. The presented measure of returns is expressed
with the following formula (Lee et al. 2008):
r b = ri - r m , [11.2]
where rb is the benchmark-adjusted return of fund i in period t and rm is the return on

the local stock market index in period t. This measure of returns enabled the
classification of funds into groups of outperforming funds and underperforming
funds in particular periods. However, due to the shortage of relevant databases in
Poland, i.e. ones that would contain, for example, information about management
fees, the applied performance calculation completely omits the expense ratio in fund
returns.
As was mentioned in the first part, the main aim of this chapter is to examine
whether the performance persistence phenomenon occurs in the Polish mutual fund
market. Like Brown and Goetzmann (1995) did in one of their early studies, the
benchmark used here was the median of the rates of return for each period (relative
benchmark) and the value of the stock market index (absolute benchmark). Hence,
the null hypothesis states that the results achieved in consecutive periods are
unrelated to each other. It will be verified using the stochastic procedure, which will
be supported with a few mobility measures.
The main research approach applied was a Markovian framework (see Kemeny
and Snell 1976). The Markov chain used in this study is a special stochastic process
with a countable state space and transitions at integer times. It could be said that
a process X = ( X t ) t∞=1 is a Markov chain with the state space S if it takes value
in set S and for every n ∈ N , for every S1, …, Sn, Sn+1, and for every
t ∈ {n , n + 1, n + 2,...} , we have that:
Ҏ(Xt+1 = sn+1 | Xt = sn) = Ҏ(Xt+1 = sn+1 | Xt = sn, Xt-1=sn-1, …, X1 = s1). [11.3]
From equation [11.3], it comes from an immediate interpretation of the Markov

chain. Knowing the present state, it could be seen that the past of the process does
not give us any further information about its future.
A crucial aspect in dealing with a Markov chain is its transition matrices, i.e.
matrices
Pt = [Ҏ(Xt+1)= j | Xt = i] i , j∈S for t = 1, 2, 3, … [11.4]
If the state space is finite, i.e. if S = {1, 2, 3, …, m} for some m ∈ N , then we

can indicate
 p11t  p1tm 
 
 
 pt  pt 
Pt =  m1 mm  . [11.5]
Each element of transition matrix Pt corresponds to the estimated probability of

transiting from state i to state j across states. Moreover, it will be said that a Markov
chain with transition matrices Pt is homogeneous if Pt does not depend on t. More
precisely, there exists a matrix P such that for every t we have that P = Pt (see Filip
and Rogala 2021).
The probability of events was calculated with the application of the moving ratio,
which allows the determination of the number of times an outcome can occur
compared to all possible outcomes within a chosen time horizon. The above
assumptions enabled the estimation of transition probabilities on the basis of a
six-month horizon for a monthly perspective, a four-quarter horizon for a quarterly
perspective and a three-year horizon for a yearly perspective.
Moreover, it was decided to use several mobility measures to support the

findings obtained based on the Markovian approach. The first of them was the
immobility ratio (IR), which means the percentage of funds with the ability to
maintain their returns (see Bigard et al. 1998). The IR was calculated as the number
of funds characterized by predominance of good performance as well as bad
performance in relation to the whole number of entities in the sample. The next two
were MU and MD, which mean the percentage of funds whose returns improve from
below the median to above the median (under- to outperformance) in general and the
percentage of funds whose returns deteriorate from above the median to below the
median (out- to underperformance) in most cases, respectively (Drakos et al. 2015).
It should be noted that under the assumption of equal transition probabilities, each of
these indices should be 33.3%.
11.4. Empirical results
When deciding to employ the Markovian framework, it had to be assumed that

the probabilities of funds transiting between individual segments of the performance
ranking in successive periods were uniform for funds. According to the assumptions
of the applied approach, this means that all states of the Markov chain are equally
likely to be visited and the process replicates at each consecutive stage
independently.
In the successive transition matrices presented, the values in rows correspond to

the probability of transiting from state i to state j across states (n = 2). The initial
state in the first row means being a well-performing fund, whereas the initial state in
the second row means being a poorly performing fund. In this section, instead of
adopting the matrix approach, it was chosen to use tables with mobility measures
added. The first of them is IR, which means immobility ratio. It will be a way to
show a tendency for persistence, both winning persistence and losing persistence.
The next two (MU and MD) indicate tendencies for performance reversal:
improvement and deterioration, respectively.
11.5. Monthly perspective
It should be restated that the estimations of transition probabilities for a monthly

perspective were made on the basis of observations from a six-month horizon.
Table 11.1 presents the initial results for performance persistence in a short term.
Panel A: Probability of transiting between winning funds and losing funds
Probability Winners t+1 Losers t+1

Winners t 0.5308 0.4692
Losers t 0.4637 0.5363
IR = 0.0990 MU = 0.4455 MD = 0.4554
Panel B: Probability of transiting between outperforming

funds and underperforming funds

Winners t 0.4598 0.5402
Losers t 0.4356 0.5644
IR = 0.0594 MU = 0.2475 MD = 0.6931
Table 11.1. Estimations for a monthly perspective. Source: own study
With reference to the six-month horizon, the generated raw returns, which do not
take market factors into account, seem to imply the existence of a short-term
persistence. As follows from the data provided in Panel A of Table 11.1, the
probability of remaining in the group of winning funds or losing funds is
insignificantly higher (above 0.53) than that of transiting to any other state – for
funds with reversal performance (approx. 0.47). This is not entirely confirmed by
the results achieved as an effect of applying mobility measures. The degree of
mobility measured by means of MU and MD indicates an increased propensity for
transiting between individual performance ranking segments. Funds characterized
by immobility were dominated by both funds with improving and ones with
deteriorating positions.
The findings obtained after grouping funds by the market efficiency criterion are
as expected. According to the results from Panel B of Table 11.1, the probability of
failing to beat the market in consecutive periods was the highest of all permissible
states, which is consistent with the efficient market theory. The noted lack of
uniformity suggests that the percentage of funds whose positions deteriorated was
larger than the percentage of funds whose positions improved. Moreover, it was
noted that the funds that remained in their states as losers were observed most
frequently, as far as a short term is concerned (probability at the level of 0.56). This
proves the existence of performance persistence, but only with respect to the
icy-hand effect. Mobility measures confirmed only the tendency for deteriorating
performance (MD at the level of 0.69).
11.6. Quarterly perspective
In relation to a medium-term perspective, it also needs to be restated that the

estimations of transition probabilities were made on the basis of observations from a
four-quarter horizon. Table 11.2 presents results for performance persistence
supported with mobility measures for a quarterly perspective.

Winners t 0.5702 0.4298
Losers t 0.4388 0.5612
IR = 0.2857 MU = 0.3673 MD = 0.3469
Panel B: Probability of transiting between outperforming funds and underperforming funds

Winners t 0.4888 0.5112
Losers t 0.3618 0.6382
IR = 0.2653 MU = 0.1429 MD = 0.5918
Table 11.2. Estimations for a quarterly perspective. Source: own study
The results of the research, presented in Panel A of Table 11.2, suggest the
persistence of quarterly raw returns of funds in consecutive periods in accordance
with the selected classification criterion, namely the achievement or non-achievement
of the median value in the performance distribution. The probability of transiting
from the winner (loser) state to the loser (winner) state was lower (approx. 0.43)
than that of remaining in the group of winners or losers (approx. 0.56–0.57). This is
partly confirmed by the value of the immobility ratio (IR), which was determined to
be relatively high, i.e. 0.29, yet not enough to exceed the expected level of 0.33.
The application of the stock market index as an absolute benchmark resulted,

however, in a strong orientation of the recurrence towards the icy-hand
phenomenon. Underperforming funds maintained the level of their performance in
successive periods definitely more frequently (0.64) than funds characterized by
different performance inertia (see Panel B of Table 11.2). Quite a high value of the
probability of performance deterioration (0.51) also needs to be noted, since it was
reflected in high values of the MD index (0.59), which show the percentage of the
funds whose returns deteriorated in most cases from out- to underperformance.
11.7. Yearly perspective
Once again, it must be repeated that the estimations of transition probabilities for
a yearly perspective were made on the basis of observations from a three-year
horizon. Long-term observations were made for classifications on the basis of the
relative (Panel A) as well as the absolute (Panel B) benchmark, and their results are
presented in Table 11.3.

Winners t 0.4725 0.5275
Losers t 0.5208 0.4792
IR = 0.1353 MU = 0.4265 MD = 0.4382
Panel B: Probability of transiting between outperforming

funds and underperforming funds

Winners t 0.3582 0.6418
Losers t 0.3562 0.6438
IR = 0.2059 MU = 0.1677 MD = 0.6264
Table 11.3. Estimations for a yearly perspective. Source: own study
The last of the specified time horizons concerns a yearly perspective. As follows
from the data provided in Panel A of Table 11.3, the performance persistence
observed earlier decreases as the timeframe increases. For the sake of comparison,
the probability of transiting from winning (losing) funds to losing (winning) funds
was approximately 0.52–0.53. Mobility measures (MU and MD) also indicate high
values, ones considerably exceeding the natural level of 0.33, let alone the low
values of the immobility ratio (0.13). This means that transition probabilities are not
uniform.
When the absolute benchmark (Panel B of Table 11.3), i.e. the stock market
index in this case, was introduced as a classification criterion, the findings again
turned out to be consistent with the efficient market hypothesis. The empirical
transition probabilities were the highest when funds’ performance deteriorated in
consecutive periods (0.64) or bad income persisted (0.64). The obtained results are
supported by high values of the MD ratio (0.62), which signifies a high share of
funds with deteriorating investment results
11.8. Conclusion
The aim of this chapter was to examine whether the performance persistence
phenomenon occurred in a developing mutual fund market. The analysis was
conducted for Polish equity funds from three time perspectives: monthly, quarterly
and yearly. The empirical investigation was possible through the employment of
Markov chains with transition matrices, supported with a few mobility measures.
The results suggest the existence of a limited performance persistence, especially

when raw returns were applied. With reference to market-adjusted returns, used
alternatively, only losing persistence was observed. It should be noted that the
phenomenon decreases as the timeframe increases. However, the propensity for a
relative repetition of mutual fund performance in consecutive periods, which takes
the form of the icy-hand effect, as well as the higher transition probability for funds
with deteriorating returns are consistent with the efficient market theory.
The applied research framework, which is still unknown in the area of finance,
has proved useful in the verification of the performance persistence hypothesis.
However, this study, along with the findings of Filip and Rogala (2021), may be
considered as an introduction to the research on the performance of mutual funds in
developing countries by means of stochastic processes and as a basis for further
discussions and analyses in this respect.
11.9. References
Bigard, A., Guillotin, Y., Lucifora, C. (1998). Earnings mobility: An international comparison
of Italy and France. Review of Income and Wealth, 44(4), 535–554.
Bota, G. and Ormos, M. (2017). Determinants of the performance of investment funds
managed in Hungary. Ekonomska Istraživanja / Economic Research, 30(1), 1–14.
Brown, S.J. and Goetzmann, W.N. (1995). Performance persistence. The Journal of Finance,
50(2), 679–698.
Carhart, M. (1997). On persistence in mutual fund performance. The Journal of Finance,
52(1), 57–82.
Casarin, R., Pelizzon, L., Piva, A. (2008). Italian equity funds: Efficiency and performance
persistence. ICFAI Journal of Financial Economics, 6(1), 7–28.
Czekaj, J. and Grotowski, M. (2014). Short-term persistence of the results achieved by
common stock investment funds acting in the Polish Capital Market (in Polish).
Ekonomista, 4, 545–557.
Dahlquist, M., Engstrom, S., Soderlind, P. (2000). Performance and characteristics of
Swedish mutual funds. Journal of Financial and Quantitative Analysis, 35(3), 409–423.
Drakos, K., Giannakopoulos, N., Konstantinou, P.T. (2015). Investigating persistence in the
US mutual fund market: A mobility approach. Review of Economic Analysis, 7, 54–83.
Filip, D. and Rogala, T. (2021). Analysis of Polish mutual funds performance: A Markovian
approach. Statistics in Transition New Series, 22(1), 115–130.
Goetzmann, W.N. and Ibbotson, R.G. (1994). Do winners repeat? The Journal of Portfolio
Management, 20(2), 9–18.
Grinblatt, M. and Titman, S. (1989). Mutual fund performance: An analysis of quarterly
portfolio holdings. The Journal of Business, 62(3), 393–416.
Grinblatt, M. and Titman, S. (1992). The persistence of mutual fund performance. Journal of
Finance, 47(5), 1977–1984.
Haslem, J. (2003). Mutual Funds: Risk and Performance Analysis for Decision Making.
Blackwell Publishing, Malden.
Hendricks, D., Patel, J., Zeckhauser, R. (1993). Hot hands in mutual funds: Short-run
persistence of relative performance, 1974–1988. Journal of Finance, 48(1), 93–130.
Huij, J. and Derwall, J. (2008). “Hot hands” in bond funds. Journal of Banking and Finance,
32(4), 559–572.
Huij, J. and Verbeek, M. (2007). Cross-sectional learning and short-run persistence in mutual
fund performance. Journal of Banking and Finance, 31(3), 973–997.
Kemeny, J.G. and Snell, L.J. (1976). Finite Markov Chains. With a New Appendix
“Generalization of a Fundamental Matrix”. Springer-Verlag, New York-Berlin-
Heidelberg-Tokyo.
Koutsokostas, D., Papathanasiou, S., Eriotis, N. (2020). Short-term versus longer-term
persistence in performance of equity mutual funds: Evidence from the Greek market.
International Journal of Bonds and Derivatives, 4(2), 89–103.
Lee, J.S., Yen, P.H., Chen, Y.J. (2008). Longer tenure, greater seniority, or both. Evidence
form open-end equity mutual fund managers in Taiwan. Asian Academy of Management
Journal of Accounting and Finance, 4(2), 1–20.
Machnik, J. (2020). Performance persistence and gamma convergence in absolute return

funds in Poland over the period 2011–2018. Financial Sciences, 25(2–3), 41–54.
Malkiel, B.G. (1995). Returns from investing in equity mutual funds 1971 to 1991. The
Journal of Finance, 50(2), 549–572.
Otten, R. and Bams, D. (2002). European mutual fund performance. European Financial
Management, 8(1), 75–101.
Wilkinson, N. and Klaes, M. (2012). An Introduction to Behavioral Economics, 2nd edition.
Palgrave Macmillan, London.
12
Invariant Description for a Batch

Version of the UCB Strategy with
Unknown Control Horizon
We consider a batch processing variation of the UCB strategy for multi-armed

bandits with a priori unknown control horizon size n. Random rewards of the
considered multi-armed bandits can have a wide range of distributions with finite
variances. We consider a batch processing scenario, when the arm of the bandit can
be switched only after it was used a fixed number of times, and parallel processing is
also possible in this case. A case of close distributions, when expected rewards
differ by the magnitude of order N–1/2 for some fairly large N, is considered as it
yields the highest normalized regret. Invariant descriptions are obtained for upper
bounds in the strategy and for minimax regret. We perform a series of Monte Carlo
simulations to find the estimations of the minimax regret for multi-armed bandits.
The maximum for regret is reached for n proportional to N, as expected based on
obtained descriptions.
12.1. Introduction
The multi-armed bandit (MAB) problem is a control problem in a random

environment. Traditionally, it is pictured as a slot machine that has (two or more)
arms (levers). Choosing an arm at time yields some random rewards (income)
( ) associated with it. The gambler (decision-making agent) begins with no initial
knowledge about rewards associated with the arms.
Chapter written by Sergey GARBAR.
The goal is to maximize the total expected reward.
A classic reinforcement learning problem that exemplifies the exploration–

exploitation trade-off dilemma (Lattimore and Szepesvari 2020): the goal of a
gambler is to maximize the total expected reward. Yet, as there is no prior
knowledge about parameters of the MAB, the gambler should gather new data to
lessen the losses due to incomplete knowledge (exploration) while getting the most
income based on already obtained knowledge (exploration).
Formally, we describe a MAB as a controlled random process ( ) , =

1,2, … (with a priori unknown control horizon). Value ( ) at time only depends
on the chosen arm. Expected values of the rewards ,…, are assumed to be
unknown. Variances of the rewards , … , are assumed to be known and equal
( =⋯= = ). Examined MAB can therefore be described with a vector
parameter = ,…, .
The regret (loss function) after rounds is defined as the expected difference
between the reward sum associated with an optimal strategy and the sum of the
collected rewards and is equal to
( , )= , max ,…, −
Here, , denotes the expected value calculated with respect to the measure
generated by strategy and parameter .
/
Further, we consider the normalized regret (scaled by ( ) , is the reached
step).
We assume mean rewards to have “close” distributions that can be described as

values of parameters that are chosen as follows
Θ= = + / ; ∈ ℝ, | | ≤ < ∞, = 1, … ,
This set of parameters describes “close” distributions: the difference between

/
expected values has the order . As control horizon size is a priori unknown,
we use as a parameter. Maximal normalized regrets are observed on that domain
and have the order / (see Vogel 1960).
For “distant” distributions, the normalized regrets have smaller values. For
example, they have order log if max ,…, exceeds all other by some
> 0 (see Lai et al. 1980).
Invariant Description for a Batch Version of the UCB Strategy 165
We aim to build a batch version of the UCB strategy described in Lai (1987).
Also, we obtain its invariant description on the unit horizon in the domain of “close”
distributions (as in the case of “close” distributions, the maximum values of
expected regret are attained). Finally, we show (using Monte Carlo simulations) that
expected regret only depends on the number of processed batches (not the number of
steps) and that the maximum of the scaled regret is reached for step number
proportional to .
12.2. UCB strategy
Suppose that at the step , the -th arm was chosen times and let ( ) denote
a corresponding cumulative reward (for = 1, … , ).
( )/ is a point estimator of the mean reward for this arm.
Since the goal is maximize the total expected reward, it might seem reasonable
to always to apply the action corresponding to the current largest value ( )/ .
However, such a rule can result in a significant loss since initial estimate
( )/ , corresponding to the largest , can by chance take a lower value, and
consequently, this action will be never applied.
To get a correct estimation for , we must ensure that each arm is chosen
infinitely many times as → ∞.
Instead of estimates of themselves, it is proposed to consider the upper bounds of

their confidence intervals
( ) 2 log( / )
( )= + , = 1,2, … , ; = 1,2, … ,
It is supposed that for initial estimation of mean rewards, each arm will be used
once in the initial stage of control.
12.3. Batch version of the strategy
We consider a setting in which the gambler can change the arm only after using
it times in a row. We assume for simplicity that = , where is the number
of batches. This limitation allows batch (and also parallel) processing (see
Kolnogorov 2012, 2018).
If is large and variance is finite, then due to the central limit theorem, the
reward for each batch will have a close to normal distribution with probability
density function
( )
( | ) = (2 ) /
if = , and = 1, … , . Therefore, in this scenario, we can assume that we need to

study a Gaussian MAB with arms.
For the -th batch, the following expression for the reward holds:
( )= ⋅ + / +√ ( ~ (0,1))
Upper bounds for batches take the form
( ) log( / )
( )= + , = 1,2, … ,
where is the parameter of the strategy, is the number of processed batches, is

the number of batches for which the -th arm was chosen and ( ) is the
corresponding cumulative reward after processing batches ( = 1,2, … , ).
12.4. Invariant description with a unit control horizon
The aim is to get a description of the strategy and the regret, which is
independent of the control horizon size. That way, it will be is possible to make
conclusions about its properties no matter the horizon size. We aim to scale the step
number by some parameter (this parameter is the horizon size in the case where it
is a priori known).
We denote by
1, ( ) = max ( ), … , ( ) ,
( )=
0, ℎ
the indicator of chosen action for processing the ( + 1)-th batch according to the
considered rule (also recall that at ≤ every arm is chosen once for a batch, so
( ) = 1 for = ). With probability 1, only one of values of ( ) is equal to 1.
The cumulative reward for each arm can be written as
( )= /
⋅ + ( / ) + () ( ; ), = 1,2, … ,
where ( ; )~ 0, √ are i.i.d. normally distributed random variables with

zero means and variances that are equal to . Note that for arm the indicator
equals 1 exactly times, so ∑ ( ) ( )
; is the sum of Gaussian random
variables, which can be presented as a scaled by standard deviation, standard normal
random variable. In that case
( )= /
⋅ + ( / ) +
where ~ (0,1) is a standard normal random variable.
The upper bound value for each arm can be written as ( = 1,2, … , ; = +
1, + 2, … , ):
√ √ log( / )
( )= + + +
√
Next, we apply a linear transformation that does not change the arrangement of
bounds
( )=( ( )− ) /
And we introduce a notation = , = , that way obtaining an

expression
log( / )
( )= + + , = 1,2, … ,
Here, changes in interval (0,1 when changes from 1 to , i.e. the control
horizon is scaled to a unit size. A priori unknown control horizons can change from
0 to any value.
To obtain an invariant form for regret, we first assume without loss of generality
that = max( , … , ), so the regret can be expressed as
( , )=( / ) / ( − ) , ( )
/
=( ) ( − ) , ( )
/
After normalization (scaling by ( ) ), we get the following expression for
regret:
( ) / ( , )=( / ) / ( /
− ) , = ( − ) , ( )
which is the required invariant description. Hence, we can present the results in the
form of the following theorem.
THEOREM 12.1.– For Gaussian multi-armed bandits with arms, fixed known
variance and unknown expected values ,…, that have “close” distributions
defined by
= + / ; ∈ ℝ, | | ≤ < ∞, = 1, … , ,
the use of the batch UCB rule with bounds
( ) log( / )
( )= + , = 1,2, … , ; = 1,2, … ,
results in an invariant description on the unit control horizon, which is described by
log( / )
( )= + + , = 1,2, … ,
and normalized regret
( ) / ( , )= / ( − ) , ( )
12.5. Simulation results
Having an invariant description allows us to generalize the results of Monte

Carlo simulations for strategies at large.
In what follows, we consider a Gaussian two-armed bandit with control horizon

sizes up to = 10,000, which can be considered fairly large. We consider different
values of shown by different line colors. All of the results are averaged over
10,000 simulations.
Different plots (Figures 12.1–12.3) correspond to different values of difference

between mean rewards of bandit arms. All the plots show maximum regret (scaled
by ( ) / ) versus step number .
Figure 12.1. Maximum scaled regret versus step number for − = 10.
For a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
Values of differences are chosen − = 3.3 (Figure 12.2) as in this case the
biggest maximum regret was obtained according to Garbar (2020a, 2020b). Other
values ( − = 10, Figure 12.1; − = 1, Figure 12.3) correspond to bigger
and smaller difference values for the expected reward.
For all cases, we can observe that the maximum for scaled regret is reached for
proportional to . When the difference in mean rewards is large (Figure 12.1), the
strategy can distinguish between better and worse arms in the early stages of control
and regret is not that big.
Figure 12.2. Maximum scaled regret versus step number − = 3.3.

Figure 12.3. Maximum scaled regret versus step number − = 1.

12.6. Conclusion
We reviewed a batch version of the UCB rule with a priori unknown control
horizon.
An invariant description of the strategy was obtained.

Monte Carlo simulations were performed to study the normalized regret for
different fairly large horizon sizes; it is shown that the maximum for regret is
reached for proportional to , as expected based on obtained descriptions.
12.7. Affiliations
This study was funded by RFBR, project number 20-01-00062.
12.8. References
Garbar, S.V. (2020a). Invariant description for batch version of UCB strategy for multi-armed
bandit. J. Phys. Conf. Ser., 1658, 012015.
Garbar, S.V. (2020b). Invariant description of UCB strategy for multi-armed bandits for batch
processing scenario. Proceedings of the 24th International Conference on Circuits,
Systems, Communications and Computers (CSCC), 75–78, Chania, Greece.
Kolnogorov, A.V. (2012). Parallel design of robust control in the stochastic environment (the
two-armed bandit problem). Automation and Remote Control, 73, 689–701.
Kolnogorov, A.V. (2018). Gaussian two-armed bandit and optimization of batch data
processing. Problems of Information Transmission, 54(1), 84–100.
Lai, T.L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. Ann.
Statist., 25, 1091–1114.
Lai, T.L., Levin, B., Robbins, H., Siegmund, D. (1980). Sequential medical trials (stopping
rules/asymptotic optimality). Proc. Natl. Acad. Sci. USA, 77, 3135–3138.
Lattimore, T. and Szepesvari, C. (2020). Bandit Algorithms. Cambridge University Press,
Cambridge, UK.
Vogel, W. (1960). An asymptotic minimax theorem for the two-armed bandit problem. Ann.
Math. Statist., 31, 444–451.
13
A New Non-monotonic Link

Function for Beta Regressions
Beta regression is used to analyze data whose value is within the range (0,1),
such as rates, proportions or percentages, and therefore is useful for analyzing the
variables that affect them (Ferrari and Cribari-Neto 2004; Simas et al. 2010). This
method is based on the beta distribution or its re-parametrizations, proposed by
Ferrari and Cribari-Neto (2004) and Cribari-Neto and Souza (2012), to obtain a
regression structure on the mean that is easier to analyze and interpret. For the
regression for binary data, the literature has debated the problem of incorrect link
functions and therefore proposed new links, such as gev (generalized extreme
value), while, for the mean of the beta regression, the traditional link functions for
binary regressions were used, i.e. logit, probit and complementary log–log. In this
chapter, a new inverse link function is proposed for the mean parameter of a beta
regression, which has as its particular cases inverse logit, representing a traditional
symmetric inverse link function, and gev, proposed for binary data due to its
asymmetry. The new inverse link function proposed in this chapter has the
advantage that it can also be non-monotonic, unlike those proposed until now. The
parameters are estimated maximizing the likelihood function, using a modified
version of the genetic algorithm, therefore giving greater importance to traditional
link functions than the others. This method is compared with the one proposed by
Cribari-Neto and Zeileis (2010), in which the researcher decides a priori the link
function, using simulated data, so as to be able to compare which of the two
methods is closest to the true values. This method, therefore, is better because it is
able to correctly determine the link function with which the data was simulated and
to estimate the parameters with less error.
Chapter written by Gloria GHENO.

13.1. Introduction
Beta regression is typically used to analyze data whose value is within the range
(0,1), such as rates, proportions or percentages, and to study the variables which
affect them (Cox 1996; Ferrari and Cribari-Neto 2004; Simas et al. 2010; Cribari-Neto
and Queiroz 2014). This statistical method is based on beta distribution or its
re-parametrizations, proposed by Ferrari and Cribari-Neto (2004) and by Cribari-Neto
and Souza (2012), to obtain a regression structure on the mean which is easier to
analyze and interpret. Cox (1996) and Kieschnick and McCullough (2003) were
among the first to propose beta regression. They proposed their own version of the
generalized linear models (McCullagh and Nelder 1989) for the variables of interest
included in the unit interval, exploiting the belonging of the beta distribution to the
exponential family, on the basis of the generalized linear model. These link the
mean of the variable of interest to a function, called the link function, of exogenous
explanatory variables, also called regressors. The inverse of the link function is
called the response function. Since the mean of the beta distribution is between
0 and 1, Kieschnick and McCullough (2003) recommend the use of a logit link
function, essentially created to be applied to regressions for binary data. In its
traditional form, the beta distribution is characterized by the two parameters p and q.
Because its mean is a function of the parameters p and q, Kieschnick and
McCullough (2003) link them to the explanatory variables through the link function,
considering q as a function of the regressors. Ferrari and Cribari-Neto (2004)
re-parameterize the beta distribution, which thus becomes characterized by the
parameters µ and φ, which are, respectively, the mean and the precision. With this
modification, the analysis is simplified, since the mean is directly linked to the
explanatory variables through the link function. The parameters of the link function
are estimated using the quasi-maximum likelihood (QMLE) method (Cox 1996;
Kieschnick and McCullough 2003) or the maximum likelihood method (Ferrari and
Cribari-Neto 2004). Beta regression has had a further evolution with the possibility
of also linking the precision parameter to explanatory variables through another link
function (Smithson and Verkuilen 2006; Simas et al. 2010). Smithson and Verkuilen
(2006) use the logarithmic function for the link function of the precision parameter
and the logit function for the mean. Simas et al. (2010), instead, propose some
functions both for the link function of the precision parameter and for that of the
mean. Simas et al. (2010) apply, for the mean, the traditional link functions of binary
regressions, i.e. logit, probit and complementary log–log, and, for the precision
parameter, the logarithmic function, the square root and equality. The parameters
of the two link functions are estimated with the maximum likelihood method.
Cribari-Neto and Souza (2012) propose their new parameterization of the beta
distribution, where the parameters represent the mean and a measure of dispersion.
In their study, the logit function is used for both link functions.
A New Non-monotonic Link Function for Beta Regressions 175
In the literature, the problem of incorrect link functions has been discussed in the
context of regressions for binary data (Czado and Santner 1992), and therefore, new
and further link functions have been proposed (Aranda-Ordaz 1981; Stukel 1988;
Nagler 1994; Wang and Dey 2010; Jiang et al. 2013; Gheno 2018). For the mean of
the beta regression, however, until now, researchers have used the link functions
used for the binary regressions, i.e. logit, probit and complementary log–log. Since
the mean is between (0,1), however, only the functions (0,1) →ℜ can be used as link
functions. The logit and probit functions are monotonic and symmetric functions,
while the complementary log–log approaches slowly 0 and quickly 1. The
complementary log–log, also defined as extreme minimal value (Fahrmeir and Tutz
2013), has its complementary version in the log–log, or extreme maximal value
(Fahrmeir and Tutz 2013), because it approaches 0 quickly and 1 slowly. Other
non-symmetric functions have been proposed for binary data. Some examples of these
link functions are gev (generalized extreme value), which has the complementary
log–log link function as a special case (Wang and Dey 2010; Calabrese and Osmetti
2013); scobit (Nagler 1994) and Aranda-Ordaz’s link, which has the logit and
the complementary log–log as special cases (Aranda-Ordaz 1981). Only Canterle and
Bayer (2019) use Aranda-Ordaz’s link for the mean in a beta regression.
In this chapter, a new response function is proposed for the mean parameter of a
beta regression, which has as its particular cases the symmetric inverse link function
logit and the asymmetric gev. The new response function has the advantage that it
can also be non-monotonic, a feature not present in those proposed until now. The
parameters are estimated with the maximization of likelihood, made possible by the
use of a modified version of the genetic algorithm, to give more relevance to
the traditional link functions than the others. This new method is compared with that
proposed by Cribari-Neto and Zeileis (2010) using simulated data, in order to
compare which of the two methods is closest to the true values. This method is able
to correctly determine the link function with which the data are simulated and to
estimate the parameters with less error.
13.2. Model
The variable of interest of a beta regression has a beta distribution and therefore
takes values between 0 and 1, excluded extremes. However, beta regression can also
be used for variables included in an interval (a, b) with the appropriate
modifications, i.e. (y-a)/(b-a) (Ferrari and Cribari-Neto 2004; Smithson and
Verkuilen 2006). If y, instead, can also assume the values 0 and 1, Smithson and
Verkuilen (2006) propose the transformation (y (n-1) + 0.5)/n, where n represents
the sample size.
To facilitate the interpretation of the estimated values from the beta regression,
Ferrari and Cribari-Neto (2004) propose the following re-parametrization of the beta
distribution
( )
( , , )= (1 − )( )
( ) (1 − )
where μ represents the mean and is between 0 and 1 excluded, and φ represents the
precision parameter and is greater than 0. In this parameterization, the variable y has
the mean equal to μ and variance equal to μ (1-μ)/(1 + φ). In the simplest form of
beta regression, the mean of the variable of interest is equal to
( )= => ( ) =
(1 − )
( )=
1+
where with j = 1, … , J represent regressors or explanatory variables and (∙) a

function such that (∙): ℜ → (0,1). The function (∙) is called the link function
(Cribari-Neto and Zeileis 2010) while (∙) is called the response function. In this
case, (∙) represents the link function of the mean. In the most advanced form of
beta regression (Smithson and Verkuilen 2006; Simas et al. 2010), indeed, even the
precision parameter becomes a linear function of the explanatory variables with
= 1, … ,
ℎ( ) =
where ℎ(∙) represents the link function of the precision parameter and is such that
ℎ (∙): ℜ → (0, ∞) (Cribari-Neto and Zeileis 2010). A sample of sample size n is
used to estimate the parameters β and φ of the simpler version or the parameters β
and γ of the more complex version. The relative log-likelihood function of the
simplest model becomes
( , )= log( ( , , ))
In the simplest version, each observation has mean equal to and variance
equal to (1 − )/(1 + ). The parameters are obtained by maximizing the
log-likelihood function. The most commonly used link functions for the mean are,
respectively, logit, probit and complementary log–log
∑
( )= => = ∑
1− 1+
( )= ( ) => = +
∑
( ) = log − (1 − ) => =1−
While logit and probit are monotonic and symmetric functions, complementary
log–log approaches 0 slowly and 1 quickly. In this chapter, to broaden the
possibility of studying the relationship between the mean and the explanatory
variables with = 1, … , more comprehensively, a new response function called
logev is proposed, because it contemplates between its particular cases the inverse
link function logit and that gev (Calabrese and Osmetti 2013), until now only
applied to regressions for binary variables. The gev link function is
(− ln ) −1 ∑
/
( )= => =
The response function gev becomes with → 0 the response function of the
complementary log–log and with < 0 the response function of Weibull (Calabrese
and Osmetti 2013). The logev function is
−1/
− 1− + +∑
+∑
1− + + >0
1+
−1
=
+∑
+∑
1− + + ≤0
1+
with , , ∈ (−∞, +∞) and ∈ 0,1 . If = 0, the equation becomes the

response function gev, if, instead, = 1 and = −1, the equation becomes
the inverse link function logit. The peculiarity of this response function is that the
choice of symmetry or asymmetry is not given a priori but by the data. In the
response functions proposed until now, this choice is made a priori by
the researcher, while in this response function are the data that provide it. As noted
by Czado and Santner (1992) in the model of the binomial regression, also for the
beta regression, the misspecification of the link function induces the bias of the
estimated parameters, and therefore, an a priori choice can eliminate this problem.
Another peculiarity of this function is the possibility of being non-monotonic, a
feature that has not been considered for a beta regression until now. Non-monotonicity
for binary data has only been proposed in the Bayesian field by Gheno (2018).
13.3. Estimation
The parameters of the logev regression cannot be estimated using the

maximization of the log likelihood with the derivatives, due to its complexity, and
therefore, the use of a modified version of the genetic algorithm presented by
Holland is proposed (Holland 1975). Genetic algorithms, based on the concept of
evolution, are very often used for optimization problems (Whitley 1994). The
algorithm, which is proposed in this chapter, divides the sample into two parts: the
train part, which corresponds to 75% of the sample, and the test part, which refers to
the remaining 25%. This subdivision is used to analyze the goodness of the estimate
(train) and the goodness of the forecast (test). The parameters are estimated applying
the genetic algorithm to the likelihood calculated on the train dataset. The MSE
(mean square error) is calculated in both datasets (train and test), in order to study
both the goodness of the estimate and the goodness of the forecast. Repeating the
subdivision of the dataset and the related estimation process 100 times, the
parameters that minimize the following two functions are obtained:
0.5 ( ) + 0.5 ( )
0.1 ( ) + 0.9( )
In the first function, equal importance is attributed to goodness of estimate and

goodness of forecast; in the second, instead, more importance is given to goodness
of estimate. If the estimation of the parameters ̂ , determines a known model (logit
or gev), the estimate of the parameters ̂ , , , with = 1, … , is equal to the
estimate obtained from the subdivision with = 1, … , 100
min 0.5 ( ) + 0.5 ( )
If ̂ , , instead, determines an unknown model, the algorithm analyzes the

parameters estimated by the following function:
min 0.1 ( ) + 0.9 ( )
If these determine a known model, they become the searched estimated

parameters. If both functions, instead, determine ̂ , of an unknown model, the
estimate of the first function is chosen. This procedure is preferred in order to give
greater importance to the logit and gev models than a hybrid model. The standard
errors of the estimated parameters can be calculated with the bootstrap method.
13.4. Comparison
To study the goodness of the method, it is compared with the beta logit
regression proposed by Cribari-Neto and Zeileis (2010) (hereinafter also defined as
betareg), using simulated data so as to know exactly what the true relationship
between the response variable and the explanatory variables is. In the first analysis,
30 datasets of sample size 500 are simulated from a logit model with = 1 and
= −2. Logev beta regression estimates all datasets, while logit beta regression is
able to estimate only 25 datasets. Figure 13.1 shows that in almost 80% of cases,
logev chooses the logit model exactly. As simulated data is used, it is possible to
analyze which of the two methods is closest to the true value. Only the cases where
logev chooses the logit model exactly are considered, and the bias is analyzed
(Langner et al. 2003):
1 1
( ) = ( − ) = ( − 1) = − = −1
1 1
( )= − = +2 = − = +2
where D is the number of simulated datasets, which are estimated by both methods
as logit model and is equal to 19 and c = betareg, logev. If the bias is considered, the
intercept and the coefficient β are better estimated by the logev method, because, in
both cases, the logev bias is closer to 0 than the betareg bias:
( ) = |−0.06| < ( ) = |−0.11|
( ) = 0.16 < ( ) = 0.68
When always considering the cases where logev chooses the logit model exactly,
the MSE statistic of the two methods is compared, in order to analyze both the
variance and the bias:
1 1
( ) = ( − ) = ( − 1) = ( − ) = ( − 1)
1 1
( )= − = +2 = − = +2
where D = 19 and c = betareg, logev. The intercept is better estimated by the betareg
method, even if the two MSEs are very close, while logev estimates the coefficient β
much better
( ) = 0.03 > ( ) = 0.02
( ) = 0.07 < ( ) = 0.46
The two methods are compared with the AIC and BIC criteria. The AIC and BIC
are equal to (Qi and Zhang 2001).
2
= , +
( )
= , +
∑ − ,
, = =
where c = betareg, logev and represents the number of parameters and therefore
= 4, while = 6. The two AICs are now compared:
8 12
∆ = , − , + −
500 500
If ∆ > 0, the best model is logev, and then the condition for choosing the
logev is
−8 12
, − , > + = 0.008
500 500
∆ ( )
Now, the BIC is compared:
4 (500) 6 (500)
∆ = , − , + −
500 500
If ∆ > 0, the best model is the one proposed by the logev regression model,
and the condition for choosing the logev becomes
2 (500)
, − , > = 0.024858
500
∆ ( )
Figure 13.2 shows that logev is always better than the logit regression model.
Figure 13.1. Percentage of choice (logit dataset)
Figure 13.2. Comparison between AIC and BIC,

blue = ∆log(MSE), gray = 0.008, orange = 0.024858
In the following three datasets, logev and logit are only compared by analyzing the
goodness of the model. Indeed, the estimation of the parameters α and β is not
considered, because, as noted by (Czado and Santner 1992), their estimation depends
on the type of the used link function. In the first dataset, instead, the estimation of the
parameters α and β is compared because logev also chooses the link function logit. In
the second analysis, 30 datasets of sample size 500 are simulated from a gev model
with = 1, = −4, = −2. In this case, logev always correctly chooses the gev
model (Figure 13.3), but estimates better in only about 60% of cases (Figure 13.4).
Figure 13.3. Percentage of choice (gev data)
Figure 13.4. Comparison of AIC and of MSE (if ∆MSE > 0,

according to MSE the best model is logev)
In the third analysis, 30 datasets of sample size 500 are simulated from a gev
model with = 1, = 1, = −2. In this case, logev almost always correctly
chooses the gev model (Figure 13.5) and estimates better in about 90% of cases
(Figure 13.6).
Figure 13.5. Percentage of choice (gev data)

Figure 13.6. Comparison of AIC and of MSE (if ∆MSE > 0,

according to MSE the best model is logev)
In the fourth analysis, 30 datasets of sample size 500 are simulated from a
non-monotonic model:
| |
=
The logev method is always able to estimate the model, while logit beta
regression is only able to estimate the model 28 times. In this case, logev chooses
the correct model in almost 75% of cases (Figure 13.7) and estimates better in about
99% of cases (Figure 13.8).
Figure 13.7. Percentage of choice (non-monotonic data)
Figure 13.8. Comparison between AIC and BIC,

blue = ∆log(MSE), gray = 0.008 and orange = 0.024858
13.5. Conclusion
The study of the link functions in the case of beta regression has, until now, been
poorly developed, and therefore, in this chapter, a new response function has been
proposed, which includes as special cases the asymmetric response function gev and
the symmetric inverse link function logit, both monotonic. The peculiarity of this
inverse link function, which is called logev, is that it can also be non-monotonic. To
estimate its parameters, a modified version of the genetic algorithm is used. The
logev beta regression is compared with logit beta regression using simulated data in
order to know the real model. The logev beta regression estimates much better than
logit beta regression and, in addition, finds the true model effectively in most cases.
Therefore, this new response function greatly improves the study of the relationships
among variables.
13.6. References
Aranda-Ordaz, F.J. (1981). On two families of transformations to additivity for binary

response data. Biometrika, 68(2), 357–363.
Calabrese, R. and Osmetti, S.A. (2013). Modelling small and medium enterprise loan defaults
as rare events: The generalized extreme value regression model. Journal of Applied
Statistics, 40(6), 1172–1188.
Canterle, D.R. and Bayer, F.M. (2019). Variable dispersion beta regressions with parametric
link functions. Statistical Papers, 60(5), 1541–1567.
Cox, C. (1996). Nonlinear quasi-likelihood models: Applications to continuous proportions.
Computational Statistics & Data Analysis, 21(4), 449–461.
Cribari-Neto, F. and Queiroz, M.P.F. (2014). On testing inference in beta regressions. Journal
of Statistical Computation and Simulation, 84(1), 186–203.
Cribari-Neto, F. and Souza, T.C. (2012). Testing inference in variable dispersion beta
regressions. Journal of Statistical Computation and Simulation, 82(12), 1827–1843.
Cribari-Neto, F. and Zeileis, A. (2010). Beta regression in R. Journal of Statistical Software,
34(2), 1–24.
Czado, C. and Santner, T.J. (1992). The effect of link misspecification on binary regression
inference. Journal of Statistical Planning and Inference, 33(2), 213–231.
Fahrmeir, L. and Tutz, G. (2013). Multivariate Statistical Modelling Based on Generalized
Linear Models. Springer Science & Business Media, New York.
Ferrari, S. and Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions.
Journal of Applied Statistics, 31(7), 799–815.
Gheno, G. (2018). A new link function for the prediction of binary variables. Croatian Review
of Economic, Business and Social Statistics, 4(2), 67–77.
Holland, J.H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan
Press, Ann Arbor.
Jiang, X., Dey, D.K., Prunier, R., Wilson, A.M., Holsinger, K.E. (2013). A new class of
flexible link functions with application to species co-occurrence in Cape Floristic Region.
The Annals of Applied Statistics, 7(4), 2180–2204.
Kieschnick, R. and McCullough, B.D. (2003). Regression analysis of variates observed on (0, 1):
Percentages, proportions and fractions. Statistical Modelling, 3(3), 193–213.
Langner, I., Bender, R., Lenz-Tönjes, R., Küchenhoff, H., Blettner, M. (2003). Bias of
maximum-likelihood estimates in logistic and Cox regression models: A comparative
simulation study. Discussion paper 362, Ludwig Maximilian University of Munich.
McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, 2nd edition. Chapman
and Hall, London.
Nagler, J. (1994). Scobit: An alternative estimator to logit and probit. American Journal of
Political Science, 38, 230–255.
Qi, M. and Zhang, G.P. (2001). An investigation of model selection criteria for neural
network time series forecasting. European Journal of Operational Research, 132(3),
666–680.
Simas, A.B., Barreto-Souza, W., Rocha, A.V. (2010). Improved estimators for a general class
of beta regression models. Computational Statistics & Data Analysis, 54(2), 348–366.
Smithson, M. and Verkuilen, J. (2006). A better lemon squeezer? Maximum-likelihood
regression with beta-distributed dependent variables. Psychological Methods, 11(1), 54.
Stukel, T.A. (1988). Generalized logistic models. Journal of the American Statistical
Association, 83(402), 426–431.
Wang, X. and Dey, D.K. (2010). Generalized extreme value regression for binary response
data: An application to B2B electronic payments system adoption. The Annals of Applied
Statistics, 4(4), 2000–2023.
Whitley, D. (1994). A genetic algorithm tutorial. Statistics and Computing, 4(2), 65–85.
14
A Method of Big Data Collection

and Normalization for Electronic
Engineering Applications
Data collection and storage have become the greatest challenges and tedious
processes in data science engineering. Data from various nodes (sensors, bridges,
switches, hubs, etc.) in the environment or in a particular system is collected at the
nodes from which they arrive at the storage point. These types of operations need a
separate workforce to monitor the whole process of data handling. This proposed
work mainly focuses on the data analytics of creating normalized data from
unprocessed data. This reduces the manipulation of data when it is of a different
form. The data may be realistic depending on the system which produces it. The
normal distribution applies to the collected data to create a dataset that is distributed
over the continuous probability density function. It extends up to infinity in both
directions of the axes. The proposed work provides an easy storage and data
retrieval method in the case of large data volumes. The proposed data recovery is
compliant with the conventional data collection methods. This type of data
interpretation provides security and confidentiality of the user’s data.
14.1. Introduction
Data science has long been prevalent in all areas of science in this digital era.
Data science is an interdisciplinary field that fuses science and technologies by using
algorithms, tasks and devices to extract usable data from raw unstructured data. The
extracted data is applied to various domains to gain insights from the data acquired
Chapter written by Naveenbalaji GOWTHAMAN and Viranjay M. SRIVASTAVA.
to refine the required data through efficient searches. George and Groza (2019)
introduced a concept of the extract-transform-load (ETL) that used graduate
attributes in the common form of places and details. Separate attention was given to
the transformation procedures that helped to get two different reports as final results.
The reports were the graduate attribute report per cohort (GAR/C) and course
progression report per cohort (CPR/C). The GAR/C accessed each attribute based on
the average and the CPR/C showed the tracking information based on the
achievement made by the students in their program. The reports were generated
synchronously at the same time to enhance the ETL efficiency. The model paved
the way for the integrated assessment of the database based on the granularity of
the ETL. Sulaiman et al. (2019) came up with incorporating information and
communication technologies (ICT) in the power industry by applying modernization
concepts to it. That field is the smart power grid, which integrates all the smart
meters used in that grid. It collects an enormous amount of data from these smart
meters and processes the huge data in the centralized servers to control the entire
grid, making it easier to control and observe from a remote location. In this work,
the smart meters act as a backbone of this smart grid. The data science is used to
process this huge data, thereby benefitting the user and also the energy supplier in
that domain of engineering.
Ghosh and Grolinger (2019) investigated the merging of cloud and edge
computing for IoT data analytics and came up with a deep learning (DL)-based
system to define the data processing along with ML concepts. The encoder is widely
used in all the nodes to reduce data congestion while reading data from the sensors.
These kinds of reduced data from the devices make the big data feasible to the
application. This data has been used directly by the ML algorithm to expand the
original features with the help of a decoder present in the auto-encoder module.
McHann (2013) developed a strategy of collecting data from all the nodes at present
and processing it further at a later stage. But this method involves data to be stored
in lot of storage devices. The large volumes of data have to be stored in the cloud to
perform ML at later stages. As the data storage capacity increases, the need for
technology infrastructure increases, skill sets to work in that infrastructure increase
and it is also expensive in terms of time and budget (Bhuiyan et al. 2017). This
chapter has been organized as follows. Section 14.2 elaborates on the machine
learning models in materials science and its application in Electronic Engineering.
It also discusses data acquisition by supervised learning and outlines accessing data
repositories and the data storage, respectively. This section describes the comparison
of the predicted and the actual values. Section 14.3 illustrates the application of
machine learning in electronic engineering. Finally, section 14.4 concludes the work
and recommends the future aspects.
A Method of Big Data Collection and Normalization 189
14.2. Machine learning (ML) in electronic engineering
Machine learning (ML) has a wide range of applications in the semiconductor

industry. ML in this industry has two different phases of application such as:
– training phase – algorithm to get results;
– inference phase – response to stimuli.
The application of ML in semiconductors includes learning from the data. The data
has been acquired from the databases or clouds called repositories. Numerous
researchers around the globe perform their simulations and experiments and provide
their valuable results in the common cloud to prove their integrated research
(Karmawijaya et al. 2019; Naveen Balaji et al. 2019; Balaji et al. 2020; Malini et al.
2020). By accessing the data repositories, the ML user can retrieve usable data by
using efficient algorithms and refining the model. The model can be trained in this
environment by undergoing several searches and improved by the feedback given
(Malini et al. 2020).
The various procedural steps in ML are as follows:

– acquiring data from the repository;
– handling acquired data by algorithms;
– supervised learning by predictive models;
– unsupervised learning through patterns that exist between I/O;
– experimental procedures design for materials science;
– cyberinfrastructure – data platforms.
Data acquisition is the initial step in ML where data has been extracted from the
data repositories based on the user’s search (Malini et al. 2019). The next step is
learning from the data based on mathematical applications such as correlation and
regression. Supervised learning (SL) is the method of backtracking the inputs from
the outputs, thereby establishing the relationship between the input and output
pairs (Casula et al. 2019; Gowthaman and Srivastava 2021a, 2021b; https://www.
wolframalpha.com). SL has been the most prevalent research area in the ML platform
to refine the model created. Unsupervised learning (UL) is the method of creating the
algorithm to study the pattern behaviors that exist between the input and output. The
UL is the most time-consuming process that needs a revision of the pattern and creates
the model (Kampker et al. 2018). The design of experimental procedures plays a major
role in the design of the ML model. This needs a lot of procedures and algorithms to be
used in the ML model (Moradi et al. 2020). Hence, cyberinfrastructure comes into
existence. The famous cyberinfrastructures are Citrine, The Materials Project, Wolfram
Alpha, KIM and Materials Data Facility (MDF) for materials science research
(Karpatne et al. 2017; Tanifuji et al. 2019; Liu and Shur 2020; Gowthaman and
Srivastava 2021c).
Figure 14.1. The flow of machine learning platform

for materials science (Karmawijaya et al. 2019)
The flow of the ML platform for the materials science project has been illustrated
in Figure 14.1. The existing data is present in the data repositories submitted by
researchers around the globe. The predictive model suggests the test run of the search
of the properties of the particular material (Kampker et al. 2018). The search results
have been shared with the information acquisition system for further processing. The
information acquisition system collects the search results from above and sends the
data packets for verification based on the query made on the web front of the
cyberinfrastructure (Liu and Shur 2020). The query has been recorded in the database
of the query information source register. The next step of this process is the
verification of the goals given to the results obtained (Karpatne et al. 2017).
14.2.1. Data acquisition
Data acquisition is the most important aspect of data analytics in the materials
science domain. The individual researcher has to devise a platform to make sure the
data acquired is legit and correct as per the requirement (Casula et al. 2019). Every
individual researcher has to collaborate with researchers around the world to make
an integrated search based on the big data (McHann 2013; Moradi et al. 2020). The
next step is the extraction of the result with the publication metric from the open
repositories. The cyberinfrastructures are the larger databases of information from

many researchers. ML has been used to extract the data required efficiently.
Figure 14.2. Data acquisition process in ML (Gowthaman and Srivastava 2021b)
The application program interface (API) is a set of rules or even algorithms to

extract the required data from larger datasets (Hack and Papka 2015). The API is
widely used by advanced ML users. Beginners always uses a web application
function before the API to train the model. The efficient API model is supported by
the web function created to enhance the search.
14.2.2. Accessing the data repositories
The high-ƙ dielectrics have to be chosen for the double-gate (DG) MOSFET
designs. The selection of the high-ƙ dielectric material plays a major role in the
operation of the MOSFET with negligible short channel effects (SCEs)
(Hack and Papka 2015; Feng et al. 2019; Gowthaman and Srivastava 2021d). The
dielectric properties of the material compounds are derived from the periodic table for
analysis. This work concentrates the web front Wolfram Alpha to predict the values
of high-ƙ dielectric for the suitable inclusion in the DG MOSFET design
(Gowthaman and Srivastava 2021d). A computational intelligence platform like
Wolfram Alpha uses ML procedures to select particular required properties of the
chemical compounds. The dielectric compounds discussed are Al2O3, HfO2, ZrO2,
La2O3, SiO2, etc. (He et al. 2018; Ghosh and Grolinger 2019). The advantages of the
data repository are that it is easy to use and predict results using an Internet platform.
But this has become labor-intensive in case of frequently used materials as they have
many combinations of compounds. Examples of application programming interfaces
(APIs) are the databases – ACK, Citrine Informatics, OQMD, Wolfram Alpha, etc.
(Chen et al. 2018).
14.2.3. Data storage and management
Data storage is done in any one of the following formats: .csv (comma separated
values), Numpy arrays, Matlab, pandas. The .csv files are simple and good for
storage in programs like MS Excel (https://www.wolframalpha.com).
Electronic simulation dataset ( 1017)

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Set 9
2.70 1.65 1.65 3.29 1.36 1.36 3.30 1.35 1.35
2.70 1.65 1.65 3.29 1.36 1.36 3.30 1.35 1.35
2.70 1.65 1.65 3.29 1.36 1.36 3.30 1.35 1.35
2.70 1.65 1.65 3.29 1.36 1.36 3.30 1.35 1.35
2.70 1.65 1.65 3.28 1.35 1.35 3.30 1.35 1.35
2.70 1.66 1.66 3.28 1.35 1.35 3.30 1.35 1.35
2.70 1.66 1.66 3.28 1.35 1.35 3.29 1.35 1.35
2.70 1.66 1.66 3.27 1.34 1.34 3.29 1.35 1.35
2.70 1.65 1.65 3.27 1.32 1.32 3.29 1.34 1.34
2.70 1.66 1.66 3.25 1.30 1.30 3.29 1.34 1.34
Table 14.1. Dataset based on electronic simulation
Normalized dataset ( 1017) based on Table 14.1

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Set 9
0.9992 0.9337 0.9337 0.9999 0.8486 0.8486 0.9999 0.8475 0.8475
0.9992 0.9337 0.9337 0.9999 0.8486 0.8486 0.9999 0.8475 0.8475
0.9992 0.9337 0.9337 0.9999 0.8484 0.8484 0.9999 0.8474 0.8472
0.9992 0.9337 0.9337 0.9999 0.8480 0.8805 0.9999 0.8472 0.8475
0.9992 0.9338 0.9338 0.9999 0.8473 0.8473 0.9999 0.8469 0.8464
0.9992 0.9345 0.9345 0.9999 0.8462 0.8462 0.9999 0.8464 0.8466
0.9992 0.9346 0.9346 0.9999 0.8443 0.8443 0.9999 0.8456 0.8469
0.9992 0.9346 0.9346 0.9999 0.8409 0.8409 0.9999 0.8445 0.8444
0.9992 0.9337 0.9337 0.9999 0.8359 0.8359 0.9999 0.8428 0.8428
0.9992 0.9346 0.9346 0.9999 0.8280 0.8280 0.9999 0.8407 0.8407
Table 14.2. Consolidated dataset after normalization

The Numpy arrays are good for mathematical operations and processing.
Sometimes, MATLAB data files can be used but this involves a large amount of
computation and system memory. The pandas file type has been used in sorting,
parsing and storage (also called the excel of python). These data formats remove
data based on logic operations and plot the values accurately (Malini et al. 2019,
2020). The data stored in the system can be used for further processing based on the
needs of the user. The dataset based on the electronic simulation, which shows the
electron density in the valley, has been tabulated in Table 14.1. The normalized data
after statistical processing performed in the raw data has been illustrated in
Table 14.2.
14.3. Electronic engineering applications – data science
The physical vapor deposition method to create the elemental compounds using
various elements has been illustrated in Figure 14.3. The marked items are the
compounds which have been used in electronic engineering and applied to the DG
MOSFETs to get rid of SCEs. The compounds are as follows lanthanum, gallium,
aluminum and hafnium (Mohammadi et al. 2018). Their properties have been
analyzed for effective usage in electronic applications. The inkjet method of forming
the same compounds has been portrayed in Figure 14.4 for further analysis. These
elements have been displayed for the number of material level analyses to
create/deposit the material (Singh et al. 2017; Singh and Srivastava 2018;
Gowthaman and Srivastava 2021e).
Figure 14.3(a). Dataset derived from the physical vapor deposition device
for various high-ƙ dielctric and semiconductor materials. For a color
These methods highly depend on the ionic packaging value of the particular
material. Using ML, the time taken to create the material has been reduced
drastically (Chen et al. 2018; Gowthaman and Srivastava 2021e).
Figure 14.3(b). Dataset derived from the inkjet method for various
high-ƙ dielctric and semiconductor materials. For a color version
Figure 14.4. Simulation data for energy sub-bands for valleys of the DG MOSFET.
This work mainly focuses on the easy storage of the data and its efficient
retrieval on demand. This concentrates on the large data volumes of databases called
cyber infrastructures (Feng et al. 2019). The data recovery projected in this work
had been submissive to the conventional method of data retrieval. The comparison
of the previous researches results in good agreement with the novel data retrieval
technique. The confidentiality and the security of the user’s data was ensured by
normalization of the raw data. The unauthorized user cannot determine the type of
data they visualize since it is in a normalized form. Hence, normalizations of the
data had given enormous capability of security and confidentiality in the cloud-based
data query. The electronic engineering field has been enhanced by the usage of
normalization and other statistic modeling of the raw data in order to process it.
ML reduces the data processing and data storage compared to non-ML-based
statistical models.
Figure 14.5. Normalized dataset for energy sub-bands attained for the same.
14.4. Conclusion and future work
The enhanced database architecture and the data storage facilities facilitate the
access of databases through larger query and its assessment. The big data was
introduced in this work to reduce human work and data analysis. The normalization
of the data helps the user to create a detailed analysis in terms of the processed data.
The idea of ML can be further improved in theoretical form to apply statistics
models in the raw data to perform quicker processing. Training is not required at the
data user end since it uses automated processing of raw data and reporting of the
additional queries.
14.5. References
Balaji, N., Sethupathi, M., Sivaramakrishnan, N., Theeijitha, S. (2020). EDF-VD scheduling-
based mixed-criticality cyber-physical systems in smart city paradigm. Inventive
Communication and Computational Technologies, Lecture Notes in Networks and
Systems, 89, 931–946.
Bhuiyan, S.M.A., Khan, J.F., Murphy, G.V. (2017). Big data analysis of the electric power
PMU data from the smart grid. SoutheastCon 2017, Concord, NC, USA, 30 March–
2 April 2017, pp. 1–5.
Casula, L., D’Amico, G., Masala, G., Petroni, F., Sobolewski, R.A. (2019). Performance
estimation of a wind farm with a copula dependence structure. 18th Applied Stochastic
Models and Data Analysis International Conference with Demographics Workshop,
Florence, Italy, 11–14, June 2019.
Chen, K., He, Z., Wang, S.X., Hu, J., Li, L., He, J. (2018). Learning-based data analytics:
Moving towards transparent power grids. CSEE Journal of Power and Energy Systems,
4(1), 67–82.
Feng, M., Zheng, J., Ren, J., Hussain, A., Li, X., Xi, Y., Liu, Q. (2019). Big data analytics and
mining for effective visualization and trends forecasting of crime data. IEEE Access, 7,
106111–106123.
George, A. and Groza, V. (2019). Information analytics system database for uniform approach
to continuous engineering program improvement. 15th International Conference on
Engineering of Modern Electric Systems (EMES), Oradea, Romania, 13–14, June 2019,
185–188.
Ghosh, A.M. and Grolinger, K. (2019). Deep learning: Edge-cloud data analytics for IoT.
IEEE Canadian Conference of Electrical and Computer Engineering (CCECE),
Edmonton, AB, Canada, 5–8 May 2019, pp. 1–7.
Gowthaman, N. and Srivastava, V.M. (2021a). Analysis of n-type double-gate MOSFET
(at nanometer scale) using high-K dielectrics for high-speed applications.
44th International Spring Seminar on Electronics Technology, Advancements in
Microelectronics Packaging for Harsh Environment, Dresden, Germany, 6–7, May 2021,
130–131.
Gowthaman, N. and Srivastava, V.M. (2021b). Analysis of InN/La2O3 twosome for
double-gate MOSFETs for radio frequency applications. Third International Conference
on Materials Science and Manufacturing Technology (ICMSMT 2021), Coimbatore,
India, 8–9 April 2021, 1–10.
Gowthaman, N. and Srivastava, V.M. (2021c). Dual gate material (Au and Pt) based
double-gate MOSFET for high-speed devices. IEEE Latin America Electron Devices
Conference (LAEDC), Mexico, 19–21 April 2021, 1–4.
Gowthaman, N. and Srivastava, V.M. (2021d). Design of hafnium oxide (HfO2) sidewall in
InGaAs/InP for high-speed electronic devices. International Conference on Materials
Sciences and Nanomaterials, London, UK, 12–14 July 2021, 1–6.
Gowthaman, N. and Srivastava, V.M. (2021e). Capacitive modeling of cylindrical
surrounding double-gate MOSFETs for hybrid RF applications. IEEE Access, 9,
89234–89242.
Hack, J.J. and Papka, M.E. (2015). Big data: Next-generation machines for big science.
Computing in Science & Engineering, 17(4), 63–65.
He, X., Chu, L., Qiu, R.C., Ai, Q., Ling, Z. (2018). A novel data-driven situation awareness
approach for future grids – Using large random matrices for big data modeling. IEEE
Access, 6, 13855–13865.
Kampker, A., Kreisköther, K., Büning, M.K., Möller, T., Windau, S. (2018). Exhaustive
data- and problem-driven use case identification and implementation for electric drive
production. 8th International Electric Drives Production Conference (EDPC),
Schweinfurt, Germany, 4–5 December 2018, 1–8.
Karmawijaya, M.I., Nashirul Haq, I., Leksono, E., Widyotriatmo, A. (2019). Development of
big data analytics platform for electric vehicle battery management system.
6th International Conference on Electric Vehicular Technology (ICEVT), Bali, Indonesia,
18–21, November 2019, 151–155.
Karpatne, A., Atluri, G., Faghmous, J.H., Steinbach, M., Banerjee, A., Ganguly, A.,
Shekhar, S., Samatova, N., Kumar, V. (2017). Theory-guided data science: A new
paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data
Engineering, 29(10), 2318–2331, 1 October 2017.
Liu, X. and Shur, M.S. (2020). TCAD model for TeraFET detectors operating in a large
dynamic range. IEEE Transactions on Terahertz Science and Technology, 10(1), 15–20.
Malini, P., Poovika, T., Shanmugavadivu, P., Priya, I.R.P., Balaji, G.N., Rajotiya, R.N.,
Kumar, A., Mashette, G. (2019). 22nm 0.8V strained silicon-based programmable MISR
under various temperature ranges. American Institute of Physics – CF, 2087(020004),
020004-1–020004-12.
Malini, P., Kokila, S., Karthiga, M., Naveen Balaji, G. (2020). Design of hybrid full adder
using full swing and non-full swing XOR XNOR gates. TEST Engineering and
Management, January–February 2020, 2778–2787.
McHann, S.E. (2013). Grid analytics: How much data do you really need? IEEE Rural
Electric Power Conference (REPC), Stone Mountain, GA, USA, 28 April–1 May 2013,
C3–1–C3–4.
Mohammadi, M., Al-Fuqaha, A., Sorour, S., Guizani, M. (2018). Deep learning for IoT big
data and streaming analytics: A survey. IEEE Communications Surveys & Tutorials,
20(4), 2923–2960.
Moradi, J., Shahinzadeh, H., Nafisi, H., Marzband, M., Gharehpetian, G.B. (2020). Attributes
of big data analytics for data-driven decision making in cyber-physical power systems.
14th International Conference on Protection and Automation of Power Systems (IPAPS),
Tehran, Iran, 83–92.
Naveen Balaji, G., Karthiga, M., Swetha, D., Suchitra, M. (2019). Low power design of 0.8V
based 8 bit content addressable memory using MSML implemented in 22nm technology
for aeronautical applications. International Journal of Recent Technology and
Engineering, 8(2S11), 2688–2694.
Singh, M. and Srivastava, V.M. (2018). An analysis of key challenges for adopting the cloud
computing in the Indian education sector. Communications in Computer and Information
Science, 905(1), 439–448, Chapter 44, Springer, Singapore.
Singh, M., Srivastava, V.M., Gaurav, K., Gupta, P.K. (2017). Automatic test data generation
based on multi-objective ANT LION optimization algorithm. 28th Annual Symposium of
the Pattern Recognition Association of South Africa and 10th Robotics and Mechatronics
International Conference of South Africa (PRASA-RobMech-2017), Bloemfontein,
South Africa, 30 November–1 December 2017, 168–174.
Sulaiman, S.M., Jeyanthy, P.A., Devaraj, D. (2019). Smart meter data analysis issues: A data
analytics perspective. IEEE International Conference on Intelligent Techniques in
Control, Optimization and Signal Processing (INCOS), Tamilnadu, India, 11–13, April
2019, 1–5.
Tanifuji, M., Matsuda, A., Yoshikawa, H. (2019). Materials data platform – A FAIR system
for data-driven materials science. 8th International Congress on Advanced Applied
Informatics (IIAI-AAI), 7–11 July 2019, 1021–1022.
15
Stochastic Runge–Kutta Solvers Based on

Markov Jump Processes and Applications
to Non-autonomous Systems of
Differential Equations
We present a solver for non-autonomous systems of ordinary differential equations

based on the approximation principle by Markov jump processes. Each jump occurs
after an exponentially distributed random waiting time which is intrinsically adapted,
being computed in dependence on the current state, and is scalable by a given
factor which controls the precision. The step function computed by simulating the
jump processes can serve as a predictor which is further improved by suitable
correction steps, which can be described as Picard iterations followed by Runge–Kutta
approximations. The correction steps are applied after every jump of the original
process, and the final result is a high precision scheme with several layers, which
starts from the crude approximation delivered by the standard jump process, and based
on this data, it computes several steps in which the approximations are successively
refined.
15.1. Introduction
The numerical method presented in this chapter is based on the connection between
the infinitesimal generators of Markov jump processes and corresponding differential
equations. The theoretical background for this property is presented in Ethier and
Kurtz (1986).
Chapter written by Flavius G UIA Ş.
The transitions X → X of a Markov jump process X with state space (E, d)

(Polish space, for example Rn or space of Radon measures) take place according
to a transition kernel r(X, dX ). With the infinitesimal generator (ΛΨ)(X(t)) =
E
(Ψ(X ) − Ψ(X))r(X, dX ) for all bounded continuous functions Ψ defined on

E, we have the following martingale characterization of the dynamics: Ψ(X(t)) =

t
Ψ(X(0)) + 0 (ΛΨ)(X(s))ds + MΨ (t) with the deterministic trend (ΛΨ)(X(t))
and the trendless stochastic noise (martingale) MΨ , i.e. E[Mψ (t) | Fs ] = Mψ (s)
for s ≤ t.
If we have a family of Markov jump processes XN indexed by the natural number

N with the property that E[(MN (t))2 ] → 0, then, under further suitable additional
conditions, we have a convergence result XN → X (in mean square or in distribution),
where the limit process X solves the deterministic equation Ψ(X(t)) = Ψ(X(0)) +
t
0 ΛΨ(X(s))ds for any Ψ ∈ Cb (E) (or in a given subclass).
A special choice of the Markov jump process, which approximates the solution
of the system of ordinary differential equations Ẋ = F (t, X), delivers the so
called direct simulation method, where at every jump, only one component of the
vector-valued process is changed. The selection of this component occurs by sampling
according to a probability table, the probability for component i being proportional
to |Fi (t, X)|. Based on this first approximation, Guiaş and Eremeev (2016), Guiaş
(2017) and Guiaş (2019) presented several improvements suitable in principle for
autonomous systems Ẋ = F (X).
In this chapter, we present a further development of this family of methods which

is also suitable for non-autonomous systems. Given the improved approximation
X ∗ (t) at time t, we first perform a direct simulation step. That is, we first compute
a random
time interval Δt which is exponentially distributed with parameter λ =
N · i |Fi (t, X)| and sample the component i which is changed by ± − 1/N ,
depending on the sign of Fi (t, X). The improved approximation at time t + Δt is
computed next by an integral scheme of the form
t+Δt
X ∗ (t + Δt) = X ∗ (t) + Q(s) ds.
t
The choice Q = F (·, X̃(·)) corresponds to a Picard iteration. We can also take
a deterministic value for the time step, equal to the expected value 1/λ of the
exponentially distributed time step. After this first correction step with the result
denoted by X̄, we can apply further correction steps of the Runge–Kutta type by
taking for Q a polynomial which interpolates between the values F (X ∗ (t)) and
F (X̄(t + Δt)), optionally with an additional intermediate point F (X̄(t + Δt/2)).
The detailed description of the method is presented in section 15.2, and

applications are illustrated in section 15.3. The first example is the Lorenz system,
Stochastic Runge–Kutta Solvers Based on Markov Jump Processes 201
which is a well-known example of a chaotic system of differential equations. The

main challenge of this autonomous system is to compute an accurate solution at large
values of time. The second example is a large system of ordinary differential equations
obtained by a discretization of a parabolic partial differential equation modeling an
ignition process. The main feature is the appearance of a steep temperature gradient
which moves with high speed through the spatial domain. By choosing certain
coefficients to be non-constant, the obtained system will be non-autonomous. The
final section of this chapter is dedicated to conclusions and comments.
15.2. Description of the method
15.2.1. The direct simulation method
The basic stochastic direct simulation scheme for systems of ordinary differential
Ẋ = F (t, X), which is then successively improved, delivers paths of a Markov jump
process X̃(·). Its feature is that at every jump, only one component of the process
is changed with a fixed amount ±1/N , which can be interpreted as the resolution
of the method. The component i, which is chosen to be changed in the next step, is
selected at random with a probability proportional to |Fi (t, X̃(t))|. The steps of the
direct simulation method are the following:
While t ≤ tmax do:

1) Given the state vector X̃(t) of the process at time t.
2) Select a component Xi with probability proportional to |Fi (t, X̃(t))|.
3) The time step Δt = − log U/λ with U uniformly
n distributed on (0, 1) is then
exponentially distributed with parameter λ = N i=1 |Fi (t, X̃(t))|.
4) Update the value of the selected component: X̃i → X̃i + N1 sign(Fi (t, X̃)) and
set the new time as t = t + Δt.
5) Update the values of Fj (t, X̃(t)) for all j.
6) GOTO 1.
15.2.2. Picard iterations
Writing the ODE system on the time interval [t, t + Δt] in the integral form yields:
t+Δt
X(t + Δt) = X(t) + F (s, X(s)) ds.
t
Assuming that X̄(t) is an approximation for the exact solution X(t) and that we
have simulated a path X̃(s), t ≤ s ≤ t + Δt of the Markov jump process, we can use
these data in order to compute an approximation for X(t + Δt) which improves the
crude result X̃(t + Δt). This is done by a Picard iteration:
t+Δt
X̄(t + Δt) = X̄(t) + F (s, X̃(s)) ds.
t
In the case of autonomous systems, where F does not explicitly depend on t, the
integral is that of a step function and can be computed effectively by updating its value
after every jump of the Markov process X̃(·). This approach was used in Guiaş and
Eremeev (2016), Guiaş (2017), Guiaş (2019) for autonomous systems.
In this chapter, we consider non-autonomous systems and a Picard iteration is

therefore possible only if Δt is the time between two jumps of the Markov process.
During the time interval [t, t + Δt), the value of the process remains unchanged:
t+Δt
X̃(s) = X̃(t); therefore, the integral takes the form F (s, X̃(t)) ds and can be
t
computed approximatively by a numerical scheme such as the Simpson method, since
the function F and the value X̃(t) are explicitly known.
15.2.3. Runge–Kutta steps
A further improvement of the precision of the above schemes is the employment

of Runge–Kutta steps of the general form
t+Δt
X ∗ (t + Δt) = X ∗ (t) + Q(s) ds,
t
where X ∗ (t) is a given approximation for the exact solution X(t), Δt is the
time between two consecutive jumps and Q(s) is a vector of polynomials which
approximates the exact term F (s, X(s)). Note that in the case of Picard iterations,
we considered Q(s) = F (s, X̃(s)), i.e. we used the path of the simulated Markov
jump process. In this case, by denoting X̄(t + Δt) a predictor computed by one of the
previous methods, the polynomial Q(s) used by the Runge–Kutta steps interpolates
between the values F (t, X ∗ (t)) and F (t + Δt, X̄(t + Δt)), optionally with an
additional intermediate point F (t + Δt/2, X̄(t + Δt/2)).
For the RK2-method, we consider linear interpolation between the values at the
boundaries of the time interval and we therefore have:
Δt
X ∗ (t + Δt) = X ∗ (t) + F (t, X ∗ (t)) + F (t + Δt, X̄(t + Δt)) [15.1]
2
This method is similar to the classical second-order Heun method, and we denote it
therefore by RK2. Note that X̄(t + Δt) can be any predictor: the value of the Markov
jump process, the approximation obtained by Picard iteration or an approximation
obtained by another RK2 step based, for example, on a previously performed Picard
step. This variant of the scheme can therefore be applied in several layers.
Within this framework, for approximations also using values at the midpoint t +
Δt/2 of the time interval, we therefore have several options. The general scheme
RK23 has the form:
Δt
X ∗ (t + Δt) = X ∗ (t) + F (t, X ∗ (t)) + 4F (t + Δt/2, X̄(t + Δt/2))
6

+F (t + Δt, X̄(t + Δt)) [15.2]
By X ∗ (·), we denote the final, i.e. the best approximation for the solution at
the given time points, while X̄(·) are predictors computed by a combination of the
previously described Picard and RK2 methods (theoretically also by direct simulation,
but in this case, the precision is lower, so we do not make this choice here). This
method shows a similarity to the usual Runge–Kutta method of order 3 and, since
for the predictors we use the RK2 scheme, we denote it by RK23. For the value
X̄(t+Δt)), we have two options: we compute this value applying Picard and two RK2
steps starting either directly from X ∗ (t), an algorithm which is denoted by RK23_1,
or starting from the best approximation at t+Δt/2, an algorithm denoted by RK23_2.
We can also apply the classical fourth-order Runge–Kutta scheme on the adapted
interval∗ Δt, which is either exponentially distributed with parameter λ = N ·
time
i |Fi (t, X (t))|, or deterministic, equal to the expected value 1/λ.
The RK method therefore has the form:

Δt
X ∗ (t + Δt) = X ∗ (t) + (F (t, X ∗ (t)) + 2k2 + 2k3 + k4 ) [15.3]
6
where
Δt
X2 = X ∗ (t) + · F (t, X ∗ (t)), k2 = F (t + Δt/2, X2 )
2
Δt
X3 = X ∗ (t) + · k2, k3 = F (t + Δt/2, X3 )
2
X4 = X ∗ (t) + Δt · k3, k4 = F (t + Δt, X4 ).
Some applications of these schemes are illustrated in the next section.
15.3. Numerical examples
15.3.1. The Lorenz system
The ODE system introduced by Edward Lorenz is a standard example for

illustrating chaotic behavior:
x = σ(y − x)
y = rx − y − xz
z = xy − bz
A small change in the initial data leads to a totally different trajectory at larger
values of time. The challenge is therefore to compute an accurate solution at time
values as large as possible, since all numerical schemes induce small approximation
errors which in this case may propagate very fast. Kehlet and Logg (2010) presented a
high precision reference solution at large values of time for σ = 10, b = 8/3, r = 28
and initial data x(0) = (1, 0, 0).
The results of the methods introduced in this chapter are presented in Figure 15.1:
Figure 15.1. Efficiency comparison (error in the · 1 -norm). For a color

The error is taken in the · 1 -norm compared to the reference solution. At times
tmax = 30 or tmax = 40, the solvers RK and RK23_1 have a similar precision to
the MATLAB solvers ode45 and ode113 used for comparison, but the latter ones turn
out to be faster due to the performance of internal MATLAB routines. However, for
tmax = 50, the RK solver delivers better results than the MATLAB solvers (smaller
error and comparable computation time). The RK23_1 solver has a similar error with
the MATLAB solvers but needs a longer computational time.
15.3.2. A combustion model
The following reaction–diffusion equation is modeling an ignition process:

du 5eδ(t) δ(t)
= Δu + (2 − u) exp −
dt δ(t) u
with initial condition u0 ≡ 1. The equation is considered on (0, 1) with boundary

conditions ∂ν u(0) = 0 and u(1) = 1. The final time is tmax = 0.247.
We assume that δ(t) = 30 + t2 + sin(50t) and solve the non-autonomous ODE

system obtained by finite-difference discretization with irregular (random) grids, in
order to artificially enhance the difficulty of the numerical computations.
We consider first a grid that consists of n = 200 points, with hmin = 3.2 ·
10−3 , hmax = 2.28 · 10−2 .
The results are shown in Figure 15.2.
Figure 15.2. Efficiency comparison (error in the max-norm) for n = 200.

The results of the MATLAB solvers ode45 and ode113 (set at the maximum
possible precision) are compared with those of the solvers RK, RK23_1 and RK23_2.
We note that the solvers presented in this chapter basically show similar behavior
and that they can achieve similar precision to the MATLAB solvers, but at a longer
computational time.
Next we consider a grid of n = 400 points, with hmin = 2 · 10−4 , hmax =

2.7 · 10−2 . As a reference solution, we consider the results of the MATLAB-solver
ode15s, set with maximum precision (note that this is considered a rather imprecise
solver, used for stiff problems, like in this case).
A comparative run of different solvers delivers the following results:

– MATLAB-solver ode23s – error: 2 · 10−6 ;
– MATLAB-solver ode113 – imprecise/out of memory;
– RK-solver – error: 9.65 · 10−7 ;
– RK23_1-solver – error : 2.12 · 10−7 .
This framework of 400 grid points with large differences of magnitude between
the spatial step sizes (a range between 2 · 10−4 − 2.7 · 10−2 ) shows to be more difficult
to the high precision MATLAB solvers like ode113, which is unable to perform the
computation due to memory problems. Only the solvers ode15s and ode23s, suited for
stiff problems, manage to compute a solution. However, the solvers RK and RK23_1
show better precision than the MATLAB solvers, the latter one performing best for
this problem.
15.4. Conclusion
Starting from the direct simulation method using Markov jump processes,
we developed a class of numerical schemes suited for non-autonomous ordinary
differential equations. After every jump of the underlying process, which occurs on
an adapted time interval, scalable by a given factor which controls the magnitude of
the jumps, we performed Picard iterations and different variants of Runge–Kutta steps.
The result turns out to be a highly efficient scheme, relatively easy to implement, with
precision similar to or even better than that delivered by standard MATLAB solvers.
15.5. References
Ethier, S.N. and Kurtz, T.G. (1986). Markov Processes. Characterization and Convergence.
John Wiley & Sons, New York.
Guiaş, F. (2017). Stochastic Picard–Runge–Kutta solvers for large systems of autonomous
ordinary differential equations. Proceedings of the Fourth International Conference on
Mathematics and Computers in Sciences and in Industry (MCSI), 298–302, Corfu, Greece.
Guiaş, F. (2019). High precision stochastic solvers for large autonomous systems of differential
equations. International Journal of Mathematical Models and Methods in Applied Sciences,
13, 60–63.
Guiaş, F. and Eremeev, P. (2016). Improving the stochastic direct simulation method
with applications to evolution partial differential equations. Applied Mathematics and
Computation, 289, 353–370.
Kehlet, B. and Logg, A. (2010). A reference solution for the Lorenz system on [0,1000]
[Online]. Available at: https://doi.org/10.1063/1.3498141.
16
Interpreting a Topological Measure of

Complexity for Decision Boundaries
We propose a method to examine the decision boundaries of classification

algorithms to yield insight into the nature of overfitting. In machine learning,
model evaluation can be performed via two common techniques: train–test split or
cross-validation. In this chapter, we expand this toolkit to include tools from the field
of topological data analysis. In particular, we use persistent homology, which roughly
characterizes the shape of a data set.
Our method focuses on binary classification, using training data to sample points
on the decision boundary of the feature space. We then calculate the persistent
homology of this sample and compute metrics to quantify the complexity of the
decision boundary. Our experiments with data sets in various dimensions suggest that
in certain cases, our measures of complexity are correlated with a model’s ability
to generalize to unseen data. We hope that refining this method will lead to a better
understanding of overfitting and a means to compare models.
16.1. Introduction
In this chapter, we introduce and investigate the usage of a toolkit known as

topological data analysis (TDA) to deeply understand classification algorithms by
inspecting decision boundaries, which can yield signs of overfitting, among other
information. This toolkit is motivated by the notion that data intrinsically has a shape,
and that this shape can be recovered computationally; in this chapter, we determine the
Chapter written by Alan H YLTON, Ian L IM, Michael M OY and Robert S HORT .
shape of the decision boundary of a classification algorithm and interpret the results.
This geometric approach has become a recurrent theme in machine learning and data
science: our contribution is development and formalization towards a technique that
relies on TDA to evaluate a trained neural network independent of a validation data
set. We include an introduction to our chosen computational approach to TDA, as well
as evidence that it is sensitive to the shape of a decision boundary.
The choice of neural networks is not surprising, as they are the foremost drivers in
the demand for applying pure mathematics to applied problems, including algebraic
topology. This is largely due to their great ability to learn complex relationships in
data sets, as well as the difficulty to “look under the hood”. We hasten to add that
these methods generally apply to classification algorithms, and with the ever-growing
presence of machine learning in everyday applications, it will be crucial to gain a
better understanding of their functions and reliability.
There are several methods of machine learning model evaluation such as train/test
split, cross-validation and various metrics for accuracy, all of which give some sort of
indication of how a model will generalize to unseen data. Ideally, we would like to
determine the efficacy of a model based on the training data alone. With this in mind,
measures of accuracy would not be reliable in the event the model is biased towards
the training data – this is what motivates us to seek to analyze the decision boundaries
learned by models.
Figure 16.1. Example decision boundaries
In Figure 16.1, suppose the green and black lines represent two decision
boundaries determined by a binary classification model. The green line represents a
model that achieves perfect training accuracy but is likely biased towards the training
data. The black line represents a model with some flaws, but which will likely perform
better on unseen data. It is possible to distinguish this difference in complexity
between models visually in low-dimensional data sets, but it can also be viewed via
Interpreting a Topological Measure of Complexity for Decision Boundaries 209
techniques from TDA for data sets in higher dimensions. A member of the toolkit
of TDA known as persistent homology, which is constructed from ideas in algebraic
topology, estimates the shape of a data set by detecting connected components, holes,
voids and other higher-dimensional features. In Figure 16.1, persistent homology can
be used to measure the complexity of the shapes (lines) at play – this will be made
more clear in section 16.2.
Our study shows that persistent homology can distinguish between decision
boundaries such as the examples pictured in Figure 16.1. We provide evidence that
as a model becomes biased towards the training data set, it is notable via its decision
boundary, including in higher-dimensional data sets. We focus on neural networks
because of the complex relationships they are capable of learning and because their
iterative training process allows us to view the changes in a decision boundary as a
network trains. A proposed metric, which summarizes the topological complexity, is
demonstrated on synthetic data sets with varying amounts of noise. As the noise in a
data set increases, this is indicated via our metric; indeed, noisier data sets are more
likely to lead to overfit models, and this tendency can be observed through our metric.
Our work is not alone in incorporating persistent homology into machine learning.
Methods have recently been proposed to transform persistence diagrams into a more
suitable input for machine learning algorithms: see Adams et al. (2017), Hofer et al.
(2018) and Zhao and Wang (2019). The interest in applying persistent homology to
machine learning tasks comes from its ability to provide a summary of the geometric
and topological features of a data set. Applications come from a variety of fields: see
for instance Carlsson et al. (2008), Bendich et al. (2016) and Motta et al. (2018).
Other work has used persistent homology to analyze machine learning algorithms,
such as Naitzat et al. (2020), and work similar to ours can be found in Varshney et al.
(2015), Chen et al. (2019) and Ramamurthy et al. (2019). The main difference in our
work is the explicit sampling of a decision boundary to describe its shape.
16.2. Persistent homology
In Figure 16.2, the points are sampled from a circle. Instead of recovering the circle
in its entirety – location and radius – we are more interested in the fact that there is
a circle, which has a line and a single one-dimensional hole in the middle. Moreover,
while this is visible for this curve, we want to recover this type of information
for less-familiar shapes of arbitrary dimensions. This is the function of persistent
homology; the member of the TDA toolkit we will use quantifies the shape of a
decision boundary. We give a brief overview here; see Ghrist (2008) for a more
thorough introduction and Zomorodian and Carlsson (2005) and Edelsbrunner and
Harer (2010) for mathematical details. We will consider data sets consisting of points
in some Euclidean space Rn . We begin by building additional structures on such a
point set, in order to study the shape outlined by the points without any other given
information.
A simplex is a generalization of a triangle to an arbitrary dimension. A 0-simplex

is a point, a 1-simplex is a line segment, a 2-simplex is a triangle and in general, a k
simplex is the convex hull of k + 1 vertices in Rk . A simplicial complex is formed
by attaching any number of simplices together along their faces, and an example is
shown in Figure 16.3. Beginning with a data set X ⊂ Rn , we can form simplicial
complexes with data points as vertices and simplices formed between data points that
are close enough together. One standard method is to use the Vietoris–Rips complex,
VR(X; r), which forms a simplex for any set of points {x1 , . . . , xm } ⊆ X, such that
xi − xj ≤ r for all i and j.
Figure 16.2. Sampling of a circle
Figure 16.3. A simplicial complex

With this ability to construct simplicial complexes on data sets, we would like
to have a systematic way to describe the shape of the data set. Forming a quantitative
description of a shape poses a challenge. Another challenge is to choose an appropriate
r that determines the simplicial complex VR(X; r): without prior knowledge about
the data set, there is no clear way to choose an r such that the resulting simplicial
complex accurately depicts the data. Persistent homology offers a solution to both
of these problems. It is based on homology, which is a method from the field
of algebraic topology to characterize the holes in a space. If K is a simplicial
complex, then roughly speaking, the homology vector space Hk (K) has a dimension
equal to the number of k-dimensional holes in K. For instance, a zero-dimensional
hole is a connected component, a one-dimensional hole is a circular hole and a
two-dimensional hole is a spherical hole.
Given a data set X ⊂ Rn , homology gives a concrete description of the

shape of a simplicial complex built on X. We now return to the problem of
choosing a parameter r. The solution offered by persistent homology is to avoid
choosing a single parameter, and instead consider a sequence of parameters r1 <
r2 < · · · < rN . This leads to a sequence of nested Vietoris–Rips complexes
VR(X; r1 ) ⊆ VR(X; r2 ) ⊆ · · · ⊆ VR(X; rN ), as shown in Figure 16.4. We can
imagine the simplicial complex growing over time as more simplices are added in.
Figure 16.4. Example of a sequence of Vietoris–Rips complexes
Persistent homology starts by considering the homology of each of these

complexes, and then summarizes this information by tracking each individual hole
through the sequence. A k-dimensional hole that appears at parameter ri corresponds
to a nonzero element of Hk (VR(X; ri )), and this element is said to be born at
parameter ri . Tracking this hole through the sequence of complexes, persistent
homology identifies when the hole collapses or merges with a previous hole; we
say the element dies at the parameter when this happens. By recording the birth and
death times of all holes, we end up with a reductive summary of the shape of the
data set called a persistence diagram, which is the output of persistent homology. A
persistence diagram plots a point (b, d) in R2 for each hole, where b is the birth time
and d is the death time. Thus, all points are above the diagonal line. Points that are far
above the diagonal indicate features with long lifetimes, and these can sometimes be
used to understand the global shape of the data set. On the other hand, a significant
number of points near the diagonal can contain information about small-scale features
of the data set. An example of a persistence diagram generated from a uniform random
sample of points on the surface of a sphere is shown in Figure 16.5. The sphere
has one zero-dimensional hole, signifying that it has one connected component, no
one-dimensional holes and one two-dimensional hole (void). Thus, in Figure 16.5, we
see evidence that this shape is sphere-shaped – at least in the eyes of topology. Indeed,
there is one blue point high above the diagonal showing that the dimension of H0 is
likely one; there are no orange points showing strong evidence for H1 , and the blue
point for H2 high above the diagonal identifies the two-dimensional void: the data set
most likely outlines a sphere. We can also note that the single point in H0 at height ∞
records the fact that one connected component never dies.
Figure 16.5. Persistence diagram of a sample of a sphere in R3
Our study makes use of the Scikit-TDA (Saul and Tralie 2019) python library,
and specifically the Ripser package, which computes persistent homology using
Vietoris–Rips complexes as described above. We aim to use persistent homology to
characterize the complexity of a set of points; rather than use an entire persistence
diagram, we simplify further to the sum of lifetimes of the points in a diagram. Thus,
for a point set X ⊂ Rn , we define

Sk (X) = (d − b) [16.1]
(b,d)∈dgmk (X)
d=∞
where dgmk (X) is the persistence diagram for X resulting from k-dimensional
persistent homology using Vietoris–Rips complexes. We focus on S0 and S1 , as
persistent homology can be computed quickly in dimensions 0 and 1.
16.3. Methodology
We seek to use persistent homology to characterize the complexity of a decision

boundary produced by a neural network. In particular, we aim to observe the decision
boundary of a network as it trains excessively and becomes overfit. Our experiments
were conducted on synthetic data sets of varying dimensions and levels of noise; in
each case, we generated a training data set and a separate test data set. For each
test, numerous neural networks were trained until they each reached a high level
of accuracy on the training data. At regular intervals during the training process,
determined by a set number of epochs, we determined a sample of points from
the decision boundary and calculated the persistent homology of this sample. Using
equation [16.1], S0 and S1 were computed for each sample and compared to the
model’s accuracy on training and test data.
16.3.1. Neural networks and binary classification
For each neural network, we construct a wide and deep-layer architecture with
hidden layers that are equipped with the ReLU activation function. Since we only
consider binary classification, the output layers contain two nodes with the softmax
activation function. We use TensorFlow (Abadi et al. 2015) to train neural networks.
In each example, we use neural networks that are larger than necessary to allow them
to overfit, since our goal is to examine the decision boundary of overfit networks.
For emphasis, we recall that our techniques can also be applied to other
classification algorithms; we only use neural networks as an example. We also focus
on binary classification for simplicity, as this leads to a single decision boundary
between two classes. The ideas presented here can likely be generalized to work for
data sets divided into more than two classes, but we will leave it to future work to
determine how this can best be done.
16.3.2. Persistent homology of a decision boundary
We will consider binary classification of a data set consisting of points in some

Euclidean space Rn . A binary classification algorithm uses training data to construct a
function that places points of Rn into two classes, say class 0 and class 1. In our cases,
this will be a function f : Rn → [0, 1], where x is assigned to class 0 if f (x) < .5
and assigned to class 1 if f (x) > .5. The decision boundary of such a function is the
set of x ∈ Rn such that f (x) = .5, which are points that are expected to be near
points of either class. In general, the decision boundary produced by a classification
algorithm can be expected to be a surface of dimension n − 1 in Rn that separates the
two classes.
Our analysis relies on taking a representative sample of points from the decision
boundary in order to characterize its shape. In general, the decision boundary may
be an infinitely large region in Rn , so we require a method that samples the relevant
portion. Our method for sampling the decision boundary uses the training data, along
with the classes assigned to the data by the algorithm, to determine the portion of the
boundary that is sampled. To generate m sample points of the decision boundary, we
begin with m pairs of data points, where each pair has one point from each class; we
sample points close to the decision boundary by choosing the m pairs with minimal
distances between their points. Then, for each pair, we search on the line segment
between the two points of the pair for a point on the decision boundary, simply using
bisection to find a point x such that f (x) ≈ .5.
Given a sample of a decision boundary, we aim to give a rough characterization of

its shape using persistent homology. We use the Ripser package of Scikit-TDA (Saul
and Tralie 2019) to compute a persistence diagram for the Vietoris–Rips complexes
built on a decision boundary sample, and focusing on H0 and H1 , we further compute
S0 and S1 for each sample. Each of these quantities should be thought of as a rough
description of the complexity of the decision boundary.
16.3.3. Procedure
For each data set, we ran multiple experiments with varying amounts of noise
added, following the steps below:
1) We choose n neural networks to train and a period of m epochs.
2) Each data set is split into a training set and a test set.
3) A neural network trains for m epochs, then we determine the training accuracy,
testing accuracy and a sample of points from the decision boundary.
4) Persistent homology is computed from the decision boundary sample, and S0
and S1 are computed from the persistence diagrams.
5) Steps 3 and 4 are repeated until the neural network achieves a specific training
accuracy or reaches a set maximum training time.
6) The process repeats for all n neural networks.
To examine the data recorded in the experiments, S0 and S1 are compared to the
difference in the training accuracy and test accuracy of a model. This difference is a
reliable metric to determine if a model is overfit, and for convenience, we will refer to
this as the overfitness of a model.
16.4. Experiments and results
16.4.1. Three-dimensional binary classification
It seems reasonable to begin with data sets we can concretely depict. Consider
500 points randomly sampled from a torus and 500 points randomly sampled from a
sphere in R3 with the sphere nested inside the torus. A reasonable decision boundary
distinguishing these two sets of data would take on a circular profile.
The TaDAsets package of the Scikit-TDA (Saul and Tralie 2019) python library
was used to produce the data sets in Figure 16.6 and implement a percentage of noise.
We considered three data sets as such with noise levels of 0%, 10% and 20% to allow
for a neural network to become overfit. For each data set, 30% of the points were
reserved as a test set, and 25 neural networks, with 20 hidden layers consisting of 15
nodes each, were trained on the remaining 70%. Every 250 epochs, a sample of 700
points was taken from the decision boundary to calculate S0 and S1 . For this data set,
the decision boundary is expected to have a one-dimensional hole, as the boundary
is expected to look like a circle, so S1 is considered in this case. This process was
repeated until each neural network achieved a 98% accuracy on the training data.
Figure 16.6. Points sampled from a torus and sphere in R3
0% Noise 10% Noise 20% Noise

S0 0.507 0.848 0.884
S1 0.369 0.581 0.746
Table 16.1. Pearson correlation coefficients for S0 or S1 and overfitness
In Figures 16.7 and 16.8, we compare the overfitness of each neural network to S0
and S1 for each noise level. In addition to this summary, Table 16.1 shows the Pearson
correlation coefficients of the overfitness compared to S0 and S1 for each data set.
As noise increases, it is expected that the complexity of the decision boundary should
increase as well, and these metrics now provide a way to quantify this behavior. We
see that in the presence of noise, S0 has significant correlation with overfitness as
r > .8. We also see notable correlation for S1 , and this is likely due to the fact that
the decision boundary is expected to have a one-dimensional hole. Furthermore, in
Figures 16.7 and 16.8, we see that each group, corresponding to a different data set,
seems to cluster in three different regions. This means that these metrics not only
measure noise, but can distinguish one boundary from another to some degree.
Figure 16.7. Values of S0 from the first data set
Figure 16.8. Values of S1 from the first data set

Figure 16.9. S0 from four-dimensional data with a 30% overlap
16.4.2. Data divided by a hyperplane
Next, we consider some examples of simple data sets in various dimensions. The
data points were uniformly sampled from a unit ball in Rn ; we ran tests in dimensions
n = 2, 4, 6, 8, and 10. The data points were divided into two classes based on what
side of a randomly chosen hyperplane they were on, except we allowed an overlap
between the classes near the hyperplane. For each value of n above, we created data
sets with the two classes overlapping 5%, 10%, 20% and 30% by volume, with the goal
of observing the different amounts of overfitting resulting from the different amounts
of overlap. For each dimension n, the number of training data points was chosen
to allow neural networks to gradually overfit over the course of a sufficiently long
training period. In each case, test data was generated according to the same distribution
used for the training data.
For each choice of dimension n and percentage of overlap, we trained 25 neural

networks, each on a different data set matching the description above, and each with
nine hidden layers consisting of 20 nodes each, plus one output layer. Each network
was trained until it reached an accuracy of 99% on training data or a fixed maximum
amount of training time.
Periodically during the training process, a sample of 500 points was taken from
the decision boundary. At each step, persistence diagrams were computed from the
sample, and S0 and S1 were recorded. The training accuracies and test accuracies at
each step were also recorded.
In cases of large overlap, in which more overfitting is to be expected, there was an

approximately linear relationship between overfitness and S0 . A similar, but weaker,
linear relationship was also observed with S1 . Examples for S0 are shown for the cases
of 30% overlap in dimensions 4 and 6 in Figures 16.9 and 16.10. Each includes the
samples from all networks.
Table 16.2 shows the Pearson correlation coefficients r for S0 and overfitness. In
each case, this was calculated from the data from all networks and each point in the
training process at which the decision boundary was sampled.
Figure 16.10. S0 from six-dimensional data with a 30% overlap
5% overlap 10% overlap 20% overlap 30% overlap

Dimension 2 -0.166 0.050 0.408 0.633
Dimension 4 0.173 -0.100 0.724 0.795
Dimension 6 -0.147 0.020 0.579 0.801
Dimension 8 -0.169 0.008 0.272 0.639
Dimension 10 -0.092 -0.089 0.288 0.454
Table 16.2. Correlation coefficients for S0 and overfitness
Positive values of r indicate that a more overfit model tends to produce a higher
value of S0 . Furthermore, the greater values of r found for higher percentages of
overlap suggest that S0 is in fact sensitive to the amount of overfitting. The correlations
tended to be highest in dimensions 4 and 6. While the correlations shown here were
calculated from the collective trials of all networks with a given dimension and
percentage of overlap, the samples taken during the training of an individual network
often resulted in higher correlation coefficients. For instance, in trials with dimension
6 and a 30% overlap, the individual correlation coefficient for a network was greater
than .8 for 20 out of the 25 networks, and greater than .9 for 16 out of the 25 networks.
As an example, results from one such network with correlation coefficient r = .904
are shown in Figure 16.11.
Figure 16.11. Results from an individual network with correlation r = .904
16.5. Conclusion and discussion
Our experiments suggest that S0 and S1 are correlated with a model’s ability to
generalize to unseen data. We interpret this as an indication that persistent homology
is sensitive to the shape of a decision boundary, where a more complicated decision
boundary produced by an overfit model results in a more complicated persistence
diagram. A key feature of our method was the sampling of a decision boundary, where
the training data was used to find the relevant portion of the decision boundary. Our
approach gives evidence that topological measures of a decision boundary can give
insight into the quality of a model.
There is a wide variety of opportunities to expand upon this work. We list some
ideas for future work here:
1) Experiments with other classification algorithms such as logistic regression,
random forests or k-nearest neighbors.
2) Decision boundary sampling techniques: the method presented here only
demonstrates that the decision boundary can be sampled in a meaningful way. More
work is needed to determine how a high-quality sample can be obtained.
3) Incorporating persistence diagram data into the training process (see for
instance Chen et al. (2019)).
4) Investigation of measures of complexity other than S0 and S1 that could be
extracted from persistence diagrams.
5) Applications of these methods to higher-dimensional data and real data sets. For
instance, experiments could be done with image data or with applications of natural
language processing.
The simple examples considered here have demonstrated the ability of persistent
homology to describe the shape of a decision boundary. However, much more work is
needed to create a method suitable for practical applications. We hope that this work
motivates further study of the topological complexity of a model, as well as other work
applying TDA to machine learning.
16.6. References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S.,
Davis, A., Dean, J., Devin, M. et al. (2015). TensorFlow: Large-scale machine learning
on heterogeneous systems [Online]. Available at: tensorflow.org.
Adams, H., Chepushtanova, S., Emerson, T., Hanson, E., Kirby, M., Motta, F., Neville, R.,
Peterson, C., Shipman, P., Ziegelmeier, L. (2017). Persistence images: A stable vector
representation of persistent homology. Journal of Machine Learning Research, 18(8), 1–35.
Bendich, P., Marron, J.S., Miller, E., Pieloch, A., Skwerer, S. (2016). Persistent homology
analysis of brain artery trees. The Annals of Applied Statistics, 10(1), 198.
Carlsson, G., Ishkhanov, T., de Silva, V., Zomorodian, A. (2008). On the local behavior of
spaces of natural images. International Journal of Computer Vision, 76, 1–12.
Chen, C., Ni, X., Bai, Q., Wang, Y. (2019). A topological regularizer for classifiers via persistent
homology. In Proceedings of Machine Learning Research, Chaudhuri, K. and Sugiyama, M.
(eds). 89, 2573–2582, 16–18 April 2019.
Edelsbrunner, H. and Harer, J. (2010). Computational Topology – An Introduction. American
Mathematical Society, Providence, RI.
Ghrist, R. (2008). Barcodes: The persistent topology of data. Bulletin of the American
Mathematical Society, 45, 61–75.
Hofer, C., Kwitt, R., Niethammer, M., Uhl, A. (2018). Deep learning with topological
signatures. arXiv preprint arXiv:1707.04041.
Motta, F.C., Neville, R., Shipman, P.D., Pearson, D.A., Bradley, R.M. (2018). Measures of
order for nearly hexagonal lattices. Physica D: Nonlinear Phenomena, 380, 17–30.
Naitzat, G., Zhitnikov, A., Lim, L.-H. (2020). Topology of deep neural networks. Journal of
Machine Learning Research, 21(184), 1–40.
Ramamurthy, K.N., Varshney, K., Mody, K. (2019). Topological data analysis of decision
boundaries with application to model selection. In Proceedings of the 36th International
Conference on Machine Learning, Chaudhuri, K. and Salakhutdinov, R. (eds). 97,
5351–5360, Long Beach, California, 9–15 June 2019.
Saul, N. and Tralie, C. (2019). Scikit-TDA: Topological data analysis for Python [Online].
Available at: https://github.com/scikit-tda/scikit-tda.
Varshney, K.R. and Ramamurthy, K.N. (2015). Persistent topology of decision boundaries.
Proceedings of the IEEE International Conference on Acoustic Speech Signal Processing,
3931–3935, Brisbane, Australia, April 2015.
Zhao, Q. and Wang, Y. (2019). Learning metrics for persistence-based summaries and
applications for graph classification. Advances in Neural Information Processing Systems,
9859–9870.
Zomorodian, A. and Carlsson, G. (2005). Computing persistent homology. Discrete &
Computational Geometry, 33, 249–274.
17
The Minimum Renyi’s Pseudodistance
Estimators for Generalized Linear Models
Minimum Renyi’s pseudodistance (RP) estimators have good robustness

properties without a significant loss of efficiency for linear regression models (LRM).
The main purpose of this chapter is to extend these minimum RP estimators to
generalized linear models (GLM), using some results previously obtained by Castilla
et al. (2021) in relation to independent and non-identically distributed observations in
LRM. We theoretically derive asymptotic properties of the proposed estimators and
examine the performance of the estimators in Poisson regression models through a
simulation study, focusing on the robustness properties of the estimators. We finally
test the proposed methods in a real dataset related to the treatment of epilepsy,
illustrating the outperformance of the robust minimum RP estimators when there are
outlier observations.
17.1. Introduction
Generalized linear models (GLMs) were first introduced by Nelder and

Wedderburn (1972) and later widely by McCullagh and Nelder (1983). The GLMs
represent a natural extension of the standard linear regression model (LRM), which
encloses a large variety of response variable distributions, including distributions
of counts, binary or positive values. The regression model is defined in terms of a
set of independent response variables, Y1 , ..., Yn , following a distribution from the
general exponential family. That is, the density function of each response variable, Yi ,
i = 1, .., n with respect to a convenient σ-finite measure, is of the form

yi θi − b(θi )
f (yi , θi , φ) = exp + c (yi , φ) , [17.1]
a(φ)
Chapter written by María JAENADA and Leandro PARDO.
where the canonical parameter, θi , is an unknown measure of location depending on

the predictor x and φ is a known or unknown nuisance scale or dispersion parameter
typically required to produce standard errors following Gaussian, Gamma or inverse
Gaussian distribution. The functions a(φ), b(θ) and c (y, φ) are known. In particular,
a(φ) is set to 1 for binomial, Poisson, and negative binomial distributions (known φ).
The distribution of each Yi is then defined except for a common scale parameter φ and
a location parameter θi depending on the sample i. The GLM assumes that the mean
of the response variable, μi = E[Yi ], is related to a linear predictor xTi β through a
link function g. Thus, for model specification, we are usually interested in a vector of
T
unknown parameters β = (β1 , ..., βk ) (k < n), satisfying relation g(μi ) = xi β,
T the
where g is monotone and differentiable. Since θ = θ x β , we will also denote the
density in equation [17.1] by f (yi , β, φ).
The lack of robustness of the maximum likelihood estimator (MLE) and the
maximum quasi-likelihood estimators (MQLE) in GLMs has been widely studied
in the literature. For this reason, robust procedures for GLMs have been considered
to robustify the classical MLE. Among others, Stefanski et al. (1986) studied
optimally bounded score functions for the GLM and generalized the results obtained
by Krasker and Welsch (1982) for classical LRM. However, the robust estimator
proposed by Stefanski et al. (1986) is difficult to compute. In this line, Künsch et al.
(1989) introduced the so-called conditionally unbiased bounded-influence estimate.
The development of robust estimators for the GLM continued with the work of
Morgenthaler (1992) and more recently Cantoni and Ronchetti (2001) proposed a
robust approach based on robust quasi-deviance functions, which simultaneously
performs parameter estimation and variable selection. Another class of M-estimators
was proposed by Bianco and Yohai (1996) and further studied by Croux and
Haesbroeck (2003). Bianco et al. (2013) proposed general M-estimators with missing
values in the responses. Later, Valdora and Yohai (2014) proposed a family of robust
estimators for GLM, based on M-estimators after stabilizing the response variance. In
this line, Ghosh and Basu (2016) presented a robust family of estimators based on the
density power divergence (DPD) approach but assuming a fixed design matrix.
On the other hand, Broniatowski et al. (2012) considered for the first time the
minimum Renyi’s pseudodistance (RP) estimators for the LRM, and they studied
their robustness properties. Based on these minimum RP estimators, Castilla et al.
(2020b) introduced and studied Wald-type tests for the parameters in the LRM. Later,
the results were extended to the context of high-dimensional LRM in Castilla et al.
(2021), combining the robust loss given by RP and regularization methods. Some
interesting results based on independent and non-identically distributed observations
(i.n.i.d.o) were considered in Castilla et al. (2020a). Following the previous works, in
this chapter, we consider the minimum RP estimators for i.n.i.d.o. for the GLMs.
The Minimum Renyi’s Pseudodistance Estimators for Generalized Linear Models 225
17.2. The minimum RP estimators for the GLM model: asymptotic

distribution
In the following, we assume that the explanatory variables, xi , are fixed, and
therefore the random response variables Yi are independent but non-homogeneous
observations, i.e. we deal with the i.n.i.d.o. setup studied in Castilla et al. (2020a).
Let us then consider the i.n.i.d.o. random variables, Y1 , ..., Yn , with density functions
with respect to some common dominating measure, g1 , ..., gn , respectively. The true
densities gi are modeled by the density functions given in [17.1], belonging to the
exponential family. As pointed out in section 17.1, we will denote by fi (y, β, φ)
these density functions, highlighting its dependence on the regression vector β, the
nuisance parameter φ and the observation i, i = 1, .., n. For each observation, the RP
between fi (y, θ) and gi , can be defined for positive values of α as

1 α+1
Rα (fi (y, θ), gi ) = log fi (y, θ) dy
α+1

1
− log fi (y, θ)α gi (y)dy + k, [17.2]
α
where

1 α+1
k= log gi (y) dy
α (α + 1)
does not depend on θ = (β, φ). Since we only have one observation of each
random variable Yi , the best way to estimate the true density gi is to assume that
the distribution is degenerate in the observation yi . Accordingly, we denote by gi the
density function of the degenerate variable at the point yi . Then, [17.2] yields the loss

1 1
Rα (fi (y, θ), gi ) = log fi (y, θ)α+1 dy − log fi (yi , θ)α + k.
α+1 α
[17.3]
At α = 0, the RP loss can be defined taking continuous limits by
R0 (fi (y, θ), gi ) = lim Rα (fi (y, θ), gi ) = − log fi (yi , θ) + k. [17.4]
α↓0
Now, expression [17.3] can be rewritten as

1 fi (Yi , θ)α
Rα (fi (y, θ), gi ) = − log α + k,
α fi (y, θ)α+1 dy α+1
and thus minimizing Rα (fi (y, θ), gi ) in θ, for α > 0, is equivalent to maximizing
fi (Yi , θ)α
Vi (Yi , θ) = α+1
α .
fi (y, θ)α+1 dy
For notational simplicity, in the following, we will denote,

α+1
α
Liα (θ) = fi (y, θ)α+1 dy ,
so the Rényi loss is given by
fi (Yi , θ)α
Vi (Yi , θ) = .
Liα (θ)
Based on the previous idea, we consider the objective function
1 n fi (Yi , θ)α 1 n fi (Yi , θ)α 1 n

Hnα (θ) = α+1
α = i
= Vi (Yi , θ)
n i=1 fi (y, θ)α+1 dy n i=1 Lα (θ) n i=1
[17.5]
and then the minimum RP estimator, θ α , for the common parameter vector θ, is given
by
θ α = arg max Hnα (θ), [17.6]

θ∈Θ
n
1
with Hnα (θ) defined in [17.5] for α > 0 and Hn0 (θ) = n log fi (Yi , θ). Note that at
i=1
α = 0, the minimum RP estimator coincides with the MLE and therefore the proposed
family can be considered a generalization of the maximum likelihood procedure.
Since the minimum RP estimator is defined as a maximum, it must annul the

first derivatives of the loss function given in [17.5]. The estimating equations of the
parameters β and φ are given by
⎧ n
⎪
⎪ ∂Vi (yi ,β,φ)
⎨ n1 ∂β = 0k
i=1
n
⎪
⎪ 1 ∂Vi (yi ,β,φ)
= 0.
⎩ n ∂φ
i=1
For the first equation, we have

∂Vi (yi , β, φ) 1 ∂ log fi (yi , β, φ) i
= 2 αfi (yi , β, φ)α Lα (β, φ)
∂β i
Lα (β, φ) ∂β
α+1
α
−1
− α fi (y, β, φ)α+1 dy

α+1 ∂ log fi (yi , β, φ) α
fi (y, β, φ) dy fi (yi , β, φ) .
∂β
Following Ghosh and Basu (2016), we can rewrite the previous partial derivatives
as
∂ log fi (y, β, φ) yi − μi
= xi = K1i (yi .β, φ) xi
∂β V ar(Yi )g (μi )
and
∂ log fi (y, β, φ) (yi θi − b(θi )) ∂c (yi , φ)
=− 2
a (φ) + = K2i (yi .β, φ) .
∂φ a(φ) ∂φ
Then, substituting on the first equation, we have that the estimating equations for
the parameter β are given by
n xi
i
{Mi (yi , β, φ) − Ni (yi , β, φ)} = 0k [17.7]
i=1 Lα (β, φ)
being
Mi (yi , β, φ) = fi (yi , β, φ)α K1i (yi .β, φ)
and

fi (y, β, φ)α
Ni (yi , β, φ) = fi (y, β, φ)α+1 K1i (y.β, φ) dy.
fi (y, β, φ)α+1 dy
In relation to the estimating equation for φ, we have,

∂Vi (yi , β, φ) 1 ∂ log fi (yi , β, φ) i
= αfi (yi , β, φ)α Lα (β, φ)
∂φ Liα (β, φ)2 ∂φ
α+1
α
−1
α+1
− α fi (y, β, φ) dy

α+1 ∂ log fi (yi , β, φ) α
fi (y, β, φ) dy fi (y, β, φ)
∂φ

1 ∂ log fi (yi , β, φ) i
= 2 αfi (yi , β, φ)α Lα (β, φ)
Liα (β, φ) ∂φ

Liα (β, φ)
− α

∂ log fi (yi , β, φ)
fi (y, β, φ)α+1 dy fi (y, β, φ)α .
∂φ
Thus, the estimating equation for φ is given by
n 1
i (β, φ)
{Mi∗ (yi , β, φ) − Ni∗ (yi , β, φ)} = 0 [17.8]
L
i=1 α
being
Mi∗ (yi , β, φ) = fi (yi , β, φ)α K2i (yi .β, φ)
and

fi (yi , β, φ)α
Ni∗ (yi , β, φ) = fi (y, β, φ)α+1 K2i (y, β, φ) dy,
Castilla et al. (2021) proved that the minimum RP estimator (βα , φα ) is consistent
and asymptotically normal under some regularity conditions. Before stating the
asymptotic distribution, let us introduce some useful notations. We define the
quantities

1
mjli (β, φ) = fi (y, β, φ)α+1 Kji (y.β, φ) Kli (y.β, φ) dy,

1
mji (β, φ) = fi (y, β, φ)α+1 Kji (y.β, φ) dy,

fi (y, β, φ)2α+1
ljli (β, φ) = 2 (Kji (yi .β, φ) − mji (β, φ))
Liα (β, φ)
(Kli (yi .β, φ) − mli (β, φ)) dy,
[17.9]
for all j, l = 1, 2 and i = 1, .., n.
T HEOREM 17.1.– Let Y1 , ..., Yn be i.n.i.d.o. each with a density function given in
[17.1]. The asymptotic distribution of the minimum RP estimator, (β α , φα ), is given
by
√ 1

L
nΩn (β, φ)− 2 Ψn (β, φ) (β α , φα ) − (β, φ) → N (0p+1 , I p+1 ),
n→∞
being I k the k-dimensional identity matrix and the matrices Ψn and Ωn are given by

1 X T D 11 X X T D 12 1
Ωn (β, φ) = ,
n 1T D 12 X 1T D22 1
where X denotes the design matrix and D jk = diag (ljki )i=1,..,n , j, k = 1, 2, and
⎛ ⎞
T ∗ ∗ T ∗ T ∗ ∗ T ∗
1 ⎝ X D11 −(D1 ) D1 X X D12 −(D1 ) D2 1 ⎠
Ψn (β, φ) = T T
,
n 1T D∗12 −(D∗1 ) D ∗2 X 1T D∗22 −(D∗2 ) D ∗2 1
with D ∗jk = diag (mjki (β, φ))i=1,..,n and D∗j = diag (mji (β, φ))i=1,..,n , , j.k =
1, 2.
P ROOF.– See Appendix

(a) MSE under pure data.
(b) MSE under 5% of contaminated data.
(c) MSE under 10% of contaminated data.
Figure 17.1. MSE in β estimation with different values of α against sample

size for the Poisson regression model. For a color version of this
figure, see www.iste.co.uk/zafeiris/data1.zip
17.3. Example: Poisson regression model
We illustrate the proposed robust method for the Poisson regression model. As
pointed out in section 17.1, the Poisson regression model belongs to the GLM with
a known shape parameter φ = 1 and location parameters θi = νi = xTi β and
c(yi ) = − log(yi !). The mean of the response variable is then linked to the linear
predictor through the natural logarithm, i.e. μi = exp(xTi β). Thus, we can apply the
previously proposed method to estimate the vector of regression parameters β with
the objective function given in equation [17.5]. Note that the expression Liα (β) does
not have a simplified form for the Poisson regression model, and it must be computed
numerically.
We analyze the performance of the proposed methods in Poisson regression

through a simulation study. In the following, we set the regression parameter
β = (1.8, 1, 0, 0, 1.5, 0, ...0) with dimension p = 12 and generate the explanatory
variables, xi , from the standard uniform distribution with the variance–covariance
matrix having the Toeplitz structure, with the (j, k)-th element being 0.5|j−k| , j, k =
1, ..p. The response variables are generated from the Poisson regression model with
mean μi = xTi β, Yi ∼ P(μi ). In order to evaluate the robustness of the proposed
estimators, we contaminate the responses using a perturbed distribution of the form
(1 − b)P(μi ) + bP(2μi ), where b is a realization of a Bernoulli variable with
parameter ε = 0.05, 1, called the contamination level. That is, the distribution of
the contaminated responses lies in a small neighborhood of the assumed model. We
repeat the process R = 100 for each value of α.
Figure 17.1 shows the mean-squared error of the estimate, MSE = ||β α −β||2 , for
different values of α = 0, 0.1, 0.3, 0.5, against the sample size. It is straightforward
that our proposed estimators are more robust than the classical MLE, since the MSE
committed is lower for all positive values of α than for α = 0 (corresponding to the
MLE), except for too small sample sizes, in all contaminated scenarios. Conversely,
the MLE is, as expected, the most efficient estimator in the absence of contamination,
closely to our proposed estimators with α = 0.1, 0.3, highlighting the role of α in
controlling the trade-off between efficiency and robustness. In this regard, values of α
about 0.3 perform the best taking into account the low loss of efficiency and the gain
in robustness. Finally, note that small sample sizes adversely affect greater values of
α.
17.3.1. Real data application
We finally apply our proposed estimators in a real dataset arising from a clinical
trial of 59 patients who suffer from epilepsy. The data was first studied in Leppik et al.
(1985) and has been previously studied for robust Poisson regression in Hosseinian
(2009) and Ghosh and Basu (2016). Patients with epilepsy were treated by the
anti-epileptic drug “progabide” or a placebo with randomized assignment and at

each of the four successive post-randomization clinic visits, the number of seizures.
Epileptic seizure count is modeled using a Poisson regression model with three
explanatory variables, namely, baseline seizure rate recorded over an eight-week
period prior to randomization divided by 4, “Bline”, the age of the patients in years
divided by 10, “Age”, and the binary indicators “Trt” for the progabide group. As
pointed out in Thall and Vail (1990), the interaction between treatment and baseline
should be considered, “Trt × Bline” as the mean seizure rate for the progabide group
is either higher or lower than for the control group, so that the baseline count does
not exceed a certain threshold, indicating a possible contra-indication of the drug for
patients with high baseline counts. In that study, it was observed that patient 207
may be an outlying observation, since both baseline and epileptic seizure count after
the clinical trial are greater than those of other patients. Moreover, it has also been
highlighted that robust methods show the interaction between treatment and baseline
variable, “Trt × Bline”, to be significant, whereas non-robust estimators turn out to be
insignificant due to possible data contamination.
Intercept Trt Bline Age Trt × Bline

MLE 1.9680 -0.2553 0.0854 0.0243 0.0075
Cantoni 2.04 -0.32 0.085 0.16 0.012
WMLE 2.13 -0.47 0.128 0.044 0.054
MDPD α = 0.1 2.1089 -0.3169 0.0866 0.1153 0.0107
MDPD α = 0.3 1.9106 -0.3871 0.1689 0.0408 0.0156
MDPD α = 0.5 1.9691 -0.3893 0.1631 0.0362 0.0165
MDPD α = 0.7 2.0060 -0.3516 0.1622 0.0242 0.0131
MDPD α = 1 1.9653 -0.3186 0.1562 0.0559 0.0098
MRP α = 0.1 1.9543 -0.4780 0.0848 0.0187 0.0294
MRP α = 0.3 2.1669 -0.3653 0.0865 0.0084 0.0108
MRP α = 0.5 1.7350 -0.6180 0.1621 0.0122 0.0415
MRP α = 0.7 1.8859 -0.6044 0.1575 0.0080 0.0392
Table 17.1. Estimates of the regression coefficients for

the epilepsy data with different estimating methods
Table 17.1 shows the estimate values of the regression coefficients for different
values of the tuning parameter α, jointly with the coefficient estimates using
some other robust and non-robust methods in the literature, namely the MLE, the
weighted maximum likelihood estimators (WMLE) proposed in Hosseinian (2009),
the M-estimators proposed in Cantoni and Ronchetti (2001) and minimum density
estimators based on the DPD (MDPD), proposed in Ghosh and Basu (2016). These
estimated values have been taken from the mentioned papers. The proposed robust
estimators behave similarly to other robust proposals, justifying robustness also with
real data. Note that all robust estimates with different methods of the regression
coefficients for Trt, Bline and Trt×Bline variables are greater (in absolute value) than
the MLE, leading to the suspicion that the last non-robust estimates are influenced by
data contamination.
17.4. Conclusion
In this chapter, we have presented the minimum RP estimator for GLMs. The
proposed estimators are robust against data contamination, including outliers and
leverage points, as well as consistent and asymptotically normal. Following this idea,
Wald-type test statistics could be developed in order to test a simple and composite
null hypothesis, extending the previous work for the LRM in Castilla et al. (2020b).
The latter have been explored in Basu et al. (2021) for the minimum DPD estimators,
but assuming the random design matrix.
17.5. Acknowledgments
This research was supported by the Spanish Grants PGC2018-095194-B-100

(L. Pardo and M. Jaenada) and FPU/018240 (M. Jaenada).
17.6. Appendix
17.6.1. Proof of Theorem 1
We use the same notation introduced for Theorem 17.1 in [17.9]. Let us rewrite
∂Vi (y; β, φ)
= fi (yi , β, φ)α Hi1 (yi , β, φ)xTi xi and
∂β
∂Vi (y; φ)
= fi (yi , β, φ)α Hi2 (yi , β, φ), i = 1, ..., n
∂φ
with

1 1
Hij (yi , β, φ) = K1i (yi .β, φ) −
Liα (β, φ) fi (y, β, φ)α+1 dy

fi (y, β, φ)α+1 K1i (y.β, φ) dy
1
= (K1i (yi .β, φ) − mij (β, φ)) , j = 1, 2.
Liα (β, φ)
As stated in Castilla et al. (2021), the matrix Ωn (β, φ) is given by

1 n ∂Vi (Y ; β, φ)
Ωn (β, φ) = V arβ,φ
n i=1 ∂(β, φ)
T
1 n ∂Vi (Y ; β, φ) ∂Vi (Y ; β, φ)
= Eβ,φ ,
n i=1 ∂(β, φ) ∂(β, φ)
which can be rewritten using the expression defined in [17.9] as,

T
∂Vi (Y ; β, φ) xi l11i (β, φ)xi xi l12i (β, φ)
V arβ,φ = .
∂(β, φ) xTi l12i (β, φ) l22i (β, φ)
with

fi (y, β, φ)2α+1
ljli (β, φ) = (Kji (yi .β, φ) − mji (β, φ))
Liα (β, φ)2
(Kli (yi .β, φ) − mli (β, φ)) dy, j, l = 1, 2, i = 1, ..n
Now, if we denote the matrices
D jk = diag (ljki )i=1,..,n ; j, l = 1, 2, [17.10]
we get the final expression of Ωn (β, φ),

1 X T D 11 X X T D 12 1
Ωn (β, φ) =
n 1D 22 X 1T D22 1
where X T = (x1 , ..., xn ) . We now derive an expression of the matrix Ψn (β, φ) .

Based on Remark 5.2 in Castilla et al. (2021), we can use their formula (21) to obtain
1 n 1
Ψn (β, φ) =
n i=1 fi (y, β, φ)α+1 dy

α+1 T
fi (y, β, φ) u (yi , β, φ) u (yi , β, φ) dy
1
− 2

α+1
fi (y, β, φ) u (yi , β, φ) dy
T
α+1
fi (y, β, φ) u (yi , β, φ) dy
being u (yi , β, φ) = (K1i (yi , β, φ)xi , K2i (yi , β, φ)) . Therefore, using the quantities
defined in [17.9], we can express the matrix Ψn (β, φ) as

1 n m11i (β, φ) xTi xi m12i (β, φ) xi
Ψn (β, φ) =
n i=1 m12i (β, φ) xTi m22i (β, φ)

m1i (β, φ) xi
− m1i (β, φ) xi m2i (β, φ) .
m2i (β, φ)
Finally, defining
D ∗jl = diag (mjli (β, φ))i=1,..,n , j, l = 1, 2 [17.11]
and
D ∗j = diag (mji (β, φ))i=1,..,n , j = 1, 2, [17.12]
we can write

⎛ ⎞
T ∗ ∗ T ∗ T ∗ ∗ T ∗
1 X D 11 −(D 1 ) D 1 X X D −(D ) D 2 1
Ψn (β, φ) = ⎝ T ∗ 12 1
⎠.
T T
n X D 12 −(D 1 ) D 2 1 X D22 −(D2 ) D∗2 1
∗ ∗ T ∗ ∗
17.7. References
Basu, A., Ghosh, A., Mandal, A., Martin, N., Pardo, L. (2021). Robust Wald-type tests in GLM
with random design based on minimum density power divergence estimators. Statistical
Methods and Applications, 3, 933–1005.
Bianco, A.M. and Yohai, V.J. (1996). Robust estimation in the logistic regression model. Robust
Statistics, Data Analysis, and Computer Intensive Methods, 109, 17–34.
Bianco, A.M., Boent, G., Rodrigues, I.M. (2013). Robust tests in generalized linear models with
missing responses. Computational Statistics and Data Analysis, 65, 80–97.
Broniatowski, M., Toma, A., Vajda, I. (2012). Decomposable pseudodistances and applications
in statistical estimation. Journal of Statistical Planning and Inference, 142, 2574–2585.
Cantoni, E. and Ronchetti, E. (2001). Robust inference for generalized linear models. Journal
of the American Statistical Association, 96, 1022–1030.
Castilla, E., Ghosh, A., Jaenada, M., Pardo, L. (2020a). On regularization methods based
on Rényi’s pseudodistances for sparse high-dimensional linear regression models [Online].
Available at: arXiv:2007.15929.
Castilla, E., Martín, N., Muñoz, S., Pardo L. (2020b). Robust Wald-type tests based on
minimum Rényi pseudodistance estimators for the multiple regression model. Journal of
Statistical Computation and Simulation, 90(14), 2592–2613.
Castilla, E., Jaenada, M., Pardo, L. (2021). Estimation and testing on independent not
identically distributed observations based on Rényi’s pseudodistances [Online]. Available
at: arXiv:2102.12282.
Croux, C. and Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimator for logistic
regression. Computational Statistics and Data Analysis, 44, 273–295.
Ghosh, A. and Basu, A. (2016). Robust estimation in generalized linear models: The density
power divergence approach. TEST, 25, 269–290.
Hosseinian, S. (2009). Robust inference for generalized linear models: Binary and Poisson
regression. Thesis, École Polytechnique Fédérale de Lausanne.
Krasker, W.S. and Welsch, R.E. (1982). Efficient bounded-influence regression estimation.
Journal of the American Statistical Association, 77, 595–604.
Künsch, H.R., Stefanski, L.A., Carroll, R.J. (1989). Conditionally unbiased bounded-influence
estimation in general regression models, with applications to generalized linear models.
Journal of the American Statistical Association, 84, 460–466.
Leppik, I., Dreifuss, F., Bowman, T., Santilli, N., Jacobs, M.P., Crosby, C., Cloyd, J.C.,
Stockman, J., Graves, N.M., Sutula, T.P. et al. (1985). A double-blind crossover evaluation
of progabide in partial seizures. Neurology, 35, 285.
McCullagh, P. and Nelder, J.A. (1983). Generalized Linear Models. Chapman and Hall,
London.
Morgenthaler, S. (1992). Least-absolute-deviations fits for generalized linear models.
Biometrika, 79, 747–754.
Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal
Statistical Society, 135, 370–384.
Stefanski, L.A., Carroll, R.J., Ruppert, D. (1986). Optimally bounded score functions for
generalized linear models with applications to logistic regression. Biometrika, 73, 413–424.
Thall, P.F. and Vail, S.C. (1990). Some covariance models for longitudinal count data with
overdispersion. Biometrics, 46(3), 657–671.
Valdora, M. and Yohai, V.J. (2014). Robust estimators for generalized linear models. Journal of
Statistical Planning and Inference, 146, 31–48.
18
Data Analysis based on Entropies

and Measures of Divergence
In this chapter, we discuss entropies and measures of divergence which are

extensively used in data analysis and statistical inference. Tests of goodness of fit are
reviewed and their asymptotic theory is discussed. Simulation studies are undertaken
for comparing their performance capabilities.
18.1. Introduction
Measures of divergence are powerful statistical tools directly related to statistical

inference, including robustness, with diverse applicability (see, for example,
Papaioannou 1985; Basu et al. 2011; Ghosh et al. 2013). Indeed, on one hand, they can
be used for estimation purposes with the classical example, the well-known maximum
likelihood estimator (MLE) which is the result of the implementation of the famous
Kullback–Leibler measure. On the other hand, measures are applicable in tests of fit
to quantify the degree of agreement between the distribution of an observed random
sample and a theoretical, hypothesized distribution. The problem of goodness of fit
(gof) to any distribution on the real line is frequently treated by partitioning the range
of data in a number of disjoint intervals. In all cases, a test statistic is compared against
a known critical value to accept or reject the hypothesis that the sample is from
the postulated distribution. Over the years, numerous non-parametric gof methods
including the chi-squared test and various empirical distribution function (edf) tests
(D’Agostino and Stephens 1986) have been developed. At the same time, measures
of entropy, divergence and information are quite popular in goodness-of-fit tests.
Over the years, several measures have been suggested to reflect the fact that some
Chapter written by Christos M ESELIDIS, Alex K ARAGRIGORIOU and Takis PAPAIOANNOU.
probability distributions are closer together than others. Many of the currently used
tests, such as the likelihood ratio, the chi-squared, the score and Wald tests are defined
in terms of appropriate measures.
In this chapter, we provide a comprehensive review on entropies and distance

measures and their use in inferential statistics. Section 18.2 is devoted to a brief
literature review on entropies and divergence measures, and section 18.3 presents the
main results on tests of fit. Section 18.4 through extensive simulations explores the
performance of a recently proposed double index test of fit in contingency tables.
18.2. Divergence measures
Measures of information are powerful statistical tools with diverse applicability.

In this section, we will focus on a specific type of information measure, known as
measures of discrepancy (distance or divergence) between two variables X and Y
with pdfs f and g. Furthermore, we will explore ways to measure the discrepancy
between (i) the distribution of X as deduced from an available set of data and (ii) the
distribution of X as compared with a hypothesized distribution believed to be the
generating mechanism that produced the set of data at hand.
For historical reasons, we present first Shannon’s entropy (Shannon 1948) given
by

I S (X) ≡ I S (f ) = − f ln f dμ = Ef [− ln f ],
where X is a random variable with density function f (x) and μ is a probability

measure on R. The development of the concept of entropy started in the 19th century
in the field of physics and in particular in describing thermodynamics processes, but
the development of the statistical description of entropy by Boltzmann led to a strong
resistance by many. Shannon’s entropy was introduced and used during World War II
in Communication Engineering. Shannon derived the discrete version of I S (f ),
where f is a probability mass function and named it entropy because of its similarity
to thermodynamics entropy. The continuous version was defined by analogy and it is
called differential entropy (Cover and Thomas 2006). For a finite number of points,
Shannon’s entropy measures the expected information of a signal provided without
noise from a source X with density f (x) and is related to the Kullback–Leibler
divergence (Kullback and Leibler 1951) through the following expression:
I S (f ) = I S (h) − IX
KL
(f, h)
where h is the density of the uniform distribution and the Kullback–Leibler divergence
between two densities f (x) and g(x) is given by

KL
IX (f, g) = f ln(f /g)dμ = Ef [ln(f /g)]. [18.1]
Data Analysis based on Entropies and Measures of Divergence 239
Many generalizations of Shannon’s entropy were hereupon introduced. Rényi’s

(1963) entropy as extended by Liese and Vajda (1987) is given by
1
I Rlv ,a (X) ≡ I Rlv ,a (f ) = ln Ef f a−1 , a = 0, 1.
a (a − 1)
For more details about entropy measures, the reader is referred to Mathai and
Rathie (1975) and Nadarajah and Zografos (2003).
A measure of divergence is used as a way to evaluate the distance (divergence)

between any two functions f and g associated with the variables X and Y. Among the
most popular measures of divergence are the Kullback–Leibler measure of divergence
given in [18.1] and Csiszar’s ϕ-divergence family of measures (Csiszar 1963; Ali and
Silvey 1966) given by

ϕ f
If,g = gϕ dμ, [18.2]
g
where ϕ is a convex function on [0, ∞) such that ϕ (1) = ϕ (1) = 0 and ϕ (1) = 0.
We also assume the conventions 0ϕ (0/0) = 0 and 0ϕ (u/0) = lim ϕ (u) /u, u > 0.
u→∞
The class of Csiszar’s measures includes a number of widely used measures that can
be recovered for appropriate choices of the function ϕ. When the function ϕ is defined
as
ϕ(u) = u log u or ϕ(u) = u log u + 1 − u [18.3]
then the above measure reduces to the Kullback–Leibler measure given in [18.1]. If
ϕ(u) = (1 − u)2 , [18.4]
Csiszar’s measure yields Pearson’s chi-square divergence (also known as Kagan’s

divergence; Kagan (1963)). If
.
ϕ(u) = ϕλ (u) = uλ+1 − u − λ(u − 1) /(λ(λ + 1)) [18.5]
or
.
ϕ(u) = ϕ∗λ (u) = uλ+1 − u /(λ(λ + 1)),
we obtain the Cressie and Read power divergence (Cressie and Read 1984), λ = 0,
−1.
Another function usually considered in practice is
. 1 u1+a
ϕ(u) = ϕα (u) = 1 − (1 + )u + , a = 0. [18.6]
a a
which is associated with the BHHJ power divergence (Basu et al. 1998) and is a
member of the BHHJ family of divergence measures proposed by Mattheou et al.
(2009), which depends on a general convex function Φ and a positive index a and is
given by

a f
DX (g, f ) = g 1+a Φ dμ, a > 0, Φ ∈ Φ∗ [18.7]
g
where μ represents the Lebesgue measure and Φ∗ is the class of all convex functions
Φ on [0, ∞) such that Φ (1) = Φ (1) = 0 and Φ (1) = 0. We also assume the
conventions 0Φ (0/0) = 0 and 0Φ (u/0) = lim Φ (u) /u, u > 0.
u→∞
Appropriately chosen functions Φ(·) give rise to special measures mentioned

above, while for α = 0, the BHHJ family reduces to Csiszar’s family. Expression
[18.7] covers not only the continuous case but also the discrete one where the measure
μ is a counting measure. Indeed, for the discrete case, the divergence in [18.7] is
meaningful for probability mass functions f and g whose support is a subset of the
support Sμ , finite or countable, of the counting measure μ that satisfies μ (x) = 1 for
x ∈ Sμ and 0 otherwise. The discrete version of the Φ−family of divergence measures
is presented in the definition below.

D EFINITION .– For two discrete distributions P = (p1 , . . . , pm ) and Q =

(q1 , . . . , qm ) with sample space Ω = {x : p (x) q (x) > 0}, where p (x) and q (x)
are the probability mass functions of the two distributions, the discrete version of the
Φ−family of divergence measures with a general function Φ ∈ Φ∗ and a > 0 is given
by
m
pi
da = qi1+a Φ . [18.8]
i=1
qi
For Φ having the special form given in [18.6], we obtain the BHHJ measure
(Basu et al. 1998) which was proposed for the development of a minimum divergence
estimating method for robust parameter estimation. Observe that for Φ (u) = φ (u) ∈
Φ0 and a = 0, the family reduces to Csiszár’s φ−divergence family of measures,
while for a = 0 and for Φ (u) = ϕλ (u) as in [18.5], it reduces to the Cressie and Read
power divergence measure. Other important special cases of the Φ−divergence family
are those for which the function Φ(u) takes the form
Φ1λ (u) = (1 + λ)ϕλ (u) [18.9]
and

1 1 1 1
Φ1α (u) = Φa (u) = u 1+a
− 1+ a
u + . [18.10]
1+a 1+a a a
It is easy to see that for a → 0, the measures Φa (·) and Φ1a (·) reduce to the KL
measure.
More examples of φ functions are given in Arndt (2001) and Pardo (2006).
For more details on divergence measures, see Cavanaugh (2004), Toma (2009)
and Toma and Broniatowski (2011). Specifically, for robust inference based on
divergence measures, see Basu et al. (2011) and a paper by Patra et al. (2013) on
the power divergence and the density power divergence families. The descritized
version of measures has been given considerable attention over the years, with some
representative works being by Zografos et al. (1986) and Papaioannou et al. (1994).
18.3. Tests of fit based on Φ−divergence measures
In this section, we discuss the problem of goodness-of-fit tests via divergence

measures. Assume that X1 , ..., Xn are i.i.d. random variables with common
distribution function (d.f.) F . Given some specified d.f. F0 , the classical
goodness-of-fit problem is concerned with testing the simple null hypothesis H0 :
F = F0 . This problem is frequently treated by partitioning the range of data in m
disjoint intervals and by testing a hypothesis based upon the vector of parameters of a
multinomial distribution.
Let P = {Ei }i=1,...,m be a partition of the real line R in m intervals. Let p =

(p1 , . . . , pm ) and p0 = (p10 , . . . , pm0 ) be the true and the hypothesized probabilities
of the intervals Ei , i = 1, . . . , m, respectively,
in such a way that pi = PF (Ei ),
i = 1, . . . , m and pi0 = PF0 (Ei ) = Ei dF0 , i = 1, . . . , m.
n
Let Y1 , . . . , Yn be a random sample from F and let ni = IEi (Yj ) with
j=1
m
i=1 ni = n, where
1 if Yj ∈ Ei
IEi (Yj ) = , i = 1, 2, . . . , m
0 otherwise
and p = (p1 , p2 , ...., pm ) with pi = ni /n, i = 1, 2, . . . , m be the MLE of pi .
Although the above simple null hypothesis frequently appears in practice, it is

common to test the composite null hypothesis that the unknown distribution belongs
to a parametric family {Fθ }θ∈Θ , where Θ is an open subset in Rk . In this case, we
can again consider a partition of the original sample space with the probabilities of
the elements of the partition depending on the unknown k−dimensional parameter θ.
Then, the hypothesis can be tested by the hypotheses
H0 : p = p0 (θ0 ) vs. H1 : p = p(θ) [18.11]
where θ 0 is the true value of the k-dimensional parameter under the null model
and p0 (θ 0 ) = (p10 (θ 0 ), . . . , pm0 (θ 0 )) . Pearson encountered this problem in the
well-known chi-square test statistic and suggested the use of a consistent estimator
for the unknown parameter. He further claimed that the asymptotic distribution of
the resulting test statistic, under the null hypothesis, is a chi-square random variable
with m degrees of freedom. Later, for the same test, Fisher (1924) established that
the correct distribution has m − 1 degrees of freedom. The result was later discussed
by Neyman (1949) and recently by Menendez et al. (2001). In this case, since the
null distribution depends on the unknown parameter θ, a consistent estimator of θ is
required.
The partition of the data range is a delicate matter since it is frequently associated
with the loss of information. For a thorough investigation on the issue, the interested
reader is referred to the works by Ferentinos and Papaioannou (1979, 1983).
For testing the above null hypotheses, the most commonly used test statistics are
Pearson’s or the chi-squared test statistic and the likelihood ratio test statistic which
are both special cases of the family of power-divergence test statistics (CR test) which
was introduced by Cressie and Read (1984), is based on the measure given in [18.5]
and is given by
⎛ λ ⎞
m
2n p i
Inλ p, p0 (θ̂) = pi ⎝ − 1⎠ [18.12]
λ (λ + 1) i=1 pi0 (θ̂)
m

pi
= 2n pi0 (θ̂)Φ2,λ , [18.13]
i=1 pi0 (θ̂)
where λ = −1, 0, −∞ < λ < ∞, p0 (θ̂) = (p10 (θ̂), . . . , pm0 (θ̂)) , and
θ̂ is a consistent estimator of θ. Particular values of λ in [18.12] correspond to
well-known test statistics: chi-squared test statistic (λ = 1), likelihood ratio test
statistic (λ → 0), Freeman–Tukey test statistic (λ = −1/2), minimum discrimination
information statistic (Gokhale and Kullback 1978; Kullback 1985) (λ → −1),
modified chi-squared test statistic (Neyman 1949) (λ = −2) and Cressie–Read test
statistic (λ = 2/3).
Although the power-divergence test statistics yield an important family of tests of

fit, it is possible to consider the more general Csiszar’s family of φ − divergence test
statistics for testing [18.11] which contains [18.12] as a particular case is based on the
discrete form of [18.2] and is defined by
m

φ 2n pi
In p, p0 (θ̂) = pi0 (θ̂)φ [18.14]
φ (1) i=1 pi0 (θ̂)
with φ (x) a convex, twice continuously differentiable function for x > 0 such that
φ (1) = 0.
The above family of tests was generalized by Mattheou and Karagrigoriou (2010)
to the following Φ−family of tests which is based on the Φ−divergence measure given
in [18.8]:
2ndâ
InΦ p, p0 (θ̂) = , [18.15]
Φ (1)
m

1+a pi
dâ = pi0 (θ̂) Φ , Φ ∈ Φ∗ . [18.16]
i=1 pi0 (θ̂)
Cressie and Read (1984) obtained the asymptotic distribution of the

power-divergence test statistic Inλ p, p0 (θ̂) given in [18.12], Zografos et al. (1990)
extended the result to the family InΦ p, p0 (θ̂) for a = 0 and Φ = φ ∈ Φ0 and
Mattheou and Karagrigoriou (2010) extended the result to cover any function Φ ∈ Φ∗ :
T HEOREM 18.1.– (Cressie and Read 1984). Under the null hypothesis H0 : p =

p0 = (p10 , . . . , pm0 ) , the asymptotic distribution of the Cressie and Read divergence
test statistic, Inλ (p, p0 ), is chi-square with m − 1 degrees of freedom:
L
Inλ (p, p0 ) −−−−→ χ2m−1 .
n→∞
T HEOREM 18.2.– (Zografos et al. 1990). Under the null hypothesis H0 : p =

p0 = (p10 , . . . , pm0 ) , the asymptotic distribution of the φ − divergence test statistic,
Inφ (p, p0 ), is chi-square with m − 1 degrees of freedom:
L
Inφ (p, p0 ) −−−−→ χ2m−1 .
n→∞
T HEOREM 18.3.– (Mattheou and Karagrigoriou 2010). Under the composite null
hypothesis H0 : p = p0 (θ 0 ), the asymptotic distribution of the Φ−divergence test
statistic, InΦ p, p0 (θ̂) divided by a constant c, is chi-square with m − 1 degrees of
freedom:
1 Φ L
I p, p0 (θ̂) −−−−→ χ2m−1 ,
c n n→∞
where

c = 0.5 min pai0 (θ̂) + max pai0 (θ̂) [18.17]
i i
and θ̂ a consistent estimator of θ.
For the case of the simple null hypothesis, the theorem is adjusted accordingly and
the asymptotic distribution is therefore chi-square with m − 1 degrees of freedom. For
the fixed alternative hypothesis, the power is given in the theorem below:
T HEOREM 18.4.– The power of the test H0 : pi = pi0 vs Ha : pi = pib , i = 1, ..., m

using the test statistic [18.15] with pi (θ̂) = pib is approximately equal to:
⎛ m ⎞
Φ (1) cχ2m−1,α + 2N Φ (1) p1+a − 2N da
⎜ i0 ⎟
γa = P ⎜ ⎝ Z ≥ √ i=1 ⎟,
⎠ [18.18]
2 N σa
where Z a standard normal random variable, and

⎧ ⎡ ⎤ ⎫
⎪
⎨ 2 2⎪
pjb ⎦ ⎬
m m
pjb
σa2 = pjb paj0 Φ −⎣ pjb paj0 Φ .
⎪
⎩j=1 pj0 pj0 ⎪
⎭
j=1
It is known that it is not always possible to determine the asymptotic distribution

under any alternative. Here, we will provide the asymptotic distribution under
contiguous alternatives. Suppose that the simple null hypothesis indicates that pi =
pi0 , i = 1, 2, . . . , m when in fact it is pi = pib , ∀i. As is well known, if pi0 and pib are
fixed, then as n tends to infinity, the power of the test tends to 1. In order to examine
the situation when the power is not close to 1, we must make it continually harder
for the test as n increases. This can be done by allowing the alternative hypothesis
steadily closer to the null hypothesis. As a result, we define a sequence of alternative
hypotheses as follows
√
H1,n : p = pn = p0 + d/ n [18.19]
where pn = (p1n , . . . , pmn ) and d = (d1 , . . . , dm ) is a fixed vector such that

m
i=1 di = 0. This hypothesis is known as the Pitman (local) alternative or local
contiguous alternative to the null hypothesis H0 : p = p0 . Observe that as n tends
to infinity, the local contiguous alternative converges to the null hypothesis at the rate
O(n−1/2 ).
The following theorem by Mattheou and Karagrigoriou (2010) provides the

asymptotic distribution under contiguous alternatives.
T HEOREM 18.5.– Under the contiguous alternative hypothesis given in [18.19], the
asymptotic distribution of the Φ−divergence test statistic, InΦ (p, p0 ) divided by a
constant c, is a non-central chi-square with m − 1 degrees of freedom:
1 Φ L
I (p, p0 ) −−−−→ χ2m−1,δ ,
c n n→∞
m
where c = 0.5 min pai0 + max pai0 and non-centrality parameter δ = i=1 d2i /pi0 .
i i
Due to the above theorems, the power of the test under the fixed alternative
hypothesis H1 : pi = pib and the local contiguous alternative hypotheses [18.19]
can be easily obtained. For the case of the local contiguous alternative hypotheses, the
power is given by
γn = P (InΦ (p, p0 ) > χ2m−1,α |pi = pin , i = 1, ..., m)

= P (χ2m−1,δ > χ2m−1,α ),
where χ2m−1,α is the α−percentile of the χ2m−1 distribution.
We close this section with a short discussion about the estimation of the
unknown parameter θ which is a classic inferential problem. Optimal estimating
approaches, like the maximum likelihood estimation, are available in the literature
(e.g. Papaioannou et al. 2007). Here, we focus on the parameter estimator under
the composite hypothesis. Although the traditional MLE can be evaluated and
implemented, we may alternatively consider a wider class of estimators, known as
Φ−divergence estimators. More specifically, the minimum Φ−divergence estimator
of θ is any θ̂ Φ ∈ θk satisfying
m

1+a pi
da (θ̂Φ ) = min da (θ) = min pi0 (θ) Φ
θ∈θ k θ∈θ k
i=1
pi0 (θ)
for a function Φ ∈ Φ∗ and with p̂i = ni /n. Obviously, the resulting estimator depends
on the Φ-function chosen. Observe that for Φ as in [18.6] or [18.10] and for a → 0,
the resulting estimator is the usual MLE for the grouped data. It should be pointed
out that the function Φ used for the Φ−divergence estimator θ̂Φ does not necessarily
coincide with the Φ-function used for the test statistic which, in general, is written as
2ndâ (θ̂Φ2 ,α2 )

InΦ1 p, p0 (θ̂Φ2 ,α2 ) = , [18.20]
Φ1 (1)
where
m

1+a1 pi
dâ (θ̂ Φ2 ,α2 ) = pi0 (θ̂ Φ2 ,α2 ) Φ1 , [18.21]
i=1 pi0 (θ̂ Φ2 ,α2 )
for two, not necessarily different functions Φ1 & Φ2 ∈ Φ∗ . Finally, note that such
a type of estimator has been thoroughly investigated and their asymptotic theory has
been presented in Meselidis and Karagrigoriou (2020). Indeed, the innovative idea
behind the proposal by Meselidis and Karagrigoriou (2020) is the duality in choosing
among the members of the general class of divergences, one for estimating and one for
testing purposes which may not necessarily be the same. In that sense, the divergence
test statistic given in [18.20] offers the greatest possible range of options both for
the strictly convex function Φ and the indicator value α ∈ R. More specifically, if a
parameter θ needs to be estimated, then a function Φ, say Φ2 , and an index α, say
α2 , are used for that purpose and then we proceed with the distance and the testing
problem using a function Φ, say Φ1 , and an index α, say α1 , which, in general, can be
different from those used for the estimation problem. The resulting divergence is given
in [18.20] and [18.21], where θ̂ (Φ2 ,α2 ) is the minimum (Φ2 , α2 ) divergence estimator
which is allowed to be obtained even under restrictions, say c(θ) = 0.
18.4. Simulations
The problem of contingency tables or cross-tabulations and their statistical analysis

based on measures of divergence always attracts the attention of researchers with a
plethora of important contributions (see, for example, Kateri et al. 1996; Kateri and
Papaioannou 1997, 2007). Such problems though are often associated with two serious
issues that frequently appear in practice and considerably affect both estimating and
testing procedures, namely censoring and contamination often encountered among
other fields, in survival analysis (see Basu et al. 2006; Vonta and Karagrigoriou 2010;
Sachlas and Papaioannou 2014). For an extensive overview of such issues and their
handling, please refer to Tsairidis et al. (1996, 2001). The emphasis in this section is
on contamination. More specifically, in order to attain a better insight of the behavior
of the proposed divergences used both for estimation and testing purposes, we proceed
further with a simulation study. The null hypothesis considered focuses on the Gamma
distribution with a shape parameter equal to 1, denoted by Γ(1). On the other hand,
as alternative hypotheses, we have used Gamma distributions with shape parameters
equal to 1.5, 4.0 and 10.0 denoted by Γ(1.5), Γ(4) and Γ(10), respectively. In every
case, the scale parameter is chosen to be equal to 1 due to the fact that the distribution
is scale invariant.
The study is implemented not only for the regular case but also for cases where the
data set is contaminated. In this regard, we define as the contamination level with ∈
[0, 1]. Thus, the data generating distribution has the form (1 − )Γd + Γc , where Γd is
the dominant and Γc the contaminant Gamma distribution. Note that the contamination
level used is taken to be equal to 0.075. Thus, for the examination of estimators and
test statistics in terms of size of the test (α), we contaminate the null distribution with
observations from the alternative hypotheses and vice versa for the examination of
tests in terms of power (γ). Furthermore, for the implementation, we have considered
a large sample size, n = 200 and N = 100000 √ repetitions of the experiment, while
for the partition of the data range, we use 200 = 15 equiprobable intervals, where
the · operator returns the least integer which is greater than or equal to its argument.
As classical minimum divergence estimators, we use those that can be derived

from the Cressie–Read family for λ = −2, −1, −1/2, 0, 2/3, 1 which are
known as the minimum modified chi-squared (θ̂MCS ), discrimination information
(θ̂MDI ), Freeman–Tukey (θ̂F T ), likelihood ratio (θ̂LR ), Cressie–Read (θ̂CR ) and
chi-squared (θ̂CS ) estimators. On the other hand, the proposed BHHJ family of
estimators (θ̂(Φ2 ,α2 ) ) is applied for 13 values of the parameters α2 = 10−7 , 0.01,
0.05, 0.10...(0.10)...1.00 and Φ2 as in [18.6]. Furthermore, we have included in our

analysis not only the L2 -distance estimator (θ̂L2 ) which along with the likelihood ratio
serve as benchmark estimators (divergences in general) since they are equivalent with
the BHHJ family for α = 1 and α → 0, respectively, but also the MLE (θ̂MLE ) based
on the ungrouped data.
In reference to the test statistics, we proceed in a similar manner and retrieve from
the Cressie–Read family the classical modified chi-squared M CS(θ̂MCS ), minimum
discrimination information M DI(θ̂MDI ), Freeman–Tukey F T (θ̂F T ), likelihood
ratio LR(θ̂LR ), Cressie–Read CR(θ̂CR ) and Pearson’s chi-squared CS(θ̂CS ) test
statistics along with the proposed TΦα11 θ̂(Φ2 ,α2 ) for α1 = 10−7 , 0.01, 0.05,
0.10...(0.10)...1.00 and Φ1 as in [18.6].
The examination of the behavior of the minimum divergence estimators is based

on the mean-squared error (MSE) given by
N
1 I 2
M SEθ̂ = θ̂l − θ0
N
l=1
with θlI being the minimum divergence estimator based on any I divergence for the
lth sample.
Figure 18.1 presents the MSE for the four cases which are associated with
no contamination and contamination from the three alternative distributions. The
minimum divergence estimators are displayed in acceding order following a
counterclockwise direction according to the case where the contaminant distribution
lies far from the null, i.e. when the data are generated from 0.925Γ(1) + 0.075Γ(10).
Results indicate that in terms of MSE, estimators that can be derived from the
Cressie–Read family with λ ≥ 0 along with those that can be derived from the BHHJ
family with small values of α2 have better performance for the no contamination case
and when the contaminant distribution is close to the null (Figures 18.1a and 18.1b).
Note that in these two cases, (θ̂MLE ) has the best performance among all competing
estimators. On the contrary, when the contaminant distribution departs further from the
null (Figures 18.1c and 18.1d), estimators from the BHHJ family with larger values
of α2 and those from the Cressie–Read family with negative values of λ appear to
behave better while the worst results arise for the θ̂MLE . In addition, Figure 18.1
reveals the robustness aspect of the BHHJ and the Cressie–Read estimators since it
is apparent that in the presence of contamination the larger the value of the index α2
and the smaller the value of the parameter λ the smaller the MSE. Finally, note that
in every case, the MSE of θ̂(Φ2 ,α2 ) lies between the MSEs of the θ̂LR and the θ̂L2 .
We should state here that for presentation purposes, the MSE has been multiplied by
100. For more information about robust estimation for grouped data, refer to Basu
et al. (1997), Victoria-Feser and Ronchetti (1997), Lin and He (2006) and Toma
and Browniatowski (2011), while for the mathematical connection of the BHHJ and
Cressie–Read families, refer to Patra et al. (2013).
CR(−2) CR(−2)
L2 CR(1) L2 CR(1)
BHHJ(1.00) 0.588 MLE BHHJ(1.00) 0.712 MLE
0.57 0.676
BHHJ(0.90) CR(2/3) BHHJ(0.90) CR(2/3)
0.553 0.64
BHHJ(0.80) BHHJ(0.00) BHHJ(0.80) BHHJ(0.00)

0.536 0.604
0.518 0.568
BHHJ(0.70) CR(0) BHHJ(0.70) CR(0)
CR(−1) BHHJ(0.01) CR(−1) BHHJ(0.01)

CR(−1/2) BHHJ(0.30) CR(−1/2) BHHJ(0.30)
(a) Γ(1) (b) 0.925Γ(1) + 0.075Γ(1.5)

CR(−2) CR(−2)
L2 CR(1) L2 CR(1)
BHHJ(1.00) 3.62 MLE BHHJ(1.00) 17.2 MLE
3.47 13.2
BHHJ(0.90) CR(2/3) BHHJ(0.90) CR(2/3)
3.32 9.26

3.18 5.29
3.03 1.32
BHHJ(0.70) CR(0) BHHJ(0.70) CR(0)
CR(−1) BHHJ(0.01) CR(−1) BHHJ(0.01)

CR(−1/2) BHHJ(0.30) CR(−1/2) BHHJ(0.30)
(c) 0.925Γ(1) + 0.075Γ(4) (d) 0.925Γ(1) + 0.075Γ(10)
Figure 18.1. MSE (×100) for the four cases of contamination regarding the
tests that can be derived both from the BHHJ and Cressie–Read families
Under the setup of this study, we have m = 15 probabilities of the multinomial

model and k = 1 unknown parameters to estimate; thus, the critical values used
are the asymptotic
critical values based on the asymptotic
distribution cχ213 with
α1 α1
c = 0.5 mini pi0 (θ̂(Φ2 ,α2 ) ) + maxi pi0 (θ̂(Φ2 ,α2 ) ) , being a generalization of
[18.17], for the BHHJ family of test statistics, and the χ213 for the classical test
statistics that can be derived from the Cressie–Read family, with a nominal level equal
to 0.05.
(a) Γ(1) (b) 0.925Γ(1) + 0.075Γ(1.5)
(c) 0.925Γ(1) + 0.075Γ(4) (d) 0.925Γ(1) + 0.075Γ(10)
Figure 18.2. Size for the four contamination cases regarding the tests
that can be derived from the BHHJ family. For a color version of
this figure, see www.iste.co.uk/zafeiris/data1.zip
In Figure 18.2, we examine under the four aforementioned cases the behavior of
the BHHJ test statistics in terms of size for various values of the indices α1 and α2 ,
while in Table 18.1, the behavior of the classical tests is presented. In general, we
can see that as the index α1 increases, the size decreases, while as the index α2
increases, the size increases as well. Furthermore, we can observe that in the case
where the contaminant distribution lies far from the null (Figure 18.2d), the size
becomes very large, indicating the disastrous effect imposed from the contaminant
distribution to all BHHJ test statistics. This disastrous effect is also apparent in the
classical test statistics. In the case where the contaminant distribution is the Γ(4)
(Figure 18.2c), the BHHJ family of tests discounts the effect of contamination for
values of α1 ≥ 0.8, while the classical tests are largely affected by the contamination
once again. Finally, for the no contamination and contamination from the Γ(1.5), we
can derive the following conclusions about the behavior of the tests. Regarding the
BHHJ family (Figures 18.2a and 18.2b), we can observe that the larger the value of α1 ,
the more conservative the test is, while the best performance appears for α1 ≤ 0.10
and α2 ≥ 0.50. With respect to the classical tests, M DI(θ̂MDI ), F T (θ̂F T ) and
M CS(θ̂MCS ) appear to be conservative, while CS(θ̂CS ) and CR(θ̂CR ) appear to
be liberal. Note that in terms of size, LR(θ̂LR ) appears to have the best performance
among all classical test statistics.
Data distribution FT CR CS LR M DI M CS
Γ(1) 0.04028 0.06538 0.07841 0.04744 0.03783 0.04263
0.925Γ(1) + 0.075Γ(1.5) 0.03943 0.06759 0.08195 0.04775 0.03659 0.04072
0.925Γ(1) + 0.075Γ(4) 0.09546 0.12762 0.14392 0.10539 0.09063 0.09521
0.925Γ(1) + 0.075Γ(10) 0.83558 0.85420 0.85953 0.85953 0.82183 0.68850
Table 18.1. Size for the four contamination cases regarding the classical
tests that can be derived from the Cressie–Read family
(a) Γ(1.5) (b) 0.925Γ(1.5) + 0.075Γ(1)
Figure 18.3. Power for the no contamination and contamination from Γ(1)
cases regarding the tests that can be derived from the BHHJ family. For
a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
Data distribution FT CR CS LR M DI M CS
Γ(1.5) 0.69412 0.69911 0.70887 0.69061 0.70568 0.74808
0.925Γ(1.5) + 0.075Γ(1) 0.55308 0.57515 0.59049 0.55571 0.56224 0.60448
Table 18.2. Power for the no contamination and contamination from Γ(1) cases
regarding the classical tests that can be derived from the Cressie–Read family
In terms of power, results are presented in Figure 18.3 and Table 18.2 for the BHHJ
and classical tests, respectively. Note that we only present results that are associated
with the Γ(1.5) alternative since in every other case the power reaches the highest level
1 for all tests. As a general conclusion, we can state that the contamination affects the
performance of all tests by notably downgrading their power. Concerning the BHHJ
tests, the best results appear for small values of α1 and large values of α2 , while the
classical modified chi-squared test statistic, M CS(θ̂MCS ), has the best performance
among all classical tests.
Based on the preceding analysis, we proceed further with the comparison of

the tests. Beyond the classical ones, we choose the following four test statistics
from the BHHJ family T 1 ≡ TΦ0.05 1
(θ̂(Φ2 ,0.90) ), T 2 ≡ TΦ0.30
1
(θ̂(Φ2 ,0.30) ), T 3 ≡
0.60 0.90
TΦ1 (θ̂(Φ2 ,0.60) ) and T 4 ≡ TΦ1 (θ̂(Φ2 ,0.30) ). In order to derive solid conclusions
about the behavior of the test statistics in terms of size, we consider Dale’s criterion
(Dale 1986) which involves the following inequality:
|logit(1 − α̂n ) − logit(1 − α)| ≤ d, [18.22]
where logit(p) = log(p/(1 − p)), while α̂n and α are the exact simulated and nominal
sizes, respectively. When [18.22] is satisfied with d = 0.35, the exact simulated size
is considered to be close to the nominal size. For α = 0.05, the exact simulated size is
close to the nominal if α̂n ∈ [0.0357, 0.0695]. This criterion has been used previously
by Pardo (2010) and Batsidis et al. (2016). We apply the criterion not only for α =
0.05 but also for a range of nominal sizes that are of interest, namely α ∈ [0, 0.1].
Results are presented in Figure 18.4, where the dashed line refers to the situation
where the exact simulated equals the nominal size; thus, lines that lie above this
reference line refer to liberal, while those that lie below refer to conservative test
statistics. Furthermore, the gray area that is depicted in Figure 18.4 refers to Dale’s
criterion; thus, lines that lie in this area satisfy the criterion. From Figures 18.4a
and 18.4b, we observe that in the no contamination case and when the contaminant
distribution is close to the null, besides CS and T 4, every other test satisfies Dale’s
criterion. On a more granular level, we observe that the CR test statistic satisfies
the criterion only for nominal sizes α ≥ 0.03. For the case where the contaminant
distribution is the Γ(4), we can see that the only test that resists the contamination and
satisfies the criterion is T 4. One conclusion that can be derived from Figure 18.4d is
that even though every test fails to satisfy Dale’s criterion, M CS appears to be notably
resistant to the contamination, in relation to all other tests, especially for small nominal
sizes.
Apparently, the actual size of each test differs from the targeted nominal one; thus,
in order to proceed further with the comparison of the tests in terms of power, we
have to make an adjustment. We follow the method proposed in Lloyd (2005) which
involves the so-called receiver operating characteristic (ROC) curves. In particular,
let G(t) = P r(T ≥ t) be the survivor function of a general test statistic T , and
c = inf{t : G(t) ≤ α} be the critical value, then ROC curves can be formulated
by plotting the power G1 (c) against the size G0 (c) for various values of the critical
value c. Note that with G0 (t), we denote the distribution of the test statistic under the
null hypothesis and with G1 (t) under the alternative.
(a) Γ(1) (b) 0.925Γ(1) + 0.075Γ(1.5)
(c) 0.925Γ(1) + 0.075Γ(4) (d) 0.925Γ(1) + 0.075Γ(10)
Figure 18.4. Exact simulated sizes against nominal sizes for the four cases of
contamination. The gray area depicts the range of exact simulated sizes in which
Dale’s criterion is satisfied. For a color version of this figure, see www.iste.co.uk/
zafeiris/data1.zip
Results are presented in Figure 18.5, from where we can observe that under the
adjustment the test statistics have similar behavior in terms of power for both cases
of no contamination and contamination from the Γ(1), with the performance being
downsized in the latter case. Note also that results under the adjustment differ from
those of the preceding analysis. In particular, we can see that even though from
Figure 18.3 we derived the conclusion that the best results arise for small values of
α1 and large values of α2 in the no contamination case, T1 has the worst performance
among all the BHHJ tests under the adjustment in size. Similar conclusions can be
derived for the classical tests. For example, CS and CR have the worst performance
among the classical tests under the adjustment, although in Table 18.2, results indicate
the opposite. This behavior is explained by the fact that the power of the test is highly
affected from its liberality or not, making the adjustment in size mandatory before
proceeding to the comparison.
0.90
1.0
T4 0.85
T3 MCS
MDI
T2 T4
0.80 MCS
0.9
FT 0.75
T3
0.8 0.70 MDI
T1 T2
0.65
LR
0.7 FT
0.60
T1
0.55
Empirical power
Empirical power
0.6 CR LR
0.50
0.45
0.5
CS CR
0.40
0.4 0.35
CS
0.30
0.3
0.25
0.20
0.2
0.15
0.10
0.1
0.05
0.0 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Empirical size Empirical size
(a) Γ(1.5) (b) Γ(1.5) magnified

0.90
1.0
0.85
T4 0.80
MCS
0.9 T3
MDI 0.75
T2
0.8 0.70
T4
FT 0.65 MCS
0.7 0.60
T1 T3
0.55 MDI
Empirical power
0.6
Empirical power
LR T2
0.50
FT
0.45
0.5
T1
CR
0.40
LR
0.4 0.35
CS 0.30 CR
0.3
0.25
CS
0.20
0.2
0.15
0.10
0.1
0.05
0.0 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
Empirical size Empirical size
(c) 0.925Γ(1.5) + 0.075Γ(1) (d) 0.925Γ(1.5) + 0.075Γ(1) magnified
Figure 18.5. Left: empirical ROC curves for the no contamination and contamination
from Γ(1) cases. Right: the same curves magnified over a relevant range of empirical
sizes. For a color version of this figure, see www.iste.co.uk/zafeiris/data1.zip
In addition, taking into account the results of Figure 18.5, we focus our interest on
the following four tests, two from each family, T 4, M CS, T 3 and M DI which appear
to have the best performance in terms of power. Note that, even though M DI and T 3
closely follow each other in terms of size, T 3 appears to perform better in terms of
power. Additionally, we can see that the performance of T 3 in terms of power closely
follows the performance of M CS and especially when the alternative distribution
is contaminated from the null. Although T 4 appears to have the best performance
among all competing tests in terms of power, we should only consider it when the null
distribution is contaminated from a distribution which is neither far nor close to the
null since in every other case the exact simulated size fails to satisfy Dale’s criterion.
In conclusion, based on the analysis conducted with regard to the two families of
estimators and test statistics, namely the BHHJ and the Cressie–Read families, we
can state the following remarks. For estimation purposes, under contamination, the
best estimators arise for large values of the index α2 and small negative values of the
parameter λ, while the opposite is true when there is no contamination. In relation to
testing procedures, when the null distribution is not contaminated or is contaminated
from a distribution that is close to it, the best test statistics from the BHHJ family arise
for values of the indices α1 and α2 close to 0.50, say between 0.40 and 0.60, while the
most prominent members of the Cressie–Read family arise for values of λ ∈ [−2, −1].
In the case where the contaminant distribution lies neither too close nor too far from
the null, only test statistics that are members of the BHHJ family with large values of
α1 near 0.90 and moderate values of α2 near 0.30 are appropriate choices.
18.5. References
Ali, S.M. and Silvey, S.D. (1966). A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society Series B, 28, 131–142.
Arndt, C. (2001). Information Measures. Springer, Berlin, Heidelberg.
Basu, A., Basu, S., Chaudhuri, G. (1997). Robust minimum divergence procedures for count
data models. Sankhyā: The Indian Journal of Statistics Series B (1960–2002), 59(1), 11–27.
Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C. (1998). Robust and efficient estimation by
minimising a density power divergence. Biometrika, 85, 549–559.
Basu, S., Basu, A., Jones, M.C. (2006). Robust and efficient parametric estimation for censored
survival data. Annals of the Institute of Statistical Mathematics, 58, 341–355.
Basu, A., Shioya, H., Park, C. (2011). Statistical Inference: The Minimum Distance Approach.
Chapman & Hall/CRC Press, Boca Raton, FL.
Batsidis, A., Martin, N., Pardo Llorente, L., Zografos, K. (2016). ϕ-divergence based procedure
for parametric change-point problems. Methodology and Computing in Applied Probability,
18(1), 21–35.
Cavanaugh, J.E. (2004). Criteria for linear model selection based on Kullback’s symmetric
divergence. Australian & New Zealand Journal of Statistics, 46, 257–274.
Cover, T.M. and Thomas, J.A. (2006). Elements of Information Theory. John Wiley and Sons,
New York.
Cressie, N. and Read, T.R.C. (1984). Multinomial goodness-of-fit tests. Journal of the Royal
Statistical Society, 5, 440–454.
Csiszar, I. (1963). Eine Informationstheoretische Ungleichung und ihre Anwendung auf den
Bewis der Ergodizitat on Markhoffschen Ketten. Publications of the Mathematical Institute
of the Hungarian Academy of Sciences, 8, 84–108.
D’Agostino, R.B. and Stephens, M.A. (1986). Goodness-of-Fit Techniques. Marcel Dekker,
New York.
Dale, J.R. (1986). Asymptotic normality of goodness-of-fit statistics for sparse product
multinomials. Journal of the Royal Statistical Society. Series B (Methodological), 48(1),
48–59.
Ferentinos, K. and Papaioannou, T. (1979). Loss of information due to groupings. Transactions
of the Eighth Prague Conference on Information Theory, Statistical Decision Functions,
Random Processes, vol. C, 87–94, Reidel, Dordrecht-Boston, MA.
Ferentinos, K. and Papaioannou, T. (1983). Convexity of measures of information and loss
of information due to grouping of observations. Journal of Combinatorics, Information &
System Sciences, 8(4), 286–294.
Fisher, R.A. (1924). The conditions under which χ2 measures the discrepancy between
observation and hypothesis. Journal of the Royal Statistical Society, 87, 442–450.
Ghosh, A., Maji, A., Basu, A. (2013). Robust inference based on divergence. In Applied
Reliability Engineering and Risk Analysis: Probabilistic Models and Statistical Inference,
Frenkel, I., Karagrigoriou, A., Lisnianski, A., Kleiner, A. (eds). John Wiley and Sons,
New York.
Gokhale, D.V. and Kullback, S. (1978). The Information in Contingency Tables, vol. 23. Marcel
Dekker, New York.
Kagan, A.M. (1963). On the theory of Fisher’s amount of information. Soviet Mathematics –
Doklady, 4, 991–993.
Kateri, M. and Papaioannou, T. (1997). Asymmetry models for contingency tables. Journal of
the American Statistical Association, 92(439), 1124–1131.
Kateri, M. and Papaioannou, T. (2007). Measures of symmetry-asymmetry for square
contingency tables. TR07-3, University of Piraeus [Online]. Available at: https://www.
researchgate.net/profile/Takis-Papaioannou-2/publication/255586795_Measures_of_
Symmetry-Asymmetry_for_Square_%20Contingency_Tables/links/543147840cf27
e39fa9eb943/Measures-of-Symmetry-Asymmetry-for-Square-Contingency-Tables.pdf.
Kateri, M., Papaioannou, T., Ahmad, R. (1996). New association models for the analysis of sets
of two-way contingency tables. Statistica Applicata, 8, 537–551.
Kullback, S. (1985). Minimum discrimination information (MDI) estimation. In Encyclopedia
of Statistical Sciences, Volume 5, Kotz, S. and Johnson, N.L. (eds). John Wiley and Sons,
New York.
Kullback, S. and Leibler, R. (1951). On information and sufficiency. Annals of Mathematical
Statistics, 22, 79–86.
Liese, F. and Vajda, I. (1987). Convex Statistical Distances. Teubner, Leipzig.
Lin, N. and He, X. (2006). Robust and efficient estimation under data grouping. Biometrika,
93(1), 99–112.
Lloyd, C.J. (2005). Estimating test power adjusted for size. Journal of Statistical Computation
and Simulation, 75(11), 921–933.
Mathai, A. and Rathie, P.N. (1975). Basic Concepts in Information Theory. John Wiley and
Sons, New York.
Mattheou, K. and Karagrigoriou, A. (2010). A new family of divergence measures for tests of
fit. Australian and New Zealand Journal of Statistics, 52, 187–200.
Mattheou, K., Lee, S., Karagrigoriou, A. (2009). A model selection criterion based on the BHHJ
measure of divergence. Journal of Statistical Planning and Inference, 139, 128–135.
Matusita, K. (1967). On the notion of affinity of several distributions and some of its
applications. Annals of the Institute of Statistical Mathematics, 19, 181–192.
Menéndez, M.L., Morales, D., Pardo, L., Vajda, I. (2001). Approximations to powers
of ϕ-disparity goodness-of-fit. Communications in Statistics – Theory and Methods, 30,
105–134.
Meselidis, C. and Karagrigoriou, A. (2020). Statistical inference for multinomial populations
based on a double index family of test statistics. Journal of Statistical Computation and
Simulation, 90(10), 1773–1792.
Nadarajah, S. and Zografos, K. (2003). Formulas for Renyi information and related measures
for univariate distributions. Information Sciences, 155, 118–119.
Neyman, J. (1949). Contribution to the theory of χ2 test. In Proceedings of the 1st Symposium
on Mathematical Statistics and Probability, University of Berkeley, 239–273.
Papaioannou, T. (1985). Measures of information. In Encyclopedia of Statistical Sciences,
Vol. 5, Kotz, J. (ed.). Wiley, Hoboken, NJ.
Papaioannou, T., Ferentinos, K., Menéndez, M.L., Salicrú, M. (1994). Discretization of
(h,ϕ)-divergences. Information Sciences, 77(3–4), 351–358.
Papaioannou, T., Ferentinos, K., Tsairidis, C. (2007). Some information theoretic ideas useful
in statistical inference. Methodology and Computing in Applied Probability, 9(2), 307–323.
Pardo, L. (2006). Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC,
Boca Raton, FL.
Pardo, J.A. (2010). An approach to multiway contingency tables based on ϕ-divergence test
statistics. Journal of Multivariate Analysis, 101, 2305–2319.
Patra, S., Maji, A., Basu, A., Pardo, L. (2013). The power divergence and the density power
divergence families: The mathematical connection. Sankhya B, 75(1), 16–28.
Renyi, A. (1961). On measures of entropy and information. Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics and Probability, 1, 547–561.
Sachlas, A. and Papaioannou, T. (2014). Residual and past entropy in actuarial science and
survival models. Methodology and Computing in Applied Probability, 16(1), 79–99.
Shannon, C.E. (1948). A mathematical theory of communication. Bell Systems Technical
Journal, 27(3), 379–423.
Toma, A. (2009). Optimal robust M-estimators using divergences. Statistics and Probability
Letters, 79, 1–5.
Toma, A. and Broniatowski, M. (2011). Dual divergence estimators and tests: Robustness
results. Journal of Multivariate Analysis, 102(1), 20–36.
Tsairidis, C., Ferentinos, K., Papaioannou, T. (1996). Information and random censoring.
Information Science, 92(1–4), 159–174.
Tsairidis, C., Zografos, K., Ferentinos, K., Papaioannou, T. (2001). Information in quantal
response data and random censoring. Annals of the Institute of Statistical Mathematics, 53(3),
528–542.
Victoria-Feser, M. and Ronchetti, E. (1997). Robust estimation for grouped data. Journal of the
American Statistical Association, 92(437), 333–340.
Vonta, F. and Karagrigoriou, A. (2010). Generalized measures of divergence in survival analysis
and reliability. Journal of Applied Probability, 47(1), 216–234.
Zografos, K. and Nadarajah, S. (2005). Survival exponential entropies. IEEE Transactions on
Information Theory, 51, 1239–1246.
Zografos, K., Ferentinos, K., Papaioannou, T. (1986). Discrete approximations to the Csiszár,
Rényi, and Fisher measures of information. Canadian Journal of Statistics, 14(4), 355–366.
Zografos, K., Ferentinos, K., Papaioannou, T. (1990). Divergence statistics: Sampling
properties and multinomial goodness of fit and divergence tests. Communications in
Statistics-Theory and Methods, 19(5), 1785–1802.
PART 3
19
Geographically Weighted Regression

for Official Land Prices and their
Temporal Variation in Tokyo
This study models Tokyo official land price data using geographically weighted
regression (GWR) and multi-scale GWR (MGWR) models. The GWR model spatially
explores the varying relationships between land prices and the exploratory variables.
Based on the estimated model parameters, the influence of land individuality increases
as the estimated bandwidth parameters in the GWR model decrease. These facts are
also confirmed by the local regression coefficients of the access index, the distance
to the nearest station and residential area dummy variables. The differences between
local coefficients for some convenience indicators, including access time to central
Tokyo and walking distances to nearest stations, tend to increase between the west
and central areas of Tokyo.
19.1. Introduction
In the “price announcement of public land”, published annually in March, the

Ministry of Land, Infrastructure, Transport and Tourism (MLIT) of Japan announces
the prices of standard land on January 1 based on the Land Price Announcement
Law. This announcement not only provides an appropriate index of land prices for
general land transactions, but also plays the role of institutional infrastructure in
the socio-economic context, being applied to national land use plans, including land
acquisition and planning for public work projects. The price announcement of public
land is thus not only the standard for evaluating inheritance and fixed property taxes,
but is also an important index that reflects the actual state of land prices. While strict
Chapter written by Yuta K ANNO and Takayuki S HIOHAMA.

evaluation procedures are set for normal land prices, the price announcements of
public land are stationary observations, and since survey points can change frequently,
it is difficult to monitor land prices at the same time points over a long period of time.
Thus, in this study, we analyze the changes in the price announcements for public
land using the geographically weighted regression (GWR) and the multi-scale GWR
(MGWR) models for a total of 38,914 residential use land areas in Tokyo based on
price announcements for public land from 1997 to 2018.
The GWR model was proposed by Brunsdon et al. (1996) and Fotheringham et al.
(1998) as a spatial statistical model considering spatial heterogeneity. In other words,
the GWR model is a local regression model that captures spatial heterogeneity or
non-stationarity by estimating spatially varying regression coefficients. One of the
disadvantages of the GWR model is that the multicollinearity between explanatory
variables occurs when using a common bandwidth for spatial kernels for all
explanatory variables, which yields similar or unstable regression coefficients for the
target area. Hence, various models that extend the GWR model have been proposed
and applied in the literature. For instance, the mixed GWR model is a mixed model
of linear regression and GWR, which attempts to explain both the global variables
common to all observations and the local variations in the characteristics of each site
(Lee et al. 2009). In this study, we overcome the shortcomings of the GWR model
by using the MGWR model, which estimates the local regression coefficients using
variable-specific bandwidth for spatial kernels (Lu et al. 2017) 1.
Various global GWR applications also exist in the literature. For example, Cho
et al. (2006) estimated the GWR model using housing data from Knox County,
Tennessee, showing that the proximity of water areas and parks to housing is reflected
in the price. Helbich et al. (2014) estimated a mixed GWR model that distinguishes
between local and global explanatory variables based on Australian residential data.
Using sample size-based distance measures for spatial kernels, Lu et al. (2015) argued
that the fitting performance is better than that for the usual distance-based kernels,
and constructed a parameter-specific distance metrics-GWR (PSDM-GWR) model
using both distinct bandwidth and metric functions of each explanatory variable.
Additionally, they proposed back-fitting algorithms to fit the generalized linear model
with the parameter estimation of PSDM-GWR models. Lu et al. (2017) estimated
GWR and MGWR models using housing transaction prices in London in 2001,
and showed that the MGWR model is superior in terms of fitting and prediction
accuracy. Recently, several studies extended the GWR and MGWR models to the
space–time dimensions, including that of Huang et al. (2010). LeSage and Pace (2009)
derived estimates focusing on the results of spatiotemporal long-term equilibrium with
1 Both mixed GWR and multi-scale GWR are sometimes referred to as MGWR but, to avoid
confusion, we refer to multi-scale GWR as MGWR.
Geographically Weighted Regression for Official Land Prices 263
regard to the use of cross-sectional data and focusing on the dynamics embodied by
time-dependent parameters with regard to the use of spatiotemporal data.
In this study, we assume independence between the different time points and
estimate secular changes under the GWR and MGWR models. This study thus clarifies
the interannual variability of geographical and environmental factors for land prices
by applying the GWR and MGWR models. Our findings are as follows. For over
20 years, the individual factors of land prices increase, as demonstrated by the increase
in the local regression coefficients in Tokyo. Additionally, due to the secular changes
in environmental factors that indicate convenience, the land price differences between
the central and southern parts of the 23 wards, including the surrounding and other
areas, increase. In particular, in the western part of Tokyo, the estimates of the MGWR
model show that the land price differences in the eastern part increase towards the
west. Moreover, the influence of higher land prices was stronger in the southern part
of Kitatama (Northern Tama Area) than in the northeastern part of the 23 wards. There
are also regional differences in land preferences. From the central to the northern
areas of the 23 wards and from the western area of Kitatama to the eastern area of
Nishitama (Western Tama Area), low-rise residential areas that have an emphasis
on the living environment are preferred. Conversely, in the Minamitama (Southern
Tama) area, residential areas and semi-residential areas that have an emphasis on
convenience and commerciality are preferred. Each influence became stronger as time
progressed. Furthermore, the above-mentioned effects significantly changed before
the 2008 financial crisis and remained stable after the crisis.
The rest of this chapter is structured as follows. In section 19.2, we explain

the GWR model, which is a spatial econometric model that considers both
spatial dependence and spatial heterogeneity, and its extension, the MGWR model.
Section 19.3.1 presents the data used to obtain the land price function. In
section 19.3.2, we estimate the non-spatial model by OLS, GWR and MGWR using
published land prices. Additionally, we consider secular changes by visualizing the
spatial prediction distribution of the parameters. Finally, section 19.4 summarizes and
discusses the results.
19.2. Models and methodology
We denote yt (s) by the logarithmic public land price vector of site s ∈ D in region
D at time t. Then, the global or non-spatial model can be expressed as follows:
yt (s) = Xt (s) βt + εt (s),
where Xt (s) is the matrix of the explanatory variables described in the previous
section, βt is the vector of the regression coefficient, including the constant term,
and εt (s) is the error term, which is assumed to be independent at time t and for site
s. x denotes the transpose of a matrix or vector x. The regression coefficients of the

non-spatial models are common for all sites.
The GWR uses location-wise estimates to model spatially varying relationships.

Let yt (si ) be the logarithmically transformed official land price of each site si with
n observation sites (si , i = 1, . . . , n) at time t. Further, let the k-dimensional vector
of the explanatory variable be Xt (si ) = [1, x1,t (si ), · · · , xk−1,t (si )] . We denote the
vector of local regression coefficients by βt,i (k × 1) and the error term by εt (si ).
Then, the GWR model can be expressed as:
yt (si ) = Xt (si ) βt,i + εt (si ).
Under the GWR, to estimate βt,i = [β0,t,i , β1,t,i , · · · , βk−1,t,i ] , we use the
generalized least-squares (GLS) method using the following weight matrix:
1 1 1
V t,i
2
y t (s) = V t,i
2
X t (s) β t,i + V t,i
2
εt (s).
Here, matrix Vt,i is a diagonal matrix and its j-th component vt,i,j is the weight
given to site j:
Vt,i = diag (vt,i,1 , . . . , vt,i,n ) .
The estimator of the local regression coefficient at site i and time t is given by:
β̂ t,i = [X t (s)V t,i X t (s) ]−1 X t (s)V t,i y t (s).
In the GWR model, it is important to define weight matrix V t,i . To this end, we
use a Gaussian distance-decay function:

d2i,j
vt,i,j = exp − 2 ,
δt
where di,j is the Euclidean distance between i and j. δt is the bandwidth of a common
spatial kernel at time t. Bandwidth δt is determined by minimizing the cross-validation
(CV) error of the following equation:
n
2
δ̂t = argmin CV(δt ), CV(δt ) = [yt,i − ŷt,=i (δt )] , [19.1]
δt i=1
where ŷt,=i (δt ) is the predicted value for the neighbor of site i, without using site i. If
the spatial distribution of the observed points is not constant, an adaptive kernel that
adjusts the bandwidth according to the number of samples, not the distance, may be
used; see, for example, Lu et al. (2015).
Let the explanatory variable for site s0 be Xt (s0 ). Then, the predicted value of
the log official land price becomes:
ŷt (s0 ) = X t (s0 ) β̂t,s0 ,
β̂t,s0 = [X t (s0 )V t,s0 X t (s0 ) ]−1 X t (s0 )V t,s0 y t (s).
See, for example, Leung et al. (2000) and Harris et al. (2011). The corresponding
variance of the predictor becomes:
V ar[yt (s0 ) − ŷt (s0 )]

= [1 + X t (s0 )[X t (s0 )V t,s0 X t (s0 ) ]−1 X t (s0 )V 2t,s0 X t (s0 )
[X t (s0 )V t,s0 X t (s0 ) ]−1 X t (s0 ) ]σε2 ,
where σε2 is the variance of error term ε(s0 ) for site s0 .
Brunsdon et al. (1999) pointed out that, for the GWR and mixed GWR
models, the bandwidths of the common spatial kernels are sometimes restrictive
and the resulting GWR estimates tend to be inflexible. Additionally, Wheeler and
Tiefelsdorf (2005) explained that, under the GWR model, there exists instability
that creates multicollinearity due to the similarities of local explanatory variables.
Hence, Yang (2014) proposed the MGWR model, which applies a distinct bandwidth
for each explanatory variable for the spatial kernels. The MGWR model can
provide more location-specific regression surfaces, which makes it possible to avoid
multicollinearity between variables. In this study, we use the following extended
algorithm, as proposed by Lu et al. (2017).
Step 0: Data formatting: we denote the data matrix and log land price by y j,t and
(0)
X t , respectively, for time t(1 ≤ t ≤ T ) and site i(1 ≤ i ≤ p). Let V k,t,i be the initial
weight matrix for t, i, and the k-th regression coefficients in the GWR model. The
(0)
initial kernel bandwidth is set to bwk,t . The required precision is denoted by τ > 0,
and the maximum number of iterations is set as N.
(0) (0) (0) (0)
Step 1: Initialization: initial estimates β̂t = [β̂ 0,t , β̂ 1,t , · · · , β̂ k−1,t ] are
(0) (0) (0)
obtained by the GWR model. Then, we calculate ŷ 0,t = X 0,t ◦ β̂0,t , ŷ 1,t =
(0) (0) (0)
X 1,t ◦ β̂1,t , · · · , ŷ k−1,t = X k−1,t ◦ β̂ k−1,t . Here, X h−1,t denotes the h-th row
of matrix X t and ◦ is the Hadamard product. We obtain the residual sum of squares,
k−1 (0)
RSS(0) = (y t − i=0 ŷ i,t )2 .
Step 2: update the (n)-th estimates using the estimates of the (n − 1)-th iteration
as follows. Here, we re-define the explanatory variable as Xl,t (0 ≤ l ≤ m).
m m
(n) (n−1) (n)
1) Calculate ξl,t = y − j=l Latestyhat ŷ j,t , ŷ j,t , where j=l denotes
the sum of numbers other than l and

(n) (n)
ŷ j,t , if ŷ j,t exists
(n−1) (n)
Latestyhat ŷ j,t , ŷ j,t = . (n−1)
ŷ j,t , otherwise.
(n)
2) We calculate bandwidth bwk,t using criteria such as the CV scoring method
(n)
and obtain weight matrix V k,t,i .
(n) (n)
Finally, we calculate β̂l,t by using ξl,t and X l,t .
(n) (n)
3) We update ŷ l,t = X l,t ◦ β̂l,t .
(n) (n) (n) (n) (n)

Step 3: Using β̂t = [β̂ 0,t , β̂1,t , · · · , β̂k−1,t ] , we calculate ŷ t and RSS(n)
and obtain the rate of change CVR(n) :
RSS(n) − RSS(n−1)
CVR(n) = . [19.2]
RSS(n−1)
If CVR(n) < τ or n ≥ N, the calculation ends. Otherwise, n = n + 1 and the

process is repeated.
19.3. Data analysis
19.3.1. Data
The public announcement of land prices in 2018 was conducted for 47 prefectures
nationwide in Japan, targeting 20,572 areas for urbanization, 1,394 urbanization
control areas, 4,015 other urban planning areas and 19 publicly announced areas
outside the urban planning area, for a total of 26,000 standard land areas. In Tokyo,
there were 2,602 sites and 1,540 residential zones, excluding islands. In this study,
we use the public announcement of land price data for residential zoning in Tokyo
as of January 1, 1997 to 2018. A total of 38,914 data points exist during the 22-year
analysis period. The number of sites subject to the public announcement of land prices
for residential zoning changed annually as needed, which varies from 1,200 to 2,000.
We estimate each model using the official land price as the objective variable. As
explanatory variables, we selected the following seven variables: (1) access index of
the target site (minutes), (2) distance to the main nearest station (m), (3) front road
width (m), (4) land area of the target site (m2 ), (5) low-rise residential area dummy,
(6) residential area dummy, and (7) gas equipment dummy. All variables except for
the dummy ones are transformed into logarithmic values.
Figure 19.1 shows a boxplot of the transitions in official land prices for 22 analyzed
years. There are outliers above the boxplot due to the presence of very high land prices.
The average official land price at the analysis sites was 393,000 yen/m2 in 2018,
and the median value was 310,000 yen/m2 . Regarding the time-series transitions,
land prices had been declining since 1997 until the early 2000s and then rose until
2008, after which they showed a downward trend once more due to the effects of
the 2008 financial crisis. In recent years, the upward trend of high official land
price sites has been remarkable. Figure 19.2 shows the distribution of official land
prices in the residential areas of Tokyo in 2018. The highest official land price was
4,010,000 yen/m2 and the lowest was 45,000 yen/m2 . The official land prices are
generally high near the central area of the 23 wards, which is also the center of the
city, but they are not always high within the other 23 wards, except for the Adachi,
Katsushika and Edogawa wards in the northeastern part of Tokyo and Musashino City
and Mitaka City, which are adjacent to the 23 wards to the west. Additionally, the
locations and numbers of public notice points are highly biased by region.
Figure 19.1. Boxplots of official land prices in Tokyo from 1997 to 2018
Figure 19.2. Spatial distributions of Tokyo official land prices in 2018

19.3.2. Results
Here, we perform the parameter estimation using the models presented in

section 19.2, namely the non-spatial model, GWR model and MGWR model. A
non-spatial model is a global regression model whose coefficients are common for all
sites and are estimated by OLS. For GWR and MGWR models, the local regression
parameters that show the spatial patterns and heterogeneity are estimated.
Non-spatial model GWR model (δ̂ = 1.41km)

β̂i SE Min Q1 Median Q3 Max
Intercept 17.3990 0.1181 2.3637 14.1987 15.4629 18.1178 38.9184
Access index -1.0633 0.0182 -5.4377 -1.1618 -0.5631 -0.2583 1.5504
Distance to station -0.2770 0.0112 -0.4836 -0.2392 -0.1772 -0.1208 0.1633
Front road width 0.1430 0.0218 -0.2991 0.0368 0.0916 0.1664 0.7921
Land area 0.1378 0.0150 -0.7251 -0.0106 0.0488 0.1125 1.3120
Low-rise residential 0.0098 0.0181 -0.6611 -0.1087 -0.0275 0.0456 0.5553
Residential -0.0099 0.0217 -0.6033 -0.0683 0.0153 0.0993 0.4892
Gas equipment -0.2836 0.0273 -2.2197 -0.7126 -0.3128 -0.1007 1.1098
Table 19.1. Regression coefficients for the non-spatial and GWR models for 2018
Table 19.1 shows a comparison of the regression coefficients for the non-spatial
and GWR models. Under the non-spatial model, the low-rise residential area and
residential area dummies are insignificant at the 5% significance level. The local
regression coefficient on the GWR model is estimated for each site, and there is
a range in the distribution of the regression coefficients. If we compare the median
values of the regression coefficients estimated by the GWR model, then the absolute
value of the estimates, which was significant under the non-spatial model, becomes
smaller. Table 19.2 shows a comparison with the MGWR model. If we compare the
regression coefficients on the median values, the estimates for the GWR and MGWR
models take similar values, but smaller absolute values under the non-spatial model.
The range of each regression coefficient, which was large under the GWR model, is
smaller under the MGWR model, probably because of the common bandwidth for
spatial kernels for all explanatory variables. Specifically, this bandwidth might be too
large or too small for each variable in the GWR model and was estimated adequately
under a variable-specific bandwidth for spatial kernels in the MGWR model.
Figure 19.3 shows the time-series transition of the estimated regression coefficients
under the GWR model. The Gaussian distance-decay function is adopted, and the
common bandwidth for the spatial kernels is determined by the CV scoring method,
according to equation [19.1]. Except for the intercept, the range of the regression
coefficients on the access index and gas equipment dummy is larger than for the other
regression coefficients. Additionally, outliers are present for all regression coefficients.
The regression coefficients on the access index, nearest station distance and the gas
equipment dummy took on a negative trend in recent years, similar to the non-spatial
model. This fact indicates that, if both explanatory variables are at the same level, the
effect of reducing land prices becomes stronger over time. No visual trend is observed
for the coefficients on the other explanatory variables.
GWR model MGWR model

Mean SD Min Q1 Median Q3 ˆ k,t (km)
Max bw
Intercept 16.4557 2.6373 12.9792 13.8973 15.1192 19.2707 21.2709 0.58
Access index -0.7830 0.6063 -1.6806 -1.4486 -0.4205 -0.1955 -0.1586 2.52
Distance to station -0.1695 0.0443 -0.2592 -0.1972 -0.1720 -0.1112 -0.0664 3.76
Front road width 0.0972 0.0143 0.0749 0.0833 0.0987 0.1102 0.1172 14.41
Land area 0.0414 0.0758 -0.1907 -0.0063 0.0423 0.0891 0.2231 1.90
Low-rise residential -0.0065 0.0902 -0.3621 -0.0373 0.0047 0.0463 0.1654 2.37
Residential 0.0244 0.0302 -0.0277 -0.0027 0.0165 0.0478 0.0958 6.81
Gas equipment -0.4706 0.3869 -1.1953 -0.8197 -0.4709 -0.0785 0.0500 3.38
Table 19.2. Regression coefficients for the GWR and MGWR models for 2018
Figure 19.3. Boxplots of the transition for the estimated GWR model parameters
Figure 19.4 shows the time-series transition of the estimated regression coefficients
under the MGWR model using boxplots. We use the algorithm of Lu et al. (2017)
for parameter estimation. The bandwidth for the variable-specific spatial kernels is
determined by converging the CVR in equation [19.2]. Compared to the GWR model,
the range of the regression coefficients is smaller and the vertical length of the boxplot
becomes longer with fewer outliers. In addition to the access index and nearest station
distance, a negative trend can be confirmed for front road width and the gas equipment
dummy, and a positive one for the low-rise residential area dummy. The range of
the boxplots for the access index, nearest station distance, low-rise residential area
dummy and residential area dummy becomes larger over time, which indicates that the
individual factor of land for the explanatory variable becomes stronger with respect
to land price. Additionally, the increase in the range of the constant terms means that
the individual factors of land for the explanatory variables not used in this analysis are
likely increasing 2.
Figure 19.4. Boxplots of the transitions for the

estimated parameters under the MGWR model
Non-spatial model GWR model MGWR model

Kernel bandwidth (km) — 1.41 0.58 – 14.41
Adjusted R2 0.8492 0.9746 0.9821
AICc 365.18 -1632.58 -2024.02
MSE 0.0734 0.0067 0.0040
Prediction accuracy (%) 21.9019 5.9864 4.5662
0.5004 0.0063 -0.0336
Moran’s I of residuals
(pvalue 0.0000) (pvalue 0.1585) (pvalue 1.0000)
Table 19.3. Fitting performances of the models for 2018
Table 19.3 summarizes the fitting performances of the non-spatial, GWR and
MGWR models by the land price function in 2018. The MSE indicates the
mean-squared error, and the prediction accuracy is defined by the following equation
[19.3]:
n
1 exp(yt (si )) − exp(ŷt (si ))
2
Prediction accuracy = × 100(%).
n i=1 exp(yt (si ))
[19.3]
2 The individual factors of land include areas of caution on hazard maps, the existence of crime,
local sunshine and noise conditions, as well as the location of garbage collection sites.
Regarding the goodness of fit, it is desirable that the AICc and prediction accuracy
(%) are small and the adjusted R2 is close to 1. As for the spatial correlation of
residuals, it is desirable that Moran’s I is close to 0 because the spatial correlation
cannot be confirmed for the error term if the spatial regression model is fitted properly.
From this table, the MGWR model outperforms the non-spatial and GWR models in
2018.
Figure 19.5 shows the time-series transition of the fitting performance for each
model. The MGWR model has the best fit of the three models every year. Since the
adjusted R2 of the non-spatial model changes to around 0.84, the non-spatial model
can explain a large proportion of the official land price, but residual Moran’s I is
around 0.50. If there is a spatial correlation, the adjusted R2 is overestimated. The
fit of the GWR and MGWR model is significantly better than that of the non-spatial
model, as the adjusted R2 of the GWR model is around 0.97 and the AICc ranges
from −4,000 to −1,500. Since the transition of residual Moran’s I is around 0.03, no
significant spatial correlation is observed. In the MGWR model, the adjusted R2 is
around 0.98 and the AICc is from −5,000 to −2,000, which is even better than for the
GWR model and the MSE and prediction accuracy are also improved. The change in
residual Moran’s I ranges from −0.04 to −0.03, and no significant spatial correlation
is observed even at the 1% level. Additionally, the MSE and prediction accuracy of
the MGWR model and residual Moran’s I are stable.
Figure 19.5. Transition of the performance measure fit
Figure 19.6 shows the time-series transition of the bandwidth for the
variable-specific spatial kernels under the MGWR model. The “GWR” label in the
top-right panel indicates the common bandwidth for spatial kernels under the GWR
model. Under this model, the kernel bandwidth changed from 1.4 km to 1.9 km. After
the burst of the bubble economy, the range of official land prices narrowed during their
downward trend, and has remained stable since then. Under the MGWR model, the
trend in the kernel bandwidth for each explanatory variable was confirmed, regardless
of the trend of official land prices. The kernel bandwidth of the constant term is smaller
than that for the other explanatory variables due to the increase in the individual factors
of land over time that cannot be explained by the explanatory variables included in
this study. Moreover, the access index, nearest station distance, front road width,
residential area dummy, gas equipment dummy kernel and the bandwidth are allowed
to jump from 2013 to 2016. The cause is thought to be a large change in the number of
publicly announced points. The fact that the kernel bandwidth of the front road width
is larger than those of the other explanatory variables and an upward trend can be seen
in recent years indicates that it may become a global explanatory variable. We would
like to consider these cases in future research.
Figure 19.6. Transition of bandwidth for variable-specific kernels under the

MGWR model and the common bandwidth for the GWR model
19.4. Conclusion
This study estimated a land price model using 38,914 sites over 22 years of
residential area of Tokyo. Three models were compared based on their fitting
performances, namely the non-spatial, GWR and MGWR models. We found that the
MGWR model with variable-specific bandwidth for spatial kernels has a better fit than
the non-spatial and GWR models in terms of the adjusted R2 , AICc, MSE, prediction
accuracy and spatial correlation of residuals. The results of the MGWR model and
its visualization confirmed that the individuality of the land, which is a factor of land
price formation, is gradually strengthened by the increase in the range of each local
regression coefficient. The effects of the access index, nearest station distance and
low-rise residential area dummy are remarkable. Additionally, from the increase in
the range of constant terms, the individuality of the land other than the explanatory
variables used in this study strengthened.
A future research project is to build an MGWR model that includes global

explanatory variables and extend it to the space–time dimension. The mixed GWR
model estimates the parameters of the regression model by distinguishing between

global and local explanatory variables, but the local explanatory variables use a
common bandwidth for spatial kernels. The GWR model considers spatial effects
such as spatial autocorrelation and spatial heterogeneity, while the MGWR model
is an advanced version of it. However, because of the nature of the explanatory
variables, they are not always spatially affected. We will consider constructing a mixed
MGWR model that quantitatively measures the characteristics of the explanatory
variables and enhances the consistency of the model in future studies. Additionally,
Huang et al. (2010) and Fotheringham et al. (2015) proposed a geographically and
temporally weighted regression model (GTWR) with weights for both the space and
time dimensions for Calgary, Canada and London, respectively, and reported that the
GTWR model improved the forecasting accuracy. Further, Wu et al. (2019) proposed
an MGTWR model, which is a multi-scale version of the GTWR and estimated the
land price model in Shenzhen, Guangdong Province, China. Analyzing these models is
a future research topic. Moreover, it is necessary to consider the explanatory variables
of the land price function and its form. In particular, as per Chay and Greenstone
(2005) and Heckman et al. (2010), it is necessary to construct a nonlinear model
that considers the size and age of households and the spatial heterogeneity of their
characteristics in estimating the land price function.
This research was supported in part by JSPS KAKENHI Grant Number 18K01706
and Nanzan University Pache Research Subsidy I-A-2 for the 2021 academic year.
19.6. References
Brunsdon, C., Fotheringham, A.S., Charlton, M.E. (1996). Geographically weighted regression:
A method for exploring spatial nonstationarity. Geographical Analysis, 28(4), 281–298.
Brunsdon, C., Fotheringham, A.S., Charlton, M.E. (1999). Some notes on parametric
significance tests for geographically weighted regression. Journal of Regional Science, 39(3),
497–524.
Chay, K.Y. and Greenstone, M. (2005). Does air quality matter? Evidence from the housing
market. Journal of Political Economy, 113(2), 376–424.
Cho, S.H., Bowker, J.M., Park, W.M. (2006). Measuring the contribution of water and green
space amenities to housing values: An application and comparison of spatially weighted
hedonic models. Journal of Agricultural and Resource Economics, 31(3), 485–507.
Fotheringham, A.S., Charlton, M.E., Brunsdon, C. (1998). Geographically weighted regression:
A natural evolution of the expansion method for spatial data analysis. Environment and
Planning A, 30(11), 1905–1927.
Fotheringham, A.S., Crespo, R., Yao, J. (2015). Geographical and temporal weighted regression
(GTWR). Geographical Analysis, 47(4), 431–452.
Harris, P., Brunsdon, C., Fotheringham, A.S. (2011). Links, comparisons and extensions of
the geographically weighted regression model when used as a spatial predictor. Stochastic
Environmental Research and Risk Assessment, 25(2), 123–138.
Heckman, J.J., Matzkin, R.L., Nesheim, L. (2010). Nonparametric identification and estimation
of nonadditive hedonic models. Econometrica, 78(5), 1569–1591.
Helbich, M., Brunauer, W., Vaz, E., Nijkamp, P. (2014). Spatial heterogeneity in hedonic house
price models: The case of Austria. Urban Studies, 51(2), 390–411.
Huang, B., Wu, B., Barry, M. (2010). Geographically and temporally weighted regression for
modeling spatio-temporal variation in house prices. International Journal of Geographical
Information Science, 24(3), 383–401.
Lee, S., Kang, D., Kim, M. (2009). Determinants of crime incidence in Korea: A mixed GWR
approach. World Conference of the Spatial Econometrics Association, Barcelona.
LeSage, J.P. and Pace, R.K. (2009). Introduction to Spatial Econometrics. Chapman and
Hall/CRC, Boca Raton, FL.
Leung, Y., Mei, C.L., Zhang, W.X. (2000). Statistical tests for spatial nonstationarity based on
the geographically weighted regression model. Environment and Planning A, 32(1), 9–32.
Lu, B., Harris, P., Charlton, M., Brunsdon, C. (2015). Calibrating a geographically weighted
regression model with parameter-specific distance metrics. Procedia Environmental
Sciences, 26, 109–114.
Lu, B., Brunsdon, C., Charlton, M., Harris, P. (2017). Geographically weighted regression
with parameter-specific distance metrics. International Journal of Geographical Information
Science, 31(5), 982–998.
Wheeler, D.C. and Tiefelsdorf, M. (2005). Multicollinearity and correlation among local
regression coefficients in geographically weighted regression. Journal of Geographical
Systems, 7(2), 161–187.
Wu, C., Ren, F., Hu, W., Du, Q. (2019). Multiscale geographically and temporally weighted
regression: Exploring the spatiotemporal determinants of housing prices. International
Journal of Geographical Information Science, 33(3), 489–511.
Yang, W. (2014). An extension of geographically weighted regression with flexible bandwidth.
PhD Thesis, St Andrews.
20
Software Cost Estimation Using

Machine Learning Algorithms
Software cost estimation is one of the most important problems in software

projects. When the project manager estimates the project cost correctly, ambiguities
in the project are reduced, otherwise serious economic problems will arise. As a
result of the growth and complexity of software projects, new cost estimation
methods are constantly being developed. In this study, the cost of software projects
is estimated by testing different machine learning algorithms using the Waikato
Environment for Knowledge Analysis (WEKA) data mining software tool.
Algorithms were applied to a Chinese dataset taken from the PROMISE data store
using a 10-fold cross-validation technique and results, the performance criterion
correlation coefficient, and error rates, such as the mean absolute error (MAE), root
mean square error (RMSE), relative absolute error (RAE) and root relative squared
error (RRSE). Thanks to this study, it was possible to obtain the information about
which algorithms could be used for software cost estimation, what the estimation
results may be when these algorithms were applied to the Chinese dataset and which
algorithm worked best.
20.1. Introduction
Software cost estimation can be defined as “predicting the resources required for
a software development process”. The estimation process includes size estimation,
effort estimation, development of initial project schedules and, finally, estimation
of the overall project cost. This can be used to generate requests for proposals,
Chapter written by Sukran EBREN KARA and Rüya ŞAMLI.
contract negotiations, scheduling, monitoring and control (Kumari and Pushkar

2013). The software cost estimation is important for developers and customers.
As a result of the growth and complexity of software projects, new cost estimation
models are constantly being developed. Software cost estimation models are
categorized in different ways. Attarzadeh and Ow (2010) categorized them as
algorithmic and non-algorithmic models, and the main traditional algorithmic
models include the COCOMO, function point, regression model and SLIM.
Non-algorithmic models are based on soft computing, some of which are based on
expert, learning, linguistic and optimization models (Marapelli 2019).
In this study, the cost of software projects is estimated by using machine learning
algorithms. In this study, the project cost was estimated by testing 29 different
machine learning algorithms in the WEKA (Waikato Environment for Knowledge
Analysis) for information analysis. Algorithms were applied to a Chinese dataset
taken from the PROMISE data repository.
20.2. Methodology
20.2.1. Dataset
In this study, we used a Chinese dataset that was taken from the PROMISE
software engineering data repository. The Chinese dataset was added to the
PROMISE repository in 2010. This dataset, although comparatively new, was used
in this study because it consisted of 499 records, which was a large number when
compared with most other publicly available software engineering datasets.
However, it is difficult to provide any further information about this dataset (Bosu
and MacDonell 2019). Both dependent and independent attributes were included in
the Chinese dataset, which were used to estimate the cost of software projects. This
dataset consisted of 19 features: 18 independent variables (ID, AFP, Input, Output,
Enquiry, File, Interface, Added, Changed, Deleted, PDR_AFP, PDR_UFP,
NPDR_AFP, NPDR_UFP, Resource, Dev.Type, Duration and N_effort) and one
dependent variable (Effort). The independent attributes of the dataset determined the
value of the dependent attribute. Table 20.1 presents the statistics of the Chinese
dataset.
Some of the independent variables that were not very important to predict the
effort were removed, thus making the model much simpler and efficient (Prabhakar
and Dutta 2013). For example, the ID and Dev.Type attributes were deleted from the
Chinese dataset because they had no effect on effort estimation. The Chinese dataset
was analyzed by 29 machine learning algorithms. The dataset was randomly divided
into the training set and the test set using a k-fold cross-validation technique.
Software Cost Estimation Using Machine Learning Algorithms 277
S. no. Variables Min Max Mean
1 ID 1 499 250
2 AFP 9 17,518 487
3 Input 0 9,404 167
4 Output 0 2,455 114
5 Enquiry 0 952 62
6 File 0 2,955 91
7 Interface 0 1,572 24
8 Added 0 13,580 360
9 Changed 0 5,193 85
10 Deleted 0 2,657 12
11 PDR_AFP 0.3 83.8 12
12 PDR_UFP 0.3 96.6 12
13 NPDR_AFP 0.4 101 13
14 NPDU_UFP 0.4 108 14
15 Resource 1 4 1
16 Dev.Type 0 0 0
17 Duration 1 84 9
18 N_effort 31 54,620 4,278
19 Effort 26 54,620 3,921
Table 20.1. Chinese dataset statistics
20.2.2. Model
The WEKA contains a large number of machine learning algorithms for data
preprocessing, clustering, classification, regression, visualization and feature
selection.
In this study, the cost of software projects is estimated by using machine learning
algorithms in the WEKA. A total of 29 classification algorithms in the WEKA were
applied to the Chinese dataset.
The algorithms under the Meta group, LWL in the Lazy group and Input Mapped
in the Rules group take, in addition to their own parameters, a basic classifier and its
parameters. Therefore, the classifier parameters were changed from the properties
window to get the best performance, and the REP Tree classification algorithm was
chosen for all of them in order to have an accurate comparison.
20.2.3. Evaluating the performance of the model
20.2.3.1. Correlation coefficient

The correlation coefficient is a measure of the direction and strength of the
statistical relationship between dependent and independent variables. Different
correlation coefficients have been developed for different situations. The correlation
coefficient can have a value between -1 and 1. The correlation coefficient of -1
indicates that there is an inverse relationship between the two variables, a correlation
coefficient of 0 indicates that there is no relationship between the two variables and
a correlation coefficient of 1 indicates that there is a full relationship between the
two variables (Marapelli 2019).
20.2.3.2. Mean absolute error (MAE)

The MAE is used to determine how far the predicted values deviate from the
actual values. The associated expression is given in [20.1]:
= ∑ | − | [20.1]
where Pi is the estimated value, Ai is the actual value and n is the number of
samples.
20.2.3.3. Root mean square error (RMSE)

The RMSE provides the standard deviation of the differences between the
predicted and actual values of the sample. The associated expression is given in
[20.2]:
= ∑ ( − ) [20.2]
where Pi is the estimated value, Ai is the actual value and n is the number of
samples.
20.2.3.4. Relative absolute error (RAE)

The RAE is a sum of the difference between the predicted and actual values,
dividing it by the sum of the difference between the actual values and the mean of
the actual values. The associated expression is given in [20.3]:
| |
= [20.3]
| |
where Pi is the estimated value, Ai is the actual value, Am is the sum of actual values
and n is the number of samples.
20.2.3.5. Root relative squared error (RRSE)

The RRSE of the individual dataset j is defined by equation [20.4]:
∑ ( )
= ∑
[20.4]
( )
where Pij is the predicted value by the individual dataset j for the data point in i, Ai is
the actual value, Am is the sum of actual values and n is the number of samples.
20.3. Results and discussion
In this study, the Chinese dataset was used to estimate the software cost. The
WEKA, which is a data mining tool, was used in the experiments. Datasets were
randomly divided into training and test data using a 10-fold cross-validation
technique. The performance measurements of the developed models were evaluated
based on the correlation coefficient, MAE, RMSE, RAE and RRSE.
The algorithms under the Meta group, LWL in the Lazy group and Input Mapped
in the Rules group take, in addition to their own parameters, a basic classifier and its
parameters. Therefore, the REP Tree classification algorithm was chosen for all of
them in order to have an accurate comparison. The results are presented in
Table 20.2.
Table 20.2 presents the performance evaluation results of the machine learning
algorithms applied to the Chinese dataset. The results reveal that the SMOreg
algorithm obtains the best estimation result. The correlation coefficient is 0.9897,
the MAE is 271.9954 and the RAE is 7.3511%. In addition to the SMOreg
algorithm, the Linear Regression, Simple Linear Regression, M5P, M5 Rules and
Random Committee algorithms also performed relatively well. The Linear
Regression algorithm was the second best performing algorithm, with a correlation
coefficient of 0.9889, MAE of 362,939 and RAE of 9.809%. As shown in Table
20.3, the correlation coefficient value of the M5P algorithm on the Chinese dataset
is high and the RMSE, RAE, RRSE and MAE values are low. The M5P algorithm
performed well in general.
Chinese dataset
Algorithms
Correlation RAE RRSE

Functions MAE RMSE
coefficient (%) (%)
Gaussian Processes 0.9301 2314.077 2908.9407 62.5417 44.8027
Linear Regression 0.9889 362.939 968.6259 9.809 14.9185
Multilayer Perceptron 0.9673 473.8363 1697.273 12.8062 26.1409
Simple Linear Regression 0.9833 414.1948 1176.6757 11.1943 18.1228
SMOreg 0.9897 271.9954 939.2438 7.3511 14.466
Lazy
IBK (K-nearest neighbor) 0.8398 1409.0681 3574.2831 38.0824 55.0501
KStar 0.9637 623.0072 1748.6995 16.8378 26.933
LWL 0.9491 607.4129 2040.737 16.4163 31.4309
Meta
Additive Regression 0.9523 578.9322 1976.4062 15.6466 30.4401
Attribute Selected Classifier 0.9594 563.1707 1837.6206 15.2206 28.3025
Bagging 0.9605 509.9053 1804.7185 13.781 27.7958
CVParameter Selection 0.9595 562.0643 1833.9513 15.1907 28.246

Multi Schema 0.9504 602.9682 2014.6861 16.2962 31.0296
Random Committee 0.9693 473.8313 1594.3058 12.8061 24.555
Randomizable Filtered
0.9566 635.8599 1887.3988 17.1852 29.0692
Classifier
Random SubSpace 0.963 749.8866 2085.5992 20.2669 32.1218
Regression by Discretization 0.9396 1345.0928 2233.4769 36.3533 34.3994
Stacking 0.927 721.25 2431.5936 19.493 37.4507
Vote 0.9595 562.0643 1833.9513 15.1907 28.246
Weighted Instances Handler

0.9595 562.0643 1833.9513 15.1907 28.246
Wrapper
Misc
Input Mapped Classifier 0.9595 562.0643 1833.9513 15.1907 28.246
Rules
Decision table 0.9292 1333.4336 2393.9995 36.0382 36.8717
M5 rules 0.9762 411.9179 1412.4493 11.1328 21.7541
ZeroR -0.156 3700.0519 6492.7821 100 100
Tree
Decision Stump 0.8155 2302.7444 3747.8124 62.2355 57.7228
M5P 0.9847 389.5608 1127.8709 10.5285 17.3711
Random Forest 0.9592 549.7298 2038.384 14.8574 31.3946
Random Tree 0.9317 1014.6122 2357.5622 27.4216 36.3105
REP Tree 0.9595 562.0643 1833.9513 15.1907 28.246
Table 20.2. Performance evaluation of machine learning algorithms in the WEKA

Correlation RRSE
Algorithms MAE RMSE RAE (%)
coefficient (%)
SMOreg 0.9897 271.9954 939.2438 7.3511 14.466
Linear regression 0.9889 362.939 968.6259 9.809 14.9185
M5P 0.9847 389.5608 1127.8709 10.5285 17.3711
Simple linear regression 0.9833 414.1948 1176.6757 11.1943 18.1228
M5 rules 0.9762 411.9179 1412.4493 11.1328 21.7541
Random Committee 0.9693 473.8313 1594.3058 12.8061 24.555
Table 20.3. Top six performance evaluation of machine

learning algorithms on the Chinese dataset
When the ZeroR algorithm, one of the machine learning algorithms, was applied
to the Chinese dataset for cost estimation using the WEKA tool, it showed the worst
forecast performance. The Decision Stump classification algorithm from the Tree
group performed the worst after the ZeroR algorithm. The correlation coefficient
was 0.8155 and the RAE was 62.2355%.
20.4. Conclusion
Different estimation methods have been developed to estimate the cost of

software projects; one of these methods is the machine learning method. In this
study, machine learning algorithms were tested on the Chinese dataset that was
taken from the PROMISE repository for software cost estimation.
It has been noted that the attributes in the datasets affect the estimation result.
There were 19 attributes in the Chinese dataset. When two attributes that did not
affect the cost were removed from the Chinese dataset, much better performance
values were obtained. Some algorithms were able to take another classifier and its
parameters in addition to their own parameters. The REP Tree algorithm was
selected as the basic classifier, and tests were carried out on all of these algorithms
for an accurate comparison.
The analysis of the test results showed that the best prediction algorithm in the
Chinese dataset was the SMOreg algorithm, with a correlation coefficient of 0.9897
and an RAE of 7.3511%. The ZeroR algorithm showed the worst prediction result.
This study made it possible to obtain information about which machine learning
algorithms could be used for software cost estimation, what the prediction results
might be when these algorithms were applied to the Chinese dataset and which
algorithms worked best.
In future studies, tests will be performed in the WEKA tool using datasets of
software projects prepared with different methodologies. Other methods of artificial
intelligence, such as genetic algorithms and fuzzy logic will also be used for the cost
estimation of software projects.
20.5. References
Attarzadeh, I. and Ow, S.H. (2010). A novel algorithmic cost estimation model based on soft
computing technique. Journal of Computer Science, 6(2), 117–125.
Bosu, M.F. and MacDonell, S.G. (2019). Experience: Quality benchmarking of datasets used
in software effort estimation. Journal of Data and Information Quality, 11(4), 38.
Kumari, S. and Pushkar, S. (2013). Performance analysis of the software cost estimation
methods: A review. International Journal of Advanced Research in Computer Science
and Software Engineering, 229–238.
Marapelli, B. (2019). Software development effort duration and cost estimation using linear
regression and K-nearest neighbors machine learning algorithms. International Journal of
Innovative Technology and Exploring Engineering (IJITEE), 9(2), 2278–3075.
Prabhakar and Dutta, M. (2013). Application of machine learning techniques for predicting
software effort. Elixir International Journal, 56, 13677–13682.
21
Monte Carlo Accuracy Evaluation

of Laser Cutting Machine
A large selection of laser cutting machines is currently available – from 100 USD
entry-level devices to 50,000 USD industrial machines. Generally, the accuracy of
the specific model is characterized by a simple numerical parameter – from 0.3 mm
for simple models to 0.01 mm for industrial models. However, using one parameter
to characterize engraving accuracy may, in some situations, be misleading. This
single parameter may be adequate for evaluating the accuracy of, say, a horizontal
cut, but more parameters may be required to adequately describe the accuracy of the
cut between two arbitrary points. In order to evaluate the practical accuracy of the
different mechanical designs of laser cutting machines, the MAPLE-based software
simulator was designed. By changing the type of the mechanical design and the
values of the parameters of the geometrical sizes of the mechanical members used,
mechanical slacks and mechanical rigidity, a practical evaluation of the resulting cut
accuracy for different parts of the cut can be calculated. This chapter describes
pintograph-based laser cutting machines. Pintograph has recently become popular
due to its simple mechanical implementation that uses inexpensive servo motors
controlled by an inexpensive microcontroller. Despite the simplicity of the
mechanical design, the mathematical model of the real-life pintograph contains a
large number of mechanical and electronic parameters. To evaluate the accuracy of
the pintograph by taking into account slacks in the pintograph’s joints, the previously
designed Monte Carlo software simulator was reworked. Relevant math equations
were created and solved using the MAPLE symbolic software. The simulator takes
into account rod length, slacks in the joints and the servo motor’s resolution. The
simulator operation results are the drawing zone map and the accuracy map in the
drawing zone. By changing the sizes and slacks of the pintograph elements as inputs
of the simulator, it is possible to evaluate the drawing zone and the cutting accuracy.
Chapter written by Samuel KOSOLAPOV.
21.1. Introduction
A pintograph is a lateral (2D) implementation of the 3D harmonograph (Doan

1923). A pintograph uses four rods to move a pen or another instrument relative to a
flat drawing surface (Pinterest 2017). Figure 21.1 demonstrates a practical
implementation of the pintograph, designed to draw figures on sand (according to
the “Sand Clock” design of Joostens and S’heeren (2017)).
The mechanical design of the pintograph has an obvious advantage compared to

other designs of this type – 2D models using stepper motors with timing rubber
belts, 2D models using stepper motors with lead screws and 2D two-arm models
using servo motors – the pintograph must move only lightweight rods and the
instrument but not the heavy motor.
Figure 21.1. Practical implementation of the pintograph to draw figures

(according to the “Sand Clock” design of (Joostens and S’heeren 2017).
21.2. Mathematical model of a pintograph
The mechanical design and parameters of a pintograph described by Kosolapov

(2017) were reworked to take into account slacks in the joints. The design used to
develop the mathematical model of the pintograph is presented in Figure 21.2.
Monte Carlo Accuracy Evaluation of Laser Cutting Machine 287
Two motors (marked “M#1” and “M#2”) are positioned on the axis “X”. The
distance of Motor #1 from the origin {0,0} is marked as “L1”, so that the absolute
coordinates of the Motor #1 shaft (axis) are {L1, 0}. The distance between Motor #1
and Motor #2 is marked as “L2”, so that the absolute coordinates of the Motor #2 shaft
are {(L1+L2), 0}.
The pintograph contains four rigid rods that, in most cases, have equal length.
However, the developed model uses the length of all the rods {L3, L4, L5, L6} as
parameters that can be changed.
The bottom left rod is connected to the shaft of Motor #1, so that the angle
between the axis “X” and the bottom left rod is marked as “a”, whereas the bottom
right rod is connected to Motor #2, so that the angle between the axis “X” and the
bottom right rod is marked as “b”.
Figure 21.2. Mechanical design and parameters of a pintograph
The coordinates of the upper part of L3 are marked as {X11, Y11}. The
coordinates of the lower part of L5 are marked as {X12, Y12}. In the ideal design,
L3 and L5 formed a flexible joint, so that X11 = X12 and Y11 = Y12. However, in
the case of slack in the left joint, X12 = X11 + sX1 and Y12 = Y11 + sY1. The
values of sX1 and sY1 effectively described slack in the left joint. The coordinates
of the upper part of L4 are marked as {X21, Y21}. The coordinates of the lower part
of L6 are marked as {X22, Y22}. In the ideal design, L4 and L6 formed a flexible
joint, so that X21 = X22 and Y21 = Y22. However, in the case of slack in the right
joint, X22 = X21 + sX2 and Y22 = Y21 + sY2. The values of sX2 and sY2
effectively described slack in the right joint. The coordinates of the top joint are
marked as {X, Y}. We assume that the instrument (e.g. a laser) is positioned at this
point (the top joint). Considering the mechanical design presented in Figure 21.2,
angles “a” and “b” define the position of the instrument {X, Y}, so the shafts of a
controlled rotating motor can position the instrument {X, Y} in a predictable
manner. The angles of “a” and “b” have some tolerance that must be taken into
account in the model.
The mathematical model of a pintograph must be able to calculate {X, Y} by

{a, b} and vice versa. {L1..L6} are the model parameters. The current design has an
inherited (and well-known) mathematical ambiguity: for the same angles {a, b}, two
possible sets of {X, Y} exist, as can be seen in Figure 21.3: the “upper” position {X, Y}
and the “bottom” position {X’, Y’}. To prevent this ambiguity, the mathematical
model of a pintograph enforces the use of the “upper” configuration only.
Figure 21.3. Ambiguity in {X, Y}

The following equations specify two pintograph’s geometrical equations derived

from Figure 21.2. Equation 1 describes the length of the rod L5 in terms of
coordinates {X12, Y12} and {X, Y}. Equation 2 describes the length of the rod L6
in the terms of coordinates {X22, Y22} and {X, Y}.
Equation 1:= (X-X12)*(X-X12)+(Y-Y12)*(Y-Y12) = L5*L5;
Equation 2:= (X-X22)*(X-X22)+(Y-Y22)*(Y-Y22) = L6*L6;
By using the MAPLE procedure “solve”, Equation 1 and Equation 2 can be

solved symbolically so that {X, Y} can be presented as a formula using temporary
variables {X12, Y12} and {X22, Y22} and parameters L5 and L6. The intermediate
formula for X derived by MAPLE is presented in Figure 21.4. It is not simple.
The formula for Y is also not simple.
To finalize the mathematical model, additional geometrical equations must be

used. Relevant additional equations can be presented as:
X12:= L1-L3*cos(a)+sX1;
Y12:= L3*sin(a)+sY1;
X22:= L1+L2+L4*cos(b)+sX2;
Y22:= L4*sin(b)+sY2.
These equations describe “upper” pairs {X12, Y12} and {X22, Y22} of the
pintograph design as it is presented in Figure 21.2 by using parameters L1, L2, L3
and L4, angles “a” and “b” and by using slacks sX1, sY1, sX2 and sY2. The above
equations were then substituted into the formulae for X and Y, and the final
formulae for X and Y were then created. The unrealistically simplified formula for
X for the case when all rods are of length “L” and all slacks are equal to “s” is
presented in Figure 21.5.
It is obvious that even after MAPLE simplification, human-oriented analysis of

the formula provided in Figure 21.5 is not possible. Full formulae for “X” and “Y”
are too long to be presented here.
A preferable (and inexpensive) implementation of a pintograph uses two servo

motors (M#1 and M#2 – as presented in Figure 21.2). It is assumed that these servo
motors can change the shaft angles (“a” and “b”) in the range {0,180o}. In addition,
we must take into account that possible angles of the servo motor shaft, controlled
by a microcontroller, can only be set to a limited number of values. To resolve this

problem, an additional parameter of the “resolution” of the servo motor must be
added to the model. The angles (“a” and “b”) also have some tolerances, which must
be considered in the real-life model of a pintograph in addition to the tolerances of
{L1..L6}.
Figure 21.4. Intermediate formula for “X”
The resulting mathematical model of a pintograph design (presented in

Figure 21.2) has 15 parameters to consider: six lengths {L1..L6}, two angles “a” and
“b” and their tolerances, the resolution of the stepper motor and four slacks sX1,
sY1, sX2 and sY2. As a result, the actual position {X,Y} of the instrument is shifted
in a pseudo-random manner.
Figure 21.5. Simplified formulae for “X” and “Y”

(all rods’ length set to L. All slacks set to s)
21.3. Monte Carlo simulator
To simulate the operation of a real-life pintograph by considering additions,

described above, an earlier designed software simulator was properly modified. The
simulator was implemented using Visual Studio 2019 as a C# .NET desktop
application. The formulae for “X” and “Y”, created by MAPLE, were converted to
C# code by using the MAPLE “CodeGeneration” package. The code for “X” has
about 1500 “words”, the code for “Y” has about 1300 “words”, and hence, it will not
be demonstrated here.
A preferable (and inexpensive) implementation of a pintograph uses two servo

motors (M#1 and M#2, as presented in Figure 21.2). It is assumed that these servo
motors can change the shaft angles (“a” and “b”) in the range {0..180o}.
In addition, we must take into account that possible angles of the servo motor’s
shaft, controlled by a microcontroller, can only be set to a limited number of values.
Hence, an additional parameter describing the “resolution” (number of possible

angles that the servo motor can be positioned to) of the servo motors was added to
the model. In addition, the tolerances of angles “a” and “b” were considered in the
current mathematical model of a pintograph.
As stated previously, the resulting mathematical model of a pintograph design,

presented in Figure 21.2, has 15 parameters to consider: six lengths {L1..L6}, two
angles “a” and “b” and their tolerances, the resolution of the stepper motors and four
slacks sX1, sY1, sX2 and sY2.
It is clear that in order to get a real-life evaluation of the “design in test”, the
Monte Carlo approach must be used. Thus, an additional parameter, the “number of
Monte Carlo tests”, was added to the software simulator. When “number of Monte
Carlo tests” is set to 1, and all tolerances are set to 0, a software simulator calculates
“X” and “Y” for all geometrically possible values of angles “a” and “b”.
Considering that angles {a, b} in the mathematical model are arguments of nonlinear
functions, the simulator operation results are a non-trivial map of the points that can
be reached by the instrument – not all the points on the XY plane can be reached by
the instrument, so the “points that can be reached” effectively creates a “drawing
zone”, which is a function of the selected {L1..L6}. An exemplary “drawing zone”
created by the software simulator for the “resolution” parameter when set to 90 is
presented in Figure 21.6.
Figure 21.6. Drawing map for L1=30. L2=L3=L4=L5=L6=100 mm. For

In Figure 21.6, all slacks were set to 0. The resolution of the servo motors was set
to 90 possible angles. To make points visible, the drawing option “bold” was selected.
In this case, bundles of five points are drawn. The grid step was set to 10 mm.
However, when the parameter “number of Monte Carlo tests” was set to 100, the
results were significantly different (see Figure 21.7).
Figure 21.7. Drawing map for L1=30. L2=L3=L4=L5=L6=100 mm. For

Figure 21.7 presents the effect of the slacks in the pintograph mechanical
structure. To simplify the analysis, all slacks were set to 0.3. The resolution of the
servo motors was set to 90 possible angles. The number of Monte Carlo tests was set
to 100. In this case, the option “bold” was disabled.
We can see that the individual “points” from Figure 21.6 can now be seen as
“clouds” of points. The sizes of those clouds effectively represent the resulting error
of the mechanical design. We can see that the error is different in the different
regions of the drawing plane of the pintograph. We can also see that the error in the
X and Y directions is different.
21.4. Simulation results
The software simulator created enables us to immediately see the drawing map
for the selected parameters. However, for the user of the laser cutter, it is more
important to evaluate the accuracy of the cut. Thus, the option “draw vertical line”
was added. Two lines as they were drawn by the simulator are presented in
Figure 21.8.
Figure 21.8. Line. Left: number of Monte Carlo tests was set to 1.
Right: number of Monte Carlo tests was set to 100
In Figure 21.8, the resolution of the servo motors was set to 360, whereas all
tolerances were set to 0.3. By using the option “line”, the end user can visually
evaluate the accuracy of the laser cut.
21.5. Conclusion
The software developed for the Monte Carlo simulator enables the evaluation of
the drawing zone and the drawing accuracy of the four-rod pintograph for the
selected set of parameters. Simulator runs reveal that the four-rod pintograph with a
unit length of 100 mm achieves an accuracy of 0.5 mm in the center of the drawing
zone, which is good enough for an inexpensive DIY laser cutting machine or laser
engraving machine. When better accuracy is required, designs with the customer’s
selected sizes and tolerances of pintograph elements can be tested.
This study was supported by a grant from the ORT Braude College Research
Committee under grant number 5000.838.3.3-58.
21.7. References
Doan, R. (1923). The harmonograph as a project in high school physics. School Science and
Mathematics, 23(5), 450–455.
Joostens, I. and S’heeren, P. (2017). Sand clock: A real eye-catcher. Elektor, 1(January &
February), 33–39.
Kosolapov, S. (2017). Monte-Carlo accuracy evaluation of a pintograph-based laser cutting
machine. Paper presented at The 17th Applied Stochastic Models and Data Analysis
International Conference, London, 6–9 June 2017.
Pinterest (2017). The world’s catalog of ideas [Online]. Available at: https://www.pinterest.
com/pin/216032113350089616/.
22
Using Parameters of Piecewise

Approximation by Exponents
for Epidemiological Time
Series Data Analysis
Nowadays, detailed epidemiological data are available in the form of time series
data (or as an array): N[k] – where N is the documented number of events registered
at the equidistant time moments T(k) = To + k*delta (e.g. “Number of newly
reported cases of Covid-19 in the last 24 hours” – published on a daily basis by
WHO). Theoretically, those data can be adequately described by different dynamic
models containing exponential growth and exponential decay elements. Practically,
parameters of those models are not constants – they can change in time because of
many factors like changing hygiene policies, changing social behavior and
vaccination. Hence, it was decided to use a piecewise approach: short sequential
fragments of time series data are approximated by a function containing some
parameters. The above parameters are evaluated for the first time series data
fragment. Then, the next data fragments are processed. As a result, new time series
data (arrays) are created: evaluated sequences of parameters. Those new series can
be considered and analyzed as functions of time. In the simplest example, the
function to be used for every fragment is A + B*exp( alpha* t). The resulting values
of A, B and alpha in that case are time series data – arrays: A[k], B[k] and alpha[k]
known at the equidistant time moments T(k). By plotting those sequences, it can be
seen if the simple growth or decay model is adequate. Significant jumps in values
may point to an interesting event, for example the start of mass vaccination or the
effect of a non-desirable social behavior on the specific date. In order to make
calculations robust, some preliminary filtration and after-filtration can be used
Chapter written by Samuel KOSOLAPOV.
(e.g. by using moving average, moving median or other filters such as the Gaussian
filter and Bessel filter). A number of practical examples were considered.
22.1. Introduction
The plurality of epidemiology models is known. The simplest (classical models)

described epidemic data in terms of exponential growth and decay, see, for example,
“A tutorial for students” by Okabe and Shudo (2020). More sophisticated
mathematical models use a set of linear or nonlinear differential equations with a set
of empirical parameters – from the classic SIR epidemic model (using compartments
with labels S for Susceptible, I for Infectious and R for Recovered) created by
Kermack and McKendrick (1927), to its different modifications listed in a constantly
updated list in Wikipedia. The problem with those models is that they consider
parameters as constants, which, in the best case, is only a theoretically reasonable
assumption. In real life, parameters of the models are influenced by a number of
non-epidemiological factors – for example, by executing (or not executing) some
specific policy, by changing hygiene policies, by changing social behavior and by
time of vaccination. In this chapter, the “local piecewise approach” will be used to
analyze epidemiological time series data.
22.2. Deriving equations for moving exponent parameters
In real-life measurements, time series data are typically described as an array of

digital values obtained from measurements provided at the equidistant time points.
In electronics, for example, such an array is treated as a “signal”. A large number of
DSP (digital signal processing) algorithms are known and widely used. Probably the
simplest DSP algorithm is a “moving average” (rolling average and running
average). Input data are described by an array Y[k] of a fixed size. Values of Y are
known for the equidistant time marks T[k] =To + k*delta; index k is changed from
“1” to “N”, where N is a number of measured values of Y. In some programming
languages such as C and C++ (typically used in the implementation of the DSP
algorithms), index “k” is started from 0, and maximal index is (N-1); however, in
this chapter, starting index “1” will be used. For the simplest three-taps moving
average, the resulting (processed) array is Z[k]. For an arbitrary index “k”, Z[k] is
calculated as (Y[k-1]+Y[k]+Y[k+1])/3. It is clear that values of Z[1] and Z[N]
cannot be calculated by using this formula. In some implementations above, values
are set to 0 or to NaN. In some implementations, resulting array Z has a size (N-2).
In the last case some time shift is created and, thus, this time shift must be taken into
account. The resulting Z (if treated as a signal) represents the filtered signal Y.
Using Parameters of Piecewise Approximation by Exponents 299
Practically, more sophisticated filters must be used to filter “noise”. The idea of this
approach will be used here to describe signal Y as a sequence of time shifted
piecewise exponents. In that case, the resulting “signals” will be parameters of
“moving exponents”. In some approximations, algorithm data are approximated by
an exponential function A*exp(-alpha*t), where A and “alpha” are the parameters to
be found. Practically, to find values of “A” and “alpha” logarithms (and log graphs)
are used. However, this approach works only if values of “A” and “alpha” are
constants – at least for the time interval selected for measurements.
To analyze the situation when parameters of the function used for the
approximation may change, the following approach (analogous to approach used in
the “moving average”) will be used.
To derive equations in the symbolic form and to provide numerical calculations,

and present results as graphs, MAPLE software was used. In the following
description of the method selected, self-explanatory fragments of MATLAB code
will be used.
The following function was used to approximate short fragments of original data:
Y := proc (t, A, B, alpha)

A+B*exp ( alpha *t ) :
end proc
It is known that if three values of some signal Y1, Y2 and Y3 are known for
equidistant points of “t”, “t+delta” and “t+2*delta”, then parameters A, B and alpha
can be found by solving the following equations:
Equ1 := A+B*exp(alpha*t) = Y1
Equ2 := A+B*exp(alpha*(t+delta)) = Y2
Equ3 := A+B*exp(alpha*(t+2*delta)) = Y3
After using MAPLE procedure “solve”
Solution1 := solve({Equ1, Equ2, Equ3}, {A, B, alpha})
and after operating MAPLE simplifications, the following formulae for the
parameters A, B and alpha were obtained:
A := (Y1*Y3-Y2^2) / (-2*Y2+Y3+Y1)
B := (Y2-Y3)*(Y1-Y2)*((Y2-Y3)/(Y1-Y2))^((-delta-t)/delta)/(-2*Y2+Y3+Y1)
alpha := ln ( (Y2-Y3) / (Y1-Y2) ) / delta
It can be seen that formulae for “alpha” and “A” are relatively simple for
practical implementation. However, the equation for “B” is slightly problematic for
the goals of this analysis because, obviously, the value of B depends on the value of
“t”. This behavior can be compensated but to do this, some “starting” moment of “t”
must be set. Hence, in this chapter, only values of A and alpha will be analyzed for
the real data. It can be noted that while for the symbolic calculations the above
formulae are adequate, for the numerical calculations some real-life numerical data
combinations of values may become problematic. For example, if (-2Y2 +Y3+Y1) is
equal to zero, then numerical calculations of A and B cannot be executed. For
numerical calculation of “alpha”, the following “protected” procedure was used:
AlphaF := proc (Y1, Y2, Y3, delta)

local x, y; x := (Y2-Y3)/(Y1-Y2);
if x <= 0 then y := 0
else y := evalf( ln(x) / delta )
end if;
return y
end proc;
It must be noted that this protection added “impulse noise”. However, this noise
can be effectively eliminated by using the median filter.
22.3. Validation of derived equations by using synthetic data
To validate the proposed approach, synthetic data were used. To make

calculations, plotting and analysis simple, the size of the “TestData” array was set to
the relatively small number of 64. Values of “TestData” were calculated as a
combination of two exponents having different amplitudes and “alpha”. The first
exponent “started” at the time “0”, whereas the second exponent started after a time
delay equal to 32 time intervals “delta”. The following fragment of the code
demonstrates how values of “TestData” were calculated:
for k to arraySize do
TestData[k]:= evalf(TestA1*(1-exp(k/TestK1))
+Heaviside(k-(1/2)
*arraySize)*TestA2*(1-exp((k-(1/2)*arraySize)/TestK2)))
end do
Parameters were set as: TestA1 = 300, TestK1 = -8. TestA2 = 500, TestK2 = -5.
Figure 22.1 presents the synthetic data in graphical form. The presented signal is
typical in the field of digital electronic signals.
Figure 22.1. “TestData” created by calculations. Axes “Y”:

values of “TestData”. Axes “X”: time ticks in the range {1..64}
From Figure 22.1, it can be seen that “the second exponent” started at moment
32. It can be seen that at this time, “the first exponent” (that started at moment “1”)
practically became a constant.
The array “TestData” was processed by way of “moving average”, but instead or
“average”, parameters “Alpha”, “A” and “B” were calculated for the different values
of the index “k”. Parameter TestDataStep was set to “1” here.
for k from 2 to arraySize-1 do

Zalpha[k] := evalf(AlphaF(TestData[k-1], TestData[k], TestData[k+1], 1));
Za[k] := evalf(AoF(TestData[k-1], TestData[k], TestData[k+1], 1));
Zb[k] := evalf(BoF(TestData[k-1], TestData[k], TestData[k+1], 1, k))
end do
This procedure effectively creates new data series (arrays): “Zalpha[k]”, “Za[k]”
and “Zb[k]”. Figure 22.2 presents the array “Zalpha” and Figure 22.3 represents the
array “Za”.
Figure 22.2. Calculated array “Zalpha” after median filtration. Axes

“Y”: values of “Zalpha”. Axes “X”: counts in the range {1..64}
Figure 22.3. Calculated array “Za” after median filtration. Axes

“Y”: values of “Za”. Axes “X”: counts in the range {1..64}
The values of Zalpha represent the calculated values of the parameter “alpha”
for the different moment of time. From Figures 22.2 and 22.3, we can clearly
see that in the left part, “the signal” can be described as an exponent having “alpha”
= -0.125 = -1/8 and magnitude A = 300, whereas in the right part, “the signal” can
be described as an exponent having “alpha”= -0.2 = -1/5, and magnitude
A = 300+500 = 800.
It can be concluded that in this case, the description by a “moving exponent” is

adequate, and that calculated values coincide with the values that were used for the
calculations of “TestData”.
22.4. Using derived equations to analyze real-life Covid-19 data
Real-life Covid-19 data were downloaded as an Excel file from the site (Ritchie
et al. 2021). This file contains data for a large number of countries. To demonstrate
use of the developed approach, the data concerning Israel and Sweden were used.
Figure 22.4. Total number of Covid-19 cases per million (Israel)

Figure 22.5. Values of “alpha” calculated by using data of Figure 22.4
Figure 22.6. Values of “A” calculated by using data of Figure 22.4
Figure 22.7. Total number of deaths per million (Israel)
Original data for the “total number of Covid-19 cases per million” in Israel
published in the source (Ritchie et al. 2021) were from February 2, 2020 (before that
date no Covid-19 cases were registered in Israel) up to March 20, 2021 – totaling
395 days. However, original data were smoothed by using the MAPLE “moving
median” filter with a parameter 5 and by the “moving average” filter with a
parameter 20. As a result, data presented in Figure 22.4 contain only 370 points,
which means that for valid epidemiological analysis more accurate evaluation of the
introduced time shift must be provided. However, the aim of this chapter is to
provide a preliminary evaluation of the proposed approach; hence, the following
results will not be used to derive epidemiological consequences. Figure 22.5

presents the results of calculations of “alpha” for the data presented in Figure 22.4.
Figure 22.6 presents the results of calculations of “A” for the data presented in
Figure 22.4. Arrays “alpha” and “A” were additionally filtered by using the MAPLE
median and averaging filters; hence, the number of valid points was even smaller,
albeit still large enough to be used at least for preliminary analysis of the “method in
test”. It can be seen that values of “alpha” are not constant, hence the simple
exponential growth/decay model cannot be used to describe these real-life data.
It can be assumed that by observing changes of “alpha” from positive to negative
and back, some known mathematical models can be modified. It is important to note
that by visually inspecting the “alpha” graph, a human observer can reveal trend
changes at earlier stages than by using the “original data” graph. It appears that the
graph of “A” is less robust, and thus, is less informative.
Figure 22.8. Values of “alpha” calculated by using

data of Figure 22.7. Parameter “TestDataStep” = 1
Figure 22.9. Values of “alpha” calculated by using

data of Figure 22.7. Parameter “TestDataStep” = 4
Figure 22.7 presents the “total number of Covid-19 deaths per million” for Israel.
Figure 22.8 presents the results of calculations of “alpha” for the data presented in
Figure 22.7. Parameter “TestDataStep” (as for the previous cases) was set to 1.
Figure 22.9 presents the results of calculations of “alpha” for the data presented in
Figure 22.7. However, in that case, parameter “TestDataStep” was set to 4. It can be
seen that using an increased value of that parameter obviously creates more robust
results, albeit decreasing resolution.
Figure 22.10. Total number of Covid-19 deaths per million (Sweden)
Figure 22.11. Values of “alpha” calculated by using data of Figure 22.10
Figure 22.10 presents the “total number of Covid-19 deaths per million” for
Sweden. Figure 22.11 presents the results of calculations of “alpha” for the data
presented in Figure 22.10. It can be seen that the behavior of graphs for these two
countries differs. Even in the simplest implementation, the proposed method of data
approximation by using piecewise exponential functions (parameters of which are
effectively changing in time) reveals that well-known parameter: “number of waves”
is not as obvious, as it can be seen by visually observing original data. However,
more data must be checked to evaluate the usefulness of the proposed method. In
addition, different modifications of this method are to be implemented and tested
later.
22.5. Conclusion
Analysis of synthetic and real-life Covid-19 data demonstrates that the proposed
approach can be used to evaluate the validity of mathematical epidemiological
models under test for the different periods of time. Developed equations can be used
for the analysis of other processes for which the description by exponents may be
adequate. However, more real-life data from different countries must be analyzed in
order to recommend an optimal set of the smoothing parameters, and to evaluate the
reliability of the proposed approach for the analysis of real-life data.
22.6. References
Kermack, W. and McKendrick, A. (1927). A contribution to the mathematical theory of

epidemics. Proceedings of the Royal Society A, 115(772), 700–721.
Okabe, Y. and Shudo, A. (2020). A mathematical model of epidemics – A tutorial for
students. Mathematics 2020, 8, 1174 [Online]. Available at: www.mdpi.com/journal/
mathematics.
Ritchie, H., Mathieu, E., Rodés-Guirao, L., Appel, C., Giattino, C., Ortiz-Ospina, E.,
Hasell, J., Macdonald, B., Beltekian, D., Roser, M. (2021). Coronavirus source data,
Covid-19 dataset [Online]. Available at: https://ourworldindata.org/coronavirus-source-
data.
Wikipedia (n.d.). Compartmental models in epidemiology [Online]. Available at: https://en.
wikipedia.org/wiki/Compartmental_models_in_epidemiology.
23
The Correlation Between Oxygen

Consumption and Excretion of Carbon
Dioxide in the Human Respiratory Cycle
General anesthesia and lung damage treatment in novel coronavirus infection

and other diseases with mechanical ventilation often require strict monitoring of
oxygen delivery (DO2) and consumption (VO2), as well as CO2 exhalation rate
(VCO2). Oxygen transport consistency can be easily monitored in real time by
means of percutaneous pulse oximetry via the level of blood oxygen saturation in
pulsatile flow (SpO2). The method is based on the difference of absorbed light
wavelength by the blood in red and infrared parts of the spectrum, dependent on the
number of O2 molecules captured by hemoglobin. The control of the O2 content in
the inhaled and exhaled air, and the respiratory removal of CO2 can be carried out
using paramagnetic or fuel cell and infrared absorption sensors, respectively.
Acoustic, thermal, magnetic, ionization and other types of gas analyzers based on
the change in the relevant properties of the measured gases, depending on their
concentration in the mixture, are also used. The algorithm for real-time VO2 and
VCO2 measurements includes the numerical integration of dVO2/dt and dVCO2/dt
instantaneous values during respiratory cycle, derived from the product of certain
gas concentration and total flow instantaneous values. The report discusses such
an approach, emphasizing (i) precision gas concentration and flow signal
synchronization and (ii) taking into account air humidification in the respiratory
tract.
Chapter written by Anatoly KOVALENKO, Konstantin LEBEDINSKII, and Verangelina

MOLOSHNEVA.
23.1. Introduction
Artificial lung ventilation (ALV) – a method aimed at maintaining gas exchange

between the external environment and the body (external respiration), when the
body on its own cannot provide this process – is reasonably effective. An example is
the use of this technique during general anesthesia, as well as in the treatment of
respiratory distress, which is a complication of many diseases, including the
infamous coronavirus infection. The probability of developing hypoxia decreases,
and the gas composition of the blood is normalized, with the timely start of
respiratory support (Vasilev et al. 2015).
The state of the body systems responsible for gas exchange can vary as a result
of the dynamics of the pathological process and as a result of changes in
physiological parameters. So, it means that for an effective management of patients
on ALV, constant monitoring of the functions of external respiration is required
(Petrova et al. 2014).
Of the main parameters registered for respiratory monitoring, the capnogram can
be highlighted, which allows us to estimate the partial pressure of carbon dioxide in
the respiratory mixture, as well as the flow graph, which shows the rate of flow
change and is measured in liters per minute (Vasilev et al. 2015).
There are many parameters that can be used to assess the interaction between the
ventilator and the patient, including data on flow, pressure, breathing volume and
frequency, the ratio of inhalation and exhalation, etc. So far, however, until now, in
the ordinary arsenal of a doctor’s practice, there are no parameters that allow us to
sufficiently assess the effectiveness of external respiration on the monitor. This
assessment requires data on the amount of oxygen uptaking and emitted carbon
dioxide. Nonetheless, the possibilities of obtaining the necessary data exist. The use of
a metabolograph could be taken as an example. A metabolograph is a module that
integrates an artificial lung ventilation machine and allows you to calculate the amount
of energy consumed by the patient, which makes it possible to select the daily kcal
intake. Operation principle is based on the calculation of indirect calorimetry, for
which the data of carbon dioxide emission and oxygen absorption are required, which
is implemented in this device (Petrova et al. 2014). Therefore, this device can be used
to configure an artificial lung ventilation machine and select the best values that
ensure maximum gas exchange efficiency (Mihnevich and Kursov 2008).
Since the device was not originally intended for adjusting the parameters of
artificial respiration, its vitals do not meet all the necessary requirements; in
particular, the indicators of gas emission and absorption are averaged and have a
display delay. These features strongly limit the use of this device by physicians
when setting up an artificial lung ventilation machine. The efficiency of external
The Correlation Between Oxygen Consumption and Excretion of Carbon Dioxide 309
respiration is an important task that is proposed to be solved by monitoring in real

time the concentration of oxygen and carbon dioxide in the inhaled and exhaled gas
mixtures (Naumenko et al. 2018).
23.2. Respiratory function physiology: ventilation–perfusion ratio
To obtain a sufficient amount of energy, the cells need oxygen, which they
receive from the blood, and it enters the blood from the external environment. In the
process of obtaining energy, carbon dioxide is formed, which must be removed from
the body (Gabdulkhakova et al. 2016). The process of exchange of gas molecules
between the external environment and the body is called external respiration, in
which the lungs play the main role, which are located in a sealed pleural cavity. The
lungs provide contact between the blood and a mixture of gas from the external
environment in special alveolar sacs, which have an extremely thin wall. The gas
mixture enters the alveoli through the air-conducting system – the bronchi and
bronchioles. Blood flows through the arteries, which are divided into capillaries at
the alveoli and then collected into the veins (Figure 23.1).
Figure 23.1. The structure of the human bronchus. Source: https://cinetoday.ru/

breast/dyhatelnaya-sistema-cheloveka-ege-chelovek-organy-sistemy-organov.
The mechanics of breathing are provided by the breathing movements of the thorax
and diaphragm. The volume of the thorax increases with inspiration, due to the
contraction of the intercostal muscles and the flattening of the diaphragm. Due to this,
pleural pressure is reduced. The atmospheric pressure in the alveoli stretches the lungs;
as a result, the pressure in the lungs begins to drop below atmospheric pressure. This
difference ensures the flow of air into the lungs – inhalation. Exhalation occurs
according to similar mechanisms, but in the opposite direction. The main difference is
that when inhaling, the driving force is muscle contraction, and when exhaling, it is the
stored elastic traction in the fibers of the lungs and thorax. As soon as the muscles
relax, the lungs and thorax, like a spring, decrease in volume, which leads to an
increase in pleural and alveolar pressures and the pressure rises above atmospheric,
and the resulting flow is directed outward (Gutsol et al. 2014).
This breathing mechanics ensures a constant renewal of the composition of the

gas mixture in the alveoli. But for effective gas exchange, it is also necessary to
constantly renew the composition of its second component – blood. The flow of
blood within an organ is called perfusion.
Each alveolus has its own characteristics of ventilation and perfusion; therefore,
the ratio of ventilation and perfusion may be different in different parts of the lungs.
This may lead to the conclusion that even in a normal state, the ventilation of the
lungs may have some unevenness. With the pathology of the respiratory system, for
example, in inflammatory processes, the degree of ventilation inequalities of
different alveoli increases. It is necessary to constantly maintain an effective ratio of
ventilation and perfusion in the lungs in order to ensure effective gas exchange. The
structure of the respiratory system is aimed at maintaining the required ratio of
ventilation and perfusion, if there are no pathologies
23.3. The basic principle of operation of artificial lung ventilation

devices: patient monitoring parameters
Artificial lung ventilation (ALV) – a piece of equipment that simulates external

respiration – forcibly delivers a gas mixture to the alveoli and removes the used gas
from the lungs (Chursin 2008).
As mentioned earlier, the natural mechanism of human breathing is based on the

difference in pressure and muscle contraction. That is, autonomous breathing is
NPV (negative pressure ventilation), since during inhalation, the pressure in the
lungs is below atmospheric. Artificial ventilation devices based on the NPV
principle exist, but their designs are rather cumbersome, although physiological.
The devices used in intensive care and in resuscitation rooms are implemented
on the principle of PPV (positive pressure ventilation), i.e. the pressure of the gas
mixture in the lungs during inhalation is higher than the atmospheric pressure.
Ventilation techniques can vary in end-expiratory pressure:

1) ZEEP (zero end expiratory pressure) – at the end of exhalation, the pressure
drops to atmospheric;
2) PEEP (positive end expiratory pressure) – at the end of exhalation, the
pressure does not drop to atmospheric (Goryachev and Savin 2019).
Also, ventilators can differ in the principle of switching from inhalation to

exhalation and back:
1) Ventilation devices or modes with tidal volume control – the device controls
the time of the phases of the respiratory cycle – inspiration and expiration (“in
frequency”), while within the estimated time for inspiration, the equipment
calculates the speed with which it is necessary to deliver a given tidal volume to the
patient’s lungs.
2) Ventilation devices or modes with pressure control during inspiration – the
device also works “in frequency”, but in this case, the equipment is at a certain
speed and before reaching the set pressure in the lungs injects the tidal volume,
measuring its value.
It is necessary to select certain parameters when connecting a patient to a

ventilator.
On the equipment, the doctor needs to configure parameters:

– the value of the tidal volume;
– respiratory rate;
– concentration of oxygen in the gas mixture;
– ensuring control of the pressure value at the inspiratory height (Goryachev and
Savin 2019).
Further, the physician must maintain strict control over the delivery of oxygen,
its consumption and the carbon emissions.
Oxygen transfer can be monitored in real time by using percutaneous pulse

oximetry.
Pulse oximetry is based on two phenomena:

1) Oximetry – hemoglobin absorbs a specific wavelength of light differently
when passing through tissues, depending on the grade of oxygenation.
2) Pulse wave – the pulsation of the arteries and arterioles corresponds to the
stroke volume of the heart.
The principle of oximetry is based on the fact that oxygenated hemoglobin and
deoxyhemoglobin absorb the red and infrared (IR) parts of the spectrum differently.
Oxyhemoglobin absorbs IR radiation well, while deoxyhemoglobin absorbs red light
intensively. Saturation (SO2) is the grade of oxygen saturation of the blood,
determined by the ratio of red and IR streams that have come from the source to the
photodetector through a tissue site. The pulse wave can be used to determine the
heart rate and assess the quality of peripheral blood flow (Lapitsky 2015).
Gas exchange can be monitored using data from gas analyzers which uses
different types of sensors: paramagnetic, fuel cells, IR absorption sensors, etc. The
indirect calorimetry, which was carried out by the metabalograph, has its own
difficulties for continuous gas analysis. Data on oxygen consumption and carbon
dioxide emission by the body are not registered in real time, but it is necessary to
control these parameters.
23.4. The algorithm for monitoring the carbon emissions and oxygen
consumption
Treatment of lung damage in coronavirus infection with the use of artificial

ventilation and surgery under general anesthesia requires strictly monitoring of the
consumption and transport of O2, as well as the excretion of CO2 in the patients’
respiratory cycle. Oxygen transport can be easily monitored in real time by means of
percutaneously pulse oximetry based on the level of arterial oxygen saturation.
The control of the O2 content in the inhaled and exhaled air, and the respiratory
removal of CO2 can be carried out using inertial mechanical gas analyzers based on
the separation of a component from the gas mixture by special absorbers and
measuring changes in the sample volume at constant pressure, or pressure at a
constant volume of the measuring chamber.
So, the data from the metabolograph cannot be used for continuous monitoring
of oxygen consumption and carbon dioxide emission in real time. This means that
the task arises of implementing an algorithm that will allow solving this problem.
Operations research of synchronous data of current monitoring: indicators of air

flow velocity in the patient respiratory cycle and time capnogram for measuring the
partial pressure of CO2 released by the body in the exhaled air mixture are more
adequate to the clinical conditions of intensive care units. Based on this data, it is
possible to create an algorithm and code that allows us to realize real-time control of
oxygen consumption by the body and the removal of carbon dioxide by it.
The capnogram shows the partial pressure of carbon dioxide over time, but for
further calculations of volumes, we need to know the concentration of carbon
dioxide. To find it, it is worth referring to the definition of partial pressure and the
formula for calculating it.
Partial pressure is the pressure that would be produced by a gas that is part of a
mixture of gases if it alone at a given temperature occupies a volume filled with the
entire mixture of gases.
If the gas content in parts or percent and the total pressure of the mixture are
known, then the partial pressure of the gas entering the gas mixture can be
determined.
The equation for calculating the partial pressure is as follows:
p1=(a*Pgeneral)/100,
where p1 is the partial pressure of a gas, а is the gas content of the mixture in % and
Pgeneral is the gas mixture pressure.
This equation can be used to find an array of values for the amount of carbon
dioxide in a mixture in parts:
PrCO2 = CO2./Pair,
where PrCO2 is the instantaneous values of the amount of carbon dioxide in parts,
CO2 is the instantaneous values of the partial pressure of carbon dioxide and Pair is
the gas mixture pressure equal to atmospheric.
In addition, we need Dalton’s law.
According to this law, the total pressure of a mixture of gases is equal to the sum
of the partial pressures of the mixture.
Based on this, we can find the partial pressure of nitrogen, provided that the
initial data on the partial pressures of oxygen and carbon dioxide of the supplied gas
mixture are known:
N = Pair – O2(1) – CO2(1),
where O2(1) and CO2(1) are the values of the partial pressure of oxygen and carbon
dioxide in the mixture, respectively, and N is the partial pressure of nitrogen
(Petrovsky 1988).
Also, using the equation for partial pressure, you can find the amount of nitrogen
in parts:
PrN = N/ Pair,
where PrN is the amount of nitrogen in parts and N is the partial pressure of
nitrogen.
And taking advantage of the fact that there are only three gases in the mixture,
find through the difference an array of instantaneous values for the amount of
oxygen in parts:
PrO2 = 1 – PrCO2 – PrN,
where PrO2 is the instantaneous values of the amount of oxygen in parts, PrCO2 is
the instantaneous values of the amount of carbon dioxide in parts and PrN is the
amount of nitrogen in parts.
The algorithm for real-time VO2 and VCO2 measurements includes the
numerical integration of dVO2/dt and dVCO2/dt instantaneous values during
respiratory cycle, derived from the product of a certain gas concentration and total
flow instantaneous values.
To implement the work function, it is necessary to synchronize the data of the

capnogram and the flow graph.
The initial data were presented graphically (Figure 23.2).
Figure 23.2. Graph of flow and capnogram. For a color

Accordingly, the data was digitized and brought to a form with which it was
possible to work further.
23.5. Results
The results of the implementation of this algorithm are two integrated quantities,
as well as their graphs.
The result of the first integration (Int1) is the difference between the volume of
the inhaled and exhaled oxygen. That is, the amount of oxygen decreased in the
exhaled air in comparison with the inhaled air by 11.3957 ml according to the
calculations.
The result of the second integration (Int2) is the difference between the volume
of the inhaled and exhaled carbon dioxide, and the volume of carbon dioxide
increased in the exhaled air relative to that inhaled by 8.5367 ml.
In addition to calculating the amount of oxygen consumed and carbon dioxide

emitted, this code plots the flow of oxygen and carbon dioxide during the
inhalation–exhalation cycle (Figures 23.3 and 23.4).
Figure 23.3. Oxygen flow graph
Figure 23.4. Carbon dioxide flow graph

23.6. Conclusion
The main conclusions of this work are as follows:

1) There is a need to develop an algorithm for gas analysis using continuous
monitoring of oxygen consumption and carbon dioxide emitted in real time.
2) The algorithm developed on the basis of the analysis of the synchronous data
of the capnogram and the flow of the gas mixture makes it possible to assess
whether the input parameters of the mechanical ventilator were correctly selected
and, in the case of deviation of the analyzed values, to correct these parameters and
find the optimal values for a particular patient so that the ratio of ventilation and
perfusion was corresponding to the norm.
23.7. References
Chursin, V.V. (2008). Artificial Lung Ventilation. Tutorial, Almaty.

Gabdulkhakova, I.R., Shamratova, A.R., Insarova, G.E. (2016). Respiratory physiology.
Tutorial, Federal State Budgetary Educational Institution of Higher Education Bashkir
State Medical University, Ufa.
Goryachev, A.S. and Savin, I.A. (2019). The basics of artificial lung ventilation, Guide for
doctors. N.N. BURDENKO National Scientific and Practical Center for Neurosurgery of
the Ministry of Healthcare of the Russian Federation, Moscow.
Gutsol, L.O., Nepomnyashchikh, S.F., Korytov, L.I. (2014). Physiological and
pathophysiological aspects of external respiration. Tutorial, Federal State Budgetary
Educational Institution of Higher Education Irkutsk State Medical University, Irkutsk.
Lapitsky, D.V. (2015). Diagnostic capabilities of non-invasive monitoring of arterial blood
hemoglobin saturation with oxygen in the clinic of internal diseases. Tutorial, Belarusian
State Medical University, Minsk.
Mihnevich, K.G. and Kursov, S.V. (2008). Acute respiratory failure, methodical instructions.
Kharkiv National Medical University, Kharkov.
Naumenko, Z.K., Chernyak, A.V., Neklyudova, G.V., Chuchalin, A.G. (2018).
Ventilation/perfusion ratio. Practical Pulmonology, 4, 86–90.
Petrova, M.V., Butrov, A.V., Bikharri, S.D., Storchay, M.N. (2014). Monitoring of
metabolism in patients with critical conditions, effective pharmacotherapy.
Anesthesiology and Resuscitation, 2, 8–12.
Petrovsky, B.V. (1988). Big Medical Encyclopedia, 3rd edition. Soviet Encyclopedia,
Moscow.
Vasilev, D.V., Baklakov, A.A., Kim, V.A., Kozhakhmetov, B.A., Loshik, R.V., Sklyarov,
V.V. (2015). Monitoring of ventilation function of the lungs in patients in intensive care.
Medicine and Ecology, 4, 80–82.
PART 4
24
Approximate Bayesian Inference
Using the Mean-Field Distribution
Dynamical systems representing populations of interacting heterogeneous

individuals are rarely studied and validated within a Bayesian framework, with the
notable exception of Schneider et al. (2006), dealing with a model of plants in
competition for a light resource. The reasons for this lack of coverage of a subject
with such significant stakes (agriculture, crowd dynamics) are to be found in the
computational difficulties posed by the problem of inference when the size of the
population is large. In this chapter, we will focus on dynamical systems admitting
a mean-field limit distribution when the population’s size tends to infinity, such as
the flocking models presented in Carrillo et al. (2010). We introduce a numerical
scheme to simulate the mean-field distribution, which is a partial differential transport
equation solution, and we use these simulations to simplify the likelihood distributions
associated with Bayesian inference problems arising when the population is only
partially observed.
24.1. Introduction
Population models may be used to assess, from data, the interaction laws governing
the individual dynamics (Bongini et al. 2017; Lu et al. 2019). In most of these models,
the interaction of an individual with the rest of the population is represented by means
of some statistics, potentially depending on the state variables of the whole population.
These statistics can take the form of the average velocity in birds swarms, for example,
(Cucker and Smale 2007; Degond et al. 2014), or the mean competition potential
exerted by a population of plants over a single plant in Schneider et al. (2006).
Chapter written by Antonin D ELLA N OCE and Paul-Henry C OURNÈDE .

In this chapter, we will consider population models that satisfy a list of frequently
encountered properties:
– Each individual in the population is represented by a state variable x, which
may vary through time, and an individual trait variable θ, which remains constant.
The variability of trait θ from one individual to another can be used to model the
heterogeneous aspect of the population.
– The evolution of a population of N individuals is given by a differential system,
where the motion of each individual i is driven by a transition function hN depending
on some population statistics TiN (t), i.e. for any time t ≥ 0,
dxN
i
∀i ∈ 1; N , (t) = hN (t, xi (t), θi , μ̂N (t)) = HN (t, xi (t), θi , TNi (t)),
dt
N
1 N
where μ̂N (t) = δ xj (t), θj is the empirical population measure,
N j=1 [24.1]

and TiN (t) = E ΦN (xN
i (t), θi , x , θ ), (x , θ ) ∼ μ̂N (t) a statistic
defined with ΦN : (X × Θ)2 → Rp a feature function.

In the above equation, we have used the notation δ(x, θ) for the Dirac distribution
centered at point (x, θ). The empirical population measure μ̂N (t) is the probability
distribution corresponding to the uniform sampling of an individual (x, θ) in the
population. μ̂N (t) is used to compute all possible statistics over the population, such as
the first-order moment and covariance of the state variable. Here we consider statistics
taking a form expressed as the empirical expectation of the function ΦN . The state
variable x evolves in an Euclidean space X , while the individual trait θ remains within
a finite-dimensional metric space Θ.
– To account for the uncertainty on the initial configuration of the system, the
collection of initial conditions and individual traits (x0i , θi )1≤i≤N is assumed to be
an independent and identically distributed (i.i.d.) sample of some distribution μ0 ∈
P(X × Θ), referred to as the initial distribution.
When the initial distribution is factorized and when the transition function
depends, as above, on the empirical measure μ̂N (t), then the system dynamics has
the property of being invariant by permutation of its individuals’ labels. We then
say that the system is symmetric if for any time t ≥ 0 and for any bijection
ρ : 1; N → 1; N , the distribution of the permuted collection (xρ(1:N ) (t), θρ(1:N ) )
is the same as the original collection (x1:N (t), θ1:N ). This property is commonly
shared by population models, where, most often, the assignment of labels is arbitrary
(Carrillo et al. 2010).
Our focus in this chapter is to discuss the statistical inference problems related to
the study of such symmetric systems. More specifically, when some elements of the
Approximate Bayesian Inference Using the Mean-Field Distribution 321
structure of the system are partially known, such as the initial condition or the size N
of the population, determining parameters of the transition function hN or the initial
distribution μ0 can appear as a very complex task, leading to the necessity of building
approximations. In section 24.2, the plant population model introduced by Schneider
et al. (2006) is taken as an example of systems leading to difficult inference problems
when the size of the population is partially known. Section 24.3 uses an asymptotic
property of the empirical measure of the Schneider system when N → ∞, i.e. the
fact that it admits a mean-field limit distribution, to simplify the previously mentioned
inference problem. Section 24.4 deals with the consistency of this approximation.
24.2. Inference problem in a symmetric population system
24.2.1. Example of a symmetric system describing plant competition
In this section, we consider a plant growth model with competition, first introduced
by Schneider et al. (2006) and later by Lv et al. (2008) and Nakagawa et al. (2015).
This system describes the growth of Arabidopsis thaliana: each plant is represented
by the state variable s ∈ R∗+ , the diameter of its rosette, its position x ∈ R2 and two
growth parameters γ ∈ R+ , S ∈ R∗+ , namely, the growth rate and the asymptotic
isolated size. As the parameters x, S, γ remain constant over time, they are considered
as components of the individual trait θ. The differential system giving the dynamics
of N plants takes the following expression, for all plant i ∈ 1; N and for all t ≥ 0,
(s0 , xi , Si , γi ) ∼ μ0 ,
si (0) = s0 ,
dsi i

(t) = γi si (t) log(Si /sm )(1 − CN (t)) − log(si (t)/sm ) ,
dt
where sm > 0 is a minimal size parameter and
i 1
CN (t) = C(si (t), sj (t), |xi − xj |)
N −1
j=i

log(sj (t)/sm ) 1 + tanh σ1r log(sj (t)/si (t))
with C(si (t), sj (t), |xi − xj |) = .
|x −x |2
2RM 1 + i σ2 j
x
[24.2]
i
In the equation above, CN (t) models the competition exerted on plant i by all the
other plants at time t. As we can read in the competition potential C(si (t), sj (t), |xi −
xj |), the competition is stronger the closer the competitors are to plant i and larger
they are in relation to plant i. We assume that the distribution μ0 is such that all
plants initially have the same size s0 > sm . RM , σx and σr are parameters of the
competition potential. Because this system is nonlinear, we need to make additional

assumptions on the initial distribution μ0 to ensure that the system has a biologically
consistent behavior, in particular, that a finite-time blowup cannot occur, or that the
competition potential does not take negative values. A sufficient condition to prevent
such pathological situations from appearing is to include the support of the initial
distribution μ0 within the domain

D = (s, x, S, γ) ∈ R∗+ × R2 × R∗+ × R+ | sm ≤ s ≤ S, sm ≤ S ≤ sm exp(RM ) .
In other words, the initial size is lower than the asymptotic isolated size S, and
the asymptotic size is below some threshold depending on the competition parameter
RM . In this setting, we can prove that for all i ∈ 1; N , and for all t ≥ 0, sm ≤
i
si (t) ≤ Si and that CN (t) ∈ [0; 1] almost surely, which is consistent with the initial
assumptions of the model. Indeed, for any time t > 0 sufficiently close to zero, we
have the following inequality on the derivative of the state variable:
si (t) d si (t) Si
−γi log ≤ log ≤ γi log ,
sm dt sm si (t)
e−γi t e−γi t
s0 s0
which leads to sm ≤ si (t) ≤ Si and that proves that the
sm Si
inequality sm ≤ si (t) ≤ Si holds for any time t ∈ R+ .
The Schneider system fits into the definition of symmetric population models as
the transition function can be expressed as a function of the individual variables and
the population empirical measure only. The differential system can be rewritten as
follows:
dsi
∀i ∈ 1; N , (t) = hN (si (t), xi , Si , γi , μ̂N (t))
dt
N
1
where μ̂N (t) = δ(si (t), xi , Si , γi )
N i=1
and hN (si (t), xi , Si , γi , μ̂N (t))

N Si
= γi si (t) log 1−
C(si (t), s , |xi −x
|)μ̂s,x
N (t, ds , dx )
N −1 sm R∗
+ ×R
2

si (t) γi si (t) Si si (t)
− log − log (1 − C(si (t), si (t), 0)) − log
sm N −1 sm sm
[24.3]
In the equation above, we use the notation μ̂s,x N (t) to denote the marginal
distribution of distribution μ̂N (t) with respect to the variables s, x. The second term
in the expression of hN represents the exclusion of the diagonal interaction, between

an individual plant and itself.
Now that we have guarantees on the global existence of the trajectories of

the Schneider system, we can consider a numerical methodology to estimate the
evolution of individual sizes t ∈ R+ → s1:N (t) ∈ RN . The numerical scheme we
introduce in this chapter takes into account the two-level structure of the dynamics:
i
a population level represented by the competition potential CN (t) and an individual
level represented by variables (si , θi ). Let {t0 = 0, t1 , . . . , tM = T } be a subdivision
of the interval [0; T ], over which we want to simulate the system. Then, over each
time instance of this sub-interval [tk ; tk+1 ), for k ∈ 0; M − 1, we approximate the
evolution of each individual competition potential by a piecewise polynomial function,
given by a Taylor expansion with respect to time variable t:
d
i d(m) i (t − tk )m
∀i ∈ 1; N , C̃N (t, tk ) = m
CN (tk ). .
m=0
dt m!
With this approximation, we can obtain an approximation of the individual

trajectories over the interval [tk ; tk+1 ) by solving the analytical differential equation
for all i ∈ 1; N :
s̃i (tk ) given by the previous iteration,

ds̃i Si s̃i (t)
i
∀t ∈ [tk ; tk+1 ), (t) = γi s̃i (t) log 1 − C̃N (t, tk ) − log .
dt sm sm
i
After this, we can evaluate the competition potentials CN (tk+1 ) at the next time
step.
24.2.2. Inference problem of the Schneider system, in a more general

setting
Let us consider the inference problem consisting of the identification of parameters

η = (s0 , σx , σr ) ∈ H = R3+ from observations made on a sub-population of size
N0 ≤ N . The actual size N of the population is unknown; it is a latent variable of
the problem. We introduce the prior distributions pη , pN on the parameters to estimate
and on the unknown size, respectively. We consider that the support of the prior pN is
infinite, included in the interval N0 ; +∞. Let us assume that the observation error
is Gaussian of standard deviation σ, and that observations are made independently
over the timeline {t1 , . . . , tm } ⊂ R+ . We denote by s = (sij ) ∈ RN0 m the random
variable representing the observations, sij being the measured size of individual i at
time tj . In a Bayesian setting, we would like to estimate the posterior distribution of η
knowing the observations s. Bayes’ formula classically gives the following expression
of the posterior density, which can only be known up to a multiplicative factor in our
case, i.e. for all (η, s) ∈ H × RN0 m ,
pη|s (η|s) ∝ ps|η (s|η)pη (η).
Figure 24.1. Evolution of plant sizes in a population of N = 50 individuals described by

the Schneider system. All plants initially have the same size s0 , and the individual traits
θ = (x, S, γ) have a uniform distribution in a domain included within D. Trajectories are
computed using the two-level numerical scheme introduced in the section. For a color
Beforehand, the inference requires the evaluation of the likelihood distribution ps|η
of the observations knowing the parameters. The likelihood of the observations has a
density ps|η of expression
+∞

∀(s, η) ∈ RN0 m × H, ps|η (s|η) = pN (N )ps|η,N (s|η, N )
N =N0
⎛ ⎞
N0
m
1 1 2
ps|η,N (s|η, N ) = N0 m exp ⎝− 2 sij − sN
i (tj , η, θ1:N )
⎠
(2πσ 2 ) 2 2σ i=1 j=1
ΘN
× μθ⊗N
0 (dθ1:N ) [24.4]
where sN
i (t, η, θ1:N ) is the solution of the Schneider system of size N . In practice,
we cannot evaluate the trajectories sN i exactly, and we have to resort to numerical
methods to estimate them. The management of this source of uncertainty is out of the
scope of this chapter, and let us assume that we are able to solve the differential system
exactly, or with a numerical error that we can reasonably ignore.
The main difficulty in the computation of this likelihood distribution comes from
the fact that individuals are interdependent for any time t > 0. As a consequence,
to compute the trajectory sN i , we need to introduce the initial configuration of the
whole population θ1:N as latent variables, although we only observe a subset of N0
individuals. Thus, the likelihood appears as an infinite mixture of densities, due to
the infinite support of the prior pN . Moreover, each of the densities in the mixture
is expressed as an integral over a space of increasing dimension. The inference
problem associated with the simulation of a posterior distribution with a latent variable
changing the dimension of the candidate model is called a trans-dimensional inference
problem (Preston 1975). Numerically simulating such a complex distribution is still
an open research topic, as pointed out in Roberts and Rosenthal (2006).
24.3. Properties of the mean-field distribution
To ignore the aforementioned source of uncertainty, we resort to the notion of

mean-field limit distribution, which is used in statistical physics and in mathematical
biology (Carrillo et al. 2010). This notion describes the asymptotic behavior of a
population system, for which the population empirical measure μ̂N (t) converges
weakly and almost surely, when N → ∞, towards some deterministic probability
distribution μ(t), referred to as the mean-field distribution. The weak convergence of
the sequence μ̂N (t) is quantified classically by the Wasserstein distance (Golse 2016).
It is formally defined for any couple of probability measures (μ, ν) ∈ P((X × Θ)2 )
and for any p ≥ 1 by the infimum of the distance averaged by a coupling of the
distributions μ and ν, i.e.

Wp (μ, ν)p = inf d(x, θ, x , θ )p π(dx, dθ, dx , dθ ),
(X ×Θ)2

π such that 1st marginal of π = μ, 2nd marginal of π = ν
with d being some metric defined over X ×Θ. When a sequence of probability measure
(μn )n∈N converges to some distribution μ for the metric Wp , it means, in particular,
that, for all ϕ continuous and bounded, we have
E{ϕ(x, θ), (x, θ) ∼ μn } −−−−→ E {ϕ(x, θ), (x, θ) ∼ μ}

n→∞
and that the moments of order p of the sequence converge also towards the same
moments of the limit distribution. In the case of the population empirical measure,
this convergence can only hold almost surely, since μ̂N (t) is a stochastic measure
depending on the initial configuration of the system.
As the empirical measure μ̂N (t) is a description of the population of size N , the
limit μ(t) of this sequence can be interpreted as a description of an infinite population.
The mean-field distribution μ(t) addresses the issues entailed by the interdependence
of the individuals, since the individuals in this infinite population are independent.
Such a paradox can be explained by comparing the interactions within a subgroup of
individuals of constant size in a population of increasing N admitting a mean-field
distribution. Let us illustrate this on a toy example.
We consider a linear population model, given by

N
dxi λ
∀i ∈ 1; N , (t) = xj (t) = λE{x, x ∼ μ̂N (t)}
dt N j=1
with Gaussian initial condition x0i ∼ N (m, σ 2 ). The analytical trajectories of this
system are for all i ∈ 1; N and all t ≥ 0:
N
eλt − 1 0
xi (t) = x0i + x .
N j=1 j
It is quite straightforward to prove that, in this case, the empirical measure has a
mean-field limit, for any time t ≥ 0 and almost surely,
N
1 Wp
μ̂N (t) = δ(xi (t)) −−−−→ N meλt , σ 2 .
N i=1 N →∞
Besides, if we consider the covariance of two particles xN N

1 (t) and x2 (t), that are
Gaussian,
eλt − 1 eλt − 1
Cov(xN N
1 (t), x2 (t)) = σ
2
1+ −−−−→ 0.
N N N →∞
It follows that these two particles are asymptotically independent. This property of
asymptotic independence can be generalized to more complex and nonlinear systems,
as long as they admit mean-field limits. In a symmetric system of the type in equation
[24.1], a necessary condition for the existence of the mean-field limit is the pointwise
convergence of the transition function hN when N → ∞. This pointwise convergence
is clearly satisfied in the case of the toy example above, since we have, for any fixed
probability measure μ ∈ P(R), hN (x, μ) = E{x , x ∼ μ}, which is independent
of N . It is also the case of the Schneider system, as we can see in equation [24.3],
replacing μ̂N (t) by a fixed probability measure μ. Moreover, we can prove the
2 W
convergence μ̂N (t) −−−−→ μ(t) almost surely for any t ≥ 0, where μ(t) is defined
N →∞
as the pushforward measure of the initial distribution μ0 by the flow of the following
differential equation, starting from the initial configuration (s0 , θ) = (s, x, S, γ),
∂s∞ S
(t, s0 , θ) = γs∞ (t, s0 , θ) log
∂t sm

× 1− C(s∞ (t, s0 , θ), s∞ (t, s0 , θ ), |x −x
|)μ0 (ds0 , dθ )
R∗
+ ×Θ

s∞ (t, s0 , θ)
− log = g(s∞ (t, s0 , θ), θ, s∞ (t, s0 , θ ), θ )μ0 (ds , dθ ).
sm R∗
+ ×Θ
This equation can be interpreted as the continuous version of the original Schneider
system, where the empirical expectation is replaced by the theoretical expectation.
For all time t ≥ 0, the mean-field limit of the Schneider system is given by
μ(t) = (s∞ (t), IdΘ )#μ0 , where # is the pushforward operator between a function
and a probability measure. To prove the existence and uniqueness of the flow s∞ ,
we proceed to the exact same steps as in the proof of the Cauchy–Lipschitz theorem,
except that this time the initial conditions are not scalar but probability distributions.
More specifically, the existence and uniqueness are consequences of a fixed point
procedure, with the recurrence equation:
t
f n+1 (t, s0 , θ) = s0 + g(f n (τ, s0 , θ), θ, f n (τ, s0 , θ ), θ )μ0 (ds , dθ )dτ.
0 R∗
+ ×Θ
The sequence (f n )n∈N converges for some functional metric to a fixed point of
the recurrence function, which defines the mean-field flow s∞ . The convergence of
the empirical measure μ̂N (t) is proved in Della Noce et al. (2019), resorting to an
argument referred to as the propagation of chaos.
24.4. Mean-field approximated inference
24.4.1. Case of systems admitting a mean-field limit
Let us return to the inference problem described in section 24.2.2. We consider the
following approximation: if the size of the population is large enough, s∞ is close
to the individual trajectories, and we can assume that observations are made on the
infinite population rather than on the finite population. Under this approximation, the
resulting likelihood of the observations has a simplified expression, in comparison
with the original one in equation [24.4]:
⎛ ⎞
N0
m
1 1
p∞
s|η (s|η) = N0 m exp ⎝− 2 (sij − s∞ (tj , η, θi ))2 ⎠ μθ0 (dθi ).
2
(2πσ ) 2
i=1
2σ j=1
Θ
Here, the asymptotic independence of the individuals is used to decompose the

integral into a product of integrals over space Θ, the dimension of which does
not depend on N . As a consequence of the convergence of the empirical measure
mentioned in the previous section, we can prove that this approximation of the
likelihood is consistent when N → ∞, and the convergence of the likelihood is
quantified, in this case, using the total variation distance:

1
∞
dT V ps|N , ps = ps|η,N (s|η, N ) − ps|η (s|η) pη (η)λ(dη) λ(ds)
∞
2 RN 0 m H
−−−−→ 0.
N →∞
Moreover, the speed of the convergence towards the mean-field approximated

inference depends on the configuration of the observation protocol, with an upper
bound of the total variation distance taking the following expression:
m
N0
dT V ps|N , p∞
s ≤ K E {W2 (μ̂N (tj ), μ(tj ))} ,
σ j=1
which means that, for an accurate observation protocol (with large N0 and small σ),
the mean-field approximation is relevant for population size N higher than in the case
of a rough observation protocol.
We can therefore use this limit in total variation to approximate the infinite mixture
of densities by a finite mixture, by truncating the infinite sum in equation [24.4] at
a size N = Nmf , above which the mean-field likelihood is close to the original
likelihood, below some tolerance ε
⎛ ⎞
Nmf +∞
N

p̃s|ηmf (s|η) = pN (N )ps|η,N (s|η, N ) + ⎝ pN (N )⎠ p∞ s|η (s|η)
N =N0 N =Nmf +1
This approximated likelihood seems much more manageable than the original one.
Nevertheless, the main difficulty is in the simulation of the mean-field flow s∞ which
is used in the expression of p̃s|η .
To estimate the mean-field flow s∞ , we use a two-level numerical method with the
same structure as the numerical method used in section 2.1 to simulate a population of
finite size N . Similarly, we consider the same subdivision {t0 = 0, t1 , . . . , tM = T }
of the interval [0; T ]. In a finite population, the dynamics of the individuals’ sizes are
i
driven by the competition potential CN (t), and, in the case of an infinite population,
the same role is played by the averaged competition potential, defined for all t ∈
R+ , (s, θ) ∈ D by
C∞ (t, s, θ) = E {C(s∞ (t, s, θ), s∞ (t, s , θ ), |x − x |), (s , θ ) ∼ μ0 } .

Over each sub-interval [tk ; tk+1 ) for k ∈ 1; M , we approximate the C∞ by a

piecewise polynomial function with respect to time and by a dense parametric family
to approximate the dependency with respect to (s, θ) (we can choose multivariate
polynomial functions if the domain of the variables is compact). The resulting
approximation of C∞ over the interval takes the following form:
d
(t − tk )m
C̃∞ (t, tk , s, θ) = cm (s, θ) where
m=0
m!
(m)
∂
cm (s, θ) ≈ [E {C(s̃∞ (tk , s, θ), s̃∞ (tk , s , θ ), |x − x |), (s , θ ) ∼ μ0 }] .
∂tm
Figure 24.2. Top left: mean value of the parameter S according to the position x of the
plant. Top right: mean value of the parameter γ according to the position of the plant.
Bottom: evaluation of the mean-field flow s∞ at the end of the observation period,
with the mean values of parameters S and γ . s∞ is computed using the two-level
numerical scheme introduced in this section. For a color version of this figure, see
www.iste.co.uk/zafeiris/data1.zip
Once the trajectories of the competition potential are approximated, we can

analytically compute an approximation of the mean-field flow s∞ by solving a
differential equation over [tk ; tk+1 ). The analytical differential equation to solve is
for any (s, θ) = (s, x, S, γ) and t ∈ [tk ; tk+1 ),

∂s̃∞ S s̃∞ (t, s, θ)
(t, s, θ) = γs̃∞ (t, s, θ) log 1 − C̃∞ (t, tk ) − log .
∂t sm sm
Figure 24.2 gives an illustration of simulations of the Schneider system under

the mean-field approximation. We consider a specific initial distribution μ0 giving the
parameters x, S, γ. We use the two-level numerical scheme introduced in this section
to evaluate the numerical flow at a time T , where the individual sizes are close to a
stationary distribution. We can note the impact of competition between the plants, as
the surface corresponding to the mean-field flow does not entirely correspond to the
surface characterizing the spatial distribution of parameter S.
24.5. Conclusion
In this chapter, we have presented a methodology to simplify inference problems

related to symmetric systems admitting a mean-field distribution. This methodology
should not be confused with the mean-field variational inference, which considers
the minimization of other metrics to simplify the inference, most often the
Kullback–Leibler divergence. The research on how to generalize such an approach
to the symmetric system with less restrictive asymptotic behavior is ongoing.
24.6. References
Bongini, M., Fornasier, M., Hansen, M., Maggioni, M. (2017). Inferring interaction rules from
observations of evolutive systems I: The variational approach. Mathematical Models and
Methods in Applied Sciences, 27(05), 909–951.
Carrillo, J.A., Fornasier, M., Toscani, G., Vecil, F. (2010). Particle, kinetic, and hydrodynamic
models of swarming. In Mathematical Modeling of Collective Behavior in Socio-Economic
and Life Sciences, Naldi, G., Pareschi, L., Toscani, G. (eds). Birkhäuser, Boston.
Cucker, F. and Smale, S. (2007). On the mathematics of emergence. Japanese Journal of
Mathematics, 2(1), 197–227.
Degond, P., Dimarco, G., Mac, T.B.N. (2014). Hydrodynamics of the Kuramoto–Vicsek model
of rotating self-propelled particles. Mathematical Models and Methods in Applied Sciences,
24(02), 277–325.
Della Noce, A., Mathieu, A., Cournède, P.H. (2019). Mean field approximation of a
heterogeneous population of plants in competition. arXiv preprint [Online]. Available at:
arXiv:1906.01368.
Golse, F. (2016). On the dynamics of large particle systems in the mean field limit. In
Macroscopic and Large Scale Phenomena: Coarse Graining, Mean Field Limits and
Ergodicity, Muntean, A., Rademacher, J., Zagaris, A. (eds). Springer, Cham.
Lu, F., Zhong, M., Tang, S., Maggioni, M. (2019). Nonparametric inference of interaction laws
in systems of agents from trajectory data. Proceedings of the National Academy of Sciences,
116(29), 14424–14433.
Lv, Q., Schneider, M.K., Pitchford, J.W. (2008). Individualism in plant populations: Using
stochastic differential equations to model individual neighbourhood-dependent plant growth.
Theoretical Population Biology, 74(1), 74–83.
Nakagawa, Y., Yokozawa, M., Hara, T. (2015). Competition among plants can lead to an
increase in aggregation of smaller plants around larger ones. Ecological Modelling, 301,
41–53.
Preston, C. (1975). Spatial birth and death processes. Advances in Applied Probability, 7(3),
465–466.
Roberts, G.O. and Rosenthal, J.S. (2006). Harris recurrence of Metropolis-within-Gibbs and
trans-dimensional Markov chains. The Annals of Applied Probability, 16(4), 2123–2139.
Schneider, M.K., Law, R., Illian, J.B. (2006). Quantification of neighbourhood-dependent plant
growth by Bayesian hierarchical modelling. Journal of Ecology, 94, 310–321.
25
Pricing Financial Derivatives in the

Hull–White Model Using Cubature
Methods on Wiener Space
In our previous studies, we developed novel cubature methods of degree 5 on

the Wiener space in the sense that the cubature formula is exact for all multiple
Stratonovich integrals up to dimension equal to the degree. In this chapter, we apply
the above methods to the modeling of fixed-income markets via affine models. Then,
we apply the obtained results to price interest rate derivatives in the Hull–White
one-factor model.
25.1. Introduction and outline
Calculating stochastic integrals is one of the main challenges in probability.

Stochastic integrals cannot always be calculated in the closed form. Therefore, proper
numerical methods should be used to estimate the value of stochastic integrals. Monte
Carlo estimates are among the popular approaches to estimate the value of stochastic
integrals in mathematics and physics. In mathematical finance (and physics), we
would like to calculate (estimate) the expected values of functionals defined on the
solutions of stochastic differential equations (SDEs).
Without going through the technical details, we mention that the classical idea
of cubature methods and consequently cubature formulae can be described as a
construction of a probability measure with finite support on a finite-dimensional real
Chapter written by Hossein N OHROUZIAN, Anatoliy M ALYARENKO and Ying N I.

linear space which approximates the standard Gaussian measure. For more technical
details, see Lyons and Victoir (2002) and Malyarenko et al. (2017). A generalization
of this idea, when a finite-dimensional space is replaced with a Wiener space, can
be used for constructing modern Monte Carlo estimates. In what follows, we briefly
review both classical and modern Monte Carlo approaches.
To begin with, in mathematical terminology, a sample space Ω is the set of all

possible outcomes ω of an experiment. Also, an event is a subset of Ω. In probabilistic
notations, a real random variable X is a measurable function X : Ω → R such that
for any x ∈ R, the set {ω ∈ Ω : X(ω) ≤ x} is an event. In other words, we
measure associating an event with a set of possible outcomes (ω’s), and then define
the probability of that event occurring to be its measure (volume) relative to the set of
all possible outcomes (Ω).
According to Glasserman (2004), (classical) Monte Carlo methods are constructed

upon probability and volume but in the reverse order of the above description. To
put it simply, we randomly sample from Ω and take the fraction of random draws
that falls in a given set (event) as an estimate of the set’s probability. As the number
of simulations increases, the law of large numbers secures us that such an estimate
converges to the correct value. Moreover, after a finite number of draws (simulations),
confidence intervals and errors in the estimations can be obtained using the central
limit theorem.
Simulation can be performed in two approaches, classical and modern. These two
approaches can be described in the following steps.
Classical: Modern (see Bayer and Teichmann

1. Space discretization. (2008)):
2. Time discretization, 1. Discretization on the space of
3. Discretization on the space of trajectories.
trajectories. 2. Using standard numerical methods.
In order to estimate the value of some stochastic integrals in finance, we would

like to use modern Monte Carlo estimates and cubature methods on Wiener space. So,
the outline of this work will be as follows. In section 25.2, we review and discuss the
idea of cubature formulae on Wiener space. Also, we will apply cubature formulae
of degree 5 in the Black–Scholes pricing process. After that, in section 25.3, we will
briefly mention some interest-rate models and will look at the Hull–White one-factor
model and its SDE. Then, in section 25.4, we will simulate the Hull–White price
process using both classical Monte Carlo simulation and a cubature formula on Wiener
space. We will iterate sample paths created by a cubature formula to construct a
trinomial tree. Finally, we will close this work with section 25.5, where we discuss the
results of this chapter, possible improvements and applications of cubature methods
on Wiener space in the forthcoming work.
Pricing Financial Derivatives in the Hull–White Model 335
25.2. Cubature formulae on Wiener space
In this section, we would like to review how we applied the Stratonovich correction
to the Black–Scholes SDE in Nohrouzian and Malyarenko (2019). To begin with,
in mathematical finance, we mostly deal with parabolic partial differential equations
(PDEs). We would like to use cubature method on Wiener space to simulate some
SDEs to transform solving PDEs to estimating the values of stochastic integrals on
Wiener space (for more details, see Nohrouzian and Malyarenko (2019)). Let us give
our full attention to the Wiener space of scalar-valued functions.
In order to explain the idea of the cubature method on Wiener space, we consider
first the dynamics of risky asset prices proposed by Samuelson (1965), i.e.
dS(t) = rS(t)dt + σS(t)dW (t), [25.1]
where S(t) is the time-t price of a (non-dividend paying) risky asset, r and σ
are drift (risk-free interest rate) and diffusion (volatility of asset price) coefficients,
respectively, and {W (t)}t≥0 is a standard one-dimensional Wiener process. Black
and Scholes in Black and Scholes (1973) used equation [25.1] and delta-hedging to
derive the celebrated Black–Scholes PDE. Then, they used such a change of variables
that the above PDE became a well-known heat equation.
25.2.1. A simple example of classical Monte Carlo estimates
We can rearrange equation [25.1] and obtain the following

dS(t)
= rdt + σdW (t). [25.2]
S(t)
Then, make a change of variables from S(t) to ln S(t) and applying the Itô lemma
yields

1 2
S(t) = S(0) exp r − σ t + σdW (t) . [25.3]
2
Equation [25.3] can be used in the classical Monte Carlo approach to estimate the
fair price of European call and put options. That is, performing the steps presented in
section 25.1, we are able to simulate the following equation (see Glasserman (2004)):

1 2
Ŝ(ti ) = Ŝ(ti−1 ) exp r − σ (ti − ti−1 ) + σ ti − ti−1 Zi , [25.4]
2
where i = 1, . . . , n, 0 = t0 < t1 < . . . < tn = T and { Zi : 1 ≤ i ≤ n } are
independent standard normal random variables, i.e. Zi ∼ N (0, 1). If we simulate
equation [25.4] enough times, then we can estimate the fair price of an option, which
is its discounted expected payoff. Let us review how we can apply the Stratonovich
correction into equation [25.1]. We start with the theories behind this idea.
25.2.2. Modern Monte Carlo estimates via cubature method
To begin with, according to the fundamental theorem of asset pricing, the

no-arbitrage price of an attainable contingent claim X is equal to the discounted
expected payoff of the claim under the risk-neutral probability measure (see Kijima
(2013)). In particular, for the Black–Scholes model, this price is
C = e−rT E∗ [X],
where r denotes the risk-free interest rate and T is the time to maturity of the claim.
Recall that the mathematical expectation is the integral

−rT
C =e X(ω) dP∗ (ω). [25.5]
Ω
For the Black–Scholes model, Ω is the Banach space C0 ([0, T ]) of continuous

functions ω : [0, T ] → R with supremum norm and ω(0) = 0, F is the σ-field of
Borel subsets of Ω, and P∗ is the Wiener measure on Ω. In the simplest case, when the
contingent claim X is the European call option with strike price K, we have
X(ω) = max{S(T, ω) − K, 0},
where the stochastic process S(t, ω) denotes the price of a risky asset and is the
solution to the Black–Scholes SDE given in [25.1].
In fact, Merton in Merton (1973) obtained the Black–Scholes formula by

calculating the integral given in equation [25.5] in the closed form.
Even in the Black–Scholes model, such a calculation is not simple if the claim X
is more complicated than the European call or put option. Instead, we use the cubature
method.
For simplicity, let us consider a classical one-dimensional cubature method first.

Put Ω = R1 , let F = B(R1 ) be the σ-field of Borel subsets of R1 , and let

∗ 1 2
P (A) = √ e−x /2 dx, A ∈ F,
2π A
that is, the probability measure which corresponds to the standard normal random
variable. The idea of the cubature method is to replace the measure P∗ by another
probability measure, say Q, which has a finite support. This means, there is a finite
set {x1 , . . . , x } ⊂ R1 and a set of positive real numbers {λ1 , . . . , λ } such that
λ1 + · · · + λ = 1 and

Q(A) = λk , for k = 1, . . . , and A ∈ F.
k : xk ∈A
For a function f , which is P∗ -integrable, the cubature formula takes the form
∞ ∞

f (x) dP∗ (x) ≈ f (x) dQ(x) = λk f (xk ).
−∞ −∞ k=1
If two persons propose two different cubature formulae, which one is better? Note
that any polynomial in x is P∗ -integrable. We say that a cubature formula Q has degree
m if for any polynomial P (x) of degree less than or equal to m, we have
∞

P (x) dP∗ (x) = λk P (xk ),
−∞ k=1
that is, for any such polynomial, the cubature formula is exact. If we have two different
cubature formulae of degree m, then the one with the lower value of is better. The
classical Tchakaloff theorem (Tchakaloff 1957) guarantees the existence of a cubature
formula with less than or equal to the dimension of the linear space of corresponding
polynomials. However, it is an existence theorem; it does not give any way to construct
the nodes xk and the weights λk .
The definition of a cubature formula on Wiener space C0 ([0, T ]) is literally the

same. That is, we have paths ωk ∈ C0 ([0, T ]) and weights λk with λ1 +· · ·+λ = 1
such that

X(ω) dP∗ (ω) ≈ λk X(ωk ).
C0 ([0,T ]) k=1
In order to compare competitive cubature formulae, we need to know which

functions on the Wiener space C0 ([0, T ]) play the role of polynomials P (x) on the
space R1 ? The idea of Lyons and Victoir presented in Lyons and Victoir (2002) is,
these functions are the following iterated Stratonovich integrals:
T T T
P (ω) = ··· ◦ dW (t ) ◦ · · · ◦ dW (t1 ),
0 t1 t−1
and the formula has degree m if and only if all ωk have a bounded variation and for
all iterated Stratonovich integrals up to degree m, we have

T T T
E∗ [P (ω)] = λk ··· dωk (t ) · · · dωk (t1 ),
k=1 0 t1 t−1
where the integrals on the right-hand side are iterated Riemann–Stiltjes integrals.
The Tchakaloff theorem remains true, but still not constructive. In order to construct
cubature formulae on Wiener space, we followed Lyons and Victoir (see Lyons and
Victoir 2002) where we used advanced algebraic methods presented in Malyarenko
et al. (2017) and Nohrouzian and Malyarenko (2019).
25.2.2.1. Application
Let {Y (t)}t≥0 be the solution to the following Itô SDE
dY (t) = ã(t, Y (t))dt + b(t, Y (t))dW (t), 0 ≤ t ≤ T.
We rewrite the above equation in its Stratonovich differential form (see Øksendal
(2013))
dY (t) = a(t, Y (t))dt + b(t, Y (t)) ◦ dW (t), 0 ≤ t ≤ T.
Now, writing the above SDE in its integral form yields

t t
Y (t) = Y (0) + a(s, Y (s))ds + b(s, Y (s)) ◦ dW (s).
0 0
We replace W (s) with paths ωk which gives the following integral equations:
t t
Yk (t) = Yk (0) + a(s, Yk (s))ds + b(s, Yk (s))dωk (s), 1 ≤ k ≤ .
0 0
If ωk , k = 1, . . . , are piecewise differentiable, then the Riemann–Stiltjes integral

on the right-hand side can be replaced with an ordinary Riemann integral, i.e.
t t
Yk (t) = Yk (0) + a(s, Yk (s))ds + b(s, Yk (s))ωk (s)ds.
0 0
Now, differentiating both hand sides gives
Yk (t) = a(t, Yk (t)) + b(t, Yk (t))ωk (t).
If we are able to solve this equation explicitly, we obtain deterministic functions

Yk (t). For each of them, we can calculate the value of a given contingent claim, say
Xk . The number k=1 λk Xk is the approximate estimate for the value of the claim.
Later, in section 25.4, we can try to realize this program for the case of the Hull–White
one-factor model in Stratonovich form. If time to maturity is not small, we apply trees.
For the next step, let us briefly review the implementation of Stratonovich
corrections and the obtained results in Nohrouzian and Malyarenko (2019).
25.2.3. An application in the Black–Scholes SDE
In order to use the cubature formula on a Wiener space for the Black–Scholes
SDE, we need to rewrite the Itô process given in [25.2] in its Stratonovich form (see,
for example, Øksendal (2013) and Nohrouzian and Malyarenko (2019)). That is,
1
dS(t) = (r − σ 2 )S(t)dt + σS(t) ◦ dW (t). [25.6]
2
Implementing the results of the cubature method on Wiener space in

equation [25.6] yields
1
dSk (t) = (r − σ 2 )Sk (t)t + σSk (t)ωk (t).
2
where ωk , (1 ≤ k ≤ l) is the kth possible trajectory, and l, (l ∈ Z+ ) stands for the
possible number of trajectories in the cubature formula of degree m.
Rearranging the last equation and calculating the integral of both hand sides give

1 2
Ŝk (tj ) = Ŝk (tj−1 ) exp (r − σ )[tj − tj−1 ] + σ[ωk (tj ) − ωk (tj−1 )] ,
2
[25.7]
with j = 1, . . . , l and 0 ≤ tj ≤ 1.
25.2.4. Trajectories of the cubature formula of degree 5 on Wiener space
We explicitly explain how to calculate ωk (tj ) − ωk (tj−1 ) in a cubature formula of

degree m = 5. In a cubature formula of degree 5, the number of trajectories is l = 3
and one of the possible solutions becomes
ωk (tj ) = 3θk,j (tj − tj−1 ) + ωk (tj−1 ), j = 1, 2, 3, ωk (0) = 0, [25.8]
where 0 = t0 < t1 < t2 < t3 = 1, i.e. trajectories start from time 0 and stop at time 1,
j = 1, 2, 3, tj − tj−1 = 1/3 and with weight λk and coefficients θk summarized in
Table 25.1.
k λk θk,1 = θk,3 θk,2 θk,3 = θk,1

√ √ √ √ √ √
1 1/6 (−2 3 ∓ 6)/6 (− 3 ± 6)/3 (−2 3 ∓ 6)/6
√ √ √
2 2/3 ± 6/6 ∓ 6/3 ± 6/6
√ √ √ √ √ √
3 1/6 (2 3 ± 6)/6 ( 3 ∓ 6)/3 (2 3 ± 6)/6
Table 25.1. Information for cubature formulae of degree 5
Figure 25.1 is created in MATLAB® and depicts the two possible sets of
trajectories for cubature of degree 5.
Figure 25.1. Two sets of trajectories of cubature (degree 5) on Wiener space
25.2.5. Trajectories of price process given in equation [25.7]
Assume that today’s market price of an arbitrary asset is S0 = 20$, the yearly
interest rate is r = 12% and the yearly volatility σ = 30%. Then, we can rearrange
equation [25.8] and substitute the result in equation [25.7]. As a result, we get two
possible sets of trajectories for the price process illustrated in Figure 25.2.
Figure 25.2. Price trajectories for S0 = 20, r = 0.12 and σ = 0.30
Furthermore, Figure 25.3 depicts the idea of the iterated cubature formula which
we will use later to construct a trinomial tree.
If we remove intermediate movements in each grid, then the trajectories of

Figure 25.3 will look like Figure 25.4a where SBM stands for standard Brownian
motion. Additionally, if we choose different interval lengths, interest and volatility in
each grid, the result of iterated trajectories of the cubature formula of degree 5 on a
geometric Brownian motion (GBM) will look like Figure 25.4b. In Figure 25.4b, we
chose the length of the first grid to be one day, the length of the second grid to be two
days and the length of the last grid to be three days.
Figure 25.3. Iterated trajectories of cubature (degree 5) on Wiener space
(a) Recombining tree (SBM) (b) Non-recombining tree (GBM)
Figure 25.4. Iterated trajectories on Wiener space
25.2.6. An application on path-dependent derivatives
According to Hull (2017), the path-dependent (history-dependent) derivative’s

payoff functions depend not only on the final value of the underlying asset but also
on the path followed by the underlying asset. Lookback and Asian options are some
examples of the path-dependent derivatives (see Hull and White 1993). Although
Monte Carlo simulation can be used to price the path-dependent derivatives, the
estimation might be very time consuming until it reaches the desired accuracy. A
classical way to deal with American-style path-dependent options is using lattice
approximations. Cox, Ross and Rubinstein (CRR) proposed the famous CRR
(binomial) model in Cox et al. (1979). There exists a lattice model of different kinds.
We could use our cubature formula of degree 5 in Nohrouzian and Malyarenko (2019)
to construct a recombining trinomial model. Let us briefly review this model in the
next section.
25.2.7. Trinomial tree (model) via cubature formulae of degree 5
Assume that we divide the time interval [0, T ] into n intervals of not necessarily
equal length, where n ∈ Z, n ≥ 1. Then, we will have {0 = t0 , . . . , tn = T } and an
n-step time grid.
We will construct an n-step recombining trinomial model by setting n-step equal

time intervals or grid (see Nohrouzian and Malyarenko (2019) for full details). Denote
the up factor by fu , the middle factor by fm and the down factor by fd . Then, use
equation [25.7] for 1 ≤ k ≤ 3 to calculate the up, middle and down factors in each
time grid. Note that, depending on the choice of n, the interest rate r and volatility σ
should be calibrated for the chosen time interval length.
That is,

S1 (1) 1 2 1 2 √
fu = = exp (r − σ ) + σω1 (1) = exp (r − σ )+σ 3 ,
S(0) 2 2

S2 (1) 1 2
fm = = exp r − σ ,
S(0) 2

S3 (1) 1 1 2 √
fd = = exp (r − σ 2 ) + σω3 (1) = exp (r − σ )−σ 3 ,
S(0) 2 2
where we omitted the intermediate calculations in equation [25.7]. Since as it is

depicted in Figure 25.5a, it does not affect the final values for fu , fm and fd .
(b) Five-step trinomial model via

(a) One-step trinomial and cubature
cubature
Figure 25.5. Idea of a trinomial tree
Furthermore, due to the symmetry of path ω1 and path ω3 , i.e. ω1 = −ω3 , and
the log-normality of the considered price process, i.e. (Sk > 0), it is easy to see that
√
fm = fu fd . Therefore, we have a recombining trinomial tree. Figure 25.5b depicts
a five-step trinomial tree with S0 = 100$, r = 0.05 and σ = 0.1.
In Nohrouzian and Malyarenko (2019), we explicitly tested such a trinomial model

and compared its results with the results of the classical CRR and Black–Scholes
models. In the next section, we would like to extend the usage of the described
trinomial trees in the interest-rate (term-structure) models. Namely, we consider the
Hull–White one-factor model.
25.3. Interest-rate models and Hull–White one-factor model
In mathematical finance, it is possible to model the dynamics of risky assets via

security market models and interest-rate (term-structure) models in the following
sense. In security market models, the underlying assets are securities or stocks. In
interest-rate models, however, the underlying assets are interest rates. Moreover, in the
interest-rate models, the price of a default-free discount bond for different maturities
is called the term structure of interest rates (Kijima 2013). Interest-rate models are
typically used to valuate bonds and interest rate derivatives, for example, cap, floor
and swap options (swaption).
Interest-rate models can be divided into two groups. First, spot rate models which
consist of equilibrium models and no-arbitrage models. Second, forward rate models.
Let {r(t)}t≥0 be the stochastic interest rate (instantaneous spot rate), B(t) be the
money market account, P (t, T ) be the market price of a default-free discount bond and
f (t, T ) be the instantaneous forward rate. Then, the following relations hold (Kijima
2013):
∂ ∂
for t ≤ T, r(t) = − ln P (t, T ) T =t
, f (t, T ) = − ln P (t, T ),
∂T ∂T
T T
B(t)
P (t, T ) = exp − f (t, s)ds , P (t, T ) = exp − r(s)ds = .
t t B(T )
Typically, spot-rate models are mean reverting. Let us briefly review equilibrium,
no-arbitrage and forward rate models.
25.3.1. Equilibrium models
The equilibrium models typically satisfy the following SDE:
dr(t) = m(r(t))dt + s(r(t))dW (t), t ≥ 0, [25.9]
where m and s are the instantaneous drift and standard deviation, respectively, and
assumed to be a function of instantaneous spot rate and usually time-independent.
Model Drift Diffusion
Rendleman–Bartter μr σr
Vasicek a(b − r) σ
√
Cox–Ingersoll–Ross a(b − r) σ r
Table 25.2. Some well-known equilibrium models, where a and b are constants
Proper choices of m and s summarized in Table 25.2 convert SDE [25.9] to SDEs
given in the Rendleman–Bartter model (Rendleman and Bartter 1980), the Vasicek
model (Vasicek 1977) and the Cox–Ingersoll–Ross (CIR) model (Cox et al. 1985)
(see Kijima 2013; Hull 2017).
25.3.2. No-arbitrage models
Unlike equilibrium models, the drift and diffusion parts of the SDE [25.9] in
no-arbitrage models are usually functions of time. Given a filtered probability space
(Ω, F, P, (Ft )t≥0 ), in no-arbitrage models, the dynamic of spot rate r under physical
probability measure P satisfies the following SDE (Kijima 2013; Pascucci 2011):
dr(t) = α(t, r(t))dt + β(t, r(t))dW (t), t ≥ 0. [25.10]
By definition, the price of a default-free discount bond via risk neutral pricing,
i.e. using the equivalent martingale probability measure Q ∼ P , is given by Pascucci
(2011)

T
P (t, T ) = EQ exp − r(u)du, Ft , 0 ≤ t ≤ T. [25.11]
t
Let V (t, r) = P (t, T ), then by equation [25.11] and the Feynman–Kac̆

representation formula, we can obtain the Q-dynamics of equation [25.10] for the
short rate r, the so-called term-structure equation (Kijima 2013; Pascucci 2011)

∂t V (t, r) + α(t, r)∂r V (t, r) + 12 β 2 (t, r)∂rr V (t, r) = rV (t, r),
[25.12]
V (T, r) = 1.
Let α1 , α2 , β1 and β2 be the deterministic functions and set α(t, r) = α1 (t) +

α2 (t)r and β 2 (t, r) = β1 (t) + β2 (t)r, then we obtain affine models. According
to Pascucci (2011), affine models are effective under the empirical term structure.
In the case of affine models, the term structure equation given in [25.12] can be
solved semi-analytically in terms of (Riccati type) first-order differential equations
(see also Kijima (2013, Theorem 15.1)).
The Ho–Lee model Ho and Lee (1986) and the Hull–White one-factor model (Hull
and White 2015) are some well-known no-arbitrage (as well as affine) models. For the
purpose of this chapter, we will closely look at the Hull–White model and its lattice
applications.
25.3.3. Forward rate models
The most famous forward rate models are Black (Black 1975),
Heath–Jarrow–Morton (HJM) (Heath et al. 1990, 1992) and Brace–Gatarek–Musiela
(BGM) (Brace et al. 1997) or the LIBOR market model (see also Jamshidian 1997;
Miltersen et al. 1997). For more detailed information about interest-rate models, the
reader is referred to Kijima (2013); Hull (2017); Nohrouzian et al. (2021).
25.3.4. Hull–White one-factor model
The Hull–White one-factor model or the extended Vasicek model is a special case
of the Ho–Lee model. The Hull–White one-factor model provides an exact fit to the
initial term structure (see Hull and White 2015). Set α(t, r) = α1 (t) + α2 (t)r =
[ϕ(t) − ar] and β 2 (t, r) = β1 (t) + β2 (t)r = σ 2 in equation [25.10]; then, the
instantaneous short-rate r in the Hull–White model is the solution to
dr(t) = [ϕ(t) − ar(t)]dt + σdW (t), 0 ≤ t ≤ T, [25.13]
where a and σ are constants. Note that SDE [25.13] (even if a and σ are not
constants and are functions of time) describes a general Gaussian–Markov process
(see Glasserman 2004, Equation (3.41), p. 109). Moreover, the function ϕ(t) can be
calculated from the initial term structure (Hull and White 2015; Hull 2017). That is,
σ2
ϕ(t) = ∂t f (0, t) + af (0, t) + 1 − e−2at , [25.14]
2a
where f (0, t) is the observed instantaneous forward rate in the market at time 0. We
will explain the idea and derivation of φ(t) in section 25.3.6. Using the above equation,
we can rewrite SDE [25.13] in the following form:

σ2
dr(t) = ∂t f (0, t) + a f (0, t) + 2 1 − e−2at − r(t) dt + σdW (t).
2a
[25.15]
SDE [25.15] describes the dynamics of an affine model and has an explicit
solution. We will not go through the theory of affine models. Instead, we would like
to concentrate on the application of the Hull–White model to construct a trinomial
tree using the cubature method. The readers therefore are referred to look at Kijima
(2013), Hull (2017) and Hull and White (2015).
Hull and White (1994, 1996) explained how to apply a numerical procedure and
use an interest-rate tree in their model. As we explained in section 25.2.6, the lattice
approximation has advantages in terms of time efficiency and pricing American-style
path-dependent options. Now, we would like to construct the Hull–White trinomial
interest-rate model using the cubature formula of degree 5 on Wiener space.
25.3.5. Discretization of the Hull–White model via Euler scheme
The discretization of SDE [25.15] in the Euler scheme (see Glasserman 2004)
gives

ˆ ˆ σ2 −2ati

r̂(ti+1 ) = r̂(ti ) + ∂t f (0, ti ) + a f (0, ti ) + 2 1 − e − r̂(ti )
2a

× (ti+1 − ti ) + σ ti+1 − ti Zi+1 , [25.16]
where 0 ≤ i ≤ n, n is the number of time discretizations and Zi+1 ∼ N (0, 1) are

independent standard normal random
t variables. Also, observe that by the fundamental
theorem of calculus, we have tii+1 ∂u f (0, u)du = f (0, ti+1 ) − f (0, ti ).
25.3.6. Hull–White model for bond prices
To begin with, the Ho–Lee model and Vasicek model are special cases of the
following general Markov process (Glasserman 2004)
dr(t) = [g(t) + h(t)r(t)]dt + σ(t)dW (t),
where the functions g(t), h(t) and σ(t) are time-dependent. The solution to the above
SDE (affine model) is given by
t t
H(t) H(t)+H(u)
r(t) = r(0)e + e g(u)du + eH(t)+H(u) σ(u)dW (u),
0 0
where
t
H(t) = h(u)du.
0
The above relation can be verified by the Itô formula. For the Hull–White model,
the general solution of the above for any given r(s), 0 < s < t is
t t
a(t−s) −a(t−u)
r(t) = r(s)e + e ϕ(u)du + e−a(t−u) σ(u)dW (u).
s s
[25.17]
For a given r(s), the first two terms on the right-hand side of the above equation
are the mean of the normally distributed r(t). By definition, for the variance, we have
t
σ2
σr2 (s, t) := σ 2 e−2a(t−u) du = 1 − e−2a(t−s) .
s 2a
25.3.6.1. Simulation (Part I)

Now, we can simulate the process r(t), using
r̂(ti+1 ) = e−a(ti+1 −ti ) r̂(ti ) + μ(ti , ti+1 ) + σr (ti , ti+1 )Zi+1 , [25.18]
where 0 ≤ i ≤ n, n is the number of time discretizations, Zi+1 ∼ N (0, 1) are

independent standard normal random variables and
ti+1
μ(ti , ti+1 ) = e−a(ti+1 −u) ϕ(u)du. [25.19]
ti
We will calculate ϕ(t) in the next part of this chapter. Let us start by the calibration
of the market’s data to the short rate process r.
25.3.6.2. Calibration
Assume that X ∼ N (m, v 2 ) is a normal random variable, then we have
E[exp(X)] = exp(m + v 2 /2).
Using the above relation and equation [25.11] yields

T
P (t, T ) = E exp − r(u)du
t

T T
1
= exp −E r(u)du + Var r(u)du . [25.20]
t 2 t
Substituting equation [25.17] into the expected value of integral of the process r,

T T
E r(u)du = E[r(u)]du
0 0
T t
r(0)
= 1 − e−aT + e−a(t−u) ϕ(u)dudt.
a 0 0
For the variance, let u ≤ t,

T
T t
Var r(u)du = 2 Cov[r(t), r(u)]dudt
0 0 0
T t u
=2 σ2 e−a(t−s) e−a(u−s) ds dudt
0 0 0
t
T
σ 2 a(u−t)
=2 e − e−a(u+t) dudt
0 0 2a

σ 2
1 2 −aT
= 2 T+ 1 − e−2aT + e −1 .
a 2a a
Substituting the obtained mean and variance in [25.20] gives the bond price
r(t) T u
P (t, T ) = exp − 1 − e−a(T −t) − e−a(u−s) ϕ(s)dsdu
a t t
2 2
σ 1 −2a(T−t) −a(T −t)
× exp (T −t) + 1−e + e −1 .
2a2 2a a
25.3.6.3. Derivation of ϕ(t) in the Hull–White model for bond prices

On the one hand, we now know the value
of P (0, T ). On the other hand, we
T
discussed P (0, T ) = exp − 0 f (0, s)ds . Equating these two results in
T T t
r(0)
f (0, s)ds = 1 − e−aT + e−a(t−u) ϕ(u)dudt
0 a 0 0

σ2 1 2 −aT
− 2 T+ 1 − e−2aT + e −1 .
2a 2a a
Differentiating the above expression once with respect to T gives
T
σ2
f (0, T ) = r(0)e−aT + e−a(T −u) ϕ(u)du − 2 1 + e−2aT − 2e−aT .
0 2a
By the Leibniz integral rule, we have,
T T
−a(T −u)
∂T e ϕ(u)du = ϕ(u)(−a)e−a(T −u) du + ϕ(T ).
0 0
Therefore, calculating ∂T f (0, T ) gives

T
−aT
∂T f (0, T ) = −ar(0)e + ϕ(u)(−a)e−a(T −u) du + ϕ(t)
0
σ −2aT
2
+ e − e−aT .
2a
Thus,
σ2
ϕ(T ) = ∂T f (0, T ) + af (0, T ) + 1 − e−2aT .
2a
Finally, the above equation is true for any maturity. That is,
σ2
ϕ(t) = ∂T f (0, T ) + af (0, t) + 1 − e−2at . [25.21]
T =t 2a
25.3.6.4. Simulation (Part II)

We first substitute [25.21] into [25.19]. That is,
ti+1
μ(ti , ti+1 ) = e−a(ti+1 −u) ∂T f (0, u) + af (0, u)
ti T =t

σ2 −2au

+ 1−e du.
2a
Substituting the result into equation [25.18], we get the following general equation to
simulate process r:

a(ti+1 −ti ) 1 − e−2a(ti+1 −ti )
r̂(ti+1 ) = e r̂(ti ) + μ(ti , ti+1 ) + σ Zi+1 .
2a
[25.22]
25.4. The Hull–White model via cubature method
To use the cubature formula of degree 5 in constructing the Hull–White trinomial

tree, we follow the discussion from section 25.2. The first step is to rewrite the
SDE [25.10] in its (multi-dimensional n × m) Stratonovich representation form. That
can be achieved by (Øksendal 2013),
dr(t) = α̃(t, r(t))dt + β(t, r(t)) ◦ dW (t),
where
m n
1 ∂βij
α̃i (t, x) = αi (t, x) − βkj , 1 ≤ i ≤ n,
2 j=1 ∂xk
k=1
is called the Stratonovich correction. For Hull–White (one-dimensional case), and

since σ is constant, we have
α̃(t, x) = α(t, x) − 0 = α(t, x)

Substituting the values of the last equations in equations [25.13] and [25.14]
converts equation [25.15] into

σ2 −2at

dr(t) = ∂t f (0, t) + 1−e + a [f (0, t) − r(t)] dt + σ ◦ dW (t).
2a
[25.23]

Let κ(t) = σ 2 /(2a) 1 − e−2at . In the next step, we rewrite the above equation
in its integral form, that is
t t
r(t) = r(0)+ (∂s f (0, s)+κ(s) + a [f (0, s) − r(s)]) ds + σ ◦dW (s).
0 0
Now, we get the following set of Riemann–Stiltijes integrals given by the cubature
methods
t t
rk (t) = rk (0) + (∂s f (0, s) + κ(s) + a [f (0, s) − rk (s)]) ds + σ dωk (s).
0 0
Taking the derivative of both hand sides of the above equation and applying the
fundamental theorem of calculus replaces the SDE [25.23] to the following finite set
of ODEs:
rk (t) = ∂t f (0, t) + κ(t) + a [f (0, t) − rk (t)] + σωk (t), [25.24]

with 1 ≤ k ≤ and r(t) = k=1 λk rk (t). The last equation for the cubature formula
of degree m = 5, where = 3 works better for a small time interval. Let us try to
implement this equation where we iterate it multiple times.
25.4.1. Simulating SDE [25.15] and ODE [25.24]
Let us simulate the SDE and ODE given in [25.15] and [25.24], First, we make
a discretization of both equations. On the one hand, recall that the discretization of
SDE [25.15] in the Euler scheme was given by [25.16] (see Glasserman 2004). If we
use our shortened notations, we have

r̂(ti+1 ) =r̂(ti ) + ∂t fˆ(0, ti ) + κ(ti ) + a[fˆ(0, ti ) − r̂(ti )] Δti + σ Δti Zi+1 ,
[25.25]
where 0 ≤ i ≤ n, n is the number of time discretizations, and Δti = ti+1 − ti and

Zi+1 ∼ N (0, 1) are independent standard normal random variables.
On the other hand, in the implementation of [25.24] and for simplicity, we are not
interested in intermediate values in each time grid with length one (see Figure 25.5a).
Therefore, we will use only the last value of the sample path in the cubature formula
in equation [25.8]. After that, discretization of [25.24] gives

r̂k (ti+1 ) =r̂k (ti ) + ∂t fˆ(0, ti ) + κ(ti ) + a[fˆ(0, ti ) − r̂k (ti )] + σωk ,
[25.26]
√
where 0 ≤ i ≤ n, n√is the number of time discretizations, 1 ≤ k ≤ 3, ω1 = − 3,
ω2 = 0 and ω3 = 3 (see the values of ω(t) at the end of grids in Figure 25.2).
Furthermore, values for λk are given in Table 25.1. Also, we observe that
3

r̂(ti ) = λk r̂k (ti ).
k=1
We used the (instantaneous) forward rates (FR) and spot rates (SR) available in
the European Central Bank (ECB) on March 1, 2021. The given rates are based on
triple A, i.e. AAA rated bonds, and are summarized in Table 25.3.
i 0 1 2 3 4 5 6 7 8
ti (in years) 0 0.25 0.50 0.75 1.00 2.00 3.00 4.00 5.00
f (0, ti ) -0.63737 -0.63737 -0.679325 -0.709803 -0.730072 -0.730798 -0.644705 -0.512856 -0.362631
r(ti ) -0.611094 -0.611094 -0.635227 -0.655307 -0.671664 -0.705749 -0.701509 -0.671451 -0.624844
i 9 10 11 12 13 14 15 16 17
ti (in years) 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00
f (0, ti ) -0.211625 -0.070525 0.054764 0.161581 0.249389 0.318988 0.371949 0.410242 0.43597
r(ti ) -0.56848 -0.507261 -0.444655 -0.383053 -0.324043 -0.268616 -0.217334 -0.170444 -0.127979
i 18 19 20 21 22 23 24 25 26
ti (in years) 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00
f (0, ti ) 0.451206 0.457894 0.457791 0.45244 0.443166 0.431087 0.417124 0.402025 0.386386
r(ti ) -0.089822 -0.055759 -0.025518 0.001205 0.024725 0.045355 0.063396 0.079135 0.092835
i 27 28 29 30 31 32 33 - -
ti (in years) 24.00 25.00 26.00 27.00 28.00 29.00 30.00 - -
f (0, ti ) 0.370669 0.355226 0.340317 0.326124 0.312769 0.300323 0.288821 - -
r(ti ) 0.104738 0.115065 0.124013 0.131759 0.13846 0.144253 0.149261 - -
Table 25.3. ECB instantaneous forward rates and spot rates in %, March 1, 2021
Given the data in Table 25.3, we used MATLAB® function polyfit to estimate
fˆ(0, ti ) for i = 1, . . . , n which fits the given data for instantaneous forward rates in
a least-square sense. Figure 25.6 illustrates the graphs of initial data and the obtained
polynomial where the degree of the described polynomial is 6.
After that, we set n = 30 × 52, i.e. 1,560 weeks, a = {0.075, 0.35}, σ = 0.15,
r(t1 ) = r(t0 ) and f (0, t1 ) = f (0, t0 ). Then, substituting the obtained polynomial
in equations [25.16] and [25.26], we simulate the spot-rate SDE in the Hull–White
model. We make 100,000 simulations for classical Monte Carlo and then take the
average of simulations. We make one cubature simulation, where we calculate
weighted average values in each step. Note that this is equivalent to simulating
only the middle path in equation [25.26], i.e. for k = 2. The results are depicted
in Figures 25.7a and 25.7b.
Figure 25.6. Initial FR, SR and polynomial fit
(a) a = 0.075 and σ = 0.15 (b) a = 0.35 and σ = 0.15
Figure 25.7. Initial FR, SR, Monte Carlo and cubature mean
We do not know which model the ECB, or financial institutes which provide AAA
rated bonds, use to estimate spot rates. However, as depicted in Figure 25.7, the
Hull–White model via cubature method seems to fit the ECB’s spot rates fairly well.
The recent described cubature method for estimating forward rate is much faster than
the classical Monte Carlo, but as we will see in section 25.4.2, it cannot be used to
price financial derivatives. For pricing financial derivatives, we need to consider more
paths (trajectories).
25.4.2. The Hull–White interest-rate tree via iterated cubature formulae:

some examples
Hull and White (1994, 1996) explicitly explained how they constructed their
recombining trinomial tree. To put it briefly, they assumed that at each node the
short-rate can go either:
– up one, straight along and one down;
– or up two, up one and straight along;
– or straight along, down one and down two.
They also calculated the corresponding probabilities to reach each node (see
also Hull (2017)). We will deal, however, with a non-recombining tree created by
the iterated cubature formula of degree 5 on Wiener space. In other words, in order
to price financial derivatives via cubature method, unlike the approach presented
in Figure 25.7, we need to access more possible random values at each time. Therefore,
we iterate the cubature formula which results in obtaining a non-recombining
trinomial tree.
(a) Including intermediate values (b) Excluding intermediate values
Figure 25.8. Paths via the iterated cubature

formula and their weighted average values
Figure 25.8 shows the results of a nine-step iterated cubature formula in

equation [25.26] with and without intermediate steps for all initial forward rates given
in Table 25.3. Note that the paths are not recombining and therefore at the last step of
the tree, we have 39 = 19683 nodes. As we can see in the figure, the thick blue line is
more identical with ECB FRs rather than ECB SRs. The reason is that the Hull–White
model is a mean reverting model and for accurate evolution of the process r̂(t), we
need more time steps. The mean reverting characteristic of the Hull–White model
suggests that we cannot choose a big time interval length and they should be small
enough. Figure 25.9 shows what happens if we do not choose a proper length in each
step. Compare this figure with Figure 25.7.
(a) n = 8, a = 0.75 and σ = 0.15 (b) n = 50, a = 0.75 and σ = 0.15
Figure 25.9. Effect of time discretization in MC and cubature methods
We would like to emphasize that for plotting the intermediate movements of

paths created by n = 9 steps iterated cubature formula, we created a matrix of size
3n × 3n + 1. This took around 8–10 minutes. Ignoring the plots, we could perform
an iterated cubature formula for n ≤ 10 within five minutes or so. For n > 10, this
approach would hardly work in personal computers.
25.5. Discussion and future works
In this chapter, we discussed the idea of cubature formulae on Wiener space. We

briefly explained the theory behind the idea of the cubature method and compared and
contrasted it with the classical Monte Carlo simulation. Then, we reviewed how the
cubature formulae of degree 5 can be used in security market models, namely in the
Samuelson (Black–Scholes) price process to estimate the price of European call and
put options. Also, we reviewed the idea of constructing a recombining trinomial tree
in the Black–Scholes model to price path-dependent derivatives. After that, we turned
our attention to the interest-rate (term-structure) models and picked the Hull–White
one-factor model to study the application of the cubature formula in the fixed-income
market models. We used the triple A rated data for instantaneous forward rates and
spot rates available from the European Central Bank. Using the ECB forward rate
data, we simulated the spot-rate process via the cubature method and classical Monte
Carlo simulation. The cubature method performed better to fit the ECB spot-rate
data. After that, we tried to iterate the cubature formula to get enough random price
trajectories (paths) to calculate desirable payoff functions of financial derivatives.
Iterating cubature trajectories in the Hull–White model resulted in a non-recombining
tree with exponential growth. As a result, we saw that working with a non-recombining
tree would cause inaccuracy in results and would not be effective for a small number
of steps.
We would like to emphasize that, in our experience, cubature formulae work better
for small time intervals. To tackle the extensional growth in the number of nodes in a
non-recombining tree, we mention some ideas. These ideas are to consider cubature
formulae of degree 7 or higher where the number of trajectories on Wiener space are
more and the formula becomes more accurate. If the degree is big enough, then we
might even not need to iterate the formula to get satisfactory results. This, however,
might be extremely complicated. For example, we could find the cubature formula
of degree 7 which creates six paths in Nohrouzian and Malyarenko (2019). To get
weights and coefficients of the cubature formula, we found the solutions to the system
of Lie polynomial equations. In the case of degree 5, this was done analytically.
In the case of degree 7, the algebraic expressions for Lie polynomials occupied 48
A4 pages and we used numerical approach to find solutions to the system. In the
future works, we would like to examine the cubature formula of degree 7 in some
interest-rate models. We will also try to find cubature formulas of higher degrees.
Another suggestion to overcome the exponential growth in the number of nodes is to
either use the recombination method proposed by Litterer and Lyons in (Litterer and
Lyons 2012) or the tree-based branching algorithm (TBBA) proposed by Crisan and
Lyons in (Crisan and Lyons 2002). Finally, we would like to try the following idea
as well. We may shift the initial data by a certain amount to have all data as positive
numbers.
√ Then, using the ideas presented in section 25.2.7, we may use the relation
fm = fu fd in order to make a recombining tree. That is, when the number of steps
2
in the tree is even (odd), we calculate fm and fd and set fu = fm /fd and when the
2
number of steps in the tree is odd (even), we calculate fm and fu and set fd = fm /fu .
25.6. References
Bayer, C. and Teichmann, J. (2008). Cubature on Wiener space in infinite dimension.

Proceedings of the Royal Society of London. Series A: Mathematical, Physical and
Engineering Sciences, 464(2097), 2493–2516.
Black, F. (1976). The pricing of commodity contracts. Journal of Financial Economics, 1(1),
167–169.
Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of
Political Economy, 81(3), 637–654.
Brace, A., Gatarek, D., Musiela, M. (1997). The market model of interest rate dynamics.
Journal of Mathematical Finance, 7(2), 127–155.
Cox, J., Ross, S., Rubinstein, M. (1979). Option pricing: A simplified approach. Journal of
Financial Economics, 7(3), 229–263.
Cox, J., Ingersoll, J., Ross, S. (1985). A theory of the term structure of interest rates.
Econometrica, 53(2), 385–407.
Crisan, J. and Lyons, T. (2002). Minimal entropy approximations and optimal algorithms.
Monte Carlo Methods and Applications, 8(4), 343–355.
Glasserman, P. (2004). Monte Carlo Methods in Financial Engineering. Springer, New York.
Heath, D., Jarrow, R., Morton, A. (1990). Bond pricing and the term structure of interest
rates: A discrete time approximation. Journal of Financial and Quantitative Analysis, 25(4),
419–440.
Heath, D., Jarrow, R., Morton, A. (1992). Bond pricing and the term structure of interest rates:
A new methodology for contingent claims valuation. Econometrica, 60(1), 77–105.
Ho, T. and Lee, S. (1986). Term structure movements and pricing interest rate contingent claims.
Hull, J. (2017). Options, Futures, and Other Derivatives, 10th edition. Pearson, London.
Hull, J. and White, A. (1993). Efficient procedures for valuing European and American
path-dependent options. Journal of Derivatives, 1(1), 21–31.
Hull, J. and White, A. (1994). Numerical procedures for implementing term structure models I:
Single-factor models. Journal of Derivatives, 2(1), 7–16.
Hull, J. and White, A. (1996). Pricing interest rate trees. Journal of Derivatives, 3(3), 26–36.
Hull, J. and White, A. (2015). Pricing interest-rate-derivative securities. Review of Financial
Studies, 3(4), 573–592.
Jamshidian, F. (1997). LIBOR and swap market models and measures. Finance and Stochastics,
1(4), 293–330.
Kijima, M. (2013). Stochastic Processes with Applications to Finance, 2nd edition. CRC Press,
Boca Raton, FL.
Litterer, C. and Lyons, T. (2012). High order recombination and an application to cubature on
Wiener space. The Annals of Applied Probability, 22(4), 1301–1327.
Lyons, T. and Victoir, N. (2002). Cubature on Wiener space. Proceedings of the Royal Society of
London. Series A. Mathematical, Physical and Engineering Sciences, 460(2041), 169–198.
Malyarenko, A., Nohrouzian, H., Silvestrov, S. (2017). An algebraic method for pricing
financial instruments on post-crisis market. Algebraic Structures and Applications. Sergei
D Silvestrov, Anatoliy Malyarenko, Milica Rančić (eds). Springer, Cham.
Merton, R. (1973). Theory of rational option pricing. The Bell Journal of Economics and
Management Science, 4(1), 141–183.
Miltersen, K., Sandmann, K., Sondermann, D. (1997). Theory of rational option pricing.
Nohrouzian, H. and Malyarenko, A. (2019). Testing cubature formulae on Wiener space

vs explicit pricing formulae. SPAS 2019. 2nd Edition of the International Conference on
Stochastic Processes and Algebraic Structures, Västerås.
Nohrouzian, H., Ni, Y., Malyarenko, A. (2021). An arbitrage-free large market model
for forward spread curves. In Applied Modeling Techniques and Data Analysis 2,
Dimotikalis, Y., Karagrigoriou, A., Parpoula, C., Skiadas, C.H. (eds). ISTE Ltd, London,
and John Wiley & Sons, New York.
Øksendal, B. (2013). Stochastic Differential Equations: An Introduction with Applications.
Springer, Berlin.
Pascucci, A. (2011). PDE and Martingale Methods in Option Pricing. Springer, Milan.
Rendleman, R. and Bartter, B. (1980). The pricing of options on debt securities. Journal of
Financial and Quantitative Analysis, 15(1), 11–24.
Samuelson, P. (1965). Rational theory of warrant pricing. Industrial Management Review, 6(2),
13–31.
Tchakaloff, V. (1957). Formules de cubatures mécaniques à coefficients non négatifs. Bulletin
des sciences mathématiques, 81(2), 123–134.
Vasicek, O. (1977). An equilibrium characterization of the term structure. Journal of Financial
Economics, 5(2), 177–188.
26
Differences in the Structure of Infectious

Morbidity of the Population during the First
and Second Half of 2020 in St. Petersburg
The emergence of Covid-19 has posed challenges for healthcare professionals to

quickly diagnose and provide timely medical care to patients. The basis for making
managerial decisions on the task set is statistical accounting. In this work, a
comparative statistical analysis of the incidence of ARVI, new coronavirus infection
Covid-19 and community-acquired pneumonia was carried out in one of the
administrative districts of St. Petersburg. To collect operational information, a form
for recording the incidence of the adult and child population with acute respiratory
viral infectious diseases, new coronavirus infection Covid-19 and community-
acquired pneumonia was developed and implemented. It was found that the total
number of people with ARVI, new coronavirus infection Covid-19 and community-
acquired pneumonia observed in pediatric and adult clinics had two “waves”. In the
structure of the incidence of Covid-19 in the first “wave”, adult patients prevailed
(93.3%, children – 6.7%). During the second “wave” of the rise in the incidence of
Covid-19, the proportion of children doubled to 12.9%. The structure of the
incidence of ARVI when comparing the two periods was different. In the first
“wave”, the number of adult patients prevailed (70%, children – 30%). During the
second “wave”, ARVI was mainly registered in children (71.7%). Community-
acquired pneumonia for the entire period was registered in 2,725 cases, in four
children (0.14%), and in 2,721 adult patients (99.8%). The increased infectious
morbidity required the involvement of additional medical personnel, transport, as
well as the introduction of new organizational technologies for providing medical
Chapter written by Vasilii OREL, Olga NOSYREVA, Tatiana BULDAKOVA, Natalya GUREVA,
Viktoria SMIRNOVA, Andrey KIM and Lubov SHARAFUTDINOVA.
care to the population (mobile medical teams, dispensing drugs and pulse oximeters
for providing medical care at home).
The data of regular statistical observation became the basis for making
operational management decisions for the organization of medical care for the
population in the context of an epidemic rise in morbidity.
26.1. Introduction
In 2020, the Russian Federation, like other countries in the world, accepted the
challenge of the spread of the new coronavirus infection, Covid-19. To the greatest
extent, the rise in incidence has affected large metropolitan areas. St. Petersburg is a
city with a population of over 5 million, with a population density of 3,800 people
per km2, a developed transport network, a high level of population migration within the
city, from small towns in neighboring regions, as well as from foreign countries. These
factors have contributed to the high rate of the spread of Covid-19 infection.
St. Petersburg includes 18 administrative districts. This statistical study of the structure
of the incidence of respiratory infections in the first and second half of 2020 was carried
out using the example of one of the administrative districts. The emergence of Covid-19
has posed challenges for healthcare professionals to quickly diagnose and provide
timely medical care to patients (Orel et al. 2020). The basis for making managerial
decisions on the task set is statistical accounting (Orel et al. 2018).
26.2. Materials and methods
26.2.1. Characteristics of the territory of the district
The area covers 240.3 sq. km or 24,032.6 hectares (16.7% of the area of St.
Petersburg) and is the second largest area among the districts of St. Petersburg. The
average length of the region is: from the south to the north – 21 km, from the east to
the west – 21 km. Geographically, the district is located in the southern part of St.
Petersburg includes five municipal regions located at a distance from each other.
Indicators of the ecological state of land, water and air are within acceptable limits.
26.2.2. Demographic characteristics of the area
The region has a population of 240,809 people (184,490 – adults over 18 years
old; 56,319 – children from 0 to 18 years old). For five years, due to migration
processes, the total population increased by 27.0%, and the child population by
49.5%. High growth rates of the population in general, and of children in particular,
Differences in the Structure of Infectious Morbidity of the Population 361
are associated with intensive housing construction. Nevertheless, the age structure of
the district’s population remains regressive. The share of older age groups (50 years
and older) is 31%, from 15 to 49 years old – 50%, children – 19%. The share of the
female population in the structure of the total population is 54%, and the share of
women of fertile age is 48%. The birth rate is 13.2 per 1,000 people, the mortality
rate is 11.0 per 1,000 people and the natural increase is 1.3. Taking into account the
available data on the dynamics of population growth in the district, by 2024, the
number is expected to be 282,245 people (approximately 68,700 children), and by
2030, up to 350,000 (approximately 85,000 children).
26.2.3. Characteristics of the district medical service
In the district, primary health care for adults and children is provided by two
polyclinics, which include 10 polyclinic departments (five for adults and five for
children). The service area is divided into 88 therapeutic and 65 pediatric areas.
Local general practitioners and pediatricians carry out the primary registration of all
cases of the population’s disease, including infectious diseases.
In each polyclinic, statistics departments have been created, with 12 specialists

(four in a children’s clinic and eight in an adult clinic) performing the whole range of
work on statistical accounting and preparation of state reporting forms.
Polyclinics are equipped with computers for processing statistical information

and maintaining federal registers. Medical assistance to the population was
organized in accordance with the methodological recommendations of the Russian
Ministry of Health (Ministry of Health of the Russian Federation 2020c).
26.2.4. The procedure for collecting primary information on cases of

diseases of the population with a new coronavirus infection
The Ministry of Health of the Russian Federation approved a special registration

and reporting form – “Emergency notification of an infectious disease, food, acute
occupational poisoning, an unusual reaction to vaccination”. This notice is intended
to notify the supervisory authorities of these conditions. The purpose of this form is
to prevent the spread of infectious diseases and the emergence of epidemics of
dangerous diseases (Ministry of Health of the USSR 1980; Federal State Statistics
Service 2017).
The procedure for transmitting information about infectious diseases is also

regulated by the Ministry of Health of the Russian Federation. It involves a path
from a medical organization to a territorial department of the Office of the Federal
Service for Supervision of Consumer Rights Protection and Human Well-being: by

phone – within two hours, in writing (emergency notification) – within 12 hours
after the establishment of the preliminary diagnosis (Ministry of Health of the
Russian Federation 2013).
The peculiarity of the course of the Covid-19 pandemic sets the task of obtaining
reliable statistical data on the situation, with morbidity and mortality as a priority.
Monitoring based on an in-depth study of the course of Covid-19 provides the most
relevant, objective and detailed statistics on this disease, in order to more widely
assess the impact of infection on the population and the course of the disease
(Ministry of Health of the Russian Federation 2020a). When registering these
diseases, the recommendations of the Ministry of Health of the Russian Federation
were used (Ministry of Health of the Russian Federation 2020b).
At the district level, in order to collect daily operational information, a form was
developed and implemented for recording the incidence of the adult and child
population by the individual most important respiratory infections: acute respiratory
viral infectious diseases, new coronavirus infection Covid-19 and community-
acquired pneumonia.
The registration of morbidity and the number of those observed in a medical

organization began on April 20, 2020 and continues today. The analysis of the
situation with the incidence of respiratory infections was carried out from April 20,
2020 to December 31, 2020.
The information received by district doctors of district polyclinics is registered in

accordance with the temporary rules for recording information about a new
coronavirus infection (Covid-19) in a special all-Russian information resource – the
Federal Register (Government of the Russian Federation 2020).
In the work, the total indicators are presented in numerical form with an
accuracy of one person.
26.3. Results of the analysis of the incidence of acute respiratory viral

infectious diseases, new coronavirus infection Covid-19 and
community-acquired pneumonia
Since March 2020, there has been an increase in acute respiratory viral infections
with a special course, which subsequently made it possible to differentiate Covid-19.
An important criterion for the diagnosis of a new coronavirus infection is the results
of laboratory diagnostics, namely, the detection of SARS-CoV-2 RNA using nucleic
acid amplification methods. According to the recommendations of the Ministry of
Health of the Russian Federation, laboratory tests for SARS-CoV-2 RNA are
recommended for all persons with signs of acute respiratory infections.
As a preliminary screening examination, it is recommended to use the

SARS-CoV-2 antigen test in nasal/oropharyngeal swabs by immunochromatography.
The main type of biomaterial for laboratory research on SARS-CoV-2 RNA is
the material obtained by taking a swab from the nasopharynx (from two nasal
passages) and oropharynx. Swabs from the mucous membrane of the nasopharynx
and oropharynx are collected in one tube for a higher concentration of the virus.
Patients who test positive are assigned the disease code – U07.1, and patients
with a negative test result, but showing signs of the disease – U07.2.
It was found that the total number of people with ARVI, new coronavirus
infection Covid-19 and community-acquired pneumonia observed in pediatric and
adult polyclinics has two “waves”. The minimum number of these diseases was
registered in the 13th week (from July 20, 2020 to July 27, 2020) for Covid-19 and
ARVI, and the 17th week (from August 17, 2020 to August 23, 2020) for
community-acquired pneumonia.
It should be noted that the number of ARVI patients without signs of Covid-19
and laboratory confirmation of the SARS-CoV-2 antigen in the “second wave”
(36,879 people) was almost five times higher than in the “first wave” (7,461 people),
in total 44,340 people. In the first “wave”, the number of adult patients prevailed –
70%, and the share of children was 30%. During the “second wave”, ARVI was
predominantly registered in children – 71.7%. The total incidence of acute
respiratory viral infections in the total adult and child population in the “first wave”
was 29.4 per 1,000 people, in the “second” – 154.7 per 1,000 people. Figures 26.1–
26.3 show the dynamics of the number of patients with ARVI under observation in
the district polyclinics during the period of the increase in the incidence (36 weeks
of 2020, from April 20, 2020 to December 31, 2020).
It should be noted that the number of ARVI patients without signs of Covid-19
and laboratory confirmation of the SARS-CoV-2 antigen in the “second wave”
(36,879 people) was almost five times higher than in the “first wave” (7,461 people),
in total 44,340 people. In the first “wave”, the number of adult patients prevailed –
70%, and the share of children was 30%. During the “second wave”, ARVI was
predominantly registered in children – 71.7%. The total incidence of acute
respiratory viral infections in the total adult and child population in the “first wave”
was 29.4 per 1,000 people, in the “second” – 154.7 per 1,000 people. Figure 26.1
shows the dynamics of the number of ARVI patients under observation in the
district polyclinics during the period of the increase in the incidence (36 weeks of
2020, from April 20, 2020 to December 31, 2020).
Figure 26.1. Dynamics of the number of patients with ARVI who were under
observation in the district polyclinics during the period of the increase in the incidence
(36 weeks of 2020, from April 20, 2020 to December 31, 2020). For a color version of
this figure, see www.iste.co.uk/zafeiris/data1.zip
The situation with the incidence of Covid-19 and the involvement of the adult
and child population looks similar to ARVI. In the structure of the incidence of
Covid-19 in the first “wave”, adult patients dominated (93.3%, children – 6.7%).
During the second “wave” of the rise in the incidence of Covid-19, the proportion of
children doubled and amounted to 12.9%. The total incidence of Covid-19 in the
first “wave” was recorded at 7.3 per 1,000 people, in the second – 31.4 per
1,000 people. Figure 26.2 shows the number of patients with Covid-19 under
observation in polyclinics of the district during the period of the increase in the
incidence (36 weeks of 2020, from April 20, 2020 to December 31, 2020).
Figure 26.2. Dynamics of the number of patients with Covid-19 who were monitored
in the district polyclinics during the period of the increase in the incidence (36 weeks
of 2020, from April 20, 2020 to December 31, 2020). For a color version of this figure,
see www.iste.co.uk/zafeiris/data1.zip
The main contribution to mortality from the new coronavirus infection was made
by the incidence of community-acquired pneumonia. It is this complication in the
course of Covid-19 that required the greatest efforts from the healthcare system:
improving the material and technical equipment of inpatient medical institutions,
emergency medical services and optimizing the work of clinics.
Over the entire analyzed period, 2,725 cases of community-acquired pneumonia

of viral etiology were registered in the district. Of the cases in 2020, only 4 children
were registered (0.14%) and 2,721 people were adult patients (99.8%). The total
incidence of community-acquired pneumonia in the total adult and child population
in the first “wave” was 6.8 per 1,000 people, in the “second” – 4.5 per 1,000 people.
The graph indicates a decrease in the number of primary cases of community-

acquired pneumonia in the last week of 2020 (see Figure 26.3). However, the
number of patients simultaneously observed in polyclinics during the “second wave”
gradually increased (see Figure 26.4).
Figure 26.3. Dynamics of the number of newly registered cases of community-

acquired pneumonia in adult patients in polyclinics of the district during the period of
the increase in the incidence (36 weeks of 2020, from April 20, 2020 to December
31, 2020)
This is due to the severity of the course of community-acquired pneumonia and

the duration of the course of the disease (three to eight weeks). Most of the patients
were hospitalized in city hospitals (about 70%), and the rest of the patients received
treatment at home.
An analysis of the total incidence of the population of the district for the
specified period with acute respiratory infections (Covid-19, ARVI, community-
acquired pneumonia) was also carried out. A total of 56,416 cases were registered,
or 234.3 per 1,000 people (see Figure 26.5).
Figure 26.4. Dynamics of the number of adults with community-acquired pneumonia

who were monitored in the district polyclinics during the period of the increase in the
incidence (36 weeks of 2020, from April 20, 2020 to December 31, 2020)
Figure 26.5. Dynamics of the number of patients with Covid-19, ARVI and
community-acquired pneumonia who were monitored in the district polyclinics during
the period of the increase in the incidence (36 weeks of 2020, from April 20, 2020 to
December 31, 2020). For a color version of this figure, see www.iste.co.uk/zafeiris/
data1.zip
The emergence of Covid-19 has posed challenges for healthcare professionals to

quickly diagnose and provide timely medical care to patients. The data of regular
statistical observation became the basis for making operational management
decisions for the organization of medical care for the population, in the context of an
epidemic rise in morbidity.
To reduce the incidence, the efforts of various departments have been combined.
Mass cultural, educational and sports events have been canceled. Information work
on the rules for the prevention of infectious diseases is being carried out in the
media, places of mass stay. In educational institutions, the control of medical and
pedagogical workers has been strengthened to prevent contact with sick children and
adults. Schedules of additional cleaning with the use of disinfectants in the premises
of educational institutions have been drawn up.
On the territory of microdistricts, disinfection is carried out in the entrances of

residential buildings and adjacent areas, playgrounds for children. Social protection
institutions provide assistance to citizens in the delivery of medicines and food.
The district medical service has worked out interaction with the Territorial
Department of Rospotrebnadzor for the control of people arriving from countries
with an unfavorable situation for Covid-19. In the district polyclinics, teams of
epidemiologists, district doctors and paramedical personnel have been created to
monitor and contact people with Covid-19.
Biomaterial sampling points have been organized; the results of PCR tests are
available in personal accounts on the “St. Petersburg’s Health” portal. The work of
call centers for adults and children has been strengthened, as well as interaction with
the city service to receive house calls from doctors for patients. The polyclinics have
a two-month supply of personal protective equipment for medical personnel. An
additional mode of road transport was purchased for the delivery of district general
practitioners and pediatricians to house calls for patients, specialist doctors
(cardiologist, ENT, neurologist, surgeon, etc.) for consultations, sampling of
biomaterial at home (smears, blood), etc. Free distribution of drugs is carried out to
provide medical care to patients with Covid-19 at home. Since December 2020, four
vaccination points have been opened in the region and mass immunization of the
population with the Gam Covid Vac vaccine (Sputnik V) has begun. The measures
taken, based on the results of statistical records and analysis of the incidence of
the new coronavirus infection Covid-19, have significantly improved the
epidemiological situation in the area and preserved the health of the population.
26.4. Conclusion
1) Statistical accounting for analyzing the dynamics of the spread of the new
coronavirus infection Covid-19 is of paramount importance and is an integral part of
the organization of health care during a pandemic.
2) To obtain reliable information on the spread of Covid-19, a nationwide

information resource is used – the Federal Register. At the district level, daily
operational records have been developed and are being successfully applied.
3) Statistical analysis of reporting on the incidence of acute respiratory infections
(Covid-19, ARVI and community-acquired pneumonia) made it possible to see
differences in the development of the pandemic in 2020 – two “waves” of growth in
the incidence that affected the adult and child population to varying degrees. In the
“first wave”, a new coronavirus infection was registered, mainly among adult
patients. During the “second wave”, the prevalence of the new coronavirus infection
increased significantly among children.
4) Severe forms of Covid-19, complicated by community-acquired pneumonia,
were registered in 99.8% of cases among the adult population.
5) Thanks to unidirectional work and interdepartmental interaction of various
departments, significant investments of the state budget for the provision of medical
and other services, drug supply and vaccination of the population, in the region, in
St. Petersburg and the Russian Federation as a whole, it was possible to contain the
spread.
26.5. References
Federal State Statistics Service (2017). Order of the Federal State Statistics Service of January
28, 2009, No. 12 (revised on January 20, 2017). On the approval of statistical tools for the
organization of the Ministry of Health and Social Development of Russia federal
statistical observation in the field of health care.
Government of the Russian Federation (2020). Decree of the Government of the Russian
Federation of March 31, 2020, No. 373. On approval of temporary rules for recording
information in order to prevent the spread of a new coronavirus infection (COVID-19).
Ministry of Health of the Russian Federation (2013). Order of the Ministry of Health of the
Russian Federation and the Federal Service for Supervision in the Field of Consumer
Rights Protection and Human Well-being of October 10, 2013, No. 726n/740. On
optimizing the system of informing about cases of infectious and parasitic diseases.
Moscow.
Ministry of Health of the Russian Federation (2020a). Guidelines for coding and selection of
the underlying condition in morbidity statistics and the initial cause in mortality statistics
associated with COVID-19 [in Russian]. Moscow.
Ministry of Health of the Russian Federation (2020b). Letter of the Ministry of Health of Russia
dated August 4, 2020, No. 13-2 / I / 2-4335. On the coding of coronavirus infection caused
by COVID-19. Moscow.
Ministry of Health of the Russian Federation (2020c). Interim guidelines for the prevention,
diagnosis and treatment of the new coronavirus infection (COVID-19), Version 10
[in Russian]. Moscow.
Ministry of Health of the USSR (1980). Order of the Ministry of Health of the USSR of
October 4, 1980, No. 1030. On approval of forms of primary medical documentation of
health care institutions.
Orel, V.I., Bezhenar, S.I., Buldakova, T.I., Kim, A.V., Roslova, Z.A., Rubezhov, A.L.,
Orel, O.V., Gurieva, N.A., Nosyreva, O.M., Sharafutdinova, L.L. (2018). Scientific and
practical vector of problems of primary medical and social care in a metropolis. Medicine
and Healthcare Organization, 3(2), 63–67.
Orel, V.I., Gurieva, N.A., Nosyreva, O.M., Smirnovа, V.I., Buldakova, T.I., Libova, E.B.,
Razgulyaeva, D.N., Sharafutdinova, L.L., Kulev. A.G. (2020). Modern medico-
organizational features of coronavirus infection. Pediatrician, 11(6), 5–12.
27
High Speed and Secured

Network Connectivity for Higher
Education Institutions Using
Software Defined Networks
During Covid-19, there was a demand for high data throughput and connectivity
from higher education institutions. This required that backbone and metro links as well
as fiber links be upgraded with more capacity, catering to these high data demands and
connectivity. In addition, there was a need for a reliable and secured network to protect
against attacks from hackers. Cyberinfrastructures needed to be put in place to secure
the network and prevent information from being accessed by unknown/unauthorized
users and outsiders. It was also observed that some network devices lost configurations
with power failures around the areas. Due to this problem, several devices even lost
routing and signaling information. This required a technician to reconfigure the whole
device manually. Moreover, throughout this process, there were several challenges in
the deployment of transmission network equipment.
In this research work, a model was proposed to solve the aforementioned issues,
which was deployed on the software defined network (SDN). This network has three
layers, namely application, control and data. Typically, the open systems
interconnection (OSI) model has seven layers, but some layers have been combined
and reduced to three in this proposed model. This SDN is user-friendly, as it is
programmable to execute some of the tasks. It also saves bandwidth, as it reuses
network resources. If power (electricity) fails and the networking device reboots,
then it will automatically look for configuration information on the SDN server and
compare it with the one configured on the device. If the comparison is the same,
then the device will work as usual. If the network device loses configurations, then
Chapter written by Lincoln S. PETER and Viranjay M. SRIVASTAVA.
the SDN server will automatically reload the configurations of the device. The SDN
can also calculate the bandwidth utilization for all the links or routes connected to it.
This assists in finding where network resources are used the most. If a particular
route is congested, then the SDN will look for another alternative route where
utilization is low, in order to load balance the traffic.
27.1. Introduction
The communication network is vast and uses different types of connectivity. The
connectivity to any network can be made by means of fiber, wired or wireless
(Khalighi and Uysal 2014; Zhang and Pedersen 2016; Odeyemi et al. 2017). To
deliver this type of network connectivity, we need to first understand the consumer
requirements. In this present research work, consumers are the institutions of higher
learning. These institutions have challenges of delivering high data throughput to
students, lecturers and academic staff. The issue of high data challenges is more
critical in remote areas (Ercan et al. 2010; El-Seoud et al. 2017; Javidi 2017;
Srivastava 2020). The Covid-19 pandemic has in fact exposed some of these
institutions when it comes to high-speed connectivity. It did not end there; how fragile
the metro and backbone rings were when most people were working from home
(Dargar and Srivastava 2019; Teras et al. 2020; Izumi et al. 2021; Maatuk et al. 2021).
As there is a demand for high data throughput and connectivity from higher
learning institutions, it is required that backbone and metro links be upgraded to
more capacity. This will cater to these high data demands and connectivity (Kim and
Feamster 2013; Megyesi et al. 2017). High capacity alone is not enough; there is a
need to have a reliable and secured network to protect against attacks from hackers.
Cyberinfrastructures need to be put in place to secure the network and prevent
information from being accessed by unknown users and outsiders (van Adrichem
et al. 2014; Shu et al. 2016; Wang et al. 2018).
It has also been noted that some network devices lose configurations with power
failures around the areas. Due to this problem, some devices also lose routing and
signaling information (Guo et al. 2013; Jung and Song 2015). This requires a
technician to come out and reconfigure the whole device.
To overcome these issues, in this research work, we propose a model for deploying
a software defined network (SDN), which helps to increase productivity and reduces
the cost (Nunes et al. 2014; Alsmadi and Xu 2015; Kreutz et al. 2015; Singh
and Srivastava 2018; Ali et al. 2020). It also saves the bandwidth, as it will reuse
network resources. The proposed model was deployed on the SDN. This chapter is
organized as follows. The existing work and the brief history of the open systems
interconnection (OSI) model and its layers are discussed in section 27.2. Section 27.3
presents a new SDN architecture and its benefits through three layers, namely
High Speed and Secured Network Connectivity for Higher Education Institutions 373
application, control and data. Finally, section 27.4 concludes the work and
recommends future aspects.
27.2. Existing model review
The OSI model is a framework that defines tasks required for any computer or
network system to communicate with one another (Bakshi 2013; Farhady et al.
2015; Marconett and Yoo 2015; Cox et al. 2017). The history of the OSI model
began in the 1970s by the International Organization for Standardization (ISO) and
the International Telegraph and Telephone Consultative Committee (CCITT). The
CCITT was later succeeded by the International Telecommunication Union (ITU).
The OSI model is also called the basic reference model with specific protocols. This
model is divided into seven layers. Each layer has protocols that affect its
functionality. They interact directly only with the layer underneath it and provide
facilities for the layer above it.
The purpose of the OSI model was to assist telecommunication equipment

vendors, software and developers to produce an interoperable network system. The
OSI model succeeded as a tool for describing and defining how the network systems
should communicate. This layered approach was developed to address the
following:
a) to break the complex communication network into smaller, more manageable
and understandable parts;
b) to provide a standard interface between the network function and modules;
c) to provide a standard language that network engineers can use.
The seven OSI model layers are shown in Figure 27.1.
Figure 27.1. The OSI model (Farhady et al. 2015; Cox et al. 2017)
27.3. Selection of a suitable model
Telecommunication uses seven layers for communication: physical, data,

network, transport, session, presentation and application. The SDN combines all
seven layers into three layers (Farhady et al. 2015; Marconett and Yoo 2015). The
SDN has application, control and data. The data layer on the SDN combines the first
four layers of the OSI model. This layer performs data forwarding for upper layers
such as control and application. The control plane or layer collects all the
information forwarded to it by the data layer. This information is collected and
analyzed on this layer. The SDN is a network where control and data planes are
decoupled and programmable through a software running on top of the network
operating system (NOS) on the controller.
Figure 27.2. The SDN architecture

The newly selected model has three layers: application, control and
infrastructure, as shown in Figure 27.2. The most important feature of the SDN
architecture is that the data plane is separated from the control plane. This separation
allows for the centralization of management. The lower layer of the SDN, the data
plane, enables the controller plane to have full network management. The data plane
forwards packets to the control plane where decisions are made. This helps to reduce
the cost of network devices and processing at the data plane. Now let us examine
each layer and the roles they play in this new model.
All the issues of the physical layer such as links that are down and hardware
devices that are not reachable on the network, are reported and processed on the
control layer. Once the information is processed on this layer, it will be presented to
the application layer. This is where the information will be made readable for
network engineers and administrators. This also allows the engineers to make any
changes on the network depending on business requirements.
In the new model, seven layers are reduced to three layers. In other cases, it is
difficult to reprogram some of the devices when they lose configurations due to
power failures. With this new model, it is possible to implement all the routing rules
in centralized software. This helps network engineers and administrators to have
more control over the network, as well as to provide high network performance. In
old traditional networks such as the OSI model, the routers become overwhelmed by
data processing and updating routing tables when the network grows rapidly. This
can cause network delays as routers need to exchange information periodically in
order to keep up with the network status. With the SDN model, all of this burden
will be carried by the control plane to ease congestion in routers on the data plane.
The SDN uses OpenFlow as its standard protocol. This protocol is multivendor
and managed by the Open Networking Foundation (ONF). It has evolved over the
years to become the most widely used protocol in SDN applications. It can be
integrated into both software and hardware without any issues on the SDN. It is
advanced in such a way that it uses open-source code to control SDN controllers. It
can interact with any switch and router vendor. This protocol exchange information
between the control plane and OpenFlow resides on the data plane (Xia et al. 2015).
The OpenFlow protocol offers convenient flow table manipulation services for a
controller to insert, delete, modify and find the flow entries. Its main features are
flow tables, which it uses to control the traffic between data plane devices. Each
flow table contains flow entries that are communicated to the controller. The
controller only handles the routing of packets and decision-making.
This new model has efficient use of resources. It has the capabilities to distribute
the workload across the controllers. This, in turn, increases the speed and efficiency
of the network. It allows network engineers to make changes on the network
remotely. The SDN is programmable, which allows the control plane to reduce
congestion in the entire network. This new model enables the scalability of the
network with improved security features and resilience to faults.
The SDN can also calculate the bandwidth utilization for all the links or routes
connected to it. This assists in finding where network resources are used the most. If
a particular route is congested, the device’s SDN will look for another alternative
route where utilization is low, in order to load balance the traffic. The SDN concept
is not new – it has been there, but we are extending its application to institutions of
higher learning where the demand for data is needed the most.
ForCES is another protocol used in SDN applications. This protocol proposes the
separation of IP control and data forwarding (Hu et al. 2014). Unlike OpenFlow, it
does not have widespread adoption due to its lack of clear language abstraction
definition and controller-switch communication rules. The main advantage of the
ForCES protocol is that it can be easily integrated into old traditional network
devices since it just adds networking/forwarding elements.
The optimal performance of the SDN relies mainly on the controllers.

Concerning SDN applications, we must have a controller distributed system. In this
case, a single controller does not become a single point of failure. The distributed
controller system continuously checks for updates to avoid inconsistencies and
incorrect routing of packets. The testing and checking of the system is performed
using Mininet.
27.4. Conclusion and future recommendations
The SDN can be applied in the case where the link is down, in order to enable
the traffic to reroute on another path that is up and running. This can only be
achieved when a ring network is on the backbone, aggregation and last mile. The
SDN simplifies the traditional network and makes things easy for troubleshooting
and maintenance. It will rely more on automation and the programmability of the
network.
The SDN works very well with cloud networking and artificial intelligence
networks. It is the future of secured and reliable communication networks.
27.5. References
van Adrichem, N.L.M., van Asten, B.J., Kuipers, F.A. (2014). Fast recovery in software-
defined networks. 3rd European Workshop on Software Defined Networks (EWSDN),
Budapest, Hungary.
Ali, J., Lee, G.M., Roh, B.H., Ryu, D.K., Park, G. (2020). Software defined networks
approaches for link failure recovery: A survey. Sustainability, 12(10), 1–28.
Alsmadi, I. and Xu, D. (2015). Security of software defined networks: A survey. Computers
& Security, 53, 79–108.
Bakshi, K. (2013). Considerations for software define network (SDN): Approaches and use
cases. IEEE Aerospace Conference, Big Sky, MT, USA.
Cox, J.H., Chung, J., Donovan, S., Ivey, J., Clark, R.J., Riley, G., Owen, H.L. (2017).
Advancing software defined networks: A survey. IEEE Access, 5, 25487–25526.
Dargar, S.K. and Srivastava, V.M. (2019). Integration of ICT based methods in higher
education teaching of electronic engineering. 10th International Conference of Strategic
Research on Scientific Studies and Education (ICoSReSSE), Rome, Italy.
El-Seoud, S.A., El-Sofany, H.F., Abdelfattah, M., Mohamed, R. (2017). Big data and cloud
computing: Trends and challenges. International Journal of Interactive Mobile
Technologies, 11(2), 34–52.
Ercan, T., Rajabion, L., Sheybani, E. (2010). Effective use of cloud computing in educational
institutions. Procedia – Social and Behavioral Sciences, 2(2), 938–942.
Farhady, H., Lee, H.Y., Nakao, A. (2015). Software-defined networking: A survey. Computer
Networks, 81, 79–95.
Guo, J., Yang, J., Zhang, Y., Chen, Y. (2013). Low cost power failure protection for MLC
NAND flash storage systems with PRAM/DRAM hybrid buffer. Design, Automation &
Test in Europe Conference & Exhibition (DATE), Grenoble, France.
Hu, F., Hao, Q., Bao, K. (2014). A survey on software-defined network and OpenFlow: From
concept to implementation. IEEE Communications Surveys & Tutorials, 16(4),
2184–2189.
Izumi, T., Sukhwani, V., Surjan, A., Shaw, R. (2021). Managing and responding to
pandemics in higher educational institutions: Initial learning from COVID-19.
International Journal of Disaster Resilience in the Built Environment, 12(1), 51–66.
Javidi, G. (2017). Educational data mining and learning analytics: Overview of benefits and
challenges. International Conference on Computational Science and Computational
Intelligence (CSCI), Las Vegas, NV, USA.
Jung, S. and Song, Y.H. (2015). Data loss recovery for power failure in flash memory storage
systems. Journal of Systems Architecture, 61(1), 12–27.
Khalighi, M.A. and Uysal, M (2014). Survey on free-space optical communication:
A communication theory perspective. IEEE Communications Surveys & Tutorials, 16(4),
2231–2258.
Kim, H. and Feamster, N. (2013). Improving network management with software define
networking. IEEE Magazines, 114–119.
Kreutz, D., Ramos, F.M.V., Verissimo, P.E., Rothenberg, C.E., Azodolmolky, S., Uhlig, S.
(2015). Software-defined networking: A comprehensive survey. Proceedings of the IEEE,
103(1), 14–76.
Maatuk, A.M., Elberkawi, E.K., Aljawarneh, S., Rashaideh, H., Alharbi, H. (2021). The
COVID-19 pandemic and e-learning: Challenges and opportunities from the perspective
of students and instructors. Journal of Computing in Higher Education, 1–18.
Marconett, D. and Yoo, S.J.B. (2015). Flow broker: A software-defined network controller
architecture for multi-domain brokering and reputation. Journal of Network and Systems
Management, 23, 328–359.
Megyesi, P., Botta, A., Aceto, G., Pescape, A., Molnar, S. (2017). Challenges and solution for
measuring available bandwidth in software define networks. Computer Communications,
99, 48–61.
Nunes, B.A.A., Mendonca, M., Nguyen, X.N., Obraczka, K., Turletti, T. (2014). A survey of
software-defined networking: Past, present, and future of programmable networks. IEEE
Communications Surveys & Tutorials, 16(3), 1617–1634.
Odeyemi, K.O., Owolawi, P.A., Srivastava, V.M. (2017). Performance analysis of free-space
optical system with spatial modulation and diversity combiners over the Gamma-Gamma
atmospheric turbulence. Optics Communications, 382(1), 205–211.
Shu, Z., Wan, J., Li, D.I., Lin, J., Vasilakos, A.V., Imran, M. (2016). Security in software
defined network: Threats and counter measures. Mobile Networks and Applications, 21,
764–776.
Singh, M. and Srivastava, V.M. (2018). An analysis of key challenges for adopting the cloud
computing in Indian education sector. Advances in Computing and Data Sciences, Singh, M.,
Gupta, P.K., Tyagi, V., Flusser, J., Ören, T. (eds). Springer, Singapore.
Srivastava, V.M. (2020). Learning for future education, research, and development through
vision and supervision. International African Conference on Current Studies of Science,
Technology & Social Sciences (African Summit), South Africa.
Teras, M., Suoranta, J., Teras, H., Curcher, M. (2020). Post-Covid-19 education and education
technology “solutionism”: A seller’s market. Postdigital Science and Education, 2(3),
863–878.
Wang, L., Yao, L., Xu, Z., Wu, G., Obaidat, M.S. (2018). CFR: A cooperative link failure
recovery scheme in software defined networks. International Journal of Communication
Systems, 31(10), e3560.
Xia, W., Wen, Y., Foh, C.H., Niyato, D., Xie, H. (2015). Survey on software defined
networking. IEEE Communications Surveys & Tutorials, 17(1), 27–43.
Zhang, S. and Pedersen, G.F. (2016). Mutual coupling reduction for UWB MIMO antennas
with a wideband neutralization line. IEEE Antennas and Wireless Propagation Letters,
15(166–169).
28
Reliability of a Double Redundant System

Under the Full Repair Scenario
For two probability distributions A(x) and B(x) in Rykov et al. (2020b), modified
Laplace–Stieltjes transforms have been introduced. In terms of these transforms, the
analytical expressions of the main reliability characteristics of a double redundant
system with arbitrarily distributed life- and repair times of system components under
the partial repair scenario have been found. In this chapter, we extend the investigation
of the same model under the full repair scenario. The analytical expressions for the
time-dependent and steady-state system probabilities as well as the system reliability
function are presented in this chapter. The proposed approach and obtained analytical
results also allow us to investigate the sensitivity of system reliability characteristics
to the shape of system component life- and repair time distributions.
28.1. Introduction
Reliability of both simple and complex systems is a key issue in management,

production and other areas of human activity. The redundancy technique is one of the
most common methods for improving reliability (Ushakov 2013; Levitin et al. 2016).
Duplication, for example, is of particular interest among researchers.
There are a lot of papers devoted to the study of redundant systems. Calculation
of the reliability characteristics of such systems is not a trivial task, even for a simple
double redundant structure but for arbitrarily distributed component life- and repair
times.
Chapter written by Vladimir RYKOV and Nika I VANOVA.
Several analytical methods allow us to calculate the reliability characteristics

of such systems, when one time (life or repair) has an exponential distribution,
and the other is arbitrarily distributed. For example, in Rykov et al. (2014), the
authors considered two types of cold redundancy models by solving a system of
forward differential equations. In Rykov and Kozyrev (2019) and Rykov et al.
(2020a), to describe the system behavior, the Markovization method based on
additional variable introduction has been used. Following it, a two-dimensional
Markov process was constructed, and for its time-dependent probabilities, the
system of Kolmogorov forward partial differential equations was written out. For its
solution, the method of characteristics was applied. Using a combination of these
methods, the time-dependent system reliability and stationary characteristics have
been calculated. In Vanderperre and Makhanov (2014), a similar system of partial
differential equations was solved using numerical methods. A probabilistic approach
to calculating reliability characteristics was proposed in Vanderperre (2005), where
the results have been obtained in terms of the Laplace–Stieltjes transform (LST) and
the Cauchy-type integral.
As opposed to previous works in Utkin (2003), the author proposed the imprecise
reliability models of cold standby systems. The investigation of these models supposed
that arbitrary probability distributions of the component time to failure are possible
and they are restricted only by the available information in the form of lower and
upper probabilities of some events. Any system, subject to arbitrary life- and repair
time distributions, invokes an open problem in reliability theory.
However, in the recent paper by Rykov et al. (2020b), the authors have obtained
analytical expressions of the main characteristics of the reliability of a double
redundant system with partial repair and arbitrarily distributed both life- and repair
times. This result was reached using the theory of decomposable semi-regenerative
processes. Some of the main milestones of this approach are briefly presented below.
In 1955, Smith (1955) proposed the regeneration idea, which was the beginning
of the development of a new direction in the random processes theory. Smith’s theory
helps researchers to solve many applied problems by reducing their complexity. Thus,
this theory led to many generalizations and modifications of regenerative process
theory. Sometimes, the behavior of the process in a separate regeneration period,
as well as the corresponding probabilities, can be complex enough for analytical
calculations, so it is necessary to investigate this process in more detail. Combining
Cinlar (1969) who proposed the theory of semi-Markov processes and Smith’s theory
led to the development of semi-regenerative processes. The continuation of this
generalization was reflected in papers by Klimov (1966), Jacod (1971), Rykov and
Yastrebenetsky (1971) and Korolyuk and Turbin (1976).
Reliability of a Double Redundant System Under the Full Repair Scenario 381
Later, the idea of semi-regenerative processes was continued by Rykov (1975),

where the theory of decomposable semi-regenerative processes (DSRP) was proposed.
If the process behavior in some separate regeneration period is complex enough and its
distribution cannot be analytically represented, sometimes it is possible to find some
embedded regeneration time points within this period, in which the process forgets
its past up to the present state conditionally to its behavior in the main regeneration
period. Being extended to any regeneration period procedure above leads to the
construction of DSRP. For details, see Rykov (2011).
Investigations above dealt with the partial repair of the system. This means that
after a whole system failure, the repair of the failed component is prolonged, and
after its end, the system resumes operation and the repair of the remaining component
begins. However, there is a second way to restore the whole system. Sometimes it
is more useful to consider a full repair instead of a partial one to maintain a high
level of reliability, as well as to save energy and reduce economic costs. Full system
repair means the restoration of all failed components, after which the system becomes
operational and works like a new one. Thus, the problem of studying a system under
a full repair scenario arises. Moreover, in such a system, the time of full repair can be
different from the partial one.
The purpose of this chapter is to apply the DSRP theory to the problem of
studying the reliability characteristics of a double redundant system under the full
repair scenario and arbitrarily distributed life- and repair times of system components,
as well as repair of the whole system.
This chapter is organized as follows. In the next section, the problem statement
and notations are given. Section 28.3 introduces the process for system behavior
described as regenerative and deals with the calculation of the reliability function
and the system’s mean lifetime. In section 28.4, we consider the process behavior
in a separate regeneration period and find time-dependent system state probabilities
(t.d.s.p.s). The final section aims to present the results of steady-state probabilities
(s.s.p.s). This chapter ends with a conclusion and some future research directions.
28.2. Problem statement, assumptions and notations
Consider a homogeneous cold double redundant system with arbitrarily distributed

life- and repair times. Such a system has two components (main and reserved) and one
repair facility (Figure 28.1).
At the beginning of the work, we assume that both components are in working
order. With the failure of the main component, a reserved one starts operating. If the
repair of a failed component ends before the failure of the other one, the first one
returns to reserve as new. Otherwise, the failure of the reserved component before the
failed main component has been repaired results in the failure of the entire system and
the beginning of its full repair.
2
Repair facility
Two-component cold-
standby system
Figure 28.1. Two-component cold-standby repairable system with one repair facility
Assume that both life- and repair times of the system’s components are arbitrarily
distributed. Denote by Ai (i = 1, 2, ...), lifetimes of the system elements, by Bi (i =
1, 2, ...) its partial repair times (repair of an element), and by Ci (i = 1, 2, ...)
full repair of the whole system. Suppose that all these random variables (r.v.s) are
mutually independent and identically (for each type of r.v.s) distributed (i.i.d.). Thus,
the corresponding cumulative distribution functions (c.d.f.) are A(x) = P{Ai ≤ x},
B(x) = P{Bi ≤ x} and C(x) = P{Ci ≤ x} (i = 1, 2, . . . ). Denote also by A, B
and C the r.v.s with the same c.d.f.s as Ai , Bi and Ci , respectively. Suppose that the
instantaneous failures and repairs are impossible and their mean times are finite:
A(0) = B(0) = C(0) = 0,

∞ ∞
a = (1 − A(x))dx < ∞, b = (1 − B(x))dx < ∞,
0 0
∞
c= (1 − C(x))dx < ∞.
0
Introduce a random process J = {J(t), t ≥ 0}, where
J(t) = j, if in time t, the system is in state j,
where E = {0, 1, 2} is the set of system states and j means the number of failed
components. Figure 28.2 illustrates a transition graph of the considered system.
A A
0 1 2
B
C
Figure 28.2. Transition graph of a double redundant system under full repair
To study the system reliability, introduce the following notations.

– The LSTs of the corresponding c.d.f.s of failures and repairs are denoted as
∞ ∞ ∞
−sx −sx
ã(s) = e dA(x), b̃(s) = e dB(x), c̃(s) = e−sx dC(x).
0 0 0
– Modified LSTs of truncated distributions

∞ ∞
−sx
ãB (s) = e B(x)dA(x), b̃A (s) = e−sx A(x)dB(x).
0 0
Note the property of these transforms
ã1−B (s) = ã(s) − ãB (s), b̃1−A (s) = b̃(s) − b̃A (s).
– Modified expectations of the corresponding truncated distributions are denoted

by
∞ ∞
aB = −ãB (0) = xB(x)dA(x), bA = −b̃A (0) = xA(x)dB(x).
0 0
– The probabilities P{B ≤ A} and P{B > A} are associated with these
transforms through the following ratios:
∞
ãB (0) = B(x)dA(x) = P{B ≤ A} ≡ p,
0
∞
b̃A (0) = A(x)dB(x) = P{B > A} ≡ q = 1 − p.
0
This chapter deals with the following reliability characteristics of the considered
system:
– the reliability function
R(t) = P{F > t} = 1 − F (t),
where F is the time to the first system failure and F (t) its c.d.f.;
– the t.d.s.p.s
πj (t) = P{J(t) = j} (j = 0, 1, 2);
– the s.s.p.s
π = lim πj (t) (j = 0, 1, 2).

t→∞
28.3. Reliability function
Process J is a regenerative one (see Figure 28.3), and its regeneration times
S0 = 0, S1 = S0 + G1 , ... Sn = Sn−1 + Gn , ...
mean the time moments when the process returns to state 0 after a full system failure
and repair. Here, Gi (i = 1, 2, . . . ) are the sequence of i.i.d. r.v.s of the time
intervals between two consecutive returns of the system to state 0 after full failure,
which represents the lengths of regenerative periods. Denote G(t) = P{Gi ≤ t}
the corresponding c.d.f. of such r.v.s. Define also the lifetime of the system as W ,
and the time to the first system failure as F (see Figure 28.3) and their c.d.f.s by
W (t) = P{W ≤ t} and F (t) = P{F ≤ t}. The following lemma holds for the
LSTs of these distributions.
L EMMA 28.1.– The LSTs w̃(s), f˜(s) and g̃(s) of the corresponding distributions
W (t), F (t) and G(t) are of the form
ã(s) − ãB (s) ˜ ã(s) − ãB (s) ã(s) − ãB (s)

w̃(s) = , f (s) = ã(s) , g̃(s) = ã(s)c̃(s) .
1 − ãB (s) 1 − ãB (s) 1 − ãB (s)
[28.1]
P ROOF.– The lifetime of the system W is the time between two successive failures
of system components. After a failure in state 0, the system goes to state 1 where
there can be two events: either the recovery of a component in state 1 in time B and
subsequent transition again to state 0, or a failure of a second component in time A
will happen. Hence, from Figure 28.3, the time W satisfies the following stochastic
equation:

A + W, if B < A,
W = [28.2]
A, if B ≥ A.
J (t ) G
F
W
C
2
B
1
A
0
t
S0 S1(1) S 2(1) S1 S2 S1(1)
Figure 28.3. Trajectory of the process J
Accordingly, the r.v.s F and G can be found as
F = A + W,G = F + C. [28.3]

Applying LST w̃(s) = E e−sW to [28.2], we can obtain
∞

w̃(s) = E e−sW = e−st dW (t)
0
∞
= e−sx [w̃(s)B(x) + (1 − B(x))] dA(x)
0
= w̃(s)ãB (s) + ã1−B (s).
Application of the LST to [28.3] leads to the remaining equalities in [28.1].
The main result of this section is the following theorem.

T HEOREM 28.1.– The LT R̃(s) of the system reliability function R(t) is

(1 − ã(s))(1 + ã(s) − ãB (s))
R̃(s) = .
s(1 − ãB (s))
P ROOF.– The reliability function is defined as R(t) = 1 − F (t). Thus, applying LT

and taking into account that
∞ ∞
−st 1 1˜
F̃ (s) = e F (t)dt = e−st dF (t) = f (s),
s s
0 0
the LT R̃(s) can be calculated with the second equation of [28.1]

1 1 (1 − ã(s))(1 + ã(s) − ãB (s))
R̃(s) = − F̃ (s) = · (1 − f˜(s)) = .
s s s(1 − ãB (s))
C OROLLARY 28.1.– The mean time to the first failure f ≡ E [F ] and the mean length
of the regeneration period g ≡ E [G] have the following form:
a a
f =a+ , g =a+c+ .
q q
P ROOF.– The relations above follow from the equalities

(1 − ã(s))(1 + ã(s) − ãB (s))
f = R̃(0) = lim
s→0 s(1 − ãB (s))
1 − ãB (s) − ã(s)(ã(s) − ãB (s)) a
= lim =a+ ;
s→0 s(1 − ãB (s)) q
1 − g̃(s) a
g = lim =a+c+ .
s→0 s q
28.4. Time-dependent system state probabilities
28.4.1. General representation of t.d.s.p.s
According to the renewal theory, the t.d.s.p.s πj (t) of the process J in any time t
can be represented in terms of its distribution in a separate regeneration period G,
(1)
πj (t) = P{J(t) = j, t < G} (j = 0, 1, 2),
and the renewal function H(t)

⎧⎡ ⎤ ⎫
⎨ ⎬
H(t) = P ⎣ Gi ⎦ ≤ t = G∗n (t)
⎩ ⎭
1≤i≤n n≥1
as
t
(1) (1)
πj (t) = πj (t) + πj (t − u)dH(u). [28.4]
0
In terms of the LT and LST of the corresponding functions

∞ ∞ ∞
−st (1) (1)
π̃j (s) = e πj (t)dt, π̃j (s) = e−st πj (t)dt, h̃(s) = e−st dH(t)
0 0 0
the equation above can be represented in the form

(1)
π̃j (s) = (1 + h̃(s)) · π̃j (s) (j = 0, 1, 2). [28.5]
In addition to the renewal theory, it is well known that the LST of the renewal
function H(t) can be expressed with the LST of the regeneration period distribution
as follows:
g̃(s)
h̃(s) = .
1 − g̃(s)
Using the third equality from [28.1], equation [28.5] turns to
1 − ãB (s) (1)
π̃j (s) = · π̃ (s) (j = 0, 1, 2).
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s)) j
[28.6]
(1)
Further, we calculate the LT π̃j (s) of t.d.s.p.s in a separate regeneration period.
28.4.2. T.d.s.p.s in a separate regeneration period
As follows from Figure 28.3 and as was mentioned in Lemma 28.1, the r.v. G
consists of two time intervals F and C, G = F + C. Therefore, the distribution
(1)
πj (t) of the process J in a separate main regeneration period G can be divided into
two distributions as follows:
(1) (F ) (C)
πj (t) = (1 − δj2 )πj (t)1{t<F } + δj2 πj (t)1{F <t<G} (j = 0, 1, 2).
[28.7]
(1)
Following this representation, the probability π2 (t) can be calculated very easily
and is given in the following lemma.
(1) (1)
L EMMA 28.2.– The LT π̃2 (s) of the t.d.s.p. π2 (t) in the main regeneration period
is given by
(1) ã(s)(ã(s) − ãB (s)) 1 − c̃(s)

π̃2 (s) = · . [28.8]
1 − ãB (s) s
P ROOF.– From [28.7], it follows that the event {J(t) = 2, t < G} occurs if {F ≤
t < F + C}. Thus, it holds
t
(1) (C)
π2 (t) = πj (t) = P{F ≤ t < F + C} = dF (u)(1 − C(t − u)).
0
Applying the LT and by changing the variable in integration, we get

∞ t
(1)
π̃2 (s) = e−st dF (u)(1 − C(t − u))
0 0
∞ t
−su
= e dF (u) e−s(t−u) (1 − C(t − u))dt
0 0
∞
= f˜(s) e−sv (1 − C(v))dv
0
1 − c̃(s) ã(s)(ã(s) − ãB (s)) 1 − c̃(s)

= f˜(s) = · .
s 1 − ãB (s) s
For states j = 0, 1, equation [28.7] takes the form

(1) (F )
πj (t) = πj (t).
Taking into account (see Lemma 28.1) that F = A + W , the process J behavior
in a separate period F can be divided into its behavior in intervals A and W . Thus, the
process t.d.s.p.s in this period can be expressed as
t
(F ) (W )
π0 (t) ≡ P{J(t) = 0, t < F } = P{t < A} + dA(u)π0 (t − u),
0
t
(F ) (W )
π1 (t) ≡ P{J(t) = 0, t < F } = dA(u)π1 (t − u),
0
(W )
where πj (t) = P{J(t) = j, t < W } (j = 0, 1) are t.d.s.p.s in a separate period W .
(F ) ∞ (F ) (W ) ∞ (W )
In terms of LT π̃j (s) = e−st πj (t)dt and π̃j (s) = e−st πj (t)dt, we get
0 0
(1) (F ) 1 − ã(s) (W )
π̃0 (s) ≡ π̃0 (s) = + ã(s)π̃0 (s),
s
(1) (F ) (W )
π̃1 (s) ≡ π̃1 (s) = ã(s)π̃1 (s). [28.9]
(W )
For the probabilities πj (t) (j = 0, 1) calculation, we use the theory of DSRP,
briefly described in the introduction. According to this theory, the process distribution
(W )
πj (t) (j = 0, 1) in a separate regeneration period W can be represented in terms
of its distribution
(2)
πj (t) = P{J(t) = j, t < G(1) } (j = 0, 1)
in the embedded regeneration period G(1) with c.d.f. G(1) (t) = P{G(1) ≤ t} and
embedded renewal function H (W ) (t) analogously to equation [28.4] as follows:
t
(W ) (2) (2)
πj (t) = πj (t) + dH (W ) (u)πj (t − u) (j = 0, 1), [28.10]
0
where H (W ) (t) satisfies the following equation:
t
(W ) (1)
H (t) + W (t) = G (t) + dH (W ) (u)G(1) (t − u). [28.11]
0
(1)
In the considered case, as embedded regeneration times Sk , we use a random
number ν = min{n : An < Bn } of the following time moments:
(1) (1) (1)
S1 = A1 1{A1 >B1 } , S2 = S1 + A2 1{A1 >B1 ,A2 >B2 } , ... ,
until the event {An ≤ Bn } does not occur for the first time. This means that these
time points belong to the interval W , which is defined in equation [28.2], and the time
(1)
intervals Gi (i = 1, ν) between embedded regeneration points have the distribution
(1)
A(t), G (t) = A(t). Based on these arguments, we get the following statement.
(W )
L EMMA 28.3.– The LT π̃j (s) (j = 0, 1) of the process t.d.s.p.s in a separate
embedded regeneration period W satisfies the relation
(W ) 1 (2)
π̃j (s) = π̃ (s). [28.12]
1 − ãB (s) j
∞
P ROOF.– In terms of LT h̃(W ) (s) = e−st dH (W ) (t) and taking into account that
0
g̃ (1) (s) = ã(s) from [28.11], it follows
h̃(W ) (s) + w̃(s) = g̃ (1) (s) + h̃(W ) (s)g̃ (1) (s),
ã(s) − w̃(s)
h̃(W ) (s) = .
1 − ã(s)
(2)
For LT π̃j (s) of t.d.s.p.s in a separate regeneration period G(1) of the second
level and taking into account the first equality of [28.1] from [28.10], it follows
(W ) (2)
π̃j (s) = (1 + h̃(W ) (s))π̃j (s)
1 − w̃(s) (2) 1 (2)
= π̃j (s) = π̃ (s),
1 − ã(s) 1 − ãB (s) j
that ends the proof.
The next step consists of the calculation of the process t.d.s.p.s in a separate
(2)
regeneration period of the second level πj (t) (j = 0, 1).
L EMMA 28.4.– The LTs of t.d.s.p.s in a separate regeneration period of the second
level are
(2) 1
π̃0 (s) = b̃(s) − (ãB (s) + b̃A (s)) ,
s
1
(2)
π̃1 (s) = 1 − (ã(s) + b̃(s)) + ãB (s) + b̃A (s) . [28.13]
s
(2)
P ROOF.– From Figure 28.3, the probabilities πj (t) (j = 0, 1) can be obtained in the
following way.
The event {J(t) = 0, t < G(1) } occurs if {B < t < A}, i.e. if the repair of
the component being repaired ends before time point t until another component fails.
Since the independence of r.v.s A and B, we get
(2)
π0 (t) = P{B < t < A} = B(t)(1 − A(t)).
The event {J(t) = 1, t < G(1) } occurs if {t < B < A} or {t < A < B}, i.e.
being in state 1 at time point t, one component is repaired, and the second did not fail,
or when a failure of the second component occurred before repair completion of the
other one. Due to the incompatibility of the events above, it follows
(2)
π1 (t) = P{t < B < A} + P{t < A < B}
∞ ∞
= (1 − A(u))dB(u) + (1 − B(u))dA(u).
t t
Calculating LTs of these expressions by partial integration

∞
(2)
π̃0 (s) = e−st B(t)(1 − A(t))dt
0
∞
1
= e−st [dB(t)(1 − A(t)) − B(t)dA(t)]
s
0
1
= b̃(s) − (ãB (s) + b̃A (s)) ,
s
⎡∞ ⎤
∞ ∞
(2)
π̃1 (s) = e−st ⎣ (1 − A(u))dB(u) + (1 − B(u))dA(u)⎦
0 t t
∞ u
= e−st dt [(1 − A(u))dB(u) + (1 − B(u))dA(u)]
0 0
∞
1
= (1 − e−su [(1 − A(u))dB(u) + (1 − B(u))dA(u)]
s
0
1
= 1 − (ã(s) + b̃(s)) + ãB (s) + b̃A (s)
s
ends the proof.
Collection of all of the current section results leads to the theorem.
T HEOREM 28.2.– The LTs of the process t.d.s.p.s πj (t) are:
1 1 − ãB (s) − ã(s)(1 − b̃(s) + b̃A (s))

π̃0 (s) = · ,
s 1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
1 ã(s)(1 − (ã(s) + b̃(s)) + ãB (s) + b̃A (s))

π̃1 (s) = · ,
s 1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
1 ã(s)(1 − c̃(s))(ã(s) − ãB (s))

π̃2 (s) = · . [28.14]
s 1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
P ROOF.– The proof is presented as a series of equalities by substitution [28.9], [28.12]

and [28.13] into [28.6] for j = 0, 1, and [28.8] for j = 2
(1)
π̃0 (s) = (1 + h̃(s))π̃0 (s)

1 − ãB (s) 1 − ã(s) (W )
= · + ã(s)π̃0 (s)
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s)) s

1 − ãB (s) 1 − ã(s) ã(s) (2)
= · + π̃0 (s)
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s)) s 1 − ãB (s)
1 1 − ãB (s) − ã(s)(1 − b̃(s) + b̃A (s))
= · ;
s 1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
(1)
π̃1 (s) = (1 + h̃(s))π̃1 (s)
1 − ãB (s) (W )
= · ã(s)π̃1 (s)
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
1 − ãB (s) ã(s) (2)
= · π̃ (s)
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s)) 1 − ãB (s) 1

1 ã(s) 1 − (ã(s) + b̃(s)) + ã B (s) + b̃ A (s)
= · ;
s 1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
(1)
π̃2 (s) = (1 + h̃(s))π̃2 (s)
1 − ãB (s) (1)
= π̃ (s)
1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s)) 2
1 ã(s)(1 − c̃(s))(ã(s) − ãB (s))
= · .
s 1 − ã(s) + (ã(s) − ãB (s))(1 − ã(s)c̃(s))
28.5. Steady-state probabilities
To calculate the process s.s.p.s, we use the following relation:

πj = lim πj (t) = lim sπ̃j (s). [28.15]
t→∞ s→0
T HEOREM 28.3.– The s.s.p.s of the considered process are

aq + (aB + bA ) − b a + b − (aB + bA ) cq
π0 = , π1 = , π2 = .
a + q(a + c) a + q(a + c) a + q(a + c)
[28.16]
P ROOF.– By substitution, Maclaurin expansions for the expressions
ãB (s) = aB (0) + aB (0) · s = p − aB · s,

b̃A (s) = bA (0) + bA (0) · s = q − bA · s,
ã(s) = a(0) + ã (0) · s = 1 − a · s,
b̃(s) = b(0) + b̃ (0) · s = 1 − b · s,
c̃(s) = c(0) + c̃ (0) · s = 1 − c · s,
into [28.15] and taking into account expressions [28.14] from Theorem 28.2, we can
get s.s.p.s and end the proof of the theorem.
R EMARK 28.1.– Consider a Markov model with exponential distribution of all the
r.v.s of life- and repair times
A(t) = 1 − e−αt , B(t) = 1 − e−βt , C(t) = 1 − e−γt .
So, equation [28.16] from Theorem 28.3 gives

(α + β)γ αγ α2
π0 = , π1 = , π2 = .
α2 + 2αγ + βγ α2 + 2αγ + βγ α2 + 2αγ + βγ
This result coincides with those calculated by the direct approach using the birth
and death process for the Markov case.
28.6. Conclusion
In this chapter, a cold double redundant system with a single repair facility and
with arbitrarily distributed life- and repair times of the components of the system
under the full repair scenario, is considered. For the system modeling, the theory
of DSRP has been used. The reliability function, time-dependent and steady-state
probabilities of this process have been calculated. Because the process describes
the system behavior, the obtained probabilities can be used for further calculation
of system reliability indicators. These results can be used for further system study,
for example, for sensitivity analysis of system components to the shape of its time
distributions.
28.7. References
Cinlar, E. (1969). On semi-Markov processes on arbitrary space. Mathematical Proceedings of

The Cambridge Philosophical Society. 66, 381–392.
Jacod, J. (1971). Théorème de renouvellement et classification pour les chaines
semi-Markoviennes. Annales de l’institut Henri Poincare B, 7, 85–129.
Klimov, G.P. (1966). Stochastic Service Systems. Nauka, Moscow.

Korolyuk, V.S. and Turbin, A.F. (1976). Semi-Markov Processes and their Applications.
Naukova Dumka, Kyiv.
Levitin, G., Xing, L., Dai, Y. (2016). Cold standby systems with imperfect backup. IEEE
Transactions on Reliability, 65, 4, 1798–1809 [Online]. Available at: https://doi.org/10.1109/
TR.2015.2491599.
Rykov, V. (1975). Regenerative processes with embedded regeneration periods and their
application for priority queuing systems investigation. Cybernetics, B(6), 105–111.
Rykov, V. (2011). Decomposable Semi-Regenerative Processes and their Applications. Lap
Lambert Academic Publishing, Berlin.
Rykov, V. and Kozyrev, D. (2019). On the reliability function of a double redundant system
with general repair time distribution. Applied Stochastic Models in Business and Industry,
35, 191–197 [Online]. Available at: https://doi.org/10.1002/asmb.2368.
Rykov, V. and Yastrebenetsky, M. (1971). On regenerative processes with several types of
regeneration points. Cybernetics, 3, 82–86.
Rykov, V., Efrosinin, D., Vishnevsiy, V. (2014). On sensitivity of reliability models to
the shape of life and repair time distributions. Proceedings of the Ninth International
Conference on Availability, Reliability and Security, 430–437 [Online]. Available at:
https://doi.org/10.1109/ARES.2014.65.
Rykov, V., Kozyrev, D., Filimonov, A., Ivanova, N. (2020a). On reliability function of a
k-out-of-n system with general repair time distribution. Probability in the Engineering
and Informational Sciences, 1–18 [Online]. Available at: https://doi.org/10.1017/
S0269964820000285.
Rykov, V., Efrosinin, D., Stepanova, N., Sztrik, J. (2020b). On reliability of a double redundant
renewable system with a generally distributed life and repair times. Mathematics, 8(2), 278
[Online]. Available at: https://doi.org/10.3390/math80202783.
Smith, W.L. (1955). Regenerative stochastic processes. Proceedings of the Royal Society of
London, 232 [Online]. Available at: https://doi.org/10.1098/rspa.1955.0198.
Ushakov, I. (2013). Redundancy. Encyclopedia of operations research and management science
[Online]. Available at: https://doi.org/10.1007/978-1-4419-1153-7_868.
Utkin, L.V. (2003). Imprecise reliability of cold standby systems. International Journal of
Quality & Reliability Management, 20(6), 722–739 [Online]. Available at: https://doi.org/
10.1108/02656710310482159.
Vanderperre, E.J. (2005). On the reliability of a renewable multiple cold standby system.
Mathematical Problems in Engineering, 3, 269–273 [Online]. Available at: https://doi.org/
10.1155/MPE.2005.269.
Vanderperre, E.J. and Makhanov, S.S. (2014). On the availability of a warm standby system:
A numerical approach. TOP, 22, 644–657 [Online]. Available at: https://doi.org/10.1007/
s11750-013-0285-9.
29
Predicting Changes in Depression

Levels Following the European
Economic Downturn of 2008
The economic crisis occurring in Europe since 2008 has caused major changes to
people’s lives. Past studies found that mental health disorders have risen during
periods of economic recession for both genders in Europe while others have
supported that males are more vulnerable compared to females. The target of this
study is to assess the depression imprint for a large sample of Europeans after the
2008 crisis. The sample studied in the analysis comes from the database of SHARE
(Survey of Health, Aging and Retirement in Europe), a multidisciplinary
longitudinal and cross-national database including material regarding health,
socioeconomic and demographic information of individuals aged 50 or higher,
resident in several European countries. The selection of respondents included those
participating both in wave 2, carried out in 2006–2007 and wave 6, completed in
2015, covering cross-national material in two time periods, just before and after the
economic recession. For the purposes of the analysis, multinomial logistic regression
models were applied for the total sample and separately by gender, using SPSS 20.
Special attention is given to the concurrent factors being associated with the
depression burden in older ages, covering different domains of life, before and after
economic recession. Findings indicate that health predictors including mobility
limitations, instrumental activities of daily living and long-term illnesses had
increased after 2015 for the total population of individuals, indicating worse health
levels. Further, cognitive function had declined as well. Concerning factors leading
to decreasing depression levels, the highest contribution is due to the reduction of
Chapter written by Eleni SERAFETINIDOU and Georgia VERROPOULOU.
limitations in instrumental activities of daily living and in mobility. Furthermore,

better cognitive function and life satisfaction levels are related to a decline in
depression levels. In general, factors associated with increasing depression levels are
related to a worsening of the health indicators. Men are more vulnerable to mental
disease due to an increase of limitations in instrumental activities of daily living
while women are more vulnerable due to an increase in long-term illness. Worse
levels in orientation in time and life satisfaction affect both sexes, increasing
depression scores. For males, other factors related to depression are facing more
economic difficulties and living alone.
29.1. Introduction
The economic crisis bursting out in Europe since 2007 has caused major changes
to people’s lives. It is associated with a reduction in economic growth, an increase in
unemployment and the deterioration of living conditions through the rise of poverty
(World Health Organization 2011). It has been suggested that poverty is a
socioeconomic risk factor, increasing chances of mental health disorders, including
depression and suicide attempts (Fryers et al. 2005; Frasquilho et al. 2016). Analysts
claim that between 2006 and 2012, a significant increase in mean depressive
symptoms has been observed, more likely due to job loss or to a major illness
(Pruchno et al. 2017). Further, research indicates that from January 2008 to
December 2015, general mental health among European populations was aggravated
and suicides increased (Parmar et al. 2016), although there were differences between
countries and population subgroups. Consequently, the well-being of individuals,
their families and of society as a whole has been undermined.
Past studies have found that mental health disorders have risen during periods of
economic downturn for both genders in Europe (Frasquilho et al. 2016) while others
have supported that these are higher among males, as they are considered more
vulnerable during economic recessions compared to females (Gunnell et al. 2015;
Bacigalupe et al. 2016; Gili et al. 2016; Margerison-Zilko et al. 2016). Nevertheless,
the analysis based on country of residence often presents contradictory results. For
example, a study in Portugal revealed that during recessions including the most
recent one (2008–2015), women reported higher levels of distress compared to men,
mainly due to factors affecting mental health, such as income, employment and
social status (Frasquilho et al. 2017). Another study in England reported that the
likelihood of mental health problems in the Great Recession increased more among
females and less educated individuals (Jofre-Bonet et al. 2018). Glonti et al. (2015)
showed that women’s mental health was more vulnerable during economic
downturns, mainly due to a reduction in income levels and changes in employment
status. Other associations of mental health with age, educational attainment and
Predicting Changes in Depression Levels 397
marital status seem to be less significant, although higher educational levels led to
healthier behaviors (Glonti et al. 2015). Contrary to the above-mentioned results
regarding Europe, both men and women in the United States reported lower odds of
depression during and after recession and better mental health during the recession
(Dagher et al. 2015).
The association between health levels, socioeconomic status (SES), demographic

factors and depression in older ages has been thoroughly studied. Past findings
support that individuals suffering from cardiac disease or heart attack, coronary
artery disease and stroke have a greater likelihood of experiencing depression
(Fenton and Stover 2006; Benton et al. 2007; Welch et al. 2009; Gunn
et al. 2010) and the same holds for patients having diabetes (Fenton and Stover
2006; Simon et al. 2007; Vamos et al. 2009). Furthermore, limitations in
instrumental activities of daily living reinforce the presence of depression in older
ages (Ormel et al. 2002), and mobility limitations strongly predict negative health
outcomes and psychological distress (Backe et al. 2017; Musich et al. 2018).
Regarding cognitive impairment, it is clear that it may enhance depression disorders
in later life (Hammar and Ardal 2009; Giri et al. 2016). Although such factors are
related to increased depression levels for individuals, there are others that act in a
protective manner. For instance, a feeling of “life satisfaction”, related to successful
aging, plays a significant and straightforward role in the absence of depression
(Beutel et al. 2009; Srivastava 2016). Furthermore, Sarkisian et al. (2002) found in
their study that people having lower levels of life satisfaction are more likely to
experience higher levels of depression, a decline in quality of life and become less
energetic. Moreover, enhancing “trust in other people” helps older individuals
maintain their emotional connection with others, leading to better psychological
conditions (Li and Fung 2013). Finally, there is a significant inverse association
between SES and depression observed in European countries. There is evidence that
an increase in SES significantly decreases the odds of depression (Freeman et al.
2016) while other studies suggest that low SES or decreasing financial resources and
financial strain are strong predictors of a worsening in mental health, especially
between 2006 and 2010 (Martin-Carrasco et al. 2016; Wilkinson 2016). These
adverse effects are substantially reinforced through country austerity-imposed
measures and poorly developed welfare systems (Martin-Carrasco et al. 2016).
Regarding the role of educational attainment in times of recession, Freeman et al.
(2016) found that the higher the education qualifications, the lower the relative
chances of depression prevalence for Finland, Poland and Spain. This conclusion
has been confirmed by other studies as well (Bjelland et al. 2008; Zhang et al. 2012).
Finally, marital status is regarded as a significant factor related to the risk of
depression in later life. Past analysis has shown that being unmarried, widowed or
divorced among older persons is linked to a higher prevalence of depression
compared to married individuals (Yan et al. 2011).
29.1.1. Aims of the study
The main aim of this study is to assess the effects of changes in health and other
circumstances in the period following the recession of 2008 on depression, for a
sample of Europeans aged 50 or older. Special attention is given to concurrent
factors associated with depression in older ages, covering different domains of life.
More specifically, the main questions of the present analysis are: (a) Which factors’
transitions are more relevant to changes in depression levels? and (b) what are the
differentiations between genders regarding these changes? For the first question,
predictors have been considered in a holistic manner in order to assess their relative
effect and thus, their contribution to the improvement or deterioration of depression
levels. Results will inform authorities regarding the vulnerability of older persons to
specific life events. For the second question, the interest shifts to comparisons
between sexes. Findings may shape social policies related to depression in later life
and help individuals improve their quality of life.
29.2. Data and methods
29.2.1. Sample
The sample studied in the analysis comes from the SHARE study, a
multidisciplinary, longitudinal and cross-national database, including material
regarding health, socioeconomic and demographic information of individuals aged
50 or higher, resident in European countries (Börsch-Supan et al. 2013). The
selection of the sample involved respondents who participated both in wave 2,
carried out in 2006–2007, and wave 6, completed in 2015, covering cross-national
material in different time periods. The initial sample of wave 2 included 31,009
respondents; of these, 16,106 persons were excluded from the analysis as 6,332 of
them had died (39.3%), whereas 9,774 (60.6%) had not taken part in wave 6. In
total, the number of individuals who are included in the analysis was 14,903, 6,384
males and 8,519 females, originating in the following European countries: Austria,
Germany, Sweden, Spain, Italy, France, Denmark, Greece, Switzerland, Belgium,
Czech Republic and Poland.
29.2.2. Measures
29.2.2.1. Dependent variable

Depression in older ages is measured by the EURO-D scale (Beekman et al.
1999; Prince et al. 1999a, 1999b; Börsch-Supan and Jurges 2005) and includes the
following 12 symptoms: depression, pessimism, suicidality, guilt, sleep, interest,
irritability, appetite, fatigue, concentration, enjoyment and tearfulness. This scale

has been reduced to a binary indicator, comparing respondents having reported 0–3
symptoms (not having depression) with those having reported 4 or more symptoms
(having depression). This cut-off point has been validated by several scientists as an
appropriate measure for depression (Prince et al. 1999a, 1999b; Dewey and Prince
2005; Castro-Costa et al. 2007, 2008;). The outcome variable in this study is
considered as the difference in this binary variable between the two waves. Hence,
the dependent variable has three categories, indicating i) an improvement in mental
health between the waves (i.e. the respondent had depression in wave 2 but did not
have in wave 6), ii) a worsening in mental health (in the opposite case) and iii) no
change in mental health status.
29.2.2.2. Independent variables

The independent variables cover health, cognitive status, SES, demographic
characteristics as well as factors reflecting the respondents’ perspective of life.
Health status is represented by a binary indicator describing whether respondents

suffered from a long-term illness, including cancer, cardiovascular problems, stroke
or other serious diseases. Moreover, a variable measuring limitations in instrumental
activities of daily living (out of a list of seven) is taken into consideration. These
activities include 1) using a map to figure out how to get around in a strange place,
2) preparing a hot meal, 3) shopping for groceries, 4) making telephone calls,
5) taking medication, 6) doing work around the house or garden and 7) managing
money, such as paying bills and keeping track of expenses. Further, a binary
indicator was included in the analysis showing whether individuals had to face
mobility limitations. Concerning cognitive function, orientation in time is used in a
scale of 0: bad to 4: good. SES for individuals is based on their educational
attainment (measured in years of education) as well as by how easily they could
make ends meet. The perspective of life for respondents is represented by life
satisfaction, measured in a scale of 1 (least satisfied) to 10 (most satisfied).
Demographic factors include age of respondents at the time of interview, gender and
marital status; the latter distinguishes individuals being married or in a registered
partnership from those never married, divorced and widowed. Country of residence
is also considered in the models as a control variable.
All the above-mentioned variables recorded at wave 2 have been included in the
models to control for baseline characteristics. Further, to assess the impact of
transitions in health, SES and marital status between waves 2 and 6 on depression,
variables reflecting changes in long-term illness, mobility limitations, instrumental
activities of daily living, orientation in time, life satisfaction, financial hardship and
marital status have also been included in the analysis. These variables have three
categories, indicating improvement, worsening or no change; the latter category has
been selected as a reference category for all variables defined this way.
29.2.2.3. Statistical analysis

For the purposes of the analysis, multinomial logistic regression models were
applied for the total sample and separately by gender. Further, a comparison
between respondents and non-respondents was carried out, using a logistic
regression model. All models were run using SPSS, version 20.
29.3. Results
29.3.1. Descriptive findings
Table 29.1 shows the percentage distribution for the factors included in the
analysis regarding the samples of respondents and non-respondents separately,
focusing on their differences. The population of non-respondents is a little older
compared to respondents (mean age 66.86 years compared to 62.13 years for
respondents), whereas educational qualifications for both groups are similar (mean
values of education equal to 10.06 and 10.71 years). Further, percentage of males is
higher among non-respondents. Regarding depression, individuals who dropped out
of the study at wave 6 were more vulnerable in depression (percentages of the
disorder are 28.40% and 22.00%, respectively).
Regarding health status, it seems that non-respondents include lower proportions

suffering from a long-term illness (47.20% vs. 57.10%), whereas they had to deal
with more limitations in instrumental activities of daily living and mobility
limitations (22.90% vs. 10.40% and 30.80% vs. 18.00%). Concerning factors
measuring cognition and perspective of life, non-participants indicate somewhat
worse levels of orientation in time (mean value equal to 3.65 vs. 3.89) and of life
satisfaction (mean value equal to 7.32 vs. 7.67). Further, results measuring economic
hardship show that non-participants face slightly higher financial difficulties.
Finally, the structure of marital status is the same for all individuals taking into
consideration persons being in a registered partnership, never married and divorced
while widowed were higher among non-respondents (19.00% vs. 11.20%) and
married people, living with their spouse were lower in that group (66.60% vs.
74.60%). To conclude, non-participants demonstrate on average worse health,
orientation in time, life satisfaction and economic status compared to participants.
Total sample Participants Non-

Later-life predictors
(%) (%) participants (%)
Health factors
Long-term illness
No 47.90 42.90 52.20
Yes 52.10 57.10 47.20
Instrumental activities of daily living
No 83.10 89.60 77.10
Yes 16.90 10.40 22.90
Mobility limitations
No 75.40 82.00 69.20
Yes 24.60 18.00 30.80
Cognitive function
Orientation in time (0: bad – 4: good) 3.76* 4.00** 3.89* 4.00** 3.65* 4.00**
Perspective of life
Life satisfaction 7.49* 8.00** 7.67* 8.00** 7.32* 8.00**
(0: completely dissatisfied – 10:
completely satisfied)
Socio economic status
Household able to make ends meet
Easily 12.10 12.00 12.10
With great difficulty 30.10 29.30 30.90
With some difficulty 32.80 32.40 33.20
Fairly easily 25.00 26.30 23.70
Educational attainment 10.37* 11.00** 10.71* 10.06* 10.00**
11.00**
Demographic characteristics
Age at the time of interview 64.59* 63.00** 62.13* 66.86* 66.00**
61.00**
Gender
Males 44.10 42.80 45.20
Females 55.90 57.20 54.80
Marital status
Married, living with spouse 70.40 74.60 66.60
Registered partnership 1.30 1.40 1.30
Married, not living with spouse 1.30 1.30 1.40
Never married 5.00 4.90 5.10
Divorced 6.60 6.60 6.60
Widowed 15.20 11.20 19.00
Control variables
Country (N)
Austria 1,200 526 674
Germany 2,628 903 1,725
Sweden 2,796 1,432 1,364
Spain 2,427 1,241 1,186
Italy 2,986 1,655 1,331
France 2,989 1,129 1,860
Denmark 2,630 1,419 1,211
Greece 3,412 2,011 1,401
Switzerland 1,498 803 695
Belgium 3,227 1,659 1,568
Czech Republic 2,750 956 1,794
Poland 2,466 1,169 1,297
Depression levels
(based on EURO-D)
No 74.70 78.00 71.60
Yes 25.30 22.00 28.40
Total sample (N) 31,009 14,903 16,106
Mean*, median**.
Table 29.1. Descriptive statistics (baseline characteristics) for the

total sample (wave 2), the participants in the analysis (respondents
at both waves 2 and 6) and non-participants at wave 6
29.3.2. Non-respondents compared to respondents at baseline (wave 2)
Table 29.2 shows odds ratios and confidence intervals based on logistic
regression models comparing non-respondents at wave 6 (participating only at
wave 2) to respondents (i.e. participating both waves 2 and 6). The odds ratios are
adjusted for country of residence. Although findings regarding age and education are
significant, whereas the opposite holds for long-term illness, it is obvious that the
remaining factors indicate diversifications in the characteristics of these groups.
For instance, non-respondents include more males and are older compared to
respondents, they include lower proportions reporting a high degree of life
satisfaction (Odds Ratio (OR) = 0.956), lower proportions of married, living with
spouse (OR = 0.892) and of persons dealing with great difficulty with economic
hardship (OR = 0.910). Further, fewer of them report good orientation in time
(OR = 0.754) while they exhibit a slightly higher likelihood of having depression
(OR = 1.069) and of suffering from mobility limitations and limitations in
instrumental activities of daily living (ORs 1.201 and 1.423, respectively).
Later-life predictors Total sample

Health factors
Long-term illness
No (ref. cat.) 1
Yes 0.991
(0.940 1.044)
No (ref. cat.) 1
Yes 1.423***
(1.313 1.542)
No (ref. cat.) 1
Yes 1.201***
(1.118 1.289)
Cognitive function
Orientation in time 0.754***
(0.716 0.794)
Perspective of life
Life satisfaction 0.956***
(0.941 0.971)

Easily (ref. cat.) 1
With great difficulty 0.910**
(0.825 1.003)
With some difficulty 1.023
(0.951 1.101)
Fairly easily 1.005
(0.941 1.073)
Educational attainment 0.983***
(0.976 0.989)
Age at the time of interview 1.032***
(1.029 1.035)
Gender
Females (ref. cat.) 1
Males 1.203***
(1.144 1.266)
Marital status
Widowed (ref. cat.) 1
Married, living with spouse 0.892***
(0.825 0.964)
Registered partnership 1.202
(0.962 1.502)
Married, not living with spouse 1.036
(0.834 1.287)
Never married 0.994
(0.874 1.130)
Divorced 0.977
(0.869 1.098)
Depression levels (based on EURO-D)
No (ref. cat.) 1
Yes 1.069**
(1.003 1.139)
*** p-value < 0.01, ** p-value < 0.05.
a
Model control for country of residence.
Table 29.2. Odds ratios and confidence intervals comparing baseline

a
(wave 2) characteristics of non-respondents to respondents
29.3.3. Descriptive findings for respondents – analysis by gender
Table 29.3 shows descriptive results concerning the total sample of the
respondents and males and females, separately. Because of the fact that
measurements cover different time periods, i.e. before and after the economic
downturn, we consider three types of possible transitions for each predictor: an
increase, a decline and no change. Regarding the overall population, all health
factors included in the analysis seem to indicate worse post-crisis health levels.
Moreover, a decline in cognitive function is notable based on the index measuring
orientation in time. Nevertheless, it is obvious that a high percentage of individuals
did not experience any change in the above-mentioned factors. By contrast, factors
relating to a sense of life satisfaction and financial difficulties seem to have
improved. Indeed, a significant portion of the population reported increasing levels
in life satisfaction (37.90%), indicating a more optimistic perspective, as well as
decreasing financial difficulties (33.80%) indicating better socioeconomic status,
whereas a high portion of the population did not exhibit a transition referring to
those predictors. Regarding educational attainment of individuals, the mean value is
10.71 years and as it concerns marital status, it seems that being alone becomes a
more common circumstance following the economic crisis rather than the transition
to being in a relationship (percentage equal to 7.10% vs. 0.60%); being alone is
expressed through the transition from being married, living with spouse or in a
registered partnership to i) married, not living with spouse, or never married or
divorced and ii) widowed. Moreover, being in a relationship refers to the opposite
transition. Finally, depression levels seem to exhibit an increase for more individuals
in the total sample in that period (15.50%) rather than a decrease (10.50%).
The sample consists of 6,384 males and 8,519 females with a mean age of
approximately 62 years. There are gender differentiations to observe. The greatest
difference is detected in mobility limitations, where a more severe worsening is
observed among females (16.40% vs. 12.20% for males). Taking into consideration
life perspectives, it is more frequent for women to enhance their sense of life
satisfaction (39.10% vs. 36.20% for males). Concerning SES, results are similar for
both genders relative to the decreasing or increasing economic hardship. On the
other hand, there is a slight difference in educational qualifications; females have on
average 10.35 years of education compared to males having 11.18 years. Regarding
marital status, the vast majority of males and females did not experience any change
though a higher proportion of women seem to have changed to “be alone” status
compared to men (9.00% vs. 4.70%). Following the recession, depression levels
increased more for women (17.50% vs. 12.80% for men).
Total sample
Males (%) Females (%)
(%)
Health factors
Long-term illness
Better health 11.80 11.20 12.20
Worse health 19.50 20.80 18.50
No change 68.70 68.00 69.20
Better health 4.90 3.00 6.40
Worse health 13.30 11.40 14.60
No change 81.80 85.50 79.00
Worse health 14.60 12.20 16.40
No change 79.50 83.10 76.80
Cognitive function
Orientation in time
Worse health 10.30 10.40 10.20
No change 82.00 81.50 82.50
Perspective of life
Life satisfaction
Worse health 29.70 29.70 29.70
Better health 37.90 36.20 39.10
No change 32.50 34.10 31.20
Decreased difficulty 33.80 33.50 34.10
Increased difficulty 19.90 18.90 20.6
No change 46.30 47.60 45.30
Educational attainment at wave 2 10.71* 11.00** 11.18* 10.35*
11.00** 11.00**
Age at the time of interview at 62.13* 61.00** 62.63* 61.76*
wave 2) 62.00** 61.00**
Gender (wave 2)
Males 42.80
Females 57.20
Marital status
Becoming alone 7.10 4.70 9.00
Being in a new relationship 0.60 0.80 0.40
No change 92.30 94.50 90.6
Control factor
Country of residence at wave 2 (N)
Austria 526 198 328
Germany 903 425 478
Sweden 1,432 620 812
Spain 1,241 517 724
Italy 1,655 734 921
France 1,129 471 658
Denmark 1,419 634 785
Greece 2,011 863 1,148
Switzerland 803 341 462
Belgium 1,659 738 921
Czech Republic 956 362 594
Poland 1,169 481 688
Depression levels
(based on EURO-D)
Improvement 10.50 7.30 12.80
Worsening 15.50 12.8 17.50
No change 74.00 79.80 69.70
Total sample (N) 14,903 6,384 8,519
Mean*, median**.
Table 29.3. Descriptive statistics for the predictors in

the model for the total sample and by gender
29.3.4. Findings regarding decreasing depression levels – analysis for

the total sample and by gender
Table 29.4 shows the findings based on multinomial logistic regression models
for factors associated with decreasing depression levels, controlling for wave 2
characteristics. Overall, the decrease in instrumental activities of daily living and
mobility limitations have the highest effect on decreasing depression levels (ORs
equal to 1.533 and 1.475, respectively). It is observed that a reduction in
instrumental activities of daily living (or mobility limitations) enhances the relative
chances of decreasing depression by 53.30% (or 47.50% for the latter). Moreover,
improvement in orientation in time and life satisfaction levels leads to better
cognitive function and perspective of life and is associated with a higher likelihood
of decreasing depression burden (ORs equal to 1.407 and 1.369, respectively). In
contrast, predictors related to economic hardship, educational attainment and
changes in marital status are insignificant.
Concerning genders, it is clear that there are distinct differences between males
and females. For males, it is more likely to experience an improvement in
depression due to a decrease in long-term illnesses (odds ratio equal to 1.316), while
for females, this seems insignificant. For women, a reduction in instrumental
activities of daily living and mobility limitations contributes to a major decrease in
depression (odds ratios equal to 1.554 and 1.420, respectively). Additionally, an
improvement in orientation in time seems to matter only for men (OR = 1.571),
while for women, it is unimportant. Regarding perspective of life, the sense of life
satisfaction is important for both sexes in an equivalent manner (OR = 1.379 for
males and 1.382 for females). Finally, socioeconomic status (economic difficulties
and educational attainment) as well as becoming single at wave 6 are insignificant
factors for both sexes.
Total Males Females
Health factors
Long-term illness
No change (ref. cat.) 1 1 1
Better health 1.179 1.316 1.139
(0.994 1.398) (0.968 1.790) (0.927 1.399)
Worse health 0.953 0.881 0.990
(0.798 1.139) (0.647 1.200) (0.796 1.232)

Better health 1.533*** 1.399 1.554***
(1.143 2.056) (0.768 2.546) (1.109 2.178)
Worse health 0.876 1.145 0.774**
(0.715 1.072) (0.797 1.644) (0.607 0.987)
Better health 1.475*** 1.283 1.420**
(1.165 1.868) (0.821 2.004) (1.074 1.879)
Worse health 1.009 0.740 1.084
(0.837 1.216) (0.500 1.094) (0.874 1.344)
Cognitive function
Orientation in time
Worse health 0.892 0.911 0.881
(0.733 1.086) (0.650 1.279) (0.690 1.123)
Better health 1.407** 1.571 1.354
(1.019 1.943) (1.022 2.689) (0.900 2.038)
Perspective of life
Life satisfaction
Worse health 1.051 1.159 1.002
(0.893 1.238) (0.867 1.547) (0.821 1.222)
Better health 1.369*** 1.379** 1.382***
(1.189 1.578) (1.070 1.777) (1.163 1.641)
Decreased difficulty 1.076 1.044 1.080
(0.940 1.233) (0.819 1.332) (0.916 1.273)

Increased difficulty 1.005 1.216 0.923
(0.855 1.181) (0.911 1.622) (0.759 1.122)
Educational attainment at wave 2 1.003 0.987 1.014
(0.988 1.019) (0.961 1.014) (0.995 1.035)
Age at the time of interview at wave 2 0.988*** 1.006 0.981***
(0.980 0.995) (0.992 1.020) (0.971 0.990)
Gender (wave 2)
Males 0.567***
(0.501 0.642)
Marital status
Becoming alone 0.840 1.180 0.778
(0.669 1.056) (0.756 1.843) (0.596 1.015)
Being in a relationship 1.097 1.249 0.930
(0.543 2.215) (0.452 3.451) (0.345 2.510)
*** p-value < 0.01, ** p-value < 0.05.
a
All models control for wave 2 characteristics and country of residence.
Table 29.4. Odds ratios and confidence intervals for predictors related
to an improvement in depression (total model and by gender)a
29.3.5. Findings regarding increasing depression levels – analysis for

the total sample and by gender
Table 29.5 shows odds ratios and confidence intervals for all factors that
contribute to increasing depression levels following the recession, controlling for
wave 2 characteristics. It is evident that the significant factors for the total sample
and by sex are the same. In particular, an increase in health factors contributes the
most to a worsening in depression. In fact, for the total sample and females, the
highest odds ratios are related to long-term illness (equal to 1.816 and 1.786,
respectively), whereas for males, instrumental activities of daily living are of major
importance (OR = 2.164). A worsening in orientation in time has an equal
contribution to the increase of depression for the joined sample and both genders;
the OR ranges from 1.347 to 1.364. Moreover, a decrease in life satisfaction
indicates worse health and hence increases the likelihood of depression almost by
two times for the whole population and both sexes (OR almost equal to 2). This is a
significant factor, taking into consideration that increasing life satisfaction decreases
the relative chances of increasing depression levels by 22.3% for the total sample
and by 32.7% for males and 15.5% for females. Increasing economic difficulties
seem to strengthen the increase in depression levels, especially for males (OR equal
to 1.614). On the other hand, even though educational attainment and age at the time
of interview are significant factors, they do not substantially alter the increase in
depression burden. Finally, becoming alone at wave 6 is associated with higher
chances regarding an increase in depression, only for males (OR = 1.550).
Total Males Females

Health factors
Long-term illness
Better health 0.822** 0.756 0.873
(0.691 0.978) (0.563 1.017) (0.704 1.082)
Worse health 1.816*** 1.856*** 1.786***
(1.589 2.075) (1.485 2.319) (1.508 2.115)
Instrumental activities of daily
living
Better health 0.839 0.779 0.820
(0.617 1.139) (0.402 1.509) (0.579 1.160)
Worse health 1.796*** 2.164*** 1.587***
(1.569 2.057) (1.732 2.704) (1.337 1.885)
Better health 1.001 0.858 0.993
(0.786 1.273) (0.538 1.367) (0.748 1.318)
Worse health 1.752*** 1.878*** 1.673***
(1.534 2.001) (1.509 2.336) (1.413 1.980)
Cognitive function
Orientation in time
Worse health 1.364*** 1.347** 1.354***

(1.182 1.575) (1.066 1.703) (1.126 1.627)
Better health 0.945 0.973 0.934
(0.714 1.250) (0.618 1.532) (0.653 1.338)
Perspective of life
Life satisfaction
Worse health 1.989*** 1.980*** 1.973***
(1.767 2.237) (1.628 2.410) (1.701 2.290)
Better health 0.777*** 0.673*** 0.845**
(0.680 0.887) (0.539 0.840) (0.715 0.999)
Decreased difficulty 0.850*** 0.781** 0.887
(0.752 0.960) (0.635 0.960) (0.761 1.034)
Increased difficulty 1.485*** 1.614*** 1.438***
(1.313 1.679) (1.315 1.981) (1.232 1.678)
Educational attainment at wave 2 1.013 1.003 1.021**
(0.999 1.026) (0.982 1.023) (1.004 1.039)
Age at the time of interview at 1.013*** 1.022*** 1.009**
wave 2
(1.006 1.019) (1.011 1.033) (1.001 1.017)
Gender (wave 2)
Males 0.657***
(0.593 0.729)
Marital status
Becoming alone 1.125 1.550*** 1.033
(0.947 1.337) (1.120 2.146) (0.842 1.267)
Being in a relationship 1.723 1.735 1.620
(0.945 3.143) (0.758 3.968) (0.640 4.098)
*** p-value < 0.01, ** p-value < 0.05.
a
All models control for wave 2 characteristics and country of residence.
Table 29.5. Odds ratios and confidence intervals for predictors related
to increasing depression levels (total model and by gender)a
29.4. Discussion
The main aim of this study is to explore how changes occurring in the period
2007–2015 in various aspects of life, including health, attitude towards life, SES and
marital status, affect depression. Multinomial logistic regression models were
applied in order to detect which factors are associated with decreasing or increasing
depression levels. The study focused on the total sample and on differentiations
between genders.
The descriptive analysis showed that health, including mobility limitations,

instrumental activities of daily living and long-term illnesses, had worsened for the
total sample in the period under investigation. Further, cognitive function had
declined as well. In contrast, the sense of life satisfaction was enhanced, and
economic difficulties seemed to be reduced. Regarding marital status, more people
had ended up alone. Finally, levels of depression had also increased for the total
population. Considering the genders separately, it is observed that health for both
men and women had declined after the economic recession, especially for women
due to an increase in mobility limitations. Moreover, depression levels seem to have
increased more among women. Past analyses support these findings and explain that
recession has proved a significant risk factor for the development of mental health
problems for both men and women (Frasquilho et al. 2016). On the other hand,
(Glonti et al. 2015) claimed that the economic downturn has affected mentally
mainly women. Moreover, the feeling of life satisfaction is more prevalent among
females.
Concerning factors leading to decreasing depression levels, the highest

contribution is due to the reduction in instrumental activities of daily living and
mobility limitations. Furthermore, better cognitive function and life satisfaction
levels also play a role. Regarding genders, for males, a reduction in long-term
illnesses is more important, whereas for females, a decline in instrumental activities
of daily living and mobility limitations plays a greater part. Better scores in
orientation in time seem to influence only males. Other factors included in the
analysis such as economic difficulties, educational qualifications and living with a
partner were insignificant. Factors related to increasing depression levels include a
worsening in health. Men seem to be more vulnerable to an increase in limitations in
instrumental activities of daily living, while women seem to be more vulnerable to
an increase in long-term illness. Worse levels in orientation in time and life
satisfaction affect both sexes, increasing depression scores. For males, an increase in
economic difficulties and transitioning to the “alone” status also seem significant
risk factors. Finally, educational attainment and age at the time of interview are of
less importance.
Concerning past literature, some findings are in accordance with our results,
whereas others are contradictory. Undoubtedly, past analysis has found that
limitations in instrumental activities of daily living and mobility limitations
negatively affect psychological health (Backe et al. 2017; Musich et al. 2018).
Further, poor cognitive function contributes to a deterioration in depressive
disorders in older ages (Hammar and Ardal 2009; Giri et al. 2016). In contrast, life
satisfaction seems to act positively, predicting better mental health (Beutel et al.
2009; Srivastava 2016). Contrary to our results, other studies have found that an
increase in SES, such as better educational attainment, lead to a lower prevalence of
depression (Bjelland et al. 2008; Zhang et al. 2012; Freeman et al. 2016).
Some limitations of the study should be discussed. First, the percentage of

individuals who were excluded from the analysis due to non-participation in wave 6
is high. Hence, 14,903 observations were included out of a total of 31,009
respondents at wave 2. Most of the non-respondents were males; overall, they were
older than the respondents and reported higher levels of depression. Moreover, they
had on average worse health, with the exception of long-term illnesses. They also
reported low levels of life satisfaction, but they did not seem to differ substantially
from respondents regarding their ability to manage economic difficulties. Hence,
based on our findings and the characteristics of non-respondents, we expect our
findings to represent a lower bound of the possible effects of deteriorating health on
depression. Further, we consider our findings representative for persons with better
than average health.
Further, a crucial point to consider is that the time period of the transitions
considered in the analysis means that persons in the longitudinal sample have grown
older by about seven years on average and the deterioration in their health is
attributable not only to the economic recession but also to physical wear due to the
ageing process. This fact should be taken into account when interpreting the
findings.
Future research would benefit from the inclusion of more detailed information on
the magnitude, the onset and the duration of exposures to disadvantage and
advantage, which would allow measuring inequality with greater accuracy. Further,
it would be of great interest to study the effects of such factors using information not
only from late adulthood, as in the present case, but also from childhood and middle
adulthood as well, estimating their effect in a cumulative way.
29.5. Conclusion
The present study aimed at assessing the transitions in depression levels for the
European population aged 50 or higher, after the economic recession, considering
the role played by transitions in health-related factors, cognitive status, life

perspective, SES and demographic characteristics. Emphasis was put on gender
differentials. Results showed that the depression burden has increased after the
economic downturn, mainly for females. In total, the reduction in instrumental
activities of daily living and in mobility limitations have the greatest consequence in
decreasing depression levels, whereas the increase in long-term illness enhances
depression in the post-recession time. Regarding males, the decrease in long-term
illness and the improvement in cognitive function led to lower depression levels,
whereas for women, the same holds concerning a reduction in mobility and
instrumental activities of daily living limitations. Additionally, an increase of the
latter factor is related to increasing depression levels for males and the increase of
long-term illnesses does that for the females. Worse orientation in time and life
satisfaction seem to have a negative impact on depression for both genders, whereas
increasing economic difficulties led to a higher depression risk, especially for males.
This work was fully supported by the General Secretariat for Research and
Technology (GSRT) and the Hellenic Foundation for Research and Innovation
(HFRI).
29.7. References
Bacigalupe, A., Esnaola, S., Martín, U. (2016). The impact of the great recession on mental
health and its inequalities: The case of a Southern European region, 1997–2013.
International Journal for Equity in Health, 15(1), 1–10.
Backe, I.F., Patil, G.G., Nes, R.B., Clench-Aas, J. (2017). The relationship between physical
functional limitations, and psychological distress: Considering a possible mediating role
of pain, social support and sense of mastery. SSM – Population Health, 4, 153–163.
Beekman, A.T., Copeland, J.R., Prince, M.J. (1999). Review of community prevalence of
depression in later life. The British Journal of Psychiatry, 174(4), 307–311.
Benton, T., Staab, J., Evans, D.L. (2007). Medical co-morbidity in depressive disorders.
Annals of Clinical Psychiatry, 19(4), 289–303.
Beutel, M.E., Glaesmer, H., Wiltink, J., Marian, H., Brähler, E. (2009). Life satisfaction,
anxiety, depression and resilience across the life span of men. The Aging Male, 13(1),
32–39.
Bjelland, I., Krokstad, S., Mykletun, A., Dahl, A.A., Tell, G.S., Tambs, K. (2008). Does a
higher educational level protect against anxiety and depression? The HUNT study. Social
Science & Medicine, 66(6), 1334–1345.
Börsch-Supan, A. and Jurges, H. (2005). The Survey of Health, Aging and Retirement in
Europe. Methodology. Mannheim Research Institute for the Economics of Ageing.
Mannheim, Germany.
Börsch-Supan, A., Brandt, M., Hunkler, C., Kneip, T., Korbmacher, J., Malter, F., Schaan, B.,
Stuck, S., Zuber, S. (2013). Data resource profile: The survey of health, ageing and
retirement in Europe (SHARE). International Journal of Epidemiology, 42(4), 992–1001.
Castro-Costa E., Dewey M., Stewart R., Banerjee S., Huppert F., Mendonca-Lima C.,
Bula C., Reisches F., Wancata J., Ritchie K. et al. (2007). Prevalence of depressive
symptoms and syndromes in later life in ten European countries: The SHARE study. The
British Journal of Psychiatry, 191(5), 393–401.
Castro-Costa E., Dewey M., Stewart R., Banerjee S., Huppert F., Mendonca-Lima C.,
Bula C., Reisches F., Wancata J., Ritchie K. (2008). Ascertaining late-life depressive
symptoms in Europe: An evaluation of the survey version of the EURO-D scale in 10
nations. The SHARE project. International Journal of Methods in Psychiatric Research,
17(1), 12–29.
Dagher, R.K., Chen, J., Thomas, S.B. (2015). Gender differences in mental health outcomes
before, during, and after the great recession. PLoS ONE, 10(5), 1–16.
Dewey, M.E. and Prince, M.J. (2005). Mental health. In Health, Ageing and Retirement in
Europe, First Results from the Survey of Health, Ageing and Retirement in Europe.
Borsch-Supan, A., Brugiavini, A., Jurges, H., Mackenbach, J., Siegrist, J., Weber, G.
(eds), Mannheim Research Institute for the Economics of Ageing (MEA), Mannheim,
Germany.
Fenton, W.S. and Stover, E.S. (2006). Mood disorders: Cardiovascular and diabetes
comorbidity. Current Opinion in Psychiatry, 19(4), 421–427.
Frasquilho, D., Matos, M.G., Salonna, F., Guerreiro, D., Storti, C.C., Gaspar, T., Caldas-de-
Almeida, J.M. (2016). Mental health outcomes in times of economic recession: A
systematic literature review. BMC Public Health, 16(1), 115, 1–40.
Frasquilho, D., Cardoso, G., Ana, A., Silva, M., Caldas-de-Almeida, J.M. (2017). Gender
differences on mental health distress: Findings from the economic recession in Portugal.
European Psychiatry, 41, S(1), S902–S902.
Freeman, A., Tyrovolas, S., Koyanagi, A., Chatterji, S., Leonardi, M., Ayuso-Mateos, J.L.,
Tobiasz-Adamczyk, B., Koskinen, S., Rummel-Kluge, C., Haro, J.M. (2016). The role of
socio-economic status in depression: Results from the COURAGE (aging survey in
Europe). BMC Public Health, 16(1), 1–8.
Fryers, T., Melzer, D., Jenkins, R., Brugha, T. (2005). The distribution of the common mental
disorders: Social inequalities in Europe. Clinical Practice and Epidemiology in Mental
Health, 1(1), 1–12.
Gili, M., López-Navarro, E., Castro, A., Homar, C., Navarro, C., García-Toro, M., García-
Campayo, J., Roca, M. (2016). Gender differences in mental health during the economic
crisis. Psicothema, 28(4), 407–413.
Giri, M., Chen, T., Yu, W., Lü, Y. (2016). Prevalence and correlates of cognitive impairment
and depression among elderly people in the world’s fastest growing city, Chongqing,
People’s Republic of China. Clinical Interventions in Aging, 11, 1091–1098.
Glonti, K., Gordeev, V.S., Goryakin, Y., Reeves, A., Stuckler, D., McKee, M., Roberts, B.
(2015). A systematic review on health resilience to economic crises. PLoS One, 10(4),
1–22.
Gunn, J.M., Ayton, D.R., Densley, K., Pallant, J.F., Chondros, P., Herrman, H.E., Dowrick,
C.F. (2010). The association between chronic illness, multimorbidity and depressive
symptoms in an Australian primary care cohort. Social Psychiatry and Psychiatric
Epidemiology, 47(2), 175–84.
Gunnell, D., Donovan, J., Barnes, M., Davies, R., Hawton, K., Kapur, N., Hollingworth, W.,
Metcalfe, C. (2015). The 2008 global financial crisis: Effects on mental health and
suicide. Policy Bristol, Policy Report 3/2015.
Hammar, A. and Ardal, G. (2009). Cognitive functioning in major depression – A
summary. Frontiers in Human Neuroscience, 3(26), 1–7.
Jofre-Bonet, M., Serra-Sastre, V., Vandoros, S. (2018). The impact of the great recession on
health-related risk factors, behaviour and outcomes in England. Social Science &
Medicine, 197, 213–225.
Li, T. and Fung, H.H. (2013). Age differences in trust: An investigation across 38
countries. Journals of Gerontology: Series B, 68(3), 347–355.
Margerison-Zilko, C., Goldman-Mellor, S., Falconi, A., Downing, J. (2016). Health impacts
of the great recession: A critical review. Current Epidemiology Reports, 3(1), 81–91.
Martin-Carrasco, M., Evans-Lacko, S., Dom, G., Christodoulou, N.G., Samochowiec, J.,
González-Fraile, E., Bienkowski, P., Dos Santos, M.J., Wasserman, D. (2016). EPA
guidance on mental health and economic crises in Europe. European Archives of
Psychiatry and Clinical Neuroscience, 266(2), 89–124.
Musich, S., Wang, S.S., Ruiz, J., Hawkins, K. Wicker, E. (2018). The impact of mobility
limitations on health outcomes among older adults. Geriatric Nursing, 39(2), 162–169.
Ormel, J., Rijsdijk, F.V., Sullivan, M., van Sonderen, E., Kempen, G.I. (2002). Temporal and
reciprocal relationship between IADL/ADL disability and depressive symptoms in late
life. The Journals of Gerontology: Series B, 57(4), 338–347.
Parmar, D., Stavropoulou, C., Ioannidis, J.P. (2016). Health outcomes during the 2008
financial crisis in Europe: Systematic literature review. BMJ, 354, 1–11.
Prince, M.J., Reischies, F., Beekman, A.T., Fuhrer, R., Jonker, C., Kivela, S.L., Lawlor, B.A.,
Lobo, A., Magnusson, H., Fichter, M. (1996a). Development of the EURO-D scale – A
European Union initiative to compare symptoms of depression in 14 European
centres. The British Journal of Psychiatry, 174(4), 330–338.
Prince, M.J., Beekman, A.T., Deeg, D.J., Fuhrer, R., Kivela, S.L., Lawlor, B.A., Lobo, A.,
Magnusson, H., Meller, I., van Oyen, H. (1996b). Depression symptoms in late life
assessed using the EURO-D scale. Effect of age, gender and marital status in 14 European
centres. The British Journal of Psychiatry, 174(4), 339–345.
Pruchno, R., Heid, A.R., Wilson-Genderson, M. (2017). The great recession, Life events, and
mental health of older adults. The International Journal of Aging and Human
Development, 84(3), 294–312.
Sarkisian, C.A., Hays, R.D., Mangione, C.M. (2002). Do older adults expect to age
successfully? The association between expectations regarding aging and beliefs regarding
healthcare seeking among older adults. Journal of the American Geriatrics Society,
50(11), 1837–1837.
Simon, G.E., Katon, W.J., Lin, E.H., Rutter, C., Manning, W.G., Von Kroff, M.,
Ciechanowski, P., Ludman, E.J., Ypung, B.A. (2007). Cost-effectiveness of systematic
depression treatment among people with diabetes mellitus. Archives of General
Psychiatry, 64(1), 65–72.
Srivastava, A. (2016). Relationship between life satisfaction and depression among working
and non-working married women. International Journal of Education and Psychological
Research (IJEPR), 5(3), 1–7.
Vamos, E.P., Mucsi, I., Keszei, A., Kopp, M.S., Novak, M. (2009). Comorbid depression is
associated with increased healthcare utilization and lost productivity in persons with
diabetes: A large nationally representative Hungarian population survey. Psychosomatic
Medicine, 71(5), 501–507.
Welch, C.A., Czerwinski, D., Ghimire, B., Bertsimas, D. (2009). Depression and costs of
health care. Psychosomatics, 50(4), 392–401.
Wilkinson, L.R. (2016). Financial strain and mental health among older adults during the
great recession. Journals of Gerontology Series B: Psychological Sciences and Social
Sciences, 71(4), 745–754.
World Health Organization (2011). Impact of economic crises on mental health. WHO,
Copenhagen, Denmark.
Yan, X.Y., Huang, S.M., Huang, C.Q., Wu, W.H., Qin, Y. (2011). Marital status and risk for
late life depression: A meta-analysis of the published literature. Journal of International
Medical Research, 39(4), 1142–1154.
Zhang, L., Xu, Y., Nie, H., Zhang, Y., Wu, Y. (2012). The prevalence of depressive
symptoms among the older in China: A meta-analysis. International Journal of Geriatric
Psychiatry, 27(9), 900–906.
List of Authors
Mohammed ALBUHAYRI Tatiana BULDAKOVA

Division of Mathematics and Physics Department of Social Pediatrics
Mälardalen University and Health Organization
Västerås Faculty of Postgraduate and
Sweden Additional Professional Education
Saint Petersburg State Pediatric
Roberto ASCARI Medical University
Department of Economics, Russia
Management and Statistics (DEMS)
University of Milano-Bicocca Ekaterina BULINSKAYA
Milan Department of Probability Theory
Italy Lomonosov Moscow State University
Moscow
Francesca BITONTI Russia
Department of Economics
and Business Evren BURSUK
University of Catania Program of Biomedical Technologies
Italy Istanbul University-Cerrahpaşa
Turkey
Orhun Ceng BOZO
Computer Engineering Frederico CAEIRO
Istanbul University-Cerrahpaşa Department of Mathematics
Turkey Universidade NOVA de Lisboa
Caparica
Portugal
Mark A. CARUANA Marko DIMITROV

Department of Statistics and Division of Mathematics and Physics
Operations Research Mälardalen University
University of Malta Västerås
Msida Sweden
Malta
Sukran EBREN KARA
Anastasia CHARALAMPI Department of Computer
Social Policy Technologies
Panteion University of Social Sirnak University
and Political Sciences Turkey
Athens
Greece Dariusz FILIP
Department of Finance
Paul-Henry COURNÈDE Cardinal Stefan Wyszynski
Mathematics University in Warsaw (UKSW)
MICS laboratory Poland
Université Paris-Saclay
Gif-sur-Yvette Sergey GARBAR
France Department of Applied Mathematics
and Informatics
Antonin DELLA NOCE Yaroslav-the-Wise Novgorod State
Unité Mixte de Recherche 981 University
Institut Gustave Roussy Velikiy Novgorod
Villejuif Russia
France
Gloria GHENO
Agnese Maria DI BRISCO Ronin Institute
Department of Economics and Montclair, New Jersey
Business Studies USA
University of Eastern Piedmont
Novara Naveenbalaji GOWTHAMAN
Italy Department of Electronic Engineering
University of KwaZulu-Natal
Yiannis DIMOTIKALIS Durban
Department of Management Science South Africa
and Technology
Hellenic Mediterranean University
Heraklion
Greece
List of Authors 421
Flavius GUIAŞ Yuta KANNO

Department of Mechanical Graduate School of Engineering
Engineering Tokyo University of Science
Dortmund University of Applied Japan
Sciences and Arts
Dortmund Alex KARAGRIGORIOU
Germany Department of Statistics and
Actuarial-Financial Mathematics
Burcu Bektas GÜNEŞ University of the Aegean
Department of Computer Engineering Samos
National Defence University Greece
Turkish Naval Academy
Istanbul Christiana KARAGRIGORIOU-VONTA
Turkey Freelance translator and editor
Athens
Natalya GUREVA Greece
Department of Social Pediatrics
and Health Organization Andrey KIM
Faculty of Postgraduate and Department of Social Pediatrics
Additional Professional Education and Health Organization
Saint Petersburg State Pediatric Faculty of Postgraduate and
Medical University Additional Professional Education
Russia Saint Petersburg State Pediatric
Medical University
Alan HYLTON Russia
NASA Glenn Research Center
Cleveland, Ohio Maria KONTORINAKI
USA Department of Statistics and
Operations Research
Nika IVANOVA University of Malta
Department of Applied Probability Msida
and Informatics Malta
Peoples’ Friendship University of
Russia (RUDN University) Samuel KOSOLAPOV
Moscow Department of Electronics
Russia ORT Braude Academic College of
Engineering
María JAENADA Karmiel
Department of Statistics and O.R. Israel
Complutense of Madrid
Spain
Anatoly KOVALENKO Christos MESELIDIS

Faculty of Energy and Ecotechnology Department of Statistics and
Ioffe Institute of the Russian Academy Actuarial-Financial Mathematics
of Science University of the Aegean
Saint Petersburg Samos
Russia Greece
Konstantin LEBEDINSKII Catherine MICHALOPOULOU

Department of Anesthesiology and Department of Social Policy
Emergency Medicine Panteion University of Social
Mechnikov North-Western State and Political Sciences
Medical University Athens
Saint Petersburg Greece
Russia
Sonia MIGLIORATI
Ian LIM Department of Economics,
The University of Texas at Arlington Management and Statistics (DEMS)
Texas University of Milano-Bicocca
USA Milan
Italy
Anatoliy MALYARENKO
Division of Mathematics and Physics Verangelina MOLOSHNEVA
Mälardalen University Faculty of Energy and Ecotechnology
Västerås ITMO University
Sweden Saint Petersburg
Russia
Ayana MATEUS
Department of Mathematics Michael MOY
Universidade NOVA de Lisboa Mathematics
Caparica Colorado State University
Portugal Fort Collins
USA
Angelo MAZZA
Department of Economics and Ying NI
Business Division of Mathematics and Physics
University of Catania Mälardalen University
Italy Västerås
Sweden
List of Authors 423
Hossein NOHROUZIAN Lincoln S. PETER

Division of Mathematics and Physics Department of Electronic
Mälardalen University Engineering
Västerås Howard College
Sweden University of KwaZulu-Natal
Durban
Olga NOSYREVA South Africa
Department of Social Pediatrics
and Health Organization Vladimir RYKOV
Faculty of Postgraduate and Department of Applied Probability
Additional Professional Education and Informatics
Saint Petersburg State Pediatric Peoples’ Friendship University of
Medical University Russia (RUDN University)
Russia Moscow
Russia
Andrea ONGARO
Department of Economics, Rüya ŞAMLI
Management and Statistics (DEMS) Department of Computer Engineering
University of Milano-Bicocca Istanbul University-Cerrahpaşa
Milan Turkey
Italy
Charles SAVONA-VENTURA
Vasilii OREL Department of Obstetrics and
Department of Social Pediatrics Gynaecology
and Health Organization University of Malta
Faculty of Postgraduate and Msida
Additional Professional Education Malta
Medical University Eleni SERAFETINIDOU
Russia Department of Statistics and
Insurance Science
Takis PAPAIOANNOU University of Piraeus
Department of Statistics and Greece
Insurance Science
University of Piraeus Lubov SHARAFUTDINOVA
Greece Department of Social Pediatrics
and Health Organization
Leandro PARDO Faculty of Postgraduate and
Department of Statistics and O.R. Additional Professional Education
Complutense of Madrid Saint Petersburg State Pediatric
Spain Medical University
Russia
Takayuki SHIOHAMA Joanna TSIGANOU

Department of Data Science The National Centre for
Nanzan University Social Research
Nagoya Athens
Japan Greece
Robert SHORT Eva TSOUPAROPOULOU

NASA Glenn Research Center Social Policy
Cleveland Panteion University of Social
Ohio and Political Sciences
USA Athens
Greece
Christos H. SKIADAS
ManLab Conceição VEIGA DE ALMEIDA
Technical University of Crete Department of Mathematics
Chania Universidade NOVA de Lisboa
Greece Portugal
Viktoria SMIRNOVA Georgia VERROPOULOU

Department of Social Pediatrics Department of Statistics and
and Health Organization Insurance Science
Faculty of Postgraduate and University of Piraeus
Additional Professional Education Greece
Medical University Konstantinos N. ZAFEIRIS
Russia Department of History and Ethnology
Democritus University of Thrace
Viranjay M. SRIVASTAVA Komotini
Department of Electronic Greece
Engineering
Howard College
University of KwaZulu-Natal
Durban
South Africa
Louisa TESTA
Department of Statistics and
Operations Research
University of Malta
Msida
Malta
Index
A, B data
interpretation, 187
acute respiratory viral infection (ARVI),
storage, 188, 192, 195
359, 362, 363
databases, 31, 32, 34–36, 38–41
approximation by exponents, 297
depression, 395–400, 402–405, 407, 408,
arbitrarily distributed life- and repair
410, 412–414
times, 379, 381, 393
developing markets, 149, 152
asymptotic analysis, 43
devices, 371, 372, 375, 376
batch processing, 163
dividends, 43, 44, 48, 49, 55
Bayesian inference, 106
double redundant system, 379–381, 393
approximate, 319
beta regression, 173–177, 179, 183, 184
E, F
blockchain, 31–36, 38–41
economic downturn, 396, 405, 413, 415
C, D entropies, 237, 238
epidemiological data, 297
calibration, 135–138, 141–144, 146, 148
Europe, 395–398, 414
classification algorithm, 207, 208, 213, 219
exploratory factor analysis (EFA), 81, 82,
CO2, 307, 312, 313
85–90, 94, 95
community-acquired pneumonia, 359,
extreme value theory, 60
362, 363, 365, 366, 368
fixed-income market, 333
compositional data, 115, 116, 118
fixed-radius NN, 67, 68, 70, 71, 73, 74,
confirmatory factor analysis (CFA), 81,
76–79
82, 85–88, 91–95
FlexReg package, 99, 101, 107–110
contingency tables, 238, 246
Covid-19 (see also new coronavirus
G, H
infection), 297, 302–305
cubature method, 333–336, 339, 345, gas analysis, 307, 312, 316
349, 350, 352–354 Gatheral model, 135–140, 143, 147
gender, 395–401, 404, 405, 407, 408, medical care, 359, 360, 366, 367
410–413, 415 mixture
generalized linear model (GLM), 223 distribution, 116
genetic algorithm, 173, 175, 178, 184 model, 107
geographically weighted regression morbidity, 359, 360, 362, 367
(GWR), 261, 262 multi-armed bandit (MAB), 163, 164,
gestational diabetes mellitus (GDM), 166, 168
67–69, 73–78 multivariate regression, 115
Hamiltonian Monte Carlo (HMC), 115,
118 N, O
higher education, 371
network, 371–376
Hull–White model, 333, 345, 346, 348,
new coronavirus infection (see also
349, 352, 354, 355
Covid-19), 359–363, 365, 367, 368
non-parametric, 67, 68, 70
I, K
normalized data, 187, 192, 193, 195
implied volatility expansions, 135–137, NoSQL, 32–41
147 numerical method, 199
independent and non-identically O2, 307, 312, 313
distributed observations, 223, 224 official land price, 261, 264–267, 271,
insurance models, 43 272
k-nearest neighbour (kNN), 67, 68, 70, optimization, 43, 45, 55
73, 74, 76–79
kernel classification, 72, 74 P, R
performance inertia, 158

L, M
persistence, 149–152, 154–159
laser pintograph, 285
cutting machine, 285, 295 point processes, 18, 19
engraver, 295 political trust, 81, 82, 84, 87, 89–95
likelihood, 173–178 prediction method, 7
link function, 173–177, 181, 184 PROMISE, 275, 276, 282
local Moran’s I statistic, 13, 16, 21, 24, proportions, 99–101
26 Poisson regression model, 223, 229–231
machine learning, 207–209, 220 regenerative processes, 380, 381
algorithms, 3, 7, 9, 275 reinsurance, 43, 44, 55
MAPLE, 285, 289, 291, 299, 303 reliability, 82, 87, 91, 93, 94
Markov characteristics, 379–381, 384
chain, 151, 152, 154, 155, 159 Renyi’s pseudodistance, 223, 224
jump processes, 199, 200, 206 respiratory cycle, 307, 311, 312, 314
mean-field limit, 319, 321, 325–327 robustness, 223, 224, 230, 231
measures, 237–241, 246
Index 427
S, T thyroid
cancer, 13, 14
simulations, 238, 246
diseases, 3, 4, 7, 10
software
time series data, 297, 298
cost estimation, 275, 282, 283
topological data analysis (TDA), 207
defined network (SDN), 371, 372,
374–376
U, V, W
spatial
clustering, 13 UCB, 163, 165, 168, 170
statistics, 262 validity, 81, 82, 85, 87, 91, 93–95
SQL, 32, 34, 36–41 volcanic areas, 13–17, 19
Stratonovich integral, 333, 337 WEKA (Waikato Environment for
tail distribution, 57 Knowledge Analysis), 275–279,
temporal variation, 261 281–283
tests of fit, 237, 238, 241, 242 Wiener space, 349–351
Summary of Volume 2
Preface
Konstantinos N. ZAFEIRIS, Yiannis DIMOTIKALIS, Christos H. SKIADAS,
Alex KARAGRIGORIOU and Christiana KARAGRIGORIOU-VONTA
Part 1
Chapter 1. A Topological Clustering of Variables

Rafik ABDESSELAM
1.1. Introduction
1.2. Topological context
1.2.1. Reference adjacency matrices
1.2.2. Quantitative variables
1.2.3. Qualitative variables
1.2.4. Mixed variables
1.3. Topological clustering of variables – selective review
1.4. Illustration on real data of simple examples
1.4.1. Case of a set of quantitative variables
1.4.2. Case of a set of qualitative variables
1.4.3. Case of a set of mixed variables
1.5. Conclusion
1.6. Appendix
1.7. References
Chapter 2. A New Regression Model for Count Compositions

Roberto ASCARI and Sonia MIGLIORATI
2.1. Introduction
2.1.1. Distributions for count vectors
Data Analysis and Related Applications
2.2. Regression models and Bayesian inference

2.3. Simulation studies
2.3.1. Fitting study
2.3.2. Excess of zeroes
2.4. Application to real electoral data
2.5. References
Chapter 3. Intergenerational Class Mobility in Greece with Evidence

from EU-SILC
Glykeria STAMATOPOULOU, Maria SYMEONAKI and Catherine MICHALOPOULOU
3.1. Introduction
3.3. The trends of class mobility between different birth cohorts
3.4. Conclusion
3.5. References
Chapter 4. Capturing School-to-Work Transitions Using Data from the

First European Graduate Survey
Maria SYMEONAKI, Glykeria STAMATOPOULOU and Dimitris PARSANOGLOU
4.1. Introduction
4.2. Data and methodology
4.3. Results
4.4. Conclusion
4.5. References
Chapter 5. A Cluster Analysis Approach for Identifying

Precarious Workers
Maria SYMEONAKI, Glykeria STAMATOPOULOU and Dimitris PARSANOGLOU
5.1. Introduction
5.2. Data and methodology
5.3. Results
5.4. Conclusion and discussion
5.4.1. Declarations
5.5. References
Chapter 6. American Option Pricing Under a Varying Economic Situation

Using Semi-Markov Decision Process
Kouki TAKADA, Marko DIMITROV, Lu JIN and Ying NI
6.1. Introduction
6.2. American option pricing
Summary of Volume 2
6.3. Exercising strategies

6.3.1. Setting parameter
6.3.2. Relationship between the American option price and
economic situation i
6.3.3. Relationship between the American option price
and the asset price s
6.3.4. Relationship between the American option price
and maturity T
6.3.5. Relationship between the American option price and
transition probabilities P
6.3.6. Consideration of the optimal exercise region
6.4. Conclusion
6.5. References
Chapter 7. The Implementation of Hierarchical Classifications and

Cochran’s Rule in the Analysis of Social Data
Aggeliki YFANTI and Catherine MICHALOPOULOU
7.1. Introduction
7.2. Methods
7.3. Results
7.4. Conclusion
7.5. References
Chapter 8. Dynamic Optimization with Tempered Stable Subordinators

for Modeling River Hydraulics
Hidekazu YOSHIOKA and Yumi YOSHIOKA
8.1. Introduction
8.2. Mathematical model
8.3. Optimization problem
8.4. HJBI equation: formulation and solution
8.5. Concluding remarks
8.7. References
Part 2
Chapter 9. Predicting Event Counts in Event-Driven Clinical Trials

Accounting for Cure and Ongoing Recruitment
Vladimir ANISIMOV, Stephen GORMLEY, Rosalind BAVERSTOCK and Cynthia KINEZA
9.1. Introduction
9.2. Modeling the process of event occurrence
9.2.1. Estimating parameters of the model

9.3. Predicting event counts for patients at risk
9.3.1. Global prediction
9.4. Predicting event counts accounting for ongoing recruitment
9.4.1. Modeling and predicting patient recruitment
9.4.2. Predicting event counts
9.4.3. Global forecasting event counts at interim stage
9.5. Monte Carlo simulation
9.6. Software development
9.6.1. R package design
9.6.2. R package input data required
9.7. R package and implementation in a clinical trial
9.7.1. Introduction
9.7.2. Key predictions
9.7.3. Plots and parameter estimates
9.8. Conclusion
9.9. References
Chapter 10. Structural Modeling: An Application to the Evaluation of

Ecosystem Practices at the Plot Level
Dominique DESBOIS
10.1. Introduction
10.2. Structural equation modeling using partial least squares
10.2.1. Specification of the internal model
10.2.2. Specification of the external model
10.2.3. Validation statistics for the external model
10.2.4. Overall validation of structural modeling
10.3. Material and method
10.3.1. Agro-ecological context of the study
10.3.2. Data
10.3.3. The structural model and the estimation
10.4.1. Checking the block one-dimensionality
10.4.2. Fitting the external model and assessing the quality of the fit
10.4.3. The structural model after revision
10.5. Conclusion
10.6. References
Summary of Volume 2
Chapter 11. Lean Management as an Improvement Factor in Health

Services – The Case of Venizeleio General Hospital of Crete, Greece
Eleni GENITSARIDI and George MATALLIOTAKIS
11.1. Introduction
11.2. Theoretical framework
11.3. Purpose of the research
11.4. Methodology
11.5. Research results
11.6. Conclusion
11.7. References
Chapter 12. Motivation and Professional Satisfaction of Medical

and Nursing Staff of Primary Health Care Structures (Urban and
Regional Health Centers) of the Prefecture of Heraklion, Under the
Responsibility of the 7th Ministry
Mihalis KYRIAKAKIS and George MATALLIOTAKIS
12.1. Introduction
12.2. Methodology and material
12.2.1. Research tools for measuring motivation and professional
satisfaction for this work
12.2.2. Purpose and objectives of the research
12.2.3. Material and method
12.2.4. Statistical analysis
12.3. Results
12.4. Discussion
12.5. Conclusion
12.6. References
Chapter 13. Developing a Bibliometric Quality Indicator for Journals

Applied to the Field of Dentistry
Pilar VALDERRAMA, Ana M. AGUILERA and Mariano J. VALDERRAMA
13.1. Introduction
13.2. Methodology
13.3. Discussion and conclusion
13.5. Appendix
13.6. References
Chapter 14. Statistical Process Monitoring Techniques for Covid-19

Emmanouil-Nektarios KALLIGERIS and Andreas MAKRIDES
14.1. Introduction
14.3. Behavior of Covid-19 disease in the Mediterranean region
14.4. Conclusion
14.6. References
Part 3
Chapter 15. Increase of Retirement Age and Health State of Population

in Czechia
Tomáš FIALA, Jitka LANGHAMROVÁ and Jana VRABCOVÁ
15.1. Introduction
15.2. Data and methodological remarks
15.3. Statutory retirement age
15.4. Development of the state of health of population
15.5. Development of the state of health of population in productive and
post-productive ages
15.6. Conclusion
15.7. Acknowledgment
15.8. References
Chapter 16. A Generalized Mean Under a Non-Regular Framework and

Extreme Value Index Estimation
M. IVETTE GOMES, Lígia HENRIQUES-RODRIGUES and Dinis PESTANA
16.1. Introduction
16.2. Preliminary results in the area of EVT for heavy tails and asymptotic
behavior of MOp functionals
16.2.1. A brief review of first- and second-order conditions
16.2.2. Asymptotic behavior of the Hill EVI-estimators
16.2.3. Asymptotic behavior of MOp EVI-estimators under
a regular framework
16.2.4. A brief reference to additive stable laws
16.2.5. Asymptotic behavior of EVI-estimators under
a non-regular framework
16.3. Finite-sample behavior of MOp functionals
Summary of Volume 2
16.4. A non-regular adaptive choice of p and k

16.5. Concluding remarks
16.6. References
Chapter 17. Demography and Policies in V4 Countries

Michaela KADLECOVÁ, Filip HON and Jitka LANGHAMROVÁ
17.1. Introduction
17.2. Demographic development in the V4 countries
17.3. Development of fertility and family policy
17.4. Pension systems of the Visegrad Four countries
17.5. Prediction of future development of V4 populations
17.6. Conclusion
17.8. References
Chapter 18. Decomposing Differences in Life Expectancy with and

without Disability: The Case of Czechia
David MORÁVEK, Tomáš BĚLOCH and Jitka LANGHAMROVÁ
18.1. Introduction
18.2. Methodology and data
18.3. Main results
18.3.1. Effect of mortality
18.3.2. Effects of mortality and health
18.4. Conclusion
18.6. References
Chapter 19. Assessing the Predictive Ability of Subjective

Survival Probabilities
Apostolos PAPACHRISTOS and Georgia VERROPOULOU
19.1. Introduction
19.1.1. Actual mortality patterns
19.1.2. Objectives of the study
19.2. Methods
19.2.1. Data
19.2.2. Force of subjective mortality
19.2.3. Variables
19.2.4. Statistical modeling
19.3. Results
19.3.1. Sample
19.3.2. Multivariable analyses
19.4. Discussion
19.5. Conclusion
19.7. References
Chapter 20. Exploring Excess Mortality During the Covid-19

Pandemic with Seasonal ARIMA Models
Karl-Heinz JÖCKEL and Peter PFLAUMER
20.1. Introduction
20.2. Binomial mortality model and the empirical distribution of daily
deaths in Germany
20.3. Non-seasonal ARIMA model for weekly data in Germany
20.4. Seasonal ARIMA models of weekly deaths for Spain,
Germany and Sweden
20.5. Measuring excess mortality, especially in Spain, Germany and Sweden
20.6. Forecasting daily deaths in Germany
20.7. Conclusion
20.8. Appendix
20.8.1. Estimation results of the other age classes
20.8.2. Time series decomposition
20.9. References
Chapter 21. The Impact of Cesarean Section on Neonatal

Mortality in Rural–Urban Divisions in a Region of Brazil
Carlos SANTOS and Neir PAES
21.1. Introduction
21.2.1. Multilevel logistic model
21.4. Conclusion
21.5. References
Chapter 22. Analysis of Alcohol Policy in Czechia: Estimation of

Alcohol Policy Scale Compared to EU Countries
Kornélia SVAČINOVÁ, Markéta Majerová PECHHOLDOVÁ and Jana VRABCOVÁ
22.1. Introduction
22.2. Literature review
22.3. Methods
Summary of Volume 2
22.4. Results
22.5. Discussion
22.6. Conclusion
22.8. References
Chapter 23. Alcohol-Related Mortality and Its Cause-Elimination

in Life Tables in Selected European Countries and USA:
An International Comparison
Jana VRABCOVÁ, Markéta Majerová PECHHOLDOVÁ and Kornélia SVAČINOVÁ
23.1. Introduction
23.3. Alcohol consumption in European countries by the OECD
23.4. Czechia
23.5. Poland
23.6. Belarus
23.7. Russia
23.8. France
23.9. USA
23.10. Conclusion
23.12. References
Chapter 24. Labor Force Aging in the Czech Republic:

The Role of Education and Economic Industry
Martina SIMKOVA and Jaroslav SIXTA
24.1. Introduction
24.2. The setting of the statutory retirement age
24.3. The economic status of elderly workers
24.4. The structure of working people by factors
24.5. The change in the number of workers
24.6. Conclusion
24.8. References
Other titles from
in
Innovation, Entrepreneurship and Management
2022
BOUCHÉ Geneviève
Productive Economy, Contributory Economy: Governance Tools for the
Third Millennium
HELLER David
Valuation of the Liability Structure by Real Options
MATHIEU Valérie
A Customer-oriented Manager for B2B Services: Principles and
Implementation
NOËL Florent, SCHMIDT Géraldine
Employability and Industrial Mutations: Between Individual Trajectories
and Organizational Strategic Planning (Technological Changes and Human
Resources Set – Volume 4)
SALOFF-COSTE Michel
Innovation Ecosystems: The Future of Civilizations and the Civilization of
the Future (Innovation and Technology Set – Volume 14)
VAYRE Emilie
New Spaces and New Working Times
2021
ARCADE Jacques
Strategic Engineering (Innovation and Technology Set – Volume 11)
BÉRANGER Jérôme, RIZOULIÈRES Roland
The Digital Revolution in Health (Health and Innovation Set – Volume 2)
BOBILLIER CHAUMON Marc-Eric
Digital Transformations in the Challenge of Activity and Work:
Understanding and Supporting Technological Changes
(Technological Changes and Human Resources Set – Volume 3)
BUCLET Nicolas
Territorial Ecology and Socio-ecological Transition
(Smart Innovation Set – Volume 34)
DIMOTIKALIS Yannis, KARAGRIGORIOU Alex, PARPOULA Christina,
SKIADIS Christos H
Applied Modeling Techniques and Data Analysis 1: Computational Data
Analysis Methods and Tools (Big Data, Artificial Intelligence and Data
Analysis Set - Volume 7)
Applied Modeling Techniques and Data Analysis 2: Financial,
Demographic, Stochastic and Statistical Models and Methods (Big Data,
Artificial Intelligence and Data Analysis Set – Volume 8)
DISPAS Christophe, KAYANAKIS Georges, SERVEL Nicolas,
STRIUKOVA Ludmila
Innovation and Financial Markets
(Innovation between Risk and Reward Set – Volume 7)
ENJOLRAS Manon
Innovation and Export: The Joint Challenge of the Small Company
FLEURY Sylvain, RICHIR Simon
Immersive Technologies to Accelerate Innovation: How Virtual and
Augmented Reality Enables the Co-Creation of Concepts
GIORGINI Pierre
The Contributory Revolution (Innovation and Technology Set – Volume 13)
GOGLIN Christian
Emotions and Values in Equity Crowdfunding Investment Choices 2:
Modeling and Empirical Study
GRENIER Corinne, OIRY Ewan
Altering Frontiers: Organizational Innovations in Healthcare (Health and
Innovation Set – Volume 1)
GUERRIER Claudine
Security and Its Challenges in the 21st Century (Innovation and Technology
Set – Volume 12)
HELLER David
Performance of Valuation Methods in Financial Transactions (Modern
Finance, Management Innovation and Economic Growth Set – Volume 4)
LEHMANN Paul-Jacques
Liberalism and Capitalism Today
SOULÉ Bastien, HALLÉ Julie, VIGNAL Bénédicte, BOUTROY Éric,
NIER Olivier
Innovation in Sport: Innovation Trajectories and Process Optimization
UZUNIDIS Dimitri, KASMI Fedoua, ADATTO Laurent
Innovation Economics, Engineering and Management Handbook 1:
Main Themes
Innovation Economics, Engineering and Management Handbook 2:
Special Themes
VALLIER Estelle
Innovation in Clusters: Science–Industry Relationships in the Face of
Forced Advancement (Smart Innovation Set – Volume 36)
2020
ACH Yves-Alain, RMADI-SAÏD Sandra
Financial Information and Brand Value: Reflections, Challenges and
Limitations
ANDREOSSO-O’CALLAGHAN Bernadette, DZEVER Sam, JAUSSAUD Jacques,
TAYLOR Robert
Sustainable Development and Energy Transition in Europe and Asia
(Innovation and Technology Set – Volume 9)
BEN SLIMANE Sonia, M’HENNI Hatem
Entrepreneurship and Development: Realities and Future Prospects
CHOUTEAU Marianne, FOREST Joëlle, NGUYEN Céline
Innovation for Society: The P.S.I. Approach
CORON Clotilde
Quantifying Human Resources: Uses and Analysis
CORON Clotilde, GILBERT Patrick
Technological Change
CERDIN Jean-Luc, PERETTI Jean-Marie
The Success of Apprenticeships: Views of Stakeholders on Training and
Learning (Human Resources Management Set – Volume 3)
DELCHET-COCHET Karen
Circular Economy: From Waste Reduction to Value Creation
(Economic Growth Set – Volume 2)
DIDAY Edwin, GUAN Rong, SAPORTA Gilbert, WANG Huiwen
Advances in Data Science
(Big Data, Artificial Intelligence and Data Analysis Set – Volume 4)
DOS SANTOS PAULINO Victor
Innovation Trends in the Space Industry
GASMI Nacer
Corporate Innovation Strategies: Corporate Social Responsibility and
Shared Value Creation
GOGLIN Christian
Emotions and Values in Equity Crowdfunding Investment Choices 1:
Transdisciplinary Theoretical Approach
GUILHON Bernard
Venture Capital and the Financing of Innovation
(Innovation Between Risk and Reward Set – Volume 6)
LATOUCHE Pascal
Open Innovation: Human Set-up
LIMA Marcos
Entrepreneurship and Innovation Education: Frameworks and Tools
MACHADO Carolina, DAVIM J. Paulo
Sustainable Management for Managers and Engineers
MAKRIDES Andreas, KARAGRIGORIOU Alex, SKIADAS Christos H.
Data Analysis and Applications 3: Computational, Classification, Financial,
Statistical and Stochastic Methods
Data Analysis and Applications 4: Financial Data Analysis and Methods
MASSOTTE Pierre, CORSI Patrick
Complex Decision-Making in Economy and Finance
MEUNIER François-Xavier
Dual Innovation Systems: Concepts, Tools and Methods
MICHAUD Thomas
Science Fiction and Innovation Design (Innovation in Engineering and
Technology Set – Volume 6)
MONINO Jean-Louis
Data Control: Major Challenge for the Digital Society
MORLAT Clément
Sustainable Productive System: Eco-development versus Sustainable
Development (Smart Innovation Set – Volume 26)
SAULAIS Pierre, ERMINE Jean-Louis
Knowledge Management in Innovative Companies 2: Understanding and
Deploying a KM Plan within a Learning Organization
2019
AMENDOLA Mario, GAFFARD Jean-Luc
Disorder and Public Concern Around Globalization
BARBAROUX Pierre
Disruptive Technology and Defence Innovation Ecosystems
(Innovation in Engineering and Technology Set – Volume 5)
DOU Henri, JUILLET Alain, CLERC Philippe
Strategic Intelligence for the Future 1: A New Strategic and Operational
Approach
Strategic Intelligence for the Future 2: A New Information Function
Approach
FRIKHA Azza
Measurement in Marketing: Operationalization of Latent Constructs
FRIMOUSSE Soufyane
Innovation and Agility in the Digital Age
(Human Resources Management Set – Volume 2)
GAY Claudine, SZOSTAK Bérangère L.
Innovation and Creativity in SMEs: Challenges, Evolutions and Prospects
GORIA Stéphane, HUMBERT Pierre, ROUSSEL Benoît
Information, Knowledge and Agile Creativity
HELLER David
Investment Decision-making Using Optional Models
HELLER David, DE CHADIRAC Sylvain, HALAOUI Lana, JOUVET Camille
The Emergence of Start-ups
HÉRAUD Jean-Alain, KERR Fiona, BURGER-HELMCHEN Thierry
Creative Management of Complex Systems
LATOUCHE Pascal
Open Innovation: Corporate Incubator
LEHMANN Paul-Jacques
The Future of the Euro Currency
LEIGNEL Jean-Louis, MÉNAGER Emmanuel, YABLONSKY Serge
Sustainable Enterprise Performance: A Comprehensive Evaluation Method
LIÈVRE Pascal, AUBRY Monique, GAREL Gilles
Management of Extreme Situations: From Polar Expeditions to Exploration-
Oriented Organizations
MILLOT Michel
Embarrassment of Product Choices 2: Towards a Society of Well-being
N’GOALA Gilles, PEZ-PÉRARD Virginie, PRIM-ALLAZ Isabelle
Augmented Customer Strategy: CRM in the Digital Age
NIKOLOVA Blagovesta
The RRI Challenge: Responsibilization in a State of Tension with Market
Regulation
(Innovation and Responsibility Set – Volume 3)
PELLEGRIN-BOUCHER Estelle, ROY Pierre
Innovation in the Cultural and Creative Industries
PRIOLON Joël
Financial Markets for Commodities
QUINIOU Matthieu
Blockchain: The Advent of Disintermediation
RAVIX Joël-Thomas, DESCHAMPS Marc
Innovation and Industrial Policies
ROGER Alain, VINOT Didier
Skills Management: New Applications, New Questions
(Human Resources Management Set – Volume 1)
SAULAIS Pierre, ERMINE Jean-Louis
Knowledge Management in Innovative Companies 1: Understanding and
Deploying a KM Plan within a Learning Organization
SERVAJEAN-HILST Romaric
Co-innovation Dynamics: The Management of Client-Supplier Interactions
for Open Innovation
SKIADAS Christos H., BOZEMAN James R.
Data Analysis and Applications 1: Clustering and Regression, Modeling-
estimating, Forecasting and Data Mining
Data Analysis and Applications 2: Utilization of Results in Europe and
Other Topics
UZUNIDIS Dimitri
Systemic Innovation: Entrepreneurial Strategies and Market Dynamics
VIGEZZI Michel
World Industrialization: Shared Inventions, Competitive Innovations and
Social Dynamics
2018
BURKHARDT Kirsten
Private Equity Firms: Their Role in the Formation of Strategic Alliances
CALLENS Stéphane
Creative Globalization
CASADELLA Vanessa
Innovation Systems in Emerging Economies: MINT – Mexico, Indonesia,
Nigeria, Turkey
CHOUTEAU Marianne, FOREST Joëlle, NGUYEN Céline
Science, Technology and Innovation Culture
CORLOSQUET-HABART Marine, JANSSEN Jacques
Big Data for Insurance Companies
CROS Françoise
Innovation and Society
DEBREF Romain
Environmental Innovation and Ecodesign: Certainties and Controversies
DOMINGUEZ Noémie
SME Internationalization Strategies: Innovation to Conquer New Markets
ERMINE Jean-Louis
Knowledge Management: The Creative Loop
GILBERT Patrick, BOBADILLA Natalia, GASTALDI Lise,
LE BOULAIRE Martine, LELEBINA Olga
Innovation, Research and Development Management
IBRAHIMI Mohammed
Mergers & Acquisitions: Theory, Strategy, Finance
LEMAÎTRE Denis
Training Engineers for Innovation
LÉVY Aldo, BEN BOUHENI Faten, AMMI Chantal
Financial Management: USGAAP and IFRS Standards
MILLOT Michel
Embarrassment of Product Choices 1: How to Consume Differently
PANSERA Mario, OWEN Richard
Innovation and Development: The Politics at the Bottom of the Pyramid
(Innovation and Responsibility Set – Volume 2)
RICHEZ Yves
Corporate Talent Detection and Development
SACHETTI Philippe, ZUPPINGER Thibaud
New Technologies and Branding
SAMIER Henri
Intuition, Creativity, Innovation
TEMPLE Ludovic, COMPAORÉ SAWADOGO Eveline M.F.W.
Innovation Processes in Agro-Ecological Transitions in Developing
Countries
UZUNIDIS Dimitri
Collective Innovation Processes: Principles and Practices
VAN HOOREBEKE Delphine
The Management of Living Beings or Emo-management
2017
AÏT-EL-HADJ Smaïl
The Ongoing Technological System
BAUDRY Marc, DUMONT Béatrice
Patents: Prompting or Restricting Innovation?
BÉRARD Céline, TEYSSIER Christine
Risk Management: Lever for SME Development and Stakeholder
Value Creation
CHALENÇON Ludivine
Location Strategies and Value Creation of International
Mergers and Acquisitions
CHAUVEL Danièle, BORZILLO Stefano
The Innovative Company: An Ill-defined Object
CORSI Patrick
Going Past Limits To Growth
D’ANDRIA Aude, GABARRET Inés
Building 21st Century Entrepreneurship
DAIDJ Nabyla
Cooperation, Coopetition and Innovation
FERNEZ-WALCH Sandrine
The Multiple Facets of Innovation Project Management
FOREST Joëlle
Creative Rationality and Innovation
GUILHON Bernard
Innovation and Production Ecosystems
HAMMOUDI Abdelhakim, DAIDJ Nabyla
Game Theory Approach to Managerial Strategies and Value Creation
(Diverse and Global Perspectives on Value Creation Set – Volume 3)
LALLEMENT Rémi
Intellectual Property and Innovation Protection: New Practices
and New Policy Issues
LAPERCHE Blandine
Enterprise Knowledge Capital
LEBERT Didier, EL YOUNSI Hafida
International Specialization Dynamics
MAESSCHALCK Marc
Reflexive Governance for Research and Innovative Knowledge
(Responsible Research and Innovation Set – Volume 6)
MASSOTTE Pierre
Ethics in Social Networking and Business 1: Theory, Practice
and Current Recommendations
Ethics in Social Networking and Business 2: The Future and
Changing Paradigms
Smart Decisions in Complex Systems
MEDINA Mercedes, HERRERO Mónica, URGELLÉS Alicia
Current and Emerging Issues in the Audiovisual Industry
(Diverse and Global Perspectives on Value Creation Set – Volume 1)
MICHAUD Thomas
Innovation, Between Science and Science Fiction
PELLÉ Sophie
Business, Innovation and Responsibility
(Responsible Research and Innovation Set – Volume 7)
SAVIGNAC Emmanuelle
The Gamification of Work: The Use of Games in the Workplace
SUGAHARA Satoshi, DAIDJ Nabyla, USHIO Sumitaka
Value Creation in Management Accounting and Strategic Management:
An Integrated Approach
(Diverse and Global Perspectives on Value Creation Set –Volume 2)
UZUNIDIS Dimitri, SAULAIS Pierre
Innovation Engines: Entrepreneurs and Enterprises in a Turbulent World
2016
BARBAROUX Pierre, ATTOUR Amel, SCHENK Eric
Knowledge Management and Innovation
BEN BOUHENI Faten, AMMI Chantal, LEVY Aldo
Banking Governance, Performance And Risk-Taking: Conventional Banks
Vs Islamic Banks
BOUTILLIER Sophie, CARRÉ Denis, LEVRATTO Nadine
Entrepreneurial Ecosystems (Smart Innovation Set – Volume 2)
BOUTILLIER Sophie, UZUNIDIS Dimitri
The Entrepreneur (Smart Innovation Set – Volume 8)
BOUVARD Patricia, SUZANNE Hervé
Collective Intelligence Development in Business
GALLAUD Delphine, LAPERCHE Blandine
Circular Economy, Industrial Ecology and Short Supply Chains
GUERRIER Claudine
Security and Privacy in the Digital Era
MEGHOUAR Hicham
Corporate Takeover Targets
MONINO Jean-Louis, SEDKAOUI Soraya
Big Data, Open Data and Data Development
MOREL Laure, LE ROUX Serge
Fab Labs: Innovative User
PICARD Fabienne, TANGUY Corinne
Innovations and Techno-ecological Transition
2015
CASADELLA Vanessa, LIU Zeting, DIMITRI Uzunidis
Innovation Capabilities and Economic Development in Open Economies
CORSI Patrick, MORIN Dominique
Sequencing Apple’s DNA
CORSI Patrick, NEAU Erwan
Innovation Capability Maturity Model
FAIVRE-TAVIGNOT Bénédicte
Social Business and Base of the Pyramid
GODÉ Cécile
Team Coordination in Extreme Environments
MAILLARD Pierre
Competitive Quality and Innovation
Operationalizing Sustainability
Sustainability Calling
2014
DUBÉ Jean, LEGROS Diègo
Spatial Econometrics Using Microdata
LESCA Humbert, LESCA Nicolas
Strategic Decisions and Weak Signals
2013
HABART-CORLOSQUET Marine, JANSSEN Jacques, MANCA Raimondo
VaR Methodology for Non-Gaussian Finance
2012
DAL PONT Jean-Pierre
Process Engineering and Industrial Management
MAILLARD Pierre
Competitive Quality Strategies
POMEROL Jean-Charles
Decision-Making and Action
SZYLAR Christian
UCITS Handbook
2011
LESCA Nicolas
Environmental Scanning and Sustainable Development
LESCA Nicolas, LESCA Humbert
Weak Signals for Strategic Intelligence: Anticipation Tool for Managers
MERCIER-LAURENT Eunika
Innovation Ecosystems
2010
SZYLAR Christian
Risk Management under UCITS III/IV
2009
COHEN Corine
Business Intelligence
ZANINETTI Jean-Marc
Sustainable Development in the USA
2008
CORSI Patrick, DULIEU Mike
The Marketing of Technology Intensive Products and Services
DZEVER Sam, JAUSSAUD Jacques, ANDREOSSO Bernadette
Evolving Corporate Structures and Cultures in Asia: Impact
of Globalization
2007
AMMI Chantal
Global Consumer Behavior
2006
BOUGHZALA Imed, ERMINE Jean-Louis
Trends in Enterprise Knowledge Management
CORSI Patrick et al.
Innovation Engineering: the Power of Intangible Networks
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.

(Innovation, Enterpreneurship, Management Series) Konstantinos N. Zafeiris, Christos H. Skiadas, Yannis Dimotikalis, Alex Karagrigoriou, Christiana Karagrigoriou-Vonta - Data Analysis and Related Appl

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Innovation, Enterpreneurship, Management Series) Konstantinos N. Zafeiris, Christos H. Skiadas, Yannis Dimotikalis, Alex Karagrigoriou, Christiana Karagrigoriou-Vonta - Data Analysis and Related Appl

Uploaded by

Copyright:

Available Formats

Data Analysis and Related Applications 1

Big Data, Artificial Intelligence and Data Analysis Set

Data Analysis and

Computational, Algorithmic and

ISTE Ltd John Wiley & Sons, Inc.

© ISTE Ltd 2022

Library of Congress Control Number: 2022935196

British Library Cataloguing-in-Publication Data

Chapter 1. Performance of Evaluation of Diagnosis of Various Thyroid

Chapter 2. Exploring Chronic Diseases’ Spatial Patterns:

2.4.2. Estimate of the spatial attraction . . . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 3. Analysis of Blockchain-based Databases in Web

Chapter 4. Optimization and Asymptotic Analysis

4.4. Conclusion and further research directions . . . . . . . . . . . . . . . . . . . . 55

Chapter 5. Statistical Analysis of Traffic Volume in

Chapter 6. Predicting the Risk of Gestational Diabetes Mellitus through

Chapter 7. Political Trust in National Institutions: The Significance

Chapter 8. The State of the Art in Flexible Regression Models for

Chapter 9. Simulation Studies for a Special Mixture Regression Model

Chapter 10. Numerical Studies of Implied Volatility Expansions

Chapter 11. Performance Persistence of Polish Mutual Funds:

11.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

Chapter 13. A New Non-monotonic Link Function for

13.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Chapter 14. A Method of Big Data Collection and Normalization

14.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Chapter 15. Stochastic Runge–Kutta Solvers Based on Markov

Chapter 16. Interpreting a Topological Measure of Complexity for

Chapter 17. The Minimum Renyi’s Pseudodistance Estimators for

Chapter 18. Data Analysis based on Entropies and Measures

Chapter 19. Geographically Weighted Regression for Official Land

Chapter 20. Software Cost Estimation Using Machine

20.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Chapter 21. Monte Carlo Accuracy Evaluation of Laser

Chapter 22. Using Parameters of Piecewise Approximation by

22.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

Chapter 23. The Correlation Between Oxygen Consumption and

23.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

Chapter 24. Approximate Bayesian Inference Using the

Chapter 25. Pricing Financial Derivatives in the Hull–White Model Using

Chapter 26. Differences in the Structure of Infectious Morbidity

26.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360

26.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

27.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

Chapter 28. Reliability of a Double Redundant System Under the

Chapter 29. Predicting Changes in Depression Levels Following the

29.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

29.4. Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

List of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

Summary of Volume 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

This book is a collective work with contributions by leading experts on “Data

The contributions to this collective work are by a number of leading scientists,

Performance of Evaluation of Diagnosis

According to Feigenbaum, the pioneer of Artificial Intelligence, “an expert

Medicine occupies a lot of space in artificial intelligence and expert systems, in

There are a number of research works on the classification of thyroid diseases in

1.2. Data understanding

This dataset contains 10 attributes of 130 patients. Each measurement vector

Attribute Domain Mapped domain

Table 1.1. Dataset attribute description

SVM, KNN, ANN and DT were selected as the classification models.

KNN is a simple, supervised machine learning algorithm that can be used to

ANN is a well-known artificial intelligence technique for solving problems that