Professional Documents
Culture Documents
The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Abstract:
This paper provides a model–based method for the forecast of the
total number of currently CoVoD-19 positive individuals and of the
occupancy of the available Intensive Care Units in Italy. The predic-
tions obtained – for a time horizon of 10 days starting from March
29th – will be provided at a national as well as at a more disag-
gregate levels, following a criterion based on the magnitude of the
phenomenon. While the Regions which have been hit the most by
the pandemic have been kept separated, the less affected ones have
been aggregated into homogeneous macro-areas. Results show that
– within the forecast period considered (March 29th - April 7th ) –
all of the Italian regions will show a decreasing number of CoViD-19
positive people. Same for the number of people who will need to be
hospitalized in a Intensive Care Unit (ICU). These estimates are valid
under constancy of the Government’s current containment policies.
In this scenario, Northern Regions will remain the most affected ones
and no significant outbreak are foreseen in the southern regions.
1. Introduction
On March 19th, the death toll paid by Italy for the spread of the virus CoViD-19
amounted to 3405 deaths, the highest paid by a single country in the World. Despite
an hard and relatively timely lock-down policy implemented by the Government, on
March 26 this figure has risen to 8165 deaths.
Page 1 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Since the Italian regions are affected in different extents by the CoViD-19, it
has been decided to perform the forecasting exercise for the following geographical
areas: Lombardia, Piedmont, Valle d’Aosta, Veneto, Friuli Venezia Giulia, Trentino Alto
Adige, Lazio and Campania. The remaining Regions have been grouped in the fol-
lowing macro-areas: “Center” (Marche, Umbria and Toscana) and “South” (Abruzzo,
Molise, Puglia Basilicata, Calabria, Sicilia and Sardegna). At least other two reasons
justify such a break down:
In essence, in this study the available official data have been employed in a
three step procedure, i.e.:
1. data pre-processing, in which data anomalies are identified and corrected accord-
ing to an approach of the type a Kalman filter;
2. univariate forecasting, based on a autoregressive moving average (ARMA) model
for number of positive cases and ICU;
3. bootstrap–based generation of predicted values and confidence intervals.
2. The Data
This paper employs the data related to COVID-19, collected and regularly updated
by the Italian National Institute of Health (an agency of the Italian Ministry of Health)
and by the Italian Civil Protection Department. The whole data set is freely and publicly
available in a comprehensive database, accessible on the Internet at the web address
https://www.iss.it/. It collects crucial data related to all the persons tested for COVID-
19 – from the outbreak of the pandemics (February the 24th ) – and, in particular, it
As already pointed out, in the present study, the variables of interest are the
number of people who have been:
1. tested positive for SARS-CoV-2 (in what follows denoted by the bold Latin letter
V);
2. hospitalized in a ICU (which will be denoted by the bold Latin letter U).
Page 2 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
In order to correctly process the data, all the regions showing no positive cases
at the beginning of the recording period and/or low values along the whole time span
have been aggregated into macro areas. This has been done to i) give more meaningful
results and ii) save degrees of freedom (which are always precious in short time series).
A) Nothtern Regions
1. Lombardia
2. Piedmont
3. Valle d Aosta
4. Veneto
5. Friuli Venezia Giulia
6. Emilia Romagna
7. Liguria
8. Macro Area “Trentino Alto Adige” (Trento and Bolzano)
B) Center Regions
1. Lazio
2. Macro-area “Center”: Marche, Umbria and Toscana
C) Southern Regions
1. Campania
2. Macro-area “South” Abruzzo, Molise, Puglia, Basilicata, Calabria, Sicilia,
Sardegna.
The north Italy Regions – presently the more severely affected by the pandemic
– have been treated separately along with two other regions, i.e. Lazio and Campania,
since their major cities – Rome and Naples – deserve special attention for the institu-
tional role played and the population density exhibited. On the other hand, the Re-
gions showing less worrying figures have been aggregated into macro areas according
to their geolocation. The only exception is Valle d’Aosta, which has been left separated
as no aggregation options could be found.
To simplify notation, for both the variables of interest V and U , the following
convention is introduced:
K
V j.
Here, the upper left superscript (denoted by the upper case Latin letter K)
refers to the geographical areas (i.e. North, Center and South) whereas the subscript
j is associated to the number the different regions or macroareas are codified with,
as above detailed. For example, by the symbols A V6 and B V2 the number of positive
cases for the Emilia Romagna region and the Center macro-area “Marche – Umbria –
Toscana” are respectively identified.
Page 3 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
3. Data Preprocessing
Missing data and other anomalies become the first challenge when designing predictive
models, as statistical methods, in general, are designed and tested under the assump-
tion of no missing observations (Moritz et al. (2015)).
Before delving into the details of the proposed procedure, a word of caution
is needed since, unfortunately, a visual inspection of the database suggests the pres-
ence of a number of anomalous data both at a regional and country level. The de-
tected anomalies might be associated to the biological samples collecting process and
the testing procedures. In fact, the typical lab–workflow is governed by a set of rigid
protocols which might be critically affected by factors such as the availability of man-
power, swabs, reagents and other laboratory materials. In emergency situations, such
a workflow can be disrupted and temporal inconsistency might appear as a result. For
example, a set of samples might be delivered to a laboratory with longer than usual
delays with respect to the time of collection or a given lab can only complete a screen-
ing process for a certain number of samples. In both the cases, one day (or more) shift
in the release of the lab results can be reasonably expected. A further source of anoma-
lies is represented by the data entry and data editing processes, carried out in working
environment necessarily affected by the risk of contagious and under rigid deadlines.
An example of such anomalous data is given in Figure 1, where the series 1 V1,t
(Lombardia) is depicted. Here, some data points showing values inconsistent with the
overall pattern are clearly noticeable. Given the (very small) available sample size, the
relative weight of such data is almost surely not negligible and can introduce severe
distortions in the model parameter inference procedures and thus in the predicted
values.
In figure 2 the corrected version of the series 1 V1,t – resulting by applying the
Kalman smoother – is depicted. Not only this series lends itself to a better visual inspec-
tion but, more importantly, is more suitable to be processed by the adopted prediction
model.
4. Theoretical framework
The approach used in this paper relies on i) the theory of stochastic process and ii)
a resampling method. While the former is necessary to generate the input (predicted
values) of the bootstrap algorithm, as well as to justify the employment of the outlier
correction method, the latter serves the purpose of
1. generating the final predictions, which are affected by a reduced amount of un-
certainty (with respect to those generated by the stochastic model)
2. yielding the related confidence intervals.
Page 4 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
25000
20000
15000
10000
5000
Let X = (Xt )t∈Z be a real 2nd order stationary process, it is said to admit a
ARMA(p,q) representation (p,q ∈ Z ) if, for some constant a1 ....ap ,b1 ...bq , will be:
p
X q
X
aj X(t − j) = bj ε(t − j) (t ∈ Z), a0 = b0 = 1 (1)
j=0 j=0
E 2 (t)|Ft−1 = σ 2
E {(t)|Ft−1 } = 0 ,
p
X p
X
aj Z j 6= 0 , bj Z j 6= 0 , |Z| ≤ 1
j=0 j=0
Pp
Here Ft denotes the sigma algebra induced by (j), j ≤ t and j=0 aj Z j and
Pp j
b
j=0 j Z are assumed not to have common zero.
Page 5 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
20000
15000
10000
5000
The dynamics of the series under investigations are not suitable for this the-
oretical framework which requires 2nd – order stationarity; this is achieved by pre-
processing the series according to the following filter: log(∇d ), being the symbol ∇
the difference operator and the exponent d indicating the order of the difference. To
fully understand the role played by ∇, the backward operator B is now introduced.
In essence, B moves the time index of an observation back by p time intervals, i.e.
B p xt = Xt−p , and thus we have that
Page 6 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Unlike standard bootstrap schemes, in the MEB case the resample set Ω mimics
the observed realization of the underlying stochastic process, in MEB a large number
of subsets, say {ω1 , . . . , ωN } becomes the elements belonging to Ω, each of them con-
taining a large number of replicates {x1 , . . . , xJ }. Among the important features of the
MEB scheme, it is worth mentioning the consistency of its bootstrap samples with the
ergodic theorem (see, e.g., Birkhoff (1931)) and with the probabilistic structure of the
observed time series. In Figure 3 an example of the application of MEB for the variable
1
Vt,1 is given.
1. Eqn. 1 is estimated for both Vt and Ut so that the model orders (Eqn. 1) M1 and
M2 become available;
2. for each time series Vt and Ut the MEB procedure is applied so that the sets V
and U – each containing B = 500 “bona fide” replications – are available, i.e.
V ≡ [V1∗ , V2∗ , . . . , VB∗ ] and U ≡ [U1∗ , U2∗ , . . . , UB∗ ] (in Figure 4 the set V for the
variable 1 Vt,1 is given);
3. for each of the replications stored in V, Eqn. 1 is estimated according to the
model order selected, i.e. M1 , and the 1 to 10–step–ahead predictions – as well
as the 5% and 95% bootstrap confidence interval – are generated;
4. the B predictions and the confidence intervals obtained in the previous step are
stored in the B × 3 matrix [FV (h); h = 1, 2, . . . , 10], whose columns are: lower
bootstrap confidence interval, bootstrap prediction and upper bootstrap confi-
∗ ∗ ∗
dence interval, respectively denoted by the symbols CIL,b (h), Vt,b , CIU,b (h) b =
1, . . . , B.
5. the median value V̂ ∗ = M(Vt,b ∗
) is then extracted along with the ≈ 95% con-
∗ ∗
fidence intervals CIL,b (h = 1) and CIU,b (h = 1), computed according to the
t–percentile method. The explanation of this procedure goes beyond the scope
of this paper, therefore the interested reader is referred to the excellent paper by
Berkowitz and Kilian (2000).
6. CIL∗ (h = 2, . . . , 10) and CIU∗ (h = 2, . . . , 10) (the subscript b is omitted for brevity)
are computed conditional to a subset of V, say V, e made up of the bootstrap
replications whose range falls between the minimum and maximum values of
the values of the confidence intervals computed for h = 1. In symbols:
min(CIU∗ (h = 1)) ≤ [V
e ⊂ V] ≤ max(CIU∗ (h = 1)); (3)
7. steps 1–6 are repeated for Ut , so that a new matrix of prediction of dimension
B × 3 is built, i.e. [FV (h) h = 1, 2, . . . , 10], whose columns are as in FU (h) and
denoted by the symbols CIU∗ (h), Ût∗ , CIU∗ (h).
Page 7 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Fig. 3. Lombardia: B= 500 Bootstrap replications performed via the MEB algo-
rithm on the adjusted, log–transformed, data (in red the original time series is
dpicted)
10
9
8
7
6
5
0 5 10 15 20 25 30
Time
6. Empirical evidences
At the national level (data have been plotted in Figure 4) the peak in the number of
CoViD-19–positives will be reached on 2 April, with a number of predicted positive
close to 77,000. The maximum forecasted value for the occupied ICU – expected for
April 4 – will be 4280. These values have been calculated using an indirect methodol-
ogy, i.e. by summing up the estimates obtained at a disaggregated level. Regarding the
results obtained at a disaggregated level, the models outcomes are now commented.
• Lombardia – the most affected region – will reach the peak of positive cases
(25963) and of the demand of ICUs (1425) respectively on 2 and 4 April;
• Emilia Romagna is the second most affected Region by COVID-19 but still shows
a very high number of victims. The trend of infected people will reach its peak
on April 5th whereas the number of cases in Intensive Care will continue to grow
at a progressively slower rate over the forecasting period;
Page 8 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
• Veneto is the third Region for number of deaths. Here, the number of positive
cases, as well as the number of cases in ICU, will reach the peak on April 3rd;
• for Piedmont – the fourth Region for number of victims – the predicted positive
cases will reach the peak on March 29th (6635) whereas the persons in ICU will
be 431 on 31 March, when the peak is predicted;
Page 9 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Fig. 4. Italy, time series data of positive (data corrected via Kalman filter, left
side axis) and of the number of people hospitalized in ICUs (right side axis)
4000
70000 Number of people tested positive (corrected)
Number of ICUs
60000
3000
50000
40000
2000
30000
20000 1000
10000
0
0
Index
Page 10 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Lombardia
Piedmont
Page 11 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Liguria
Valle D’Aosta
Veneto
Page 12 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Emilia Romagna
Page 13 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Lazio
Campania
Lombardia
CIL∗ (h) Ût∗ CIU∗ (h)
29 March 1272 1369 1426
30 March 1286 1389 1458
31 March 1287 1402 1482
1 April 1291 1412 1490
2 April 1289 1420 1509
3 April 1286 1427 1526
4 April 1270 1425 1537
5 April 1252 1421 1558
6 April 1227 1416 1578
7 April 1191 1412 1599
Piedmont
CIL∗ (h) Ût∗ CIU∗ (h)
29 March 422 433 451
30 March 428 436 457
31 March 428 438 470
1 April 427 436 478
2 April 4241 434 491
3 April 419 427 498
4 April 409 417 508
5 April 395 406 512
6 April 373 396 523
7 April 354 383 535
Liguria
CIL∗ (h) Ût∗ CIU∗ (h)
29 March 156 163 168
30 March 159 163 171
31 March 158 163 174
1 April 158 162 178
2 April 158 161 181
3 April 156 159 185
4 April 153 157 187
5 April 148 153 191
6 April 143 151 193
7 April 137 145 195
Page 15 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Valle Aosta
CIL∗ (h) Ût∗ CIU∗ (h)
29 March 24 25 27
30 March 25 25 29
31 March 24 25 30
1 April 24 24 31
2 April 23 22 32
3 April 22 21 33
4 April 21 19 34
5 April 19 16 36
6 April 18 14 36
7 April 15 12 37
Veneto
CIL∗ (h) Ût∗ CIU∗ (h)
29 March 331 357 367
30 March 335 360 375
31 March 335 365 382
1 April 334 370 391
2 April 332 375 401
3 April 324 378 409
4 April 312 375 424
5 April 298 376 440
6 April 286 370 448
7 April 267 369 461
Page 16 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Emilia Romagna
CIL∗ (h) Ût∗ CIU∗ (h)
29 March 306 349 379
30 March 309 365 394
31 March 309 369 409
1 April 312 372. 421
2 April 313 379 433
3 April 314 386 439
4 April 311 388 442
5 April 307 392 442
6 April 298 399 447
7 April 288 401 444
Lazio
CIL∗ (h) Ût∗ CIU∗ (h)
29 March 120 136 146
30 March 125 139 157
31 March 123 141 171
1 April 118 144 179
2 April 111 145 200
3 April 104 145 219
4 April 95 141 235
5 April 84 139 264
6 April 73 134 294
7 April 61 131 343
Page 17 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
Campania
CIL∗ (h) Ût∗ CIU∗ (h)
29 March 126 142 167
30 March 133 159 197
31 March 138 176 228
1 April 144 189 257
2 April 147 201 293
3 April 148 206 326
4 April 143 211 348
5 April 134 203 382
6 April 120 197 411
7 April 109 192 447
Page 18 of 19
medRxiv preprint doi: https://doi.org/10.1101/2020.03.30.20047894. The copyright holder for this preprint (which was not peer-reviewed) is the
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
It is made available under a CC-BY-NC-ND 4.0 International license .
7. Disclaimer
The views and opinions expressed in this article are those of the author and do not nec-
essarily reflect the official policy or position of the Italian National Institute of Statistics.
Page 19 of 19