You are on page 1of 59

Accurate long‑range forecasting of

COVID‑19 mortality in the USA


Pouria Ramazi, Arezoo Haratian , Maryam Meghdadi , Arash Mari Oriyad , Mark A.
Lewis, Zeinab Maleki , Roberto Vega, Hao Wang , David S. Wishart & Russell Greiner

1
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package

2
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package

3
Covid-19 Pandemic
Coronavirus disease (COVID-19) was declared a public health emergency of international
concern in January 2020 by the World Health Organization.
Since then, about 260 million confirmed cases and almost 5 million deaths due to COVID-
19 have been reported worldwide (up to Nov 28, 2021).

4
Covid-19 Pandemic
Some of these cases and deaths might have been prevented if more aggressive public
policies, such as travel restrictions and lockdowns, had been implemented in a proper
time.
Robust policy development and planning requires pandemic models that can accurately
predict the number of COVID-19 cases and deaths far into the future.
Also governments and policy makers need an accurate tool to examine the effect of
different preventive scenarios .

5
Main Goal
To address the mentioned issues, our team have designed and developed a novel approach
that improves long-range forecasting substantially over existing COVID-19 prediction
models.
The approach uses machine learning methods to predict the number of new COVID-19
confirmed cases and deaths in the US over different time periods.
To be precise, the main goal is to predict the number of deaths and confirmed cases in
each county, each state and the whole country of the US for the future five to ten weeks.

6
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package

7
Data Collection
The first phase of every data-drive problem like the covid-19 mortalities prediction, is
gathering proper data.
We have collected the information related to the outbreak of COVID-19 disease in the
United States, including data from each of 3142 US counties since the beginning of the
outbreak.
The dataset is collected from many public online databases and now is available online for
data scientist at http://10.0.23.196/m9.figshare.12986069.v1.

8
County Map of the US

9
Dataset
So a lot of effort was done to collect 48 covariates that may be relevant to the pandemic
dynamics: demographic, geographic, climatic, traffic, public-health, social-distancing-
policy adherence, and political characteristics.
A covariate is an independent variable or characteristic that can influence the outcome of a
given statistical trial and is possibly predictive of the outcome under study.
Two main types of covariates in the dataset:
- Fixed Covariates: Constant characteristics which do not differ for a long time
- Temporal Covariates: Variables which differ day by day

10
Fixed Covariates
Covariate description Percent of
observed
Total population Total population of county in 2018 100%
Population density Population per square mile 100%
Female ratio Total number of females divided by total 100%
population
Age distribution Percentage of residents in the age groups : 0-39, 100%
40-59, 60-79, 80 or older
Education level Percentage of residents with different levels of 100%
distribution education
Area Land area of county 100%
11
Fixed Covariates
Covariate description Percent of
observed

Latitude The latitude of county 100%


Longitude The longitude of county 100%
Houses density Number of houses per square mile 100%
Academic population ratio Total number of students and staff of 100%
universities and colleges divided by
total population in county

Hospital bed ratio Total number of Hospital bed ratio in 100%


county
12
Fixed Covariates

Covariate description Percent of


observed

ICU bed ratio Total number of ICU beds 77%


divided by total population

Ventilator capacity ratio Ventilator capacity divided by 77%


total population

Smoker ratio Adult smoking ratio 100%


Diabetes ratio Adult diabetes ratio 100%

13
Fixed Covariates
Covariate description Percent of
observed
Airport distance Distance to nearest international airport 100%
with average daily passenger load more
than 10
Passenger load ratio Average daily passenger load of nearest 100%
international airport
Number of deaths in Number of deaths in 2018 per 100,000 76%
2018 per 100,000 residents in each county
residents

14
Temporal Covariates

Covariate description Percent of


observed
Moisture Precipitation of the county 89%

Temperature The temperature of the county 80%

Severe cases Number of severe cases in county 22%

Flight arrival Number of incoming flights 45%

15
Temporal Covariates

Covariate description Percent of


observed
Social A factor for social distance 59%
distancing total
grade
Neighboring a measure of transmission of the virus from 100%
virus pressure
neighboring counties to the county

Occupied ICU total number of staffed inpatient ICU beds that 45%
beds are occupied

16
Temporal Covariates
Covariate description Percent of
observed
Grocery and Percent change in mobility trends in grocery 44%
pharmacy mobility stores and pharmacies compared to pre-COVID-
percent change 19 period

Parks mobility Percent change in mobility trends in parks 18%


percent change (including local and national parks, public
beaches, marinas, dog parks, plazas, and public
gardens) compared to pre-COVID-19 period

17
Target Variables
The dataset includes two target variable:
● Confirmed Cases
● Deaths

It is notable that we can use the target variables as temporal covariates too.

Also the dataset consists of two other characteristics:


● Date (Temporal ID)
● County Code (Spatial ID)

18
Raw Data Table
Fixed covariates Time dependent covariates
County Total Populatio Severe
date … moisture temperature
code population n density cases

01001 2020/01/20 55601 93.5 0.23 31 1

⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞

01001 2020/05/08 55601 93.5 0.50 25 5

⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞

56045 2020/01/20 6967 2.9 0.47 6 1

⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞

56045 2020/05/08 6967 2.9 0.96 10 9


19
Covariates in Covid-19 Mortality of the US
We use a dataset containing the following 9 covariates alongside the number of deaths and
confirmed cases..
● Average Number of Daily Tests
● Average Daily Temperature
● Daily Precipitation
● 6 Social Distancing Covariates (mobility trends of individuals visiting different
places including parks, transit stations, residences, workplaces, grocery stores and
pharmacies, retail shops and recreation centers)

20
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package

21
Data Preprocessing

By having raw data, some preprocessings will be applied on raw dataset to produce
preprocessed data.

There are four main processes in this step:


● Imputation
● Spatial and Temporal Scale Changing
● Target and Feature Mode Changing
● Constructing Historical Data

22
Imputation

Imputation is the process of replacing missing data with substituted values.

Some of the covariates like the number of ICU beds, do not have value for certain
counties or days.

To avoid missing values, we use imputation methods like replacing them with mean and
median of some of the past and future covariate values.

23
Spatial and Temporal Scale Changing

Each row of the covid-19 dataset was corresponding to one county. So we can say that the
spatial scale of the dataset was county.

But sometimes we may want to predict the number of deaths (or confirmed cases) for each
state or for the whole country. To aim this purpose, we first must change the spatial scale
of the dataset by aggregating the values of covariates.

Same as the previous process, we can have the option to predict the target value for each
week and month instead of each day by aggregating the values of every covariate to
change the temporal scale.

Available aggregation methods are summation and mean.


24
Target and Feature Mode Changing

By default, the target values of covid-19 outbreak dataset were the number of deaths (or
confirmed cases) for each day which is called ‘normal’ target mode.

We provided the possibility of changing target mode from ‘normal’ to ‘cumulative’,


‘differential’ or ‘moving average’. For example in cumulative case, we sum up all the
target values from the first day in the dataset up to the day t , as the value of target
variable in day t.

Similar to the target variable, we can apply some modifications on the values of covariates
using methods like standardization and normalization. We call this process feature mode
changing.

25
Constructing Historical Data

This is the main process to convert raw dataset to a proper dataset for our learner.

We are interested in predicting targets r days into the future based on h days of data. Thus,
the prediction for target variables at day t+r, uses the covariates measured at days t, t-1,...,
t-h+1.

We call the parameter r and h, the ‘forecast horizon’ and the ‘history length’ of the
prediction respectively.

26
Constructing Historical Data

Recall an arbitrary row of the raw data table which is for a specific day and county.

Now suppose for a give day like February 14, we want to predict the target value in
February 16 (r=2) only using the information of February 14 (h=1).

27
Constructing Historical Data
● t = February 14
● r=2
● h=1

28
Constructing Historical Data

So the data instance for h=1 in the previous slide consists of the information of February
14 covariates except the target value which belongs to February 16.

As another example, suppose we want to use the information of day t and t-1 to predict the
target value of day t+r.

29
Constructing Historical Data
● t = February 14
● r=2
● h=2

30
Constructing Historical Data

So with respect to the value of forecast horizon and history length, we construct historical
data from the raw data.

In the term of covid-19 pandemic, it means that for example we aim to predict the number
of deaths (or confirmed cases) at day t+r, using the information of temperature, grocery
and pharmacy mobility percent change at day t, t-1, …, t-h+1.

31
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package

32
Prediction Phase

After the preparation of historical data, the prediction phase begins.

As an overview of this phase, the preprocessed data is splitted into three parts, training
data, validation data and testing data. Then we train predictive models (like machine
learning models) using some feature sets of training data and obtain the performance of
the models on the validation set according to performance measures like MAPE. Based on
the reported performances, the best model and the best feature set are used to make
prediction on testing data.

33
Prediction Phase

As mentioned in the previous slide, the prediction phase consists of the following
processes:
● Data Splitting
● Feature Ranking
● Best Feature set and Model Selection

34
Data Splitting

As the first step of prediction phase, we split the preprocessed data (historical data) into
three following parts which is a common pattern in machine learning related approaches:
● Training Data
● Validation Data
● Testing Data

An important question is how we should do the above partitioning?

Is random partitioning a good choice?

35
Data Splitting

When predicting the future, we only have present and past data available for training
predictive models. So a realistic measurement of model performance could be achieved
only by keeping the same arrangement in the data splitting process and selecting the
testing set from the last “fold” of temporal units in the data, not the commonly practiced
random split.

So for prediction the number of deaths (or confirmed cases) of covid-19, we select the
testing set from the last temporal units (last days or weeks) and this is the reason why our
approach is called ‘LaFoPaFo’ which is short for LAst FOld PArtitioning FOrecaster.

36
Data Splitting
● r=9
● h=3

‫مم‬

37
Data Splitting

38
Gap in Data Splitting

For forecast horizons greater than one (r>1), the gap between now and the target time
must be considered in the data splitting, ensuring that the training instances would
maintain this temporal distance with the testing instances.

More specifically, to predict an instance at time t with a target variable at time t+r, we
may not use any of the instances between time t and t+r.

39
Feature Ranking

Redundant and irrelevant attributes in a dataset can be misleading to the modeling


algorithm and bring overfitting on the training data, which can adversely affect the
modeling and cripple the predictive accuracy.

To address this problem, the prediction module executes the ranking process to obtain an
ordered list of ranked features.

We provide two types of ranking methods:


● Correlation-Based Ranking Method
● mRMR (minimum redundancy maximum relevance) Ranking Method

40
Ranked Features of Covid-19 Dataset
1.Past confirmed

2.Neighboring virus
pressure
3.Past deaths

4.Total population

5.Population density

6.Meat plants

7.Houses density

8.Social distancing visitation


grade

41
Best Feature Set and Best Model Selection
Perhaps the most distinguishing feature of our approach from other similar tools is its
systematic search over ranked features and models during the training and validation
phases to find the best configurations.

In this context, best configuration means the best feature set and the best model, which
together gain the best performance (the minimum loss) on the validation data.

42
Best Feature Set Selection

To find the best feature set, our method uses the ordered list of ranked features from the
feature ranking process and not the whole feature set space. This approach saves us from
considering the long history for temporal covariates and dealing with features’ volume and
computational complexity of the training process.

Then according to the performance of predictive models on the validation set, the feature
set with minimum loss is selected.

43
Best Model Selection

At the same time as searching for the best feature set, our approach examines different
models to select the best one based on the models’ performances on validation set.

We not only provide some predefined models like KNN (k-nearest neighbor), GBM
(gradient boosting machine), and GLM (generalized linear model), but also allows the
execution of user-defined and custom models.

Furthermore, to utilize the power of multiple models for more accurate predictions, a
mixed model option is provided, which combines the predictions of several models to
train a new model.

For covid-19 mortality prediction, we used the KNN model.


44
Performance Measures

We examine the performance of a specific model on the validation data using different
performance measures like:
● MAPE (mean absolute percentage error)
● MAE (mean absolute error)
● MASE (mean absolute scaled error)
● MSE (mean squared error)

For covid-19 mortality prediction, we used the MAPE performance measure.

45
Training, Validation and Testing

To put it all together, the following processes are done in the prediction phase:
● First we split data in training, validation and testing sets.
● Then a list of ranked features is obtained according to the training and validation
data.
● Next, systematically, the best model and the best feature set are found based on a
performance measure on the validation set.
● Finally, the testing process begins, which uses training and validation data together to
train the best model which is obtained in the previous process.

46
Testing Type

Also, the proposed method supports two different ways of predicting test points target
variables.
● Whole-as-One which makes predictions on the whole testing set after the training
and validation process.
● One-by-One which performs training and validation processes for each testing set
instance separately so that each time one of the testing set instances is used as the last
fold in the data. So, for each instance in the testing set, we have a training set,
validation set, and a gap data that would be different from the other testing set
instances.

For covid-19 mortality prediction, we used the ‘LaFoPaFo’ learner which utelize ‘One-by-
One’ testing type. 47
Future Forecasting

Finally, the proposed method provides the the option of training the model obtained in the
model selection step, on the whole available dataset and make a real prediction of the
target variable true values of which are not known to us yet.

48
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package

49
Results

The Center for Disease Control (CDC) maintains a national forecasting site, where
modeling groups continually update their forecasts for new COVID-19 cases and deaths in
the USA.

We compare the forecasting results of our LaFoPaFo with the predictive models available
at CDC during the 7 weeks starting from September 27 to November 14, 2020.

50
Results (Deaths - 7 weeks from Sep 27 to Nov 14, 2020)

51
Results (Cases - 7 weeks from Sep 27 to Nov 14, 2020)

52
Results

LaFoPaFo’s projected number of deaths from 5 to 10 weeks in the future:

53
Results

LaFoPaFo’s weekly forecasting results for the number of deaths, ten weeks later:

54
Results

LaFoPaFo’s daily forecasting results for the future 14-days number of deaths.

55
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package

56
STPredict Package

To utilize the proposed approach in other time series problems in ecology, geology, etc.
and execute all the mentioned process from making historical data to testing procedure
using just one command, we have designed and implemented a python package called
STPredict which is short for Spatio-Temporal Prediction.

The package manual (documentation) is available by https://stpredict.readthedocs.io/ and


one can install the package using just the following command:

pip install stpredict

57
STPredict Package

58
Thank you :)
Any question?

59

You might also like