Professional Documents
Culture Documents
1
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package
2
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package
3
Covid-19 Pandemic
Coronavirus disease (COVID-19) was declared a public health emergency of international
concern in January 2020 by the World Health Organization.
Since then, about 260 million confirmed cases and almost 5 million deaths due to COVID-
19 have been reported worldwide (up to Nov 28, 2021).
4
Covid-19 Pandemic
Some of these cases and deaths might have been prevented if more aggressive public
policies, such as travel restrictions and lockdowns, had been implemented in a proper
time.
Robust policy development and planning requires pandemic models that can accurately
predict the number of COVID-19 cases and deaths far into the future.
Also governments and policy makers need an accurate tool to examine the effect of
different preventive scenarios .
5
Main Goal
To address the mentioned issues, our team have designed and developed a novel approach
that improves long-range forecasting substantially over existing COVID-19 prediction
models.
The approach uses machine learning methods to predict the number of new COVID-19
confirmed cases and deaths in the US over different time periods.
To be precise, the main goal is to predict the number of deaths and confirmed cases in
each county, each state and the whole country of the US for the future five to ten weeks.
6
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package
7
Data Collection
The first phase of every data-drive problem like the covid-19 mortalities prediction, is
gathering proper data.
We have collected the information related to the outbreak of COVID-19 disease in the
United States, including data from each of 3142 US counties since the beginning of the
outbreak.
The dataset is collected from many public online databases and now is available online for
data scientist at http://10.0.23.196/m9.figshare.12986069.v1.
8
County Map of the US
9
Dataset
So a lot of effort was done to collect 48 covariates that may be relevant to the pandemic
dynamics: demographic, geographic, climatic, traffic, public-health, social-distancing-
policy adherence, and political characteristics.
A covariate is an independent variable or characteristic that can influence the outcome of a
given statistical trial and is possibly predictive of the outcome under study.
Two main types of covariates in the dataset:
- Fixed Covariates: Constant characteristics which do not differ for a long time
- Temporal Covariates: Variables which differ day by day
10
Fixed Covariates
Covariate description Percent of
observed
Total population Total population of county in 2018 100%
Population density Population per square mile 100%
Female ratio Total number of females divided by total 100%
population
Age distribution Percentage of residents in the age groups : 0-39, 100%
40-59, 60-79, 80 or older
Education level Percentage of residents with different levels of 100%
distribution education
Area Land area of county 100%
11
Fixed Covariates
Covariate description Percent of
observed
13
Fixed Covariates
Covariate description Percent of
observed
Airport distance Distance to nearest international airport 100%
with average daily passenger load more
than 10
Passenger load ratio Average daily passenger load of nearest 100%
international airport
Number of deaths in Number of deaths in 2018 per 100,000 76%
2018 per 100,000 residents in each county
residents
14
Temporal Covariates
15
Temporal Covariates
Occupied ICU total number of staffed inpatient ICU beds that 45%
beds are occupied
16
Temporal Covariates
Covariate description Percent of
observed
Grocery and Percent change in mobility trends in grocery 44%
pharmacy mobility stores and pharmacies compared to pre-COVID-
percent change 19 period
17
Target Variables
The dataset includes two target variable:
● Confirmed Cases
● Deaths
It is notable that we can use the target variables as temporal covariates too.
18
Raw Data Table
Fixed covariates Time dependent covariates
County Total Populatio Severe
date … moisture temperature
code population n density cases
⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞
⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞
⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞
20
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package
21
Data Preprocessing
By having raw data, some preprocessings will be applied on raw dataset to produce
preprocessed data.
22
Imputation
Some of the covariates like the number of ICU beds, do not have value for certain
counties or days.
To avoid missing values, we use imputation methods like replacing them with mean and
median of some of the past and future covariate values.
23
Spatial and Temporal Scale Changing
Each row of the covid-19 dataset was corresponding to one county. So we can say that the
spatial scale of the dataset was county.
But sometimes we may want to predict the number of deaths (or confirmed cases) for each
state or for the whole country. To aim this purpose, we first must change the spatial scale
of the dataset by aggregating the values of covariates.
Same as the previous process, we can have the option to predict the target value for each
week and month instead of each day by aggregating the values of every covariate to
change the temporal scale.
By default, the target values of covid-19 outbreak dataset were the number of deaths (or
confirmed cases) for each day which is called ‘normal’ target mode.
Similar to the target variable, we can apply some modifications on the values of covariates
using methods like standardization and normalization. We call this process feature mode
changing.
25
Constructing Historical Data
This is the main process to convert raw dataset to a proper dataset for our learner.
We are interested in predicting targets r days into the future based on h days of data. Thus,
the prediction for target variables at day t+r, uses the covariates measured at days t, t-1,...,
t-h+1.
We call the parameter r and h, the ‘forecast horizon’ and the ‘history length’ of the
prediction respectively.
26
Constructing Historical Data
Recall an arbitrary row of the raw data table which is for a specific day and county.
Now suppose for a give day like February 14, we want to predict the target value in
February 16 (r=2) only using the information of February 14 (h=1).
27
Constructing Historical Data
● t = February 14
● r=2
● h=1
28
Constructing Historical Data
So the data instance for h=1 in the previous slide consists of the information of February
14 covariates except the target value which belongs to February 16.
As another example, suppose we want to use the information of day t and t-1 to predict the
target value of day t+r.
29
Constructing Historical Data
● t = February 14
● r=2
● h=2
30
Constructing Historical Data
So with respect to the value of forecast horizon and history length, we construct historical
data from the raw data.
In the term of covid-19 pandemic, it means that for example we aim to predict the number
of deaths (or confirmed cases) at day t+r, using the information of temperature, grocery
and pharmacy mobility percent change at day t, t-1, …, t-h+1.
31
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package
32
Prediction Phase
As an overview of this phase, the preprocessed data is splitted into three parts, training
data, validation data and testing data. Then we train predictive models (like machine
learning models) using some feature sets of training data and obtain the performance of
the models on the validation set according to performance measures like MAPE. Based on
the reported performances, the best model and the best feature set are used to make
prediction on testing data.
33
Prediction Phase
As mentioned in the previous slide, the prediction phase consists of the following
processes:
● Data Splitting
● Feature Ranking
● Best Feature set and Model Selection
34
Data Splitting
As the first step of prediction phase, we split the preprocessed data (historical data) into
three following parts which is a common pattern in machine learning related approaches:
● Training Data
● Validation Data
● Testing Data
35
Data Splitting
When predicting the future, we only have present and past data available for training
predictive models. So a realistic measurement of model performance could be achieved
only by keeping the same arrangement in the data splitting process and selecting the
testing set from the last “fold” of temporal units in the data, not the commonly practiced
random split.
So for prediction the number of deaths (or confirmed cases) of covid-19, we select the
testing set from the last temporal units (last days or weeks) and this is the reason why our
approach is called ‘LaFoPaFo’ which is short for LAst FOld PArtitioning FOrecaster.
36
Data Splitting
● r=9
● h=3
مم
37
Data Splitting
38
Gap in Data Splitting
For forecast horizons greater than one (r>1), the gap between now and the target time
must be considered in the data splitting, ensuring that the training instances would
maintain this temporal distance with the testing instances.
More specifically, to predict an instance at time t with a target variable at time t+r, we
may not use any of the instances between time t and t+r.
39
Feature Ranking
To address this problem, the prediction module executes the ranking process to obtain an
ordered list of ranked features.
40
Ranked Features of Covid-19 Dataset
1.Past confirmed
2.Neighboring virus
pressure
3.Past deaths
4.Total population
5.Population density
6.Meat plants
7.Houses density
41
Best Feature Set and Best Model Selection
Perhaps the most distinguishing feature of our approach from other similar tools is its
systematic search over ranked features and models during the training and validation
phases to find the best configurations.
In this context, best configuration means the best feature set and the best model, which
together gain the best performance (the minimum loss) on the validation data.
42
Best Feature Set Selection
To find the best feature set, our method uses the ordered list of ranked features from the
feature ranking process and not the whole feature set space. This approach saves us from
considering the long history for temporal covariates and dealing with features’ volume and
computational complexity of the training process.
Then according to the performance of predictive models on the validation set, the feature
set with minimum loss is selected.
43
Best Model Selection
At the same time as searching for the best feature set, our approach examines different
models to select the best one based on the models’ performances on validation set.
We not only provide some predefined models like KNN (k-nearest neighbor), GBM
(gradient boosting machine), and GLM (generalized linear model), but also allows the
execution of user-defined and custom models.
Furthermore, to utilize the power of multiple models for more accurate predictions, a
mixed model option is provided, which combines the predictions of several models to
train a new model.
We examine the performance of a specific model on the validation data using different
performance measures like:
● MAPE (mean absolute percentage error)
● MAE (mean absolute error)
● MASE (mean absolute scaled error)
● MSE (mean squared error)
45
Training, Validation and Testing
To put it all together, the following processes are done in the prediction phase:
● First we split data in training, validation and testing sets.
● Then a list of ranked features is obtained according to the training and validation
data.
● Next, systematically, the best model and the best feature set are found based on a
performance measure on the validation set.
● Finally, the testing process begins, which uses training and validation data together to
train the best model which is obtained in the previous process.
46
Testing Type
Also, the proposed method supports two different ways of predicting test points target
variables.
● Whole-as-One which makes predictions on the whole testing set after the training
and validation process.
● One-by-One which performs training and validation processes for each testing set
instance separately so that each time one of the testing set instances is used as the last
fold in the data. So, for each instance in the testing set, we have a training set,
validation set, and a gap data that would be different from the other testing set
instances.
For covid-19 mortality prediction, we used the ‘LaFoPaFo’ learner which utelize ‘One-by-
One’ testing type. 47
Future Forecasting
Finally, the proposed method provides the the option of training the model obtained in the
model selection step, on the whole available dataset and make a real prediction of the
target variable true values of which are not known to us yet.
48
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package
49
Results
The Center for Disease Control (CDC) maintains a national forecasting site, where
modeling groups continually update their forecasts for new COVID-19 cases and deaths in
the USA.
We compare the forecasting results of our LaFoPaFo with the predictive models available
at CDC during the 7 weeks starting from September 27 to November 14, 2020.
50
Results (Deaths - 7 weeks from Sep 27 to Nov 14, 2020)
51
Results (Cases - 7 weeks from Sep 27 to Nov 14, 2020)
52
Results
53
Results
LaFoPaFo’s weekly forecasting results for the number of deaths, ten weeks later:
54
Results
LaFoPaFo’s daily forecasting results for the future 14-days number of deaths.
55
Overview
● Introduction
● Data Collection
● Data Preprocessing
● Prediction
● Results
● STPredict Package
56
STPredict Package
To utilize the proposed approach in other time series problems in ecology, geology, etc.
and execute all the mentioned process from making historical data to testing procedure
using just one command, we have designed and implemented a python package called
STPredict which is short for Spatio-Temporal Prediction.
57
STPredict Package
58
Thank you :)
Any question?
59