You are on page 1of 16

Milestone 2_Sample submission

(Including Milestone 1 and 2)

Contents
Overall Goal /Research ................................................................................................................................. 2
Abstract ......................................................................................................................................................... 2
Research Questions ...................................................................................................................................... 2
Introduction .................................................................................................................................................. 2
Analysis Progress........................................................................................................................................... 3
Evaluation of prior work ............................................................................................................................... 6
Feature Selection .......................................................................................................................................... 9
Training Methods ........................................................................................................................................ 14
Overall Goal /Research
The overall goal of my research project is to analyze and extrapolate data from a large dataset
that shows cases of COVID-19 across the globe.

Note: that I am currently using inline referencing with the intent for the final project to reference
according to APA rules.

Abstract
The purpose of this research paper is to develop a better understanding of the spread of the
COVID-19 pandemic across the globe and to also develop a method for projecting infection rates
depending on the responses of countries that have confirmed cases of COVID-19. Through this research
I looked at a subset of 7(remains to be seen) number of countries and modeled their infection rates
based off of the SIR model developed for modeling the spread of infectious diseases.

Research Questions
- Is there a Correlation between COVID-19 infection rates and government responses of the
following countries: USA, Germany, South Korea
- Can we model future infection rates based from government responses and infection rates to
show future progression and potential end to the COVID-19 pandemic.

Introduction
As COVID-19 continues to spread across the world with varying degrees infection / death rates,
This research paper aims at developing a potential model of when countries can expect to see a
decrease in their COVID-19 cases. Prior to any modeling being created I will have to analyze the data to
see how best use it. For this research paper, I decided to use a large data set of information from the
following website https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset that
contains multiple datasets that include over 100 columns of information and 600k lines of information.

The primary method that will be used to model the spread of COVID-19 is the SIR model
https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology which is the simplest
compartmental model for projecting the spread of infectious diseases and also for building models of
how they propagate. The main three values in this model are Susceptible, Infectious, and Recovered. In
the case of this report, Susceptible will be modified depending on mitigation rates based on each
country. Infectious and recovered numbers will depend on how well the mitigation rates work and will
be the output of this research project. Once I can personally regrate the SIR model for a known source I
will then begin to build projections of how the virus will propagate through the world in the future.

It should be noted that none of the modeling that I will be performing include outside factors
that could effect the results, those being weather, financial capability of a country, and mutation of the
virus.

(More introduction information added over time)

Analysis Progress
Prior to beginning this process, I looked at two submissions to gain some understandings on
how individuals were using the dataset to present solutions to the tasks that were outlined in the Novel
Corona Virus 2019 Dataset. The first of two that I looked at the Predicting the end of the Coronavirus
disease 2019 (https://www.kaggle.com/lumierebatalong/predicting-the-end-of-covid19). In this
research the individual modeled the data as it spread across the globe and then used the SIR model to
model the epidemiology of the COVID-19 pandemic. It was shown in the research that SIR was a good
model (through some modification) was suited well for modeling the COVID-19 spread control in China.
The second solution that I looked at was Relation between growth rate, containment and mitigation
measures (https://www.kaggle.com/lumierebatalong/covid19-growth-rate-containment-mitigation-
measure). In the submission they used three rates; growth rate positive cases, growth rate recovered,
and growth rate death. I will also note that an additional CSV file was used to list all of the containment
measures that was developed by the John Hopkins Hospital that was hosted here
(http://epidemicforecasting.org/containment). The research was able to also point out that there was a
relationship between growth rate and containment mitigation measures.

Now that background research was complete the first part of my tasks is preprocessing the data
to meet my research needs. To do this, I will be using multiple tools associated with python and Juypter.
I have imported data into Juypter to begin the analysis of the data that includes finding the start and end
of the data (figure 1). The update counts by country as shown in (figure 2) and also the total affected by
COVID-19 shown in (figure 3). The data once compiled is not sorted in any manor to decern witch
country has the most cases so that was my next task as shown in (figure 4 and 5).

Figure 1

Figure 2

Figure 3
Figure 4

Figure 5

End of Milestone 1 / Deliverable 1


Evaluation of prior work
For this portion of the project I evaluated existing source codes that have been uploaded to the Kaggle
website. These evaluations are in addition to the background research that I performed during the first
phase of my research. First I continued my look at the dataset that was Predicting the end of the
Coronavirus disease 2019 (https://www.kaggle.com/lumierebatalong/predicting-the-end-of-covid19).
In this research the individual modeled the data was looking at how we could use information to identify
when the pandemic would end based off country quarantine responses using of the confirmed,
recovered, deaths, and active confirmed information that was gleaned from the Kaggle dataset. While
the dataset looked at multiple countries, for this portion I will only rely on what they found for the
Chinees population as the other countries are rehashes of the same information.

Continuing from my prior analysis of the data, the first thing that was done was to find the feature
correlation using a heatmap to create the matrix of data from china as shown below in figure 6.

Figure 6

Followed by a graph showing how different datasets are correlated together in figure 7. With this
information we can see that deaths and recovered information is associated together. Also recovered is
the opposite direction as deaths as shown in the highlighted images in figure 7.
Figure 7.

Following the correlation between deaths and recovery they started to find the growth rate of the
pandemic in this dataset. The below image shows the active growth rate in china. The same was done
for the ratio of recovery to deaths in China as shown in figure 9.

Figure 8

Figure 9.
From the information gathered and the fact that on 25 Feb the authorities in Hubei China quarantine
the entire population setting in motion the most tight restrictions on citizens that could possible be
performed. With that information we are able to look at the two features of recovered and death rates
as the most important features in this dataset. This influence the same information that I used in my
further research in other sections of this paper.

The last piece of information that I wanted to review based off of this dataset was how they used the SIR
model to model the dataset. The information that they used for the SIR dataset is as follows. X is
susceptible fraction, Y is the infectives fraction , and finally Z is the recovered fraction.

With the information of the SIR model and we are able to see the model simulating the COVID-19 cases
in china shown below in figure 10. In figure 11 we see how it applies to real data from China.

Figure 10

Figure 11
During my review of previous/related contributions I found that a large number of submissions use the
SIR model or some modified form of the SIR model to develop a solution to the problem outlined. This
is because infectious diseases have been studied for a long time and the SIR model is the most accurate
way to model these diseases.

In conclusion for this section it can be seen that the data provided through Kaggle is able to accurately

predict the spread of COVID-19 as it pertains to China. Using this same information I will now continue

on with the project to develop feature selection and also begin the process of creating a model for the

five countries I set out to plot data for.

Feature Selection
As I begin this portion of the project I have to decide what features are most important for the countries
that I am trying to model (USA, Germany, South Korea). I begin by using the information that I learned
from reviewing other submissions on the Kaggle website. In this section I will provide how I discovered
which features are the most important, plots of those features and my interpretation of that
information. Each dataset will be broken down in this section but the overall data analysis is the same.

Prior to further feature selection and analysis, I had to create one more column of information in order
to perform the same analysis and find the SIR model. The column that I created was active cases to
show overtime how COVID-19 spread across the globe.

Through general analysis of the dataset, I looked at the entire dataset to see what the most important
features were in the dataset and then drilled down for each of the three countries that I want to model
and find if there is a correlation between government responses and infection rate. The figure below,
Figure 12 shows the correlation between data. With that information I am able to see that all features
have a large correlation between each other, except for recovered to active cases. This shows that
there is a small correlation between those two features. I used both Rstudio and python code to
discover the correlation between the data and the information is below.

Recovered active_c
Recovered 1.0000000 0.2483919
active_c 0.2483919 1.0000000
Deaths 0.5423903 0.8552491
Confirmed 0.6034546 0.9213635
Deaths Confirmed
Recovered 0.5423903 0.6034546
active_c 0.8552491 0.9213635
Deaths 1.0000000 0.9350506
Confirmed 0.9350506 1.0000000

Figure 12.

The next piece of the puzzle that I wanted to solve was the growth rate of COVID-19 in the three
countries so that I will be able to apply that information along with the variables of COVID-19 already
discussed with that of when different countries put there mitigation efforts into effect to see how it
effected the growth rate.

Moving on, with the above information above I began my feature analysis and selection for each country
starting with the US.
US Figures

Data for Germany:


It is interesting to note, that the data for Germany showed that they began to decrease in active cases
around the end of April which is different than the US, that has not shown a decrease in active cases.

Germany Figures

The last country that I looked at was that of South Korea, reviewing this data it shows that they hit the
peak of active cases in the middle of march with a very low death rate.
South Korea Figures

For each country I ran plots to show the relationship of features to see how they would effect the data.
US Germany South Korea

In conclusion I found that I had to create / transform just minor information inside of the dataset. The
main one was creating an active cases variable so that I would be able to apply it later in my analysis.
Some interesting information that I found is that the growth rate in the US was half that of Germany and
South Korea yet they still had the largest number of infections. There is many pieces of conjecture that I
can apply to this to see what causes this, the main piece being the population of the US being greater
than the population of the other two countries. Also, all three countries showed a spike in confirmed
cases around the end of February.

Training Methods
Prior to beginning this process, I looked at multiple submissions to gain some understandings on how
individuals were using the dataset to present solutions to the tasks that were outlined in the Novel
Corona Virus 2019 Dataset. For this project I used the SIR model to train the data to see if there was a
way to train the model on each country depending on how they responded to COVID-19.

For my main dataset in the csv file named “modified.csv” I did not ensemble any new information into
the dataset and only added information to the original dataset that was computed in the python code
itself. The information being active cases and also growth rate of each country. The growth rate of the
pandemic was not included in the csv file since it was used for analysis only.

Before we move into analyzing the data and applying the model, it is important to first find dates that
countries responded to the pandemic. The table below shows how each of the countries responded and
when. As found in various websites.
China Germany South Korea US
Wuhan enters State of Emergency No national Lockdown Public Health
Lockdown 23 Jan declared 16 March emergency declared
Jan 31
Quarantine of citizens Mass testing by end of National Emergency 13
on 25 Feb January March
Social Distancing
begins 16 March

The SIR model worked to for both South Korea and also Germany as shown in the figure below.

Both countries line up to the SIR model to show that as confirmed cases decreased over time. Both
occurred after the country imposed strict social distancing measures or other measures to identify
citizens that were infected with COVID-19.
Work to be completed between deliverable 2 and final submission:

Model execution for the US to determine end of COVID-19

How long does it take to train your model?

 How long does it take to generate predictions using your model?

 How long does it take to train the simplified model (referenced in section A6)?

 How long does it take to generate predictions from the simplified model?

 What was the most important trick you used?

 What do you think set you apart from others in the competition?

 Did you find any interesting relationships in the data that don't fit in the sections above?

List of references:

US Responses: https://www.usatoday.com/in-depth/news/nation/2020/04/21/coronavirus-updates-
how-covid-19-unfolded-u-s-timeline/2990956001/

Germany Responses: https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Germany#February

South Korea Responses: https://english.alarabiya.net/en/features/2020/04/03/South-Korea-conquered-


coronavirus-without-a-lockdown-a-model-to-follow-

You might also like