Professional Documents
Culture Documents
Project Stage II Report On: "Covid19 Impact Upon The Community"
Project Stage II Report On: "Covid19 Impact Upon The Community"
On
By
CERTIFICATE
Certified that the project report entitled, “Covid19 Impact Upon the Community” is a
bonafied work done by Riya Bajaj, Tarique Shahidi, Anirudh Amla in fulfillment of the
requirements for the award of degree of Bachelor of Technology in Electronics&
Telecommunication Engineering.
Date:
Examiner1:
Examiner2:
ACKNOWLEDGEMENT
We would like to extend our sincere gratitude to the Principal Dr. Vidula Sohoni, Head of
Department Electronics & Telecommunication, Prof. Shruti Oza, for nurturing a congenial
yet competitive environment, which motivates all the students not only to pursue goals but
also to elevate the Humanitarian level.
Inspiration and guidance are invaluable in every aspect of life, which we have received from
our respected project guide Prof. Sudhir Bussa, who gave us his careful and ardent guidance
because of which we are able to complete this project. More words won’t suffice to express
our gratitude to his untiring devotion. He undoubtedly belongs to the members of the artistic
gallery who are masters in all aspects.
We would also like to thank all the faculty members who directly or indirectly helped us from
time to time with their invaluable inputs.
ABSTRACT
The current destructive pandemic of coronavirus disease 2019 (COVID-19), caused by severe
acute respiratory syndrome coronavirus 2 (SARS-CoV-2) , was first reported in Wuhan,
China, in December 2019. The outbreak has affected millions of people around the world and
the number of infections and mortalities has been growing at an alarming rate. In such a
situation, forecasting and proper study of the pattern of disease spread can inspire design
better strategies to make more efficient decisions. Moreover, such studies play an important
role in achieving accurate predictions.
Machine learning has numerous tools that can be used for visualization and prediction, and
nowadays it is used worldwide for study of the pattern of COVID-19 spread. One of the main
focus of the study in this project is to use machine learning techniques to analyze and
visualize the spreading of the virus country-wise as well as globally during a specific period
of time by considering confirmed cases, recovered cases and fatalities.
In this project, we are going to perform Linear regression, Support vector machine etc. on
the Johns Hopkins University’s COVID-19 data to anticipate the future effects of COVID-19
pandemic in India and some other countries. Moreover, we are going to study the impact of
some parameters such as geographic conditions, economic statistics, population statistics, life
expectancy, etc., in prediction of COVID-19 spread.
TABLE OF CONTENTS
Chapter No. Name of Topic Page No.
List of Figures III
List of Tables IV
Abstract V
Chapter- 1 Introduction 7-11
1.1 Covid19 and Problem Statement 7
1.2 Technology and Concept 7-10
1.3 Software Used 10-11
1.4 Data Sources 11
Chapter-2 Literature Survey 12
Chapter-3 Data Mining 13-15
Chapter-4 Exploratory Data Analysis 16-32
Chapter-5 Machine Learning Algorithms 33-38
LIST OF FIGURES
FIGURE NO. NAME OF THE FIGURE PAGE NO.
3.1 Figure1 14
4.1 Figure2 18
4.2 Figure3 18
4.3 Figure4 19
4.4 Figure5 19
4.5 Figure6 20
4.6 Figure7 20
4.7 Figure8 22
4.8 Figure9 23
4.9 Figure10 23
4.10 Figure11 24
4.11 Figure12 24
4.12 Figure13 26
4.13 Figure14 26
4.14 Figure15 27
4.15 Figure16 27
4.16 Figure17 28
4.17 Figure18 28
4.18 Figure19 29
4.19 Figure20 30
4.20 Figure21 31
4.21 Figure22 31
4.22 Figure23 32
4.23 Figure24 33
5.1 Figure25 34
5.2 Figure26 35
5.3 Figure27 36
5.4 Figure28 36
5.5 Figure29 38
5.6 Figure30 38
1 INTRODUCTION
On 31st December 2019, in the city of Wuhan (CHINA), a cluster of cases of pneumonia of
unknown cause was reported to World Health organization. In January 2020, a previously
unknown new virus was identified, subsequently named 2019 novel corona virus. WHO has
declared the COVID-19 as a pandemic. A pandemic is defined as disease spread over a wide
range of geographical area and that has affected high proportion of the population.
The pandemic has already taken grip over peoples’ life. Since the start of the pandemic, some
countries are facing problem of ever-increasing cases. Through the data analysis of cases one
can analyze how countries all over the world are doing in terms of controlling the pandemic.
Analyzing data leads to adapt the prevention model of the countries that are doing great in
terms of lowering the graph. Predictions are made with the dataset available to the
individual/country/organisations, thus helping them to decide how far they are able to control
the pandemic or up to how much extent they should guide preventive measures. Through this
project, a step towards helping people to understand the spread and predict the cases in their
country is done. This project also gives an insight of how a country is doing in terms of
limiting the spread.
Python Language
Machine Learning
Machine learning is a field of study or process of teaching a computer to learn the fed data
without being explicitly programmed. It makes computer make decisions similar to humans.
Now a days, it is actively being used in various field. E.g. Medical, Industries, Astronomy
etc.
The major types of Machine learning are Supervised Learning, Unsupervised Learning and
Reinforcement Learning.
Supervised Learning
The machine learning task of learning a function that can map an input data to output data
and performs analysis based on that input-output pair.
Unsupervised Learning
A type of machine learning that draw an inference from dataset consisting of input data
without labelled responses. One of the common unsupervised learning methods called cluster
analysis, is used find the hidden pattern or grouping of data.
Reinforcement Learning
A type of machine learning that is bound to learn from experiences. There is no training
dataset provided *(such methods work in the absence of datasets). An agent in Reinforcement
learning that rewards or penalise for actions done by the algorithm. The task is to find the
best possible path to reach the goal.
Data frame
Pandas data frame is 2D, mutable and heterogeneous tabular data structure with labelled
axes. Data frame can be made of more than one series (series can only contain single list with
index).
Hypothesis
In Machine learning, Hypothesis is a model that is used to approximate the target function
and performs mapping of input with output.
Regression
Classification
NumPy
NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, Fourier transform, and matrices. NumPy was created in 2005 by
Travis Oliphant. It is an open-source project, and you can use it freely. NumPy stands for
Numerical Python. In Python we have lists that serve the purpose of arrays, but they are slow
to process. NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists. The array object in NumPy is called ndarray it provides a lot of supporting
functions that make working with ndarray very easy. Arrays are very frequently used in data
science, where speed and resources are very important.
Pandas
Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both
"Panel Data", and "Python Data Analysis" and was created by Wes McKinney in
2008.Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant. Relevant data is very
important in data science.
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.
Matplotlib
Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source, and we can use it
freely. Matplotlib is mostly written in python, a few segments are written in C, Objective-C
and JavaScript for Platform compatibility. Matplotlib is a comprehensive library for creating
static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy
and hard things possible.
Pyplot
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported
under the plt alias:
import matplotlib. pyplot as plt
Seaborne
Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to visualize
random distributions. Seaborn is an open source, BSD-licensed Python library providing high
level API for visualizing the data using Python programming language. Seaborn is built on
top of Python’s core visualization library Matplotlib. It is meant to serve as a complement,
and not a replacement. However, Seaborn comes with some very important features.
Google Colab
Colab is a free Jupyter notebook environment that runs entirely in the cloud. Most
importantly, it does not require a setup and the notebooks that you create can be
simultaneously edited by your team members - just the way you edit documents in Google
Docs. Colab supports many popular machine learning libraries which can be easily loaded in
your notebook. As a programmer, we can perform the following using Google Colab.
Write and execute code in Python
Document your code that supports mathematical equations
Create/Upload/Share notebooks
Import/Save notebooks from/to Google Drive
Import/Publish notebooks from GitHub
Import external datasets e.g., from Kaggle
Kaggle
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners. Kaggle allows users to find and publish data sets, explore and build
models in a web-based data-science environment, work with other data scientists and
machine learning engineers, and enter competitions to solve data science challenges. Kaggle
got its start in 2010 by offering machine learning competitions and now also offers a public
data platform, a cloud-based workbench for data science, and Artificial Intelligence
education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas
Gruen was founding chair succeeded by Max Levichin. Equity was raised in 2011 valuing the
company at $25 million. On 8 March 2017, Google announced that they were acquiring
Kaggle.
Kaggle’s Services
Machine Learning Competitions: this was Kaggle's first product. Companies post
problems and machine learners compete to build the best algorithm, typically with
cash prizes.
Kaggle Kernels: a cloud-based workbench for data science and machine learning.
Allows data scientists to share code and analysis in Python, R and R Markdown.
Over 150K "kernels" (code snippets) have been shared on Kaggle covering
everything from sentiment analysis to object detection.
Public dataset platform: community members share datasets with each other. Has
datasets on everything from bone x-rays to results from boxing bouts.
Kaggle learn: a platform for AI education in manageable chunks.
2-LITERATURE SURVEY
3-DATA MINING
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there
are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes
and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary
from dataset to dataset. But it is crucial to establish a template for your data cleaning process,
so you know you are doing it the right way every time.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your
organization.
Step 1: Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including duplicate observations or
irrelevant observations. Duplicate observations will happen most often during data collection.
When you combine data sets from multiple places, scrape data, or receive data from clients or
multiple departments, there are opportunities to create duplicate data. De-duplication is one
of the largest areas to be considered in this process. Irrelevant observations are when you
notice observations that do not fit into the specific problem you are trying to analyze. For
example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient and minimize distraction from your primary target—as well as
creating a more manageable and more performant dataset.
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find “N/A” and “Not Applicable” both appear,
but they should be analyzed as the same category.
Step 3: Filter unwanted outliers
Often, there will be one-off observations where, at a glance, they do not appear to fit within
the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper
data-entry, doing so will help the performance of the data you are working with. However,
sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.
1 As a first option, you can drop observations that have missing values, but doing this
will drop or lose information, so be mindful of this before you remove it.
2 As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3 As a third option, you might alter the way the data is used to effectively navigate null
values.
Figure1
The data provided by John Hopkins University also required cleaning for better results so
we cleaned the data a bit
Step by Step process followed for cleaning the data-
1 Removing of useless columns such as Latitude and Longitude (from Covid dataset)
Overall rank, Score, Generosity, Perception of corruption (from World Happiness
Report)
covid_data_agg = covid_data.groupby('Country/Region').sum()
happiness_data.set_index('Country or region', inplace=True)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table['Recovered'] = full_table['Recovered'].fillna(0)
The initial analysis of data supplied or extracted, to understand the trends, underlying
limitations, quality, patterns, and relationships between various entities within the data set,
using descriptive statistics and visualization tools is called Exploratory Data Analysis (EDA).
EDA will give you a fair idea of what model better fits the data and whether any data
cleansing and massaging might be required before taking the data through advanced
modelling techniques or even put through Machine Learning and Artificial Intelligence
algorithms
Of the many outcomes of EDA, the important ones that one should try to get from the data
are,
There are broadly two categories of EDA, graphical and non-graphical. These two are further
divided into univariate and multivariate EDA, based on interdependency of variables in your
data.
Univariate non-graphical: Here, the data features a single variable, and the EDA is done in
mostly tabular form, for example, summary statistics. These non-graphical analyses give you
a statistic that indicates how skewed your data might be or which is the dominant value for
your variable if any.
Univariate graphical: The EDA here, involves graphic tools like bar charts and
histograms to get a quick view of how variable properties are stacked against each
other, whether there is a relationship between these properties and whether there is
any interdependency among these properties.
Multivariate non-graphical: non-graphical methods like crosstabs are used to depict
the relationship between two or more variables. Statistical values like correlation
coefficient indicate if there are a possible relationship and the measure of correlation.
Multivariate graphical: A graphical representation always gives you a better
understanding of the relationship, especially among multiple variables.
The most commonly used software tools to perform EDA are Python and R. Both enjoy
massive community support and frequent updates on packages that can be used to EDA. Let’s
look at the various graphical instruments that can be used to execute an EDA.
Box plots
Box plots are used where there is a need to summarize data on an interval scale like the ones
on the stock market, where ticks observed in one whole day may be represented in a single
box, highlighting the lowest, highest, median and outliers.
Heatmap
Heatmaps are most often used for the representation of the correlation between variables.
Histograms
The histogram is the graphical representation of numerical data that splits the data into
ranges. The taller the bar, the greater the number of data points falling in that range. A good
example here is the height data of a class of students. You would notice that the height data
looks like a bell curves for a particular class with most the data lying within a certain range
and a few of outside these ranges. There will be outliers too, either very short or very small.
1 Plotted the data for the confirmed cases for China, Italy and India for a specific
number of days-
2 Plotted the first derivative for China, Italy and India. This way we can count the
number of increase in cases each day. This can also be used to calculate the infection
rate.
covid_data_agg.loc['India'].diff().plot()
Figure 5
covid_data_agg.loc['China'].diff().plot()
Figure 6
covid_data_agg.loc['Italy'].diff().plot()
Figure 7
3 We calculated a new column that denotes the maximum infection rate for all the
countries
max_infections = []
for c in countries:
max_infections.append(covid_data_agg.loc[c].diff().max())
covid_data_agg['max_infection_rate'] = max_infections
There are mainly two types of Feature Selection techniques, which are:
Figure 8
4)We used Feature Selection and created a new data frame with only the necessary columns.
df = pd.DataFrame(covid_data_agg['max_infection_rate'])
df.head()
Figure 9
A left join will return all rows from the first table written, Customer in this
instance, and only populate the second table’s fields where the key value exists.
It will return NULLs (or NaNs in python) where it does not exist in the second
table.
Figure 10
A right join will return all rows from the second table written, Order in this
instance, and only populate the first table’s fields where the key value exists. It
will return NULLs (or NaNs in python) where it does not exist in the first table.
Figure 11
A full join will return all rows whether the values in the field you are joining on
exist in both tables or not. If the value does not exist in the other table, NULLs
will be returned for that table’s fields.
Figure 12
Histogram:
The histogram shows the distribution of a continuous variable. It can discover the
frequency distribution for a single variable in a univariate analysis.
Bar Chart:
Bar Chart or Bar Plot is used to represent categorical data with vertical or horizontal
bars. It is a general plot that allows you to aggregate the categorical data based on
some function, by default the mean.
Pie Chart:
Pie Chart is a type of plot which is used to represent the proportion of each category
in categorical data. The whole pie is divided into slices which are equal to the number
of categories.
Countplot:
Countplot is similar to a bar plot except that we only pass the X-axis and Y-axis
represents explicitly counting the number of occurrences. Each bar represents count
for each category of species.
Boxplot:
Boxplot is used to show the distribution of a variable. The box plot is a standardized
way of displaying the distribution of data based on the five-number summary:
minimum, first quartile, median, third quartile, and maximum
Heatmap:
Heatmap is a type of Matrix plot that allows you to plot data as color-encoded
matrices. It is mostly used to find multi-collinearity in a dataset. To plot a heatmap,
your data should already be in a matrix form, the heatmap basically just colors it in
for you.
Regression-
A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyper-plane which goes through the points.
Figure 13
Figure 14
6)To visualize the correlation between infection rate and factors such as GDP, Health Life
Expectancy etc, we used scatter plots, regression lines and heatmaps further.
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
From the above figure, we can interpret that the ‘maximum infection rate’ has a negative
correlation rate with ‘Freedom to make life choices’, ‘Healthy Life Expectancy’, ‘Social
support’ and ‘GDP per capita’ which suggests that as the maximum infection rate increases
‘Freedom to make life choices’, ‘Healthy Life Expectancy’, ‘Social support’ and ‘GDP per
capita’ decreases which we all witnessed during this pandemic outbreak.
7) We used Altair Library in Python to calculate and visualize Daily New Cases, Daily New
Deaths, Total Confirmed Case and Total Deaths.
Altair-
Altair is a statistical visualization library in Python. It is a declarative in nature and is based
on Vega and Vega-Lite visualization grammars. It is fast becoming the first choice of people
looking for a quick and efficient way to visualize datasets.
It is rightly regarded as declarative visualization library since, while visualizing any dataset in
Altair, the user only needs to specify how the data columns are mapped to the encoding
channel i.e. declare links between the data columns and encoding channels such as x and y
axis, row, columns, etc. Simply framing, a declarative visualization library allows you to
focus on the “what” rather than the “how” part, by handling the other plot details itself
without the users help.
The following command can be used to install Altair like any other python library:
pip install altair
All altair charts need three essential elements: Data, Mark and Encoding. A valid chart can
also be made by specifying only the data and mark.
The basic format of all altair chart is:
alt.Chart(data).mark_bar().encode(
encoding1 = ‘column1’,
encoding2 = ‘column2’,
)
Advantages
1 The basic code remains the same for all types of plots, the user only needs to change
the mark attribute to get different plots.
2 The code is shorter and simpler to write than other imperative visualization libraries.
User can focus on the relationship between the data columns and forget about the
unnecessary plot details.
3 Faceting and Interactivity are very easy to implement.
The most popular and commonly used Machine Learning Algorithms are
Linear Regression
Logistic Regression
Decision Tree
SVM Algorithm
Naïve Bayes Algorithm
KNN Algorithm
K means Clustering
Random Forest Algorithm
XGBoost Algorithm
In our model, we have used Linear Regression, Support Vector Machine and XGBoost
Algorithm to compare, see the results as well as anticipate the future effects of Covid19.
The first step to apply the algorithms was to divide our dataset into training and testing
datasets
Figure 25
Linear Regression
To understand the working functionality of this algorithm, imagine how you would arrange
random logs of wood in increasing order of their weight. There is a catch; however – you
cannot weigh each log. You have to guess its weight just by looking at the height and girth of
the log (visual analysis) and arrange them using a combination of these visible parameters.
This is what linear regression in machine learning is like.
In this process, a relationship is established between independent and dependent variables by
fitting them to a line. This line is known as the regression line and represented by a linear
equation Y= a *X + b.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
The coefficients a & b are derived by minimizing the sum of the squared difference of
distance between data points and the regression line.
We performed Linear Regression on the confirmed cases data by dividing the data into two
parts-training data and testing data.
First we trained the training data to a degree of four and then we tested the testing data on the
same model and got a error of 8.
Figure 26
Figure 27
We applied Support Vector Machine Algorithm by dividing the dataset into training
and testing datasets. First we trained the training dataset and then we passed the
testing dataset through that model and got an error of 7
Figure 28
Figure 29
Conclusion-
As seen above, Support Vector Machine algorithm proved to be a good fit for our model.
References-
1 Predictive Analysis for Covid 19 Using Python
Divyadharani A K, Gayatri A B, Jeyanithy N M, Padmavathi R, Dr. K. Velmurugan
International Journal of All Research Education and Scientific Methods ISSN-2455-6211
[2] Machine Learning Algorithms-a Review
Batta Mahesh
International Journal of Science and Research ISSN-2319-7064
[3] A Study of Real World Data Visualization of Covid-19 dataset using Python
Kamlendu Pandey, Ronak Panchal
International Journal of Management and Humanities ISSN-2394-0913