You are on page 1of 35

Covid19 Impact Upon the Community

Project Stage II Report

On

“Covid19 Impact Upon the Community”

Submitted in the fulfillment of the requirements


For the Degree of
Bachelor of Technology
in
Electronics & Telecommunication Engineering

By

Riya Bajaj (1714110300)


Tarique Shahidi (1714110309)
Anirudh Amla (1714110231)

Under the guidance of

Prof. Sudhir Bussa

Department of Electronics & Telecommunication Engineering


Bharati Vidyapeeth (Deemed To Be) University
College of Engineering,Pune – 4110043

Academic Year 2021-22

BVUCOE, Pune Page1


Covid19 Impact Upon the Community

BHARATI VIDYAPEETH (DEEMED TO BE) UNIVERSITY


COLLEGE OF ENGINEERING, PUNE – 4110043
DEPARTMENT OF ELECTRONICS& TELECOMMUNICATION
ENGINEERING

CERTIFICATE

Certified that the project report entitled, “Covid19 Impact Upon the Community” is a
bonafied work done by Riya Bajaj, Tarique Shahidi, Anirudh Amla in fulfillment of the
requirements for the award of degree of Bachelor of Technology in Electronics&
Telecommunication Engineering.

Date:

Prof. Sudhir Bussa Prof. Deepak Ray Prof. Shruti Oza


Guide Project Coordinator HOD

Examiner1:

Examiner2:

BVUCOE, Pune Page2


Covid19 Impact Upon the Community

ACKNOWLEDGEMENT

We would like to extend our sincere gratitude to the Principal Dr. Vidula Sohoni, Head of
Department Electronics & Telecommunication, Prof. Shruti Oza, for nurturing a congenial
yet competitive environment, which motivates all the students not only to pursue goals but
also to elevate the Humanitarian level.

Inspiration and guidance are invaluable in every aspect of life, which we have received from
our respected project guide Prof. Sudhir Bussa, who gave us his careful and ardent guidance
because of which we are able to complete this project. More words won’t suffice to express
our gratitude to his untiring devotion. He undoubtedly belongs to the members of the artistic
gallery who are masters in all aspects.

We would also like to thank all the faculty members who directly or indirectly helped us from
time to time with their invaluable inputs.

BVUCOE, Pune Page3


Covid19 Impact Upon the Community

ABSTRACT

The current destructive pandemic of coronavirus disease 2019 (COVID-19), caused by severe
acute respiratory syndrome coronavirus 2 (SARS-CoV-2) , was first reported in Wuhan,
China, in December 2019. The outbreak has affected millions of people around the world and
the number of infections and mortalities has been growing at an alarming rate. In such a
situation, forecasting and proper study of the pattern of disease spread can inspire design
better strategies to make more efficient decisions. Moreover, such studies play an important
role in achieving accurate predictions.

Machine learning has numerous tools that can be used for visualization and prediction, and
nowadays it is used worldwide for study of the pattern of COVID-19 spread. One of the main
focus of the study in this project is to use machine learning techniques to analyze and
visualize the spreading of the virus country-wise as well as globally during a specific period
of time by considering confirmed cases, recovered cases and fatalities.

In this project, we are going to perform Linear regression, Support vector machine etc. on
the Johns Hopkins University’s COVID-19 data to anticipate the future effects of COVID-19
pandemic in India and some other countries. Moreover, we are going to study the impact of
some parameters such as geographic conditions, economic statistics, population statistics, life
expectancy, etc., in prediction of COVID-19 spread.

BVUCOE, Pune Page4


Covid19 Impact Upon the Community

TABLE OF CONTENTS
Chapter No.  Name of Topic  Page No. 
  List of Figures  III 
  List of Tables  IV 
  Abstract  V 
Chapter- 1  Introduction   7-11
  1.1  Covid19 and Problem Statement 7
1.2  Technology and Concept 7-10
1.3 Software Used 10-11
1.4 Data Sources 11
Chapter-2  Literature Survey   12
Chapter-3  Data Mining 13-15 
Chapter-4  Exploratory Data Analysis 16-32
Chapter-5 Machine Learning Algorithms 33-38
 
 
 
 
 

BVUCOE, Pune Page5


Covid19 Impact Upon the Community

LIST OF FIGURES
FIGURE NO.  NAME OF THE FIGURE  PAGE NO. 
3.1  Figure1 14
4.1  Figure2 18
4.2  Figure3 18
4.3  Figure4 19
4.4  Figure5 19 
4.5  Figure6 20 
4.6 Figure7 20 
4.7 Figure8 22
4.8  Figure9 23
4.9  Figure10 23
4.10  Figure11 24
4.11 Figure12 24
4.12 Figure13 26
4.13 Figure14 26
4.14 Figure15 27
4.15 Figure16 27
4.16 Figure17 28
4.17 Figure18 28
4.18 Figure19 29
4.19 Figure20 30
4.20 Figure21 31
4.21 Figure22 31
4.22 Figure23 32
4.23 Figure24 33
5.1 Figure25 34
5.2 Figure26 35
5.3 Figure27 36
5.4 Figure28 36
5.5 Figure29 38
5.6 Figure30 38

BVUCOE, Pune Page6


Covid19 Impact Upon the Community

1 INTRODUCTION

1.1 Covid19 and Problem Statement

On 31st December 2019, in the city of Wuhan (CHINA), a cluster of cases of pneumonia of
unknown cause was reported to World Health organization. In January 2020, a previously
unknown new virus was identified, subsequently named 2019 novel corona virus. WHO has
declared the COVID-19 as a pandemic. A pandemic is defined as disease spread over a wide
range of geographical area and that has affected high proportion of the population.
The pandemic has already taken grip over peoples’ life. Since the start of the pandemic, some
countries are facing problem of ever-increasing cases. Through the data analysis of cases one
can analyze how countries all over the world are doing in terms of controlling the pandemic.
Analyzing data leads to adapt the prevention model of the countries that are doing great in
terms of lowering the graph. Predictions are made with the dataset available to the
individual/country/organisations, thus helping them to decide how far they are able to control
the pandemic or up to how much extent they should guide preventive measures. Through this
project, a step towards helping people to understand the spread and predict the cases in their
country is done. This project also gives an insight of how a country is doing in terms of
limiting the spread.

1.2 Technology and Concept

Python Language

Python is an interpreted high-level general-purpose programming language. Its design


philosophy emphasizes code readability with its use of significant indentation. Its language
constructs as well as its object-oriented approach aim to help programmers write clear,
logical code for small and large-scale projects.
Python is dynamically typed, and garbage collected. It supports multiple programming
paradigms, including structured (particularly, procedural), object-oriented and functional
programming. It is often described as a "batteries included" language due to its
comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s, as a successor to the ABC
programming language, and first released it in 1991 as Python 0.9.0. Python 2.0 was released
in 2000 and introduced new features, such as list comprehensions and a cycle-detecting
garbage collection system (in addition to reference counting). Python 3.0 was released in
2008 and was a major revision of the language that is not completely backward compatible.
Python 2 was discontinued with version 2.7.18 in 2020.
Python consistently ranks as one of the most popular programming languages

Machine Learning

Machine learning is a field of study or process of teaching a computer to learn the fed data
without being explicitly programmed. It makes computer make decisions similar to humans.

BVUCOE, Pune Page7


Covid19 Impact Upon the Community

Now a days, it is actively being used in various field. E.g. Medical, Industries, Astronomy
etc.
The major types of Machine learning are Supervised Learning, Unsupervised Learning and
Reinforcement Learning.

Supervised Learning

The machine learning task of learning a function that can map an input data to output data
and performs analysis based on that input-output pair.

Unsupervised Learning

A type of machine learning that draw an inference from dataset consisting of input data
without labelled responses. One of the common unsupervised learning methods called cluster
analysis, is used find the hidden pattern or grouping of data.

Reinforcement Learning

A type of machine learning that is bound to learn from experiences. There is no training
dataset provided *(such methods work in the absence of datasets). An agent in Reinforcement
learning that rewards or penalise for actions done by the algorithm. The task is to find the
best possible path to reach the goal.

Some important terms

Data frame

Pandas data frame is 2D, mutable and heterogeneous tabular data structure with labelled
axes. Data frame can be made of more than one series (series can only contain single list with
index).

Hypothesis

In Machine learning, Hypothesis is a model that is used to approximate the target function
and performs mapping of input with output.

Regression

Regression in Machine Learning is about predicting the continuous value-based learning


gained by dataset. The correctness of the output can depend on the size of dataset, features,
hypothesis used etc.

Classification

The problem of identifying that in which sub-population a new example/observation belongs


to, on the basis of learning obtained through training set containing observations along with
the category they belong to.

BVUCOE, Pune Page8


Covid19 Impact Upon the Community

Some important Libraries used

NumPy

NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, Fourier transform, and matrices. NumPy was created in 2005 by
Travis Oliphant. It is an open-source project, and you can use it freely. NumPy stands for
Numerical Python. In Python we have lists that serve the purpose of arrays, but they are slow
to process. NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists. The array object in NumPy is called ndarray it provides a lot of supporting
functions that make working with ndarray very easy. Arrays are very frequently used in data
science, where speed and resources are very important.

Pandas

Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both
"Panel Data", and "Python Data Analysis" and was created by Wes McKinney in
2008.Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant. Relevant data is very
important in data science.

Pandas gives you answers about the data. Like:

 Is there a correlation between two or more columns?


 What is average value?
 Max value?
 Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.

Matplotlib

Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source, and we can use it
freely. Matplotlib is mostly written in python, a few segments are written in C, Objective-C
and JavaScript for Platform compatibility. Matplotlib is a comprehensive library for creating
static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy
and hard things possible.

 Create publication quality plots.


 Make interactive figures that can zoom, pan, update.
 Customize visual style and layout.
 Export to many file formats.
 Embed in JupytrLab and Graphical User Interface.

BVUCOE, Pune Page9


Covid19 Impact Upon the Community

 Use a rich array of third-party packages built on Matplotlib.

Pyplot

Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported
under the plt alias:

import matplotlib. pyplot as plt

Seaborne

Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to visualize
random distributions. Seaborn is an open source, BSD-licensed Python library providing high
level API for visualizing the data using Python programming language. Seaborn is built on
top of Python’s core visualization library Matplotlib. It is meant to serve as a complement,
and not a replacement. However, Seaborn comes with some very important features.

 Built in themes for styling matplotlib graphics


 Visualizing univariate and bivariate data
 Fitting in and visualizing linear regression models
 Plotting statistical time series data
 Seaborn works well with NumPy and Pandas data structures
 It comes with built in themes for styling Matplotlib graphics

1.3 Software Used-

Google Colab

Colab is a free Jupyter notebook environment that runs entirely in the cloud. Most
importantly, it does not require a setup and the notebooks that you create can be
simultaneously edited by your team members - just the way you edit documents in Google
Docs. Colab supports many popular machine learning libraries which can be easily loaded in
your notebook. As a programmer, we can perform the following using Google Colab.
 Write and execute code in Python
 Document your code that supports mathematical equations
 Create/Upload/Share notebooks
 Import/Save notebooks from/to Google Drive
 Import/Publish notebooks from GitHub
 Import external datasets e.g., from Kaggle

BVUCOE, Pune Page10


Covid19 Impact Upon the Community

 Integrate PyTorch, TensorFlow, Keras, OpenCV


 Free Cloud service with free GPU

Kaggle
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners. Kaggle allows users to find and publish data sets, explore and build
models in a web-based data-science environment, work with other data scientists and
machine learning engineers, and enter competitions to solve data science challenges. Kaggle
got its start in 2010 by offering machine learning competitions and now also offers a public
data platform, a cloud-based workbench for data science, and Artificial Intelligence
education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas
Gruen was founding chair succeeded by Max Levichin. Equity was raised in 2011 valuing the
company at $25 million. On 8 March 2017, Google announced that they were acquiring
Kaggle.

Kaggle’s Services

 Machine Learning Competitions: this was Kaggle's first product. Companies post
problems and machine learners compete to build the best algorithm, typically with
cash prizes.
 Kaggle Kernels: a cloud-based workbench for data science and machine learning.
Allows data scientists to share code and analysis in Python, R and R Markdown.
Over 150K "kernels" (code snippets) have been shared on Kaggle covering
everything from sentiment analysis to object detection.
 Public dataset platform: community members share datasets with each other. Has
datasets on everything from bone x-rays to results from boxing bouts.
 Kaggle learn: a platform for AI education in manageable chunks.

1.4 Dataset Sources

 Covid19 datasets from John Hopkins University


 World Happiness Report from Kaggle

BVUCOE, Pune Page11


Covid19 Impact Upon the Community

2-LITERATURE SURVEY

 Predictive Analysis for Covid 19 Using Python


Divyadharani A K, Gayatri A B, Jeyanithy N M, Padmavathi R, Dr. K. Velmurugan
International Journal of All Research Education and Scientific Methods ISSN-2455-6211

• A Study of Real World Data Visualization of Covid-19 dataset using Python


Kamlendu Pandey, Ronak Panchal
International Journal of Management and Humanities ISSN-2394-0913

BVUCOE, Pune Page12


Covid19 Impact Upon the Community

3-DATA MINING

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there
are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes
and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary
from dataset to dataset. But it is crucial to establish a template for your data cleaning process,
so you know you are doing it the right way every time.

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your
organization.
Step 1: Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including duplicate observations or
irrelevant observations. Duplicate observations will happen most often during data collection.
When you combine data sets from multiple places, scrape data, or receive data from clients or
multiple departments, there are opportunities to create duplicate data. De-duplication is one
of the largest areas to be considered in this process. Irrelevant observations are when you
notice observations that do not fit into the specific problem you are trying to analyze. For
example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient and minimize distraction from your primary target—as well as
creating a more manageable and more performant dataset.
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find “N/A” and “Not Applicable” both appear,
but they should be analyzed as the same category.
Step 3: Filter unwanted outliers
Often, there will be one-off observations where, at a glance, they do not appear to fit within
the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper
data-entry, doing so will help the performance of the data you are working with. However,
sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.

BVUCOE, Pune Page13


Covid19 Impact Upon the Community

Step 4: Handle missing data


You can’t ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be
considered.

1 As a first option, you can drop observations that have missing values, but doing this
will drop or lose information, so be mindful of this before you remove it.
2 As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3 As a third option, you might alter the way the data is used to effectively navigate null
values.

BVUCOE, Pune Page14


Covid19 Impact Upon the Community

Figure1

Some data cleaning tools-


 Openrefine
 Trifacta Wrangler
 TIBCO Clarity
 Cloudingo
 IBM Infosphere Quality Stage

BVUCOE, Pune Page15


Covid19 Impact Upon the Community

The data provided by John Hopkins University also required cleaning for better results so
we cleaned the data a bit
Step by Step process followed for cleaning the data-

1 Removing of useless columns such as Latitude and Longitude (from Covid dataset)
Overall rank, Score, Generosity, Perception of corruption (from World Happiness
Report)

covid_data.drop(['Lat', 'Long'], axis=1, inplace=True)


useless_cols = ['Overall rank', 'Score', 'Generosity', 'Perceptions of corruption']
happiness_data.drop(useless_cols, inplace=True, axis=1)

2 Change indexing of the dataset from Provinces to Country

covid_data_agg = covid_data.groupby('Country/Region').sum()
happiness_data.set_index('Country or region', inplace=True)

3 Converting Date from string to datetime format

full_table['Date'] = pd.to_datetime(full_table['Date'])

4 Replacing missing value NaN

full_table['Recovered'] = full_table['Recovered'].fillna(0)

4- Exploratory Data Analysis

The initial analysis of data supplied or extracted, to understand the trends, underlying
limitations, quality, patterns, and relationships between various entities within the data set,
using descriptive statistics and visualization tools is called Exploratory Data Analysis (EDA).
EDA will give you a fair idea of what model better fits the data and whether any data
cleansing and massaging might be required before taking the data through advanced
modelling techniques or even put through Machine Learning and Artificial Intelligence
algorithms

BVUCOE, Pune Page16


Covid19 Impact Upon the Community

Of the many outcomes of EDA, the important ones that one should try to get from the data
are, 

 Detect outliers and anomalies


 Determine the quality of data
 Determine what statistical models can fit the data
 Find out if the assumptions about the data, that you or your team started out with is
correct or way off.
 Extract variables or dimensions on which the data can be pivoted.
 Determine whether to apply univariate or multivariate analytical techniques.

There are broadly two categories of EDA, graphical and non-graphical. These two are further
divided into univariate and multivariate EDA, based on interdependency of variables in your
data.

Univariate non-graphical: Here, the data features a single variable, and the EDA is done in
mostly tabular form, for example, summary statistics. These non-graphical analyses give you
a statistic that indicates how skewed your data might be or which is the dominant value for
your variable if any.
 Univariate graphical: The EDA here, involves graphic tools like bar charts and
histograms to get a quick view of how variable properties are stacked against each
other, whether there is a relationship between these properties and whether there is
any interdependency among these properties.
 Multivariate non-graphical: non-graphical methods like crosstabs are used to depict
the relationship between two or more variables. Statistical values like correlation
coefficient indicate if there are a possible relationship and the measure of correlation.
 Multivariate graphical: A graphical representation always gives you a better
understanding of the relationship, especially among multiple variables.
The most commonly used software tools to perform EDA are Python and R. Both enjoy
massive community support and frequent updates on packages that can be used to EDA. Let’s
look at the various graphical instruments that can be used to execute an EDA.
 Box plots
Box plots are used where there is a need to summarize data on an interval scale like the ones
on the stock market, where ticks observed in one whole day may be represented in a single
box, highlighting the lowest, highest, median and outliers.  

BVUCOE, Pune Page17


Covid19 Impact Upon the Community

 Heatmap
Heatmaps are most often used for the representation of the correlation between variables.

 Histograms
The histogram is the graphical representation of numerical data that splits the data into
ranges. The taller the bar, the greater the number of data points falling in that range. A good
example here is the height data of a class of students. You would notice that the height data
looks like a bell curves for a particular class with most the data lying within a certain range
and a few of outside these ranges. There will be outliers too, either very short or very small. 

Exploratory Data Analysis Performed

1 Plotted the data for the confirmed cases for China, Italy and India for a specific
number of days-

2 Plotted the first derivative for China, Italy and India. This way we can count the
number of increase in cases each day. This can also be used to calculate the infection
rate.

covid_data_agg.loc['India'].diff().plot()

BVUCOE, Pune Page18


Covid19 Impact Upon the Community

Figure 5
covid_data_agg.loc['China'].diff().plot()

Figure 6

covid_data_agg.loc['Italy'].diff().plot()

BVUCOE, Pune Page19


Covid19 Impact Upon the Community

Figure 7

3 We calculated a new column that denotes the maximum infection rate for all the
countries

max_infections = []
for c in countries:
max_infections.append(covid_data_agg.loc[c].diff().max())
covid_data_agg['max_infection_rate'] = max_infections

Feature Selection in Machine Learning-


A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection. Each machine
learning process depends on feature engineering, which mainly contains two processes,
which are Feature Selection and Feature Extraction. Although feature selection and extraction
processes may have the same objective, both are completely different from each other. The
main difference between them is that feature selection is about selecting the subset of the
original feature set, whereas feature extraction creates new features. Feature selection is a
way of reducing the input variable for the model by using only relevant data in order to
reduce overfitting in the model.
Before implementing any technique, it is really important to understand, need for the
technique and so for the Feature Selection. As we know, in machine learning, it is necessary
to provide a pre-processed and good input dataset in order to get better outcomes. We collect
a huge amount of data to train our model and help it to learn better. Generally, the dataset
consists of noisy data, irrelevant data, and some part of useful data. Moreover, the huge
amount of data also slows down the training process of the model, and with noise and
irrelevant data, the model may not predict and perform well. So, it is very necessary to
remove such noises and less-important data from the dataset and to do this, and Feature
selection techniques are used.

BVUCOE, Pune Page20


Covid19 Impact Upon the Community

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that it can be easily interpreted by the
researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.

There are mainly two types of Feature Selection techniques, which are:

o Supervised Feature Selection technique


Supervised Feature selection techniques consider the target variable and can be used
for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used
for the unlabelled dataset.

Figure 8

BVUCOE, Pune Page21


Covid19 Impact Upon the Community

4)We used Feature Selection and created a new data frame with only the necessary columns.
df = pd.DataFrame(covid_data_agg['max_infection_rate'])
df.head()

5 Merging the datasets using JOINS-


There are 4 types of joins: inner, left, right, and full outer.
 An inner join will return rows where the values in the field you are joining on
exist in both tables.

Figure 9

 A left join will return all rows from the first table written, Customer in this
instance, and only populate the second table’s fields where the key value exists.
It will return NULLs (or NaNs in python) where it does not exist in the second
table.

Figure 10

BVUCOE, Pune Page22


Covid19 Impact Upon the Community

 A right join will return all rows from the second table written, Order in this
instance, and only populate the first table’s fields where the key value exists. It
will return NULLs (or NaNs in python) where it does not exist in the first table.

Figure 11

 A full join will return all rows whether the values in the field you are joining on
exist in both tables or not. If the value does not exist in the other table, NULLs
will be returned for that table’s fields.

Figure 12

We used Inner Join to merge the two datasets-


data = df.join(happiness_data, how='inner')
data.head()

BVUCOE, Pune Page23


Covid19 Impact Upon the Community

Types of Visualization in Python-


 Scatterplot:
This is used to find a relationship in a bivariate data. It is most commonly used to find
correlations between two continuous variables

 Histogram:
The histogram shows the distribution of a continuous variable.  It can discover the
frequency distribution for a single variable in a univariate analysis.

 Bar Chart:
Bar Chart or Bar Plot is used to represent categorical data with vertical or horizontal
bars. It is a general plot that allows you to aggregate the categorical data based on
some function, by default the mean. 

 Pie Chart:
Pie Chart is a type of plot which is used to represent the proportion of each category
in categorical data. The whole pie is divided into slices which are equal to the number
of categories.

 Countplot:
Countplot is similar to a bar plot except that we only pass the X-axis and Y-axis
represents explicitly counting the number of occurrences. Each bar represents count
for each category of species.

 Boxplot:
Boxplot is used to show the distribution of a variable. The box plot is a standardized
way of displaying the distribution of data based on the five-number summary:
minimum, first quartile, median, third quartile, and maximum

 Heatmap:
Heatmap is a type of Matrix plot that allows you to plot data as color-encoded
matrices. It is mostly used to find multi-collinearity in a dataset. To plot a heatmap,
your data should already be in a matrix form, the heatmap basically just colors it in
for you.

Regression-
A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyper-plane which goes through the points.

BVUCOE, Pune Page24


Covid19 Impact Upon the Community

Figure 13

Types of Regression Models

Figure 14

6)To visualize the correlation between infection rate and factors such as GDP, Health Life
Expectancy etc, we used scatter plots, regression lines and heatmaps further.

GDP vs Maximum Infection Rate-


x = data['GDP per capita']
y = data['max_infection_rate']
sns.regplot(x, np.log(y))

Figure 15

BVUCOE, Pune Page25


Covid19 Impact Upon the Community

Social Support vs Maximum Infection Rate-


x = data['Social support']
y = data['max_infection_rate']
sns.regplot(x, np.log(y))

Figure 16

Healthy Life Expectancy vs Maximum Infection Rate-


x = data['Healthy life expectancy']
y = data['max_infection_rate']
sns.regplot(x, np.log(y))

Figure 17

BVUCOE, Pune Page26


Covid19 Impact Upon the Community

Freedom to make life choices vs Maximum Infection Rate-


x = data['Freedom to make life choices']
y = data['max_infection_rate']
sns.regplot(x, np.log(y))

Figure 18

Plotting a correlation matrix and heatmap


data.corr()
sns.heatmap(data.corr(), annot=True)

Figure 19

BVUCOE, Pune Page27


Covid19 Impact Upon the Community

From the above figure, we can interpret that the ‘maximum infection rate’ has a negative
correlation rate with ‘Freedom to make life choices’, ‘Healthy Life Expectancy’, ‘Social
support’ and ‘GDP per capita’ which suggests that as the maximum infection rate increases
‘Freedom to make life choices’, ‘Healthy Life Expectancy’, ‘Social support’ and ‘GDP per
capita’ decreases which we all witnessed during this pandemic outbreak.

7) We used Altair Library in Python to calculate and visualize Daily New Cases, Daily New
Deaths, Total Confirmed Case and Total Deaths.

Altair-
Altair is a statistical visualization library in Python. It is a declarative in nature and is based
on Vega and Vega-Lite visualization grammars. It is fast becoming the first choice of people
looking for a quick and efficient way to visualize datasets.
It is rightly regarded as declarative visualization library since, while visualizing any dataset in
Altair, the user only needs to specify how the data columns are mapped to the encoding
channel i.e. declare links between the data columns and encoding channels such as x and y
axis, row, columns, etc. Simply framing, a declarative visualization library allows you to
focus on the “what” rather than the “how” part, by handling the other plot details itself
without the users help.
The following command can be used to install Altair like any other python library:
pip install altair
All altair charts need three essential elements: Data, Mark and Encoding. A valid chart can
also be made by specifying only the data and mark.
The basic format of all altair chart is:
alt.Chart(data).mark_bar().encode( 
       encoding1 = ‘column1’, 
       encoding2 = ‘column2’, 
)
Advantages
1 The basic code remains the same for all types of plots, the user only needs to change
the mark attribute to get different plots.
2 The code is shorter and simpler to write than other imperative visualization libraries.
User can focus on the relationship between the data columns and forget about the
unnecessary plot details.
3 Faceting and Interactivity are very easy to implement.

BVUCOE, Pune Page28


Covid19 Impact Upon the Community

Total Confirmed Cases for India


Figure 20

Total Deaths for India


Figure 21

Daily new cases for India


Figure 22

BVUCOE, Pune Page29


Covid19 Impact Upon the Community

Daily new deaths for India


Figure 23

Daily new cases country wise


Figure 24

BVUCOE, Pune Page30


Covid19 Impact Upon the Community

5-MACHINE LEARNING ALGORITHMS


Machine learning algorithms are mathematical model mapping methods used to learn or
uncover underlying patterns embedded in the data. Machine learning comprises a group of
computational algorithms that can perform pattern recognition, classification, and prediction
on data by learning from existing data (training set). In the recent era we all have experienced
the benefits of machine learning techniques from streaming movie services that recommend
titles to watch based on viewing habits to monitor fraudulent activity based on spending
pattern of the customers. It can handle large and complex data to draw interesting patterns or
trends in them such as anomalies.
Machine Learning Algorithms are classified into 4 types
 Supervised Learning
 Unsupervised Learning
 Semi-supervised Learning
 Reinforcement Learning

The most popular and commonly used Machine Learning Algorithms are
 Linear Regression
 Logistic Regression
 Decision Tree
 SVM Algorithm
 Naïve Bayes Algorithm
 KNN Algorithm
 K means Clustering
 Random Forest Algorithm
 XGBoost Algorithm

In our model, we have used Linear Regression, Support Vector Machine and XGBoost
Algorithm to compare, see the results as well as anticipate the future effects of Covid19.

The first step to apply the algorithms was to divide our dataset into training and testing
datasets

Figure 25

BVUCOE, Pune Page31


Covid19 Impact Upon the Community

Linear Regression
To understand the working functionality of this algorithm, imagine how you would arrange
random logs of wood in increasing order of their weight. There is a catch; however – you
cannot weigh each log. You have to guess its weight just by looking at the height and girth of
the log (visual analysis) and arrange them using a combination of these visible parameters.
This is what linear regression in machine learning is like.
In this process, a relationship is established between independent and dependent variables by
fitting them to a line. This line is known as the regression line and represented by a linear
equation Y= a *X + b.
In this equation:
 Y – Dependent Variable
 a – Slope
 X – Independent variable
 b – Intercept

The coefficients a & b are derived by minimizing the sum of the squared difference of
distance between data points and the regression line.

We performed Linear Regression on the confirmed cases data by dividing the data into two
parts-training data and testing data.
First we trained the training data to a degree of four and then we tested the testing data on the
same model and got a error of 8.

Figure 26

BVUCOE, Pune Page32


Covid19 Impact Upon the Community

Figure 27

Support Vector Machine


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine.
SVM is of 2 types-
 Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

We applied Support Vector Machine Algorithm by dividing the dataset into training
and testing datasets. First we trained the training dataset and then we passed the
testing dataset through that model and got an error of 7

BVUCOE, Pune Page33


Covid19 Impact Upon the Community

Figure 28

Figure 29

BVUCOE, Pune Page34


Covid19 Impact Upon the Community

Conclusion-
As seen above, Support Vector Machine algorithm proved to be a good fit for our model.

References-
1 Predictive Analysis for Covid 19 Using Python
Divyadharani A K, Gayatri A B, Jeyanithy N M, Padmavathi R, Dr. K. Velmurugan
International Journal of All Research Education and Scientific Methods ISSN-2455-6211
[2] Machine Learning Algorithms-a Review
Batta Mahesh
International Journal of Science and Research ISSN-2319-7064
[3] A Study of Real World Data Visualization of Covid-19 dataset using Python
Kamlendu Pandey, Ronak Panchal
International Journal of Management and Humanities ISSN-2394-0913

BVUCOE, Pune Page35

You might also like