Project Stage II Report On: "Covid19 Impact Upon The Community"

Covid19 Impact Upon the Community
Project Stage II Report
On
“Covid19 Impact Upon the Community”
Submitted in the fulfillment of the requirements

For the Degree of
Bachelor of Technology
in
Electronics & Telecommunication Engineering
By
Riya Bajaj (1714110300)

Tarique Shahidi (1714110309)
Anirudh Amla (1714110231)
Under the guidance of
Prof. Sudhir Bussa
Department of Electronics & Telecommunication Engineering

Bharati Vidyapeeth (Deemed To Be) University
College of Engineering,Pune – 4110043
Academic Year 2021-22
BVUCOE, Pune Page1

BHARATI VIDYAPEETH (DEEMED TO BE) UNIVERSITY

COLLEGE OF ENGINEERING, PUNE – 4110043
DEPARTMENT OF ELECTRONICS& TELECOMMUNICATION
ENGINEERING
CERTIFICATE
Certified that the project report entitled, “Covid19 Impact Upon the Community” is a
bonafied work done by Riya Bajaj, Tarique Shahidi, Anirudh Amla in fulfillment of the
requirements for the award of degree of Bachelor of Technology in Electronics&
Telecommunication Engineering.
Date:
Prof. Sudhir Bussa Prof. Deepak Ray Prof. Shruti Oza

Guide Project Coordinator HOD
Examiner1:
Examiner2:
BVUCOE, Pune Page2

ACKNOWLEDGEMENT
We would like to extend our sincere gratitude to the Principal Dr. Vidula Sohoni, Head of
Department Electronics & Telecommunication, Prof. Shruti Oza, for nurturing a congenial
yet competitive environment, which motivates all the students not only to pursue goals but
also to elevate the Humanitarian level.
Inspiration and guidance are invaluable in every aspect of life, which we have received from
our respected project guide Prof. Sudhir Bussa, who gave us his careful and ardent guidance
because of which we are able to complete this project. More words won’t suffice to express
our gratitude to his untiring devotion. He undoubtedly belongs to the members of the artistic
gallery who are masters in all aspects.
We would also like to thank all the faculty members who directly or indirectly helped us from
time to time with their invaluable inputs.
BVUCOE, Pune Page3

ABSTRACT
The current destructive pandemic of coronavirus disease 2019 (COVID-19), caused by severe
acute respiratory syndrome coronavirus 2 (SARS-CoV-2) , was first reported in Wuhan,
China, in December 2019. The outbreak has affected millions of people around the world and
the number of infections and mortalities has been growing at an alarming rate. In such a
situation, forecasting and proper study of the pattern of disease spread can inspire design
better strategies to make more efficient decisions. Moreover, such studies play an important
role in achieving accurate predictions.
Machine learning has numerous tools that can be used for visualization and prediction, and
nowadays it is used worldwide for study of the pattern of COVID-19 spread. One of the main
focus of the study in this project is to use machine learning techniques to analyze and
visualize the spreading of the virus country-wise as well as globally during a specific period
of time by considering confirmed cases, recovered cases and fatalities.
In this project, we are going to perform Linear regression, Support vector machine etc. on
the Johns Hopkins University’s COVID-19 data to anticipate the future effects of COVID-19
pandemic in India and some other countries. Moreover, we are going to study the impact of
some parameters such as geographic conditions, economic statistics, population statistics, life
expectancy, etc., in prediction of COVID-19 spread.
BVUCOE, Pune Page4

TABLE OF CONTENTS
Chapter No. Name of Topic Page No.
List of Figures III
List of Tables IV
Abstract V
Chapter- 1 Introduction 7-11
1.1 Covid19 and Problem Statement 7
1.2 Technology and Concept 7-10
1.3 Software Used 10-11
1.4 Data Sources 11
Chapter-2 Literature Survey 12
Chapter-3 Data Mining 13-15
Chapter-4 Exploratory Data Analysis 16-32
Chapter-5 Machine Learning Algorithms 33-38

BVUCOE, Pune Page5

LIST OF FIGURES
FIGURE NO. NAME OF THE FIGURE PAGE NO.
3.1 Figure1 14
4.1 Figure2 18
4.2 Figure3 18
4.3 Figure4 19
4.4 Figure5 19
4.5 Figure6 20
4.6 Figure7 20
4.7 Figure8 22
4.8 Figure9 23
4.9 Figure10 23
4.10 Figure11 24
4.11 Figure12 24
4.12 Figure13 26
4.13 Figure14 26
4.14 Figure15 27
4.15 Figure16 27
4.16 Figure17 28
4.17 Figure18 28
4.18 Figure19 29
4.19 Figure20 30
4.20 Figure21 31
4.21 Figure22 31
4.22 Figure23 32
4.23 Figure24 33
5.1 Figure25 34
5.2 Figure26 35
5.3 Figure27 36
5.4 Figure28 36
5.5 Figure29 38
5.6 Figure30 38
BVUCOE, Pune Page6

1 INTRODUCTION
1.1 Covid19 and Problem Statement
On 31st December 2019, in the city of Wuhan (CHINA), a cluster of cases of pneumonia of
unknown cause was reported to World Health organization. In January 2020, a previously
unknown new virus was identified, subsequently named 2019 novel corona virus. WHO has
declared the COVID-19 as a pandemic. A pandemic is defined as disease spread over a wide
range of geographical area and that has affected high proportion of the population.
The pandemic has already taken grip over peoples’ life. Since the start of the pandemic, some
countries are facing problem of ever-increasing cases. Through the data analysis of cases one
can analyze how countries all over the world are doing in terms of controlling the pandemic.
Analyzing data leads to adapt the prevention model of the countries that are doing great in
terms of lowering the graph. Predictions are made with the dataset available to the
individual/country/organisations, thus helping them to decide how far they are able to control
the pandemic or up to how much extent they should guide preventive measures. Through this
project, a step towards helping people to understand the spread and predict the cases in their
country is done. This project also gives an insight of how a country is doing in terms of
limiting the spread.
1.2 Technology and Concept
Python Language
Python is an interpreted high-level general-purpose programming language. Its design

philosophy emphasizes code readability with its use of significant indentation. Its language
constructs as well as its object-oriented approach aim to help programmers write clear,
logical code for small and large-scale projects.
Python is dynamically typed, and garbage collected. It supports multiple programming
paradigms, including structured (particularly, procedural), object-oriented and functional
programming. It is often described as a "batteries included" language due to its
comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s, as a successor to the ABC
programming language, and first released it in 1991 as Python 0.9.0. Python 2.0 was released
in 2000 and introduced new features, such as list comprehensions and a cycle-detecting
garbage collection system (in addition to reference counting). Python 3.0 was released in
2008 and was a major revision of the language that is not completely backward compatible.
Python 2 was discontinued with version 2.7.18 in 2020.
Python consistently ranks as one of the most popular programming languages
Machine Learning
Machine learning is a field of study or process of teaching a computer to learn the fed data
without being explicitly programmed. It makes computer make decisions similar to humans.
BVUCOE, Pune Page7

Now a days, it is actively being used in various field. E.g. Medical, Industries, Astronomy
etc.
The major types of Machine learning are Supervised Learning, Unsupervised Learning and
Reinforcement Learning.
Supervised Learning
The machine learning task of learning a function that can map an input data to output data
and performs analysis based on that input-output pair.
Unsupervised Learning
A type of machine learning that draw an inference from dataset consisting of input data
without labelled responses. One of the common unsupervised learning methods called cluster
analysis, is used find the hidden pattern or grouping of data.
Reinforcement Learning
A type of machine learning that is bound to learn from experiences. There is no training
dataset provided *(such methods work in the absence of datasets). An agent in Reinforcement
learning that rewards or penalise for actions done by the algorithm. The task is to find the
best possible path to reach the goal.
Some important terms
Data frame
Pandas data frame is 2D, mutable and heterogeneous tabular data structure with labelled
axes. Data frame can be made of more than one series (series can only contain single list with
index).
Hypothesis
In Machine learning, Hypothesis is a model that is used to approximate the target function
and performs mapping of input with output.
Regression
Regression in Machine Learning is about predicting the continuous value-based learning

gained by dataset. The correctness of the output can depend on the size of dataset, features,
hypothesis used etc.
Classification
The problem of identifying that in which sub-population a new example/observation belongs

to, on the basis of learning obtained through training set containing observations along with
the category they belong to.
BVUCOE, Pune Page8

Some important Libraries used
NumPy
NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, Fourier transform, and matrices. NumPy was created in 2005 by
Travis Oliphant. It is an open-source project, and you can use it freely. NumPy stands for
Numerical Python. In Python we have lists that serve the purpose of arrays, but they are slow
to process. NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists. The array object in NumPy is called ndarray it provides a lot of supporting
functions that make working with ndarray very easy. Arrays are very frequently used in data
science, where speed and resources are very important.
Pandas
Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both
"Panel Data", and "Python Data Analysis" and was created by Wes McKinney in
2008.Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant. Relevant data is very
important in data science.
Pandas gives you answers about the data. Like:
 Is there a correlation between two or more columns?

 What is average value?
 Max value?
 Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.
Matplotlib
Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source, and we can use it
freely. Matplotlib is mostly written in python, a few segments are written in C, Objective-C
and JavaScript for Platform compatibility. Matplotlib is a comprehensive library for creating
static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy
and hard things possible.
 Create publication quality plots.

 Make interactive figures that can zoom, pan, update.
 Customize visual style and layout.
 Export to many file formats.
 Embed in JupytrLab and Graphical User Interface.
BVUCOE, Pune Page9

 Use a rich array of third-party packages built on Matplotlib.
Pyplot
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported
under the plt alias:
import matplotlib. pyplot as plt
Seaborne
Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to visualize
random distributions. Seaborn is an open source, BSD-licensed Python library providing high
level API for visualizing the data using Python programming language. Seaborn is built on
top of Python’s core visualization library Matplotlib. It is meant to serve as a complement,
and not a replacement. However, Seaborn comes with some very important features.
 Built in themes for styling matplotlib graphics

 Visualizing univariate and bivariate data
 Fitting in and visualizing linear regression models
 Plotting statistical time series data
 Seaborn works well with NumPy and Pandas data structures
 It comes with built in themes for styling Matplotlib graphics
1.3 Software Used-
Google Colab
Colab is a free Jupyter notebook environment that runs entirely in the cloud. Most
importantly, it does not require a setup and the notebooks that you create can be
simultaneously edited by your team members - just the way you edit documents in Google
Docs. Colab supports many popular machine learning libraries which can be easily loaded in
your notebook. As a programmer, we can perform the following using Google Colab.
 Write and execute code in Python
 Document your code that supports mathematical equations
 Create/Upload/Share notebooks
 Import/Save notebooks from/to Google Drive
 Import/Publish notebooks from GitHub
 Import external datasets e.g., from Kaggle
BVUCOE, Pune Page10

 Integrate PyTorch, TensorFlow, Keras, OpenCV

 Free Cloud service with free GPU
Kaggle
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners. Kaggle allows users to find and publish data sets, explore and build
models in a web-based data-science environment, work with other data scientists and
machine learning engineers, and enter competitions to solve data science challenges. Kaggle
got its start in 2010 by offering machine learning competitions and now also offers a public
data platform, a cloud-based workbench for data science, and Artificial Intelligence
education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas
Gruen was founding chair succeeded by Max Levichin. Equity was raised in 2011 valuing the
company at $25 million. On 8 March 2017, Google announced that they were acquiring
Kaggle.
Kaggle’s Services
 Machine Learning Competitions: this was Kaggle's first product. Companies post
problems and machine learners compete to build the best algorithm, typically with
cash prizes.
 Kaggle Kernels: a cloud-based workbench for data science and machine learning.
Allows data scientists to share code and analysis in Python, R and R Markdown.
Over 150K "kernels" (code snippets) have been shared on Kaggle covering
everything from sentiment analysis to object detection.
 Public dataset platform: community members share datasets with each other. Has
datasets on everything from bone x-rays to results from boxing bouts.
 Kaggle learn: a platform for AI education in manageable chunks.
1.4 Dataset Sources
 Covid19 datasets from John Hopkins University

 World Happiness Report from Kaggle
BVUCOE, Pune Page11

2-LITERATURE SURVEY
 Predictive Analysis for Covid 19 Using Python

Divyadharani A K, Gayatri A B, Jeyanithy N M, Padmavathi R, Dr. K. Velmurugan
International Journal of All Research Education and Scientific Methods ISSN-2455-6211
• A Study of Real World Data Visualization of Covid-19 dataset using Python

Kamlendu Pandey, Ronak Panchal
International Journal of Management and Humanities ISSN-2394-0913
BVUCOE, Pune Page12

3-DATA MINING
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there
are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes
and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary
from dataset to dataset. But it is crucial to establish a template for your data cleaning process,
so you know you are doing it the right way every time.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your
organization.
Step 1: Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including duplicate observations or
irrelevant observations. Duplicate observations will happen most often during data collection.
When you combine data sets from multiple places, scrape data, or receive data from clients or
multiple departments, there are opportunities to create duplicate data. De-duplication is one
of the largest areas to be considered in this process. Irrelevant observations are when you
notice observations that do not fit into the specific problem you are trying to analyze. For
example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient and minimize distraction from your primary target—as well as
creating a more manageable and more performant dataset.
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find “N/A” and “Not Applicable” both appear,
but they should be analyzed as the same category.
Step 3: Filter unwanted outliers
Often, there will be one-off observations where, at a glance, they do not appear to fit within
the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper
data-entry, doing so will help the performance of the data you are working with. However,
sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.
BVUCOE, Pune Page13

Step 4: Handle missing data

You can’t ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be
considered.
1 As a first option, you can drop observations that have missing values, but doing this
will drop or lose information, so be mindful of this before you remove it.
2 As a second option, you can input missing values based on other observations; again,
there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3 As a third option, you might alter the way the data is used to effectively navigate null
values.
BVUCOE, Pune Page14

Figure1
Some data cleaning tools-

 Openrefine
 Trifacta Wrangler
 TIBCO Clarity
 Cloudingo
 IBM Infosphere Quality Stage
BVUCOE, Pune Page15

The data provided by John Hopkins University also required cleaning for better results so
we cleaned the data a bit
Step by Step process followed for cleaning the data-
1 Removing of useless columns such as Latitude and Longitude (from Covid dataset)
Overall rank, Score, Generosity, Perception of corruption (from World Happiness
Report)
covid_data.drop(['Lat', 'Long'], axis=1, inplace=True)

useless_cols = ['Overall rank', 'Score', 'Generosity', 'Perceptions of corruption']
happiness_data.drop(useless_cols, inplace=True, axis=1)
2 Change indexing of the dataset from Provinces to Country
covid_data_agg = covid_data.groupby('Country/Region').sum()
happiness_data.set_index('Country or region', inplace=True)
3 Converting Date from string to datetime format
full_table['Date'] = pd.to_datetime(full_table['Date'])
4 Replacing missing value NaN
full_table['Recovered'] = full_table['Recovered'].fillna(0)
4- Exploratory Data Analysis
The initial analysis of data supplied or extracted, to understand the trends, underlying
limitations, quality, patterns, and relationships between various entities within the data set,
using descriptive statistics and visualization tools is called Exploratory Data Analysis (EDA).
EDA will give you a fair idea of what model better fits the data and whether any data
cleansing and massaging might be required before taking the data through advanced
modelling techniques or even put through Machine Learning and Artificial Intelligence
algorithms
BVUCOE, Pune Page16

Of the many outcomes of EDA, the important ones that one should try to get from the data
are,
 Detect outliers and anomalies

 Determine the quality of data
 Determine what statistical models can fit the data
 Find out if the assumptions about the data, that you or your team started out with is
correct or way off.
 Extract variables or dimensions on which the data can be pivoted.
 Determine whether to apply univariate or multivariate analytical techniques.
There are broadly two categories of EDA, graphical and non-graphical. These two are further
divided into univariate and multivariate EDA, based on interdependency of variables in your
data.
Univariate non-graphical: Here, the data features a single variable, and the EDA is done in
mostly tabular form, for example, summary statistics. These non-graphical analyses give you
a statistic that indicates how skewed your data might be or which is the dominant value for
your variable if any.
 Univariate graphical: The EDA here, involves graphic tools like bar charts and
histograms to get a quick view of how variable properties are stacked against each
other, whether there is a relationship between these properties and whether there is
any interdependency among these properties.
 Multivariate non-graphical: non-graphical methods like crosstabs are used to depict
the relationship between two or more variables. Statistical values like correlation
coefficient indicate if there are a possible relationship and the measure of correlation.
 Multivariate graphical: A graphical representation always gives you a better
understanding of the relationship, especially among multiple variables.
The most commonly used software tools to perform EDA are Python and R. Both enjoy
massive community support and frequent updates on packages that can be used to EDA. Let’s
look at the various graphical instruments that can be used to execute an EDA.
 Box plots
Box plots are used where there is a need to summarize data on an interval scale like the ones
on the stock market, where ticks observed in one whole day may be represented in a single
box, highlighting the lowest, highest, median and outliers.
BVUCOE, Pune Page17

 Heatmap
Heatmaps are most often used for the representation of the correlation between variables.
 Histograms
The histogram is the graphical representation of numerical data that splits the data into
ranges. The taller the bar, the greater the number of data points falling in that range. A good
example here is the height data of a class of students. You would notice that the height data
looks like a bell curves for a particular class with most the data lying within a certain range
and a few of outside these ranges. There will be outliers too, either very short or very small.
Exploratory Data Analysis Performed
1 Plotted the data for the confirmed cases for China, Italy and India for a specific
number of days-
2 Plotted the first derivative for China, Italy and India. This way we can count the
number of increase in cases each day. This can also be used to calculate the infection
rate.
covid_data_agg.loc['India'].diff().plot()
BVUCOE, Pune Page18

Figure 5
covid_data_agg.loc['China'].diff().plot()
Figure 6
covid_data_agg.loc['Italy'].diff().plot()
BVUCOE, Pune Page19

Figure 7
3 We calculated a new column that denotes the maximum infection rate for all the
countries
max_infections = []
for c in countries:
max_infections.append(covid_data_agg.loc[c].diff().max())
covid_data_agg['max_infection_rate'] = max_infections
Feature Selection in Machine Learning-

A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection. Each machine
learning process depends on feature engineering, which mainly contains two processes,
which are Feature Selection and Feature Extraction. Although feature selection and extraction
processes may have the same objective, both are completely different from each other. The
main difference between them is that feature selection is about selecting the subset of the
original feature set, whereas feature extraction creates new features. Feature selection is a
way of reducing the input variable for the model by using only relevant data in order to
reduce overfitting in the model.
Before implementing any technique, it is really important to understand, need for the
technique and so for the Feature Selection. As we know, in machine learning, it is necessary
to provide a pre-processed and good input dataset in order to get better outcomes. We collect
a huge amount of data to train our model and help it to learn better. Generally, the dataset
consists of noisy data, irrelevant data, and some part of useful data. Moreover, the huge
amount of data also slows down the training process of the model, and with noise and
irrelevant data, the model may not predict and perform well. So, it is very necessary to
remove such noises and less-important data from the dataset and to do this, and Feature
selection techniques are used.
BVUCOE, Pune Page20

Below are some benefits of using feature selection in machine learning:
o It helps in avoiding the curse of dimensionality.

o It helps in the simplification of the model so that it can be easily interpreted by the
researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.
There are mainly two types of Feature Selection techniques, which are:
o Supervised Feature Selection technique

Supervised Feature selection techniques consider the target variable and can be used
for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used
for the unlabelled dataset.
Figure 8
BVUCOE, Pune Page21

4)We used Feature Selection and created a new data frame with only the necessary columns.
df = pd.DataFrame(covid_data_agg['max_infection_rate'])
df.head()
5 Merging the datasets using JOINS-

There are 4 types of joins: inner, left, right, and full outer.
 An inner join will return rows where the values in the field you are joining on
exist in both tables.
Figure 9
 A left join will return all rows from the first table written, Customer in this
instance, and only populate the second table’s fields where the key value exists.
It will return NULLs (or NaNs in python) where it does not exist in the second
table.
Figure 10
BVUCOE, Pune Page22

 A right join will return all rows from the second table written, Order in this
instance, and only populate the first table’s fields where the key value exists. It
will return NULLs (or NaNs in python) where it does not exist in the first table.
Figure 11
 A full join will return all rows whether the values in the field you are joining on
exist in both tables or not. If the value does not exist in the other table, NULLs
will be returned for that table’s fields.
Figure 12
We used Inner Join to merge the two datasets-

data = df.join(happiness_data, how='inner')
data.head()
BVUCOE, Pune Page23

Types of Visualization in Python-

 Scatterplot:
This is used to find a relationship in a bivariate data. It is most commonly used to find
correlations between two continuous variables
 Histogram:
The histogram shows the distribution of a continuous variable. It can discover the
frequency distribution for a single variable in a univariate analysis.
 Bar Chart:
Bar Chart or Bar Plot is used to represent categorical data with vertical or horizontal
bars. It is a general plot that allows you to aggregate the categorical data based on
some function, by default the mean.
 Pie Chart:
Pie Chart is a type of plot which is used to represent the proportion of each category
in categorical data. The whole pie is divided into slices which are equal to the number
of categories.
 Countplot:
Countplot is similar to a bar plot except that we only pass the X-axis and Y-axis
represents explicitly counting the number of occurrences. Each bar represents count
for each category of species.
 Boxplot:
Boxplot is used to show the distribution of a variable. The box plot is a standardized
way of displaying the distribution of data based on the five-number summary:
minimum, first quartile, median, third quartile, and maximum
 Heatmap:
Heatmap is a type of Matrix plot that allows you to plot data as color-encoded
matrices. It is mostly used to find multi-collinearity in a dataset. To plot a heatmap,
your data should already be in a matrix form, the heatmap basically just colors it in
for you.
Regression-
A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyper-plane which goes through the points.
BVUCOE, Pune Page24

Figure 13
Types of Regression Models
Figure 14
6)To visualize the correlation between infection rate and factors such as GDP, Health Life
Expectancy etc, we used scatter plots, regression lines and heatmaps further.
GDP vs Maximum Infection Rate-

x = data['GDP per capita']
y = data['max_infection_rate']
sns.regplot(x, np.log(y))
Figure 15
BVUCOE, Pune Page25

Social Support vs Maximum Infection Rate-

x = data['Social support']
Figure 16
Healthy Life Expectancy vs Maximum Infection Rate-

x = data['Healthy life expectancy']
Figure 17
BVUCOE, Pune Page26

Freedom to make life choices vs Maximum Infection Rate-

x = data['Freedom to make life choices']
Figure 18
Plotting a correlation matrix and heatmap

data.corr()
sns.heatmap(data.corr(), annot=True)
Figure 19
BVUCOE, Pune Page27

From the above figure, we can interpret that the ‘maximum infection rate’ has a negative
correlation rate with ‘Freedom to make life choices’, ‘Healthy Life Expectancy’, ‘Social
support’ and ‘GDP per capita’ which suggests that as the maximum infection rate increases
‘Freedom to make life choices’, ‘Healthy Life Expectancy’, ‘Social support’ and ‘GDP per
capita’ decreases which we all witnessed during this pandemic outbreak.
7) We used Altair Library in Python to calculate and visualize Daily New Cases, Daily New
Deaths, Total Confirmed Case and Total Deaths.
Altair-
Altair is a statistical visualization library in Python. It is a declarative in nature and is based
on Vega and Vega-Lite visualization grammars. It is fast becoming the first choice of people
looking for a quick and efficient way to visualize datasets.
It is rightly regarded as declarative visualization library since, while visualizing any dataset in
Altair, the user only needs to specify how the data columns are mapped to the encoding
channel i.e. declare links between the data columns and encoding channels such as x and y
axis, row, columns, etc. Simply framing, a declarative visualization library allows you to
focus on the “what” rather than the “how” part, by handling the other plot details itself
without the users help.
The following command can be used to install Altair like any other python library:
pip install altair
All altair charts need three essential elements: Data, Mark and Encoding. A valid chart can
also be made by specifying only the data and mark.
The basic format of all altair chart is:
alt.Chart(data).mark_bar().encode(
encoding1 = ‘column1’,
encoding2 = ‘column2’,
)
Advantages
1 The basic code remains the same for all types of plots, the user only needs to change
the mark attribute to get different plots.
2 The code is shorter and simpler to write than other imperative visualization libraries.
User can focus on the relationship between the data columns and forget about the
unnecessary plot details.
3 Faceting and Interactivity are very easy to implement.
BVUCOE, Pune Page28

Total Confirmed Cases for India

Figure 20
Total Deaths for India

Figure 21
Daily new cases for India

Figure 22
BVUCOE, Pune Page29

Daily new deaths for India

Figure 23
Daily new cases country wise

Figure 24
BVUCOE, Pune Page30

5-MACHINE LEARNING ALGORITHMS

Machine learning algorithms are mathematical model mapping methods used to learn or
uncover underlying patterns embedded in the data. Machine learning comprises a group of
computational algorithms that can perform pattern recognition, classification, and prediction
on data by learning from existing data (training set). In the recent era we all have experienced
the benefits of machine learning techniques from streaming movie services that recommend
titles to watch based on viewing habits to monitor fraudulent activity based on spending
pattern of the customers. It can handle large and complex data to draw interesting patterns or
trends in them such as anomalies.
Machine Learning Algorithms are classified into 4 types
 Supervised Learning
 Unsupervised Learning
 Semi-supervised Learning
 Reinforcement Learning
The most popular and commonly used Machine Learning Algorithms are
 Linear Regression
 Logistic Regression
 Decision Tree
 SVM Algorithm
 Naïve Bayes Algorithm
 KNN Algorithm
 K means Clustering
 Random Forest Algorithm
 XGBoost Algorithm
In our model, we have used Linear Regression, Support Vector Machine and XGBoost
Algorithm to compare, see the results as well as anticipate the future effects of Covid19.
The first step to apply the algorithms was to divide our dataset into training and testing
datasets
Figure 25
BVUCOE, Pune Page31

Linear Regression
To understand the working functionality of this algorithm, imagine how you would arrange
random logs of wood in increasing order of their weight. There is a catch; however – you
cannot weigh each log. You have to guess its weight just by looking at the height and girth of
the log (visual analysis) and arrange them using a combination of these visible parameters.
This is what linear regression in machine learning is like.
In this process, a relationship is established between independent and dependent variables by
fitting them to a line. This line is known as the regression line and represented by a linear
equation Y= a *X + b.
In this equation:
 Y – Dependent Variable
 a – Slope
 X – Independent variable
 b – Intercept
The coefficients a & b are derived by minimizing the sum of the squared difference of
distance between data points and the regression line.
We performed Linear Regression on the confirmed cases data by dividing the data into two
parts-training data and testing data.
First we trained the training data to a degree of four and then we tested the testing data on the
same model and got a error of 8.
Figure 26
BVUCOE, Pune Page32

Figure 27
Support Vector Machine

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine.
SVM is of 2 types-
 Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
We applied Support Vector Machine Algorithm by dividing the dataset into training
and testing datasets. First we trained the training dataset and then we passed the
testing dataset through that model and got an error of 7
BVUCOE, Pune Page33

Figure 28
Figure 29
BVUCOE, Pune Page34

Conclusion-
As seen above, Support Vector Machine algorithm proved to be a good fit for our model.
References-
1 Predictive Analysis for Covid 19 Using Python
Divyadharani A K, Gayatri A B, Jeyanithy N M, Padmavathi R, Dr. K. Velmurugan
International Journal of All Research Education and Scientific Methods ISSN-2455-6211
[2] Machine Learning Algorithms-a Review
Batta Mahesh
International Journal of Science and Research ISSN-2319-7064
[3] A Study of Real World Data Visualization of Covid-19 dataset using Python
Kamlendu Pandey, Ronak Panchal
International Journal of Management and Humanities ISSN-2394-0913
BVUCOE, Pune Page35

Project Stage II Report On: "Covid19 Impact Upon The Community"

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Stage II Report On: "Covid19 Impact Upon The Community"

Uploaded by

Copyright:

Available Formats

Covid19 Impact Upon the Community

Project Stage II Report

“Covid19 Impact Upon the Community”

Submitted in the fulfillment of the requirements

Riya Bajaj (1714110300)

Under the guidance of

Prof. Sudhir Bussa

Department of Electronics & Telecommunication Engineering

Academic Year 2021-22

BVUCOE, Pune Page1

BHARATI VIDYAPEETH (DEEMED TO BE) UNIVERSITY

Prof. Sudhir Bussa Prof. Deepak Ray Prof. Shruti Oza

BVUCOE, Pune Page2

BVUCOE, Pune Page3

BVUCOE, Pune Page4

BVUCOE, Pune Page5

BVUCOE, Pune Page6

1.1 Covid19 and Problem Statement

1.2 Technology and Concept

Python is an interpreted high-level general-purpose programming language. Its design

BVUCOE, Pune Page7

Some important terms

Regression in Machine Learning is about predicting the continuous value-based learning

The problem of identifying that in which sub-population a new example/observation belongs

BVUCOE, Pune Page8

Some important Libraries used

Pandas gives you answers about the data. Like:

 Is there a correlation between two or more columns?

 Create publication quality plots.

BVUCOE, Pune Page9

 Use a rich array of third-party packages built on Matplotlib.

 Built in themes for styling matplotlib graphics

1.3 Software Used-

BVUCOE, Pune Page10

 Integrate PyTorch, TensorFlow, Keras, OpenCV

1.4 Dataset Sources

 Covid19 datasets from John Hopkins University

BVUCOE, Pune Page11

 Predictive Analysis for Covid 19 Using Python

• A Study of Real World Data Visualization of Covid-19 dataset using Python

BVUCOE, Pune Page12

BVUCOE, Pune Page13

Step 4: Handle missing data

BVUCOE, Pune Page14

Some data cleaning tools-

BVUCOE, Pune Page15

covid_data.drop(['Lat', 'Long'], axis=1, inplace=True)

2 Change indexing of the dataset from Provinces to Country

3 Converting Date from string to datetime format

4 Replacing missing value NaN

4- Exploratory Data Analysis

BVUCOE, Pune Page16

 Detect outliers and anomalies

BVUCOE, Pune Page17

Exploratory Data Analysis Performed

BVUCOE, Pune Page18

BVUCOE, Pune Page19

Feature Selection in Machine Learning-

BVUCOE, Pune Page20

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.

o Supervised Feature Selection technique