You are on page 1of 22

A STATISTICAL ANALYSIS OF THE NOVEL

CORONAVIRUS (COVID-19) IN INDIA


A PROJECT REPORT

Submitted by

Yugan S
19BCE1072

in partial fulfillment for the award of the degree of Bachelor of Technology

in

Computer Science and Engineering

School of Computer Science and Engineering


Vellore Institute of Technology
Vandalur - Kelambakkam Road, Chennai - 600 127

11th Dec - 2021

1
School of Computer Science and Engineering

DECLARATION

I hereby declare that the project entitled A STATISTICAL ANALYSIS OF


THE NOVEL CORONAVIRUS (COVID-19) IN INDIA submitted by me to
the School of Computer Science and Engineering, VIT Chennai, 600127 in
partial fulfillment of the requirements of the award of the degree of Bachelor of
Technology in Computer Science and Engineering (4 year Programme) is a
bona-fide record of the work carried out by me under the supervision of
Prof. Dr.Arunkumar Sivaraman. I further declare that the work reported in this
project, has not been submitted and will not be submitted, either in part or in
full, for the award of any other degree or diploma of this institute or of any other
institute or University.

Place:
Chennai
Date:
11/12/2021

Signature of the candidate

2
School of Computer Science and Engineering

CERTIFICATE

This is to certify that the report entitled A STATISTICAL ANALYSIS OF


THE NOVEL CORONAVIRUS (COVID-19) IN INDIA is prepared and
submitted by Yugan S (19BCE1072) to VIT Chennai, in partial fulfillment of
the requirement for the award of the degree of Bachelor of Technology in
Computer Science & Engineering (4 years Programme) is a bona-fide record
carried out under my guidance. The project fulfills the requirements as per the
regulations of VIT and in my opinion meets the necessary standards for
submission. The contents of this report have not been submitted and will not be
submitted either in part or in full, for the award of any other degree or diploma
and the same is certified.

Guide/Supervisor HoD

Name: Prof.Dr.Arunkumar Sivaraman Name: Dr.Nithiyanandam P


Date:11-12-2021 Date:11-12-2021

(Seal of SCSE)

3
Acknowledgement
I was obliged to give my appreciation to a number of people without whom I
could not have completed this thesis successfully.I would like to place on record
my deep sense of gratitude and thanks to my internal guide Prof.Dr.Arunkumar
Sivaraman, School of Computer Science and Engineering (SCOPE), Vellore
Institute of Technology, Chennai, whose esteemed support and immense
guidance encouraged me to complete the project successfully.I would like to
thank our HoD Dr.Nithiyanandam P, School of Computer Science and
Engineering (SCOPE) and Project Co-Ordinator Dr..Arunkumar Sivaraman,
Vellore Institute of Technology, Chennai, for their valuable support and
encouragement to take up and complete this thesis. Special mention to our dean
Dr.Ganesan R ,School of Computer Science and Engineering (SCOPE), Vellore
Institute of Technology, Chennai, for motivating us in every aspect of software
engineering.I thank our management of Vellore Institute of Technology,
Chennai, for permitting me to use the library and laboratory resources. I also
thank all the faculty members for giving me the courage and the strength that I
needed to complete my goal. This acknowledgment would be incomplete
without expressing the whole hearted thanks to my family and friends who
motivated me during the course of my work.

Yugan S
Reg. No. 19BCE1072

4
Abstract
This paper focuses on the incidence of the COVID-19 disease in India and
analysis of the top most affected states by population in the country.Data
analytics can be helpful in understanding different aspects of the pandemic.The
cumulative number of confirmed cases, deaths cases, recovered cases in
different states. We use libraries like Scrappy ,Pandas, Numpy, Matplotlib,
Plotly for dataset extraction, scientific calculations and visualizations. We used
Tableau : Powerful visualization tool which allows us to plot geo informational
Data and create Dashboard.

5
Contents

Declaratio 2
n
Certificate 3

Acknowledgement 4

Abstract 5

1 Introductio
n 8
1. Background
1
1. Statement
2
1. Motivation
3
1. Challenges
4
2 Planning & Requirements Specification 9-11
2. Literature Review
1
2. Requirements
3
2. System Requirements
4
2.3 Hardware Requirements
.1
2.3 Software Requirements
.2
3 System Design 12

Implementation of System /Methodology 13-20


4
4. Confirmed Cases & Decided Cases
1 Analysis
4. Time Series Analysis
2
4. Analysing Lockdown
3
5 Results & Discussion 21

6 Conclusion and Future Work 22


7 Reference 23
6
List of Figures

S.no Figure name Page


no.

1 System design 15

2 Dual axis graph 16

3 Pie chart 17

4 Dumbbell chart 18

5 Geographical heat map 19

6 Line Time series analysis 20

7 Bar Time series analysis 21

8 Bar Time series analysis 23

7
Introduction

1.1 Background
Coronaviruses are a large family of viruses that are known to cause illness ranging from the
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS)
and Severe Acute Respiratory Syndrome (SARS).

1.2 Statement
COVID-19 outbreak was first reported in Wuhan, China and has spread to more than 50 countries
and the first case reported in India was on 27th January 2020 in Kerala .WHO declared COVID-19
as a Public Health Emergency of International Concern (PHEIC) on 30th January 2020. Naturally,
a rising infectious disease involves fast spreading, endangering the health of large numbers of
people, and thus requires immediate actions to prevent the disease at the community level.
Therefore, CoronaTracker was born as the online platform that provides latest and reliable news
development, as well as statistics and analysis on COVID-19. This paper is done by the project
research team aims to predict and forecast COVID-19 cases, deaths, and recoveries through time
series analysis and also analysing if the lockdown period was effective or not in India. The model
helps to interpret patterns of public sentiment on disseminating related health information, and
assess political and economic influence of the spread of the virus.

1.3 Motivation
Data Analytics can be used in understanding the covid-19 trends, decision making and precautions
can be taken according to the observed results.Classification of different states based on their
active number of cases, recovered cases, deaths and other attributes.We use Time series Analysis
on the data to predict the future cases based on the previously observed cases all over the states in
India.

1.4 Challenges
The retrieved dataset had some misleading Indian state names and duplication of state names.
Huge data was collected not able to verify the legitimacy of the dataset and it was a time
consuming process all the data related to covid counts was given in cumulative numbers.

2.) Planning and Requirements


8
2.1) Literature Review
[1] Mapping the spread of COVID-19 outbreak in India (Vanshika Bidhan 1, Bhavini
Malhotra 2, Mansi Pandit 3 and N.Latha* )
In this research paper, I have observed that Real-time data have been extracted based on
daily observations using publicly available data from websites for COVID-19 and other
government official reports for the period (15th Feb, 2020 to 28th April, 2020). Statistical
analysis and visualization was performed to draw important inferences regarding COVID-
19 trend in India. Case Fatality Rate (CFR) is calculated based on the total deaths and the
active cases in a certain period of time interval. CFR is 3.22% in the period of time interval
in the analysis of the research paper. The Result from the analysis says that cases have
decreased frequently due to the lockdown period.

[2] Impact of COVID-19 epidemic curtailment strategies in selected Indian states: An


analysis by reproduction number and doubling time with incidence modelling (Arun
Mitra ID1, Abhijit P.Pakhare ID2, Adrija Roy ID3, Ankur Joshi ID4*)
Crowd sourced database for COVID-19 cases is used ,which is available in the public
domain. After preparing the data for analysis, R0 was estimated using maximum likelihood
(ML) method. Data from the top 10 states with the highest number of cases were subset.
Poisson regression method is used in this paper to curve-fit the robustness of R(t) and check
it by plotting against new incidence. The reproduction numbers R0 of the third phase (30
days into lockdown from 24th march 2020 ) were used to model the incidence and predict
the cumulative caseload for the selected states As a result, total of 23,040 COVID-19 cases
have been reported in India as of 23rd April 2020,590 cases (89.4%) were seen in the
selected 10 states. The proportion of imported cases (ie.The population from abroad) was
less than 2% in all the 10 states.

[3] Statistical Explorations and Univariate Time series Analysis on COVID-19


Datasets to Understand the Trend of Disease Spreading and Death.
Performed following three analytical studies in this paper– correlation analysis to identify
how human coronavirus spreading and its fatality are related to factors such as,external
temperature,sunshine,rainfall,population,area,and density.Finding importance of social
isolation factor( “f”) to restrict the spread of COVID-19,Development of univariate LSTM
models to forecast total death and total cases globally or country-wise(choice-based)and
their performance comparison.Statistical correlation study proved that COVID-19 does not
depend on external weather factors,such as external temperature,sunshine,and
9
precipitation.It depends on the population and its density mostly.Therefore,it is considered
as a community disease. LSTM models in this study may help to take necessary actions in
advance to control the upcoming undesirable health crisis.

[4] Prediction of new active cases of coronavirus disease (COVID-19) pandemic using
multiple linear regression models.
In this paper,Regression model such as Linear and Multiple Linear Regression techniques
are applied to the data set to visualize the trend of the affected cases and In the end it is
found that Multiple Linear Regression mode is more accurate in predicting the outcome and
nearly produce accurate results.The strength of the model is its R2 value came to be 1.0
which shows a strong predictor model taking into consideration of all the factors.

[5] Statistical Analysis of COVID cases in India


In this paper,statistical measures such as Mean, Median, Mode, Standard Deviation, Global
Average etc. have been applied over confirmed cases, recovered cases and death cases. For
finding the relationship between various cases, these statistical tools provide a complete
information region wise in India In this paper, Poisson Regression models have been used
for modelling where counts of outcomes become a necessary criterion. There are two
important outcomes of this model are Count data and Rate data. Count data are those data
where it is discrete in nature as well as magnitude is non-negative that occurs within a
certain period of time. Rate data is being defined as the data where the speed at which
occurs within a time interval.

[6] Statistical analysis and visualization of the potential cases of pandemic coronavirus
In this paper,It has supported us to generate and disseminate detailed information to the
scientific community and to the public, especially at the peak phase, in order to understand
the growth and impact of the novel coronavirus. Dataset given by Johns Hopkins CSSE data
repository. In this paper,analysing and comparing the deaths and recovery in each country.
5% of deaths and 8% of recoveries occurred in reported cases in the United States. In Spain,
10% of deaths and 40% of recoveries occurred in confirmed cases.

10
2.2 Requirements

2.2.1 System Requirements


Software Requirement
➢ Pandas (0.22.1)
➢ Numpy (1.14.2)
➢ Scrappy (web scraping)
➢ Tableau (Data visualization tool)

Hardware Requirement

➢ Microsoft Windows 8/8.1, Windows 10 (x64)


➢ Ram 2 GB
➢ 1.5 GB minimum free disk space
➢ CPUs must support SSE4.2 and POPCNT

11
3. System Design

Figure 1: System design


The above figure 1 system design represents the overall process of Statistical analysis of
COVID-19 in India and prediction using Time series analysis technique how the user
interacts with the system till the result and output is generated. That helps to understand
the working of the system. The communication between different modules in the systems.
The goal is to define all modules in a single design system. The COVID-19 dataset is
extracted from web using Scrappy and then processed for the next phase for classifying
Data types and Identification of variables followed by Data processing, Data cleaning,
method used is Time series analysis and finally the Result analysis.

12
4) Implementation of the System

STATE WISE AVERAGE CONFIRMED & DECEASED CASES

Figure 2: Dual axis chart

Analysing the dataset for confirmed Covid-19 cases in India by state wise from (31st Jan 2020)
to (31st oct 2021), it’s clearly evident that confirmed cases are higher in states & union
territories with higher digit of population and population density, dual bar-line chart visualize the
total number of confirmed and deceased cases. From the data retrieved in between the interval of
time, Maharashtra has around 6.3 million confirmed and 0.14 million of death, we find that
Maharastra has the highest number of cases reported and the highest number of deaths,also
includes most confirmed cases states like Kerala,Karnataka,Tamilnadu,Andhra pradesh,Uttar
pradesh,Delhi and more.Confirmed cases are high where the population density is high like
maharashtra population is 123,144,223 approx.123 million.

13
Figure 3: Pie chart

From the observation we come to know the state wise confirmed cases in India, from the
observation Maharashtra has the most confirmed cases compared to other states. Each state is
indicated by different shades accordingly. Kerala, Karnataka & Tamil nadu have more confirmed
cases which ranks in the top 4 places in confirmed cases scenario.
Figure 4: Dumble chart

Analysing the impact of covid-19 in different states according to year 2020 and 2021 the number of
confirmed cases have increased dramatically in Maharashtra state with numbers approx. 3980k and have
the most number of average confirmed cases and states like kerala, chhattisgarh, punjab had less number
cases in 2020 and have increased in number in 2021 even though kerala has less population comparatively
the covid has huge spread due to migration of people.
Figure 5: Geographical Heatmap

Maharashtra has a contrasting nature in terms of average confirmed cases with number
62,29,596. The reason behind the huge digit is the majorly the population of the state and
population density of each region district wise, where the population density of capital city
of Maharashtra , Mumbai is approx.73,000 people per sq mile, hence the outcome is
proportional.
Figure 6: Line chart (Time Series)

Analysing the total recovered cases throughout India by state wise, while comparing it with
the population of the country state wise ,it’s clearly evident that states and union territories
with higher the population and population density have more number of cases and the
recovered cases are proportional to the confirmed cases but the observation explicits that
states Odisha and Chhattisgarh were not proportional in compassion between confirmed and
recovered cases, this shows that Odisha has 12.6% higher recover rate over Chhattisgarh.
ANALYSING LOCKDOWN:

2020

Figure 7: Bar chart (Time Series)

Phase 1: 25 March 2020 – 14 April 2020 (21 days)

On 24 March, the first day of the lockdown, nearly all services and factories were
suspended People were hurrying to stock essentials in some parts.Arrests across the states
were made for violating norms of lockdown such as venturing out for no emergency,
opening businesses and also home quarantine violations. The government held meetings
with e-commerce websites and vendors to ensure a seamless supply of essential goods
across the nation during the lockdown period.
Phase 2: 15 April 2020 – 3 May 2020 (19 days)
On 14 April, PM Modi extended the nationwide lockdown till 5 May, with a conditional
relaxation promised after 20 April for the regions where the spread had been contained by
then.He said that every town, every police station area and every state would be carefully
evaluated to see if it had contained the spread. The areas that were able to do so would be
released from the lockdown on 20 April. If any new cases emerged in those areas,
lockdown could be reimposed. On 16 April, lockdown areas were classified as "red zone",
indicating the presence of infection hotspots,"orange zone" indicating some infection, and
"green zone" with no infections.

Phase 3: 4 May 2020 – 17 May 2020 (14 days)


On 1 May, the Ministry of Home Affairs (MHA) and the Government of India (GoI) further
extended the lockdown period to two weeks beyond 4 May, with some relaxations.The
country has been split into 3 zones: red zones (130 districts), orange zones (284 districts),
and green zones (320 districts).Red zones are those with high coronavirus cases and a high
doubling rate, orange zones are those with comparatively fewer cases than red zone and
green zones are those without any cases in the past 21 days.

Phase 4: 18 May 2020 – 31 May 2020 (14 days)


On 17 May, the National Disaster Management Authority (NDMA) and the Ministry of
Home Affairs (MHA) extended the lockdown for a period for two weeks beyond 18 May,
with additional relaxations. Unlike the previous extensions, states were given a larger say in
the demarcation of Green, Orange and Red zones and the implementation roadmap. Red
zones were further divided into containment and buffer zones. The local bodies were given
the authority to demarcate containment and buffer zones .

Analysing the chart,Lockdown periods (Mar-May 2020).During the lockdown period there
is no huge increase in confirmed cases but after the lockdown period got over the case
become very huge while after may (Jun-dec 2020).So, conclusion is while in lockdown the
cases are in control but when the lockdown was suspended the cases rises very high.
Figure 8: Bar chart (Time Series)

April 5- 15 June 2021 (Lockdown Phase)


In February end 2021, India was hit by the largest COVID wave of all time. It was noted
that people started becoming careless, not wearing masks and not following social
distancing, in November- April. This wave caused a rapid surge in covid cases and deaths
rate. Cases started to rise by March 2021,which resulted in state-wide lockdowns. In
Maharashtra a total of 4 lockdown phases from April to June.April 5- 15 June 2021
(Lockdown Phase) Analysing the graph we say that the confirmed case average has
increased in the period of lock down form april to july. Since people have to wait for the
test results for a week and many people have to take covid test We don't see an increase
from April to june. It happens from May to July .Effects of lockdown from July to
November there is only a slight increase in the average of confirmed cases and the
increasing number has been controlled with the lock down in the month of September and
October there are no visible changes.
5.Result and Discussion :-

From the above charts we can observe that , impact of Covid-19 on Indian states.The most
affected states were Maharashtra , kerala , karnataka , Tamil Nadu ..etc.Maharashtra has
suffered the most during this period but we can see that proper measures have been taken
and the recovery rate of the people have increased. Even though the Confirmed cases of the
covid-19 were high we can see that death rate is very less comparatively and the death rate
has been decaling from the start of the covi-19 period ,so we can assume that there was
good medical support.

6. Conclusion and future work

This research presented current trends of COVID-19 outbreak in India from 31th Jan 2020
to 31rd October 2021 as visualized in different charts using Tableau visualization tool. The
trajectory of the outbreak is also forecasted by using a Time series analysis model. COVID-
19 is still an infectious disease with some unclear or unknown properties, which means
accurate SEIR prediction can only be obtained once the outbreak has been successfully
contained. The outbreak spreads are largely influenced by each country’s policy and social
responsibility. In a pandemic like this, providing timely information to the public is
paramount.
Analyses district wise effect of covid-19 on each state and find the reason behind mostly
affected areas. the role of vaccination in reducing the impact of covid-19 in each state.
References:
[1] Mapping the spread of COVID-19 outbreak in India (Vanshika Bidhan 1, Bhavini
Malhotra 2, Mansi Pandit 3 and N.Latha* )

[2] Impact of COVID-19 epidemic curtailment strategies in selected Indian states: An


analysis by reproduction number and doubling time with incidence modelling
(Arun Mitra ID1, Abhijit P.Pakhare ID2, Adrija Roy ID3, Ankur Joshi ID4*)

[3] Statistical Explorations and Univariate Time series Analysis on COVID-19 Datasets to
Understand the Trend of Disease Spreading and Death.

[4] Prediction of new active cases of coronavirus disease (COVID-19) pandemic using
multiple linear regression model

[5] Statistical Analysis of COVID cases in India.

[6] Statistical analysis and visualization of the potential cases of pandemic coronavirus.

22

You might also like