You are on page 1of 27

Covid-19 Cases Prediction

Industrial/In-house Training REPORT

Submitted in partial fulfilment of the requirements for the award of the

degree of

BACHELOR OF TECHNOLOGY

in

ELECTRONICS & COMMUNICATION ENGINEERING


by

Anuja Rawat Gaurav Sharma Ronak Rana


00251207320 00551207320 01051207320

Guided by

Mr. Adgaonker Shashank


Innovians Technology

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING BHARATI


VIDYAPEETH’S COLLEGE OF ENGINEERING
(AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI) NEW DELHI –
110063
February 2022

1|Page
CANDIDATE’S DECLARATION

It is hereby certified that the work which is being presented in the B. Tech Industrial/In-house training
Report entitled "Covid-19 Cases Prediction" in partial fulfilment of the requirements for the award of the
degree of Bachelor of Technology and submitted in the Department of Electronics & Communication
Engineering of BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING, New Delhi (Affiliated to Guru Gobind
Singh Indraprastha University, Delhi) is an authentic record of our own work carried out during a period
from February 14th 2021 to March 25th 2021 under the guidance of Mr. Adgaonker Shashank, Innovians
Technology.
The matter presented in the B. Tech Industrial/In-house training Report has not been submitted by me
for the award of any other degree of this or any other Institute.

(Anuja Rawat) (Gaurav Sharma) (Ronak Rana)


(En. No: 00251207320) (En. No: 00551207320) (En. No: 01051207320)

This is to certify that the above statement made by the candidate is correct to the best of
my knowledge. He/She/They are permitted to appear in the External Industrial/In-house
training Examination

(Mr. Adgaonker Shashank) Prof. Kirti Gupta


Innovians Technology Head, ECE

The B. Tech Industrial/In-house training Viva-Voce Examination of Name of the Student


(Enrollment No: XXX), has been held on ……………………………….

Industrial/In-house training Coordinator (Signature of External Examiner)

2|Page
ABSTRACT

Covid-19’s forecasting is now a valuable tool for improving healthcare accountability as


it will make our hospitals and governments well prepared for better treatment and for
the safe future of country. COVID- 19 outbreaks not only harm people's lives, but they
also have a severe impact on the country's economy. The World Health Organization
designated it a health emergency for the entire world on Jan. 30, 2020. (WHO). More
than 3 million individuals have been infected by this virus by April 28, 2020, and there
was no vaccination to prevent it. The World Health Organization (WHO) issued certain
safety rules, although they were primarily precautionary.

Information technology, with an emphasis on subjects like data science and machine
learning, can aid in the fight against the epidemic. It's critical to have early warning
systems in place that can predict how much a sickness will harm society and then make
decisions based on that information. In this project, we include methods for forecasting
future cases based on existing data. On the basis of historical active cases, deaths, and
recovery rates, the ML technique is used to predict the number of active cases in the
future.

This Report describes our work in Polynomial model used to predict covid-19 cases
prediction. By comparing the results with other regression models, we found out
that Polynomial regression Model is proved to be best in predicting the most
accurate results.

3|Page
ACKNOWLEDGMENT

We express our deep gratitude to Mr. Adgaonker Shashank, Innovians Technology, for
his valuable guidance and suggestion throughout my project work. We are thankful to
Dr S.B Kumar for his valuable guidance.

We would like to extend my sincere thanks to Head of the Department, Prof. Kirti
Gupta for her time-to-time suggestions to complete my project work. I am also
thankful to Prof. Dharmender Saini, Principal for providing me the facilities to carry out
my project work.

Sign Sign Sign


(Anuja Rawat) (Gaurav Sharma) (Ronak Rana)
(En. No: 00251207320) (En. No: 00551207320) (En. No: 01051207320)

4|Page
TABLE OF CONTENTS

CANDIDATE DECLARATION
ABSTRACT
ACKNOWLEDGEMENT
TABLE OF CONTENTS

Chapter 1: Introduction

1.1 Machine learning


1.2 Dataset
1.3 Models
1.3.1 Linear Regression
1.3.2 Support Vector regression
1.3.3 Polynomial Regression

Chapter 2: Motivation
Chapter 3: Objective
Chapter 4: Workflow

4.1 Upload dataset


4.2 Data Analysis
4.3 Data Pre-Processing
4.4 Train-Test Split
4.5 Model accuracy
4.6 Testing Model
4.7 Prediction for future

Chapter 5: Results
Visual representation of Result
5.1

Chapter 6: Conclusion

5|Page
LIST OF FIGURES

1. Machine Learning overflow.

2. Graphs showing Actual values, Predicted Values and difference between actual
and predicted values.

6|Page
CHAPTER 1

INTRODUCTION

1.1 Machine Learning

Machine Learning (ML) is all about programming the unprogrammable. For example,
if you want to predict covid cases, ML helps to predict the cases. Prediction of Covid-
19 cases depends on various features such as confirmed cases, confirmed deaths,
rising data for covid, daily new cases which are reported for covid, depends on the
recovery rate and other factors.

Traditionally most insurance companies employ actuaries to calculate the insurance


premiums. Actuaries are business professionals who use mathematics and statistics to
assess the risk of financial loss and predict the likelihood of an insurance premium and
claim, based on the factors/features like age and gender, etc. They typically produce
something called an actuarial table provided to an insurance company’s underwriting
department, which uses the input to set insurance premiums. The insurance company
calculates and writes all the programs, but it becomes much simpler by using Machine
Learning.

Machine Learning workflow

There are three key tenants of ML workflow:

• Prepare the data. Load the data from the database or CSV files.
Extract/Identify the key features (input and output parameters) relevant to
the problem you will solve or predict the outcome.
• Build and train ML model. Here you can evaluate different algorithms, settings
and see which model is best for your scenario.

Once the model is ready, consume the model in your application.

7|Page
Fig 1: Machine Learning Workflow.

8|Page
1.2 About dataset

This dataset contains almost one-year covid-19 cases data. The individual medical costs
billed by health insurance are the target variable charges, and the rest of columns
contain personal information such as age, gender, family status, and whether the
patient smokes among other features.

These are the contents of the dataset used

1. date: dates of the covid data

2. confirmed: confirmed total cases of covid

3. death: total deaths in the given dates

4. recovered: number of recovered from covid

5. active: number of cases currently active

6. new cases: new cases reported of covid on the daily basis.

7. New deaths: new deaths reported of covid on the daily basis.

8. new recovered: new patient recovered from covid on the daily basis.

Since we are predicting covid-19 cases, new cases will be our target feature.

9|Page
1.3 Regression Models

1.3.1 Linear Regression Model

Linear Regression is a machine learning algorithm based on supervised learning. It


performs a regression task. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between
variables and forecasting. Different regression models differ based on – the kind of
relationship between dependent and independent variables, they are considering,
and the number of independent variables being used.

Linear regression performs the task to predict a dependent variable value (y) based
on a given independent variable (x). So, this regression technique finds out a linear
relationship between x (input) and y(output). Hence, the name is Linear
Regression.

1.3.2 Support vector Regression

Support Vector Machine (SVM) is a very popular Machine Learning algorithm that is
used in both Regression and Classification. Support Vector Regression is similar to
Linear Regression in that the equation of the line is

y= wx+b

In SVR, this straight line is referred to as hyperplane. The data points on either side of
the hyperplane that are closest to the hyperplane are called Support Vectors which is
used to plot the boundary line.

Unlike other Regression models that try to minimize the error between the real and
predicted value, the SVR tries to fit the best line within a threshold value (Distance
between hyperplane and boundary line), a. Thus, we can say that SVR model tries
satisfy the condition

-a < y-wx+b< a.

It used the points with this boundary to predict the value.


10 | P a g e
1.3.3 Polynomial Regression

In statistics, polynomial regression is a form of regression analysis in which the


relationship between the independent variable x and the dependent variable y is
modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear
relationship between the value of x and the corresponding conditional mean of y,
denoted E (y | x). Although polynomial regression fits a nonlinear model to the data, as
a statistical estimation problem it is linear, in the sense that the regression function
E (y | x) is linear in the unknown parameters that are estimated from the data. For this
reason, polynomial regression is considered to be a special case of multiple linear
regression.

In this model, for each unit increase in the value of x, the conditional expectation of y
increases by β1 units. In many settings, such a linear relationship may not hold. For
example, if we are modeling the yield of a chemical synthesis in terms of the
temperature at which the synthesis takes place, we may find that the yield improves by
increasing amounts for each unit increase in temperature. In this case, we might
propose a quadratic model of the form

y=β0+β1x+β2x^2+e

In this model, when the temperature is increased from x to x + 1 units, the expected
yield changes by β1+β2 (2x+1). For infinitesimal changes in x, the effect on y is given by
the total derivative with respect to x: β1+2β2x The fact that the change in yield depends
on x is what makes the relationship between x and y nonlinear even though the model is
linear in the parameters to be estimated.

In general, we can model the expected value of y as an nth degree polynomial, yielding
the general polynomial regression model

y= β0+β1x+β2x^2+β3x^3+......+βnx^n+e

11 | P a g e
Conveniently, these models are all linear from the point of view of estimation, since the
regression function is linear in terms of the unknown parameters β0, β1, .... Therefore,
for least squares analysis, the computational and inferential problems of polynomial
regression can be completely addressed using the techniques of multiple regression.
This is done by treating x, x2, ... as being distinct independent variables in a multiple
regression model.

12 | P a g e
CHAPTER 2

MOTIVATION

People’s healthcare cost forecasting is now a valuable tool for improving healthcare
accountability. The healthcare sector produces a very large amount of data related
to patients, diseases, and diagnosis, but since it has not been analyzed properly, it
does not provide the significance which it holds along with the patient healthcare
cost.

A health insurance policy is a policy that covers or minimizes the expenses of losses
caused by a variety of hazards. A variety of factors influence the cost of insurance or
healthcare. For a variety of stakeholders and health departments, accurately predicting
individual healthcare expenses using prediction models is critical. Accurate cost
estimates can help health insurers and, increasingly, healthcare delivery organizations
to plan for the future and priorities the allocation of limited care management
resources. Furthermore, knowing ahead of time what their probable expenses for the
future can assist patients to choose insurance plans with appropriate deductibles and
premiums. These elements play a role in the development of insurance policies.

In the insurance sector, ML can help enhance the efficiency of policy wording. In
healthcare, ML algorithms are particularly good at predicting high-cost, high-need
patient expenditures. ML can be categorized into three different types, as shown in
the following Figure. These types are supervised machine learning (i.e., a task-driven
approach) used for classification/regression and all data labelled; unsupervised
machine learning (i.e., a data-driven approach) used for clustering and all data
unlabeled; and reinforcement learning (i.e., learning from mistakes) used for decision
making.

13 | P a g e
CHAPTER 3

OBJECTIVE

The objective is to train a ML polynomial model that can predict covid-19 rising active
cases more accurately. Being a polynomial model problem, metrics such as the
coefficient of determination and the mean absolute error are used to evaluate the
model.

14 | P a g e
CHAPTER 4

WORKFLOW

4.1 Upload dataset

15 | P a g e
Information about dataset

Categorical features are :-


1. Confirmed
2. Deaths
3. Active

16 | P a g e
Checking for Null Values

After reading heatmap, we found that there is no null value present in our dataset.

17 | P a g e
Data Analysis

Statistical measures of the dataset

18 | P a g e
Data Analysis

19 | P a g e
Distribution of Age Value

Distribution of Gender Column

20 | P a g e
BMI Distribution

21 | P a g e
4.2 Data Pre-Processing

22 | P a g e
4.3 Train-Test Split

4.4 Model Accuracy

4.5 Testing Model

23 | P a g e
4.6 Prediction for future

24 | P a g e
CHAPTER 5

RESULTS

5.1 Visual Representation of result

The graph shows the plots of Actual value and Predicted value using four different models.

Fig. 2: Graphs showing Actual and Predicted Values and future prediction for active cases.

25 | P a g e
CHAPTER 6

CONCLUSION

Machine learning (ML) is one aspect of computational intelligence that can solve different problems in
a wide range of applications and systems when it comes to leveraging historical data. Predicting
medical insurance costs is still a problem in the healthcare industry that needs to be investigated and
improved. In this project, by using a set of ML algorithms, a computational intelligence approach is
applied to predict healthcare insurance costs. The medical insurance dataset was obtained from the
KAGGLE repository and was utilized for training and testing the Linear Regression, Support Vector
Regression, Gradient Boosting, Random Forest Regressor. The regression of this dataset followed the
steps of preprocessing, feature engineering, data splitting, regression, and evaluation. After the
evaluation it was observed that we got better accuracy and less mean absolute error by using gradient
boosting Model.

26 | P a g e
21 | P a g e

You might also like