You are on page 1of 17

EARLY AGE DETECTION OF

SEVERE COVID-19 PATIENTS

A project report submitted in partial fulfillment of the


requirements for B.Tech. Project

by

Harsh walia (2018IMT-036)

Under the Supervison of

Dr. Jeevaraj S

ABV INDIAN INSTITUTE OF


INFORMATION TECHNOLOGY AND
MANAGEMENT
CANDIDATES DECLARATION
I hereby certify that the work, which is being presented in the report,
entitled EARLY AGE DETECTION OF SEVERE COVID-
19 PATIENTS, in partial fulfillment of the requirement for the
award of the Degree of Bachelor of Technology and submitted
to the institution is an authentic record of our own work carried
out during the period June 2021 to october 2021 under the super-
vision of Dr. Jeevaraj S . We also cited the reference about the
text(s)/figure(s)/table(s) from where they have been taken.

Date: Signatures of the Candidates

This is to certify that the above statement made by the candidates is


correct to the best of my knowledge.

Date: Signatures of the Research Supervisors

1
ACKNOWLEDGEMENTS
I am highly indebted to Dr. Jeevaraj S and are obliged for giving us
the autonomy of functioning and experimenting with ideas. I would
like to take this opportunity to express our profound gratitude to him
not only for their academic guidance but also for their personal interest
in our project and constant support coupled with confidence boosting
and motivating sessions which proved very fruitful and were instru-
mental in infusing self-assurance and trust within us. The nurturing
and blossoming of the present work is mainly due to their valuable
guidance, suggestions, astute judgment, constructive criticism and an
eye for perfection. Our mentor always answered myriad of our doubts
with smiling graciousness and prodigious patience, never letting us feel
that we are novices by always lending an ear to our views, appreciating
and improving them and by giving us a free hand in our project. It’s
only because of their overwhelming interest and helpful attitude, the
present work has attained the stage it has.
Finally, we are grateful to our Institution whose constant encourage-
ment served to renew our spirit, refocus our attention and energy and
helped us in carrying out this work.

(Harsh Walia)

2
Contents

1 ABSTRACT 4

2 INTRODUCTION 4
2.1 BACKGROUND MOTIVATION . . . . . . . . . . . . 5
2.2 PROJECT OBJECTIVES . . . . . . . . . . . . . . . . 6
2.3 LITERATURE SURVEY . . . . . . . . . . . . . . . . . 6

3 METHODOLOGY 8
3.1 BLOCK DESIGN DIAGRAM . . . . . . . . . . . . . . 8
3.2 SYSTEM ARCHITECTURE . . . . . . . . . . . . . . 8
3.3 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . 13
3.3.1 TOOLS AND LIBRARIES USED . . . . . . . . 13

4 RESULTS (Progress Made so far) 14

5 TASK TO BE COMPLETED 15

6 GANTT CHART 15

7 REFERENCES 16

3
1 ABSTRACT
COVID-19, which is subsequently named as SARS-CoV-2, First Hu-
man case was found in Wuhan City, from China, in December 2019.
After, that the World health organization (WHO) has declared Coro-
navirus as a Pandemic on 11th March 2020.In this study, our pri-
mary aim is to Detect the Severe Covid-19 patient in the Early Stages
by looking at the information on demographics, comorbidities, ad-
mission laboratory values, admission medications, admission supple-
mental oxygen orders, discharge and mortality.4711 patient’s dataset
with confirmed SARS-CoV-2 infection were included in the study. So,
we have filtered the Top Best features out of 85 features from the
dataset using the seven different feature Selection algorithm and taken
the most common features from the different feature Selection algo-
rithm. After selecting the top most important features, we have ap-
plied around 17 different types of machine learning models like Linear
Regression, Logistic regression, SVM, K-Means, XGBoost, Random
Forest, Decision Tree Classifier, neural network, and many more mod-
els to predict the result using different Metrics to achieve an Effective
Results. This model can be Deployed and can be used by Hospitals
to Predict the Severity of Covid-19 Patients.

2 INTRODUCTION
A novel Coronavirus found its first case in December 2019, and af-
ter that, coronavirus cases are increasing with each subsequent day.
As we all know, many of the people have lost their lives in 1st wave
of COVID-19, and the number of Deaths increased in the 2nd Wave
of COVID-19. In this Research, we, therefore, aim to make a ro-
bust and efficient Model which help us to Predict the Severity of
Covid-19patient in the Early stages from the information based on
demographics, comorbidities, admission laboratory values, admission
medications, admission supplemental oxygen orders, discharge, and
Mortality. In this Dataset, we are given 85 features, out of which

4
some of them are important to us, and the rest are not. So, we have
used feature Selection algorithms to select the Best features out of it.
The different feature Selection algorithm are used to select the Best
feature for predicting the Mortality of Patient.
After selecting the Best Features, we have shortlisted the Best fea-
ture to Predict the Severity of Covid-19 Patients. Different types of
Machine learning Models are used for the prediction of patients who
are at High Risk of Mortality. The Models which are to be used in
this Research are Linear regression, Logistic Regression, SVM, Linear
SVC, Naive Bayes, k-Nearest Neighbours algorithm, Neural network
with Keras, Stochastic Gradient Descent, Gradient Boosting Classifier,
RidgeCV, Bagging Classifier, Decision Tree Classifier, Random Forest
Classifier, AdaBoost Classifier, XGBClassifier, LGBM Classifier, Ex-
traTrees Classifier, Gaussian Process Classification, MLP Classifier
and Voting Classifier.Finally, Ensemble the top accuracy models from
the above models to predict the final result with good accuracy.

2.1 BACKGROUND MOTIVATION


On 12 January 2020, the World Health Organization (WHO)
confirmed that a novel coronavirus was the cause of a respiratory illness
in a cluster of people in Wuhan, China, which was reported to the
WHO on 31 December 2019. After that the Number of cases of Covid-
19 was increasing exponentially. It was risking to overwhelm the health
systems around the world with an increasing demand for ICU beds far
above the existing capacity.
The first case of COVID-19 in India, which originated from China, was
reported on 30 January 2020, and from that point of time to the cur-
rent date, the Number of Coronavirus cases reaches 3.01 crore and 3.92
lakhs Death recorded worldwide. Our Motivation behind the Project is
to help the overwhelmed hospitals by predicting the Severe Covid-
19 Patients in Early Stages on the basis of their comorbidities,
admission laboratory test values, admission medications, admission

5
supplemental oxygen orders.
While knowing the number of patients which may require a Intesive-
Care-Unit(ICU) in the future, Hospitals can arrange the ICU beds
accordingly, which can lead to Save the Patients life or knowing
which patients don’t need any ICU support or Not Severely affected
by COVID-19 can go for home quarantine.

2.2 PROJECT OBJECTIVES


The Main objectives of this project are: -
• Project Objective is to Predict the Risk of Mortality in
Covid-19 Patients in the Early Stages by looking at their
admission laboratory test values, comorbidities, admission med-
ications, admission supplemental oxygen orders.
• This Project help the Doctors and Patients by predicting the Se-
vere Covid-19 Patients in Early Stages and help them to arrange
the Resources like Oxygen Cylinders and ICU beds according to
the Requirement in the future.

2.3 LITERATURE SURVEY


Some Research work which is already done in Prediction of Severity
among COVID-19 Patients are presented here-
1. Explainable Machine Learning for Early Assessment of
COVID-19 Risk Prediction in Emergency Departments
• A machine learning-based computational method for early
and rapid risk prediction in COVID-19 patients that can
be readily installed and used in emergency rooms.
• In this paper, they describe a computerised system that
aims to extract the most relevant radiological, clinical, and
laboratory variables for improving patient risk prediction,

6
as well as an explainable machine learning system that may
provide clinicians with simple decision criteria to use as a
support for assessing patient risk.
2. Early prediction keys for COVID-19 cases progression:
A meta-analysis
• The present meta-analysis combed through many databases
for relevant articles on bio marker values and major risk
factors that predict progression from mild to moderate to
severe and critical cases.
• Meta-analysis of the difference between COVID-19 patients
with severe vs mild disease in: (A) Mean age (B) Albumin
level (C) Aspartate amino transferase (D)Creatinine (E) C-
reactive protein (F) D-dimer. (G–L): Meta-analysis of the
difference between COVID-19 patients with severe vs mild
disease in: (G) Interleukin-6 (H) LDH (I) Lymphocytes
(J) Neutrophil count (K) %PD-1 expression on T cells (L)
Cortisol. (M) Hypertension (N) Diabetes (O) Chronic ob-
structive lung disease.
3. Development and validation of a laboratory risk score
for the early prediction of COVID-19 severity and in-
hospital mortality
• The goal of this study was to generate a scoring system
for identifying high-risk people, validate it in a different
samples, and assess its accuracy in predicting in-hospital
mortality mortality.
• Methods:Biological data from 330 SARS-CoV-2 infected
individuals were utilised to construct a risk score that might
predict severity progression in this cohort research. The
score was then validated using data from 240 more COVID-
19 participants in a second step. The area under the re-
ceiver operating characteristic curve was used to determine

7
the score’s accuracy.

3 METHODOLOGY
3.1 BLOCK DESIGN DIAGRAM

3.2 SYSTEM ARCHITECTURE


The Architecture of the Project is divided into Various Sub-tasks.
The various Sub-tasks Performed in this Project are mention below-
1. Data pre-processing and Exploratory Data Analysis
• About the data- The data file contains information on demo-
graphics, comorbidities, admission laboratory values, admission
medications, admission supplemental oxygen orders, discharge

8
and mortality. The data relate to COVID-19 patients admitted
to a single healthcare system, over a specific period of time, and
separated into the 1st 3 weeks of the pandemic and the 2nd 3
weeks of the pandemic.
• Perform Exploratory Data Analysis to better understand and
visualize the data. Visualize each feature of the Dataset for
taking the better insight of data.
2. Feature Engineering
• Some Features of dataset like Age has Object datatype basically
it is a string. So ,we have converted to Target Encoding to pass
them into the model.
3. Feature Selection
• Dataset contain a huge number of features(85). Now, Our Task is
to get the most important and relevant feature out of all features.
Seven Feature selection algorithm is used to get the Best
features out of all features. Different algorithm have its own
criteria for finding the best feature. So, we had find the most
common feature from all the feature selection algorithm.
The 7 Feature Selection algorithm are mentioned below:
3.1 FS with the Pearson correlation :
• High correlation features are more linearly dependant and-
hence have roughly the same influence on the dependentvariable.
When two characteristics have a strong correlation,one of them
might be dropped.
3.2 FS by the SelectFromModel with LinearSVC :
• SelectFromModel is a meta-transformer that can be used in
conjunction with any estimator that gives significance to each
feature through a particular property (such as coef ,feature im-
portances ).

9
• LinearSVC (Linear Support Vector Classification) is similar to
Support vector classification, but the parameter kernel is ’linear’.
LinearSVC is implemented in terms of liblinear while SVC is
implemented in libsvm, so it has more flexibility in the choice of
loss function and penalties and It scales better to large numbers
of samples.
3.3 FS by the SelectFromModel with Lasso :
• The Lasso is a linear model for estimating sparse coefficients
that is beneficial in particular situations because it prefers solu-
tions with fewer non-zero coefficients, effectively decreasing the
amount of characteristics that the provided solution is reliant
on.
3.4 FS by the SelectKBest with Chi-2 :
• The SelectKBest method help to select Best K Features out of
all the features.
3.5 FS by the Recursive Feature Elimination with Logistic
Regression :
• It is a greedy optimization method that seeks to identify the
highest performing feature subset. It generates models over and
over again, putting away the best or worst performing feature
at each iteration. It builds the next model using the features on
the left until all of the features are used up. The features are
then ranked in order of their removal.
3.6 FS by the Recursive Feature Elimination with Random
Forest :
• Here the Recursive Feature Elimination use the Random
forest to recursively get the best features out of it.
3.7 FS by the Variance Threshold
• Feature selector that removes all low-variance features.

10
4. Training Our Model :
For Training the Model I have implemented different Models and com-
pare their AUC-ROC score to get the Best Model out of it.

4.1 Linear Regression


• Linear Regression is a supervised learning algorithm which as-
sumes that the independent and dependent feature has linear
relationship. So, this algorithm try to predict the result by fit-
ting the Best line to the model.This algorithm uses the Least
squared loss function and uses the gradient descent to minimize
the loss.

4.2 Logistic Regression


• Logistic regression, is a supervised leaning classification algo-
rithm which predict the probability of occurring an event.This
algorithm uses the Binary cross entropy as the loss function and
try to minimize the loss using gradient descent and find out the
best line which divide the 2 classes.

4.3 Support Vector Machines


• SVM is a supervised machine learning algorithm which can be
used for classification or regression problems. It draws a hyper-
plane between different classes to distinguish the classes. The
hyperplane is a plane which is at maximum margin from the
support vectors.

11
4.4 Linear SVC
• LinearSVC (Linear Support Vector Classification) is similar to
Support vector classification, but the parameter kernel is ’linear’.
LinearSVC is implemented in terms of liblinear while SVC is
implemented in libsvm, so it has more flexibility in the choice of
loss function and penalties and It scales better to large numbers
of samples.
This class can handle both sparse and dense input and the mul-
ticlass support is handled according to a one Vs All scheme.

4.5 MLP Classifier


• The Multi-layer Perceptron (MLP) is a supervised learning method
that trains on a dataset to learn a function, where is the number
of input dimensions and is the number of output dimensions. It
can learn a non-linear function approximator for classification
or regression given a collection of features and a goal. It differs
from logistic regression in that one or more non-linear layers,
known as hidden layers, can exist between the input and output
layers.
4.6 Decision Tree Classifier
• Decision Trees is a supervised learning algorithm which is used
for both classification and regression tasks. This algorithm di-
vide the population into 2 or more sub-samples that are homo-
geneous in nature and have less impurity. The node is divided
into sub-nodes on the basis of different algorithm like gini index,
entropy and Reduction in Variance.

12
3.3 IMPLEMENTATION
Implementation of Various Machine Learning Models and Feature Se-
lection algorithm are done using various ML libraries.

3.3.1 TOOLS AND LIBRARIES USED


1. NUMPY- NumPy library is used in the Project to deal along
with a large collection of high-level mathematical functions to
operate on the arrays.
2. MATPLOTLIB- Matplotlib library is used to plot the images
and graphs.
3. SkLearn- Sklearn is the library which contain various ML al-
gorithms which are used in the Project like Linear Logistic Re-
gresion, MLPClassifier, SVM Decision tree.
4. AutoViz- AutoViz(AutoVisualization) is a python library which
can automate the whole process of Data Visualization in just a
single line of code.
5. PANDAS- Pandas is a software library written for the Python
programming language for data manipulation and analysis. In
particular, it offers data structures and operations for manipu-
lating numerical tables and time series.

13
4 RESULTS (Progress Made so far)
Training and test result ofAUC(Area under the curve) Score of
Various Models is given below in the table.

14
5 TASK TO BE COMPLETED
• Some More Advanced Models are yet to be implemented like
Random Forest, AdaBoost, Gradient Boost, XGBoost, Light-
GBM, Ridge Classifier, BaggingClassifier, Extra Trees Classifier,
k-Nearest Neighbors (KNN), Naive Bayes , Neural Network with
Keras.
• Finally Voting classifier will be made which is a Ensemble of all
the Top Machine Learning Models.
• Different Hyperparameter tuning method will be used to train
the model like GridSearchCv, RandomizedSeachCV etc.

6 GANTT CHART

15
7 REFERENCES
References
[1] Early prediction keys for COVID-19 cases progression: A meta-
analysis Addison-Wesley, Reading, Massachusetts, 1993.
https://doi.org/10.1016/j.jiph.2021.03.001

[2] Development and validation of a laboratory risk score for the early
prediction of COVID-19 severity and in-hospital mortality
https://doi.org/10.1016/j.iccn.2021.103012

[3] Explainable Machine Learning for Early Assessment of COVID-19


Risk Prediction in Emergency Departments
https://ieeexplore.ieee.org/document/9239931

[4] Explainable Machine Learning for Early Assessment of COVID-19


Risk Prediction in Emergency Departments
10.1109/ACCESS.2020.3034032

[5] Diagnostic utility of C-reactive protein to albumin ratio as an early


warning sign in hospitalized severe COVID-19 patients
https://doi.org/10.1016/j.intimp.2020.107285

[6] Development and Validation of a Web-Based Severe COVID-19


Risk Prediction Model
https://doi.org/10.1016/j.amjms.2021.04.001

[7] Applicability of MuLBSTA scoring system as diagnostic and prog-


nostic role in early warning of severe COVID-19
https://doi.org/10.1016/j.micpath.2020.104706

[8] Missing Data in Clinical Research: A Tutorial on Multiple Q2Q1


Imputation
https://doi.org/10.1016/j.cjca.2020.11.010

16

You might also like