Professional Documents
Culture Documents
Supervised Machine Learning Models For Prediction of COVID 19 Infection Using Epidemiology Dataset
Supervised Machine Learning Models For Prediction of COVID 19 Infection Using Epidemiology Dataset
https://doi.org/10.1007/s42979-020-00394-7
ORIGINAL RESEARCH
Received: 26 October 2020 / Accepted: 5 November 2020 / Published online: 27 November 2020
© Springer Nature Singapore Pte Ltd 2020
Abstract
COVID-19 or 2019-nCoV is no longer pandemic but rather endemic, with more than 651,247 people around world having
lost their lives after contracting the disease. Currently, there is no specific treatment or cure for COVID-19, and thus living
with the disease and its symptoms is inevitable. This reality has placed a massive burden on limited healthcare systems
worldwide especially in the developing nations. Although neither an effective, clinically proven antiviral agents’ strategy
nor an approved vaccine exist to eradicate the COVID-19 pandemic, there are alternatives that may reduce the huge burden
on not only limited healthcare systems but also the economic sector; the most promising include harnessing non-clinical
techniques such as machine learning, data mining, deep learning and other artificial intelligence. These alternatives would
facilitate diagnosis and prognosis for 2019-nCoV pandemic patients. Supervised machine learning models for COVID-19
infection were developed in this work with learning algorithms which include logistic regression, decision tree, support
vector machine, naive Bayes, and artificial neutral network using epidemiology labeled dataset for positive and negative
COVID-19 cases of Mexico. The correlation coefficient analysis between various dependent and independent features was
carried out to determine a strength relationship between each dependent feature and independent feature of the dataset prior
to developing the models. The 80% of the training dataset were used for training the models while the remaining 20% were
used for testing the models. The result of the performance evaluation of the models showed that decision tree model has the
highest accuracy of 94.99% while the Support Vector Machine Model has the highest sensitivity of 93.34% and Naïve Bayes
Model has the highest specificity of 94.30%.
1
* L. J. Muhammad Department of Mathematics and Computer Science, Faculty
lawan.jibril@fukashere.edu.ng of Science, Federal University of Kashere, P.M.B. 0182,
Gombe, Nigeria
Ebrahem A. Algehyne
2
e.algehyne@ut.edu.sa Department of Mathematics, University of Tabuk,
Tabuk 71491, Saudi Arabia
Sani Sharif Usman
3
ssu992@fukashere.edu.ng Department of Biological Sciences, Faculty of Science,
Federal University of Kashere, P.M.B. 0182, Gombe, Nigeria
Abdulkadir Ahmad
4
aayola99@gmail.com Department of Computer Science, Kano University
of Science and Technology, Wudil, Kano, Nigeria
Chinmay Chakraborty
5
cchakrabarty@bitmesra.ac.in Department of Electronics and Communication Engineering,
Birla Institute of Technology, Ranchi, Jharkhand, India
I. A. Mohammed
6
ibrahimsallau@gmail.com Computer Science Department, Yobe StateUniversity,
Damaturu, Yobe State, Nigeria
SN Computer Science
Vol.:(0123456789)
11
Page 2 of 13 SN Computer Science (2021) 2:11
SN Computer Science
SN Computer Science (2021) 2:11 Page 3 of 13 11
Fig. 1 Essential learning
process for the development of
predictive models
SN Computer Science
11
Page 4 of 13 SN Computer Science (2021) 2:11
and Linear Regression Models were used to estimate the Reference [18] presented a data-driven ML approach for
number of 2019-nCoV positive cases. The models were the analysis of the 2019-nCoV pandemic from its early
evaluated with root mean square error (RMSE) metric and infection dynamics especially inflation counter over time,
10 folds cross-validation techniques, respectively.. The using US data starting from the first case on 20th January
RMSE of long short-term memory and linear regression 2020. The actionable public health insight was extracted
Models were 27.187 and 7.562, respectively. Moreover, which includes infectious force, rate of the mild infection
the study predicted the trend of the 2019-nCoV outbreak. becoming serious, estimates for asymptomatic infections
Such predictions can support healthcare managers and and prediction of new infections over time. The approach
policy makers with planning, allocating and deploying revealed a very significant number of cases of asympto-
healthcare resources effectively. Reference [9] identified matic infections of pandemic 2019-nCoV, a lag of about
an intrinsic 2019-nCoV genomic signature using an ML- ten days. It was quantitatively confirmed that the infectious
based alignment-free approach. This approach incorpo- force of the virus is strong, with about 0.14% transition
rated ML-controlled digital signal analysis of genome from mild to serious infection.
analysis, augmented decision-making process and Spear- The related works that have been reviewed so far indi-
man’s rank correlation coefficient analysis for validation cate that ML techniques and other artificial intelligence
result. The result of the study corroborates the research techniques have played important roles in prediction,
hypothesis of a bat as the origin 2019-nCoV pandemic diagnosis and containment of the COVID-19 pandemic,
and the study further classifies the pandemic as Sarbe- which can help reduce the huge burden on limited health-
covirus within betacoronavirus. More than 5000 unique care systems. To the best of our knowledge, no work has
genomic sequences from the dataset, totaling 61:8 mil- been reported so far using epidemiology labeled datasets
lion bp were analyzed with more than 90% accuracy. In for positive and negative COVID-19 cases in Mexico for
the work of [7] the machine learning-based approach was development of supervised ML models for prediction of
developed for a real-time forecast of 2019-nCoV outbreak the COVID-19 infection. Therefore, the present study
using news alerts reported by Media Cloud and official intends to investigate these gaps.
heath report from Chinese Center Disease for Control
and Prevention, internet search activity from Baidu and
daily forecast from GLEAM (an agent-based mechanis-
tic model). The approach uses a clustering that enables
exploration of geo-spatial synchronicities of 2019-nCoV
activities across Chinese provinces. The approach is able
to produce an accurate forecast two days ahead of time.
The ML-driven approach was also used to predict the
severity of the infection in patients. A clinical dataset
from Wuhan was proposed in the study, with 15 patients
admitted to a hospital in Wuhan, China between January
10th to February, 18th, 2020 being screened. There were
375 patients who were discharged including 201 survi-
vors. The prognostic prediction model based on the ML
XGboost algorithm was developed and was tested with
29 patients including 3 patients from other hospitals, who
were cleared after 19th February 2020. The model was
able to predict the mortality risk of 2019-nCoV patients
and clinical route to the recognition of critical cases from
severe cases, and more so the model helped the doctors
with identification of 2019-nCoV patients and interven-
tion with the model potentially being able to reduce the
mortality risk. In the study of [8] ML and Vaxign reverse
vaccinology tools were used to predict 2019-nCoV vaccine
candidates. The Vaxign reverse vaccinology tool predicted
S protein as a likely adhesion while Vaxign ML predicted
S protein had a high protective antigenicity score. The
predicted vaccine in the study provides new strategies Fig. 2 Methodology to build machine learning classification models
for effective and safe 2019-nCoV vaccine development. for COVID-19 infection
SN Computer Science
SN Computer Science (2021) 2:11 Page 5 of 13 11
SN Computer Science
11
Page 6 of 13 SN Computer Science (2021) 2:11
74 0 0 1 0 1 0 1 0 0 0
71 1 0 1 0 1 0 1 0 1 0
50 0 1 0 0 0 0 0 0 0 1
25 1 0 0 0 0 0 1 0 0 1
28 1 0 0 0 0 0 0 0 0 0
67 0 0 0 0 1 0 1 0 1 0
44 1 0 0 0 0 0 1 0 1 0
62 0 0 0 0 0 0 1 0 0 0
30 1 0 0 0 0 0 0 0 0 0
30 0 0 0 0 0 0 0 0 0 0
SN Computer Science
SN Computer Science (2021) 2:11 Page 7 of 13 11
divided classes [48]. Therefore, the plane with the highest limit
SN Computer Science
11
Page 8 of 13 SN Computer Science (2021) 2:11
Thus b represents the value of bias, which is often 1 of the correlation between feature set, and k is the number
value. of features.
In this study, the correlation coefficient analysis between
Correlation coefficient analysis the various dependent and independent features was carried
out to determine a strong relationship between each depend-
Correlation coefficient analysis is used to determine a strong ent feature and independent feature of the dataset. Figure 8
relationship between two sets of dataset features which can shows the scatterplot correlation coefficient of the various
be either dependent and independent features or variables dependent features, including age, sex, pneumonia, diabetes,
[6, 26]. Therefore, the value r is a finite number between − 1 asthma, hypertension, CVDs, obesity, CKDs and tobacco
and + 1 which shows the strong relationship between two use, against the result, which is an independent feature of
sets of dependent and independent features or variables [29]. the dataset. Figure 9 shows the corresponding correlation
The relationship can be positive if the number is positive; matrix of the independent dataset features against depend-
likewise, the relationship can be negative if the number is ent features. Table 4 shows the relationship value (r value)
negative. The idea behind the correlation coefficient analysis of each dependent feature against the independent feature.
approach is that the importance of a relevant feature set in a
dataset can be determined by evaluating a strength relation-
ship between dependent and independent features [6, 29]. A Experimental Setup
feature set is considered good for ML model if the dependent
features are correlated with the independent features [26]. The supervised machine learning algorithms were executed
The feature can be evaluated by Eq. (3) below:- using a python programming language in window operating
system environment deployed HP Branded computer system
( ) (Laptop), Corei5 with 8 GB of Ram and 2.8 GHz processor
kavg corrfc
Importance = √ , (6) speed. All the necessary libraries were installed on python
( )
k + k(k − 1)avg corrff notebook and used for the data analysis including correlation
analysis and development of the models.
where the Importance is the correlation coefficient between
dependent feature set and independent feature and is the Predictive Models for COVID‑19 Infection
ranking criteria for evaluating the set of feature, (avg(corr fc))
is the average of the correlation between the dependent fea- The supervised ML models for COVID-19 infection were
ture and the independent feature, avg(corrff) is the average developed with a decision tree, logistic regression, naive
Fig. 8 Scatterplot correlation
coefficient of the feature of the
dataset
SN Computer Science
SN Computer Science (2021) 2:11 Page 9 of 13 11
1 Age RT-PCR COVID-19 test result 0.17 Weak positive correlation coefficient relationship
2 Sex RT-PCR COVID-19 Test Result 0.085 Weak positive correlation coefficient relationship
3 Pneumonia RT-PCR COVID-19 test result 0.082 Weak positive correlation coefficient relationship
4 Diabetes RT-PCR COVID-19 test result 0.028 A weak positive correlation coefficient relationship
5 Asthma RT-PCR COVID-19 test result 0.024 A weak positive correlation coefficient relationship
6 Hypertension RT-PCR COVID-19 Test Result 0.03 A weak positive correlation coefficient relationship
7 CVDs RT-PCR COVID-19 test result 0.025 A weak positive correlation coefficient relationship
8 Obesity RT-PCR COVID-19 test result 0.034 A weak positive correlation coefficient relationship
9 CVDs RT-PCR COVID-19 test result 0.025 Weak positive correlation coefficient relationship
10 Tobacco RT-PCR COVID-19 test result 0.025 Weak positive correlation coefficient relationship
Bayes. SVM and ANN machine learning algorithms with an and testing sets. Therefore, the model were trained the
epidemiology dataset for positive and negative COVID-19 models with 80% training data and tested with the remain-
cases in Mexico. Before the development of the model, the ing 20% of the dataset. Five different models were devel-
correlation coefficient analysis between the various depend- oped using machine learning classification algorithms to
ent and independent features was carried out to determine predict whether the patient is infected with COVID-19
a strong relationship between each dependent feature and or otherwise using ML classification algorithms which
independent feature of the dataset. Table 4 shows the rela- include decision tree, logistic regression, naive Bayes,
tionship value (r value) of each dependent feature against SVM and ANN.
the independent feature of the dataset. All the dependent The models were evaluated using an accuracy, sensitivity
features have a positive correlation coefficient relationship and specificity performance evaluation metric to determine
with the independent feature of the dataset. However, it is their efficiency and quality.
a weak positive correlation coefficient relationship that all The accuracy evaluation metric shows the percentage of
dependent features have on independent feature. the dataset instances correctly predicted by the model devel-
The epidemiology dataset for positive and negative oped by the machine learning algorithm [28]. The accuracy
COVID-19 cases in Mexico been partitioned into training is expressed in the equation below (7).
SN Computer Science
11
Page 10 of 13 SN Computer Science (2021) 2:11
tp + tn Results and Discussion
Accuracy = . (7)
tp + tn + fn + fp
Early prediction of COVID-19 can be helpful in reduc-
While, sensitivity evaluation metric shows the percent- ing the huge burden on healthcare systems by helping to
age of COVID-19 positive patients correctly by the mod- diagnose COVID-19 patients. In this work, decision tree,
els, as it is expressed in the equation below (8). logistic regression and naive, SVM and ANN supervised
tp learning classification models for prediction of COVID-19
Sensitivity = . (8) infection using an epidemiology dataset for positive and
tp + fn
negative COVID-19 cases in Mexico were developed. The
More so, the specificity evaluation metric shows the performance of all models was evaluated based on accuracy
percentage of COVID-19 negative patients correctly by parameters. The performance result of the models is shown
the models, as it is expressed in the equation below (9). in Table 5 and Fig. 11 respectively.
tp The model developed with decision tree happened to
Specificity = . (9) be the best model among all models developed in terms of
tn + fp
accuracy with 94.99% when compared with other models
where tp is the true positive, tn is the true negative, fp is developed with logistic regression, naive Bayes, SVM and
the false positive while fn is the false negative. ANN which have 94.41%, 94.36%, 92.40% and 89.20%
Figure 10 shows the decision tree model for prediction accuracy respectively. While for sensitivity which shows
of COVID-19 Infection. the percentage of COVID-19 positive patients correctly
by the models, SVM emerged to be the best model among
all models with 93.34%, followed by ANN Model which
Table 5 Performance evaluation S. No. Model Accuracy (%) Sensitivity (%) Specificity (%)
result
1 Decision tree 94.99 89.2 93.22
2 Logistic regression 94.41 86.34 87.34
3 Naive bayes 94.36 83.76 94.3
4 Support vector machine 92.4 93.34 76.5
5 Artificial neural network 89.2 92.4 83.3
SN Computer Science
SN Computer Science (2021) 2:11 Page 11 of 13 11
60
Accuracy (%)
40
Sensivity (%)
20
Specificity (%)
0
Decision Logisc Nave Bayes Support Arficial
Tree Regression Vector Neural
Machine Network
has 92.40% then Decision Model which has 89.20%, then Conclusion
logistic regression model which has 86.34% and followed
by Naïve Beyes Model which has least with 83.76%. More The COVID-19 pandemic now appears to be endemic like
so, for specificity which shows the percentage of COVID- other communicable diseases including HIV/AIDs, Tuber-
19 negative patients correctly by the models, Naïve Bayes culosis, Measles and Hepatitis. It has affected nearly 213
model emerged to be the best model among all models countries and territories worldwide and 2 international con-
with 94.30%, followed by decision tree model which has veyances thereby leading the World Health Organization
93.22% then logistic regression model which has 87.34%, (WHO) to declare the disease a public health emergency
then ANN model which has 83.30% and followed by SVM of international concern. COVID-19 is transmitted through
model which has least of 76.50%. direct contact with an infected person via sneezing and
Decision tree model indicated that age feature is the coughing and has no medically approved vaccine or medi-
most important feature among all the dependent features cation. Non-clinical techniques such as ML techniques are
of the dataset including the clinical features. The model being used as an alternative means for diagnosis and progno-
indicates that most of the people above the age of 45 are sis of 2019-nCoV pandemic patients with a view to comple-
prone to be infected with SARS-CoV-2 when compared menting and reducing the huge burden on limited healthcare
to people with lower ages. Similarly, people that are suf- systems in almost all countries around the world, with the
fering from pneumonia, CKDs, CVDs, diabetes, asthma, aim of reviving the deeply affected economic sector. Super-
obesity and hypertension more likely to be infected with vised ML models for COVID-19 infection were developed in
COVID-19. Regarding gender, males are more suscepti- this work with a decision tree, logistic regression and naive
ble to COVID-19 infection than females, and those who Bayes learning algorithms using an epidemiology labeled
smoke tobacco are more likely to be infected than non- dataset of positive and negative COVID-19 cases in Mexico.
tobacco smokers. The model will help the health workers The models were trained with 80% training data and tested
with a diagnosis of the suspected COVID-19 patients, and with the remaining 20% of the data. The model developed
this will supplement RT-PCR COVID-19 testing thereby with decision tree happened to be the best model among
reducing the huge burden on healthcare systems. all models developed in terms of accuracy with 94.99%. In
The supervised ML models can be used as retrospec- contrast, SVM and naive Bayes models emerged to be the
tive evaluation techniques or tools to validate COVID- best models among all models in terms of sensitivity and
19 infection cases. This study shows how ML Predictive specificity with 93.34% and 94.30% respectively.
COVID-19 infection models can be developed, validated
and used as the tools for rapid diagnosis of COVID-19
infection cases. The study also shows the important roles Funding No funding sources.
playing by supervised ML algorithms in prediction, diag-
nosis and containment of the COVID-19 pandemic, which Compliance with Ethical Standards
can help reduce the huge burden on limited healthcare
systems in most of the nations around the world, espe- Conflict of interest Authors have declared that no conflict of interest
exists.
cially developing nations.
SN Computer Science
11
Page 12 of 13 SN Computer Science (2021) 2:11
SN Computer Science
SN Computer Science (2021) 2:11 Page 13 of 13 11
40. Yang Z, Zeng Z, Wang K, et al. Modified SEIR and AI predic- 47. Chinese diagnosis and treatment plan of COVID-19 patients
tion of the epidemics trend of COVID-19 in China under public (The sixth edition). http://www.nhc.gov.cn/yzygj/s7653p/20200
health interventions. J Thorac Dis. 2020;12(3):165–74. https:// 2/8334a8326dd94d329df351d7da8aefc2.s html. 2020 Accessed
doi.org/10.21037/jtd.2020.02.64. 7 Jun 2020.
41. https://www.kaggle.com/marianarfranklin/mexico-covid19-clini 48. Kaur H, Kumari V. Predictive modeling and analytics for diabetes
cal-data/metadata Accessed 26 Jun 2020. using a machine learning approach. Appl Comput Inform. 2018.
42. https : //www.scien c emag . org/news/2020/05/unpro ven-herba https://doi.org/10.1016/j.aci.2018.12.004.
l-remedy-against-covid-19 could-fuel-drug-resistant-malaria- 49. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support
scientists, Accessed 2 Jun 2020. vector machines. IEEE Intell Syst Appl. 1998;13(4):18–28.
43. Dong L, Hu S, Gao J. Discovering drugs to treat coronavirus dis- 50. Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory
ease 2019 (COVID-19). Drug Discov Ther. 2020;14:58–60. and applications. Neurocomputing. 2006;70(1):489–501.
44. Wang D, Hu B, Hu C, et al. Clinical characteristics of 138 hos- 51. Islam M, Mahmud S, Muhammad LJ, et al. Wearable technology
pitalized patients with 219 novel coronavirus-infected pneu- to assist the patients infected with novel coronavirus (COVID-19).
monia in Wuhan, China. JAMA. 2020. https://doi.org/10.1001/ SN Comput Sci. 2020;1:320. https: //doi.org/10.1007/s42979 -020-
jama.2020.1585. 00335-4.
45. Huang C, et al. (2020) Clinical features of patients infected with
2019 novel coronavirus in Wuhan, China. Lancet; ahead of print. Publisher’s Note Springer Nature remains neutral with regard to
46. Chinese diagnosis and treatment plan of COVID-19 patients (The jurisdictional claims in published maps and institutional affiliations.
fifth edition). http://www.nhc.gov.cn/yzygj/ s7653p /202002 /3b09b
894ac9b4204a79db5b8912d4440. shtml. 2020. Accessed 5 Jun
2020.
SN Computer Science