You are on page 1of 9

Engineering Science and Technology, an International Journal 24 (2021) 839–847

Contents lists available at ScienceDirect

Engineering Science and Technology,


an International Journal
journal homepage: www.elsevier.com/locate/jestch

Full Length Article

A new COVID-19 detection method from human genome sequences


using CpG island features and KNN classifier
Hilal Arslan a,⇑, Hasan Arslan b
a
Department of Computer Engineering, Izmir Bakircay University, Izmir, Turkey
b
Department of Mathematics, Erciyes University, Kayseri, Turkey

a r t i c l e i n f o a b s t r a c t

Article history: Various viral epidemics have been detected such as the severe acute respiratory syndrome coronavirus
Received 13 October 2020 and the Middle East respiratory syndrome coronavirus in the last two decades. The coronavirus disease
Revised 18 December 2020 2019 (COVID-19) is a pandemic caused by a novel betacoronavirus called severe acute respiratory syn-
Accepted 30 December 2020
drome coronavirus-2 (SARS-CoV-2). After the rapid spread of COVID-19, many researchers have investi-
Available online 9 January 2021
gated diagnosis and treatment for this terrifying disease quickly. Identifying COVID-19 from the other
types of coronaviruses is a difficult problem due to their genetic similarity. In this study, we propose a
Keywords:
new efficient COVID-19 detection method based on the K-nearest neighbors (KNN) classifier using the
COVID-19
SARS-CoV-2
complete genome sequences of human coronaviruses in the dataset recorded in 2019 Novel
K-Nearest Neighbors Coronavirus Resource. We also describe two features based on CpG island that efficiently detect
CpG islands COVID-19 cases. Thus, genome sequences including approximately 30,000 nucleotides can be repre-
Human coronaviruses sented by only two real numbers. The KNN method is a simple and effective non-parametric technique
for solving classification problems. However, performance of the KNN depends on the distance measure
used. We perform 19 distance metrics investigated in five categories to improve the performance of the
KNN algorithm. Some efficient performance parameters are computed to evaluate the proposed method.
The proposed method achieves 98.4% precision, 99.2% recall, 98.8% F-measure, and 98.4% accuracy in a
few seconds when any L1 type metric is used as a distance measure in the KNN.
Ó 2020 Karabuk University. Publishing services by Elsevier B.V. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction and HKU1-CoV cause simple respiratory traces, SARS-CoV and


MERS-CoV cause fatal and dangerous respiratory infections.
Coronaviruses, which are positive sense and single-stranded SARS-CoV was first identified in the Guangdong province of
RNA viruses, are known to have the largest viral genomes among southern China on 16 November 2002 [4] and MERS-CoV emerged
all RNA viruses [1]. The family of coronaviruses has been classified in June 2012 in Jeddah, Saudi Arabia [5]. In late December 2019, a
into four genera [2]: alphacoronavirus (AlphaCoV), betacoron- novel coronavirus SARS-CoV-2 emerged in China and spread
avirus (BetaCoV), gammacoronavirus (GammaCoV), and deltacoro- rapidly throughout the world. The SARS-CoV-2 virus, which is
navirus (DeltaCoV). AlphaCoV and BetaCoV infect mammalian genetically similar to SARS-CoV, caused a severe illness known as
hosts, however, GammaCoV and DeltaCoV mainly infect bird spe- COVID-19 disease and a serious number of deaths worldwide.
cies [3]. Since several early infected people visited a local seafood Market
There have been recorded various coronaviruses, which are the in December 2019 in Wuhan city of China, the virus is therefore
types of AlphaCoV and BetaCoV. Human coronavirus 229E (229E- thought to be a pathogen (SARS-CoV-2 pathogen) that jumped
CoV) and human coronavirus NL63 (NL63-CoV) are the types of from an animal to a human and that caused an infectious disease.
AlphaCoV. Moreover, human coronavirus HKU1 (HKU1-CoV), Sev- Similarity rate between SARS-CoV-2 and a bat coronavirus is 96%
ere Acute Respiratory Syndrome coronavirus (SARS-CoV), and Mid- [6]. The outbreak COVID-19 was declared as a global pandemic
dle East Respiratory Syndrome coronavirus (MERS-CoV) are the on 11 March 2020 by the World Health Organization (WHO). As
types of BetaCoV recorded recently. While 229E-CoV, NL63-CoV of 8 October 2020, the disease spread to 188 countries and territo-
ries has infected over 36.2 million people and has caused more
⇑ Corresponding author. than 1.05 million deaths; more than 25.2 million people have
E-mail addresses: hilal.arslan@bakircay.edu.tr (H. Arslan), hasanarslan@erciyes. recovered from this illness.
edu.tr (H. Arslan).

https://doi.org/10.1016/j.jestch.2020.12.026
2215-0986/Ó 2020 Karabuk University. Publishing services by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
H. Arslan and H. Arslan Engineering Science and Technology, an International Journal 24 (2021) 839–847

The most common symptoms of COVID-19 disease are cough, workload on health-workers. Chen et al. [22] published a compre-
fever, gastrointestinal and musculoskeletal symptoms as well as hensive survey about studies performed by using artificial intelli-
loss of taste or smell. Shortness of breath and chest pressure or gence in the literature on the fight against COVID-19. In addition,
pain are among less common symptoms [7]. Because these com- they observed that machine learning, deep learning and artificial
mon symptoms are similar to the common flu, it is difficult to con- neural networks technologies have been successfully implemented
duct an early diagnosis of SARS-CoV-2. It is essential to quickly at almost every stage of combating COVID-19 when compared to
identify positive cases since it spreads rapidly and poses a threat other coronaviruses such as SARS-CoV and MERS-CoV.
to the public health system. Furthermore, there is still no specific Randhawa et al. [23] suggested a combination of supervised
antiviral drug recommended for the treatment of the novel coron- machine learning with digital signal processing methods (ML-
avirus disease other than supportive care so far. In the period of DSP) for an accurate and scalable taxonomic classification of geno-
pre-clinical researches, some medications already prescribed for mic sequences. They matched each genomic sequence to discrete
other diseases have shown positive effects against COVID-19 virus. values corresponding to its genomic signals. They employed six
Due to the absence of an effective COVID-19 treatment, early supervised machine learning classifiers (linear discriminant, linear
detection of this life-threatening infectious disease plays a very support vector machine, quadratic support vector machine, fine
important role in all medical therapies and in the prevention of KNN, subspace discriminant, and subspace KNN) to detect SARS-
the spread of the disease. Recently, many data scientists have been CoV-2. They used 29 SARS-CoV-2 genome sequences and 20 gen-
working extensively on remarkable features of the virus. For this ome sequences for each of alphacoronavirus, betacoronavirus,
purpose, artificial intelligence applications and machine learning and delta-coronavirus. They observed that the linear discriminant
methods have also been used successfully to speed up the diagno- method achieved 100% accuracy. Naeem et al. [24] proposed a
sis process of the COVID-19 cases [8–10]. Therefore, the learning method to distinguish COVID-19, SARS-CoV, and MERS-CoV viruses
based classification models will not only reduce the burden on by using the K-nearest neighbors and the trainable cascade-
healthcare professionals but will also facilitate early diagnosis of forward back-propagation neural network methods. They
the disease. extracted genomic signal processing features using a dataset that
In this paper, we introduce an effective classification method to contains 76 genome sequences for each type of coronavirus from
distinguish SARS-CoV-2 from common human coronaviruses, the National Center for Biotechnology Information (NCBI). Their
which are AlphaCoV, BetaCoV-1, MERS-CoV, HKU1-CoV, NL63- results showed that performance of the KNN algorithm was higher
CoV, and 229E-CoV using CpG island features and KNN classifica- than the cascade-forward back-propagation neural network in all
tion method. Main contributions of this research can be summa- COVID-19/SARS-CoV, COVID-19/MERS-CoV and COVID-19/SARS-
rized as follows: CoV/MERS-CoV classification processes, and achieved an accuracy
of 100%. Batista et al. [25] developed a new method to predict
 The choice of the differentiable features based on characteristic whether patients in the emergency care unit are COVID-19 positive
of SARS-CoV-2 is a critical step to improve classification perfor- or not by using five supervised machine learning algorithms, which
mance. In this study, we propose CpG island features to differ- are neural networks, random forests, logistic regression, support
entiate SARS-CoV-2 from the other human coronaviruses. vector machines, and gradient boosting regression trees. They col-
 We propose a robust prediction method by using KNN classifier lected data from 235 patients, and they used 15 different types of
with any L1 type distance metric selected from among 19 dis- features such as age, gender, hemoglobin etc. Their experimental
tance metrics investigated in five categories, which are L1 type, results showed that the support vector machine classifier achieved
L2 type, vicissitude, inner product metrics, and the other types the highest performance with AUC value of 84.7% among the other
of metrics. classifiers.
 We construct a larger dataset containing almost all types of Unal and Dudak [26] studied on diagnosis of COVID-19 viral dis-
these human coronaviruses which are genetically similar to ease. They applied Naive Bayes, KNN, support vector machines and
SARS-CoV-2. They are AlphaCoV, BetaCoV_1, MERS-CoV, NL63- decision tree algorithms to the dataset named as COVID-19 Mexico
CoV, HKU1-CoV, and 229E-CoV. Patient Health Dataset. The dataset consists of 95839 cases
 The proposed method efficiently detects COVID-19 cases in a recorded by Mexican government, and 19 different types of fea-
few seconds on the relatively large dataset. tures like as the sex of the patient, age of the patient, the state of
pneumonia and intubation as well as the state of many other dis-
The rest of the paper is organized as follows. In Section 2, a lit- eases. They performed four types of supervised machine learning
erature survey on COVID-19 is presented. In Section 3, the K- algorithms. Their experimental results showed that support vector
nearest neighbors (KNN) method and distance measures used in machine achieved the best predictive performance with the classi-
the KNN method are summarized. The proposed classification fication accuracy of 100%.
strategy to detect COVID-19 cases is introduced in Section 4. Sec- Although there are only few studies detecting COVID-19 cases
tion 5 reports experimental results and evaluation of the experi- from genome sequences, a number of papers have been published
mental observations conducted by various distance metrics. In for detecting COVID-19 cases from X-ray or computed tomography
Section 6, we compare the proposed method with the other detec- (CT) images [27–34], recently. Barstugan et al. [27] developed a
tion methods. Finally, Section 7 presents concluding remarks and COVID-19 classification method by using the support vector
future directions. machine classifier on a dataset including different types of 150
abdominal CT images. Their experimental results showed that their
proposed method achieved an accuracy of 99.6%. Ozturk et al. [28]
2. Related works applied the DarkCovidNet deep learning model to raw chest X-ray
images and generated a binary classification (COVID-19, no find-
Deep learning and machine learning techniques have been ings). They also used the DarkCovidNet classifier method on the
widely used in a variety of research areas such as big data analysis same dataset to create a triple classification (COVID-19, no-
[11–13], image classification [14], face detection [15,16], and dis- findings, pneumonia). They declared that the highest accuracy of
ease prediction [17–21]. The computer based diagnostic systems the classifier was 98.08% for the binary classification and 87.02%
developed by the help of artificial intelligence techniques will for the triple case. Sekeroglu et al. [29] proposed an alternative
speed up early diagnosis of COVID-19 and thus will decrease the COVID-19 detection method by using deep learning and machine
840
H. Arslan and H. Arslan Engineering Science and Technology, an International Journal 24 (2021) 839–847

learning classifiers on a publicly available dataset which contains k by introducing the generalized mean distance-based KNN classi-
1583 healthy, 4292 pneumonia and 225 confirmed COVID-19 chest fier. They stated that the proposed method is less sensitive to k
X-ray images. They stated that a convolutional neural network over the KNN-based classifier.
(CNN) without pre-processing and with minimized layers achieved
an accuracy of 98.50%. Jain et al. [30] developed a new approach to 3.2. Metrics
detect COVID-19 cases among chest X-ray images of healthy, bac-
terial pneumonia, viral pneumonia and COVID-19 by using To improve the performance of the KNN classifier, we investi-
ResNet18, ResNet101, DenceNet121 and VGG-16 deep learning gate the metrics used in the KNN algorithm under five categories
models. Their experimental results demonstrated that ResNet101 by following the similar categorization in [39,40]. We list them
method achieved the highest accuracy of 98.93%. Apostolopoulos as below:
and Bessiana [31] used VGG19, Mobile Net, Inception, Xception
and Inception ResNet v2 deep transfer learning classifiers to detect 1. L1 type metrics: Six metrics are investigated in this category.
COVID-19 positive cases among 224 COVID-19, 700 bacterial pneu- They are Manhattan metric (ManM), Chebyshev metric
monia and 504 normal chest X-ray images. Their results presented (ChebM), Canberra metric (CanM), Sorensen metric (SM),
that VGG19 achieved the best binary classification accuracy of Kulezynski metric (KM), and Mean character metric (MCM),
98.75% over the other CNN methods. which are explained in 1.
Ahuja et al. [32] used deep learning methods to detect COVID- 2. L2 type metrics: Seven metrics are investigated in this category.
19 positive cases from chest X-ray images. Asnaoui and Chawki They are Euclidean metric (EM), Clark metric (ClaM), Neyman
[33] constructed a novel method for detecting COVID-19 from v2 metric (NCSM), Squared v2 metric (SquM), Divergence metric
pneumonia chest X-ray and tomography images. They performed (DivM), and Squared Chi-squared metric (SCSM). We list these
recent deep learning methods, VGG16, VGG19, DenseNet201, metrics in 2.
Inception_ResNet_V2, Inception_V3, Resnet50, and MobileNet_V2. 3. Vicissitude metrics: Three metrics are considered in this cate-
They reported that the performance of Inception_Resnet_V2 gory. They are Vicis Symmetric 1 metric (VSDFM1), Vicis Sym-
demonstrated the best accuracy of 92.18%. Basu et al. [34] pro- metric 2 metric (VSDFM2) and Vicis Symmetric 3 metric
posed an alternative screening method of COVID-19, which is (VSDFM3). Definitions of the mentioned metrics are given in 3.
called Domain Extension Transfer Learning. They extracted some 4. Inner product metrics: Two metrics are investigated in this
discriminate features from the chest X-ray dataset. They then clas- category. They are Dice metric (DicM) and Chord metric
sified the images as normal, pneumonia, other diseases, and (ChoM). Definition of the mentioned metrics are demonstrated
COVID-19 with an accuracy of 95.3%. in 4.
5. Other metrics: In this category, we perform two metrics that
significantly affect the performance of the KNN algorithm.
3. Background
These metrics are Motyka (MotM) and Hassanat (HasM) met-
rics. The description of them is shown in 5, where
We propose a new COVID-19 detection method based on CpG ( 1þminðxi ;yi Þ
island features and the KNN classifier. The KNN is extremely useful 1  1þmaxðx i ;yi Þ
; if minðxi ; yi Þ P 0
Dðxi ; yi Þ ¼ 1þminðxi ;yi Þþjminðxi ;yi Þj .
for large data classification and its performance depends on the 1  1þmaxðx ;y Þþjminðx ;y Þj
; if minðxi ; yi Þ < 0
i i i i
distance metric used [35–38]. Several studies have been conducted
to detect optimum metrics for KNN algorithm in [39,40]. In this 4. Proposed method
section, first, we provide a brief description of the KNN method
and some recent studies based on the KNN. Second, we investigate In this section, we present a new COVID-19 detection method
the metrics used in the KNN classifier under five categories to based on the KNN classifier and CpG island features. Main steps
improve the performance of the KNN classifier on our model. of the proposed method are described in 1. The first step of the
algorithm is feature extraction. Extracting robust and discrimina-
3.1. K-nearest neighbors tive features from human coronaviruses genome sequences is the
most critical step to improve diagnosis of SARS-CoV-2. In this step,
K-nearest neighbors (KNN), a supervised machine learning algo- we propose to use CpG based features since CpG dinucleotides in
rithm, can be efficiently used to solve classification problems. The the open reading frames of SARS-CoV-2 has extremely low-
KNN was introduced in 1951 by [41] and then recasted in 1967 by frequency [46]. The main reason of lower CpG dinucleotides den-
[42]. The KNN is a non-parametric classifier known as one of the sity is the mutation of C into A and G into T [47,48]. The proposed
simplest and laziest algorithms. That is, there is no need to create features are extracted by using Eq. 1 and Eq. 2.
a learning model in this classification method. Despite the lazy CGp ¼ ratioðCÞ þ ratioðGÞ ð1Þ
structure of KNN, it was proposed as one of the 10 most effective
methods in the process of analyzing information in a database
given [43]. In the prediction process, the class to which a new
observation data belongs is determined by calculating the shortest Table 1
distance between the observation sample and its K-nearest neigh- L1 type metrics.
bors samples.
Abbreviation Metric Definition
There are some recent studies to improve the performance of Pn
i¼1 jxi  yi j
ManM Manhattan metric
the KNN. To decrease the sensitivity of the neighborhood size of
ChebM Chebyshev metric max jxi  yi j
k and improve voting strategy in the region of neighborhoods, 16i6n
CanM Canberra metric Pn jxi yi j
Gou et al. [44] proposed two k-nearest neighbor rules, which are i¼1 jx jþjy j
Pn i i

the weighted representation-based k-nearest neighbor rule and SM Sorensen metric jx y j


Pni¼1 i i
ðx þy Þ
the weighted local mean representation-based k-nearest neighbor Pi¼1
n
i i

KM Kulezynski metric jx y j
rule. Their experimental results demonstrate that the proposed Pn i¼1 i i
minðxi ;yi Þ
methods have a lower sensitiveness to k. Gou et al. [45] proposed Pn
i¼1

MCM Mean character metric i¼1


jxi yi j
another study to improve the selection of the neighborhood size of n

841
H. Arslan and H. Arslan Engineering Science and Technology, an International Journal 24 (2021) 839–847

Table 2 Algorithm 1. Proposed Algorithm


L2 type metrics.

Abbreviation Metric Definition Require:


qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
EM Euclidean metric S: training genome sequences,
ð ni¼1 jxi  yi j2 Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn L: COVID-19 positive, COVID-19 negative
ClaM Clark metric xi yi 2
i¼1 ðjxi jþjyi jÞ seq: a test sequence, and k: the neighborhood size
NCSM Neyman v2 metric Pn ðxi yi Þ2
i¼1 xi Ensure:
SquM Squared v2 metric Pn ðxi yi Þ2 Determine the class label of the test sequence seq
i¼1 xi þyi
DivM Divergence metric P 2 Step 1: Feature Extraction
2 ni¼1 ðxi yi Þ2
ðxi þyi Þ
Pn ðxi yi Þ2 1: for all genome sequences do
SCSM Squared Chi-squared metric
i¼1 jxi þyi j 2: feature1 ¼ ratioðCÞ þ ratioðGÞ
3: feature2 ¼ ratioðCGÞ=ratioðCÞratioðGÞ
4: end for
Table 3 Step 2: Apply KNN and use L1 type metric
Vicissitude metrics. 5: Compute the distance between seq and
Abbreviation Metric Definition every sample in S using any L1 type metric
VSDFM1 Vicis Symmetric 1 metric Pn ðxi yi Þ2
6: Choose k samples in S that are nearest to seq
i¼1 minðx ;y Þ2
i i 7: Assign seq to majority class
VSDFM2 Vicis Symmetric 2 metric Pn ðxi yi Þ2
i¼1 minðxi ;yi Þ
VSDFM3 Vicis Symmetric 3 metric Pn ðxi yi Þ2
i¼1 maxðxi ;yi Þ

5. Experimental results
Table 4
Inner product metrics. In this section, we evaluate performance of the proposed
Abbreviation Metric Definition method. Before discussing the results, first we explain our dataset,
Pn second we mention experimental setup, and then we provide a
DicM Dice metric 2 xi y
1  Pn 2 i¼1Pni 2 brief information about the performance measures. Finally, we will
x þ y
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
i¼1 i
P
i¼1 i

ChoM Chord metric n


xy
give the results of the experimental observations conducted by
2  2 Pn P
i¼1 i i
x2
i¼1 i
n
y2
i¼1 i
using different distance metrics.

5.1. Dataset
Table 5
Other metrics: Motyka and Hassanat metrics. The 2019 novel coronavirus resource (2019nCoVR) [49] at China
National Center for Bioinformation is one of the most important
Abbreviation Metric Definition
Pn sources of various types of coronaviruses. It integrates various
MotM Motyka metric maxðxi ;yi Þ
Pi¼1n important databases including the GISAID, NCBI, NMDC and
ðx þyi Þ
i¼1 i
Pn CNCB/NGDC. In this study, we used complete genomic sequences
i¼1 Dðxi ; yi Þ
HasM Hassanat metric
of human coronaviruses obtained from 2019nCoVR. Genome
sequence of each coronavirus has approximately a length of 30
000 nucleotides. The properties of human coronavirus sequences
ratioðCGÞ
CpGo ¼ ð2Þ are presented in 6. The dataset includes various types of human
ratioðCÞratioðGÞ
coronaviruses such as AlphaCoV, BetaCov-1, MERS-CoV, NL63-
CoV, HKU1-CoV and 229E-CoV as well as SARS-COV-2. We refer
where ratioðCÞ; ratioðGÞ, and ratioðCGÞ are computed as divided the SARS-CoV-2 sequences as COVID-19 positive, and the sequences
number of occurrences of C; G, and CG in the sequence, respectively that do not include SARS-CoV-2 are referred as COVID-19 negative.
by the sequence length. Thus, each sequence containing 30 000 In addition to 1000 SARS-CoV-2 sequences, we used 592 genome
nucleotides is represented by two features only. 1 provides an sequences of other human coronaviruses in our experiments. We
example of how features are calculated from a part of the sequence. note that all available genome sequences of human coronaviruses,
After feature extraction step, we apply the KNN method to clas- which are different from SARS-CoV-2 are downloaded.
sify SARS-CoV-2 sequences. The performance of the KNN mainly
depends on the metric that is used to compute the distances 5.2. Experimental setup
between different data samples. To improve the performance of
the KNN algorithm, we perform 19 distance metrics investigated The experiments were performed using a core i7, 2.7 GHz pro-
in five categories, which are L1 type, L2 type, vicissitude, inner pro- cessor, 16 GB RAM under Linux operating system. Feature extrac-
duct metrics, and the other types of metrics. In this step, we pro-
pose to use L1 type metric as a distance measure in the KNN.
Table 6
The properties of complete genome sequences of human coronaviruses.

Human coronaviruses The number sequences Label


SARS-CoV-2 1000 Positive
AlphaCoV 88 Negative
BetaCoV-1 140 Negative
MERS-CoV 258 Negative
NL63-CoV 61 Negative
Fig. 1. CpG based features. The numbers of C, G, and CG are 13, 20, and 3,
HKU1-CoV 18 Negative
respectively. Thus, CGp = ratio(C) + ratio(G) = 0.55, and CpGo = ratio(CG)/(ratio(C)
229E-CoV 27 Negative
ratio(G)) = 0.69.

842
H. Arslan and H. Arslan Engineering Science and Technology, an International Journal 24 (2021) 839–847

tion and the classification process were performed using Python


programming language. Precision, recall, F-measure, and accuracy
in predicting COVID-19 positive case are used as the major perfor-
mance measures. Next, we define these performance measures.

5.3. Performance measures

COVID-19 prediction as a binary classification problem has four


prediction outcomes. True Positive (TP) is the number of genomic
sequences which are correctly classified as COVID-19 positive, True
Negative (TN) is the number of genome sequences which are cor-
rectly classified as COVID-19 negative, False Positive (FP) is the
number of genome sequences which are incorrectly classified as
COVID-19 positive, and False Negative (FN) is the number of
sequences which are incorrectly classified as COVID-19 negative.
The performance evaluation metrics such as precision, recall, F-
Fig. 3. 5-fold cross validation.
measure, and accuracy are defined by using these outcomes and
presented in confusion matrix by assigning actual and predicted
labels in 2. Next, we briefly explain performance parameters used is used for validation. The experiments are repeated five times as
in this study. presented in 3. In our experiments, we increase the neighborhood
Precision is the accuracy of the classifier in the presence of false size of k between 1 and 20, and observe that the classification per-
positive case. It is computed as the ratio of the number of correctly formance of the proposed method remains the same or slightly
classified positive samples to the number of samples labeled by the decreases with increasing the neighborhood size of k. For this rea-
system as positive and it is exhibited in Eq. 3. son, in all experiments, the k value is basically set to 1.
TP 7 presents precision, recall, and F-measure values of the KNN
Precision ¼ ð3Þ classifier with respect to five metric groups, which are L1 type, L2
TP þ FP
type, vicissitude, inner product metrics, and the other types of met-
Recall refers to as the number of positive class predictions made rics. The results of the metrics in the same group are close to each
out all positive samples in the data set. It is computed as in Eq. 4. other; thus, we present the average of the results of the experi-
TP ments performed by the metrics in the same group. First, we ana-
Recall ¼ ð4Þ lyze the results of the KNN equipped with L1 type metrics, which
TP þ FN
are KM, ChebM, ManM, SM, CanM, and MCM. The KNN with L1
F-Measure is determined by the harmonic mean of precision type metrics presents the best result, and achieves a precision of
and recall, and computed using Eq. 5. 98.4%, a recall of 99.2%, and an F-measure of 98.8%. This means that
almost all sequences are classified correctly. Next we look at the
Precision  Recall
F  Measure ¼ 2  ð5Þ results of the KNN with L2 type metrics, which are ClaM, EM,
Precision þ Recall
SquM, NCSM, DivM, and SCSM. The KNN classifier with this group
Accuracy is computed by dividing the total number of true cases metrics achieves a precision of 96.0%, a recall of 98.2%, and an F-
by all cases and it is indicated in Eq. 6. measure of 97.1% on average. When we look at the results of the
KNN with vicissitude metrics, it presents better result than KNN
TP þ TN
Accuracy ¼ ð6Þ with L2 type metrics. It achieves a precision of 98.4%, a recall of
TP þ TN þ FP þ FN
99.0%, and an F-measure of 98.7%. Next, we investigate the results
of the KNN with inner product metrics, ChoM and DicM. This group
5.4. Evaluation of the proposed features and KNN classifier with of metrics presents the worst results among the other groups, and
respect to metrics achieves a precision of 94.4%, a recall of 92.8%, and an F-measure of
93.4%. Finally, we investigate the other type of metrics, which are
In this section, we evaluate the efficiency of proposed features HasM and MotM. The results of KNN with HasM are close to the
and the performance of the KNN classifier with respect to metrics results of the KNN with inner product type metrics, and achieves
using a fivefold cross-validation that was performed on the data- a precision of 94.4%, a recall of 92.9%, and an F-measure of 93.4%.
set. The dataset is randomly divided into two sets which are train- On the other hand, the results of the KNN with MotM is close to
ing and testing. Eighty percent of the entire human genome them of L2 type metrics, and are better than that of HasM. It
sequences is used for training and the remaining 20% of the dataset achieves a precision of 96.4%, a recall of 98.2%, and an F-measure
of 97.3%.
In addition to precision, recall, and F-measure values, we pre-
sent accuracy results. We take the average of the accuracies of
KNN classifier obtained with the metrics in the same group, and
show the results for each metric group separately in 4. The accu-
racy results of the KNN classifier using inner product metrics and
HasM are close to each other and have the worst accuracy, 91.7%,
and 91.8%, respectively. When L2 type metrics are used, the
method achieves an accuracy of 96.2%. The KNN with MotM metric
has remarkable accuracy values, which is 96.5%. The accuracy val-
ues of the KNN with L1 and Vicissitude type metrics are close to
each other, and our method achieves the best accuracy with
Fig. 2. Confusion matrix.
98.4% when L1 type metrics are used.

843
H. Arslan and H. Arslan Engineering Science and Technology, an International Journal 24 (2021) 839–847

Table 7
Average of the precision, recall, and F-measure values.

Group Metric name Precision (%) Recall (%) F-measure (%)


L1 KM 98.4 99.2 98.8
ChebM
ManM
SM
CanM
MCM

L2 ClaM 96.0 98.2 97.1


EM
SquM
NCSM
DivM
SCSM

Vicissitude VSDFM1 98.4 99.0 98.7


VSDFM2
VSDFM3

Inner Product ChoM 94.4 92.8 93.4


DicM

Other HasM 94.4 92.9 93.4


MotM 96.4 98.2 97.3

of the results in terms of the method, dataset, class, and accuracy.


We investigate these studies under two categories: the studies
using genome sequences datasets and the other datasets.
First, we compare our method with the studies which detect
COVID-19 cases from genome sequences. Randhawa et al. [23]
combined supervised machine learning methods with digital signal
processing. They used 29 sequences of SARS-CoV-2, and 20
sequences for each of alphacoronavirus, betacoronavirus, and
deltacoronavirus. They achieved 100% accuracy when linear dis-
criminant method was used. Main limitation of their proposed
method is the number of sequences used in their dataset. Further-
more, Randhawa et al. used delta coronaviruses genomes. How-
ever, deltacoronavirus mainly causes an infectious disease among
bird species rather than human. When the number of sequences
in their dataset is increased and human coronavirus sequences
genetically similar to SARS-CoV-2 sequences are added to their
dataset, overall accuracy of their method may decrease. Another
study predicting COVID-19 from genome sequences is introduced
Fig. 4. Accuracy of the proposed method with respect to each group of metric. by Naeem et al. [24]. They used the classical KNN method to distin-
guish SARS-CoV-2 sequences among SARS-CoV-2, SARS-CoV and
MERS-CoV genome sequences. They used 76 sequences for each
6. Comparisons with other methods of SARS-CoV-2, SARS-CoV, and MERS-CoV, and achieved an accu-
racy of 100%. They extracted features by using Discrete Fourier
In this part, we compare the results of the proposed method transform, Discrete cosine transform, and Seven Moment Invari-
with the state-of-the-art methods. 8 provides a simple comparison ants methods. When comparing to our method, their feature

844
H. Arslan and H. Arslan Engineering Science and Technology, an International Journal 24 (2021) 839–847

Table 8
Comparison of existing state-of-the-art classification studies.

Study Method used Dataset Classes Accuracy (%)


Randhawa et al.[23] Linear Discriminant 29 COVID-19 virus sequences COVID-19 100
Linear SVM 20 alphacoronavirus sequences vs other types of viruses
Quadratic SVM 20 betacoronavirus sequences
Fine KNN 20 deltacoronavirus sequences
Subspace Discriminant
Subspace KNN

Naeem et al. [24] Discrete Fourier Transform with KNN 76 COVID-19 sequences COVID-19 100
Discrete Cosine transform with KNN 76 SARS-CoV sequences vs SARS-CoV
Seven Moment Invariants with KNN 76 MERS-CoV sequences vs MERS-CoV

Proposed Method KNN with L1 type metrics 1000 COVID-19 sequences COVID-19 vs 98.4
CpG based features 592 other coronavirus sequences other type coronaviruses

Unal and Dudak [26] Naive Bayes 95839 case with COVID-19 vs 100
KNN 19 type of laboratory findings other diseases
Support Vector Machines
Decision Tree Algorithm

Barstugan et al. [27] Support Vector Machine 150 Abdominal CT images COVID-19 vs 99.6
other viral pneumonias

Apostolopoulos and Bessiana [31] Deep Transfer Learning 224 COVID-19 Chest X-ray images COVID-19 vs 98.75
700 images of bacterial pneumonia Pneumonia vs
504 images of normal conditions Normal

Ozturk et al. [28] DarkCovidNet raw chest X-ray images COVID-19 vs 87.02
healthy vs
pneumonia

Jain et al. [30] VGG16 chest X-ray images COVID-19 vs 98.93


ResNet18 bacterial pneumonia vs
DenceNet121 viral pneumonia vs
ResNet101 healthy

(continued on next page)

845
H. Arslan and H. Arslan Engineering Science and Technology, an International Journal 24 (2021) 839–847

Table 8 (continued)

Study Method used Dataset Classes Accuracy (%)


Asnaoui and Chawki [33] VGG16 1583 normal X-ray and CT images COVID-19 vs 92.18
VGG19 231 COVID-19 images other viral pneumonias
DenseNet201 2780 images of bacterial pneumonia
Inception ResNet V2 1493 coronavirus images
Inception V3
Resnet50
MobileNet V2

Basu et al. [34] Domain Extension Transfer Learning chest X-ray images COVID-19 vs 95.3
normal vs
pneumonia vs
other diseases

extraction method is expensive. Another disadvantage of the step, we propose to use CpG island features. Each genome
model established by Naeem et al. [24] is that they work with a sequence of human coronavirus, which includes about 30,000
smaller dataset just like Randhawa. nucleotides is represented by two real numbers only. Feature
Next we discuss advantages/drawbacks of the methods using extraction step takes just a few seconds. Second, the KNN method
the other datasets. Unal and Dudak [26] used Mexico Patient is used for the classification of COVID-19 positive cases from the
Health Dataset to detect COVID-19 cases. Although the size of their other types of human coronaviruses, AlphaCoV, Beta-Cov-1,
dataset is large, their dataset only represents the features of a MERS-CoV, NL63-CoV, HKU1-CoV, and 229E-CoV. The KNN classi-
specific region. Furthermore, they used 19 different features fier is the simplest method and has high flexibility for solving com-
including the sex of the patient, age of the patient, the state of plex classification problems. The accuracy of the KNN is higher
pneumonia and intubation as well as the state of many other dis- than state-of the-art classifiers in certain cases, and it often pro-
eases. One of the advantages of our model over the method of Unal duces efficient performance. However, the performance of the
and Dudak [26] is to use powerful and effective two features KNN greatly depends on the metric performed. To detect the most
derived from the complete genome sequences of human coron- appropriate metric, we review five groups of metrics used in the
aviruses. Thus, our proposed method detects COVID-19 positive KNN classifier. The selection of different distance metrics for the
cases within a few seconds. However, Unal and Dudak [26] did KNN can result a variation in accuracy outcomes for the same data-
not provide any explanation about the detection time. Another dif- set. Experimental results reveal that the proposed method achieves
ferent aspects of the proposed method from their method is that the highest accuracy, which is 98.4% on average in a few seconds
they use the classicial KNN classifier with default parameters when L1 type metrics are used as a distance measure in the KNN.
although we use the KNN classifier with an optimum distance met- In future studies, we will compare human SARS-CoV-2 sequences
ric in our proposed model. to other types of coronavirus sequences such as bat SARS-CoV-
There are a number of the existing studies detecting COVID-19 like coronaviruses 2, and propose a similarity based feature to
cases from image dataset in the literature. In 8, we also exhibit six increase overall accuracy. In addition, the factors affecting the
image based studies achieving remarkable accuracy values. These recovery status of patients suffering from COVID-19 may be inves-
image based methods used different deep learning or machine tigated in future studies by combining machine learning and paral-
learning methods. Although image based studies have a higher lel computing methods with effective features.
accuracy, when considering a high mutation rate of SARS-CoV-2,
using genomic sequences is extremely beneficial when tracking Declaration of Competing Interest
coronavirus genes that change frequently as the disease spreads
from one person to another. Moreover, the radiation generated The authors declare that they have no known competing finan-
by X-ray or CT scanning machines may cause permanent damages cial interests or personal relationships that could have appeared
to people. For this reason, X-ray or CT scanning may not be to influence the work reported in this paper.
obtained for some people, which can be considered as a disadvan-
tage of image based studies.
References

[1] D. Schoeman, B. Fielding, Coronavirus envelope protein: current knowledge,


7. Conclusion Virol. J. 16. doi: 10.1186/s12985-019-1182-0..
[2] R.J. de Groot, S. Baker, R. Baric, L. Enjuanes, A. Gorbalenya, K. Holmes, S.
COVID-19 is an existing epidemic that sets new records in terms Perlman, L. Poon, P. Rottier, P. Talbot, et al., Family coronaviridae, Virus Taxon.
(2012) 806–828.
of cumulative and daily numbers for global infection. The pan- [3] J. Cui, F. Li, Z.L. Shi, Origin and evolution of pathogenic coronaviruses, Nat. Rev.
demic is an unprecedented situation in healthcare systems world- Microbiol. 17 (2019) 181–192, https://doi.org/10.1371/journal.pone.0119815.
wide, and to overcome this pandemic, it is essential to accurately [4] A chronicle on the sars epidemic, Chin. Law Govern. 36 (4) (2003) 12–15.
arXiv:https://doi.org/10.2753/CLG0009-4609360412, doi: 10.2753/CLG0009-
detect COVID-19 cases by analyzing data of patients in a minimum
4609360412..
amount of time. In this study, we propose an accurate and fast [5] A. Zumla, D.S. Hui, S. Perlman, Middle east respiratory syndrome, The Lancet
method to detect COVID-19 positive cases from genome sequences 386 (9997) (2015) 995–1007, https://doi.org/10.1016/S0140-6736(15)60454-
of human coronaviruses. In the proposed method, first, the features 8. http://www.sciencedirect.com/science/article/pii/S0140673615604548.
[6] L. Xingguang, W. Wei, Z. Xiaofang, Z. Junjie, Z. Qiang, L. Yi, C. Antoine,
that significantly differentiate COVID-19 cases are extracted from Transmission dynamics and evolutionary history of 2019-ncov, J. Med. Virol.
the complete genome sequences of human coronaviruses. In this 92 (2020) 501–511, https://doi.org/10.1002/jmv.25701.

846
H. Arslan and H. Arslan Engineering Science and Technology, an International Journal 24 (2021) 839–847

[7] L. Fu, B. Wang, T. Yuan, X. Chen, Y. Ao, T. Fitzpatrick, P. Li, Y. Zhou, Y. fan Lin, Q. medRxivdoi: 10.1101/2020.04.04.20052092. https://www.medrxiv.
Duan, G. Luo, S. Fan, Y. Lu, A. Feng, Y. Zhan, B. Liang, W. Cai, L. Zhang, X. Du, L. Li, org/content/early/2020/04/14/2020.04.04.20052092..
Y. Shu, H. Zou, Clinical characteristics of coronavirus disease 2019 (covid-19) in [26] Y. Ünal, M.N. Dudak, Classification of covid-19 dataset with some machine
china: A systematic review and meta-analysis, J. Infect. 80 (6) (2020) 656–665. learning methods (2020)..
doi: https://doi.org/10.1016/j.jinf.2020.03.041. URL http:// [27] M. Barstugan, U. Ozkaya, S. Ozturk, Coronavirus (covid-19) classification using
www.sciencedirect.com/science/article/pii/S0163445320301705.. ct images by machine learning methods. arxiv:2003.09424 (03 2020)..
[8] R. Vaishya, M. Javaid, I. Khan, A. Haleem, Artificial intelligence (ai) applications [28] T. Ozturk, M. Talo, E.A. Yildirim, U.B. Baloglu, O. Yildirim, U. Rajendra Acharya,
for covid-19 pandemic, Diab. Metab. Syndrome Clin. Res. Rev. 14. doi: 10.1016/ Automated detection of covid-19 cases using deep neural networks with X-ray
j.dsx.2020.04.012.. images, Comput. Biol. Med. 121 (2020), https://doi.org/10.1016/
[9] G.G. Waleed Salehi A, Baglat P, Review on machine and deep learning models j.compbiomed.2020.103792 103792.
for the detection and prediction of coronavirus, Mater. Today Proc. doi: [29] B. Sekeroglu, I. Ozsahin, Detection of covid-19 from chest x-ray images using
10.1016/j.matpr.2020.06.245.. convolutional neural networks, SLAS TECHNOL. Transl. Life Sci. Innov. 0 (0) (0)
[10] M.A. Dey L., S. Chakraborty, Machine learning techniques for sequence-based 2472630320958376, pMID: 32948098. arXiv:https://doi.org/10.1177/
prediction of viral–host interactions between sars-cov-2 and human proteins, 2472630320958376, doi: 10.1177/2472630320958376. doi: 10.1177/
Biomed. J. doi: 10.1016/j.bj.2020.08.003. URL https://doi.org/10.1016/j. 2472630320958376..
bj.2020.08.003.. [30] G. Jain, D. Mittal, D. Thakur, M.K. Mittal, A deep learning approach to detect
[11] P. Gupta, A. Sharma, R. Jindal, Scalable machine-learning algorithms for big covid-19 coronavirus with X-ray images, Biocybern. Biomed. Eng. 40 (4)
data analytics: a comprehensive review, WIREs Data Min. Knowl. Discovery 6 (2020) 1391–1405, https://doi.org/10.1016/j.bbe.2020.08.008.
(6) (2016) 194–214. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/ [31] I. Apostolopoulos, T. Bessiana, Covid-19: automatic detection from X-ray
widm.1194, doi: 10.1002/widm.1194. https://onlinelibrary.wiley.com/doi/ images utilizing transfer learning with convolutional neural networks, Phys.
abs/10.1002/widm.1194.. Eng. Sci. Med. 43 (2) (2020) 635–640, https://doi.org/10.1007/s13246-020-
[12] M.M. Najafabadi, F. Villanustre, T.M. Khoshgoftaar, N. Seliya, R. Wald, E. 00865-4.
Muharemagic, Deep learning applications and challenges in big data analytics, [32] S. Ahuja et al., Deep transfer learning-based automated detection of covid-19
J. Big Data 2 (1) (2015) 194–214, https://doi.org/10.1186/s40537-014-0007-7. from lung ct scan slices, Appl. Intell. (2020), https://doi.org/10.1007/s10489-
[13] T.T. Zin, J.C.W. Lin, Big data analysis and deep learning applications: 020-01826-w.
proceedings of the first international conference on big data analysis and [33] K.E. Asnaoui, Y. Chawki, Using X-ray images and deep learning for automated
deep learning, 2019. URL https://www.springer.com/gp/book/ detection of coronavirus disease, J. Biomol. Struct. Dyn. 0 (0) (2020) 1–12,
9789811308680. pMID: 32397844. arXiv:https://doi.org/10.1080/07391102.2020.1767212, doi:
[14] T. Das, Machine learning algorithms for image classification of hand digits and 10.1080/07391102.2020.1767212. doi: 10.1080/07391102.2020.1767212..
face recognition dataset, Int. Res. J. Eng. Technol. 4 (12) (2017) 640–649. [34] S. Basu, S. Mitra, N. Saha, Deep learning for screening covid-19 using chest X-
[15] M. Sharma, J. Anuradha, H. Manne, G. Kashyap, Facial detection using deep ray images (2020). arXiv:2004.10507..
learning, IOP Conf. Ser.: Mater. Sci. Eng. 263 (2017), https://doi.org/10.1088/ [35] J. Maillo, I. Triguero, F. Herrera, A mapreduce-based k-nearest neighbor
1757-899X/263/4/042092 042092. approach for big data classification, in: 2015 IEEE Trustcom/BigDataSE/ISPA,
[16] P. Viola, M. Jones, Robust real-time face detection, Int. J. Comput. Vision 57 vol. 2, 2015, pp. 167–172..
(2004) 137–154, https://doi.org/10.1023/B:VISI.0000013087.49260.fb. [36] Z. Deng, X. Zhu, D. Cheng, M. Zong, S. Zhang, Efficient knn classification
[17] K. Lewandowski, Y. Xu, S. Pullan, S. Lumley, D. Foster, N. Sanderson, A. algorithm for big data, Neurocomput. 195 (C) (2016) 143–148, https://doi.org/
Vaughan, M. Morgan, N. Bright, J. Kavanagh, R. Vipond, M. Carroll, A. Marriott, 10.1016/j.neucom.2015.08.112.
K. Gooch, M. Andersson, K. Jeffery, T. Peto, D. Crook, A. Walker, P. Matthews, [37] J. Maillo, S. Ramírez, I. Triguero, F. Herrera, knn-is: an iterative spark-based
Metagenomic nanopore sequencing of influenza virus direct from clinical design of the k-nearest neighbors classifier for big data, Knowl.-Based Syst.
respiratory samples, J. Clin. Microbiol. 58. doi: 10.1128/JCM.00963-19.. 117. doi: 10.1016/j.knosys.2016.06.012..
[18] L. Kafetzopoulou, K. Efthymiadis, K. Lewandowski, A. Crook, D. Carter, J. [38] F. Wang, Q. Wang, F. Nie, W. Yu, R. Wang, Efficient tree classifiers for large
Osborne, E. Aarons, R. Hewson, J. Hiscox, M. Carroll, R. Vipond, S. Pullan, scale datasets, Neurocomputing doi: 10.1016/j.neucom.2017.12.061..
Assessment of metagenomic nanopore and illumina sequencing for recovering [39] S.H. Cha, Comprehensive survey on distance/similarity measures between
whole genome sequences of chikungunya and dengue viruses directly from probability density functions, Int. J. Math. Models Methods Appl. Sci. 1 (4) (20)
clinical samples, Eurosurveillance 23. doi: 10.2807/1560-7917. 300–307..
ES.2018.23.50.1800228.. [40] H. Abu Alfeilat, A. Hassanat, O. Lasassmeh, A. Tarawneh, M. Alhasanat, H. Eyal-
[19] A. Khanday, S. Rabani, Q. Khan, et al., Machine learning based approaches for Salman, S. Prasath, Effects of distance measure choice on k-nearest neighbor
detecting covid-19 using clinical text data, Int. J. Inf. Technol. 12 (2020) 731– classifier performance: a review, Big Data 7. doi: 10.1089/big.2018.0175..
739, https://doi.org/10.1007/s41870-020-00495-9. [41] E. Fix, J.L.H. (1951), Discriminatory analysis. nonparametric discrimination:
[20] L. Li, L. Qin, Z. Xu, Y. Yin, X. Wang, B. Kong, J. Bai, Y. Lu, Z. Fang, Q. Song, K. Cao, consistency properties, Technical Report 4, USAF School of Aviation Medicine,
D. Liu, G. Wang, Q. Xu, X. Fang, S. Zhang, J. Xia, J. Xia, Using artificial Randolph Field, TX, USA. URL http://www.jstor.org/stable/1403797..
intelligence to detect covid-19 and community-acquired pneumonia based on [42] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf.
pulmonary ct: evaluation of the diagnostic accuracy, Radiology 296 (2) (2020 Theory 13 (1)..
Aug) E65–E71, pMID: 32191588. doi: 10.1148/radiol.2020200905. doi: [43] X. Wu, V. Kumar, J. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A.F.
10.1007/s41870-020-00495-9.. M. Ng, B. Liu, P. Yu, Z. Zhou, M. Steinbach, D. Hand, D. Steinberg, Top 10
[21] A. Alimadadi, S. Aryal, I. Manandhar, P.B. Munroe, B. Joe, X. Cheng, Artificial algorithms in data mining, Knowl. Inf. Syst. 14 (2007) 1–37.
intelligence and machine learning to fight covid-19, Physiol. Genom. 52 (4) (2020) [44] J. Gou, W. Qiu, Z. Yi, X. Shen, Y. Zhan, W. Ou, Locality constrained
200–202, pMID: 32216577. arXiv:https://doi.org/10.1152/physiolgenomics. representation-based k-nearest neighbor classification, Knowl.-Based Syst.
00029.2020, doi: 10.1152/physiolgenomics.00029.2020. doi: 10.1152/ 167 (2019) 38–52, https://doi.org/10.1016/j.knosys.2019.01.016. http://
physiolgenomics.00029.2020.. www.sciencedirect.com/science/article/pii/S0950705119300152.
[22] J. Chen, K. Li, Z. Zhang, K. Li, P.S. Yu, A Survey on Applications of Artificial [45] J. Gou, H. Ma, W. Ou, S. Zeng, Y. Rao, H. Yang, A generalized mean distance-
Intelligence in Fighting Against COVID-19, arXiv e-prints (2020) based k-nearest neighbor classifier, Expert Syst. Appl. 115 (2019) 356–372,
arXiv:2007.02202arXiv:2007.02202.. https://doi.org/10.1016/j.eswa.2018.08.021. http://
[23] G.S. Randhawa, M.P.M. Soltysiak, H. El Roz, C.P.E. de Souza, K.A. Hill, L. Kari, www.sciencedirect.com/science/article/pii/S0957417418305293.
Machine learning using intrinsic genomic signatures for rapid classification of [46] X. Xia, Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host
novel pathogens: Covid-19 case study, vol. 15, Public Library of Science, 2020, antiviral defense, Mol. Biol. Evol. 37 (9) (2020) 2699–2705. arXiv:https://
pp. 1–24. doi: 10.1371/journal.pone.0232391. URL https://doi.org/10.1371/ academic.oup.com/mbe/article-pdf/37/9/2699/33721082/msaa094.pdf, doi:
journal.pone.0232391.. 10.1093/molbev/msaa094. doi: 10.1093/molbev/msaa094..
[24] S.M. Naeem, M.S. Mabrouk, S.Y. Marzouk, M.A. Eldosoky, A diagnostic genomic [47] Y. Wang, J.M. Mao, G.D. Wang, Z.P. Luo, L. Yang, Q. Yao, K.P. Chen, Human sars-
signal processing (GSP)-based system for automatic feature analysis and cov-2 has evolved to reduce cg dinucleotide in its open reading frames, Sci.
detection of COVID-19, Brief. Bioinf. Bbaa170. arXiv:https://academic. Rep. 10 (2020) 5165–5184.
oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbaa170/33650674/ [48] H. Dinka, A. Milkesa, Unfolding sars-cov-2 viral genome to understand its gene
bbaa170.pdf, doi: 10.1093/bib/bbaa170. URL https://doi.org/10.1093/bib/ expression regulation, Infect. Genet. Evol. 84. doi: 10.1016/j.
bbaa170.. meegid.2020.104386..
[25] A.F.d.M. Batista, J.L. Miraglia, T.H.R. Donato, A.D.P. Chiavegatto Filho, Covid-19 [49] The 2019 novel coronavirus resource, https://bigd.big.ac.cn/ncov, accessed:
diagnosis prediction in emergency care patients: a machine learning approach, 2020-09-24..

847

You might also like