Tracing Back The Temporal Change of SARS-CoV-2 With Genomic Signatures

bioRxiv preprint doi: https://doi.org/10.1101/2020.04.24.057380; this version posted May 7, 2020.
The copyright holder for this preprint (which

was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Tracing Back the Temporal Change of SARS-CoV-2

with Genomic Signatures
Sourav Biswas1,∗ , Suparna Saha1,∗ , Sanghamitra Bandyopadhyay1 , and
Malay Bhattacharyya1,2
1
Machine Intelligence Unit
Indian Statistical Institute, Kolkata
203 B. T. Road, Kolkata – 700108, India

2
Centre for Artificial Intelligence and Machine Learning
Indian Statistical Institute, Kolkata
203 B. T. Road, Kolkata – 700108, India
E-mail correspondence: malaybhattacharyya@isical.ac.in
Phone: (+91) (33) 2575 2231

∗
These authors contributed equally
May 8, 2020
Abstract
The coronavirus disease (COVID-19) outbreak starting from China at the end of
2019 and its subsequent spread in many countries have given rise to thousands of
coronavirus samples being collected and sequenced till date. To trace back the initial
temporal change of SARS-CoV-2, the coronavirus implicated in COVID-19, we study
the limited genomic sequences that were available within the first couple of months of
its spread. These samples were collected under varying circumstances and highlight
wide variations in their genomic compositions. In this paper, we explore whether these
variations characterize the initial temporal change of SARS-CoV-2 sequences. We ob-
serve that n-mer distributions in the SARS-CoV-2 samples, which were collected at an
earlier period of time, predict its collection timeline with approximately 78% accuracy.
However, such a distinctive pattern disappears with the inclusion of samples collected
at a later time. We further observe that isolation sources (e.g., oronasopharynx, saliva,
feces, etc.) could not be predicted by the n-mer patterns in these sequences. Finally,
1
bioRxiv preprint doi: https://doi.org/10.1101/2020.04.24.057380; this version posted May 7, 2020. The copyright holder for this preprint (which
the phylogenetic and protein-alignment analyses highlight interesting associations be-
tween SARS-CoV-2 and other coronaviruses.
Keywords: SARS-CoV-2, COVID-19, disease outbreak, phylogenetic tree.
1 Introduction
Coronaviruses are a large family of RNA viruses that cause diseases in mammals and birds [1].
Coronaviruses have been identified broadly in avian hosts along with various mammals, which
include camels, masked palm civets, bats, mice, dogs, cats, and human. The coronaviruses
cause respiratory and neurological diseases in hosts [2]. In human, it causes several illnesses,
ranging from mild common cold to much severe diseases such as SARS-CoV (Severe Acute
Respiratory Syndrome) and MERS-Cov (Middle East Respiratory Syndrome). They are
enveloped viruses with a positive-sense single-stranded RNA genome. The genome size of
coronaviruses ranges between 26-32 kb [3], the largest among known RNA viruses [4].
The recent coronavirus outbreak in China [5], originating from Wuhan, has been char-
acterized by the identification of a novel type of betacoronavirus SARS-CoV-2 that infects
human [6]. In this paper, we restrict our study to those samples collected and sequenced
between December, 2019 and January, 2020. In China, 11,791 cases were identified as in-
fected with SARS-CoV-2, and 17,988 were suspected in 34 provinces as of January 31, 2020
[7]. Due to the quick spread of SARS-CoV-2, more than 200 countries and territories around
the world have already been marked with the viral infection [8]. It is therefore interesting
to trace back whether the origin and spread of this virus has any influence on the change of
its nucleotide sequence.
The sequencing of viral genomes has opened up new promises through bioinformatics
analyses [9, 10]. With the recent efforts of sequencing the SARS-CoV-2 samples collected
across the world, hundreds of genomic samples are becoming available day by day [11]. This
has lead to the discovery of new pathophysiological facts about coronaviruses. It is equally
interesting to explore the mutations happening in the samples emerging across the world. In
this paper, we aim to verify whether the changes of sequence in the samples can reveal the
temporal footprints (precisely the month of collection) of this virus.
2 Related Work
Since the outbreak of 2019 novel coronavirus disease (COVID-19) in late December, 2019,
several studies have already been carried out in understanding the virus and its pathophysiol-
ogy. The SARS-CoV-2 has been identified by the close genomic observations of clinical sam-
ples from patients with viral pneumonia in Wuhan, China [12]. Identification of the molecular
2
mechanism of SARS-CoV-2 is essential to prevent and control the disease. Through next
generation sequencing, homology has been observed in the structure of the receptor-binding
domain between SARS-CoV and SARS-CoV-2. The SARS-CoV-2 is found to largely deviate
from both MERS-CoV and SARS-CoV and is closely related to two bat-derived SARS-like
coronaviruses [12].
Another recent work has performed an entropy-based analysis with the diverse genomes
and proteomes of SARS-CoV-2 [13]. The results of this study highlighted low varying ge-
nomic sequences within the SARS-CoV-2 sequenced samples where the higher variability was
observed in at least two nucleotide positions in protein-coding regions. The interplay between
the receptor-binding domain (RBD) of spike glycoprotein of SARS-CoV-2 and angiotensin-
converting enzyme II (ACE2) complex in human had been extensively studied in several
other papers. As the free energy of RBD-ACE2 binding for SARS-CoV-2 is remarkably
lower than that for SARS-CoV, the SARS-CoV-2 is epidemically stronger than SARS-CoV
[14].
The presence of high prevalence, ecological distribution, tremendous genetic diversity, and
frequent alterations of coronavirus genomes have already been implicated in the recurrence
of this virus in humans [15]. However, there is hardly any study that links the sequence
patterns (precisely n-mer distributions here) to the timeline of sample collection. There
are many recent resources on coronaviruses due to their repeated outbreak. CoVDBd is a
1
repository of annotated coronaviruses [16]. There are recent attempts to understand the
genome composition and divergence of SARS-CoV-2 [17], however, such studies are restricted
to China. Our analysis is more global from this perspective.
3 Materials and Methods
The materials and methods used in this paper are all available either from the developers or
from us for public use. Additional details are given below.
3.1 Dataset Details
We collected the SARS-CoV-2 genomic sequences from NCBI Virus portal2 that cover multi-
ple countries. Only those countries that have at least 5 samples were considered and complete
genomic sequences were taken. Total 36 sequences (17 samples from China, 5 samples from
Japan, and 14 samples from USA) were obtained from the 87 genomic sequences of SARS-
CoV-2 available in NCBI. These samples are gathered from multiple patients. For all these
genomic sequences, we collected the geographical locations, sources of isolation, and months
1
genes/genomes http://covdb.microbiology.hku.hk
2
https://www.ncbi.nlm.nih.gov/labs/virus (accessed in February, 2020)
3
of collection. The collection month of SARS-CoV-2 samples confirm the presence of the virus
in the patients of China in December, 2019 and January, 2020, of USA in January, 2020, and
of Japan in January, 2020. The exact collection dates are available for the samples of China
(except one) and USA but not for Japan (see Supplementary Dataset 3). The sequences from
Japan have recently been removed from NCBI but are still accessible3 . An inquiry to the
submitter regarding the reason of this removal has resulted in no reply so far. The sequences
of other types of coronaviruses were collected from NCBI for phylogenetic analysis.
3.2 Methods
The Sieve plot was created for geographical location versus isolate (source of viral sample
collection) using the Orange toolbox4 [18]. The area of each rectangle in this is proportional
to the expected frequency of different isolation sources with respect to the countries, while
the observed frequency is shown by the number of squares in each rectangle. The Sieve
diagram highlights frequencies in a two-way contingency table and compare them to expected
frequencies under the assumption of independence. Several classification algorithms were
performed that are standard [19]. The classification results were taken from Orange. For this,
the default values of the parameters set in Orange were considered. The performance of the
classification models are determined in terms of the five evaluation criteria, namely accuracy
T P +T N TP TP 2T P
(= T P +T N +F P +F N
), precision(= T P +F P
), recall (= T P +F N
), F1 score (= 2T P +F P +F N
), and
AUC (area under ROC curve), where T P , T N , F P and F N stands for the number of true
positives, true negatives, false positives and false negatives, respectively.
FASTX (version 36.3.8h Aug, 2019) was used for multiple sequence alignment of DNA
sequence of the reference genome of SARS-CoV-2 with the proteins in UnitProt [20]. The
parameters for FASTX were: BL50 matrix (15:-5), open/ext: -10/-2, shift: -20, ktup: 2, E-
join: 0.5 (0.0832), E-opt: 0.1 (0.0198) and width: 16. The fast family and domain predictions
were also obtained using FASTX.
The phylogenmetic tree was built using the tool embedded in NCBI Virus and further
processed in the form of a consensus network using SplitsTree (version 5.0.0 alpha). The
original input consisted of 269 taxa. The Consensus Network method [21] was used (with
default options) so as to obtain 535 splits as compatible.
4 Results
The different analyses performed on the SARS-CoV-2 genomic sequences are reported here-
under.
3
https://www.ncbi.nlm.nih.gov/nuccore
4
https://orange.biolab.si
4
Figure 1: The heatmap reflecting the similarity of n-mer distribution across multiple SARS-
CoV-2 samples. The samples are grouped by the geographical location from where they were
collected.
4.1 Extraction of Genomic Signatures
The frequency counts for several n-mers (for n = 1, 2 and 3) were extracted yielding a total
84 (4+16+64) features. The frequency values were normalized by dividing with the length of
the sequence. The feature similarity across the different samples showed higher importance
of shorter n-mers (Fig. 1).
To test the power of n-mer features in distinguishing different geographical locations
from where the samples were collected, hierarchical clustering with complete linkage was
performed (with number of clusters set to be 3) on the feature data. On labeling the
clustering result (Fig. 2), a strong group between the samples collected from Japan was
found. The samples from USA and China were highly overlapping. It is interesting to note
that all the samples from Japan were collected in late January, whereas those from USA and
China were collected much earlier. This highlights a possibility of having a temporal change
of SARS-CoV-2 sequences at an early stage.
4.2 Tracing Back the Temporal Change
As the samples we considered were collected over a period of couple of months, classification
results were obtained by setting collection months as the labels against the samples. The
collection months December, 2019 and January, 2020 were used as the labels of the samples
and multiple classification algorithms were applied (with 5-fold cross validation). The clas-
sification results (with 5-fold cross-validation), in terms of classification accuracy, precision,
recall, F1 score, and area under the ROC curve (AUC), are reported in Table 1. Close to
5
Figure 2: The clusters obtained from the samples based on n-mer features by applying
hierarchical clustering with complete linkage. The samples from Japan that were collected
relatively later came under a single group. Note that, all the samples from Japan have a
distinct collection month of January, 2020.
78% prediction accuracy was obtained with the Random Forest model, thereby indicating a
connection between the changes of n-mer in the SARS-CoV-2 sequences over time.
Table 1: Performance evaluation of different classification approaches (5-fold cross validation)

based on the 84 n-mer features to distinguish the months when samples were collected. The
best values over a column are shown in bold.
Classification Accuracy Precision Recall F1 Score AUC

k-NN 77.8% 0.763 0.778 0.758 0.743
AdaBoost 77.8% 0.815 0.778 0.787 0.793
Random Forest 77.8% 0.767 0.778 0.770 0.820
As many of the samples from USA and Japan were collected lately, we further studied the
importance of both geographical location and temporal signature to distinguish the footprint
of SARS-CoV-2. On performing a clustering of samples based on the best features labeled
with geographical location and collection time, well segregated groups were found. This
highlights a clear distinction of samples based on collection details (Fig. 3). The gradual
mutations in SARS-CoV-2 sequences might be the reason of the said distinguishing pattern
between the samples collected in December, 2019 and January, 2020. Unfortunately, with
the newer (collected in March and April, 2020)) SARS-CoV-2 samples added to the list, this
distinctive pattern of the corresponding sequences no more exists. However, it is interesting
to note that, our classification attempts consider the class label as month of collection and
6
Figure 3: The clustering of samples based on selected features to distinguish collection

month.
the data is skewed over the month. In fact, the samples are now getting sequenced with an
exponential growth. Therefore, further investigation is required for a deeper understanding
of the temporal change of SARS-CoV-2 sequences at a later phase.
4.3 Relevance of Isolation Sources
Countrywise isolation sources were very much overlapped and do not provide any significant
information. The distribution of samples across the different isolation sources as collected
from each geographical location are shown in Fig. 4.
To further verify the significance of isolation sources, multiple classification algorithms
were applied on the feature data with 5-fold cross validation. The same n-mer based normal-
ized features (84 in total) were used. The cross validation results, in terms of classification
accuracy, precision, recall, F1 score, and area under the ROC curve (AUC), are reported in
Table 2. The classification results were not good based on the sources of isolation of the
samples studied.
Table 2: Performance evaluation of different classification approaches (5-fold cross validation)

based on the 84 n-mer features to distinguish the sources of isolation from where samples
were collected. The best values over a column are shown in bold.
Classification Accuracy Precision Recall F1 Score AUC

k-NN 41.7% 0.409 0.417 0.407 0.669
AdaBoost 52.8% 0.479 0.528 0.501 0.678
Random Forest 58.3% 0.502 0.583 0.539 0.800
On collecting the isolates of the samples, a significant overlap was found (Fig. 5). The
observed frequencies of the geographical location and isolate in these samples indicate that
the isolate is not an important feature to distinguish (N = 36, χ2 = 42.45, p = 0) the
7
Figure 4: The distribution of samples across the sources of isolation (termed as isolates) and geographical location.
8
Figure 5: The Sieve plot showing distribution of samples in different geographical locations
(marked as Country) versus different isolates (marked as Isolate).
coronavirus samples. Hence, there is still no affinity towards the collection of samples from
different body parts in human.
4.4 Phylogenetic Analaysis
The phylogenetic tree was constructed based on the 36 sequences of the SARS-CoV-2 samples
that we studied against the other samples of coronaviruses (Fig. 6). The results highlight a
strong overlap of SARS-CoV-2 with human SARS and bat coronaviruses. Most interestingly,
some of the sample sequences of SARS-CoV-2 collected from USA exhibit a separate phylo-
genetic closeness than the others. In fact, in a majority of the cases, the samples from the
same location comes under the same branch in this tree, thereby highlighting a evolutionary
connection between the sample and its collection footprint.
4.5 Protein Analysis
On comparing the genomic sequence of Wuhan seafood market pneumonia virus isolate
Wuhan-Hu-1 complete genome (29903 nt) with the protein sequence data bank UnitProt, two
major proteins were found to match with many different coronaviruses. These include human,
bat, middle east respiratory syndrome-related, murine, bovin, avian, porcine transmissible
gastroenteritis, and feline coronaviruses, with bat coronaviruses having a major share (see
Supplementary Fig. 1). Many of these are in fact experimentally validated (PE=1 as marked
in UniProt). Interestingly, out of the three proteins found, two (Replicase polyprotein 1a
and 1ab) were aligned to many coronaviruses but the third one (spike glycoprotein) was
aligned to human SARS coronavirus only. These proteins are now being studied in many
recent studies to link mutations in them with the pathophysiology of SARS-CoV-2 [17]. The
fast family and domain prediction results (see Supplementary Fig. 2) also highlight a deep
9
MT039888_||USA|2020-01-29
KY417145_||China|2012-09-18 MN997409_||USA|2020-01-22
KT444582_||China|2013-07-21 MT049951_||China|2020-01-17
KY417150_||China|2013-07-21 MT020881_||USA|2020-01-25 MN996531_||China|2019-12-30
KY417144_||China|2012-09-18 MN985325_||USA|2020-01-19 MT093631_||China|2020-01-08
KY417151_||China|2014-10-24 MT020880_||USA|2020-01-25 MT019531_||China|2019-12-30
KY417152_||China|2015-10-16 MT106053_||USA|2020-02-10 MN994467_||USA|2020-01-23
MN988668_||China|2020-01-02 MT044257_||USA|2020-01-28
MK211376_||China|2016-09
KY417142_||China|2014-05-12 MN996527_||China|2019-12-30
MN996530_||China|2019-12-30
MT027064_||USA|2020-01-29
KY417147_||China|2013-04-17 MN988669_||China|2020-01-02
KY417148_||China|2013-04-17 MN908947_||China|2019-12 MT019529_||China|2019-12-23
KY417143_||China|2012-09-18 KJ473814_||China|2013 MT019532_||China|2019-12-30
KY417149_||China|2013-04-17 JX993987_||China|2011-09 NC_045512 ||China|2019-12 MT106052_||USA|2020-02-06
KY770859_||China|2013 JX993988_||China|2011 MG772934_||China|2015-07 MT039873_||China|2020-01-20 MN975262_||China|2020-01-11
KY770858_||China|2013 AY278489_||| GQ153547_||| MT019533_||China|2020-01-01
KJ473816_||China|2013 AY508724_||| DQ648857_||| MT118835_||USA|2020-02-23
KJ473815_||China|2013KY417146_||China|2013-04-17 MG772933_||China|2017-02
MT044258_||USA|2020-01-27
KY770860_||China|2012MK211377_||China|2016-09 GQ153542_|||
MT039887_||USA|2020-01-31
KY938558_||South_Korea|2016-04-17
MN996529_||China|2019-12-30
KU182964_||China|2013-08
MT027062_||USA|2020-01-29
KF636752_||China|2013-04-29
MT027063_||USA|2020-01-29
NC_025217 ||China|2013-04-29
MT019530_||China|2019-12-30
GU190215_||Bulgaria|2008
MT106054_||USA|2020-02-11
NC_014470 ||Bulgaria|2008
MG916901_||China|2012-05-19
MG916903_||China|2013-04-17
MG916902_||China|2012-09-16
MG916904_||China|2013-06-01
KY770851_||China|2013
KY770850_||China|2013
KM349743_||China|2012-05-17
KM349744_||China|2012-05-17
NC_026011 ||China|2012-05-17
KM349742_||China|2012-05-17
KU182965_||China|2012-12
MK492263_||Singapore|2015
EF065509_||China|
NC_039207 ||Germany|2012
EF065505_||China|
DQ415905_||China|
DQ415900_||China|
DQ415901_||China|
DQ415908_||China|
DQ415909_||China|
KY674941_||USA|2016 DQ415910_||China| DQ415907_||China|
DQ415896_||China| DQ415904_||China|
KY674942_||USA|2016 DQ415906_||China| KT779555_||China|2009-04-23
KT779556_||China|2009-04-23
KY674943_||USA|2016 DQ415914_||China|
DQ415903_||China|NC_006577 |||HM034837_||France|2005-03-04
KF686341_||USA|2010-01-16
KF430201_||USA|2010-01-22
KF686340_||USA|2009-11-28
KF686342_||USA|2009-12-13
MK167038_||USA|2017
KY674921_||USA|2016
DQ415911_||China|
DQ415912_||China|
DQ415902_||China|
DQ415897_||China|
AY884001_|||
DQ415899_||China|
MH940245_||Thailand|2017-06-04
DQ415913_||China|
DQ339101_|||
DQ415898_||China|
KF530092_||USA|2000-08-08
KF530072_||USA|1997-12-11
KF530080_||USA|1997-12-18
KF530064_||USA|1996-12-04
KF530070_||USA|1999-01-15 KF530078_||USA|1996-12-17
KF530081_||USA|1999-01-07 KY014281_||France|2002 NC_034440 ||Uganda|2013-02-20
KF530068_||USA|2000-07-27 KY967360_||USA|2015 KX574227_||Uganda|2013-02-20
KF530099_||USA|1997-01-02 MF314143_||USA|2016-03-07
KF530063_||USA|1996-12-30 KY014282_||France|2007 EF065513_||China|
KF530069_||USA|1998-02-05 KP198610_||China|2010-06
KF530096_||USA|1991-01-15 KF923906_||China|2012-03 KU762338_||China|2014-05-28
KF530094_||USA|1991-02-22 KY554972_||USA|2016 NC_030886 ||China|2014-05-28
KF530082_||USA|1991-02-07
KF530089_||USA|1991-01-29 NC_006213 ||USA|
KF530079_||USA|1991-03-14 KY674917_||USA|2016
KF530071_||USA|1992-05-04
KF530088_||USA|1990-01-23 KY674918_||USA|2016
MN310478_||USA|2019
KF530091_||USA|1991-01-24 KY554973_||USA|2016
MN306041_||USA|2019
KF530067_||USA|1991-02-07
MN306042_||USA|2019
KY967359_||USA|2015 MN306053_||USA|2019
MF374984_||USA|2017-01-03 MN306036_||USA|2019
KY983588_||USA|2015 KY983583_||USA|2015
MH121121_||USA|2016-12-19 KY369905_||USA|2016 KY369907_||USA|2016KY967361_||USA|2015
KY369906_||USA|2016 MN026164_||Kenya|2018-01-18 KF530086_||USA|1987-02-10
MG197723_||China|2016-06-20
KY684759_||USA|2016 MK303620_||| KF530074_||USA|1992-12-16
KY983585_||USA|2015 KF923904_||China|2012-05 KF923899_||China|2006-09 KF530061_||USA|1990-01-19
KY967358_||USA|2015 MG197713_||China|2015-08-14KF923895_||China|2010-07 KF530073_||USA|1989-12-21
MF374985_||USA|2017-01-17
MG977452 ||Cote dIvoire|2016-12-10 MG197715_||China|2015-05-21KF923893_||China|2010-07MK303625_|||KF530065_||USA|1990-01-17
KX538979_||Malaysia|2013-02-15KY554975_||USA|2016 MN306043_||USA|2019KF530066_||USA|1990-01-16
MG977447 ||Cote dIvoire|2016-12-26
KF923902_||China|2012-05 MK303621_||| KF530098_||USA|1996-05-10 KF530085_||USA|1987-01-22
KX538972_||Malaysia|2012-07-04 KF923903_||China|2012-05 KF923900_||China|2006-10 KF530075_||USA|1995-03-09
KX538970_||Malaysia|2012-06-20 KX538975_||Malaysia|2012-08-24 KF923905_||China|2005-06 MN310476_||USA|2019
MG977449 ||Cote dIvoire|2016-12-28KX538976_||Malaysia|2012-08-27 KF923897_||China|2012-06 JN129834_||China|2004-11 KF923889_||China|2006-03
MG977444 ||Cote dIvoire|2017-01-04 KX538964_||Malaysia|2012-02-22 KJ958219_||China|2011-10-04 KF923887_||China|2010-04
KX538967_||Malaysia|2012-05-02KJ958218_||China|2011-10-03 KF923888_||China|2010-07
KX538978_||Malaysia|2013-01-02
MK303624_||| KX344031_||Mexico|2011-02-09 KF923898_||China|2012-03
MK303622_|||
MK303623_||| AY903460_||Belgium|KF923918_||China|2010-05 MG197714_||China|2015-07-13
KY967356_||USA|2015 MG197712_||China|2015-06-09
KF923916_||China|2007-06 MG197719_||China|2015-06-04
MG977445 ||Cote dIvoire|2016-12-31 MG197716_||China|2015-06-06KX538973_||Malaysia|2012-07-16 MG197711_||China|2015-06-09
KF923917_||China|2007-06
MG197717_||China|2015-07-06 KX538965_||Malaysia|2012-03-28KF923920_||China|2007-07 MG197721_||China|2015-06-13
MG197710_||China|2015-05-06 KX538968_||Malaysia|2012-05-09 KF923914_||China|2007-06 MG197718_||China|2015-03-12
MG977451 ||Cote dIvoire|2016-12-10 JN129835_||China|2004-11KF923894_||China|2007-05 MG197720_||China|2015-06-05
KX538977_||Malaysia|2012-09-10 KF923909_||China|2007-06 MG197709_||China|2015-04-22
KX538969_||Malaysia|2012-05-18 KF923911_||China|2007-06 KF923886_||China|2010-03
MK303619_||| KF923907_||China|2007-05
MF374983_||USA|2016-02-01 KF923901_||China|2007-06
MG197722_||China|2015-12-02 KF923912_||China|2007-06
KF923910_||China|2007-06
KF923890_||China|2007-04
KF923892_||China|2007-05
KF923921_||China|2007-05
KF923915_||China|2007-06
KF923913_||China|2007-06
KF923908_||China|2007-06
KY554974_||USA|2016
KY674920_||USA|2016
KF923923_||China|2008-10
Figure 6: The phylogenetic tree connecting SARS-CoV-2 and other coronaviruses. The
SARS-CoV-2 samples are marked in red. Some samples from USA form a group on the left
side and the rest (a mixture of samples from USA and China) form a group on the right
side.
connection between human SARS coronavirus and SARS-CoV-2.
10
5 Conclusion
The SARS-CoV-2 is still a poorly characterized virus due to its diverse genomic alterations,
rapid power of proliferation from host to host, and spread over several countries. Due to
such complex characterization, understanding the pathophysiology of SARS-CoV-2 can be
a real challenge. As the samples from China and USA had distinct isolates but they were
closer together in terms of n-mer distribution, hence there are other important factors like
temporal shift of locations where the spread took place. This highlights a high complexity
of understanding the effect of mutations taking place in this virus over time. It is equally
interesting to explore how much dependent such mutations are on the geographic origin of
the samples. As the recurrent occurrence of this virus in human is already characterized
by its diversity [15], its aberrant mutations in different locations raise the possibility of its
outbreak in future.
Abbreviations
SARS-CoV-2, 2019 novel coronavirus; SARS-CoV, Severe Acute Respiratory Syndrome;

MERS-Cov, Middle East Respiratory Syndrome; COVID-19, 2019 novel coronavirus dis-
ease.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
No funding to mention.
11
Availability of data and materials
All datasets used in this paper are freely available from the NCBI data repository. The data
with normalized n-mer features is provided as Supplementary Dataset 1. The metadata
for all the 87 sequences available and for those 36 that were considered in this analysis are
provided as Supplementary Dataset 2 and 3, respectively.
Authors’ contributions
SoB, SS and MB designed the experiments. SoB and SS carried out the experiments and
data analysis. SoB, SS, SaB and MB drafted the manuscript. All the authors read and
approved the final manuscript.
Acknowledgments
SoB acknowledges Digital India Corporation (formerly Media Lab Asia), Ministry Of Elec-
tronics and Information Technology (MeitY), Government of India, for providing him a
Senior Research Fellowship under the Visvesvaraya Ph.D. scheme for Electronics and IT.
Condolence
The authors feel deep grief and sympathy for those who got affected by the SARS-CoV-2.
References
[1] Anthony R Fehr and Stanley Perlman. Coronaviruses: an overview of their replication
and pathogenesis. In Coronaviruses, pages 1–23. Springer, 2015.
[2] Susan R Weiss and Julian L Leibowitz. Coronavirus pathogenesis. In Advances in Virus
Research, volume 81, pages 85–164. Elsevier, 2011.
[3] Shuo Su, Gary Wong, Weifeng Shi, Jun Liu, Alexander CK Lai, Jiyong Zhou, Wenjun
Liu, Yuhai Bi, and George F Gao. Epidemiology, genetic recombination, and pathogen-
esis of coronaviruses. Trends in Microbiology, 24(6):490–502, 2016.
[4] Nicole R Sexton, Everett Clinton Smith, Hervé Blanc, Marco Vignuzzi, Olve B Peersen,
and Mark R Denison. Homology-based identification of a mutation in the coronavirus
rna-dependent rna polymerase that confers resistance to multiple mutagens. Journal of
virology, 90(16):7415–7428, 2016.
12
[5] W Tan, X Zhao, X Ma, W Wang, P Niu, W Xu, GF Gao, and G Wu. A novel coronavirus
genome identified in a cluster of pneumonia cases—wuhan, china 2019-2020. China CDC
Weekly, 2(4):61–62, 2020.
[6] Na Zhu, Dingyu Zhang, Wenling Wang, Xingwang Li, Bo Yang, Jingdong Song, Xiang
Zhao, Baoying Huang, Weifeng Shi, Roujian Lu, et al. A novel coronavirus from patients
with pneumonia in china, 2019. New England Journal of Medicine, 2020.
[7] Sasmita Poudel Adhikari, Sha Meng, Yu-Ju Wu, Yu-Ping Mao, Rui-Xue Ye, Qing-
Zhi Wang, Chang Sun, Sean Sylvia, Scott Rozelle, Hein Raat, et al. Epidemiology,
causes, clinical manifestation and diagnosis, prevention and control of coronavirus dis-
ease (covid-19) during the early outbreak period: a scoping review. Infectious diseases
of poverty, 9(1):1–12, 2020.
[8] World Health Organization et al. Novel coronavirus (2019-ncov): situation report, 94.
2020.
[9] Eleanor M Cottam, Jemma Wadsworth, Nick J Knowles, and Donald P King. Full se-
quencing of viral genomes: practical strategies used for the amplification and character-
ization of foot-and-mouth disease virus. In Molecular Epidemiology of Microorganisms,
pages 217–230. 2009.
[10] Deepak Sharma, Pragya Priyadarshini, and Sudhanshu Vrati. Unraveling the web of vi-
roinformatics: computational tools and databases in virus research. Journal of Virology,
89(3):1489–1501, 2015.
[11] Kai Kupferschmidt. Preprints bring ‘firehose’ of outbreak data. Science, 367(6481):963–
964, 2020.
[12] Roujian Lu, Xiang Zhao, Juan Li, Peihua Niu, Bo Yang, Honglong Wu, Wenling Wang,
Hao Song, Baoying Huang, Na Zhu, et al. Genomic characterisation and epidemiology of
2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet,
395(10224):565–574, 2020.
[13] Carmine Ceraolo and Federico M Giorgi. Genomic variance of the 2019-ncov coron-
avirus. Journal of Medical Virology, 2020.
[14] Jiahua He, Huanyu Tao, Yumeng Yan, Sheng-You Huang, and Yi Xiao. Molecular
mechanism of evolution and human infection with sars-cov-2. Viruses, 12(4):428, 2020.
[15] Jie Cui, Fang Li, and Zheng-Li Shi. Origin and evolution of pathogenic coronaviruses.
Nature Reviews Microbiology, 17(3):181–192, 2019.
13
[16] Yi Huang, Susanna KP Lau, Patrick CY Woo, and Kwok-yung Yuen. CoVDB: a
comprehensive database for comparative analysis of coronavirus genes and genomes.
Nucleic Acids Research, 36(suppl 1):D504–D511, 2007.
[17] Aiping Wu, Yousong Peng, Baoying Huang, Xiao Ding, Xianyue Wang, Peihua Niu,
Jing Meng, Zhaozhong Zhu, Zheng Zhang, Jiangyuan Wang, et al. Genome composition
and divergence of the novel coronavirus (2019-ncov) originating in china. Cell Host &
Microbe, 2020.
[18] Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Miluti-
novič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, et al. Orange: data
mining toolbox in python. The Journal of Machine Learning Research, 14(1):2349–2353,
2013.
[19] Richard O Duda, Peter E Hart, and David G Stork. Pattern Classification. John Wiley
& Sons, 2012.
[20] William R Pearson, Todd Wood, Zheng Zhang, and Webb Miller. Comparison of dna
sequences with protein sequences. Genomics, 46(1):24–36, 1997.
[21] Barbara Holland and Vincent Moulton. Consensus networks: A method for visualising
incompatibilities in collections of trees. In International Workshop on Algorithms in
Bioinformatics, pages 165–176. Springer, 2003.
14

Tracing Back The Temporal Change of SARS-CoV-2 With Genomic Signatures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tracing Back The Temporal Change of SARS-CoV-2 With Genomic Signatures

Uploaded by

Copyright:

Available Formats

bioRxiv preprint doi: https://doi.org/10.1101/2020.04.24.057380; this version posted May 7, 2020.

The copyright holder for this preprint (which

Tracing Back the Temporal Change of SARS-CoV-2

Sourav Biswas1,∗ , Suparna Saha1,∗ , Sanghamitra Bandyopadhyay1 , and

Indian Statistical Institute, Kolkata

203 B. T. Road, Kolkata – 700108, India

Indian Statistical Institute, Kolkata

203 B. T. Road, Kolkata – 700108, India

E-mail correspondence: malaybhattacharyya@isical.ac.in

Phone: (+91) (33) 2575 2231

temporal change of SARS-CoV-2, the coronavirus implicated in COVID-19, we study

variations characterize the initial temporal change of SARS-CoV-2 sequences. We ob-

the phylogenetic and protein-alignment analyses highlight interesting associations be-

tween SARS-CoV-2 and other coronaviruses.

Keywords: SARS-CoV-2, COVID-19, disease outbreak, phylogenetic tree.

3 Materials and Methods

3.1 Dataset Details

4.1 Extraction of Genomic Signatures

4.2 Tracing Back the Temporal Change

Table 1: Performance evaluation of different classification approaches (5-fold cross validation)

Classification Accuracy Precision Recall F1 Score AUC

Figure 3: The clustering of samples based on selected features to distinguish collection

4.3 Relevance of Isolation Sources

Table 2: Performance evaluation of different classification approaches (5-fold cross validation)

Classification Accuracy Precision Recall F1 Score AUC

4.4 Phylogenetic Analaysis

4.5 Protein Analysis

connection between human SARS coronavirus and SARS-CoV-2.

SARS-CoV-2, 2019 novel coronavirus; SARS-CoV, Severe Acute Respiratory Syndrome;

The authors declare that they have no competing interests.

Consent for publication

Ethics approval and consent to participate

Availability of data and materials

You might also like