You are on page 1of 5

Computational genomics

Computational genomics refers to the use of computational and statistical analysis to decipher biology
from genome sequences and related data,[1] including both DNA and RNA sequence as well as other
"post-genomic" data (i.e., experimental data obtained with technologies that require the genome sequence,
such as genomic DNA microarrays). These, in combination with computational and statistical approaches
to understanding the function of the genes and statistical association analysis, this field is also often referred
to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded
as a subset of bioinformatics and computational biology, but with a focus on using whole genomes (rather
than individual genes) to understand the principles of how the DNA of a species controls its biology at the
molecular level and beyond. With the current abundance of massive biological datasets, computational
studies have become one of the most important means to biological discovery.[2]

History
The roots of computational genomics are shared with those of bioinformatics. During the 1960s, Margaret
Dayhoff and others at the National Biomedical Research Foundation assembled databases of homologous
protein sequences for evolutionary study.[3] Their research developed a phylogenetic tree that determined
the evolutionary changes that were required for a particular protein to change into another protein based on
the underlying amino acid sequences. This led them to create a scoring matrix that assessed the likelihood
of one protein being related to another.

Beginning in the 1980s, databases of genome sequences began to be recorded, but this presented new
challenges in the form of searching and comparing the databases of gene information. Unlike text-searching
algorithms that are used on websites such as Google or Wikipedia, searching for sections of genetic
similarity requires one to find strings that are not simply identical, but similar. This led to the development
of the Needleman-Wunsch algorithm, which is a dynamic programming algorithm for comparing sets of
amino acid sequences with each other by using scoring matrices derived from the earlier research by
Dayhoff. Later, the BLAST algorithm was developed for performing fast, optimized searches of gene
sequence databases. BLAST and its derivatives are probably the most widely used algorithms for this
purpose.[4]

The emergence of the phrase "computational genomics" coincides with the availability of complete
sequenced genomes in the mid-to-late 1990s. The first meeting of the Annual Conference on
Computational Genomics was organized by scientists from The Institute for Genomic Research (TIGR) in
1998, providing a forum for this speciality and effectively distinguishing this area of science from the more
general fields of Genomics or Computational Biology. The first use of this term in scientific literature,
according to MEDLINE abstracts, was just one year earlier in Nucleic Acids Research.[5] The final
Computational Genomics conference was held in 2006, featuring a keynote talk by Nobel Laureate Barry
Marshall, co-discoverer of the link between Helicobacter pylori and stomach ulcers. As of 2014, the leading
conferences in the field include Intelligent Systems for Molecular Biology (ISMB) and Research in
Computational Molecular Biology (RECOMB).

The development of computer-assisted mathematics (using products such as Mathematica or Matlab) has
helped engineers, mathematicians and computer scientists to start operating in this domain, and a public
collection of case studies and demonstrations is growing, ranging from whole genome comparisons to gene
expression analysis.[6] This has increased the introduction of different ideas, including concepts from
systems and control, information theory, strings analysis and data mining. It is anticipated that
computational approaches will become and remain a standard topic for research and teaching, while
students fluent in both topics start being formed in the multiple courses created in the past few years.

Contributions of computational genomics research to biology


Contributions of computational genomics research to biology include:[2]

proposing cellular signalling networks


proposing mechanisms of genome evolution
predict precise locations of all human genes using comparative genomics techniques with
several mammalian and vertebrate species
predict conserved genomic regions that are related to early embryonic development
discover potential links between repeated sequence motifs and tissue-specific gene
expression
measure regions of genomes that have undergone unusually rapid evolution

Genome comparison
Computational tools have been developed to assess the similarity of genomic sequences. Some of them are
alignment-based distances such as Average Nucleotide Identity.[7] These methods are highly specific, while
being computationally slow. Other, alignment-free methods, include statistical and probabilistic approaches.
One example is Mash,[8] a probabilistic approach using minhash. In this method, given a number k, a
genomic sequence is transformed into a shorter sketch through a random hash function on the possible k-
mers. For example, if , sketches of size 4 are being constructed and given the following hash
function

(AA,0) (AC,8) (AT,2) (AG,14)


(CA,6) (CC,13) (CT,5) (CG,4)
(GA,15) (GC,12) (GT,10) (GG,1)
(TA,3) (TC,11) (TT,9) (TG,7)

the sketch of the sequence

CTGACCTTAACGGGAGACTATGATGACGACCGCAT

is {0,1,1,2} which are the smallest hash values of its k-mers of size 2. These sketches are then compared to
estimate the fraction of shared k-mers (Jaccard index) of the corresponding sequences. It is worth noticing
that a hash value is a binary number. In a real genomic setting a useful size of k-mers ranges from 14 to 21,
and the size of the sketches would be around 1000.[8]

By reducing the size of the sequences, even hundreds of times, and comparing them in an alignment-free
way, this method reduces significantly the time of estimation of the similarity of sequences.

Clusterization of genomic data


Clustering data is a tool used to simplify statistical analysis of a genomic sample. For example in[9] the
authors developed a tool (BiG-SCAPE) to analize sequence similarity networks of biosynthetic gene
clusters (BGC). In [10] successive layers of clusterization of biosynthetic gene clusters are used in the
automated tool BiG-MAP, both to filter redundant data and identify gene clusters families. This tool profiles
the abundance and expressions levels of BGC's in microbiome samples.

Biosynthetic gene clusters


Bioinformatic tools have been developed to predict, and determine the abundance and expression of, this
kind of gene cluster in microbiome samples, from metagenomic data.[11] Since the size of metagenomic
data is considerable, filtering and clusterization thereof are important parts of these tools. These processes
can consist of dimensionality -reduction techniques, such as Minhash,[8] and clusterization algorithms such
as k-medoids and affinity propagation. Also several metrics and similarities have been developed to
compare them.

Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product
discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel
chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous
BGCs, which allows identification of cross-species patterns that can be matched to the presence of
metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the
expensive network-based approach used to group these BGCs into gene cluster families (GCFs). BiG-
SLiCE (Biosynthetic Genes Super-Linear Clustering Engine), a tool designed to cluster massive numbers
of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-
pairwise, near-linear fashion.

Satria et. al, 2021[12] across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global
map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens
up new possibilities to accelerate natural product discovery and offers a first step towards constructing a
global and searchable interconnected network of BGCs. As more genomes are sequenced from
understudied taxa, more information can be mined to highlight their potentially novel chemistry.[12]

See also
Bioinformatics
Computational biology
Genomics
Microarray
BLAST
Computational epigenetics

References
1. Koonin EV (March 2001). "Computational genomics" (https://doi.org/10.1016%2FS0960-982
2%2801%2900081-1). Current Biology. 11 (5): R155–8. doi:10.1016/S0960-9822(01)00081-
1 (https://doi.org/10.1016%2FS0960-9822%2801%2900081-1). PMID 11267880 (https://pub
med.ncbi.nlm.nih.gov/11267880). S2CID 17202180 (https://api.semanticscholar.org/CorpusI
D:17202180).
2. "Computational Genomics and Proteomics at MIT" (https://web.archive.org/web/2018032206
2228/http://www.eecs.mit.edu/bioeecs/CompGenProt.html). Archived from the original (http://
www.eecs.mit.edu/bioeecs/CompGenProt.html) on 2018-03-22. Retrieved 2006-12-29.
3. Mount D (2000). Bioinformatics, Sequence and Genome Analysis. Cold Spring Harbor
Laboratory Press. pp. 2–3. ISBN 978-0-87969-597-2.
4. Brown TA (1999). Genomes (https://archive.org/details/genomes00tabr). Wiley. ISBN 978-0-
471-31618-3.
5. Wagner A (September 1997). "A computational genomics approach to the identification of
gene networks" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC146952). Nucleic Acids
Research. 25 (18): 3594–604. doi:10.1093/nar/25.18.3594 (https://doi.org/10.1093%2Fnar%
2F25.18.3594). PMC 146952 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC146952).
PMID 9278479 (https://pubmed.ncbi.nlm.nih.gov/9278479).
6. Cristianini N, Hahn M (2006). Introduction to Computational Genomics (http://www.computati
onal-genomics.net/). Cambridge University Press. ISBN 978-0-521-67191-0.
7. Konstantinidis KT, Tiedje JM (2005). "Genomic insights that advance the species definition
for prokaryotes" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC549018). Proc Natl Acad Sci
U S A. 102 (7): 2567–72. Bibcode:2005PNAS..102.2567K (https://ui.adsabs.harvard.edu/ab
s/2005PNAS..102.2567K). doi:10.1073/pnas.0409727102 (https://doi.org/10.1073%2Fpnas.
0409727102). PMC 549018 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC549018).
PMID 15701695 (https://pubmed.ncbi.nlm.nih.gov/15701695).
8. Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016).
"Mash: fast genome and metagenome distance estimation using MinHash" (https://www.ncb
i.nlm.nih.gov/pmc/articles/PMC4915045). Genome Biology. 17 (32): 14.
doi:10.1186/s13059-016-0997-x (https://doi.org/10.1186%2Fs13059-016-0997-x).
PMC 4915045 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4915045). PMID 27323842
(https://pubmed.ncbi.nlm.nih.gov/27323842).
9. Navarro-Muñoz J, Selem-Mojica N, Mullowney M, Kautsar S, Tryon J, Parkinson E, De Los
Santos E, Yeong M, Cruz-Morales P, Abubucker S, Roeters A, Lokhorst W, Fernandez-
Guerra A, Dias-Cappelini L, Goering A, Thomson R, Metcalf W, Kelleher N, Barona-Gomez
F, Medema M (2020). "A computational framework to explore large-scale biosynthetic
diversity" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6917865). Nat Chem Biol. 16 (1):
60–68. doi:10.1038/s41589-019-0400-9 (https://doi.org/10.1038%2Fs41589-019-0400-9).
PMC 6917865 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6917865). PMID 31768033
(https://pubmed.ncbi.nlm.nih.gov/31768033).
10. Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M
(2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and
expression in microbiomes" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547482).
mSystems. 6 (5): e00937-21. bioRxiv 10.1101/2020.12.14.422671 (https://doi.org/10.1101%
2F2020.12.14.422671). doi:10.1128/msystems.00937-21 (https://doi.org/10.1128%2Fmsyste
ms.00937-21). PMC 8547482 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547482).
PMID 34581602 (https://pubmed.ncbi.nlm.nih.gov/34581602).
11. Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M
(2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and
expression in microbiomes" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547482).
bioRxiv. 6 (5): e00937-21. doi:10.1101/2020.12.14.422671 (https://doi.org/10.1101%2F202
0.12.14.422671). PMC 8547482 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547482).
PMID 34581602 (https://pubmed.ncbi.nlm.nih.gov/34581602).
12. Kautsar, Satria A; van der Hooft, Justin J J; de Ridder, Dick; Medema, Marnix H (13 January
2021). "BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene
clusters" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7804863). GigaScience. 10 (1):
giaa154. doi:10.1093/gigascience/giaa154 (https://doi.org/10.1093%2Fgigascience%2Fgiaa
154). PMC 7804863 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7804863).
PMID 33438731 (https://pubmed.ncbi.nlm.nih.gov/33438731).

External links
Harvard Extension School Biophysics 101, Genomics and Computational Biology,
http://www.courses.fas.harvard.edu/~bphys101/info/syllabus.html
University of Bristol course in Computational Genomics, http://www.computational-
genomics.net/

Retrieved from "https://en.wikipedia.org/w/index.php?title=Computational_genomics&oldid=1160543471"

You might also like