You are on page 1of 4

Bioinformatics

Bioinformatics is an interdisciplinary research area at the interface


between computer science and biological science. A variety of definitions exist in
the literature and on the world wide web; some are more inclusive than others.
Here, we adopt the definition proposed by Luscombe et al. in defining
bioinformatics as a union of biology and informatics: bioinformatics involves the
technology that uses computers for storage, retrieval, manipulation, and
distribution of information related to biological macromolecules such as DNA,
RNA, and proteins. The emphasis here is on the use of computers because most
of the tasks in genomic data analysis are highly repetitive or mathematically
complex. The use of computers is absolutely indispensable in mining genomes
for information gathering and knowledge building.
Bioinformatics differs from a related field known as computational biology.
Bioinformatics is limited to sequence, structural, and functional analysis of genes
and genomes and their corresponding products and is often considered
computational molecular biology. However, computational biology encompasses
all biological areas that involve computation. For example, mathematical
modeling of ecosystems, population dynamics, application of the game theory in
behavioral studies, and phylogenetic construction using fossil records all employ
computational tools, but do not necessarily involve biological macromolecules.
Beside this distinction, it is worth noting that there are other views of how the
two termsrelate. For example, one version defines bioinformatics as the
development and application of computational tools in managing all kinds of
biological data, whereas computational biology is more confined to the
theoretical development of algorithms used for bioinformatics. The confusion at
present over definition may partly reflect the nature of this vibrant and quickly
evolving new field.
The ultimate goal of bioinformatics is to better understand a living cell and
how it functions at the molecular level. By analyzing raw molecular sequence
and structuraldata, bioinformatics research can generate new insights and
provide a global perspective of the cell. The reason that the functions of a cell
can be better understood by analyzing sequence data is ultimately because the
flow of genetic information is dictated by the central dogma of biology in
whichDNAis transcribed to RNA, which is translated to proteins. Cellular functions
are mainly performed by proteins whose capabilities are ultimately determined
by their sequences. Therefore, solving functional problems using sequence and
sometimes structural approaches has proved to be a fruitful endeavor
Gene Prediction
The current gene prediction methods can be classified into two major
categories, ab initiobased and homology-based approaches.The ab initiobased
approach predicts genes based on the given sequence alone. It does so by
relying on two major features associated with genes.Thefirst is the existence of
gene signals, which include startand stop codons, intron splice signals,
transcription factor binding sites, ribosomal binding sites, and polyadenylation
(poly-A) sites. In addition, the triplet codon structure limits the coding frame
length to multiples of three, which can be used as a condition for gene
prediction. The second feature used by ab initio algorithms is gene content,
which is statistical description of coding regions. It has been observed that
nucleotide composition and statistical patterns of the coding regions tend to vary

significantly fromthose of the noncoding regions. The unique features can be


detected by employing probabilistic models such as Markov models or hidden
Markov models to help distinguish coding from noncoding regions.
The homology-based method makes predictions based on significant
matches of the query sequence with sequences of known genes. For instance, if
a translated DNA sequence is found to be similar to a known protein or protein
family from a database search, this can be strong evidence that the region codes
for a protein. Alternatively, when possible exons of a genomic DNA region match
a sequenced cDNA, this also provides experimental evidence for the existence of
a coding region. Some algorithms make use of both gene-finding strategies.
There are also a number of programs that actually combine prediction results
from multiple individual programs to derive a consensus prediction. This type of
algorithms can therefore be considered as consensus based.
Sequence Homology and Sequence Similarity
An important concept in sequence analysis is sequence homology. When
two sequences are descended from a common evolutionary origin, they are said
to have a homologous relationship or share homology. A related but different
term is sequence similarity, which is the percentage of aligned residues that are
similar in physiochemical properties such as size, charge, and hydrophobicity.
It is important to distinguish sequence homology from the related
termsequence similarity because the two terms are often confused by some
researchers who use them interchangeably in scientific literature. To be clear,
sequence homology is an inference or a conclusion about a common ancestral
relationship drawn from sequence similarity comparison when the two sequences
share a high enough degree of similarity. On the other hand, similarity is a direct
result of observation from the sequence alignment. Sequence similarity can be
quantified using percentages; homology is a qualitative statement. For example,
one may say that two sequences share 40% similarity. It is incorrect to say that
the two sequences share 40% homology. They are either homologous or
nonhomologous.
Generally, if the sequence similarity level is high enough, a common
evolutionary relationship can be inferred. In dealing with real research
problems,theissue of atwhat similarity level can one infer homologous
relationships is not always clear. The answer depends on the type of sequences
being examined and sequence lengths. Nucleotide sequences consist of only four
characters, and therefore, unrelated sequences have at least a 25% chance of
being identical. For protein sequences, there are twenty possible amino acid
residues, and so two unrelated sequences can match up 5% of the residues by
random chance. If gaps are allowed, the percentage could increase to 1020%.
Sequence length is also a crucial factor. The shorter the sequence, the higher the
chance that some alignment is attributable to random chance. The longer the
sequence, the less likely the matching at the same level of similarity is
attributable to random chance.
There are two ways to calculate the sequence similarity/identity. One
involves the use of the overall sequence lengths of both sequences; the other
normalizes by the size of the shorter sequence. The first method uses the
following formula:

S=

Ls 2
100
La L b

where S is the percentage sequence similarity, Ls is the number of aligned


residues with similar characteristics, and La and Lb are the total lengths of each
individual sequence.
Phylogenetics Basics
Biological sequence analysis is founded on solid evolutionary principles
Similarities and divergence among related biological sequences revealed by
sequence alignment often have to be rationalized and visualized in the context of
phylogenetic trees. Thus, molecular phylogenetics is a fundamental aspect of
bioinformatics.
Evolution can be defined in various ways under different contexts. In the
biological context, evolution can be defined as the development of a biological
formfromother preexisting forms or its origin to the current existing form through
natural selections and modifications. The driving force behind evolution is natural
selection in which unfit forms are eliminated through changes of environmental
conditions or sexual selection so that only the fittest are selected. The underlying
mechanism of evolution is genetic mutations that occur spontaneously. The
mutations on the genetic material provide the biological diversity within a
population; hence, the variability of individuals within the population to survive
successfully in a given environment. Genetic diversity thus provides the source
of rawmaterial for the natural selection to act on.
Phylogenetics is the study of the evolutionary history of living organisms
using treelike diagrams to represent pedigrees of these organisms. The tree
branching patterns representing the evolutionary divergence are referred to as
phylogeny. Phylogenetics can be studied in various ways. It is often studied using
fossil records, which contain morphological information about ancestors of
current species and the timeline of divergence.However, fossil records have
many limitations; they may be available only for certain species. Existing fossil
data can be fragmentary and their collection is often limited by abundance,
habitat, geographic range, and other factors. The descriptions of morphological
traits are oftenambiguous,whicharedueto multiple genetic factors. Thus, using
fossil records to determinephylogenetic relationshipscanoftenbebiased. For
microorganisms, fossils are essentially nonexistent, which makes it impossible to
study phylogeny with this approach.

Fortunately, molecular data that are in the form of DNA or protein


sequences can also provide very useful evolutionary perspectives of existing

organisms because, as organisms evolve, the genetic materials accumulate


mutations over time causing phenotypic changes. Because genes are the
medium for recording the accumulated mutations, they can serve as molecular
fossils. Through comparative analysis of the molecular fossils from a number of
related organisms, the evolutionary history of the genes and even the organisms
can be revealed.
The advantage of using molecular data is obvious.Molecular data are more
numerous than fossil records and easier to obtain. There is no sampling bias
involved, which helps to mend the gaps in real fossil records.More clear-cut and
robust phylogenetic trees can be constructed with the molecular data.Therefore,
they have become favorite and sometimes the only information available for
researchers to reconstruct evolutionary history.The advent of the genomic era
with tremendous amounts of molecular sequence data has led to the rapid
development of molecular phylogenetics.
The field of molecular phylogenetics can be defined as the study of
evolutionary relationships of genes and other biological macromolecules by
analyzing mutations at various positions in their sequences and developing
hypotheses about the evolutionary relatedness of the biomolecules. Based on
the sequence similarity of the molecules, evolutionary relationships between the
organisms can often be inferred.
To use molecular data to reconstruct evolutionary history requires making
a number of reasonable assumptions. The first is that the molecular sequences
used in phylogenetic construction are homologous, meaning that they share a
common origin and subsequently diverged through time. Phylogenetic
divergence is assumed to be bifurcating, meaning that a parent branch splits into
two daughter branches at any given point. Another assumption in phylogenetics
is that each position in a sequence evolved independently. The variability among
sequences is sufficiently informative for constructing unambiguous phylogenetic
trees.
(Xiong, J. Essential Bioinformatics. Cambridge: Cambridge University Press; 2006.
4,5,32,33, 97, 127-128)

You might also like