Professional Documents
Culture Documents
UNIT – I
Bioinformatics
Objectives of Bioinformatics
o Development of new algorithms and statistics for assessing the relationships among large
sets of biological data.
o Application of these tools for the analysis and interpretation of the various biological
data.
o Development of database for an efficient storage, access and management of the large
body of various biological information
Components of Bioinformatics
Scope of Bioinformatics
Application of Bioinformatics
Biological Data
o Nucleic Acid Sequences - Raw DNA Sequences, Genomic sequence tags (GSTs), cDNA
sequences, Expressed sequence tags (ESTs), Organellar DNA sequences and RNA
Sequences.
o Protein sequences.
o Protein structures
o Metabolic pathways
o Gel pictures
o Literature
Biological Database
A database is a vast collection of data pertaining to a specific topic e.g. nucleotide sequence,
protein sequence, protein structure etc., in an electronic environment.
o They are heart of bioinformatics.
o Computerized storehouse of data (records).
o Allows extraction of specified records.
o Allows adding, changing, removing, and merging of records.
o Uses standardized formats.
NUCLEIC ACID
ENTREZ in NCBI
The NCBI developed and maintains Entrez, a biological database retrieval system. It is a
gateway that allows text based searches for a wide variety of data, including annotated genetic
sequence information, structural information, as well as citations and abstracts, full research
papers and taxonomic data. The key feature of entrez is its ability to integrate information, which
comes from cross-referencing between NCBI databases based on pre-existing and logical
relationships between individual entries.
DDBJ an annotated collection of all publicly available nucleotide and protein sequences, started
in 1984 at the National Institute of Genetics (NIG) in Mishima. This Institution maintains the
DDBJ by team led by Takashi Gojobori. Since 1987, the DDBJ has been collecting annotated
nucleotide sequences as its traditional database service. This endeavor has been conducted in
collaboration with GenBank at the National Center for Biotechnology Information and with
European Molecular Biology Laboratory at the European Bioinformatics Institute. The
collaborative framework is called the International Nucleotide Sequence Database Collaboration
(INSDC). Because DDBJ mirrors its information daily with GenBank and EMBL, beginning
sequence searchers might want to try a database with a friendlier searching interface. However,
DDBJ also offers al its pages in Japanese as well, so if you are more comfortable reading the
Japanese versions of the pages, it can be very useful.
The mission of DDJ – Biological data resource are diverse, some biological data resources are
Very Large Scale Database (VLSD), Diverse requirements to integrate these biological data
resources To contribute to the interoperability of biological data resources.
EUROPEAN MOLECULAR BIOLOGY LABORATORY (EMBL)
EMBL MISSION
To provide freely available data and bioinformatics services to all facets of the scientific
community in ways that promote scientific progress.
Congtribute to the advancement of biology through basic investigator-driven research in
bioinformatics
Provide advance bioinformatics training to scientists at all levels, from PhD students to
independent investigators
Help disseminate cutting-edge technologies to industry
Coordinate biological data provision throughout Europe.
PROTEIN DATABASES
As biology has increasingly turned into a data-rich science, the need for storing and
communicating large datasets has grown tremendously.
The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural
data produced by X-ray crystallography and macromolecular NMR.
The biological information of proteins is available as sequences and structures. Sequences are
represented in a single dimension whereas the structure contains the three-dimensional data of
sequences.
A biological database is a collection of data that is organized so that its contents can easily be
accessed, managed, and updated.
A protein database is one or more datasets about proteins, which could include a protein‘s amino
acid sequence, conformation, structure, and features such as active sites.
Protein databases are compiled by the translation of DNA sequences from different gene
databases and include structural information. They are an important resource because proteins
mediate most biological functions.
b. SWISS-PROT
The other well known and extensively used protein database is SWISS-PROT. Like the PIR-
PSD, this curated proteins sequence database also provides a high level of annotation.
The data in each entry can be considered separately as core data and annotation.
The core data consists of the sequences entered in common single letter amino acid code, and the
related references and bibliography. The taxonomy of the organism from which the sequence
was obtained also forms part of this core information.
The annotation contains information on the function or functions of the protein, post-
translational modification such as phosphorylation, acetylation, etc., functional and structural
domains and sites, such as calcium binding regions, ATP-binding sites, zinc fingers, etc., known
secondary structural features as for examples alpha helix, beta sheet, etc., the quaternary
structure of the protein, similarities to other protein if any, and diseases that may arise due to
different authors publishing different sequences for the same protein, or due to mutations in
different strains of an described as part of the annotation.
TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is
released as a supplement to SWISS-PROT. It contains the translation of all coding sequences
present in the EMBL Nucleotide database, which have not been fully annotated. Thus it may
contain the sequence of proteins that are never expressed and never actually identified in the
organisms.
a. PROSITE:
A set of databases collects together patterns found in protein sequences rather than the
complete sequences. PROSITE is one such pattern database.
The protein motif and pattern are encoded as ―regular expressions‖.
The information corresponding to each entry in PROSITE is of the two forms – the
patterns and the related descriptive text.
b. PRINTS:
In the PRINTS database, the protein sequence patterns are stored as ‗fingerprints‘. A
fingerprint is a set of motifs or patterns rather than a single one.
The information contained in the PRINT entry may be divided into three sections. In
addition to entry name, accession number and number of motifs, the first section contains
cross-links to other databases that have more information about the characterized family.
The second section provides a table showing how many of the motifs that make up the
fingerprint occurs in the how many of the sequences in that family.
The last section of the entry contains the actual fingerprints that are stored as multiple
aligned sets of sequences, the alignment is made without gaps. There is, therefore, one set
of aligned sequences for each motif.
c. MHCPep:
MHCPep is a database comprising over 13000 peptide sequences known to bind the
Major Histocompatibility Complex of the immune system.
Each entry in the database contains not only the peptide sequence, which may be 8 to 10
amino acid long but in addition has information on the specific MHC molecules to which
it binds, the experimental method used to assay the peptide, the degree of activity and the
binding affinity observed , the source protein that, when broken down gave rise to this
peptide along with other, the positions along the peptide where it anchors on the MHC
molecules and references and cross-links to other information.
d. Pfam
Pfam contains the profiles used using Hidden Markov models.
HMMs build the model of the pattern as a series of the match, substitute, insert or delete
states, with scores assigned for alignment to go from one state to another.
Each family or pattern defined in the Pfam consists of the four elements. The first is the
annotation, which has the information on the source to make the entry, the method used
and some numbers that serve as figures of merit.
The second is the seed alignment that is used to bootstrap the rest of the sequences into
the multiple alignments and then the family.
The third is the HMM profile.
The fourth element is the complete alignment of all the sequences identified in that
family.
UNIT- II
Sequence analysis -Introduction to Sequences, alignments and Dynamic Programming; Pairwise
alignment (BLAST and FASTA Algorithm) and multiple sequence alignment (Clustal W algorithm) and
phylogenetic analysis.
Introduction to Sequences
Biological sequence analysis (bioinformatics) is the study of the relationships between biological
sequences and the implication of these relationships for macromolecular structure, function, and
evolution.
A biological sequence is a single, continuous molecule of nucleic acid or protein. It can be
thought of as a multiple inheritance class hierarchy. One hierarchy is that of the underlying
molecule type: DNA, RNA, or protein.
DNA Sequence:
In biological databases there won‘t the U-Uracil in RNA sequence; it will be mentioned as
thymine similar to DNA sequence.
Protein Sequence
Sequence Alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or
protein to identify regions of similarity that may be a consequence of functional, structural, or
evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino
acid residues are typically represented as rows within a matrix. Gaps are inserted between the
residues so that identical or similar characters are aligned in successive columns.
Alignment Methods
Very short or very similar sequences can be aligned by hand. However, most interesting
problems require the alignment of lengthy, highly variable or extremely numerous sequences that
cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing
algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final
results to reflect patterns that are difficult to represent algorithmically (especially in the case of
nucleotide sequences). Computational approaches to sequence alignment generally fall into two
categories: global alignments and local alignments. Calculating a global alignment is a form of
global optimization that "forces" the alignment to span the entire length of all query sequences.
By contrast, local alignments identify regions of similarity within long sequences that are often
widely divergent overall. Local alignments are often preferable, but can be more difficult to
calculate because of the additional challenge of identifying the regions of similarity. A variety of
computational algorithms have been applied to the sequence alignment problem. These include
slow but formally correct methods like dynamic programming. These also include efficient,
heuristic algorithms or probabilistic methods designed for large-scale database search that do not
guarantee to find best matches.
Pairwise Alignment
Pairwise alignment is the process of aligning two DNA, RNA or protein sequences such that the
regions of similarity are maximized. This is often performed to find functional, structural or
evolutionary commonalities. In most cases, scientists use two protein sequences to quantitatively
find relatedness (aka homology). There are two types of sequence alignment; they are Global and
Local Alignments. These alignments works based on unique Dynamic programming.
Dynamic Programming
Global Alignment
The Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) is a dynamic programming
algorithm to identify the optimal pairwise global alignment. The algorithm works by associating
a cost with each edge in the edit graph, and then calculating the highest-scoring path through the
graph. We store these scores in a score matrix according to the following rules:
In this formula, the first term of the maximum corresponds to the score of a match or a mismatch
(depending on yi and xj ), and the next two terms correspond to the score of indels.
The Si,j term of this matrix corresponds to the optimal alignment score of a substring of x of
length i and a substring of y of length j. Because the edit graph is a DAG, we can calculate the
cells in an order such that all the cells required to take the maximum have already been
calculated.
Local Alignment
The Smith-Waterman algorithm (Smith and Waterman, 1981) is a dynamic programming
algorithm to identify the optimal local pairwise alignment. It closely resembles the Needleman-
Wunsch algorithm, but differs in that it identifies locally rather than globally optimal alignments.
To accomplish this, a simple change to the definition of S is required.
Here the 0 term captures the fact that a local alignment can start anywhere. If a region of an
alignment was known to have negative score, we could always remove it to produce a superior
local alignment, so the 0 term essentially wipes the slate clean whenever the alignment score
becomes negative in a local region.
Dynamic programming
A direct method for producing an MSA uses the dynamic programming technique to identify the
globally optimal alignment solution. For proteins, this method usually involves two sets of
parameters: a gap penalty and a substitution matrix assigning scores or probabilities to the
alignment of each possible pair of amino acids based on the similarity of the amino acids'
chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a
similar gap penalty is used, but a much simpler substitution matrix, wherein only identical
matches and mismatches are considered, is typical. The scores in the substitution matrix may be
either all positive or a mix of positive and negative in the case of a global alignment, but must be
both positive and negative, in the case of a local alignment.
For n individual sequences, the naive method requires constructing the n-dimensional equivalent
of the matrix formed in standard pairwise sequence alignment. The search space thus increases
exponentially with increasing n and is also strongly dependent on sequence length. Expressed
with the big O notation commonly used to measure computational complexity, a naïve MSA
takes O(LengthNseqs) time to produce. To find the global optimum for n sequences this way has
been shown to be an NP-complete problem. In 1989, based on Carrillo-Lipman Algorithm,
Altschul introduced a practical method that uses pairwise alignments to constrain the n-
dimensional search space. In this approach pairwise dynamic programming alignments are
performed on each pair of sequences in the query set, and only the space near the n-dimensional
intersection of these alignments is searched for the n-way alignment. The MSA program
optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called
sum of pair score) and has been implemented in a software program for constructing multiple
sequence alignments. In 2019, Hosseininasab and van Hoeve showed that by using decision
diagrams, MSA may be modeled in polynomial space complexity.
Iterative methods
A set of methods to produce MSAs while reducing the errors inherent in progressive methods are
classified as "iterative" because they work similarly to progressive methods but repeatedly
realign the initial sequences as well as adding new sequences to the growing MSA. One reason
progressive methods are so strongly dependent on a high-quality initial alignment is the fact that
these alignments are always incorporated into the final result — that is, once a sequence has been
aligned into the MSA, its alignment is not considered further. This approximation improves
efficiency at the cost of accuracy. By contrast, iterative methods can return to previously
calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a
means of optimizing a general objective function such as finding a high-quality alignment score.
A variety of subtly different iteration methods have been implemented and made available in
software packages; reviews and comparisons have been useful but generally refrain from
choosing a "best" technique. The software package PRRN/PRRP uses a hill-climbing algorithm
to optimize its MSA alignment score and iteratively corrects both alignment weights and locally
divergent or "gappy" regions of the growing MSA. PRRP performs best when refining an
alignment previously constructed by a faster method.
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local
alignments between sub-segments or sequence motifs without introducing a gap penalty. The
alignment of individual motifs is then achieved with a matrix representation similar to a dot-
matrix plot in a pairwise alignment. An alternative method that uses fast local alignments as
anchor points or "seeds" for a slower global-alignment procedure is implemented in the
CHAOS/DIALIGN suite.
A third popular iteration-based method called MUSCLE (multiple sequence alignment by log-
expectation) improves on progressive methods with a more accurate distance measure to assess
the relatedness of two sequences. The distance measure is updated between iteration stages
(although, in its original form, MUSCLE contained only 2-3 iterations depending on whether
refinement was enabled).
Consensus methods
Consensus methods attempt to find the optimal multiple sequence alignment given multiple
different alignments of the same set of sequences. There are two commonly used consensus
methods, M-COFFEE and MergeAlign. M-COFFEE uses multiple sequence alignments
generated by seven different methods to generate consensus alignments. MergeAlign is capable
of generating consensus alignments from any number of input alignments generated using
different models of sequence evolution or different methods of multiple sequence alignment. The
default option for MergeAlign is to infer a consensus alignment using alignments generated
using 91 different models of protein sequence evolution.
Phylogeny-aware methods
Most multiple sequence alignment methods try to minimize the number of insertions/deletions
(gaps) and, as a consequence, produce compact alignments. This causes several problems if the
sequences to be aligned contain non-homologous regions, if gaps are informative in a phylogeny
analysis. These problems are common in newly produced sequences that are poorly annotated
and may contain frame-shifts, wrong domains or non-homologous spliced exons. The first such
method was developed in 2005 by Löytynoja and Goldman. The same authors released a
software package called PRANK in 2008. PRANK improves alignments when insertions are
present. Nevertheless, it runs slowly compared to progressive and/or iterative methods which
have been developed for several years.
In 2012, two new phylogeny-aware tools appeared. One is called PAGAN that was developed by
the same team as PRANK. The other is ProGraphMSA developed by Szalkowski. Both software
packages were developed independently but share common features, notably the use of graph
algorithms to improve the recognition of non-homologous regions, and an improvement in code
making these software faster than PRANK.
Motif finding
Motif finding, also known as profile analysis, is a method of locating sequence motifs in global
MSAs that is both a means of producing a better MSA and a means of producing a scoring
matrix for use in searching other sequences for similar motifs. A variety of methods for isolating
the motifs have been developed, but all are based on identifying short highly conserved patterns
within the larger alignment and constructing a matrix similar to a substitution matrix that reflects
the amino acid or nucleotide composition of each position in the putative motif. The alignment
can then be refined using these matrices. In standard profile analysis, the matrix includes entries
for each possible character as well as entries for gaps. Alternatively, statistical pattern-finding
algorithms can identify motifs as a precursor to an MSA rather than as a derivation. In many
cases when the query set contains only a small number of sequences or contains only highly
related sequences, pseudocounts are added to normalize the distribution reflected in the scoring
matrix. In particular, this corrects zero-probability entries in the matrix to values that are small
but nonzero.
Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the
alignment. Blocks can be generated from an MSA or they can be extracted from unaligned
sequences using a precalculated set of common motifs previously generated from known gene
families. Block scoring generally relies on the spacing of high-frequency characters rather than
on the calculation of an explicit substitution matrix. The BLOCKS server provides an interactive
method to locate such motifs in unaligned sequences.
Statistical pattern-matching has been implemented using both the expectation-maximization
algorithm and the Gibbs sampler. One of the most common motif-finding tools, known as
MEME, uses expectation maximization and hidden Markov methods to generate motifs that are
then used as search tools by its companion MAST in the combined suite MEME/MAST.
MSA Algorithm
There are 3 important algoritms used in MSA for optimization. they are (i) Genetic algorithms
and simulated annealing, (ii) Mathematical programming and exact solution algorithm and (iii)
Simulated quantum computing
Genetic algorithms and simulated annealing
Standard optimization techniques in computer science — both of which were inspired by, but do
not directly reproduce, physical processes — have also been used in an attempt to more
efficiently produce quality MSAs. One such technique, genetic algorithms, has been used for
MSA production in an attempt to broadly simulate the hypothesized evolutionary process that
gave rise to the divergence in the query set. The method works by breaking a series of possible
MSAs into fragments and repeatedly rearranging those fragments with the introduction of gaps at
varying positions. A general objective function is optimized during the simulation, most
generally the "sum of pairs" maximization function introduced in dynamic programming-based
MSA methods. A technique for protein sequences has been implemented in the software program
SAGA (Sequence Alignment by Genetic Algorithm) and its equivalent in RNA is called RAGA.
The technique of simulated annealing, by which an existing MSA produced by another method is
refined by a series of rearrangements designed to find better regions of alignment space than the
one the input alignment already occupies. Like the genetic algorithm method, simulated
annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing
uses a metaphorical "temperature factor" that determines the rate at which rearrangements
proceed and the likelihood of each rearrangement; typical usage alternates periods of high
rearrangement rates with relatively low likelihood (to explore more distant regions of alignment
space) with periods of lower rates and higher likelihoods to more thoroughly explore local
minima near the newly "colonized" regions. This approach has been implemented in the program
MSASA (Multiple Sequence Alignment by Simulated Annealing).
Mathematical programming and exact solution algorithms
Mathematical programming and in particular Mixed integer programming models are another
approach to solve MSA problems. The advantage of such optimization models is that they can be
used to find the optimal MSA solution more efficiently compared to the traditional DP approach.
This is due in part, to the applicability of decomposition techniques for mathematical programs,
where the MSA model is decomposed into smaller parts and iteratively solved until the optimal
solution is found. Example algorithms used to solve mixed integer programming models of MSA
include branch and price and Benders decomposition. Although exact approaches are
computationally slow compared to heuristic algorithms for MSA, they are guaranteed to reach
the optimal solution eventually, even for large-size problems.
Simulated quantum computing
In January 2017, D-Wave Systems announced that its qbsolv open-source quantum computing
software had been successfully used to find a faster solution to the MSA problem.
PHYLOGENETIC ANALYSIS
Phylogenetic analysis provides an in-depth understanding of how species evolve through genetic
changes. Using phylogenetics, scientists can evaluate the path that connects a present-day
organism with its ancestral origin, as well as can predict the genetic divergence that may occur in
the future.
Phylogenetics is important because it enriches our understanding of how genes, genomes, species
(and molecular sequences more generally) evolve.
The results of phylogenetic analysis resemble like tree and show it is called as Phylogenetic
Tree. A phylogenetic tree (also phylogeny or evolutionary tree) is a branching diagram or a tree
showing the evolutionary relationships among various biological species or other entities based
upon similarities and differences in their physical or genetic characteristics. All life on Earth is
part of a single phylogenetic tree, indicating common ancestry.
In a rooted phylogenetic tree, each node with descendants represents the inferred most recent
common ancestor of those descendants, and the edge lengths in some trees may be interpreted as
time estimates. Each node is called a taxonomic unit. Internal nodes are generally called
hypothetical taxonomic units, as they cannot be directly observed. Trees are useful in fields of
biology such as bioinformatics, systematics, and phylogenetics. Unrooted trees illustrate only the
relatedness of the leaf nodes and do not require the ancestral root to be known or inferred.
Construction of Phylogenetic Tree
Phylogenetic trees composed with a nontrivial number of input sequences are constructed using
computational phylogenetics methods. Distance-matrix methods such as neighbor-joining or
UPGMA, which calculate genetic distance from multiple sequence alignments, are simplest to
implement, but do not invoke an evolutionary model. Many sequence alignment methods such as
ClustalW also create trees by using the simpler algorithms (i.e. those based on distance) of tree
construction. Maximum parsimony is another simple method of estimating phylogenetic trees,
but implies an implicit model of evolution (i.e. parsimony). More advanced methods use the
optimality criterion of maximum likelihood, often within a Bayesian framework, and apply an
explicit model of evolution to phylogenetic tree estimation. Identifying the optimal tree using
many of these techniques is NP-hard, so heuristic search and optimization methods are used in
combination with tree-scoring functions to identify a reasonably good tree that fits the data.
Tree-building methods can be assessed on the basis of several criteria:
efficiency (how long does it take to compute the answer, how much memory does it need?)
power (does it make good use of the data, or is information being wasted?)
consistency (will it converge on the same answer repeatedly, if each time given different data for
the same model problem?)
robustness (does it cope well with violations of the assumptions of the underlying model?)
falsifiability (does it alert us when it is not good to use, i.e. when assumptions are violated?)
Tree-building techniques have also gained the attention of mathematicians. Trees can also be
built using T-theory.