BIOINFORMATICS - eNOTES

BIOINFORMATICS – CBT53
UNIT – I
Introduction about bioinformatics and biological databases- BioInformatics: Introduction, definition,

objectives and scope. Application of BioInformatics. General Introduction of Biological Databases;
Nucleic acid databases (NCBI, DDBJ, and EMBL), Protein databases.
Introduction to Bioinformatics and Biological Databases
Bioinformatics
o Bioinformatics is a branch of science that integratescomputer science, mathematics and

statistics, chemistry and engineering for analysis, exploration, integration and
exploitation of biological sciences data, in Research and Development.
o Bioinformatics deals with storage, retrieval, analysis and interpretation of biological data
using computer based software and tools.
History of Bioinformatics
o Bioinformatics emerged in mid 1990s.

o From 1965-78 Margaret O. Dayhoff established first database of protein sequences,
published annually as series of volume entitled ―Atlas of protein sequence and structure‖.
o During 1977 DNA sequences began to accumulate slowly in literature and it became
more common to predict protein sequences by translating sequenced genes than by direct
sequencing of proteins.
o Thus number of uncharacterised proteins began to increase.
o In 1980, there were enough DNA sequences to justify the establishment of the first
nucleotide sequence database, GenBank at National Centre for Biotechnology
Information (NCBI), USA. NCBI served as primary databank provider for information.
o The European Molecular Biology Laboratory (EMBL) established at European
Bioinformatics Institute (EBI) in 1980. The aim of this data library was to collect,
organize and distribute nucleotide sequence data and related information.
o In 1986 DNA Data Bank was established by GemonNet, Japan.
o In 1984, the National Biomedical Research Foundation (NBRF) established the protein
information Resource (PIR).
o All these data banks operate in close collaboration and regularly exchange data.
o Management and analysis of the rapidly accumulating sequence data required new
computer software and statistical tools.
o This attracted scientists from computer science and mathematics to the fast emerging
field of bioinformatics.
Objectives of Bioinformatics
o Development of new algorithms and statistics for assessing the relationships among large
sets of biological data.
o Application of these tools for the analysis and interpretation of the various biological
data.
o Development of database for an efficient storage, access and management of the large
body of various biological information
Components of Bioinformatics
Scope of Bioinformatics
Application of Bioinformatics
Biological Data
Collection of information is Data. Collection of biological information is Biological Data.
o Nucleic Acid Sequences - Raw DNA Sequences, Genomic sequence tags (GSTs), cDNA
sequences, Expressed sequence tags (ESTs), Organellar DNA sequences and RNA
Sequences.
o Protein sequences.
o Protein structures
o Metabolic pathways
o Gel pictures
o Literature
Biological Database
A database is a vast collection of data pertaining to a specific topic e.g. nucleotide sequence,
protein sequence, protein structure etc., in an electronic environment.
o They are heart of bioinformatics.
o Computerized storehouse of data (records).
o Allows extraction of specified records.
o Allows adding, changing, removing, and merging of records.
o Uses standardized formats.
Classification of Biological Databases
NUCLEIC ACID
– NCBI - GenBank: (www.ncbi.nlm.nih.gov/GenBank)

– EMBL: (www.ebi.ac.uk/embl)
– DDBJ: (www.ddbj.nig.ac.jp)
o The 3 databases are updated and exchanged on a daily basis and the accession numbers
are consistent.
o There are no legal restriction in the usage of these databases. However, there are some
patented sequences in the database.
o The International Nucleotide Sequence Database Collaboration (INSD)
NCBI – National Center for Biotechnology Information
o Created as a part of NLM (National Library for Medicine) in 1988

 Established as Public database
 Promote Computational Biology Research
 To Develop Software tools for Sequence Analysis
 To Disseminate Biomedical Information
o Tools – BLAST(1990), Entrez(1992)
o GenBank(1992)
o Free MEDLINE (PubMed, 1997)
o Human Genome (2001)
ENTREZ in NCBI
The NCBI developed and maintains Entrez, a biological database retrieval system. It is a
gateway that allows text based searches for a wide variety of data, including annotated genetic
sequence information, structural information, as well as citations and abstracts, full research
papers and taxonomic data. The key feature of entrez is its ability to integrate information, which
comes from cross-referencing between NCBI databases based on pre-existing and logical
relationships between individual entries.
DDBJ – DNA DATABANK OF JAPAN
DDBJ an annotated collection of all publicly available nucleotide and protein sequences, started
in 1984 at the National Institute of Genetics (NIG) in Mishima. This Institution maintains the
DDBJ by team led by Takashi Gojobori. Since 1987, the DDBJ has been collecting annotated
nucleotide sequences as its traditional database service. This endeavor has been conducted in
collaboration with GenBank at the National Center for Biotechnology Information and with
European Molecular Biology Laboratory at the European Bioinformatics Institute. The
collaborative framework is called the International Nucleotide Sequence Database Collaboration
(INSDC). Because DDBJ mirrors its information daily with GenBank and EMBL, beginning
sequence searchers might want to try a database with a friendlier searching interface. However,
DDBJ also offers al its pages in Japanese as well, so if you are more comfortable reading the
Japanese versions of the pages, it can be very useful.
The mission of DDJ – Biological data resource are diverse, some biological data resources are
Very Large Scale Database (VLSD), Diverse requirements to integrate these biological data
resources To contribute to the interoperability of biological data resources.
EUROPEAN MOLECULAR BIOLOGY LABORATORY (EMBL)
EMBL was founded in 1974, as European Intergovernmental Research Organisation following

CERN model. Sir John Kendrew was the first Director General (Nobel Prize in Chemistry 1962).
Later in the year 1995 Christine Nusslein-Volhard and Eric Wieschaus, members of EMBL
received noble prize in medicine.
EMBL database is a part of European Molecular Biology Laboratory, it is an International, non-
profit research institute. It is also considered to be Europe‘s hub for biological data, services and
research. EMBL database is a trusted data provider for the life sciences. It has a team of 570
members of staff from 57 nations. It is considered to be Home of the ELIXIR Technical Hub.
EMBL MISSION
 To provide freely available data and bioinformatics services to all facets of the scientific
community in ways that promote scientific progress.
 Congtribute to the advancement of biology through basic investigator-driven research in
bioinformatics
 Provide advance bioinformatics training to scientists at all levels, from PhD students to
independent investigators
 Help disseminate cutting-edge technologies to industry
 Coordinate biological data provision throughout Europe.
PROTEIN DATABASES
As biology has increasingly turned into a data-rich science, the need for storing and
communicating large datasets has grown tremendously.
The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural
data produced by X-ray crystallography and macromolecular NMR.
The biological information of proteins is available as sequences and structures. Sequences are
represented in a single dimension whereas the structure contains the three-dimensional data of
sequences.
A biological database is a collection of data that is organized so that its contents can easily be
accessed, managed, and updated.
A protein database is one or more datasets about proteins, which could include a protein‘s amino
acid sequence, conformation, structure, and features such as active sites.
Protein databases are compiled by the translation of DNA sequences from different gene
databases and include structural information. They are an important resource because proteins
mediate most biological functions.
Importance of Protein Databases

Huge amounts of data for protein structures, functions, and particularly sequences are being
generated. Searching databases are often the first step in the study of a new protein. It has the
following uses:
Comparison between proteins or between protein families provides information about the
relationship between proteins within a genome or across different species and hence offers much
more information that can be obtained by studying only an isolated protein.
Secondary databases derived from experimental databases are also widely available. These
databases reorganize and annotate the data or provide predictions.
The use of multiple databases often helps researchers understand the structure and function of a
protein.
Primary databases of Protein

The PRIMARY databases hold the experimentally determined protein sequences inferred from
the conceptual translation of the nucleotide sequences. This, of course, is not experimentally
derived information, but has arisen as a result of interpretation of the nucleotide sequence
information and consequently must be treated as potentially containing misinterpreted
information. There is a number of primary protein sequence databases and each requires some
specific consideration.
a. Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):

The PIR-PSD is a collaborative endeavor between the PIR, the MIPS (Munich Information
Centre for Protein Sequences, Germany) and the JIPID (Japan International Protein Information
Database, Japan).
The PIR-PSD is now a comprehensive, non-redundant, expertly annotated, object-relational
DBMS.
A unique characteristic of the PIR-PSD is its classification of protein sequences based on the
superfamily concept.
The sequence in PIR-PSD is also classified based on homology domain and sequence motifs.
Homology domains may correspond to evolutionary building blocks, while sequence motifs
represent functional sites or conserved regions.
The classification approach allows a more complete understanding of sequence function-
structure relationship.
b. SWISS-PROT
The other well known and extensively used protein database is SWISS-PROT. Like the PIR-
PSD, this curated proteins sequence database also provides a high level of annotation.
The data in each entry can be considered separately as core data and annotation.
The core data consists of the sequences entered in common single letter amino acid code, and the
related references and bibliography. The taxonomy of the organism from which the sequence
was obtained also forms part of this core information.
The annotation contains information on the function or functions of the protein, post-
translational modification such as phosphorylation, acetylation, etc., functional and structural
domains and sites, such as calcium binding regions, ATP-binding sites, zinc fingers, etc., known
secondary structural features as for examples alpha helix, beta sheet, etc., the quaternary
structure of the protein, similarities to other protein if any, and diseases that may arise due to
different authors publishing different sequences for the same protein, or due to mutations in
different strains of an described as part of the annotation.
TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is
released as a supplement to SWISS-PROT. It contains the translation of all coding sequences
present in the EMBL Nucleotide database, which have not been fully annotated. Thus it may
contain the sequence of proteins that are never expressed and never actually identified in the
organisms.
c. Protein Databank (PDB):

PDB is a primary protein structure database. It is a crystallographic database for the three-
dimensional structure of large biological molecules, such as proteins.
In spite of the name, PDB archive the three-dimensional structures of not only proteins but also
all biologically important molecules, such as nucleic acid fragments, RNA molecules, large
peptides such as antibiotic gramicidin and complexes of protein and nucleic acids.
The database holds data derived from mainly three sources: Structure determined by X-ray
crystallography, NMR experiments, and molecular modeling.
Secondary Databases of Protein

The secondary databases are so termed because they contain the results of analysis of the
sequences held in primary databases. Many secondary protein databases are the result of looking
for features that relate different proteins. Some commonly used secondary databases of sequence
and structure are as follows:
a. PROSITE:
 A set of databases collects together patterns found in protein sequences rather than the
complete sequences. PROSITE is one such pattern database.
 The protein motif and pattern are encoded as ―regular expressions‖.
 The information corresponding to each entry in PROSITE is of the two forms – the
patterns and the related descriptive text.
b. PRINTS:
 In the PRINTS database, the protein sequence patterns are stored as ‗fingerprints‘. A
fingerprint is a set of motifs or patterns rather than a single one.
 The information contained in the PRINT entry may be divided into three sections. In
addition to entry name, accession number and number of motifs, the first section contains
cross-links to other databases that have more information about the characterized family.
 The second section provides a table showing how many of the motifs that make up the
fingerprint occurs in the how many of the sequences in that family.
 The last section of the entry contains the actual fingerprints that are stored as multiple
aligned sets of sequences, the alignment is made without gaps. There is, therefore, one set
of aligned sequences for each motif.
c. MHCPep:
 MHCPep is a database comprising over 13000 peptide sequences known to bind the
Major Histocompatibility Complex of the immune system.
 Each entry in the database contains not only the peptide sequence, which may be 8 to 10
amino acid long but in addition has information on the specific MHC molecules to which
it binds, the experimental method used to assay the peptide, the degree of activity and the
binding affinity observed , the source protein that, when broken down gave rise to this
peptide along with other, the positions along the peptide where it anchors on the MHC
molecules and references and cross-links to other information.
d. Pfam
 Pfam contains the profiles used using Hidden Markov models.
 HMMs build the model of the pattern as a series of the match, substitute, insert or delete
states, with scores assigned for alignment to go from one state to another.
 Each family or pattern defined in the Pfam consists of the four elements. The first is the
annotation, which has the information on the source to make the entry, the method used
and some numbers that serve as figures of merit.
 The second is the seed alignment that is used to bootstrap the rest of the sequences into
the multiple alignments and then the family.
 The third is the HMM profile.
 The fourth element is the complete alignment of all the sequences identified in that
family.
UNIT- II
Sequence analysis -Introduction to Sequences, alignments and Dynamic Programming; Pairwise
alignment (BLAST and FASTA Algorithm) and multiple sequence alignment (Clustal W algorithm) and
phylogenetic analysis.
Introduction to Sequences
Biological sequence analysis (bioinformatics) is the study of the relationships between biological
sequences and the implication of these relationships for macromolecular structure, function, and
evolution.
A biological sequence is a single, continuous molecule of nucleic acid or protein. It can be
thought of as a multiple inheritance class hierarchy. One hierarchy is that of the underlying
molecule type: DNA, RNA, or protein.
DNA Sequence:
A- Adenine, T- Thymine, G-Guanine and C-Cytosine

RNA Sequence
In biological databases there won‘t the U-Uracil in RNA sequence; it will be mentioned as
thymine similar to DNA sequence.
Protein Sequence
Sequence Alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or
protein to identify regions of similarity that may be a consequence of functional, structural, or
evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino
acid residues are typically represented as rows within a matrix. Gaps are inserted between the
residues so that identical or similar characters are aligned in successive columns.
Interpretation of Sequence Alignment

If two sequences in an alignment share a common ancestor, mismatches can be interpreted as
point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or
both lineages in the time since they diverged from one another. In sequence alignments of
proteins, the degree of similarity between amino acids occupying a particular position in the
sequence can be interpreted as a rough measure of how conserved a particular region or sequence
motif is among lineages. The absence of substitutions, or the presence of only very conservative
substitutions (that is, the substitution of amino acids whose side chains have similar biochemical
properties) in a particular region of the sequence, suggest that this region has structural or
functional importance. Although DNA and RNA nucleotide bases are more similar to each other
than are amino acids, the conservation of base pairs can indicate a similar functional or structural
role.
Alignment Methods
Very short or very similar sequences can be aligned by hand. However, most interesting
problems require the alignment of lengthy, highly variable or extremely numerous sequences that
cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing
algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final
results to reflect patterns that are difficult to represent algorithmically (especially in the case of
nucleotide sequences). Computational approaches to sequence alignment generally fall into two
categories: global alignments and local alignments. Calculating a global alignment is a form of
global optimization that "forces" the alignment to span the entire length of all query sequences.
By contrast, local alignments identify regions of similarity within long sequences that are often
widely divergent overall. Local alignments are often preferable, but can be more difficult to
calculate because of the additional challenge of identifying the regions of similarity. A variety of
computational algorithms have been applied to the sequence alignment problem. These include
slow but formally correct methods like dynamic programming. These also include efficient,
heuristic algorithms or probabilistic methods designed for large-scale database search that do not
guarantee to find best matches.
Two types of alignment Global and Local

Global alignments, which attempt to align every residue in every sequence, are most useful when
the sequences in the query set are similar and of roughly equal size. (This does not mean global
alignments cannot start and/or end in gaps.) A general global alignment technique is the
Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are
more useful for dissimilar sequences that are suspected to contain regions of similarity or similar
sequence motifs within their larger sequence context. The Smith–Waterman algorithm is a
general local alignment method based on the same dynamic programming scheme but with
additional choices to start and end at any place.
Hybrid methods, known as semi-global or "glocal" (short for global-local) methods, search for
the best possible partial alignment of the two sequences (in other words, a combination of one or
both starts and one or both ends is stated to be aligned). This can be especially useful when the
downstream part of one sequence overlaps with the upstream part of the other sequence. In this
case, neither global nor local alignment is entirely appropriate: a global alignment would attempt
to force the alignment to extend beyond the region of overlap, while a local alignment might not
fully cover the region of overlap. Another case where semi-global alignment is useful is when
one sequence is short (for example a gene sequence) and the other is very long (for example a
chromosome sequence). In that case, the short sequence should be globally (fully) aligned but
only a local (partial) alignment is desired for the long sequence.
Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms.
Essential needs for an efficient and accurate method for DNA variant discovery demand
innovative approaches for parallel processing in real time. Optical computing approaches have
been suggested as promising alternatives to the current electrical implementations, yet their
applicability remains to be tested.
Pairwise Alignment
Pairwise alignment is the process of aligning two DNA, RNA or protein sequences such that the
regions of similarity are maximized. This is often performed to find functional, structural or
evolutionary commonalities. In most cases, scientists use two protein sequences to quantitatively
find relatedness (aka homology). There are two types of sequence alignment; they are Global and
Local Alignments. These alignments works based on unique Dynamic programming.
Dynamic Programming
Global Alignment
The Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) is a dynamic programming
algorithm to identify the optimal pairwise global alignment. The algorithm works by associating
a cost with each edge in the edit graph, and then calculating the highest-scoring path through the
graph. We store these scores in a score matrix according to the following rules:
In this formula, the first term of the maximum corresponds to the score of a match or a mismatch
(depending on yi and xj ), and the next two terms correspond to the score of indels.
The Si,j term of this matrix corresponds to the optimal alignment score of a substring of x of
length i and a substring of y of length j. Because the edit graph is a DAG, we can calculate the
cells in an order such that all the cells required to take the maximum have already been
calculated.
Local Alignment
The Smith-Waterman algorithm (Smith and Waterman, 1981) is a dynamic programming
algorithm to identify the optimal local pairwise alignment. It closely resembles the Needleman-
Wunsch algorithm, but differs in that it identifies locally rather than globally optimal alignments.
To accomplish this, a simple change to the definition of S is required.
Here the 0 term captures the fact that a local alignment can start anywhere. If a region of an
alignment was known to have negative score, we could always remove it to produce a superior
local alignment, so the 0 term essentially wipes the slate clean whenever the alignment score
becomes negative in a local region.
MULTIPLE SEQUENCE ALIGNMENT

Multiple sequence alignment (MSA) methods refer to a series of algorithmic solution for the
alignment of evolutionarily related sequences, while taking into account evolutionary events
such as mutations, insertions, deletions and rearrangements under certain conditions.
Multiple sequence alignment (MSA) is a tool used to identify the evolutionary relationships and
common patterns between genes. Precisely it refers to the sequence alignment of three or more
biological sequences, usually DNA, RNA or protein. Alignments are generated and analysed
with computational algorithms.
There are various alignment methods used within multiple sequence to maximize scores and
correctness of alignments. Each is usually based on a certain heuristic with an insight into the
evolutionary process. Most try to replicate evolution to get the most realistic alignment possible
to best predict relations between sequences.
Dynamic programming
A direct method for producing an MSA uses the dynamic programming technique to identify the
globally optimal alignment solution. For proteins, this method usually involves two sets of
parameters: a gap penalty and a substitution matrix assigning scores or probabilities to the
alignment of each possible pair of amino acids based on the similarity of the amino acids'
chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a
similar gap penalty is used, but a much simpler substitution matrix, wherein only identical
matches and mismatches are considered, is typical. The scores in the substitution matrix may be
either all positive or a mix of positive and negative in the case of a global alignment, but must be
both positive and negative, in the case of a local alignment.
For n individual sequences, the naive method requires constructing the n-dimensional equivalent
of the matrix formed in standard pairwise sequence alignment. The search space thus increases
exponentially with increasing n and is also strongly dependent on sequence length. Expressed
with the big O notation commonly used to measure computational complexity, a naïve MSA
takes O(LengthNseqs) time to produce. To find the global optimum for n sequences this way has
been shown to be an NP-complete problem. In 1989, based on Carrillo-Lipman Algorithm,
Altschul introduced a practical method that uses pairwise alignments to constrain the n-
dimensional search space. In this approach pairwise dynamic programming alignments are
performed on each pair of sequences in the query set, and only the space near the n-dimensional
intersection of these alignments is searched for the n-way alignment. The MSA program
optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called
sum of pair score) and has been implemented in a software program for constructing multiple
sequence alignments. In 2019, Hosseininasab and van Hoeve showed that by using decision
diagrams, MSA may be modeled in polynomial space complexity.
Progressive alignment construction

The most widely used approach to multiple sequence alignments uses a heuristic search known
as progressive technique (also known as the hierarchical or tree method) developed by Da-Fei
Feng and Doolittle in 1987. Progressive alignment builds up a final MSA by combining pairwise
alignments beginning with the most similar pair and progressing to the most distantly related. All
progressive alignment methods require two stages: a first stage in which the relationships
between the sequences are represented as a tree, called a guide tree, and a second step in which
the MSA is built by adding the sequences sequentially to the growing MSA according to the
guide tree. The initial guide tree is determined by an efficient clustering method such as
neighbor-joining or UPGMA, and may use distances based on the number of identical two-letter
sub-sequences (as in FASTA rather than a dynamic programming alignment).
Progressive alignments are not guaranteed to be globally optimal. The primary problem is that
when errors are made at any stage in growing the MSA, these errors are then propagated through
to the final result. Performance is also particularly bad when all of the sequences in the set are
rather distantly related. Most modern progressive methods modify their scoring function with a
secondary weighting function that assigns scaling factors to individual members of the query set
in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. This
corrects for non-random selection of the sequences given to the alignment program.
Progressive alignment methods are efficient enough to implement on a large scale for many
(100s to 1000s) sequences. Progressive alignment services are commonly available on publicly
accessible web servers so users need not locally install the applications of interest. The most
popular progressive alignment method has been the Clustal family, especially the weighted
variant ClustalW to which access is provided by a large number of web portals including
GenomeNet, EBI, and EMBNet. Different portals or implementations can vary in user interface
and make different parameters accessible to the user. ClustalW is used extensively for
phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments
should not be used in such studies and as input for protein structure prediction by homology
modeling. Current version of Clustal family is ClustalW2. EMBL-EBI announced that
CLustalW2 will be expired in August 2015. They recommend Clustal Omega which performs
based on seeded guide trees and HMM profile-profile techniques for protein alignments. They
offer different MSA tools for progressive DNA alignments. One of them is MAFFT (Multiple
Alignment using Fast Fourier Transform).
Another common progressive alignment method called T-Coffee is slower than Clustal and its
derivatives but generally produces more accurate alignments for distantly related sequence sets.
T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with
indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output
from Clustal as well as another local alignment program LALIGN, which finds multiple regions
of local alignment between two sequences. The resulting alignment and phylogenetic tree are
used as a guide to produce new and more accurate weighting factors.
Because progressive methods are heuristics that are not guaranteed to converge to a global
optimum, alignment quality can be difficult to evaluate and their true biological significance can
be obscure. A semi-progressive method that improves alignment quality and does not use a lossy
heuristic while still running in polynomial time has been implemented in the program PSAlign.
Iterative methods
A set of methods to produce MSAs while reducing the errors inherent in progressive methods are
classified as "iterative" because they work similarly to progressive methods but repeatedly
realign the initial sequences as well as adding new sequences to the growing MSA. One reason
progressive methods are so strongly dependent on a high-quality initial alignment is the fact that
these alignments are always incorporated into the final result — that is, once a sequence has been
aligned into the MSA, its alignment is not considered further. This approximation improves
efficiency at the cost of accuracy. By contrast, iterative methods can return to previously
calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a
means of optimizing a general objective function such as finding a high-quality alignment score.
A variety of subtly different iteration methods have been implemented and made available in
software packages; reviews and comparisons have been useful but generally refrain from
choosing a "best" technique. The software package PRRN/PRRP uses a hill-climbing algorithm
to optimize its MSA alignment score and iteratively corrects both alignment weights and locally
divergent or "gappy" regions of the growing MSA. PRRP performs best when refining an
alignment previously constructed by a faster method.
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local
alignments between sub-segments or sequence motifs without introducing a gap penalty. The
alignment of individual motifs is then achieved with a matrix representation similar to a dot-
matrix plot in a pairwise alignment. An alternative method that uses fast local alignments as
anchor points or "seeds" for a slower global-alignment procedure is implemented in the
CHAOS/DIALIGN suite.
A third popular iteration-based method called MUSCLE (multiple sequence alignment by log-
expectation) improves on progressive methods with a more accurate distance measure to assess
the relatedness of two sequences. The distance measure is updated between iteration stages
(although, in its original form, MUSCLE contained only 2-3 iterations depending on whether
refinement was enabled).
Consensus methods
Consensus methods attempt to find the optimal multiple sequence alignment given multiple
different alignments of the same set of sequences. There are two commonly used consensus
methods, M-COFFEE and MergeAlign. M-COFFEE uses multiple sequence alignments
generated by seven different methods to generate consensus alignments. MergeAlign is capable
of generating consensus alignments from any number of input alignments generated using
different models of sequence evolution or different methods of multiple sequence alignment. The
default option for MergeAlign is to infer a consensus alignment using alignments generated
using 91 different models of protein sequence evolution.
Hidden Markov models

Hidden Markov models are probabilistic models that can assign likelihoods to all possible
combinations of gaps, matches, and mismatches to determine the most likely MSA or set of
possible MSAs. HMMs can produce a single highest-scoring output but can also generate a
family of possible alignments that can then be evaluated for biological significance. HMMs can
produce both global and local alignments. Although HMM-based methods have been developed
relatively recently, they offer significant improvements in computational speed, especially for
sequences that contain overlapping regions.
Typical HMM-based methods work by representing an MSA as a form of directed acyclic graph
known as a partial-order graph, which consists of a series of nodes representing possible entries
in the columns of an MSA. In this representation a column that is absolutely conserved (that is,
that all the sequences in the MSA share a particular character at a particular position) is coded as
a single node with as many outgoing connections as there are possible characters in the next
column of the alignment. In the terms of a typical hidden Markov model, the observed states are
the individual alignment columns and the "hidden" states represent the presumed ancestral
sequence from which the sequences in the query set are hypothesized to have descended. An
efficient search variant of the dynamic programming method, known as the Viterbi algorithm, is
generally used to successively align the growing MSA to the next sequence in the query set to
produce a new MSA. This is distinct from progressive alignment methods because the alignment
of prior sequences is updated at each new sequence addition. However, like progressive methods,
this technique can be influenced by the order in which the sequences in the query set are
integrated into the alignment, especially when the sequences are distantly related.
Several software programs are available in which variants of HMM-based methods have been
implemented and which are noted for their scalability and efficiency, although properly using an
HMM method is more complex than using more common progressive methods. The simplest is
POA (Partial-Order Alignment); a similar but more generalized method is implemented in the
packages SAM (Sequence Alignment and Modeling System). and HMMER. SAM has been used
as a source of alignments for protein structure prediction to participate in the CASP structure
prediction experiment and to develop a database of predicted proteins in the yeast species S.
cerevisiae. HHsearch is a software package for the detection of remotely related protein
sequences based on the pairwise comparison of HMMs. A server running HHsearch (HHpred)
was by far the fastest of the 10 best automatic structure prediction servers in the CASP7 and
CASP8 structure prediction competitions.
Phylogeny-aware methods
Most multiple sequence alignment methods try to minimize the number of insertions/deletions
(gaps) and, as a consequence, produce compact alignments. This causes several problems if the
sequences to be aligned contain non-homologous regions, if gaps are informative in a phylogeny
analysis. These problems are common in newly produced sequences that are poorly annotated
and may contain frame-shifts, wrong domains or non-homologous spliced exons. The first such
method was developed in 2005 by Löytynoja and Goldman. The same authors released a
software package called PRANK in 2008. PRANK improves alignments when insertions are
present. Nevertheless, it runs slowly compared to progressive and/or iterative methods which
have been developed for several years.
In 2012, two new phylogeny-aware tools appeared. One is called PAGAN that was developed by
the same team as PRANK. The other is ProGraphMSA developed by Szalkowski. Both software
packages were developed independently but share common features, notably the use of graph
algorithms to improve the recognition of non-homologous regions, and an improvement in code
making these software faster than PRANK.
Motif finding
Motif finding, also known as profile analysis, is a method of locating sequence motifs in global
MSAs that is both a means of producing a better MSA and a means of producing a scoring
matrix for use in searching other sequences for similar motifs. A variety of methods for isolating
the motifs have been developed, but all are based on identifying short highly conserved patterns
within the larger alignment and constructing a matrix similar to a substitution matrix that reflects
the amino acid or nucleotide composition of each position in the putative motif. The alignment
can then be refined using these matrices. In standard profile analysis, the matrix includes entries
for each possible character as well as entries for gaps. Alternatively, statistical pattern-finding
algorithms can identify motifs as a precursor to an MSA rather than as a derivation. In many
cases when the query set contains only a small number of sequences or contains only highly
related sequences, pseudocounts are added to normalize the distribution reflected in the scoring
matrix. In particular, this corrects zero-probability entries in the matrix to values that are small
but nonzero.
Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the
alignment. Blocks can be generated from an MSA or they can be extracted from unaligned
sequences using a precalculated set of common motifs previously generated from known gene
families. Block scoring generally relies on the spacing of high-frequency characters rather than
on the calculation of an explicit substitution matrix. The BLOCKS server provides an interactive
method to locate such motifs in unaligned sequences.
Statistical pattern-matching has been implemented using both the expectation-maximization
algorithm and the Gibbs sampler. One of the most common motif-finding tools, known as
MEME, uses expectation maximization and hidden Markov methods to generate motifs that are
then used as search tools by its companion MAST in the combined suite MEME/MAST.
Non-coding multiple sequence alignment

Non-coding DNA regions, especially TFBSs, are rather more conserved and not necessarily
evolutionarily related, and may have converged from non-common ancestors. Thus, the
assumptions used to align protein sequences and DNA coding regions are inherently different
from those that hold for TFBS sequences. Although it is meaningful to align DNA coding
regions for homologous sequences using mutation operators, alignment of binding site sequences
for the same transcription factor cannot rely on evolutionary related mutation operations.
Similarly, the evolutionary operator of point mutations can be used to define an edit distance for
coding sequences, but this has little meaning for TFBS sequences because any sequence
variation has to maintain a certain level of specificity for the binding site to function. This
becomes specifically important when trying to align known TFBS sequences to build supervised
models to predict unknown locations of the same TFBS. Hence, Multiple Sequence Alignment
methods need to adjust the underlying evolutionary hypothesis and the operators used as in the
work published incorporating neighbouring base thermodynamic information to align the binding
sites searching for the lowest thermodynamic alignment conserving specificity of the binding
site, EDNA .
MSA Algorithm
There are 3 important algoritms used in MSA for optimization. they are (i) Genetic algorithms
and simulated annealing, (ii) Mathematical programming and exact solution algorithm and (iii)
Simulated quantum computing
Genetic algorithms and simulated annealing
Standard optimization techniques in computer science — both of which were inspired by, but do
not directly reproduce, physical processes — have also been used in an attempt to more
efficiently produce quality MSAs. One such technique, genetic algorithms, has been used for
MSA production in an attempt to broadly simulate the hypothesized evolutionary process that
gave rise to the divergence in the query set. The method works by breaking a series of possible
MSAs into fragments and repeatedly rearranging those fragments with the introduction of gaps at
varying positions. A general objective function is optimized during the simulation, most
generally the "sum of pairs" maximization function introduced in dynamic programming-based
MSA methods. A technique for protein sequences has been implemented in the software program
SAGA (Sequence Alignment by Genetic Algorithm) and its equivalent in RNA is called RAGA.
The technique of simulated annealing, by which an existing MSA produced by another method is
refined by a series of rearrangements designed to find better regions of alignment space than the
one the input alignment already occupies. Like the genetic algorithm method, simulated
annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing
uses a metaphorical "temperature factor" that determines the rate at which rearrangements
proceed and the likelihood of each rearrangement; typical usage alternates periods of high
rearrangement rates with relatively low likelihood (to explore more distant regions of alignment
space) with periods of lower rates and higher likelihoods to more thoroughly explore local
minima near the newly "colonized" regions. This approach has been implemented in the program
MSASA (Multiple Sequence Alignment by Simulated Annealing).
Mathematical programming and exact solution algorithms
Mathematical programming and in particular Mixed integer programming models are another
approach to solve MSA problems. The advantage of such optimization models is that they can be
used to find the optimal MSA solution more efficiently compared to the traditional DP approach.
This is due in part, to the applicability of decomposition techniques for mathematical programs,
where the MSA model is decomposed into smaller parts and iteratively solved until the optimal
solution is found. Example algorithms used to solve mixed integer programming models of MSA
include branch and price and Benders decomposition. Although exact approaches are
computationally slow compared to heuristic algorithms for MSA, they are guaranteed to reach
the optimal solution eventually, even for large-size problems.
Simulated quantum computing
In January 2017, D-Wave Systems announced that its qbsolv open-source quantum computing
software had been successfully used to find a faster solution to the MSA problem.
PHYLOGENETIC ANALYSIS
Phylogenetic analysis provides an in-depth understanding of how species evolve through genetic
changes. Using phylogenetics, scientists can evaluate the path that connects a present-day
organism with its ancestral origin, as well as can predict the genetic divergence that may occur in
the future.
Phylogenetics is important because it enriches our understanding of how genes, genomes, species
(and molecular sequences more generally) evolve.
The results of phylogenetic analysis resemble like tree and show it is called as Phylogenetic
Tree. A phylogenetic tree (also phylogeny or evolutionary tree) is a branching diagram or a tree
showing the evolutionary relationships among various biological species or other entities based
upon similarities and differences in their physical or genetic characteristics. All life on Earth is
part of a single phylogenetic tree, indicating common ancestry.
In a rooted phylogenetic tree, each node with descendants represents the inferred most recent
common ancestor of those descendants, and the edge lengths in some trees may be interpreted as
time estimates. Each node is called a taxonomic unit. Internal nodes are generally called
hypothetical taxonomic units, as they cannot be directly observed. Trees are useful in fields of
biology such as bioinformatics, systematics, and phylogenetics. Unrooted trees illustrate only the
relatedness of the leaf nodes and do not require the ancestral root to be known or inferred.
Construction of Phylogenetic Tree
Phylogenetic trees composed with a nontrivial number of input sequences are constructed using
computational phylogenetics methods. Distance-matrix methods such as neighbor-joining or
UPGMA, which calculate genetic distance from multiple sequence alignments, are simplest to
implement, but do not invoke an evolutionary model. Many sequence alignment methods such as
ClustalW also create trees by using the simpler algorithms (i.e. those based on distance) of tree
construction. Maximum parsimony is another simple method of estimating phylogenetic trees,
but implies an implicit model of evolution (i.e. parsimony). More advanced methods use the
optimality criterion of maximum likelihood, often within a Bayesian framework, and apply an
explicit model of evolution to phylogenetic tree estimation. Identifying the optimal tree using
many of these techniques is NP-hard, so heuristic search and optimization methods are used in
combination with tree-scoring functions to identify a reasonably good tree that fits the data.
Tree-building methods can be assessed on the basis of several criteria:
efficiency (how long does it take to compute the answer, how much memory does it need?)
power (does it make good use of the data, or is information being wasted?)
consistency (will it converge on the same answer repeatedly, if each time given different data for
the same model problem?)
robustness (does it cope well with violations of the assumptions of the underlying model?)
falsifiability (does it alert us when it is not good to use, i.e. when assumptions are violated?)
Tree-building techniques have also gained the attention of mathematicians. Trees can also be
built using T-theory.

BIOINFORMATICS - eNOTES

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIOINFORMATICS - eNOTES

Uploaded by

Copyright:

Available Formats

BIOINFORMATICS – CBT53

Introduction about bioinformatics and biological databases- BioInformatics: Introduction, definition,

Introduction to Bioinformatics and Biological Databases

o Bioinformatics is a branch of science that integratescomputer science, mathematics and

o Bioinformatics emerged in mid 1990s.

Collection of information is Data. Collection of biological information is Biological Data.

Classification of Biological Databases

– NCBI - GenBank: (www.ncbi.nlm.nih.gov/GenBank)

NCBI – National Center for Biotechnology Information

o Created as a part of NLM (National Library for Medicine) in 1988

DDBJ – DNA DATABANK OF JAPAN

EMBL was founded in 1974, as European Intergovernmental Research Organisation following

Importance of Protein Databases

Primary databases of Protein

a. Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):

c. Protein Databank (PDB):

Secondary Databases of Protein

A- Adenine, T- Thymine, G-Guanine and C-Cytosine

Interpretation of Sequence Alignment

Two types of alignment Global and Local

MULTIPLE SEQUENCE ALIGNMENT

Progressive alignment construction

Hidden Markov models

Non-coding multiple sequence alignment

You might also like