You are on page 1of 52

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Brought to you by molecularsciences.org. This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License. This publication may not be redistributed without this notice.

Bioinformatics
Computers and the internet have revolutionized everything from agriculture and architecture to research. Biological research is no exception. Biology is the science of studying living beings. Bioinformatics is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. It is the science of using information to understand biology. Long before the invention of the word bioinformatics, researchers tried to use computers to assist in their research. These researcher realized three concepts which are still fundamental to bioinformatics today. data representation the concept of similarity bioinformatics is a data-driven science as opposed to theoretical science To make it possible for a computer to work on a problem. This problem must be abstracted in a computer understandable format. This often requires simplification of the problem and coding. Computers can be very clever at detecting similarity and similarity allows us to imply that two seemly different entities share a certain property. Bioinformatics is a data driven science meaning that we require lots of data. Fortunately, the biggest problem is not the lack of data but the quality of data, meaningful classification of data and our insufficient capacity to interpret the data.
Ads by Google

Gene Sequence

Alignment

Bioinformatic

Human Gene

Biological Data
Biology is now a data-intensive science and fortunately most of the data is available freely over the Internet. Before beginning, one needs to know what kind of data is available, where, in what format, and how it can be accessed. Most databases provide very useful and powerful tools to help its users access, manipulate, and analyze the data. Knowing and using these tools would help the user avoid lots of unnecessary work.

Bioinformatics Research Centers


Several research centers are dedicated to bioinformatics research. Following are most significant.
Ads by Google

Gene Cloning Geneious DNA

ORF Gene

Gene Services

NCBI National Center for Biotechnology Information EBI European Bioinformatics Institute SIB Swiss Institute of Bioinformatics ANGIS Australian National Genome Information Service CBR Canadian Bioinformatics Resource CBI Peking center for bioinformatics BIC Singapore Bioinformatics Centre SANBI South African National Bioinformatics Institute Sanger Institute

Biological Databases
The invention of various techniques and instruments for analyzing living being at the molecular level has lead to an explosion of scientific data generated by the scientific community. This data cannot be stored on paper. It must be stored, organized, and indexed in an electronic database. In addition we need tools to view, verify, analyze and interface this data with other databases.

1 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

An electronic biological database is a large, organized body of persistent data that can be queried to add, update, extract, and remove data. Biological databases have to respond to the needs of its various users. A certain biological data often means very different things to different researchers. For example, a physicist, a biochemist, and a biologist sitting in the same room would be interested in different aspects of the same protein. They might even use different taxonomy to refer to the same protein. Even two biologists would be interested in looking at the protein from different perspectives.
Ads by Google

Gene Services Gene Synthese DNA Sequence

RNA Seq

Biological data is often very connected and these connections are essential for comprehension and discovery. A nucleotide sequence is linked to a protein it codes for. Nucleotide sequences are grouped into genes. A gene may code for one protein, several proteins or none at all. This protein might have different names in different species. A protein belongs to protein family and it must be linked to its evolutionary progeny. We would also like to have links to scientific publications related to our protein, find out the methods and instruments used for its discovery, and even the parameters of the instrument used. Researchers frequently repeat experiments conducted by others to verify and improve their processes. Satmetrix NPS Score Customer Experience Software. Request a demo now.
www.satmetrix.com/

Why do we need biological databases?


Back in the 70s, researchers refered to the "Atlas of Protein Sequences and Structures" by Margaret Dayhoff to find information on their protein of interest. Since then biological has exploded to a point that we can no longer imagine publishing all the data on paper. One of the earliest electronic database was PIR (http://pir.georgetown.edu) which was essentially run by a group of researchers. This was a significant improvement since it offered the advantage of adding, updating, deleting and most importantly searching the data is a much more effecient manner. Today PIR is no longer in service. It is live but it only serves as an archive. It could not cope with the growing demands while databases such as SwissProt are built to cope with the needs.. Today, biology is a data-rich science where each experiment generates enormous amounts of data. We can no longer analyze all this data by a pair of eyes. We need powerful data analysis tools to help us interpret and understand the significance of this data. Biological databases offer data storage facility and various tools which help understand and analyze the data.

Nucleotide Sequence Databases


Each database is different, however, a nucleotide sequence is expected to contain at least the following: Gene Expression Analysis Next Generation Sequencing Analysis User-friendly, Advanced, Integrated
www.clcbio.com

id and/or accession number taxonomic data references annotation/curation keywords cross references sequences documentation Annotation refers to adding extra information regarding a certain record in a database. Curation refers to evaluating what goes in the database and what is not fit to go into the database.

First Generation Nucleotide Sequence Databases

2 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

The first generation nucleotide sequence databases are essentially sequence archive. The data is present in the database as it was determined and interpreted by its publisher. The original author retains full control of the information he submitted. As one can imagine, this results in a multitude of problems such as: data of varying quality and lengths highly redundant data errors in sequence, annotations, etc. lack of consistency

Second Generation Nucleotide Sequence Databases


The second generation nucleotide sequence databases were built with an eye on lessons learned from the first generation nucleotide sequence databases. The goal is to have one sequence entry for every naturally occuring molecule. In RefSeq, a second generation database, chromosome, gene, mRNA, and protein data are curated. Other data such as contigs, model mRNA, and model protein is calculated. A gene can result into multiple products. In such as case, separate RefSeq ids are used for each product and all are linked by a Locus Id. Second generation nucleotide sequences are essentially gene-centric databases.

Gene-Centric Databases
In a gene-centric database, all information relevant to a given gene is made accessible at once. Entrez and RefSeq are the most commonly used. Entrez Gene is tightly linked to RefSeq. RefSeq, the Reference Sequence, collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript RNA, and protein products. Gene-centric databases contain gene-specific information, which focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's RefSeq and other collaborating databases.

Genome-Centric Databases
Genome-centric databases contain information about the gene sequence, relative position, strand orientation, biochemical functions, etc. Ensembl and TIGR are information management systems that are able to connect specialized sequence collection and browsing tools.

Genbank: case study


GenBank is a comprehensive public database of nucleotide sequences built and distributed by the NCBI. GenBank is primarily built from the sequence data submissions from authors and from the bulk submission of ESTs, GSS and other high-throughput data from sequencing centers. EST: Expressed Sequence Tags produced by one-shot sequencing of a cloned cDNA. GSS: Genome Sequence Survey is similar to EST with the exception that most of the sequences are genomic in origin. GenBank doubles in size every 18 months. WGS and envrionmental sequences now occupy a significant space in the databases.

WGS: Whole Genome Shotgun are contigs of a sequencing project. WGS data can contain annotation and should be updated as sequencing progresses. Contig: A contig is a DNA sequence assembled from DNA fragments of 100-300 base pairs. Environmental Sequences: These are all DNA sequences present in a sample. The sample often contains many different organisms and these organisms are very often unknown and unidentified. Each GenBank entry includes a concise description of: sequence scientific name and taxonomy of the source organism bibliographic references

3 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

listing of areas of biological significance such as coding regions and their protein translations, transcription units, repeat regions and sites of mutations or modifications.

GenBank partitions sequence into divisions that roughly correspond to: taxonomic groups such as bacteria (BCT), viruses (VRL), and rodents (ROD). sequencing strategies such as EST, GSS, HTG, HTC and environmental sample (ENV) sequences HTC: High throughput cDNA HTG: High throughput genomic sequences, single-pass, unfinished genomic sequences EST and HTC are RNA or cDNA. GSS, HTG, WGS, and ENV are DNA. The data in GenBank, and the collaborating databases EMBL and DDBJ, are submitted primarily by individual authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS or HTG sequences. Data are exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporate the most recently available sequence data from all sources. Virtually all records enter GenBank as direct electronic submissions. EMBL, GenBank, DDJB and Swiss-Prot both identifiers and accession numbers to identify each entry. To make things more complicated, identifiers and accession numbers mean different things on different databases. On Swiss-Prot identifiers are alphanumeric terms that are meaningful to a human being. For example, HBA_HUMAN refers to a human haemoglobin alpha chain. Identifiers can change but they rarely do. Accession number the HBA_HUMAN is P69905. Accession numbers are primary keys so they never change. If two entries are merged, the new entry will have both accession numbers. One would be the primary key and the other would be the secondary key. When the entries are split, new accession numbers are assigned to each entry and the old accession number is noted as the secondary key. GenBank data can be retrieved by Entrez. Entrez covers over 30 biological databases containing DNA and protein sequence data, genome mapping data, population sets, phylogenetic sets, environmental sample sets, gene expression data, the NCBI taxonomy, protein domain information, protein structures from the Molecular Modeling Database, MMDB, and MEDLINE references via PubMed. Entrez is a very good system to use since it returns much more information than is available on GenBank. Biological databases often come with useful tools. BLAST is the very powerful tool which allows sequence-similarity comparisons. GenBank database can be downloaded by ftp at ftp.ncbi.nih.gov. This page is a brief summary of descriptions of Swiss-Prot, GenBank, and EMBL available on their websites.

Protein Sequences
There are two major protein sequence resources: UniProt = Swiss-Prot + TrEMBL + PIR NCBI-nr = Swiss-Prot + GenPept + PIR + RefSeq + PDB + PRF In addition, there are several different specialized protein databases.

UniProt
UniProt is a central resource for protein sequence and function. The UniProt consortium (since 2003) consists of EMBL, SIB, and PIR. PIR is no longer being updated. It now only functions as a archive. UniProt itself is divided into several components.

UniProtKB/TrEMBL

4 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

UniProtKB/TrEMBL contains computer annotated protein sequences. TrEMBL entries are produced by translating nucleic acid sequences (CDS) in EMBL using computer tools. In addition, it includes data from PIR. TrEMBL suffers from poor submission of annotated CDS. TrEMBL is a platform for the improvement of automated annotation tools. A TrEMBL entry is created after applying many annotation tools such as SignalP, TMHMM, REP, etc. Then evidence tags are added to any part of a TrEMBL entry not derived from the original EMBL entry.

UniProtKB/Swiss-Prot
UniProtKB/TrEMBL contains manually annotated protein sequences. Swiss-Prot entries are produced by manually annotating TrEMBL entries. Before creating a Swiss-Prot entry, the sequence is checked and analyzed. The data is cross-checked with literature and external scientific expertise. Once an entry is moved to Swiss-Prot, it is deleted from TrEMBL. Data in Swiss-Prot does not migrate to TrEMBL. Together, Swiss-Prot and TrEMBL provide all known protein sequences in the public domain. The goals of Swiss-Prot are: Non-redundant: (one entry - one gene - one specie) Maximum manual annotation: maximum annotation of protein diversity Maximum links to other databases A Swiss-Prot Entry contains: ID and accession number names and taxonomy references comments cross-references keywords features sequence

UniRef
One UniRef100 entry contains all identical sequences including fragments. One UniRef90 entry contains sequences that have at least 90% or more identity. One UniRef50 entry contains sequences that have at least 50% or more identity.

UniParc
UniParc are raw archived protein sequences. Sequences and information in UniProt is accessible via text search, BLAST similarity search, and FTP.

Non-coding DNA
A remarkable variability exists in genome size among eukaryotes that has little correlation with organismal complexity, size or number of coding genes. Even a unicellular organism can have a larger genome than a mammal! This striking disparity is due to non-coding DNA. Non-coding DNA describes DNA which does not contain instructions for making cell products. It constitutes a large portion of the genome of eukaryotes. Some this non-coding DNA is involved in regulating the coding regions of DNA. Functions of the remaining non-coding DNA are still unknown.

5 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

The genome contains several types of non-coding regions (regions not coding for proteins). Non-coding regions can be found in three areas: Genic DNA, genic DNA coding for ncRNA, and intergenic DNA Genic DNA is involved directly in gene expression. UTR regions (untranslated regions of mRNA), and introns are genic DNA. The intergenic region contains mostly repetititve regions. Functional regions which constitute to about 15% of intergenic regions contains SAR (scaffold attachment regions), telomeres, centromeres. The functions of the remaining 85% regions are unknown. SAR (Scaffold attachment regions) is an AT-rich segment of a eukaryotic genome that acts as an attachment point to the nuclear matrix. Nuclear matrix is a proteinaceous scaffold-like network that permeates the cell. A telomere is a region of highly repetitive DNA at the end of a chromosome that functions as a disposable buffer. Every time linear eukaryotic chromosomes are replicated, the DNA polymerase complex is incapable of replicating all the way to the end of the chromosome; if it were not for telomeres, this would quickly result in the loss of useful genetic information. The centromere is the site where spindle fibers of the mitotic spindle attach to the chromosome during mitosis. In most eukaryotes, the centromere has no defined DNA sequence. It typically consists of large arrays of repetitive DNA where the sequence within individual repeat elements is similar but not identical.

Repetitive DNA sequence classes


Much of this variation in genome size is due to non-coding, tandemly repeated DNA. A substantial fraction of the eukaryote genomes is often composed of repetitive DNA.

1. Simple Repeats
Simple repeats are duplications of the simple sets of DNA bases, typically 1 5bp. CpG are among the most important simple repeats. A CpG island is a short stretch of DNA in which the frequency of the dinucleotide sequence CG is higher than other regions. The p simply indicates that C and G are connected by a phosphodiester bond. To be classified a CpG island, a sequence must be at least 200 bases long. DNA methylation occurs at CG-rich sites. Methylated cytosines may be converted to thymine by deamination over evolution CpG -> TpG. Methylated (inactive regions) are thus poor in CpG. CpG islands are unmethylated regions of the genome that are associated with the 5 ends of genes which are frequently switched on. Often CpG islands ovelap the promoter and extend about 1000 base pairs downstream into the transcription unit.

2. Tandem Repeats - DNA satellites


Tandem repeats are typically found at the centromeres and telomeres of chromosomes. These are duplications of more complex 100-200 base sequences. DNA satellites can further be divided into satellites, minisatellites, and microsatellites, based on the number of nucleotides involved.

3. Segmental Duplications
Segmental Duplications are large blocks of 10-300kbp which have been copied to another region of the genome.

4. Interspersed Repeats (Transposons)


Interspersed repeats are repeated DNA sequences located at dispersed regions in a genome. They are also known as mobile elements for transposable elements. LINEs are long interspersed elements. SINEs are short interspersed elements.

5. Pseudogenes

6 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Pseudogenes are defined as nonfunctional sequences of DNA originally derived from functional genes (evolutionary relics). There are 2 major classes: unprocessed pseudogenes derived from gene duplication and processed pseudogenes derived from retrotransposition of mRNA Pseudogenes may be transcribed but not translated. Their chromosomal distributions appear random and dispersed. Pseudogenes can be considered as potogenes, i.e. DNA sequences with a probability of becoming new genes. Processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Pseudogene.org is a organization which concentrates on pseudogenes.

Protein Coding DNA


In prokaryotes, one gene codes for one protein. Eukaryotes used a much more elaborate mechanism to increase sequence diversity and to enable themselves to produce newer proteins.

Alternative promoter usage


Several exons are involved to code for a single protein. Any one of the several exons can used to initiate the expression. The choice of the initiating exon could generate a different isoform of the same protein. In other words, alternative usage of promoters results in proteins with different isoforms.

Alternative splicing
RNA splicing is a precisely regulated co- and post- transcriptional process (occurring prior to mRNA translation) that removes introns and joins exons in a primary transcript. During RNA splicing, exons can either be retained in the mature message or targeted for removal in different combinations to create a diverse array of mRNAs from a single pre-mRNA, a process referred to as alternative RNA splicing (tissue and cell specific). There are four known modes of alternative splicing: 1. Alternative selection of promoters: This is the only method of splicing which can produce an alternative N-terminus domain in proteins. In this case, different sets of promoters can be spliced with certain sets of other exons. 2. Alternative selection of cleavage/polyadenylation sites: This is the only method of splicing which can produce an alternative C-terminus domain in proteins. In this case, different sets of polyadenylation sites can be spliced with the other exons. 3. Intron retaining mode In this case, instead of splicing out an intron, the intron is retained in the mRNA transcript. However, the intron must be properly encoding for amino acids. The intron's code must be properly expressible, otherwise a stop codon or a shift in the reading frame will cause the protein to be non-functional. 4. Exon cassette mode: In this case, certain exons are spliced out to alter the sequence of amino acids in the expressed protein.mRNA editing ~15 % of disease-causing mutations involve misregulation of alternative splicing (missplicing) Exon order is not conserved. It cam be scrambled. A technique used in alternative promotor usage.

Trans-splicing vs. Cis-splicing


Splicing prepares pre-mRNA in eukaryotes to produce mature mRNA. This mature messenger RNA is then prepared to undergo translation as part of protein synthesis to produce proteins. When the exons are in the SAME RNA transcript, it is called cis-splicing. Trans-splicing is a form of splicing that joins two exons that are not within the same RNA transcript.

7 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Exonic splicing enhancers (ESEs) pre-mRNA cis-acting elements


ESEs are discrete sequences within exons that promote both constitutive and regulated splicing. The precise mechanism by which ESEs facilitate the assembly of splicing complexes has been controversial. However, recent studies have provided insights into this question and have led to a new model for ESE function. Other recent work has suggested that ESEs are comprised of diverse sequences and occur frequently within exons. Ominously, these latter studies predict that many human genetic diseases linked to mutations within exons might be caused by the inactivation of ESEs. Exon sequence enhancers prediction - http://rulai.cshl.edu/tools/ESE/ Alternative splicing database project - http://www.ebi.ac.uk/asd/index.html

Non-coding RNA
Non-coding RNAs represent ~10% of the genes but ~98% of all human transcripts. snRNA participates in post-transciptional chemical modification or processing of different RNAs. Micro RNAs (miRNAs) are a class of non-coding RNA gene. They play an important role in the regulation of translation and degradation of mRNAs through base pairing to partially complementary sites in the untranslated regions (UTRs) of the messenger. Antisense transcription is transcription from the opposite strand to a protein-coding or sense strand. Computational analysis suggests that between 15 and 25% of mammalian genes overlap, give rise to pairs of sense and antisense RNA. They are almost universally associated with candidate imprinted loci, also occurring on the autosomes. Its play roles in gene regulation involving degradation of the corresponding sense transcripts (RNA interference) as well as gene silencing at the chromatin level. The challenge is to determine the correct orientation for an expressed sequence, especially an expressed tag sequence (ESTs). Antisense mRNA is an mRNA transcript that is complementary to endogenous mRNA. It is the noncoding strand complementary to the coding sequence of mRNA. Introducing a transgene coding for antisense mRNA is a strategy used to block expression of a gene of interest. A strand of antisense mRNA can also be introduced into the cytosol by microinjection. Radioactively-labelled antisense mRNA can be used to hybridise to endogenous sense mRNA, which can show the level of transcription of genes in various cell types. ncRNA genes are found in genomic sequences by their sequence or structural homology. tRNA have conserved sequence elements. Programs use a combination of patterns searches; probabilistic methods and (for eukaryotes) search for Pol III promoters. tRNAscan is a very good program for finding tRNAs.

Pairwise Alignment
Much of bioinformatics involves sequences. Sequences are represented with strings of letters in an alphabet. DNA has an alphabet of 4 letters while proteins have an alphabet of 20 letters. The most basic sequence analysis is to ask if two sequences are related. This involves aligning two sequences and then deciding whether the sequences are related or is the similarity just by chance. The key issues to ponder over are: 1. what sorts of alignments should be considered 2. the scoring system used to rank alignments 3. the algorithm used to find optimal (or good) scoring alignments 4. the statistical methods used to evaluate the significance Finding similarity between sequences is important for many biological questions. Some examples: Finding similar proteins allows us to predict their function and structure. Locating similar subsequences in DNA allows us to identify pockets of interest, such as regulatory elements. Locating DNA overlapping sequences helps us in sequence assembly. Two similar sequences are probably biologically similar. Very often similar sequences have similar 3D structures. This

8 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

is important since the 3D structure of a protein defines its functions. In addition, similar sequences can come from two species which share a common ancestor, thereby indicating their evolutionary relationship. In other words, the residues occupying similar positions could have similar functional roles. Evolution tends to conserve the more efficient functional units. Therefore, important sequences which code for the important proteins are conserved among organisms in nature. In the absence of comprehension of the biological mechanisms, it is indispensable to compare a new unknown sequence to known sequences that we know better. Therefore, discovery of efficient and reliable algorithms are becoming more and more important as the number of sequences increase exponentially. Similar, Identical, Homologous Understanding the difference between similar and identical is crucial for sequence alignment. An identical pair is a pair of two same amino acids. A similar pair is a pair of amino acids which could be considered chemically similar in that certain position. Two amino acids are considered similar if one can be substituted for another with a positive log odds score from a scoring matrix. VKASQRTTV VK ++RTTV VKPNKRTTV In this example, T, V, R, and K are identical pairs while S,N and Q,K are similar pairs. Similarity can often be misleading. It can reveal evolutionarily related sequences or it can align two sequences with completely different function and structure. The challenge is to differentiate between the former and the latter.

Sequence alignment
A sequence alignment takes two sequences of the same alphabet as input and outputs an alignment of the two sequences. Alignment simply refers to placing one symbol against another. It does not involve judging the quality of the alignment. An alignment consists of writing two sequences one on each axis and inserting letters and symbols such that the two sequences have the same length. All methods are permitted as long as the order of the symbols in the sequences is not modified. There is no quality evaluation in the alignment step. Lets look at the following two sequences: GCGCATGGATTGAGCGA TGCGCCATTGATGACCA A possible alignment could be: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A The string GCGC is a perfect match. The eight character G is a mismatch since it matches with T. The - symbols are indel (insertions or deletions) as they allow for an more optimal match to occur. Many different alignments are possible. The trick is to choose the most likely alignment. This is accomplished by scoring alignments and is covered in the next section. Sequence identity refers to the occurrence of exactly the same nucleic acid or amino acid in the same position in two aligned sequences. Sequence similarity is meaningful only when possible substitutions are scored according to the probability with which they occur. Sequence homology indicates evolutionary relatedness among sequences. Two sequences are said to be homologous if they are both derived from a common ancestral sequence. Similarity refers to the presence of identical and similar sites in two sequences, while homology reflects a stronger claim that the two sequences share a common ancestor. Similarity is not definite in a unique and exact manner. It is a mix of biological knowledge and mathematical and heuristic concepts. Sequence similarity is not about comparing two texts to state whether they are similar or different. A sequence similarity must be capable of tolerating gaps and substitutions. This is an optimization problem which could be formulated in a dynamic programming problem. The idea is to give a score to each pair of residues. Then search for insertions and deletions which can maximize the global score using a substitution matrix. In addition, the degree of similarity must be validated biologically and statistically. It is also important to be able to distinguish

9 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

between accidental similarity and similarity based on biological factors. Note: Parts of this post are summary of Durbin.

Scoring
Scoring Model
When we compare sequences, we are looking for evidence that they have diverged from a common ancestor by a process of mutation and selection. Basic mutational processes are: Substitutions: Residue changes in the sequence Insertions: Addition of a residue Deletions: Removal of a residue Insertions and deletions, together, are called gaps. The total score we assign to an alignment is the sum of terms for each aligned pair of residues, plus terms for each gap. In probabilistic interpretation, this would correspond to the log of the relative likelihood that the sequences are related, compared to being unrelated. In other words it is the log of the probability of being related to another sequence compared to the log of the probability of being unrelated. The easiest scoring method is to assume that the each element of the sequence evolved independently and that the probability of a mutation is 1/20. However, this is an erroneous assumption since some changes are more plausible than others. The plausibility depends on properties of the amino acids. Amino acids which are likely to preserve the structure and function of the protein are more likely to be preserved over evolution than ones which modify. It is , therefore, expected that the identities and conservative substitutions are more likely to occur than randomly conserved regions. Thus true positives are more likely to have a positive score while random substitutions are expected to contribute towards a negative score. Using an additive scoring corresponds to an assumption that we can consider mutations at different sites of the sequence to have occurred independently. In other words, each gap is a mutation. This seems to be a reasonable assumption for DNA and protein sequences. However, this assumption is seriously inaccurate for RNA, since RNA is transcribed from DNA. Additive scoring function is defined as follows: (x,y) is the score of replacing x by y (x,-) is the score of deleting x (-,x) is the score of inserting x The score of an alignment is the sum of position scores The optimal or maximal score between two sequences is the maximal score of all alignments of these sequences, namely: d(s1,s2) = max(alignment score) The additive form of the score allows us to perform dynamic programming to compute optimal score efficiently.

Substitution Matrices / Scoring Matrices


What you really want to learn when evaluating a sequence alignment is whether a given alignment is random or meaningful. To access the meaningfulness of an alignment we construct a scoring matrix. A scoring matrix is a table of values that describe the probability of a residue pair occurring in an alignment. The values in a scoring matrix are logarithms of ratios of two probabilities. The first is the probability of random occurrence of an amino acid in a sequence alignment. The second is the probability of meaningful occurrence of a pair of residues in a sequence alignment. In order to score an alignment, the alignment program needs to know whether it is more likely or less likely that a

10 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

given amino acid pair occurred randomly. Negative log odds ratio is random while positive indicates an evolutionary relationship. It is important to note that the scores are logarithms so a match of 2 residues is far from a coincidence. Notation sequences: x, y xi is the ith symbol in x, yj is the jth symbol in y A is the alphabet e.g. A = {A, C, G, T} for DNA symbols from the alphabet are a, b, ... The unrelated or random model R is the simplest. It assumes that a occurs independently with some frequency qa, the probability of two sequences is just the product of the probabilities of each amino acid:

P(x,y|R) =

qxi

qyi

The product of the frequencies of each element of sequence x multiplied by the product of the frequencies of each element of sequence y. In the alternative match model M, aligned pairs of residues occur with a joint probability pab. pab can be thought of as the probability that the residues a and b have each independently been derived from some unknown original residue c which was present in their common ancestor. This gives:

P(x,y|M) =

pxiyi

Joint probability is the probability of two or more things happening at once. In our case, this is the probability of finding the same nucleotide or amino acid on both sequences. In this model, we take the product of the probabilities of getting the same residues on both sequences. The ratio between these the values computed by these two formulas is called the odds ratio. When we take the log of this ratio, we arrive at the log-odds ratio. To log likelihood ratio of a residue pair computed with: pab qaqb

s(a,b) = log(

This is basically the log of the joint probability of a pair divided by the product of the frequencies of each member of the pair. The sum of this value for each pair in both sequences gives us log-odds ratio. S=

s(xi,yi)

The log-odds ratio can also be looked at as the sum of P(alternative) / P(random). There are several ways to derive substitution scores, however, substitution scoring based on probabilistic models seems to be the most accurate. In order to arrive at an additive scoring system, we take the log of this ratio. The log likelihood ratios can be arranged in a matrix. DNA has a 4 x 4 matrix while proteins have a 20 x 20 matrix. This matrix is called the score matrix or substitution matrix. Blosum50 and PAM are the most commonly used matrices. Substitution matrices essentially make a statement about the probability of observing ab pairs in real alignments.

Gap Penalties
DNA sequences change not only by point mutation, but by insertion and deletion of residues as well. Consequently, it

11 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

is often necessary to introduce gaps into one or both of the sequences being aligned to produce a meaningful alignment between them. Gaps have to be penalized. The standard cost associated with a gap of length g is given either by a linear score or an affine score. (g) = -gd (g) = -d-(g-1)e where d is called the gap-open penalty and e is called the gap extension penalty. Most sequence alignment models use affine gap penalties where the cost of opening a gap in a sequence is different from the cost of extending a gap that has already been started. The extension penalty is usually set to a number less than the gap-open penalty d. This allows insertions and deletions to be penalized less than they would in linear gap cost. This is desirable when gaps of a few residues are expected almost as often as gaps of a single residue. [1] Gap penalties also correspond to a probabilistic model of alignment, although this is less widely recognized than the probabilistic basis of substitution matrices. We assume that the probability of a gap occurring at a particular site in a given sequence is the product of a function f(g) of the length of the gap, and the combined probability of the set of inserted residues. In other words, the length of a gap is not correlated to the residues it contains. Here the gap penalties correspond to the log probability of a gap of that length. [1] On the other hand, if there is evidence for a different distribution of residues in gap regions then there should be residue-specific scores for the unaligned residues in gap regions. These scores should be equal to the logs of the ratio of their frequencies in gapped versus aligned regions. For example, a sequence is more likely to be in a hydrophobic region of the protein. [1] Gap penalties are intimately tied to the scoring matrix that aligns the sequences. The best pair of gap opening and extension penalties for one scoring matrix doesnt necessarily work with another. Linear Gap Penalty Linear gap penalties are the simplest type of gap penalty. The only parameter, d, is a penalty per gap. This is almost always negative, so that the alignment with fewer gaps is favored over the alignment with more gaps. Under a linear gap penalty, the overall penalty for one large gap is the same for many small gaps. Affine Gap Penalty Affine gap penalties attempt to overcome this problem. In biological sequences, for example, it is much more likely that one big gap of length 10 occurs in one sequence, due to a single insertion or deletion event, than it is that 10 small gaps of length 1 are made. Therefore, affine gap penalties have a gap opening penalty, c, and a gap extension penalty, e. A gap of length l is then given a penalty c + (l-1)e. So the gaps are discouraged, c and e are almost always negative. Since a few large gaps is better than many small gaps, e is almost always smaller than c.

Source
[1] Durbin

Significance of Scores
Once we have an optimal alignment, we need to access the significance of its score. This would permit us to decide if it is a biologically meaningful alignment or not. We look at the distribution of the maximum N match scores to independent random sequences. If the probability of this maximum being greater than the observed best is small, then the observation is considered significant.

Alignment Algorithms
Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. When both sequences have the same length, there is only one possible global alignment of the complete sequences, but things get complicated once gaps are allowed or when we look for local alignment between subsequences of two sequences. It is not computationally feasible to enumerate all possible matches. For two sequences of length n, there

12 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

are:

possible global alignments between the two. Clearly, this is an NP hard problem.

Dynamic Programming
A dynamic programming algorithm is an algorithm for finding optimal alignments which use additive alignment scores. Dynamic programming is crucial for computational sequence analysis. Unlike heuristic methods, dynamic programming algorithms are guaranteed to find the optimal scoring alignment or set of alignments. Dynamic programming involves dividing the problem into smaller problems and storing the results in a table. It is like a recursion with memory. In the previous section, we defined additive scoring function as: (x,y) is the score of replacing x by y (x,-) is the score of deleting x (-,x) is the score of inserting x The score of an alignment is the sum of position scores. The optimal score between two sequences is the alignment which gives the maximal score. As we have just seen, enumerating all possible alignments is not feasible. In a log-odds ratio scoring scheme, better alignment would produce higher scores. To find the optimal alignment, we would like to maximize the score. In terms of a Blosum50 matrix, we want to maximize the positive values and minimize the smaller values. Since, dynamic programming is recursion with a memory, lets look at how the recursion argument would be constructed. Suppose we have two sequences: s[1..n+1] and t[1..m+1] The best alignment must be one of three cases: 1. Last match is (s[n+1],t[m +1] ) 2. Last match is (s[n +1],-) 3. Last match is (-, t[m +1] ) Thus: 1. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n], t[1..m]) + (s[n+1], t[m+1]) 2. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n], t[1..m + 1]) + (s[n+1], -) 3. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n + 1], t[1..m]) + (-, t[m+1]) where (s,t) is the gap cost.

Global Alignment: Needleman & Wunsch Algorithm


We now construct a matrix F indexed by i and j, one index for each sequence, where F(i,j) is the score of the best alignment between the initial segments of each sequence. F(i,j) is built recursively. F(i,j) = d(s[i..i],t[1..j]) Using our recursive argument, we get the following reference:

13 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Graphically, this translates to the following:

Certain texts write this algorithm from the perspective of F(i-1,j-1), but I find this method more intuitive. This, of course, makes no difference in the algorithm. We need to first handle the base cases in recursion. F(0,0) = 0 F(i+1,0) = F(i,0) + (s[i+1],-) F(0,j+1) = F(0,j) + (-,t[j+1]) This allows us to fill the first column and the first row. Since we are using using linear gaps, we need to assign a gap cost. Here, I have assigned a gap cost of 2. So the values for the first row and column would be 0, -2, -4, etc. Graphically, it looks like the following:

Now we need to find out F(1,1). We know that A and A are a perfect match. Therefore, we add 1 to the first equation since it represents a perfect match. The other two represent A,- and -,A matches. To fill in a value for F(1,1), and to fill the rest of the table, we need to find the maximum of the three. F(1,1) = max(0+1, -2, -2) = 1 F(1,2) = max(-2+1, -4, 1-2) = -1 ... Remember that A,- and -,A are penalized by gap costs.

14 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Thus the conclusion is that d(AAAC, AGC) = -1. To find the best alignment, we would need to traceback to F(0,0). In this step, we start from the last cell and simply point our arrows back to the cells we used to derive our cells.

The traceback gives us the best alignment. In this case, the alignment is: AAAC AA-G We chose an arbitrary gap cost for our example. If we had chosen a different value such as 8, we would still have gotten the same traceback.

This algorithm has both space and time complexity of O(mn), since filling the table requires O(mn) and the traceback requires O(m+n). In programming terms, N&W involves an iterative matrix method of calculation. All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array. All possible alignments (comparisons) are represented by pathways through this array. The following four steps are necessary to align sequence1 of N positions with sequence2 of M positions: 1. Build a matrix of size N * M; 2. Assign similarity values; 3. For each cell, look at all possible pathways back to the beginning of the sequence and give that cell the value of the maximum scoring pathway; 4. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment. Try out graphical alignment at http://www.itu.dk/people/sestoft/bsa/graphalign.html

Local Alignment: Smith-Waterman Algorithm


15 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Global alignment is useful when we want to align two sequences completely. Very often, however, two sequences do not align completely. In fact we are usually more interested in finding best alignment of subsequences. For example, we would like to find out if human and mouse haemoglobins are homologous. The highest scoring alignment of subsequences of sequence s and sequence t is called the best local alignment. The algorithm for finding local alignments is similar to the global alignment algorithm with two notable differences. 1. F(i,j) can take a 0 value if all other values are less than 0. 0 value corresponds to starting a new local alignment. 2. The traceback can start from anywhere in the matrix. It starts at the maximum value and ends at 0. The algorithm is as follows:

The base cases are: F(0,0) = 0 F(i+1, 0) = max(0, F(i,0) + (s[i+1],-)) F(0, j+1) = max(0, F(0,j) + (-,t[j+1])) If we have two sequences, s=TAATA and t=TACTAA, we would get the following alignments:

TAATA_ TACTAA

___TAATA TACTAA

For local alignment to work, the expected score for a random match must be negative. If that is not true, then long matches between entirely unrelated sequences will have high scores, just based on their length. As a consequence, although the algorithm is local, the maximal scoring alignments would be global or nearly global. A true subsequence alignment would be likely to be masked by a longer but incorrect alignment, just because of its length. Similarly, there must be some (s,t) greater than 0, otherwise the algorithm won't find any alignment at all.[1] The random match is required to have a negative value. In an ungapped case, only the expected value of a fixed length alignment can be considered and it must be noted that in a random model, all residues are independent. The gives the following formula:

where qa is the probability that s would occur in any given position in a sequence. When (s,t) is derived as a log likelihood ratio, using the same qa as for random probabilities, the equation above is satisfied. No equivalent analysis for optimal ungapped alignments exist. There is no analytical method for predicting which gap scores will result in local vs. global alignment behavior.

16 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Repeated Matches
If one or both sequences are long enough, we would most probably find several different local alignments with a significant score. For example, we might find several copies of a repeated domain or motif in a protein. Here we are interesting in an asymmetric method which finds one or more non-overlapping copies of sections of one sequence (e.g. domain or motif) in the other. [1] In our algorithm, we are interested in sequence matches with score higher than a certain threshold T. The reason behind defining T is that we would always find small subsequences with small positive scores which would quite likely match unrelated sequences. Following notation is used for this algorithm: y is a sequence containing some domain or sequence x is the sequence in which we are looking for multiple matches T is some threshold score value Once again, we use F(i,j) matrix but the recurrence in now different, as is the meaning of F(i,j). In the final alignment, x will be partitioned into regions that match parts of y in gapped alignments, and regions that are unmatched. The score of a completed match region is the standard gapped alignment score minus threshold T. [1] The algorithm obtains all local alignments in one pass. Changing the value of T changed what the algorithm finds.

17 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Overlap Matches

18 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Source:
[1] Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin. [2] Dr. Nir Friedman's lectures: www.cs.huji.ac.il/~nir [3] http://www.itu.dk/people/sestoft/bsa/graphalign.html [4] http://thor.info.uaic.ro/~ciortuz/SLIDES/pairAlign.pdf

Dynamic Programming with more complex models


19 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Linear gap model scoring scheme is not ideal for biological sequences since gaps are often longer than one residue. If we are given a general function (g) then we can still use all the dynamic programming algorithms described in the previous section with adjustments to the recurrence relation. This however require O(n3) operations, thus not feasible.

Alignment with affine gap scores


The alternative is to assume an affine gap cost structure. This brings us back to O(n2) implementation of dynamic programming. This, however, requires us to keep track of multiple values for each pair of residue coefficients (i,j) in place of F(i,j). This corresponds to FSA. An alignment corresponds to a path through the states, with symbols from the underlying pair of sequences being transferred to the alignment according to the (i,j) values in the states.

Heuristic Algorithms
Dynamic programming guarantees to find the best solution but has a complexity of O(mn). Heuristic algorithms do not guarantee the best solution but are very fast in comparison with deterministic algorithms such as dynamic programming.

BLAST
BLAST finds high scoring local alignments between a query sequence and a target database, both of which can be either DNA or protein. The idea is that true match alignments are very likely to contain short stretches of high scoring identities. We use these as seed and expand the alignments. BLAST makes a list of all neighborhood words of a fixed length that would match the query sequence somewhere with score higher than some threshold.

FASTA
FASTA uses a multistep approach to finding local high scoring alignments: 1. Lookup table to locate all identically matching words of length ktup between 2 sequences. 2. Lookup diagonals with many mutually supporting word matches 3. Pursue the best diagonal, extending the exact word matches to find maximal scoring ungapped regions. This is analogous to hit extension in blast. 4. Check to see if any of these ungapped regions can be joined by a gapped region, allowing for gap costs. 5. The highest scoring candidate matches in a search database are realigned using the full dynamic programming algorithm, but restricted to a subregion of the dynamic programming matrix forming a band around the candidate heuristic search.

Hidden Markov Models


Hidden Markov Models (HMMs) have many applications in bioinformatics. They are, for example, used to search for patterns in a sequence. Here pattern refers to particular chain of characters arranged in a particular sequence e.g. TATA box or CpG islands.

Patterns
Patterns can be deterministic or non-deterministic at initial inspection. For example, traffic lights follow a predictable pattern. Yellow follows green and red follows yellow. Weather, however, does not follow a predictable pattern in most parts of the planet. A sunny day can be followed by a rainy day, cloudy day or even another sunny day.

Markov Chains
It is necessary to learn markov chains before one can understand hidden markov models. Suppose we are looking for CpG islands in a sequence. If we are using a probabilistic model, we would want a model where the probability of a symbol depends on the previous symbol. Markov chains is such a model.

20 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

A markov chain is a set of states connect by arrows called transitions. Each transition has a probability parameter associated with it. The probability parameter on an arrow from C to G represents the probability of a G following a C.

A finite markov chain is an integer stochastic process, consisting of: 1. a domain D of m states {s1,...,sm} and 2. an m dimensional vector (p(s1),...,p(sm)) 3. an m x m transition probabilities matrix M=(asisj) For DNA, 1. D = {A,C,G,T} 2. p(A) is the probability of A being the first letter being in the sequence. aAG is probability that G would follow A in a sequence. 3. The matrix M is shown below.

This matrix represents transition probabilities, M = ast. Note that the sum of each vector (row of matrix) is 1. Transition probability is represented as follows:

Note that this is simply conditional probability P(t|s), the probability of s occurring given that t has occurred. A key property of markov chains is that the probability of each symbol xi depends only on the preceding symbol xi-1 rather than the entire sequence. [1]

This equation shows that we need to specify the P(x1), the probability of starting in a particular state in addition to specifying the transition probabilities. We now add two begin state and end state to our model in ensure that the beginning and end are modeled. is the begin state and = end state. P(x1 = s) = as P(|xL = t) = at We do not need to associate any probability to the begin and end states. They can just serve as points where transitions begin and end. The end state is useful in modeling distribution of lengths of sequences. The distribution over lengths decays exponentially.

Using Markov chains for discrimination


In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone. Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of

21 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

many genes. These areas are called CpG islands. To be able to discriminate between CpG island and non-CpG islands, we need to model strings with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities: + model: Use transition matrix a+st where a+st is the probability that t follows s in a CpG island - model: Use transition matrix a-st where a-st is the probability that t follows s in a non-CpG island We produce a matrix for the + model and another for the - model. To use these models for discrimination, we calculate the log-odds ratio:

This would produce another matrix which should ideally discriminate between CpG islands and others by positive and negative scores. If the ratio is greater than 1, CpG island is more likely.

Hidden Markov Models


Using markov chains, we had to build two models, + and -. Now we would like to use one model to do both. To do this, we would need to add both the + and - probabilities into one model. Thus we would end up with two states corresponding to each nucleotide symbol. To void this confusion, we would rename our states from A, C, G, T to A+, C+, G+, T+ and A-, C-, G-, T-. The transition probabilities in this model are so that within each group they are close to the transition probabilities of the original component model, but there is also a small but finite chance of switching into the other component. Overall there is more chance of switching from + to - than vice versa, so if left to run free, the model will spend more of its time in the - non-island states than in the island states. [1] A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis. In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. The essential difference between a Markov chain and a hidden Markov chain is that while there is a 1-1 correspondence between states and symbols in Markov chains, the same isn't true for hidden Markov chains. Definition: A HMM is a triplet M = (S, Q, T) where: S is an alphabet of symbols Q is a finite set of states, capable of emitting symbols form the alphabet S T is a set of probabilities, comprised of: State transition probabilities, denoted by akl for each k, l Q. Emission probabilities, denoted by ek(b) for each k Q and b S. We now need to distinguish the sequence of states from the sequence of symbols. The path is a sequence of states. The path itself follows a simple Markov chain, so the probability of a state depends only on the previous state. As in the markov chain model, we can define the state transition probabilities in terms of :

The probability of going to l given that we are at k. Since we have decoupled the symbols b from the states k, there is no longer a 1-1 correspondence between states and symbols. Thus we must introduce a new set of parameters for the model:

22 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

ek(b) is the probability that the symbol b is seen in state k. These are known as emission probabilities.

Where for our convenience we denote 0 = begin and L+1 = end. First a state 1 is chosen according to the probabilities a0i. In that state an emission is emitted according to the distribution e1 for that state. Then a new state 2 is chosen according to the transition probabilities a1i and so forth. P(X,) is the joint probability of an observed sequence X and a state sequence . In practice, this is not very useful since very often we do not know the path. So we have to estimate the path either by finding the most likely path or using an a posteriori distribution over states.

Using HMM
There are 3 canonical problems associated with HMMs: Given the parameters of the model, compute the probability of a particular output sequence. This problem is solved by the forward-backward procedure. Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by the Viterbi algorithm. Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, train the parameters of the HMM given a dataset of sequences. This problem is solved by the Baum-Welch algorithm.

Viterbi Algorithm
Although it is no longer possible to tell what state the system is in by looking at the corresponding symbol, it is often the sequence of underlying states that we are interested in. Decoding is the act of finding out the meaning of a sequence by considering the underlying states. A commonly used method is a dynamic programming algorithm, Viterbi. In general, many different states can give rise to any particular sequence of symbols. For example the following 3 states give result in CGCG sequence of symbols. 1. C+, G+, C+, G+ 2. C-, G-, C-, G3. C+, G-, C+. GThe probability of the first is larger than the second which is larger than the third. So if we are to choose only one path, it is likely that the first one will be chosen. We can calculate the most probable path in a hidden Markov model using a dynamic programming algorithm.

The most probable path &pi* can be found recursively. Suppose the probability Vk(i) of the most probable path ending in state k with observation i is known for all states k. Then these probabilities can be calculated for observation xi+1 as:

23 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

All sequences have to start in state 0 (the begin state), so the initial condition is that V0(0) = 1. By keeping pointers backwards, the actual state sequence can be found by backtracking.

We use logs to convert products to sums to avoid underflow problems. Viterbi algorithm makes three assumptions: 1. Viterbi operates on state machine assumptions 2. Transition from a previous state to a new state is marked by an incremental metric 3. Events are cumulative

Forward Algorithm
Since many different state paths can give rise to the same sequence x, we must add the probabilities for all possible paths to obtain the full probability of x, The number of possible paths increases exponentially with the length of the sequence, so evaluation by enumerating all paths in not practical. Approximation or enumeration is unnecessary as the full probability can itself be calculated by a similar dynamic programming procedure to the Viterbi algorithm, replacing the maximization steps with sums. This is called the forward algorithm. The quantity corresponding to the Viterbi variable Vk(i) in the forward algorithm is: fk(i) = (x1,...xi, = k) Note: This post is a summary of chapter 3 of Durbin.

Multiple Sequence Alignment


Multiple sequence alignment techniques are most commonly applied to protein sequences; ideally they are a statement of both evolutionary and structural similarity among the proteins encoded by each sequence in the alignment. Multiple alignments must usually be inferred from primary sequences alone. Biologists produce high quality multiple sequence alignments by hand using expert knowledge of protein sequence evolution. This knowledge comes from experience. Important factors include: specific sorts of columns in alignments, such as highly conserved residues or buried hydrophobic residues the influence of secondary or tertiary structure, such as the alteration of hydrophobic and hydrophilic columns in exposed beta sheet

24 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

expected patterns of insertions and deletions, that tend to alternate with blocks of conserved sequence The phylogenetic relationships between sequences dictate constraints on the changes that occur in columns and in the patterns of gaps. Manual alignment is tedious. To automate the process, it is hard to define exactly what an optimal multiple sequence alignment is, and it is impossible to set a standard for a single correct multiple alignment. In theory, there is one underlying evolutionary process and one evolutionarily correct alignment generated from any group of sequences. However, the differences between sequences can be so great in parts of an alignment that there isnt an apparent, unique solution to be found by an alignment algorithm. Those same divergent regions are often structurally unalignable as well. Most of the insight that we derive from multiple alignments comes from analyzing the regions of similarity, not from attempting to align highly diverged regions. In general, an automatic method must have a way to assign a score so that better multiple alignments get better scores. We should carefully distinguish the problem of scoring a multiple alignment from the problem of searching over possible multiple alignments to find the best one. To automate multiple alignment, we need to do the following: look at what we need to do for automatic multiple alignment structurally and evolutionarily consider how to turn the biological criteria into a numerical scoring scheme, so that a program will recognize a good multiple alignment. examine various approaches by different multiple alignment programs describe a full probabilistic multiple alignment approach based on profile HMM

What does a multiple alignment mean?


In a multiple sequence alignment, homologous residues among a set of sequences are aligned together in columns. Homologous is meant for both structural and evolutionary sense. Ideally, a column of aligned residues occupy similar 3D structural positions and all diverge from a common ancestral residue. Except for trivial cases of highly identical sequences, it is not possible to unambiguously identify structurally or evolutionarily homologous positions and create a single correct multiple alignment. Since protein structures also evolve, we do not expect 2 protein structures with different sequences to be entirely superposable. Even the definition of structurally superposable is subjective and can be expected to vary among experts. In principle, there is always an unambiguously correct alignment even if the structures diverge. In practice, however, an evolutionarily correct alignment can be even more difficult to infer than a structural alignment. Structural alignment has an independent point of reference, superposition of x-ray crystallography or NMR structures. The evolutionary history of the residues of a sequence family cannot be independently known from any source. It must be inferred from sequence alignment. The program should not be asked to produce exactly the same alignment. Instead, it should be focused on the subset of columns corresponding to key residues and core structural elements that can be aligned with confidence.

Summary
multiple alignment is an alignment of more than two sequences usually gives more information about conserved regions It gives better estimate of significance when using a sequence of unknown function Must use multiple alignments when establishing phylogenetic relationships Note: This post is a summary of chapter 6.1 of Durbin.

Scoring MSA
The scoring system should take 2 important features into account: 1. some positions are more conserved than others 2. sequences are not independent, but instead are related by a phylogenetic tree

25 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

An idealized way to score a multiple alignment would be to specify a complete probabilistic model of molecular sequence evolution. Given the correct phylogenetic tree for the sequences, the probability of a multiple alignment is the product of all the evolutionary events necessary to produce that alignment via ancestral intermediate sequences times the prior probability of the root ancestral sequence. This evolutionary model would be very complex. The probabilities of evolutionary change would depend on the evolutionary times along each branch of the tree, as well as position specific structural and functional constraints imposed by natural selection. This way key residues and structural elements would be conserved. High probability alignments would then be good structural and evolutionary alignments under this model. Unfortunately, we dont have enough data to parameterize this model. Therefore assumptions must be made. Almost all alignment methods assume that the individual columns of an alignment are statistically independent. Such a scoring function is written as follows:

Where mi is the column i of the multiple alignment m, S(mi) is the score for the column i, and G is a function for scoring the gaps that occur in the alignment. Most multiple alignment methods use affine scoring functions that pay a higher cost for opening the gap than extending it, so successive gaps are not treated independently. In this function, we are basically alignment score for each column. The sum is then added to the function G for scoring gaps. Now let's focus on definitions of S(mi) for scoring a column of aligned residues with no gaps.

Minimum Entropy
Lets look at this equation:

where (...) is one if the condition inside the function is true, 0 otherwise. m is a multiple alignment i is a column j is a sequence mji is a symbol in column i for sequence j cia is the observed counts for residue a in column i mi is the column of aligned symbols in column i ci is the count vector of observed symbols in column i for an alphabet K of different residues. It keeps count of all the observed symbols in a column. If you see 7 residues several times, its value would be 7. pia is the probability of a residue a in column i If the phylogenetic tree for sequences has many intermediate ancestors, then the statistical dependence between sequences is complex. The scoring problem is greatly simplified if we assume that sequences have all been generated independently. If we assume that residues within the column are independent, as well as being independent between columns, then the probability of a column mi is:

P(mi) is obtained by multiplying the probabilities of obtaining a residue a in column i of a sequence j. We can define a column score as the negative log of this probability:

This entropy measure is a convenient measure of the variability observed in an aligned column of residues. The more variable the column is, the higher the entropy. A completely conserved measure would score 0. A good alignment is one which minimizes the total entropy of the alignment.

26 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Thus, in return for giving up evolutionary tree and assuming independence between sequences, we gain the ability to straightforwardly estimate a position-specific model of both residue probabilities in columns and insertions and deletions. This assumption, however, can only be reasonable if representative sequences of a sequence family are chosen carefully. In practice, sample sequences are often biased with under or over representations of sub families. Several tree-based weighting schemes have been devised to deal with this.

Sum of pairs: SP scores


Sum-of-Pairs scores sum all possible pairwise match scores between amino acids in an aligned column; entropic scores use Shannon's information theoretical entropy to measure the diversity of symbols (amino acids) in a column; matrix scores employ a substitution matrix to evaluate stereochemical diversity in a column; sequence weighted scores normalize against redundancy of sequences in the alignment. The standard method of scoring multiple alignments is not the HMM formulation, but is similar in that it does NOT use a phylogenetic tree and it assumes statistical independence for the columns. Columns are scored by an SP function using substitution scoring matrix. The SP score for a column is defined as:

where scores s(a,b) come from a substitution scoring matrix such as PAM or BLOSUM matrix. Fro simple linear gap costs, gaps are handled by defining s(a,-) and s(-,a) to be the gap cost, and s(-,-) to be zero. Otherwise gap costs are scored separately (e.g. affine gap cost). Since substitution scores are derived as log-odds scores for pairwise comparisons, the extension to MSA would be for instance:

The relative difference between correct and incorrect alignment decreases with the number of sequences in the alignment. Note: This post is a summary of chapter 6.2 of Durbin.

Multidimensonal dynamic programming


The dynamic programming algorithms used for pairwise sequences alignment can theoretically be extended to any number of sequences. However, the time and memory requirements of this algorithm increase exponentially with the number of sequences. The only assumption necessary to make multidimensional dynamic programming to work is that column scores are independent. A common approach to multiple sequence alignment is to progressively align pairs of sequences. The general strategy is: 1. A starting pair of sequences is selected and aligned 2. Each subsequent sequence is aligned to the previous alignment This is a greedy heuristic algorithm. A greedy algorithm decomposes a problem into pieces, and then chooses the best solution to each piece without paying attention to the problem as a whole. Since it is a heuristic algorithm, progressive alignment is not guaranteed to find the best solution. In practice, however, progressive alignment methods are efficient and produce biologically meaningful results. MSA uses a clever heuristic multidimensional dynamic programming algorithm. It assumes an SP scoring system for both residues and gaps. We assume that the score of a multiple alignment is the sum of the scores of all pairwise alignments defined by the multiple alignment. The score of the complete alignment is given by:

27 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Let kl be the optimal pairwise alignment of k,l, which we can calculate in O(L2) time by standard dynamic programing. Obviously, S(akl S(kl). Combining this simple observation and the definition of the SP scoring system, we obtain a lower bound on the score of any pairwise alignment that can occur in the optimal multiple alignment. Now we only need to consider pairwise alignment better than the lower bound. This significantly reduces the complexity. Note: This post is a summary of chapter 6.3 of Durbin

Progressive alignment methods


Progressive alignment works by constructing a succession of pairwise alignments. 1. Choose two sequences and align them by pairwise alignment 2. Choose a third sequence and align it to the previous alignment 3. Repeat this until you have you more sequences left Initially, you align two sequences and then align the third sequence to the alignment, and so on. There are several progressive alignment strategies. They differ in the following ways: in the way they choose the order of alignment in whether the progression involves only alignment of sequences to a single growing alignment of whether subfamilies are built on a tree structure which is used for alignments in the procedure used to align and score sequences or alignments against existing alignments Progressive alignment is heuristic: It does not separate the process of scoring an alignment from the optimization algorithm. It does not directly optimize any global scoring function of alignment correctness. It is relatively fast and efficient However, in many cases the resulting multiple alignment is quite reasonable. The most important heuristic of progressive alignment algorithms is to align the most similar pairs of sequences first. These are the most reliable alignments. Most algorithms build a binary guide tree whose leaves represents sequences and whose interior nodes represent alignments. The root node represents a complete multiple alignment and the nodes furthest from the root represent the most similar pairs.
MA \ AL1 AL2 / \ / \ AL3 S1 S2 S3 / \ S4 S5 /

Feng-Doolittle Algorithm
1. Calculate a diagonal matrix N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to pairwise distances. 2. Construct a guide tree from the distance matrix using the clustering algorithm by Fitch & Margolish 3. Starting from the first node added to the tree, align the child nodes. Repeat in the order they were added to the tree i.e. most similar to least similar until all sequences have been aligned. To convert alignment scores to distance values, we use the following formula:

28 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

where Sobs: observed pairwise alignment score Smax: maximum score, the average of the score of aligning either sequence by itself Srand: score of a random alignment Seff: can be seen as a normalized percentage similarity decreasing to 0 with increasing evolutionary distance. The method for converting alignment scores to distances does not need to be very accurate as the goal is to create an approximate guide tree. Fitch & Margolish is a fast clustering algorithm that builds evolutionary trees from distance matrices. Before adding a sequence to an existing group, any alignment to one of the group members is tried. The highest score for such an alignment determines how the new sequence will be aligned to the group. Group-to-group alignments are done by comparing all possible pairwise alignment of the members of one group with the members of the other group. The best of these alignments determines the alignment of the two groups. Generally, PAM substitution with an affine gap penalty is used. The symbol X is used to denote gaps after alignments. Rule: once a gap, always a gap. This rule is put in place to ensure consistency. It encourages gaps to occur in the same columns in subsequent pairwise alignments: s(X,anything) = 0. A problem with Feng-Doolittle is that all alignments are determined by pairwise sequence alignments. Once an aligned group has been built up, it is advantageous to use position specific information from the group's multiple alignment to align a new sequence to it. The degree of sequence conservation at each position should be taken into account and mismatches at highly conserved positions penalized more stringently than mismatches at variable positions. Gap penalties in positions might be reduced where lots of gaps occur in the cluster alignment, and increased where no gaps occur.

Profile Alignment
A profile for a given group a sequences contains all features which are somehow typical for this group. We can think of conserved residues as a possible feature example. The idea behind profile alignment is and penalize mismatches more strongly in highly conserved regions than in variable positions. Many progressive methods use pairwise alignment of sequences to profiles or of profiles to profiles as a subroutine which is used many times in the process. The exact definition of the scoring function used in profile-sequence or profile-profile alignment varies. Aligned residues are usually scored by some form of SP score, but the handling of gaps varies substantially between different methods. For linear gap scoring, profile alignment is simple because the gap scores can be included in the SP score by setting s(-,a) = s(a,-) = -g and s(-,-) = 0. If we have two multiple alignments or profiles, an alignment of these two means that that gaps are inserted in whole columns, so that the alignment within one of the profiles is not changed. We can then split the sum into two sums only concerning the two profiles and one sum containing all cross terms. The first two sums are unaffected by the global alignment because adding columns of gap characters to a profile adds 0 to the score s(-,-) = 0. Therefore, the optimal alignment of th two profiles can be obtained by only optimizing the last sum with the cross terms. This can be done exactly like a pairwise alignment where columns are scored against columns by adding the pair scores. Obviously one of the profiles can consist of a single sequence only, which corresponds to aligning a single sequence to a profile.

Clustal W
One widely used implementation of profile-based progressive alignment is the CLUSTALW program. CLUSTALW works in much the same way as Feng-Doolittle method except for its carefully tuned use of profile alignment methods. Algorithm 1. Construct a distance matrix of all N(N-1)/ pairs by pairwise dynamic programming alignment followed by approximate conversion of similarity score to evolutionary distances using the model of Kimura. 2. Construct a guide tree by a neighbor-joining clustering algorithm by Saitou & Nei.

29 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

3. Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment. ClustalW is unabashedly ad hoc (designed for this, not generalizable) in its alignment construction and scoring stage. In addition to the usual methods of profile construction and alignment, various heuristics of ClustalW contribute to its accuracy: Sequences are weighted to compensate for biased representation in large sub-families Substitution matrix used to score an alignment is chosen on the basis of the similarity expected of the alignment; closely related sequences are aligned with hard matrices (BLOSUM 80) and distant sequences are aligned with soft matrices (BLOSUM 50). Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position. Penalties are obtained from gap frequencies observed in large number of structurally based alignments. Gap-open penalties are also decreased if the position is spanned by a consecutive stretch of 5 or more hydrophilic residues. Both gap-open and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in alignment. This rule tries to force all the gaps to occur in the same places in an alignment. In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low-scoring alignment until later in the progressive alignment phase when more profile information has been accumulated.

Iterative Refinement
One problem with progressive alignment algorithms is that the subalignments are 'frozen'. Once a group of sequences has been aligned, their alignment to each other cannot be changed at a later stage as more data arrives. Iterative refinement algorithms attempt to circumvent this problem. In iterative refinement, an initial alignment is generated, then one sequence or a set of sequences are taken out and realigned to a profile of the remaining aligned sequences. If a meaningful score is being optimized, this earlier increases the overall score or results in the same score. Another sequence is chosen and realigned, and so on, until we arrive at a point where the alignment does not change. This procedure is guaranteed to converge to a local maximum of the score provided that all the sequences are tried and a maximum score exists, simply because the sequence space is finite. Barton-Sternberg is a good example. Barton-Sternberg Algorithm 1. Find two sequences with highest pairwise similarity and align them using standard pairwise dynamic programming. 2. Find the sequence that is most similar to a profile of the alignment of the first two, and align it to the first two by profile-sequence alignment. Repeat until all sequences are included. 3. Remove one sequence and realign it to a profile of other sequences. Repeat for all sequences. 4. Repeat step 3 until the score converges or a fixed number of times if it doesn't. The ideas of profile alignment and iterative refinement come quite close to the formulation of probabilistic HMM approaches for multiple alignment. Note: This post is a summary of chapter 6.4 of Durbin.

Sources
[1] Durbin [2] http://www.cs.helsinki.fi/u/ajrantan/talks/slides_andre.pdf

Profile HMM Training


Profile HMMs could be used in place of standard profiles in progressive or iterative alignment methods. The use of profile HMM formalisms may have certain advantages such as replacing SP scoring scheme by profile HMM assumption that sequences are generated independently from a single 'root' probability distribution.

30 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Profile HMMs can also be trained from initially unaligned sequences using Baum-Welch expectation maximization algorithm.

Multiple alignment with a known profile HMM


Before tackling the problem of estimating a model and a multiple alignment simultaneously from initially unaligned training sequences, we consider the simpler problem of obtaining a multiple alignment from a known model. To align a sequence to a profile HMM, we find the most probable path through the model which is found by the Viterbi algorithm. Constructing a multiple alignment just requires calculating a Viterbi alignment for each individual sequence. Residues aligned to the same profile HMM match state are aligned in columns. Use fig. 456. Suppose we align 5 sequences. Then we derive Viterbi optimal path and realign the sequences. A profile HMM inserts insert states [a-z] for unmatched residues and [A-Z] for matched residues. A profile HMM does not modify the alignment. Insert state residues represent parts of the sequences which are atypical, unconserved, and not meaningfully alignable.

profile HMM trained from unaligned sequences


Now we try to estimate a model and multiple alignment from initially unaligned sequences. Initialization: Choose the length of the profile HMM and initialize parameters. Training: Estimate the model using Baum-Welch or Viterbi algorithm. It is necessary to use a heuristic method for avoiding local optima. Multiple Alignment: Align all sequences to the final model using the Viterbi algorithm and build a multiple alignment.

Initialization
A profile HMM is a repeating linear structure of three states (match, delete, and insert). The only decision that must be made in choosing an initial architecture for Baum-Welch estimation is the length of the model M. M is the number of match states in the profile HMM rather than the total number of states, which is usually set to the average length or training sets or based on prior knowledge. Since Baum-Welch finds local optima, it is important to choose initial models carefully. The model should be encouraged to use 'sensible' transitions; or instance, transitions into match states should be large compared to other transition probabilities. At the same time, we want to start Baum-Welch from multiple different points to see if all converge to approximately the same optimum, so we want some randomness in the choice of initial model parameters.

Training Avoiding Local Maxima


Note: This post is a summary of chapter 6.5 of Durbin.

Gene Prediction
Gene prediction refers to algorithmically identifying stretches of DNA sequences that are biologically functional. In the old days, gene prediction was a very painstaking and difficult process. Today, thanks to comprehensive genome sequencing and powerful computational resources, gene prediction is largely a computational problem. Gene prediction is used to find a functional sequence. In other words, a region of the DNA which is coding for a protein or mRNA. Regulatory regions, regions of DNA that regulate gene expression, are also considered functional. Gene prediction does not tell us which genes code for which proteins. There are two primary approaches for predicting genes: Intrinsic approach Ab Initio Extrinsic approaches homology-based

31 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Prerequisite Knowledge
A gene is the fundamental physical and functional unit of heredity. It is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific function product (RNA or protein). An Open Reading Frame (ORF) is a series of DNA codons which do not contain any stop codons. A Coding Sequence (CDS) is a region of DNA or RNA whose sequence determines the sequence of amino acids in a protein. Frames always read from 5 to 3.

Prokaryotic gene model


Prokaryotes have small genomes with high gene density. They contain operons, which mean that one transcript results in many genes. Since there are no introns, one gene produces one protein. There is one ORF per gene. ORFs begin with start codon and end with stop codon. There are conserved promoter regions around the start sites of transcription and translation. Genes often overlap in prokaryotes. The principal difficulties with prokaryote gene prediction are overlapping ORFs, short genes, and finding promoters. In spite of these difficulties, gene prediction in prokaryotes is 99% accurate.

Eukaryotic gene structure

Ab Initio Gene Prediction


Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. In Ab Initio method, genomic DNA sequence is systematically searched for signs of coding genes. Signal indicates the presence of coding regions in the vicinity. Ab initio methods make a prediction based on the sequence information only. It identifies only coding exons of protein-coding genes. Transcription start site, 5 and 3 UTRs are ignored. These methods can detect new genes with no similarity to known sequences or domains. Ab initio methods are based on rules, using coding statistics and signal detection. Statistical properties of coding regions are also taken into consideration. Training sets of known gene structures are used to generate statistical tests for the likelihood of a prediction being real. Since these statistical properties are unique to each species, knowledge is usually not transferable. This method can detect genes with no similarity to known sequences or domains. Gene Content Certain information in the gene content such as GC content, codon bias, and hexamer frequency is used by ab initio methods to discriminate coding regions from non-coding regions. Codon bias refers to unusually high usage of certain codons over its alternates. For example, L can be coded by six different codons. However, human genes prefer to use CTG over others. Coding statistics Coding statistics is a function that for a given DNA sequence we are able to compute the likelihood that the sequence is coding for a protein. We know that intergenic regions, introns and exons have different nucleotide content. This information helps the function discriminate between the regions. For example, the probability of finding a stop codon in a random sequence would be different from finding it in a coding sequence. Intergenic regions are DNA sequences located between genes that comprise a large percentage of the human genome with no known function. Unequal usage of codons in the coding regions is a universal feature of the genomes (codon bias). Uneven usage of amino acids, uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs) (codon usage), and hexamer usage also help discriminate coding region from non-coding regions.

Gene identification in prokaryotes


Gene prediction is easier and more accurate in prokaryotes than eukaryotes since prokaryote gene structure is much

32 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

simpler. In prokaryotes, ab initio methods look for: The presence of an ORF (start + stop) with a statistically significant size to code for a protein Codon usage bias RBS (ribosome binding signal) and terminator identification. Locating ORFs is much simpler in prokaryotes. DNA sequences encoding proteins are generally transcribed into mRNA which is translated into protein with very little modification. Locating an ORF from a start codon to a stop codon may suggest protein-coding regions. Longer ORFs are more likely to predict protein-coding regions than shorter ORFs. Ab initio gene prediction has certain advantages largely due to the simplicity of prokaryote genomes. The genomes are small with high gene density and simple strurcture (no exons/introns). The principle difficulties are: detection of initiation site (AUG) alternative start codons gene overlap undetected small proteins Inspite of these difficulties, prokaryote gene prediction can reach 99% accuracy.

Gene prediction in Eukaryotes


Gene identification in eukaryotes is much more complicated, difficult and a lot less accurate. In eukaryotes, we look for the following patterns: upstream promoter sequences, Kozak sequence, and exon-intron boundaries We use this information to predict Poly-A signal and the start/stop prediction. In eukaryotes, the signals are not as clearly defined as in prokaryotes. Therefore simple pattern matching techniques cannot be used. The problems with eukaryote gene prediction are numerous and the prediction accuracy is about 50% at best. Modern gene prediction tools use advanced techniques such as hidden Markov Models. GENSCAN is a notable program in this domain. Locating ORFs is less effective for eukaryotic genomes. There are large non-coding regions between genes and introns in genes. mRNA undergoes processing before translation (splicing and alternative splicing). A proteinencoding gene may contain stop codons within intronic regions. PTMs make gene prediction even more difficult. There are several tools which attempt to or help locate ORFs such as SpliceView, ORF finder, etc. Gene Prediction Methods Various pattern recognition methods are used to identify signals: weighted matrix decision trees HMM Artificial neural networks Linear discriminate analysis An algorithm can be: Rule-based Neural network based HMM based GENSCAN is a general-purpose gene identification program which analyzes genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants. Genscan: Identifies complete exon/intron structure of genes in genomic DNA Predicts multiple genes, partial and complete genes Uses HMM to model gene structure

33 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Genscan takes the following things into account to make a prediction: Transcription signals Translation signals Splicing signals (donors, acceptors, and branch points) Exon length distributions Compositional features such as G+C regions and hexamer frequency Weaknesses of ab Initio prediction Ab initio method is not reliable enough, especially in eukaryotes. It is not specific enough (too many false positives), however, exon sensitivity can be good. It is generally used to point sequence similarity searches in the right direction.

Similarity-based Methods
placeholder page until the content is ready for publishing

Comparative Genomics
placeholder page until the content is ready for publishing.

Phylogenetics
Definitions
Phylogeny refers to the evolutionary relationships among organisms. It is the study of patterns of lineage branching produced by the true evolutionary history of the organisms being considered. Phylogenetics is the field of biology that deals with the relationships between organisms. It includes the discovery of these relationships, and the study of the causes behind their pattern. In molecular biology terms, phylogenetics is useful for Infering function by similarity Choosing template for homology modeling Discovering and analyzing gene families Comparing whole genomes Taxon is a unit of classification. Often it refers to the members of the groups of organisms being analyzed. This may be a single species or a group of species. It is the label at the leaf of the tree. Homology is similarity due to a common ancestor. It is in fact the hypothesis we make when we align sequences. Homology is not similarity. Similarity is a measurable scale. Homology is a hypothesis that can be either true of false. Homoplasy:The occurrence of similar states of a character not due to common lineage. This may be due to environmental constraints or simply a random occurrence. Convergence: bats and birds have wings but dont share common ancestry. Reversion: whales resemble fish but whales ancestors lived on land.

Orthologs and Paralogs


Two genes are orthologous if they diverged after a speciation event. Two genes are paralogous if they diverged after a gene duplication event. Haemoglobin and are paralogs whether we compare within or across species. Human -Haemoglobin and pig -Haemoglobin are orthologs.

34 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Human -Haemoglobin and pig -Haemoglobin are orthologs. There is only on speciation event. It is present twice in the tree because each paralog diverged after it occured. Comparing human -Haemoglobin and pig Haemoglobin for the purpose of inferring function would give aberrant results.

Introduction
Alignment of sequences should take account of their evolutionary relationship. For example, an alignment that implies many substitutions between closely related sequences is less plausible than one that makes most of its changes over large evolutionary distances. Similarity of molecular mechanisms of different organisms strongly suggests that they might have originated from a common ancestor. Such relationships between species is called phylogeny and it can be represented in a phylogenetic tree. Phylogenetics is the science of inferring a phylogenetic tree from experiments and observations. Organisms diversify by either gene duplication or speciation events. In a gene duplication event, a gene is duplicated and over time the two genes diverge. In a speciation event, a gene is modified. Due to gene duplication, the phylogenetic tree of a group of sequences does not reflect the phylogenetic tree of the host species. If we are interested in inferring the phylogenetic tree of the species carrying genes, we must use orthologous genes (created by speciation events).

Phylogenetic Trees
Phylgenetic trees are usually binary trees. Each edge branches into two daughter edges. Each edge of the tree has a certain amount of evolutionary divergence associated to it. This divergence is measured by some measure such as distance between sequences, or from a substitution model of residues over the course of evolution. Different proteins evolve at different rates. Even same sequences in different organisms change at different rates. However, avereraging over larger sets of proteins, we witness a correspondence between lengths and evolutionary time periods. By definition, a phylogeny has a root which is the ancestor of all sequences. However, it is not always possible to reliably infer a root. Several algorithms provide information about the location of the root while others like parsimony and the probabilistic models are completely uninformative. For such algorithms, other criteria needs to be used for rooting the tree. A rooted tree indicates the direction of evolutionary time. The direction of time is undermined in an unrooted tree. Counting and labelling trees A rooted binary tree with n leaves contains n-1 non-leaf nodes, 2n-1 nodes in total, and 2n-2 edges. An unrooted binary tree with n leaves has 2n-2 nodes and 2n-3 leaves.

Phylogenetic Algorithms
There are three classes of phylogenetic algorithms: Numeric taxonomic phenetics - distance based Cladistic Methods Probabilistic Methods Cladistic Methods Make inferences about characters at internal nodes. All cladistic methods attempt to find the following:

35 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

The vast majority of cladistic methods are optimization algorithms. These algorithms search for an optimum in a search-space. The search space is the set of possible trees. This includes all topologies and all ancestral states for each topology. A search methods could be brute-force, branch and bound or heuristic. Brute Force cladistic search methods The search space can be represented in the form of a tree. A selection is made at each node. In brute force, a complete search of all phylogenetic trees is made by walking the decision-tree and calculating the score at each leaf of the decision tree. Branch and Bound cladistic search methods Branch and bound algorithms also use search trees. The score of the partially constructed tree is calculated at each internal node. If the score is worse than the best score obtained so far, we do not continue with that branch. Heuristic algorithm based cladistic search methods Both brute force and branch and bound always find the best solution but they cannot do much in real time. Heuristic solutions are much faster but do not guarantee the optimal solution. Local optima vs. global optima. Advantages of cladistic methods Take variable rates of evolution and homoplasy into account. Gives a tree with putative ancestral states. Disadvantages of cladistic methods Slow Often only local optima is found Care must be taken when interpreting evolutionary distances Many equally optimal solutions may be generated Probabilistic Methods Probabilistic methods start with a model of evolution. This model is described in the form of mutation probabilities. The most probable tree given the data and the model can then be calculated. The probabilities of multiple mutations in a branch are also taken into account. The most commonly used probabilistic algorithms are maximum likelihood and bayesian methods. Advantages Based on a model of evolution Take variable rates of evolution, homoplasy and even multiple mutations in a branch into account Statistical confidence for the result is inherent in the method Disadvantages Slow Often only local optimum is found. Probabilistic Methods Based on a probabilistic model of evolution

Molecular Clock
At the molecular level, mutations occur with a certain probability. However, a date cannot be read directly from molecular data. In some organisms this rate is higher than others due to geographical and temporal variations. Mutations are not conserved at a constant rate. All purely molecular dating methods give aberrant results.

Distance based algorithms


36 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Phylogenetic distance based algorithms are used to infer phylogenetic trees. The basic idea behind distance methods to: 1. compute evolutionary distance for each sequence pair. This is usually stored in a distance matrix. 2. Infer a tree topology from on the basis of relationships between distance values. Distance-based methods rely on calculating pairwise distances. A distance score can be calculated between two taxons. This can be as simple as the number of morphological characters which differ between two organisms or as complex as a corrected PAM or BLOSUM alignment score. Although the distance is due to evolutionary divergence, it is not directly a measure of branch-lengths in the tree. There are several ways of defining distances between 2 taxons. Suppose we align two residues a and b. The fraction f of the residues that aligns could be used to define the distance. This would give a sensible definition for small fractions. For unrelated sequences, the fraction f would approach a distance expected for random substitutions. Naturally, we would expect the value of f to increase as the similarity increases. Markov's model of residue substitution, such as Jukes-Cantor model can be used. It tends to reach infinity as the value approaches 75% . The distance matrix is a square matrix of all pairwise distances. Clustering: UPGMA, WPGMA - join closest neighbors first Neighbor joining: additivity advantage: rapid calculation, can deal with variable rates of evolution disadvantages: does not deal well with non-additive distances (homoplasy). Can give negative branch-lengths.

Source
[1] Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin.

UPGMA
UPGMA, which stands for unweighted pair group using arithmetic averages, is a clustering method used to build phylogenetic trees. It works by averaging the distances but these distance averaging is based on the number of taxons in different clusters. If A and B are two nodes, the distance between two nodes is computed by:

Latex: \huge$d_{(AB)C} = \frac{d_{AB} + d_{AC}}{2}$ which is simply the average of distance between AB and AC. Suppose, we have two clusters i and j such that |Ci| and |Cj| denote the number of sequences in the clusters respectively. Suppose that Ck which is formed by the union of Ci and Cj. If we wish to wish to compute the distance between Ck and some other cluster Cl:

Latex: \huge$d_{kl} = \frac{d_{il}|C_{i}| + d_{jl}|C_{i}|}{|C_{i}| + |C_{j}|}$ Following is more intuitive example of UPGMA. We assign each sequence to its own cluster. Initially, each cluster would be a cluster in itself. For each cluster, we find a cluster which is closest to it and compute dij and define a new cluster formed by the union of both. If we find several equidistant clusters, we choose one at random. We repeat this process until, all clusters are connected.

37 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Now lets look a this example using a distance matrix. We have the following matrix: A B C D B 2 C 4 4 D 6 6 6 E 6 6 6 2

1.

2.

3. 4. d(AB)C = (dAC + dBC) / 2

5. d(ABC)DE = (dAD + dAE + dBD + dBE + dCD + dCE) / 6

38 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

When the data is ultrametric, both UPGMA and WPGMA have the same results. When the data is not ultrametric, the inferences vary.

Molecular clocks and the ultrametric property of distances


UPGMA produces a rooted tree of a special kind. The edge lengths in the resulting tree can be viewed as times measured by a molecular clock with a constant rate. The divergence of sequences is assumed to occur at the same constant rate at all points in the tree, i.e. the sum of time down a path to the leaves from any node is the same. If our distance data is derived by adding edge lengths in a tree T with a molecular clock, then UPGMA will reconstruct T correctly. If the original tree has different length routes to its leaves, then it may be reconstructed incorrectly by UPGMA. The problem is that the closest leaves may not be neighboring leaves as they may not have an ancestor in common. A test of whether reconstruction is likely to be correct is teh ultrametric condition. The distances dij are said to be ultrametric if, for any triplet of sequences, xi, xj, xk, the distances dij, djk, dik are either all equal, or two are equal and the remaining one is smaller. This condition holds fro distances derived from a tree with a molecular clock.

Additivity and Neighbor Joining


Following is an example where simple clustering methods would produce erroneous results.

A and B are clustered together first since they are closest. Additivity can be used to solve this problem. Additivity: Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of lengths of the edges on the path connecting them. [1] An additive distance is fully characterized by the four point condition which states that any 4 points can be renamed such that: d(a,c) + d(b,d) < d(a,b) + d(c,d) = d(a,d) + d(b,c) Following is graphical representation of this equation

When we apply this equation to our graph, we get

39 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

<

This equation states that a graph can be decomposed in such a way that two are equal while the third is less than the two. Suppose we have the following matrix representing distance matrix. It must constitute an additive metric for it to produce a valid tree.

We choose a pair of leaf nodes for which Dij is minimal. Dab = dab - (ra + rb) where ra is the average distance between a and all other leaves and ra is the average distance between a and all other leaves. Using our matrix, we obtain the following values: A = 2 + 7 + 4 + 7 = 20 B = 2 + 7 + 4 + 7 = 20 C = 27 D = 22 E = 27 Since A and B have the smallest values, we would start by joining A and B to some other node

And we continue using the additivity equations.

If a new path branches off an existing branch in a tree, replace one of the original leaves by another leaf along the branching path.

40 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Each run of this algorithm has a complexity of O(n). Therefore, complexity of the reconstructing a tree from an additive distance is O(n2). In practice, the distance between molecular sequences to be non-additive. In such cases, we look for a tree T whose matrix is close to the given one.

Analysis
It is possible for the molecular clock property to fail but for additivity to hold. UPGMA fails when molecular clock property fails. We can use neighbor joining algorithms when molecular clock fails but additivity holds true. Neighbor-joining is a bottom-up clustering method used for the creation of phylogenetic trees. The algorithm requires knowledge of the distance between each pair of taxons in the tree. Neighbor-joining is based on the minimum-evolution criterion for phylogenetic trees, i.e. the topology that gives the least total branch length is preferred at each step of the algorithm. [2] Being a greedy algorithm, neighbor-joining may not find the true tree topology with least total branch length. However, it usually finds a tree quite close to the optimal tree. Neighbor-joining is an efficient algorithm with polynomial-time complexity which makes it suitable for large datasets where other algorithms such as minimum evolution are not suitable computationally. Unlike UPGMA, neighbor-joining does not assume a molecular clock i.e. that all species evolve at the same rate. Thus, neighbor-joining produced unrooted trees. There are several techniques to root a tree. For example, a root can be added by using an outgroup and the root can then effectively be placed on the point in the tree where the edge from the outgroup connects. Neighbor-joining is statistically consistent under many evolutionary models. Hence, given data of sufficient length, neighbor-joining will reconstruct the true tree with high probability. [2] Additivity is a property that depends on the distance measure used. A tree may be additive with respect to one distance measure and not with respect to another. Additivity means that the sums of 2 lengths must be larger than a third and equal in size. Just as the ultrametric condition provided a test for the molecular clock property, the four point condition provides a test for additivity. For every set of 4 points, two will be equal and both will be larger than the third. Neighbor-joining involves stripping all nodes with the exception of pre-existing additive trees. The tree is then reassembled.

Source
[1] Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin. [2] Wikipedia [3] http://www.icp.ucl.ac.be/~opperd/private/neighbor.html [4] Class notes of B. Sonderegger

Parsimony
Leaves of a phylogenetic tree represent objects being compared. Objects can be genes, protein, species, etc. Internal nodes are hypothetical ancestral objects. In a rooted tree, the path from root to a node corresponds to a path in evolutionary time. An unrooted tree specifies relationships among objects, but not evolutionary time.

41 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Parsimony is probably the most widely used of all tree building algorithms. It works by finding the tree which can explain the observed sequences with a minimal number of substitutions. Instead of building a tree, it assigns a cost to a given tree, and it is necessary to search through all topologies, or to pursue a more efficient strategy that achieves this effect in order to identify the best tree. Parsimony takes aligned sequences as input (character data) and outputs a labeled tree which explains the data with a minimal number of changes across edges. Important Points Parsimony does not build trees, it rearranges them Parsimony assigns a cost to a given tree Parsimony requires a search through all topologies Parsimony treats each site of the tree independently The parsimony algorithm has two important components: 1. compute the cost of a given tree 2. search through all trees to find the tree with the minimal cost In plain words, we would need to build the tree from the bottom up. At each step, we compute the cost of connect a leaf or node to some site. Site is the node we wish to connect our leaf or node to. We the tree is complete, we calculate the cost of the tree by summing the cost of all connections. We repeat these steps for all possible trees. When we have computed the cost for all trees, we look for the tree which has the minimum cost. This cost is the best solution.

Weighted Parsimony
Weighted parsimony is an extension of traditional parsimony. It uses substitution costs S(a,b) for each substitution of a by b. It reduces to traditional parsimony when S(a,a) = 0 and S(a,b) = 1 a b.

Suppose we have the following aligned sequences: AAG AAA GGA AGA
7. A?A 5. A?A 1. AAG 2. AGA 6. A?A 3. AAA 4. GGA

We need to do a post-order traversal to compute the tree. In simple terms, we start from the leaves and work our way up to the node. When k = {1,2,3,4} which are leaf nodes: k = 1: S1(A) = 0, S1(C) = S1(G) = S1(T) = k = 2: S1(G) = 0, S1(C) = S1(A) = S1(T) = k = 3: S1(A) = 0, S1(C) = S1(G) = S1(T) =

42 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

k = 4: S1(G) = 0, S1(C) = S1(A) = S1(T) = forces the algorithm to start at the actually observed residue. k = 5: S5(A) = minb(S1(b) + S(a,b)) + minb(S1(b) + S(a,b)) S5(A) = S1(A) + S(a;A) + S2(G) + S(a;G) S5(A) = S(a;A) + S(a;G) This translates to: S5(A) = S(A;A) + S(A;G) S5(T) = S(T;A) + S(T;G) S5(C) = S(C;A) + S(C;G) S5(G) = S(G;A) + S(G;G) If this were traditional parsimony, we would get: S5(A) = 0 + 1 = 1 S5(T) = 1 + 1 = 2 S5(C) = 1 + 1 = 2 S5(G) = 1 + 0 = 1 k = 6: Same as k = 5 k = 7: S7(a) = minb(S5(b) + S(a,b)) + minb(S6(b) + S(a,b)) S7(a) = 2 min((S5(A) + S(a;A)); (S5(T) + S(a; T)); (S5(C) + S(a;C)); (S5(G) + S(a;G)); ) If this were traditional parsimony, we would get: S7(a) = 2 min((1 + S(a;A)); (2 + S(a; T));(2 + S(a;C)); (1 + S(a;G))) S7(A) = 2 min((1 + 0); (2 + 1); (2 + 1); (1 + 1)) = 2 S7(T) = 2 min((1 + 1); (2 + 0); (2 + 1); (1 + 1)) = 4 S7(C) = 2 min((1 + 1); (2 + 1); (2 + 0); (1 + 1)) = 4 S7(G) = 2 min((1 + 1); (2 + 1); (2 + 1); (1 + 0)) = 2 The the minimal cost of tree = minaS7(a) = 2 Therefore, the two solutions are:
7. AGA / 5. AGA / \ 1. AAG 2. AGA / 5. AAA / \ 1. AAG 2. AGA \ 6. AGA / \ 3. AAA 4. GGA 7. AAA \ 6. AAA / \ 3. AAA 4. GGA

It may sometimes be of interest to keep track of minimizing residues (residue that returned minimal costs) in recursion. By assigning the following pointers to the left and right daughters at the end of each recursion block, we can track the minimizing residues: lk(a) = argminb(Si(b) + S(a,b)) rk(a) = argminb(Sj(b) + S(a,b)) To obtain an assignment of ancestral residues, we pick residue a at the root that gives minimal cost, follow pointers, and choose arbitrarily if there are more than one possible targets.

43 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Traditional Parsimony
In traditional parsimony only the number of substitutions is counted. List Rk of minimal cost residues at is kept at each node along with current cost c.

There is a traceback procedure for finding ancestral assignments in traditional parsimony. 1. Choose a residue from the root (R2n-1) and proceed down the tree. 2. Having chosen a residue in Rk, pick the same residue in Ri and Rj if possible, otherwise at random. An assignment not obtained by traceback with Rk can be found by keeping a set Qk of residues at node k where cost is one more than that of the residues in Rk at Qk is set to one higher than Rk. So far, we only looked at parsimony for rooted trees. The minimum cost for traditional parsimony is independent of root location. Minimum cost for weighted parsimony is independent of root location if S(a,b) is metric: S(a,a) = 0 S(a,b) = S(b,a) S(a,c) S(a,b) + S(b,c) When cost is independent of root location, we economize in terms of number of possible topologies.

Branch and Bound


In parsimony, the number of possible topologies increases rapidly as the number of leaves increase. Therefore, a more efficient strategy is required. There are several search strategies which are quite efficient but they do not guarantee to find the best tree. Branch and bound is an algorithm which guarantees to find the best tree. The idea is to systematically enumerate trees while abandoning a search venue when the incomplete tree is more expensive than the complete tree. The tree can be enumerated like an odometer.

44 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

[i3] lists the possible edges, where a new edge for x4 can be added. [i5] lists the possible edges, where a new edge for x5 can be added. [i2n-5] lists the possible edges, where a new edge for xn can be added.

Accessing tree with bootstrap


Once a tree is build using an algorithm of choice, how much to trust it? How can we access its accuracy? Bootstrap is a method which allows us to access the significance of a phylogenetic feature. 1. 2. 3. 4. 5. Dataset is an alignment of sequences Generate artificial dataset by selecting columns with replacement Build tree for artificial dataset Repeat the previous 2 steps about a 1000 times Frequency of appearance of a feature is the measure of confidence in this feature

Source
[1] Durbin [2] Haari Jaalinoja's slides

Simultaneous alignment and phylogeny


Now we turn to the problem of simultaneously aligning sequences and finding a plausible phylogeny for them. There are two parsimony-type algorithms that tackle this problem, the first using character substitution model of gaps, the second using affine gap penalties. Both find an optimal alignment given a tree; it is necessary to search over trees to find the overall optimum. The algorithm is guaranteed to find ancestral sequences and their alignment with leaf sequences that together minimize a tree-based parsimony type cost.

Hein's affine cost algorithm


Hein's algorithm uses an affine gap cost which is more realistic than the simple substitution treatment of gaps. It is also much faster than Sankoff & Cedergren's algorithm in most realistic situations, fast enough in fact to allow a search over tree topologies for modest-sized sets of sequences. It is the only current practical algorithm able to align sequences and explore alternative phylogenies effectively. The price paid for these very considerable gains is that the algorithm makes a simplifying assumption in the choice of ancestral sequences which does not always lead to the overall most parsimonious choices. Suppose we are building a tree bottom-up. In Hein's algorithm, in the upward pass through the tree, only sequences at nodes with minimal cost are considered given that the sequence are at two daughter nodes. This procedure is not guaranteed to find the minimum cost for the whole tree.

45 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

The aim is to find a sequence z where a given node aligned to both sequence x and sequence y at the daughter nodes and satisfying: S(x,z) + S(z,y) = S(x,y) where S here denotes the total cost for a given alignment of two sequences. We now need to show that sequences z satisfying the above equation can be found because we need to deal with gaps. Using dynamic programming, we define the following: VM(i,j), ith residue in x is aligned to the jth in y - s(a,b) VX(i,j), ith residue in x is aligned to a gap in y - s(a,-) VY(i,j), jth residue in y is aligned to a gap in x - s(-,b) These correspond to Viterbi costs up to the match state M(i,j), and insert states X and Y, respectively. We write the three numbers VM, VX, VY in the (i,j)th cell in the dynamic programming matrix. Let the affine gap cost for a gap of length k be d+(k-1)e, where ed. The the recursion is: VM(i,j) = min{VM(i-1,j-1), VX(i-1,j-1), VY(i-1,j-1)} + S(xi,yj), VX(i,j) = min{VM(i-1,j)+d, VX(i-1,j)+e} VY(i,j) = min{VM(i,j-1)+d, VY(i,j-1)+e} Using these equations, we construct the following matrix.

We mark all transitions that occur on paths that give the minimal cost. Any path that we piece together using these transitions will give an optimal alignment of x and y.

Note: This post is a summary of chapter 7.6 of Durbin.

Bioinformatics Databases
What is a database?
In simple terms, a database is an electronic filing system. It allows a user to quickly store, search, retrieve, exchange and remove data. An application that manages a database (DB) is called a DBMS (Database Management System). The big biological databases can be queried through the Internet.

Why are there so many biological databases?


46 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Biological data is very diverse and is growing at an exponential rate. Therefore, no single database can handle all the data and serve the diverse needs of the scientific community. As a result, many different databases exist, each with different capabilities and often redundant data. Right now, there is a large effort underway by different groups around the world to link and interface all the important databases and the data contained within them.

What will I find on this website?


We do not run or maintain any bioinformatics database. We simply lack the expertise and the funds. Here you will find links and brief descriptions to the various important databases. Our list is not exhaustive and it is not meant to be exhaustive. Our goal is the list the best and the most respected databases while offering links to pages or websites offering a comprehensive list.

How do I use a database listed here


All biological databases listed on this website come with a set of tools to help its users retrieve, submit, and analyze contained within. Tools evolve overtime, new tools are introduced and obsolete ones are removed. These tools often have to be learned and usually the database website offer help or tutorials to assist its users.

Meta Databases
A meta-database is DBMS which is either linked to or collects information from various other databases. A meta database allows users to access information related to a specific topic from several databases on one page.

MetaDB
The MetaDB metadatabase is a sorted, searchable collection of biological databases. Most entries in the metadatabase include a relevant peer-reviewed abstract or excerpt along with a link to the abstract or full text article. Database descriptions surrounded by quotation marks were borrowed from the database websites. It contains links to over 1200 databases.

Entrez
Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. Click on the graphic below for a more detailed view of Entrez integration.

euGenes
euGenes provides a common summary of gene and genomic information from eukaryotic organism databases. This includes: Gene symbol and full name, Chromosome, genetic and molecular map information, Gene product information (function, structure, and homologies). Links to extended gene information.

GeneCards
GeneCards project defines its goal to be to integrate the fragments of information scattered over a variety of specialized databases into a coherent picture.

Source
SOURCE is a unification tool which dynamically collects and compiles data from many scientific databases, and

47 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

thereby attempts to encapsulate the genetics and molecular biology of genes from the genomes of Homo sapiens, Mus musculus, Rattus norvegicus into easy to navigate GeneReports. The mission of SOURCE is to provide a unique scientific resource that pools publicly available data commonly sought after for any clone, GenBank accession number, or gene. SOURCE is specifically designed to facilitate the analysis of large sets of data that biologists can now produce using genome-scale experimental approaches.

Harvester
A picture speaks a thousand words, and the following screenshot of website is self-explanatory.

Nucleotide Sequence Databases


EMBL Nucleotide Sequence Database
The EMBL Nucleotide Sequence Database constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications.

NCBI - National Center For Biotechnology Information


The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis. Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.

DDJB - DNA Data Bank of Japan


DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of

48 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

Genetics (NIG). DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank. DNA sequence records the organismic evolution more directly than other biological materials and ,thus, is invaluable not only for research in life sciences, but also human welfare in general. The databases are, so to speak, a common treasure of human beings.

Unigene
Each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene), together with information on protein similarities, gene expression, cDNA clone reagents, and genomic location.

Genome Databases
Ensembl
Ensembl is a joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes.

TIGR
The Institute for Genomic Research (TIGR) is a not-for-profit center dedicated to deciphering and analyzing genomes the complex molecular chains that constitute each organisms unique genetic heritage.

Protein Databases
PMD - Protein Mutant Database
Compliations of protein mutant data are valuable as a basis for protein engineering. They provide information on what kinds of functional and/or structural influences are brought about by amino acid mutation at a specific position of protein. The Protein Mutant Database (PMD) that we are constructing covers natural as well as artificial mutants, including random and site-directed ones, for all proteins except members of the globin and immunoglobulin families. The PMD is based on literature, not on proteins. That is, each entry in the database corresponds to one article which may describe one, several or a number of protein mutants.

Gene3D
Structural and Functional Annotation of Protein Families

Panther
49 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence. Proteins are classified by expert biologists into families and subfamilies of shared function, which are then categorized by molecular function and biological process ontology terms. For an increasing number of proteins, detailed biochemical interactions in canonical pathways are captured and can be viewed interactively.

DIP - Database of Interacting Proteins


The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data. Please, check the reference page to find articles describing the DIP database in greater detail.

HPRD - Human Protein Reference Database


The Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data. HPRD has been created using an object oriented database in Zope, an open source web application server, that provides versatility in query functions and allows data to be displayed dynamically. For a more comprehnsive list, please refer to: expasy.

Protein Sequence Databases


UniProt
UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. UniProt is comprised of three components, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Reference Clusters (UniRef) databases combine closely related sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.

Swiss-Prot and TrEMBL


UniProtKB/Swiss-Prot: a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases

50 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

UniProtKB/TrEMBL a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

Protein Structure Databases


Protein Data Bank
The most authentic resource for protein structure information.

BMRDB - Biological Magnetic Resonance Data Bank


Repository for data on proteins, peptides, and nucleic acids from NMR spectroscopy

Swiss-Model Repository
The SWISS-MODEL Repository is a database of annotated three-dimensional comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL. The repository is developed at the Biozentrum Basel within the Swiss Institute of Bioinformatics.

CATH
CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H). Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. The topology level clusters structures into fold groups according to their topological connections and numbers of secondary structures. The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to fold groups and homologous superfamilies are made by sequence and structure comparisons. The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures. These include computational techniques, empirical and statistical evidence, literature review and expert analysis.

SCOP
Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.

51 of 52

2/14/2012 6:53 PM

Bioinformatics

http://www.molecularsciences.org/book/export/html/2

For a more comprehensive list, please refer to: expasy

Durbin Summary

52 of 52

2/14/2012 6:53 PM

You might also like