Introduction To Different Resources of Bioinformatics and Application PDF

Introduction to different resources of
bioinformatics and application

Bioinformatics Resources and Tools:
A database is a structured collection of records stored in a computer system. Genomic

databases typically store DNA or protein sequences as well as annotated information
about those sequences.
Many databases also provide bioinformatics tools, such as BLAST, for finding specific
sequences or annotations. There are hundreds of genomics databases: some are
comprehensive, but are not carefully curated (GenBank), while others are carefully
curated, but are narrow (FlyBase).
Bioinformatic tools are computer programs that analyze one or more sequences.
There are a dizzying array of bioinformatic tools that can analyze sequences to find protein
domains (Pfam), or that can search through databases of millions of sequences to find ones that
are similar (BLAST) or that can find potential protein-coding regions (ORF-Finder).
Many are freely available over the web. It can be overwhelming to find and use bioinformatic tools
because you need to know
1) what type of analysis you want to perform
2) what type of tool to use
3) where to find the tool.
The Online Bioinformatics Resources Collection (OBRC) contains annotations and links for
2428 bioinformatics databases and software tools.
DNA Sequence Databases and Analysis Tools (460)
Enzymes and Pathways (241)
Gene Mutations, Genetic Variations and Diseases (247)
Genomics Databases and Analysis Tools (632)
Immunological Databases and Tools (48)
Microarray, SAGE, and other Gene Expression (164)

Organelle Databases (23)
Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and
others) (143)
Plant Databases (146)
Protein Sequence Databases and Analysis Tools (404)
Proteomics Resources (58)
RNA Databases and Analysis Tools (216)
Structure Databases and Analysis Tools (378)

Nucleotide Sequence Databases (the principal ones)
NCBI - National Center for Biotechnology Information.

NCBI houses sets of databases of sequences for everything under the sun.
EBI - European Bioinformatics Institute
DDBJ - DNA Data Bank of Japan

A public database of annotated nucleotide sequences. Includes the Japanese
Genotype-phenotype Archive (JGA), personal genotype and phenotype data from
individuals who have signed consent agreements authorizing data release only for
specific research uses.
Database Searching by Sequence Similarity

BLAST @ NCBI
PSI-BLAST @ NCBI
FASTA @ EBI
BLAT Jim Kent's Blat is just superb in terms of speed and the integrated view you get
for viewing the results
Basic Local Alignment Search Tool
BLAST finds regions of similarity between biological sequences. The program compares
nucleotide or protein sequences to sequence databases and calculates the statistical
significance.
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between
sequences. The program compares nucleotide or protein sequences to sequence
databases and calculates the statistical significance of matches. BLAST can be used to
infer functional and evolutionary relationships between sequences as well as help identify
members of gene families
BLAT@UCSC
BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25
bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect
sequence matches of 20 bases. BLAT on proteins finds sequences of 80% and greater similarity
of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein
blat on land vertebrates.
FASTA @ EBI
This tool provides sequence similarity searching against protein databases using the FASTA
suite of programs. FASTA provides a heuristic search with a protein query.
FASTX and FASTY translate a DNA query. Optimal searches are available with SSEARCH
(local), GGSEARCH (global) and GLSEARCH (global query, local database)
Issues to consider
► Dealing with gaps

► Do we want gaps in alignment?
► What are disadvantages of
► Many small gaps?
► Some big gaps?
Summary
► Why are biological sequences similar to one
another?
► Knowledge of how and why sequences change
over time can help you interpret similarities and
differences between them
Why do similarity search?
► Similarity indicates conserved function
► Human and mouse genes are more than 80% similar at sequence level
► But these genes are small fraction of genome
► Most sequences in the genome are not recognizably similar
► Comparing sequences helps us understand function
► Locate similar gene in another species to understand your new gene
► Rosetta stone
BLAST
► Basic Local Alignment Search Tool

► Algorithm for comparing a given sequence against
sequences in a database
► A match between two sequences is an alignment
► Many BLAST databases and web services available
Example BLAST questions
► Which bacterial species have a protein that is related in

lineage to a protein whose amino-acid sequence is known?
► Where does the DNA sequenced come from?
► What other genes encode proteins that exhibit structures
similar to the one I just determined?
Background: Identifying Similarity
► Algorithms to match sequences:

► Needleman-Wunsch
► Smith Waterman
► BLAST
Needleman-Wunsch
► Global alignment algorithm

► An example: align COELACANTH and PELICAN
► Scoring scheme: +1 if letters match, -1 for mismatches, -1 for gaps
COELACANTH COELACANTH
P-ELICAN-- -PELICAN--
Needleman-Wunsch Details
► Two-dimensional matrix
► Diagonal when two letters align
► Horizontal when letters paired to gaps
C O E L A C A N T H
C O
P P
E
E E
L
L L
A
I I
C
C C
A
A A
N T H
N N - -
Needleman-Wunsch
► In reality, each cell of matrix contains score and
pointer
► Score is derived from scoring scheme (-1 or +1 in our
example)
► Pointer is an arrow that points up, left, or diagonal
► After initializing matrix, compute the score and arrow
for each cell
Two sequences will be aligned.
GAATTCAGTTA (sequence #1)

GGATCGA (sequence #2)
A simple scoring scheme will be used
Si,j = 1 if the residue at position I of sequence #1 is the same as

the residue at position j of the sequence #2 (called match score)
Si,j = 0 for mismatch score
w = gap penalty
To study the algorithm, consider the two given sequences.

CGTGAATTCAT (sequence #1) , GACTTAC (sequence #2)

The length (count of the nucleotides or amino acids) of the sequence 1 and sequence 2 are
11 and 7 respectively. The initial matrix is created with A+1 column’s and B+1 row’s (where
A and B corresponds to length of the sequences). Extra row and column is given, so as to
align with gap, at the starting of the matrix as shown in Figure 1.
After creating the initial matrix, scoring scheme has to be introduced which can be user defined
with specific scores. The simple basic scoring schema can be assumed as, if two residues
(nucleotide or amino acid) at ith and jth position are same, matching score is 1 (S(i,j)= 1) or if the
two residues at ith and jth position are not same, mismatch score is assumed as -1 (S(i,j)= -1 ).
The gap score(w) or gap penalty is assumed as -1 .
*Note: The scores of match, mismatch and gap can be user defined, provided the gap penalty
should be negative or zero.
Gap score is defined as penalty given to alignment, when we have insertion or deletion.
The dynamic programming matrix is defined with three different steps.

1.Initialization of the matrix with the scores possible.
2.Matrix filling with maximum scores.
3.Trace back the residues for appropriate alignment.
Initialization Step

This example assumes that there is gap penalty. First row and first column of the matrix can be
initially filled with 0. If the gap score is assumed, the gap score can be added to the previous cell of
the row or column (Figure 2).
Matrix Fill Step

The second and crucial step of the algorithm is matrix filling starting from the upper left hand
corner of the matrix. To find the maximum score of each cell, it is required to know the
neighbouring scores (diagonal, left and right) of the current position. From the assumed values,
add the match or mismatch (assumed) score to the diagonal value. Similarly add the gap score to
the other neighbouring values. Thus, we can obtain three different values, from that take the
maximum among them and fill the ith and jth position with the score obtained.

In terms of matrix positions, it is important to know [M(i-1,j-1)+S(i,j),M(i,j-1)+w,M(i-1,j)+w]

Overall the equation can be showed in the following manner
To score the matrix of the current position (the first position M1,1) the above stated
formulae can be used. The first residue (nucleotides or amino acids) in the 2 sequences
are ‘G’ and ‘C’. Since they are mismatching residues, the score would (Si,j=- 1) be -1.
The obtained score -1 is placed in position i,j (1,1) of the scoring matrix. Similarly using the
above equation and method, fill all the remaining rows and columns. Place the back pointers
to the cell from where the maximum score is obtained, which are predecessors of the
current cell (Figure 3).
Trace back Step

The final step in the algorithm is the trace back for the best alignment. In the above
mentioned example, one can see the bottom right hand corner score as -1. The
important point to be noted here is that there may be two or more alignments possible
between the two example sequences.
The current cell with value -1 has immediate predecessor, where the maximum score
obtained is diagonally located and its value is 0. If there are two or more values which
points back, suggests that there can be two or more possible alignments.

By continuing the trace back step by the above defined method, one would reach to the
0th row, 0th column. Following the above described steps, alignment of two sample
sequences can be found. The best alignment among the alignments can be identified by
using the maximum alignment score (match =5, mismatch=-1, gap=-2) which may be user
defined (Figure 4).
Sequence Alignment
USC Sequence Alignment Server - align 2 sequences with all

possible varieties of dynamic programming
T-COFFEE - multiple sequence alignment
ClustalW @ EBI - multiple sequence alignment
MSA 2.1 - optimal multiple sequence alignment using the

Carrillo-Lipman method
BOXSHADE - pretty printing and shading of multiple alignments

Splign - Splign is a utility for computing cDNA-to-Genomic, or spliced sequence
alignments. At the heart of the program is a global alignment algorithm that specifically
accounts for introns and splice signals.
Spidey - an mRNA-to-genomic alignment program
Wise2 - align a protein or profile HMM against genomic sequence to predict a gene
structure, and related tools
PipMaker - computes alignments of similar regions in two (long) DNA sequences
VISTA - align + detect conserved regions in long genomic sequences
myGodzilla - align a sequence to its ortholog in the human genome

Sequence Alignment
Clustal Omega @ EBI

Clustal Omega is a multiple sequence alignment program for proteins. It produces biologically meaningful
multiple sequence alignments of divergent sequences. Evolutionary relationships can be seen via viewing
Cladograms or Phylograms.
PipMaker
PipMaker computes alignments of similar regions in two DNA sequences. The resulting alignments are
summarized with a "percent identity plot'', or "pip'' for short. MultiPipMaker allows the user to see
relationships among more than two sequences. All pairwise alignments with the first sequence are
computed and then returned as interleaved pips. Moreover, MultiPipMaker can be requested to compute a
true multiple alignment of the input sequences and return a nucleotide-level view of the results.
T-COFFEE@EBI
T-Coffee is a multiple sequence alignment program. Its main characteristic is that it will allow you to
combine results obtained with several alignment methods.
Motif and Pattern Search in Sequences
Gibbs Motif Sampler - identification of conserved motifs in DNA or protein sequences
AlignACE Homepage - gene regulatory motif finding
MEME - motif discovery and search in protein and DNA sequences
SAM - tools for creating and using Hidden Markov Models
Pratt - discover patterns in unaligned protein sequences
Motivated Proteins - a web facility for exploring small hydrogen-bonded motifs

Gibbs Motif Sampler
The Gibbs Motif Sampler will allow you to identify motifs, conserved regions, in DNA or
protein sequences.
MEME
Multiple Em for Motif Elicitation (MEME) performs motif discovery on DNA, RNA or protein
datasets and discovers novel, ungapped motifs (recurring, fixed-length patterns) in your
sequences (sample output from sequences). MEME splits variable-length patterns into two
or more separate motifs.
Pratt
The Pratt program is able to discover patterns conserved in sets of unaligned protein
sequences.
Protein Sequence Databases
SWISS-PROT & TrEMBL - Protein sequence database and computer annotated supplement
UniProt - UniProt (Universal Protein Resource) is the world's most comprehensive catalog
of information on proteins. It is a central repository of protein sequence and function
created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
PIR - Protein Information Resource
MIPS - Munich Information centre for Protein Sequences
HUPO - HUman Proteome Organization

Human Genome Databases
•Draft Human Genome @ NCBI
•Draft Human Genome @ UCSC
•Ensembl - automatically annotated human genome.
•GDB - Genome Database
•Mammalian Gene Collection - full-length (open reading frame) sequences

for human and mouse
•STACK - Sequence Tag Alignment and Consensus Knowledgebase
•GeneCards - human genes, proteins and diseases

Databases of other Organisms
GOLD - Genomes OnLine Database, information on complete and ongoing genome projects
TIGR Comprehensive Microbial Resource
TIGR Microbial Database
The Proteome Databases - yeast, worm, & human, good annotation
Saccharomyces Genome Database

WormBase - C. elegans
FlyBase
Berkeley Drosophila Genome Project
Mouse Genome Informatics
The Arabidopsis Information Resource
ZFIN - Zebrafish Information Network
DictyBase - Dictyostelium discoideum
EcoGene - E. coli
HIV sequence database

Genome-wide Analysis
MBGD - comparative analysis of completely sequenced microbial genomes
COGs - phylogenetic classification of orthologous proteins from complete genomes
STRING - detect whether a given query gene occurs repeatedly with certain other genes in potential
operons
Pedant - automatic whole genome annotation
GeneCensus - various whole genome comparisons

Protein Domains: Databases and Search Tools
\
InterPro - integration of Pfam, PRINTS, PROSITE, SWISS-PROT + TrEMBL
PROSITE - database of protein families and domains
Pfam - alignments and hidden Markov models covering many common protein domains
SMART - analysis of domains in proteins
ProDom - protein domain database
PRINTS Database - groups of conserved motifs used to characterise protein families
Blocks - multiply aligned ungapped segments corresponding to the most highly conserved
regions of proteins
Protein Domain Profile Analysis @ BMERC - search a library of profiles with a protein
sequence
TIGRFAMs - yet more protein families based on Hidden Markov Models

Protein 3D Structure
PDB - protein 3D structure database
RasMol / Protein Explorer - molecule 3D structure viewers
SCOP - Structural Classification Of Proteins
UCL BSM CATH classification
The DALI Domain Database
FSSP - fold classification based on structure-structure alignment of proteins
SWISS-MODEL - homology modeling server

Structure Prediction Meta-server
K2 - protein structure alignment
DALI - 3D structure alignment server
DSSP - defines secondary structure and solvent exposure from 3D coordinates
HSSP Database - Homology-derived Secondary Structure of Proteins
PredictProtein & PHD - predict secondary structure, solvent accessibility, transmembrane helices,
and other stuff
Jpred2 - protein secondary structure prediction
PSIpred (& MEMSAT & GenTHREADER) - protein secondary structure prediction (& transmembrane
helix prediction & tertiary structure prediction by threading)
Metabolic, Gene Regulatory & Signal Transduction Network Databases
KEGG - Kyoto Encyclopedia of Genes and Genomes
BioCarta
DAVID - Database for Annotation, Visualization and Integrated Discovery - A useful server to

for annotating microarray and other genetic data.
stke - Signal Transduction Knowledge Environment
BIND - Biomolecular Interaction Network Database
EcoCyc
WIT
PathGuide A very useful collection of resources dealing primarily with pathways
SPAD - Signaling Pathway Database
CSNDB - Cell Signalling Networks Database
PathDB
Transpath
DIP - Database of Interacting Proteins
PFBP - Protein Function and Biochemical Networks
Alliance for Cellular Signalling

DDB (Diseases Database) –
Web site
http://www.diseasesdatabase.com/
Purpose
Diseases Database serves as way of classifying medical concepts along clinical axes, such as
cause/effect, risk factors, interactions, etc., rather than in hierarchies of anatomical, physiological,
or pathological systems.
Description
Diseases Database is a cross-referenced index of human disease, medications, symptoms, signs, and
abnormal investigation findings. The content focuses on internal medicine, inherited disease, clinical
biochemistry, and pharmacology. Its terms are referred to as "items.“
Update Frequency
Diseases Database is updated regularly.
The Gene Disease Associations Database DisGeNET
DisGeNET is a comprehensive gene-disease association database that integrates associations from

several sources that covers different biomedical aspects of diseases.[25] In particular, it is focused on
the current knowledge of human genetic diseases including Mendelian, complex and environmental
diseases.
To assess the concept of modularity of human diseases, this database performs a systematic study
of the emergent properties of human gene-disease networks by means of network topology and
functional annotation analysis.[1]
The results indicate a highly shared genetic origin of human diseases and show that for most
diseases, including Mendelian, complex and environmental diseases, functional modules exist.
Moreover, a core set of biological pathways is found to be associated with most human diseases.
Obtaining similar results when studying clusters of diseases, the findings in this database suggest
that related diseases might arise due to dysfunction of common biological processes in the cell.
The network analysis of this integrated database points out that data integration is needed to obtain
a comprehensive view of the genetic landscape of human diseases and that the genetic origin of
complex diseases is much more common than expected
DisGeNET gene-disease association ontology
The description of each association type in this ontology is:
#Therapeutic Association: The gene/protein has a therapeutic role in the amelioration of the disease.
#Biomarker Association: The gene/protein either plays a role in the etiology of the disease (e.g. participates in the
molecular mechanism that leads to disease) or is a biomarker for a disease.
#Genetic Variation Association: Used when a sequence variation (a mutation, a SNP) is associated to the disease phenotype,
but there is still no evidence to say that the variation causes the disease. In some cases the presence of the variants
increase the susceptibility to the disease. In general, the NCBI SNP identifiers are provided.
#Altered Expression Association: Alterations in the function of the protein by means of altered expression of the gene are
associated with the disease phenotype.
#Post-translational Modification Association: Alterations in the function of the protein by means of post-translational
modifications (methylation or phosphorylation of the protein) are associated with the disease phenotype. [1]
Fluorescent Protein
Fluorescent Protein Visualization

This web site provides interactive graph of fluorescent protein properties
Photoswitchable Fluorescent Proteins

This web site provides interactive graph of photoswitchable fluorescent protein properties
BD Fluorescence Spectrum Viewer

A Fluorescence Spectrum Viewer for Excitation and Emission Curves
ThermoFisher Fluorescence SpectraViewer

A Fluorescence SpectraViewer with options for different Fluorophores, Light Sources,
Excitation Filters and Emission Filters
Fluorescent Protein Plasmids & Resources

Addgene's plasmid repository contains a variety of fluorescent protein plasmids. Use this
guide to learn more about the many applications of fluorescent proteins (FPs) and to find the
plasmids that are available from Addgene's depositing scientists.
Gene Prediction
Genscan - eukaryotes
GeneMark
Genie - eukaryotes
GLIMMER - prokaryotes
tRNAscan - SE 1.1 - search for tRNA genes in genomic sequence
GFF (General Feature Format) Specification - a standard format for genomic sequence
annotation
Other Databases (Annotations, Ontologies, Consortia, etc.)
Entrez Gene - Gene provides a unified query environment for genes defined by sequence and/or
in NCBI's Map Viewer. You can query on names, symbols, accessions, publications, GO terms,
chromosome numbers, E.C. numbers, and many other attributes associated with genes and the
products they encode. Replaces LocusLink.
Cancer Genome Anatomy Project
HUGO's Human Gene Nomenclature
Gene Ontology Consortium - a controlled vocabulary of eukaryotic gene roles
Open Biological Ontologies an umbrella web address for well-structured controlled vocabularies
for shared use across different biological domains.
ACUTS - compilation of Ancient Conserved UnTranslated Sequences
UTR database
ase of databases, listing all the biological databases currently available on the internet.
ENZYME - enzyme nomenclature database
BRENDA - enzyme database
TC-DB - comprehensive classification of membrane transport proteins
The SNP Consortium

HGBASE - database of sequence variations in the human genome
MethDB - DNA methylation database
SpliceDB - canonical and non-canonical splice site sequences in mammalian genes
SpliceOme - database of intron-exon boundaries
InBase - intein database

The I.M.A.G.E. Consortium
The Kabat Database of Sequences of Proteins of Immunological Interest

Nelson Lab: Cytochrome C
REBASE - restriction enzyme database
Chemfinder.com - molecule database
Genomics Institute of the Novartis Research Foundation

Applications:
Drug Discovery
When the first three-dimensional protein structure was determined the idea of using X-ray
crystallography in drug discovery came up more than 30 years ago. In the space of a decade, a radical
change has started in drug design, incorporating the know-how of 3D target protein structures into the
design process
At every stage of the design process, protein structure can influence drug discovery. It is traditionally
used in lead optimization, a process that uses a structure to guide the chemical modification of a lead
molecule in order to optimize shape, hydrogen bonds, and other non-covalent interactions with the
objective.
Gene Therapy
Gene therapy is a new form of drug delivery that includes a therapeutic agent produced by the patient’s
synthetic machinery. Gene therapy In order to produce enough protéin encoded by the gene (transgene),
it involves the efficient introduction of functional genes into the patient’s appropriate cells, so as to
accurately and permanently correct the disorder.
Personalized Medicine
Personalized medicine is a medical model that customizes health care with the application of
genetic or other information to tailor all decisions and practices to the individual patient.
Application outside long-established considerations, such as the family history of a patient, social
conditions, the environment, and behavior are so far very limited, and in the last decade, almost
no progress has been made.
Custom medical research seeks to identify solutions on the basis of each individual’s susceptibility
profile. These are the areas where new diagnostic, medication development and individual
therapy approaches can be found.
Preventive Medicine
Prevention medicine or preventive treatment involves measures taken not to cure, to treat, or to
treat diseases (or injuries). This is contrasting in methods with curative medicines and palliative
medicines and in relation to methods of public health (which are not individual but
population-based health). Simple examples are hand washing, nursing, and immunization for
preventive medicine.
Microbial Genome Applications
In the field of Microbial Genome Applications, applications of bioinformatics are used for following
areas:
Waste Cleanup
Bacteria and microbes that are helpful in the purification of waste are identified in bioinformatics.
Deinococcus radioduran bacterium has the ability to repair DNA damaged and small chromosome
fragments by isolating damaged segments within a concentrated area. Deinococcus radioduran is a
bacteria listed in the World Book of Guinness.
Climate Change
Climate change is due to variations of the earth’s solar radiation, plate tectonics, and volcanic eruptions, or human
changes to the natural world, including oceanic processes (such as the occurrence of oceanic circulation).
By studying microorganisms, genome researchers can understand these microbes at a very basic level, isolating
genes that enable them to survive in extreme conditions. A phototrophic purple, a non-sulfur bacterium
commonly found in soil and water is Rhodopseudomonas palustris. By absorbing carbon dioxide from the
atmosphere and converting it into biomass, the sun is turned into cellular energy.
Biotechnology
It comprises, through breeding programming using artificial selection and hybridization, a wide concept of
“biotechnology” or “biotechnology” which covers a range of procedures for modifying live organisms in a
manner that refers to animal domestication, plant cultivation, and improvement.
Modern uses also include genetic engineering and technology for cell and tissue culture. Biotechnology
identified organisms and microorganisms which may be useful in the dairy sector and food processors in the
field of bioinformatics.
Lactococcus lactis, a non-pathogenic rod shape bacterium that is critical for the production of dairy foodstuffs
like buttermilk, yoghurt and cheese, is one of the most important microorganisms involved in the dairy sector.
It is also the use of this bacterium in the preparation of spiced vegetables, beer, wine, and other bread and
sausages.
Researchers expect the understanding of the physiologic and genetic make-up of this bacterium to be
invaluable for food producers as well as the L capability research pharmaceutical industry. Lactis to be used
as a drug supply vehicle.
Agriculture
In the field of Agriculture, applications of bioinformatics are in following areas:
Crop Improvement
Comparative genetics of plant genomes showed that their genes’ organization remained more conserved
than had previously been expected during evolutionary time. These findings show that information from
model crop systems can be used to suggest improvements to other food crops.
These results show. Examples of the available complete plant genomes are Arabidopsis thaliana
(watercress) and Oryza sativa (rice). The first sequenced plant Arabidopsis thaliana is considered as a
model of plant genetics and biology investigation.
In all plants, there are many genes similar, and in a model organism such as A the analysis of genes.
Thaliana helps us to understand and function gene expression in all plants. In addition, many of the genes
found in A because both animals and plants are eukarya. Thailand’s got animal homologs.
The most common reason it has been chosen as a genome-sectional model organization DNA Arabidopsis
consists of approximately 140 million foundations which are divided into five chromosomes, Arabidopsis
has the smallest genome from each flowering plant.

Introduction To Different Resources of Bioinformatics and Application PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Different Resources of Bioinformatics and Application PDF

Uploaded by

Copyright:

Available Formats

Introduction to different resources of

bioinformatics and application

A database is a structured collection of records stored in a computer system. Genomic

DNA Sequence Databases and Analysis Tools (460)

Enzymes and Pathways (241)

Gene Mutations, Genetic Variations and Diseases (247)

Genomics Databases and Analysis Tools (632)

Immunological Databases and Tools (48)

Microarray, SAGE, and other Gene Expression (164)

Protein Sequence Databases and Analysis Tools (404)

RNA Databases and Analysis Tools (216)

Structure Databases and Analysis Tools (378)

NCBI - National Center for Biotechnology Information.

EBI - European Bioinformatics Institute

DDBJ - DNA Data Bank of Japan

Database Searching by Sequence Similarity

► Dealing with gaps

► Basic Local Alignment Search Tool

► Which bacterial species have a protein that is related in

► Algorithms to match sequences:

► Global alignment algorithm

GAATTCAGTTA (sequence #1)

A simple scoring scheme will be used

Si,j = 1 if the residue at position I of sequence #1 is the same as

Si,j = 0 for mismatch score

The dynamic programming matrix is defined with three different steps.

USC Sequence Alignment Server - align 2 sequences with all

T-COFFEE - multiple sequence alignment

ClustalW @ EBI - multiple sequence alignment

MSA 2.1 - optimal multiple sequence alignment using the

BOXSHADE - pretty printing and shading of multiple alignments

Spidey - an mRNA-to-genomic alignment program

PipMaker - computes alignments of similar regions in two (long) DNA sequences

VISTA - align + detect conserved regions in long genomic sequences

myGodzilla - align a sequence to its ortholog in the human genome

Clustal Omega @ EBI

Gibbs Motif Sampler - identification of conserved motifs in DNA or protein sequences

AlignACE Homepage - gene regulatory motif finding

MEME - motif discovery and search in protein and DNA sequences

SAM - tools for creating and using Hidden Markov Models

Pratt - discover patterns in unaligned protein sequences

Motivated Proteins - a web facility for exploring small hydrogen-bonded motifs

PIR - Protein Information Resource

MIPS - Munich Information centre for Protein Sequences

HUPO - HUman Proteome Organization

•Draft Human Genome @ NCBI

•Draft Human Genome @ UCSC

•Ensembl - automatically annotated human genome.

•GDB - Genome Database

•Mammalian Gene Collection - full-length (open reading frame) sequences

•STACK - Sequence Tag Alignment and Consensus Knowledgebase

•GeneCards - human genes, proteins and diseases

TIGR Comprehensive Microbial Resource

TIGR Microbial Database

The Proteome Databases - yeast, worm, & human, good annotation

Saccharomyces Genome Database

Berkeley Drosophila Genome Project

Mouse Genome Informatics

The Arabidopsis Information Resource

ZFIN - Zebrafish Information Network

DictyBase - Dictyostelium discoideum

HIV sequence database

MBGD - comparative analysis of completely sequenced microbial genomes