Bioinformatics
Servers and Database
Trinh Hong Thai
Dept. of Biology, College of Science, VNU
Bioinformatics Application
Tools
Downloadable On-line application
software
Freeware Shareware Free Membership
(payware) access (payable)
Bioinformatics Application
Tools
User friendly
Good looking Good quality
area of interest
Bioinformatics servers and databases
ExPASy
http://www.expasy.org/
Bioinformatics servers and databases
ExPASy
http://www.expasy.org/
Bioinformatics servers and databases
UniProt database
http://www.uniprot.org/
Bioinformatics servers and databases
NCBI
http://www.ncbi.nlm.nih.gov/
Bioinformatics servers and databases
EBI
http://www.ebi.ac.uk/
Bioinformatics servers and databases
DDBJ
http://www.ddbj.nig.ac.jp/
Bioinformatics servers and databases
GenomeNet
http://www.genome.ad.jp/
Bioinformatics servers and databases
EMBNet
http://www.ch.embnet.org/
Bioinformatics servers and databases
Protein Data Bank (PDB)
http://www.rcsb.org/pdb/index.html
Bioinformatics servers and databases
DBTSS: Database of Transcriptional Start Sites
http://dbtss.hgc.jp/index.html
Bioinformatics servers and databases
Human Genome Centers and Contacts
Informatics
Human Genome Center Director
Contact
Baylor College of Medicine Human Genome Center Richard Andy
http://www.hgsc.bcm.tmc.edu/ Gibbs Arenson
Columbia University Genome Center Arg Eric Schon
http://genome1.ccc.columbia.edu/%7Egenome/ Efstratiadis
Cooperative Human Linkage Center (CHLC) Jeff Murray Ken
http://lpg.nci.nih.gov/CHLC/ Buetow
Eleanor Roosevelt Institute (U. Colorado) David Guido
http://eri.uchsc.edu/chromosome21 Patterson Vacano
Fondation Jean Dausset - CEPH, Howard Mourad
http://www.cephb.fr/ Cann Sahbatou
Genethon Jean Cecile
http://www.genethon.fr/genethon_en.html Weissenbach Fizames
Lawrence Berkeley Laboratory Human Genome Michael Nomi
Center (LBL) Palazzolo Harris
http://www-hgc.lbl.gov/GenomeHome.html
Lawrence Livermore National Laboratory Biology Tony Tom
and Biotechnology Research Program (LLNL) Carrano Slezak
http://www-bio.llnl.gov/bbrp/genome/genome.html
Los Alamos National Laboratory Center for Human Larry L. Robert
Genome Studies (LANL) Deaven Sutherland
http://www-ls.lanl.gov/index.html
(1 of 3)
Bioinformatics servers and databases
Human Genome Centers and Contacts
Human Genome Center Director Informatics
Contact
James L. Chengfeng
Marshfield Medical Research Foundation Weber Zhao
http://www.marshmed.org/genetics/
Resource for Molecular Cytogenetics (UCSF/LBL) Joe Gray Manfred
http://ioerror.ucsf.edu:8080/%7Edfdavy/rmchome.html Zorn
Sanger Centre John Sulston Peter Rice
http://www.sanger.ac.uk/
Stanford DNA Sequence and Technology Center Ron Davis Mike Cherry
http://genome-www.stanford.edu/SDSATC/
Stanford Human Genome Center Richard Myers, Kate
http://shgc-www.stanford.edu/ David Cox McKusick
TIGR - The Institute for Genomic Research Claire M. Tony
http://www.tigr.org/ Fraser Kerlavage
University of Michigan Human Genome Center Miriam Meisler Spencer
http://seqcore.brcf.med.umich.edu/ Thomas
University of Pennsylvania Beverly Chris
http://www.cbil.upenn.edu/ Emanuel Overton
University of Texas Health Science Center at San Sue Naylor Vladimir
Antonio Genome Center Pekkel
http://apollo.uthscsa.edu/
(2 of 3)
Bioinformatics servers and databases
Human Genome Centers and Contacts
Human Genome Center Director Informatics
Contact
University of Texas Southwestern Medical Center Glen Evans Chris Davies
http://www.swmed.edu/
University of Utah Ray Gesteland Peter
http://www.genetics.utah.edu/ Cartwright
University of Washington Genome Center Maynard Phil Green
http://www.genome.washington.edu/uwgc/ Olson
Washington University Center for Genetics in Medicine David States David States
http://www.ibc.wustl.edu/cgm/cgm.html
Washington University Genome Sequencing Center Bob LaDeana
http://genome.wustl.edu/gsc/gschmpg.html Waterston Hillier
Weizmann Institute Doron Lancet Jaime
http://bioinformatics.weizmann.ac.il/wis_genome_project.html Prilusky
Wellcome Trust Centre for Human Genetics John Bell, Geoff
http://www.well.ox.ac.uk/ Mark Lathrop Barton
Whitehead Institute Center for Genome Research (at MIT) Eric Lander Lincoln
http://www-genome.wi.mit.edu/ Stein
Albert Einstein Genome Center Raju Perry Miller
http://sequence.aecom.yu.edu/chr12/ Kucherlapati
(3 of 3)
Bioinformatics servers and databases
BLAST
http://www.ncbi.nlm.nih.gov/BLAST/
Bioinformatics servers and databases
SWISS-MODEL
http://www.expasy.org/swissmod/SWISS-MODEL.html
Bioinformatics servers and databases
Pedro's BioMolecular Research Tools
http://www.biophys.uni-
duesseldorf.de/BioNet/Pedro/research_tools.html
Multiple sequence alignment
Program ClustalX
Phylogenetic trees
TreeView
Representing and
Retrieving Sequences
Definition
• A sequence is a linear set of characters
(sequence elements) representing
nucleotides or amino acids
DNA composed of four nucleotides or
"bases": A, C, G, T
RNA composed of four also: A, C, G, U (T
transcribed as U)
proteins are composed of amino acids (20)
Representation of Sequences
• Characters
Simplest
Easy to read, edit, etc.
• Bit-coding
More compact, both on disk and in
memory
Comparisons more efficient
More to come on this
Character representation
of sequences
• DNA or RNA
use 1-letter codes (e.g., A, C, G, T)
• Protein
use 1-letter codes
• can convert to/from 3-letter
codes
(e.g., A = Ala = Alanine
C = Cys = Cysteine)
Representing uncertainty in
nucleotide sequences
• It is often the case that we would like to represent
uncertainty in a nucleotide sequence, i.e., that
more than one base is “possible” at a given
position
to express ambiguity during sequencing
to express variation at a position in a gene
during evolution
to express ability of an enzyme to tolerate
more than one base at a given position of a
recognition site
Representing uncertainty in
nucleotide sequences
• To do this for nucleotides, we use a set
of single character codes that represent
all possible combinations of bases
• This set was proposed and adopted by
the International Union of Biochemistry
and is referred to as the I.U.B. code
The I.U.B. Code
• A, C, G, T, U
• R = A, G (puRine)
• Y = C, T (pYrimidine)
• S = G, C (Strong hydrogen bonds)
• W = A, T (Weak hydrogen bonds)
• M = A, C (aMino group)
• K = G, T (Keto group)
• B = C, G, T (not A)
• D = A, G, T (not C)
• H = A, C, T (not G)
• V = A, C, G (not T/U)
• N = A, C, G, T/U (iNdeterminate) X or - are
sometimes used
Representing uncertainty in
protein sequences
• Given the size of the amino acid
“alphabet”, it is not practical to design a
set of codes for ambiguity in protein
sequences
• Fortunately, ambiguity is less common in
protein sequences than in nucleic acid
sequences
• Could use bit-coding as for nucleic acids
but rarely done
Single Letter Code (SLC)
Amino
acids
Sequence
File Formats
Sequence file formats
• Two characteristics of file formats
text or binary
minimal or annotated
• Text files use IUB codes and are readable by a
word processor (e.g., SimpleText, Microsoft
Word) or text editor (e.g., emacs)
• Binary files are usually readable only by the
program that created them (e.g., MacVector)
• Annotated files preserve information known
about the sequence (coding region start/stop,
protein features, literature references, etc.)
Examples of ASCII sequence
file formats
• Fasta
>gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese.
CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCC
TGTGCCGGTTCCTGTGGCTTTGGTCCTGTCCTATGTTCAAGCTGTG
CCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCAT
TGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCC
AGGCAGAGGGTCACCGGTTTGGACTTCATTCCCGGGCTTCACCCCA
TTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA
GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCAT
GACCTGGAGAACCTGCGAGACCTCCTCCATCTGCTGGCCTTCTCCA
AGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGA
GCCTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGG
TGGCTCTGAGCAGGCTGCAGGGCTCTCTGCAGGACATTCTTCAACA
GTTGGACCTTAGCCCTGAATGCTGAGGTTTC
Examples of ASCII sequence
file formats
• GCG
LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95
DEFINITION Rat mRNA for obese.
ACCESSION -
KEYWORDS -
SOURCE Rattus norvegicus; Norway rat
ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata;
Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi;
Myomorpha; Muridae; Murinae; Rattus
REFERENCE [1]
AUTHORS Murakami, T. & Shima, K.
TITLE Cloning of rat obese cDNA and its expression in obese rats.
JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995)
COMMENT Database Reference:
DDBJ RATOBESE
Accession: D49653
------------
Submitted (10-Mar-1995) to DDBJ by:
Takashi Murakami
Department of Laboratory Medicine
School of Medicine
University of Tokushima
Kuramotocho 3-chome
Tokushima 770
Japan
Phone: +81-886-33-7184
Fax: +81-886-31-9495
[continued]
Examples of ASCII sequence
file formats
– GCG [continued]
FEATURES From To/Span Description
pept 30 533 obese
???? 1 539 source; /organism=Rattus norvegicus;
/strain=OLETF, LETO and Zucker;
/dev_stage=differentiated; /sequenced_mol=cDNA
to mRNA; /tissue_type=adipose
BASE COUNT 121 A 167 C 133 G 118 T 0 OTHER
ORIGIN ?
RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 ..
1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT
61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA
121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG
181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA
241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT
301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC
361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC
421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC
481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC
//
Entrez
Entrez Databases
http://www.ncbi.nlm.nih.gov/
• PubMed: The biomedical literature
PUBMED database contains Medline abstracts as
well as links to full text articles on sites maintained
by journal publishers
• PubMed Central: free, full text journal articles
• Books: online books
• OMIM: Online Mendelian Inheritance in Man
• Nucleotide sequence database (Genbank)
• Protein sequence database
• Genome: complete genome assemblies
• Structure: three-dimensional macromolecular
structures
Entrez Databases
• Taxonomy: organisms in GenBank
• SNP: single nucleotide polymorphism
• PopSet: population study data sets
• And many more…
Entrez essentials
• Semi-automated entry of
information into databases
• Critical to usefulness is the links
between databases
Entrez literature searching
• Can find papers on a given subject
• Can find papers on a specific gene
• Can find papers related to a given paper
• Can switch between literature and sequence
databases
• Pubmed has links to publishers’ websites to
view full text of articles
• Pubmed Central has free full text copies
Entrez sequence searching
• Can find sequences for a given
gene or protein
• Can download copy of sequence
Example Entrez Session
• Goal: Find literature and sequences for cystic
fibrosis genes
Use OMIM with Keyword searching.
Switch to Nucleotide database to see
sequence.
Switch to Protein database to see sequence.
Change to GenPept format to save sequence.
Use links to find related literatures in pubmed.
Use Related Articles to find similar articles.
Search the Nucleotide database by gene name.
Set Limits to narrow down the search
Example Entrez Session:
home of Entrez
Example Entrez Session:
search OMIM for ‘BRCA1’
Example Entrez Session:
first hit
Example Entrez Session:
after clicking linksNucleotide
Example Entrez Session:
after clicking linksProtein
Example Entrez Session:
LinksPubMed
Block Diagram for Entrez
Literature Searching
Results of
Previous Search
Results of
Additional Entrez Search (List)
Search Criterion
Search
Displayed Item
Engine Item Display
Selection
Desired Output
Format