You are on page 1of 22

Lecture 03

Biological Sequences
Biological Sequences
•  Similar biological sequences tend to be
related

•  Information:
–  Functional
–  Structural
–  Evolutionary
Searching Sequence Databases
•  Compare a query sequence against a
target database

•  Return significant results


–  Possible Homologous sequences
–  Yields insight into structure and function
2020 NAR Database Category List
•  Nucleotide Sequence Databases
•  RNA sequence databases
•  Protein sequence databases
•  Structure Databases
•  Genomics Databases (non-vertebrate)
•  Metabolic and Signaling Pathways
•  Human and other Vertebrate Genomes
•  Human Genes and Diseases
•  Microarray Data and other Gene Expression Databases
•  Proteomics Resources
•  Other Molecular Biology Databases
•  Organelle databases
•  Plant databases
•  Immunological databases
•  etc…

Source : https://academic.oup.com/nar/article/48/D1/D1/5695332
(27th annual version) (181 articles) (1637 online databases)
Types of Databases

•  Primary Databases Differentiate between primary and derived databases


Derived databases have credibility
–  Original submissions by experimentalists
–  Content controlled by the submitter
•  Examples: GenBank, SNP, etc

•  Derived Databases
–  Built from primary data
–  Content controlled by third party (eg., NCBI)
•  Examples: Refseq, RefSNP, UniGene, NCBI Protein, Structure,
Conserved Domain
National Center for Biotechnology Information

Bethesda,MD

Created in 1988 as a part of the


National Library of Medicine at NIH
–  Establish public databases
–  Research in computational biology
–  Develop software tools for sequence analysis
–  Disseminate biomedical information
What is GenBank?
NCBI’s Primary Sequence Database
•  Nucleotide only sequence database
GenBank: Primary
•  Archival in nature submitters can update their submissions

•  GenBank Data
–  Direct submissions (traditional records )
–  Batch submissions (EST, GSS, STS)
–  ftp accounts (genome data)
•  Three collaborating databases
–  GenBank
–  DNA Database of Japan (DDBJ)
–  European Molecular Biology Laboratory (EMBL)
Database
International Sequence Database Collaboration

Entrez
NIH
NCBI
• Submissions GenBank
• Updates • Submissions
• Updates
EMBL
DDBJ EBI
CIB

NIG • Submissions
• Updates SRS
getentry EMBL
The Entrez
System
GenBank Divisions (18 in number)

PRI - primate sequences


ROD - rodent sequences
MAM - other mammalian sequences
VRT - other vertebrate sequences
INV - invertebrate sequences
PLN - plant, fungal, and algal sequences
BCT - bacterial sequences
VRL - viral sequences
PHG - bacteriophage sequences
SYN - synthetic sequences
UNA - unannotated sequences
EST - EST sequences (expressed sequence tags)
PAT - patent sequences
STS - STS sequences (sequence tagged sites)
GSS - GSS sequences (genome survey sequences)
HTG - HTG sequences (high-throughput genomic sequences)
HTC - unfinished high-throughput cDNA sequencing
ENV - environmental sampling sequences
A Traditional GenBank Record
LOCUS AF062069 3808 bp mRNA linear INV 23-DEC-2010
DEFINITION Limulus polyphemus myosin III mRNA, complete cds.
ACCESSION AF062069
VERSION AF062069.2 GI:7144484
KEYWORDS .
SOURCE Limulus polyphemus (Atlantic horseshoe crab)
ORGANISM Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.
REFERENCE 1 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067
PUBMED 9614231
REFERENCE 2 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE 3 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK Sequence update by submitter
COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.
GenBank Record: Locus Memorise

LOCUS AF062069 3808 bp mRNA linear INV 23-OCT-2002


DEFINITION
LOCUS Limulus polyphemus
AF062069 3808myosin
bp IIImRNAmRNA, complete
linear cds.INV 23-DEC-2010
ACCESSION AF062069
VERSION AF062069.2 GI:7144484
KEYWORDS .
Length
SOURCE
Division
Limulus polyphemus (Atlantic horseshoe crab)
ORGANISM Limulus polyphemus
Molecule type
Locus name
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.

Modification Date
REFERENCE 1 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067
PUBMED 9614231
REFERENCE 2 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE 3 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK Sequence update by submitter
COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.
GenBank Record: Identifiers
LOCUS AF062069 3808 bp mRNA linear INV 23-DEC-2010
DEFINITION Limulus polyphemus myosin III mRNA, complete cds. complete coding sequence for myosin
ACCESSION AF062069
VERSION AF062069.2 GI:7144484
KEYWORDS .
SOURCE Limulus polyphemus (Atlantic horseshoe crab)
ORGANISM Limulus polyphemus
ACCESSION AF062069 unique ID
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
VERSION
REFERENCE
AF062069.2
Xiphosura; Limulidae; Limulus.
1 (bases 1 to 3808)
GI:7144484
AUTHORS updated once,Andrews,A.W.,
Battelle,B.-A., this is the second version
Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067 myosin is a protein, mRNA is translated into the said protein sequence
PUBMED 9614231
REFERENCE 2 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE 3 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK Sequence update by submitter
COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.
GenBank Record: Organism
LOCUS AF062069 3808 bp mRNA linear INV 23-DEC-2010
DEFINITION Limulus polyphemus myosin III mRNA, complete cds.
ACCESSION AF062069 whole sequence is given, but the coding sequence is also mentioned
VERSION AF062069.2 GI:7144484
KEYWORDS .
SOURCE Limulus polyphemus (Atlantic horseshoe crab)
ORGANISM Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.
SOURCE Limulus polyphemus (Atlantic horseshoe crab)
REFERENCE 1 (bases 1 to 3808)
ORGANISM
AUTHORS Limulus polyphemus
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Eukaryota;
Greenberg,R.M. and Metazoa;
Smith,W.C. Arthropoda; Chelicerata; Merostomata;
TITLE Xiphosura;
A myosin Limulidae;
III from Limulus eyes is Limulus.
a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067
PUBMED 9614231
REFERENCE 2 (bases 1 to 3808) as many references and authors as versions
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

TITLE
Greenberg,R.M. and Smith,W.C.
Direct Submission
NCBI s Taxonomy
JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE 3 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK Sequence update by submitter
COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.
GenBank Record: Feature Table
FEATURES Location/Qualifiers
source 1..3808 in continuation with the previous slide, just below the authors
/organism="Limulus polyphemus"
/db_xref="taxon:6850"
/tissue_type="lateral eye"
CDS 258..3302
/note="N-terminal protein kinase domain; C-terminal myosin
heavy chain head; substrate for PKA"
/codon_start=1
/product="myosin III"
/protein_id="AAC16332.2" GI
only for protein sequences, mentioned with version as well
/db_xref="GI:7144485"
/protein_id="AAC16332.2"
/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA
/db_xref="GI:7144485"
NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF
SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR
PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ
BASE COUNT 1201 a 689 c 782 g 1136 t GenPept IDs
ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt
3781 aagatacagt aactagggaa aaaaaaaa
//
FASTA format
universal

>gi|7144485|gb|AAC16332.2| myosin III [Limulus polyphemus]


MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQANKKVALKIIGHIAENLLDIETEYRIY
KAVNGIQFFPEFRGAFFKRGERESDNEVWLGIEFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAV
QYLHENSIIHRDIRAANIMFSKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNY
TCDVWSIGITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYRPCIQ
EIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQPHEKIYVDDLAFLDSP
>gi|7144486|gb|AAA23731.2| metC peptide [Escherichia coli
MADKKLDTQLVNAGRSKKYSLGAVNSVIQRASSLVFDSVEAKKHATRNRANGELFYGRRGTLTHFSLQQA
MCELEGGAGCVLFPCGAAAVANSILAFIEQGDPRVPSSNS

•  Text-based format
•  Representation by single letter codes
•  Format allows sequence names and comments to precede the sequence
•  easy to manipulate and parse sequences
DNA vs. Protein Searches
•  Redundancy in Genetic code
–  Multiple codons code for same amino acid
•  A.A. sequence could be identical
•  DNA sequence could be different
DNA vs. Protein Searches
•  Easier to determine similarity in protein
sequences
–  4 base of DNA means more random
sequences
similarity vs alignment

aligning protein sequences is much difficult as aligning DNA sequences

•  Consider alignment of length 4


–  DNA: 1/44 = 1/256 chance at random
–  AA: 1/204 = 1/160,000 chance at random
Basics of Sequence alignment
Biological sequences
•  Common mistake:
–  sequence similarity is not homology!

•  Homologous sequences: derived from a


common ancestor
Possible Residue Alignments
•  Match
•  Mismatch (substitution or mutation)
•  Insertion/Deletion (INDELS – gaps)

You might also like