Professional Documents
Culture Documents
Biological Sequences
Biological Sequences
• Similar biological sequences tend to be
related
• Information:
– Functional
– Structural
– Evolutionary
Searching Sequence Databases
• Compare a query sequence against a
target database
Source : https://academic.oup.com/nar/article/48/D1/D1/5695332
(27th annual version) (181 articles) (1637 online databases)
Types of Databases
• Derived Databases
– Built from primary data
– Content controlled by third party (eg., NCBI)
• Examples: Refseq, RefSNP, UniGene, NCBI Protein, Structure,
Conserved Domain
National Center for Biotechnology Information
Bethesda,MD
• GenBank Data
– Direct submissions (traditional records )
– Batch submissions (EST, GSS, STS)
– ftp accounts (genome data)
• Three collaborating databases
– GenBank
– DNA Database of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL)
Database
International Sequence Database Collaboration
Entrez
NIH
NCBI
• Submissions GenBank
• Updates • Submissions
• Updates
EMBL
DDBJ EBI
CIB
NIG • Submissions
• Updates SRS
getentry EMBL
The Entrez
System
GenBank Divisions (18 in number)
Modification Date
REFERENCE 1 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067
PUBMED 9614231
REFERENCE 2 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE 3 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK Sequence update by submitter
COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.
GenBank Record: Identifiers
LOCUS AF062069 3808 bp mRNA linear INV 23-DEC-2010
DEFINITION Limulus polyphemus myosin III mRNA, complete cds. complete coding sequence for myosin
ACCESSION AF062069
VERSION AF062069.2 GI:7144484
KEYWORDS .
SOURCE Limulus polyphemus (Atlantic horseshoe crab)
ORGANISM Limulus polyphemus
ACCESSION AF062069 unique ID
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
VERSION
REFERENCE
AF062069.2
Xiphosura; Limulidae; Limulus.
1 (bases 1 to 3808)
GI:7144484
AUTHORS updated once,Andrews,A.W.,
Battelle,B.-A., this is the second version
Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067 myosin is a protein, mRNA is translated into the said protein sequence
PUBMED 9614231
REFERENCE 2 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE 3 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK Sequence update by submitter
COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.
GenBank Record: Organism
LOCUS AF062069 3808 bp mRNA linear INV 23-DEC-2010
DEFINITION Limulus polyphemus myosin III mRNA, complete cds.
ACCESSION AF062069 whole sequence is given, but the coding sequence is also mentioned
VERSION AF062069.2 GI:7144484
KEYWORDS .
SOURCE Limulus polyphemus (Atlantic horseshoe crab)
ORGANISM Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.
SOURCE Limulus polyphemus (Atlantic horseshoe crab)
REFERENCE 1 (bases 1 to 3808)
ORGANISM
AUTHORS Limulus polyphemus
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Eukaryota;
Greenberg,R.M. and Metazoa;
Smith,W.C. Arthropoda; Chelicerata; Merostomata;
TITLE Xiphosura;
A myosin Limulidae;
III from Limulus eyes is Limulus.
a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067
PUBMED 9614231
REFERENCE 2 (bases 1 to 3808) as many references and authors as versions
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
TITLE
Greenberg,R.M. and Smith,W.C.
Direct Submission
NCBI s Taxonomy
JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE 3 (bases 1 to 3808)
AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK Sequence update by submitter
COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.
GenBank Record: Feature Table
FEATURES Location/Qualifiers
source 1..3808 in continuation with the previous slide, just below the authors
/organism="Limulus polyphemus"
/db_xref="taxon:6850"
/tissue_type="lateral eye"
CDS 258..3302
/note="N-terminal protein kinase domain; C-terminal myosin
heavy chain head; substrate for PKA"
/codon_start=1
/product="myosin III"
/protein_id="AAC16332.2" GI
only for protein sequences, mentioned with version as well
/db_xref="GI:7144485"
/protein_id="AAC16332.2"
/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA
/db_xref="GI:7144485"
NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF
SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR
PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ
BASE COUNT 1201 a 689 c 782 g 1136 t GenPept IDs
ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt
3781 aagatacagt aactagggaa aaaaaaaa
//
FASTA format
universal
• Text-based format
• Representation by single letter codes
• Format allows sequence names and comments to precede the sequence
• easy to manipulate and parse sequences
DNA vs. Protein Searches
• Redundancy in Genetic code
– Multiple codons code for same amino acid
• A.A. sequence could be identical
• DNA sequence could be different
DNA vs. Protein Searches
• Easier to determine similarity in protein
sequences
– 4 base of DNA means more random
sequences
similarity vs alignment