You are on page 1of 21

Biological sequence data Iormats

IUPAC Codes

In order to standardize sequence data, The Nomenclature Committee oI the International
Union oI Biochemistry and the International Union of Pure and Applied Chemistry has
established a standard code to represent bases that are uncertain or ambiguous. The code,
oIten reIerred to as the IUPAC code, is as Iollows:

A = adenine
C = cytosine
G = guanine
T = thymine
U = uracil
R = G A (purine)
Y = T C (pyrimidine)
K = G T (keto)
M = A C (amino)
S = G C
W = A T
B = G T C
D = G A T
H = A C T
V = G C A
N = A G C T (any)

Any other character besides the ones listed above (with the exception oI the gap character
-) represents an error that will not be tolerated by nearly all sequence analysis
programs.


!"#$%#&% ()*$+ (,*% -+%.

In addition to the nucleic acid codes, a standard single letter and three letter amino acid
code has been Iormulated by IUPAC as well. The table Ior this code is as Iollows:










!"#$%%$& ("#$%%$& )$*+&,-%,./
A Ala Alanine
R Arg Arginine
N Asn Asparagine
D Asp Aspartic acid
C Cys Cysteine
Q Gln Glutamine
E Glu Glutamic acid
G Gly Glycine
H His Histidine
I Ile Isoleucine
L Leu Leucine
K Lys Lysine
M Met Methionine
F Phe Phenylalanine
P Pro Proline
S Ser Serine
T Thr Threonine
W Trp Tryptophan
Y Tyr Tyrosine
V Val Valine
B Asx Aspartic acid or Asparagine
Z Glx Glutamine or Glutamic acid
X
Xaa or
Xxx
Any amino acid

!"#$"

Fasta sequence Iormat is one oI the most basic and widespread sequence Iormats. A
sequence in Iasta Iormat has as its Iirst line a descriptor beginning with a ~` character.
The proceeding lines contain the sequence (either nucleotide or amino acid) using
standard one-letter symbols. This Iormat is extremely useIul Ior sequence analysis
programs, since it is devoid oI numerical and nonsequence characters (with the exception
oI the newline character).

Example Fasta Sequence:
>gi|27819608|ref|NP_776342.1| hemoglobin, beta [beta globin] [Bos taurus]
MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFESFGDLSTADAVMNNPKVKAHGKKVLDSF
SNGMKHLDDLKGTFAALSELHCDKLHVDPENFKLLGNVLVVVLARNFGKEFTPVLQADFQKVVAGVANAL
AHRYH

Note the Iirst line begins with ~`, which in this case is Iollowed by gi, indicating that the
next Iield surrounded by ,` will be the GenBank identiIier. Following the GenBank
identiIier is the keyword reI` indicating the next Iield will be the reIerence Ior the
version oI this sequence. The Iinal Iield is the description. Note that nearly all sequence
based programs will treat anything Iollowing the ~` as a comment and disregard it (or
only use it as a sequence descriptor). There are, however, a Iew sequence analysis
programs that expect the sequences to be in a strict Iasta Iormat.


!"#$%#&

GenBank is the National Center Ior Biotechnology InIormation`s nucleic acid and protein
sequence database. It is the most widely used source oI biological sequence data.
GenBank Iile Iormat contains inIormation about the sequence, including literature
reIerences, Iunctions oI the sequence, locations oI various Ieatures, etc.

The inIormation in GenBank records is organized into Iields, each with an identiIier,
justiIied to the Iarthest leIt column. Some identiIiers have additional subIields. The
actual sequence data lies between the identiIier ORIGIN and the //` which signals the
end oI a GenBank record.


























"#$%&'( )(*+$*, -(./(*0(1

LOCUS HBB 145 aa linear MAM 22-JAN-2003
DEFlNlTlON hemoglobin, beta [beta globin] [Bos taurus].
ACCESSlON NP_776342
VERSlON NP_776342.1 Gl:27819608
DBSOURCE REFSEQ: accession NM_173917.1
KEYWORDS .
SOURCE Bos taurus (cow)
ORGANlSM Bos taurus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovoidea;
Bovidae; Bovinae; Bos.
REFERENCE 1 (residues 1 to 145)
AUTHORS Duncan,C.H.
JOURNAL Unpublished (1991)
COMMENT PROVlSlONAL REFSEQ: This record has not yet been subject to final
NCBl review. The reference sequence was derived from M63453.1.
FEATURES Location/Qualifiers
source 1..145
/organism="Bos taurus"
/db_xref="taxon:9913"
/chromosome="15"
/map="15q22-q27"
/tissue_type="thymus"
/dev_stage="newborn"
Protein 1..145
/product="hemoglobin, beta [beta globin]"
Region 3..145
/region_name="Globin"
/note="globin"
/db_xref="CDD:pfam00042"
CDS 1..145
/gene="HBB"
/coded_by="NM_173917.1:53..490"
/db_xref="LocuslD:280813"
ORlGlN
1 mltaeekaav tafwgkvkvd evggealgrl lvvypwtqrf fesfgdlsta davmnnpkvk
61 ahgkkvldsf sngmkhlddl kgtfaalsel hcdklhvdpe nfkllgnvlv vvlarnfgke
121 ftpvlqadfq kvvagvanal ahryh




!"#$%

Abstract Syntax Notation (ASN.1) is a Iormal description language that has been
developed to encode various data such that it can be easily connected across computer
systems. ASN.1 Iormat is highly structured and detailed. ASN.1 Iormat contains all oI
the other inIormation Iound in other Iormats.

Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
source {
genome genomic ,
org {
taxname "Bos taurus" ,
common "cow" ,
db {
{
db "taxon" ,
tag
id 9913 } } ,
orgname {
name
binomial {
genus "Bos" ,
species "taurus" } ,
lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora;
Bovoidea; Bovidae; Bovinae; Bos" ,
gcode 1 ,
mgcode 2 ,
div "MAM" } } ,
subtype {
{
subtype chromosome ,
name "15" } ,
{
subtype map ,
name "15q22-q27" } ,
{
subtype tissue-type ,
name "thymus" } ,
{
subtype dev-stage ,
name "newborn" } } } ,
user {
type
str "RefGeneTracking" ,
data {
{
label
str "Status" ,
data
str "Provisional" } ,
{
label
str "Assembly" ,
data
fields {
{
label
id 0 ,
data
fields {
{
label
str "accession" ,
data
str "M63453.1" } ,
{
label
str "gi" ,
data
int 162741 } } } } } ,
{
label
str "Related" ,
data
fields {
{
label
id 0 ,
data
fields {
{
label
str "accession" ,
data
str "X00376.1" } ,
{
label
str "gi" ,
data
int 395 } } } } } ,
{
label
str "Unknown" ,
data
fields {
{
label
id 0 ,
data
fields {
{
label
str "accession" ,
data
str "X03248.1" } ,
{
label
str "gi" ,
data
int 319 } } } } } } } ,
pub {
pub {
gen {
cit "Unpublished" ,
authors {
names
std {
{
name
name {
last "Duncan" ,
initials "C.H." } } } } ,
date
std {
year 1991 } } } ,
comment "simple staff_entry" } ,
update-date
std {
year 2003 ,
month 1 ,
day 22 } } ,
seq-set {
seq {
id {
other {
accession "NM_173917" ,
version 1 } ,
gi 27819607 } ,
descr {
molinfo {
biomol mRNA } ,
title "Bos taurus hemoglobin, beta [beta globin] (HBB), mRNA" ,
create-date
std {
year 2003 ,
month 1 ,
day 22 } } ,
inst {
repr raw ,
mol rna ,
length 821 ,
strand ss ,
seq-data
ncbi2na '11F9F784416EF47241C440484539E1E78A20A796D165FFAA42B80BA382F
AEB8A57A929E7AFB7157A1D22BDFE2D7FAA1FB51E78E7BCE0415C2B82953A420AE723D7F2C3A4E
09376385D0A917F9E678B89E47B8C2793BA35E207D09D7A906E72EBEE7A7643FE90A0F455AE792
9E1FD20AEBA7AEE94395E9512334F09D5FD79FD4A02BFFD35D225408F833A003CE0BBFE24DE977
970C084FCFF4F91EBB3F03CFD1EDDF1D23A913A8A900478213020382A72D885F8803334B37E855
38492EBEC0C9E3BCE80129FE75F25F1DD5F020F40'H } ,
annot {
{
data
ftable {
{
data
gene {
locus "HBB" ,
db {
{
db "LocuslD" ,
tag
id 280813 } } } ,
location
int {
from 0 ,
to 820 ,
strand plus ,
id
gi 27819607 } } } } } } ,
seq {
id {
other {
accession "NP_776342" ,
version 1 } ,
gi 27819608 } ,
descr {
molinfo {
biomol peptide } ,
title "hemoglobin, beta [beta globin] [Bos taurus]" ,
create-date
std {
year 2003 ,
month 1 ,
day 22 } } ,
inst {
repr raw ,
mol aa ,
length 145 ,
seq-data
ncbieaa "MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFESFGDLSTADAVMNNPKV
KAHGKKVLDSFSNGMKHLDDLKGTFAALSELHCDKLHVDPENFKLLGNVLVVVLARNFGKEFTPVLQADFQKVVAGVA
NALAHRYH" } ,
annot {
{
data
ftable {
{
data
prot {
name {
"hemoglobin, beta [beta globin]" } } ,
location
whole
gi 27819608 } ,
{
data
region "Globin" ,
comment "globin" ,
location
int {
from 2 ,
to 144 ,
id
gi 27819608 } ,
ext {
type
str "cddScoreData" ,
data {
{
label
str "definition" ,
data
str "Globin" } ,
{
label
str "short_name" ,
data
str "globin" } ,
{
label
str "score" ,
data
int 327 } ,
{
label
str "evalue" ,
data
real { 813255, 10, -37 } } ,
{
label
str "bit_score" ,
data
real { 130091, 10, -3 } } } } ,
dbxref {
{
db "CDD" ,
tag
str "pfam00042" } } } } } } } } ,
annot {
{
data
ftable {
{
data
cdregion {
frame one ,
code {
id 1 } } ,
product
whole
gi 27819608 ,
location
int {
from 52 ,
to 489 ,
id
gi 27819607 } } } } } }

Sample ASN.1 Iile



SwissProt

XML File Format


Databases
GenBank
DDBJ
EMBL

SwissProt
BLOCKS
PFAM


Using Entrez


Complete Process:

1) Determine sequences to align (Globins)
~sp,P02023,HBBHUMAN Hemoglobin beta chain - Homo sapiens (Human), Pan troglodytes
(Chimpanzee), and Pan paniscus (Pygmy chimpanzee) (Bonobo).
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTP
DAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLV
CVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
~sp,P02062,HBBHORSE Hemoglobin beta chain - Equus caballus (Horse).
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNP
GAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLV
VVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
~sp,P01922,HBAHUMAN Hemoglobin alpha chain - Homo sapiens (Human), Pan
troglodytes (Chimpanzee), and Pan paniscus (Pygmy chimpanzee) (Bonobo).
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHG
SAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLA
AHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
~sp,P01958,HBAHORSE Hemoglobin alpha chains (Slow and Iast) - Equus caballus
(Horse).
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHG
SAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAV
HLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
~sp,P02185,MYGPHYCA Myoglobin - Physeter catodon (Sperm whale) (Physeter
macrocephalus).
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTE
AEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHV
LHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
~sp,P02208,GLB5PETMA Globin V - Petromyzon marinus (Sea lamprey).
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPK
FKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQ
VDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
~sp,P02240,LGB2LUPLU Leghemoglobin II - Lupinus luteus (Yellow lupine).
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVP
QNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVV
KEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA




2) Determine multiple alignment (ClustalW)

>sp|P02023|HBB_HUMAN
--------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQR
FFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTF
ATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVA
GVANALAHKYH------
>sp|P02062|HBB_HORSE
--------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQR
FFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTF
AALSELHCDKLHVDPENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVA
GVANALAHKYH------
>sp|P01922|HBA_HUMAN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKT
YFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLA
SVSTVLTSKYR------
>sp|P01958|HBA_HORSE
---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKT
YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGAL
SNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLS
SVSTVLTSKYR------
>sp|P02185|MYG_PHYCA
---------VLSEGEWQLVLHVWAKVEADVAGHGQDlLlRLFKSHPETLE
KFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAlLKKKGH-----HEAEL
KPLAQSHATKHKlPlKYLEFlSEAllHVLHSRHPGDFGADAQGAMNKALE
LFRKDlAAKYKELGYQG
>sp|P02208|GLB5_PETMA
PlVDTGSVAPLSAAEKTKlRSAWAPVYSTYETSGVDlLVKFFTSTPAAQE
FFPKFKGLTTADQLKKSADVRWHAERllNAVNDAVASMDDT--EKMSMKL
RDLSGKHAKSFQVDPQYFKVLAAVlADTVAAG---------DAGFEKLMS
MlClLLRSAY-------
>sp|P02240|LGB2_LUPLU
--------GALTESQAALVKSSWEEFNANlPKHTHRFFlLVLElAPAAKD
LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAlQLQVTGVVVTDATL
KNLGSVHVSKGVAD-AHFPVVKEAlLKTlKEVVGAKWSEELNSAWTlAYD
ELAlVlKKEMNDAA---
3) View alignment using various methods
4) Find Blocks in the alignment (BLOCKS)



ProIile: Scores Ior substitutions and gaps in each column
Blocks: ungapped aligned regions


Alignments based on locally conserved patterns Iound in the same order in the sequences
(synteny)

Use oI statistical methods and probabilistic models oI the sequences


Multiple sequence alignments yield inIormation into the evolutionary history oI the
sequences sequences that are most similar are likely to be recently derived Irom a
common ancestor sequence

II the sequences in a multiple alignment have quite a bit oI variation then it is diIIicult to
create a multiple sequence alignment due to the diIIerent combinations oI substitutions,
insertions, and deletions that can be used




!"#$ &'()*+ ,-./0123 4 .56 7
CECS 694-02
Introduction to Bioinformatics
Lecture 5: Searching Sequence Databases


Multiple Alignment Iormat

In addition to storing individual sequences in a speciIied Iormat, the results Irom a
multiple sequence alignment can be stored in a speciIied Iormat as well. Various
programs (including the BLOCKS server) can then read in these multiple sequence
alignments and perIorm analysis on them. The most widely used multiple sequence
alignment Iile Iormats are: FASTA, GCG Multiple Sequence Format, and ALN.


FASTA Format

In Fasta Format, each sequence in the multiple alignment starts with a Fasta description
line (beginning with a ~`). Following the description line is the sequence data. The gap
character - is Iound in locations corresponding to gaps in the sequence when the
multiple alignment was created.

>JC2395
NVSDVNLNK---YlWRTAEKMK---lCDAKKFARQHKlPESKlDElEHNSPQDAAE----
-------------------------QKlQLLQCWYQSHGKT--GACQALlQGLRKANRCD
lAEElQAM
>KPEL_DROME
MAlRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQlSSQKQRGRS-----
-------------------------ASNEFLNlWGGQYN----HTVQTLFALFKKLKLHN
AMRLlKDY
>FASA_MOUSE
NASNLSLSK---YlPRlAEDMT---lQEAKKFARENNlKEGKlDElMHDSlQDTAE----
-------------------------QKVQLLLCWYQSHGKS--DAYQDLlKGLKKAECRR
TLDKFQDM

Stockholm Format

Stockholm Format (http://www.cgr.ki.se/cgr/groups/sonnhammer/Stockholm.html)


# STOCKHOLM 1.0
#=GF lD CBS
#=GF AC PF00571
#=GF DE CBS domain
#=GF AU Bateman A
#=GF CC CBS domains are small intracellular modules mostly found
#=GF CC in 2 or four copies within a protein.
#=GF SQ 67
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071
#=GS O31698/88-139 AC O31698
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246 MTCRAQLlAVPRASSLAE..AlACAQKM....RVSRVPVYERS
#=GR O83071/192-246 SA 999887756453524252..55152525....36463774777
O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAlVLDEY
#=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE
O31698/18-71 MlEADKVAHVQVGNNLEH..ALLVLTKT....GYTAlPVLDPS
#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
O31698/88-139 EVMLTDlPRLHlNDPlMK..GFGMVlNN......GFVCVENDE
#=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH
#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
O31699/88-139 EVMLTDlPRLHlNDPlMK..GFGMVlNN......GFVCVENDE
#=GR O31699/88-139 AS *
#GR O31699/88-139 IN



GCG Multiple Sequence Format


!!AA_MULTlPLE_ALlGNMENT 1.0

msf MSF: 131 Type: P 22/01/02 CompCheck: 3003 ..

Name: lXl_234 Len: 131 Check: 6808 Weight: 1.00
Name: lXl_235 Len: 131 Check: 4032 Weight: 1.00
Name: lXl_236 Len: 131 Check: 2744 Weight: 1.00
Name: lXl_237 Len: 131 Check: 9419 Weight: 1.00

//

1 50
lXl_234 TSPASlRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
lXl_235 TSPASlRPPAGPSSR.........RPSPPGPRRPTGRPCCSAAPRRPQAT
lXl_236 TSPASlRPPAGPSSRPAMVSSR..RPSPPPPRRPPGRPCCSAAPPRPQAT
lXl_237 TSPASLRPPAGPSSRPAMVSSRR.RPSPPGPRRPT....CSAAPRRPQAT

51 100
lXl_234 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
lXl_235 GGWKTCSGTCTTSTSTRHRGRSGW..........RASRKSMRAACSRSAG
lXl_236 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR..G
lXl_237 GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR..G

101 131
lXl_234 SRPNRFAPTLMSSClTSTTGPPAWAGDRSHE
lXl_235 SRPNRFAPTLMSSClTSTTGPPAWAGDRSHE
lXl_236 SRPPRFAPPLMSSClTSTTGPPPPAGDRSHE
lXl_237 SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE





PileUp



MSF: 92 Type: P Check: 1886 ..

Name: JC2395 oo Len: 92 Check: 8870 Weight: 35.3
Name: FASA_MOUSE oo Len: 92 Check: 527 Weight: 64.6
Name: KPEL_DROME oo Len: 92 Check: 2489 Weight: 41.2

//



JC2395 .NVSDVNLNK YlWRTAEKMK lCDAKKFARQ HKlPESKlDE lEHNSPQDAA
FASA_MOUSE .NASNLSLSK YlPRlAEDMT lQEAKKFARE NNlKEGKlDE lMHDSlQDTA
KPEL_DROME MAlRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ lSSQKQRGRS


JC2395 EQKlQLLQCW YQSHGKTGAC QALlQGLRKA NRCDlAEElQ AM
FASA_MOUSE EQKVQLLLCW YQSHGKSDAY QDLlKGLKKA ECRRTLDKFQ DM
KPEL_DROME ASN.EFLNlW GGQYNHT..V QTLFALFKKL KLHNAMRLlK DY




ClustalW ALN Format

CLUSTAL W (1.82) multiple sequence alignment


JC2395 -NVSDVNLNKYlWRTAEKMKlCDAKKFARQHKlPESKlDElEHNSPQDAAEQKlQLLQCW 59
FASA_MOUSE -NASNLSLSKYlPRlAEDMTlQEAKKFARENNlKEGKlDElMHDSlQDTAEQKVQLLLCW 59
KPEL_DROME MAlRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQlSSQKQRGRSASN-EFLNlW 59
: * *. : :::* :: .::::* :. :. : .: ::* *

JC2395 YQSHGKTGACQALlQGLRKANRCDlAEElQAM 91
FASA_MOUSE YQSHGKSDAYQDLlKGLKKAECRRTLDKFQDM 91
KPEL_DROME GGQYNHT--VQTLFALFKKLKLHNAMRLlKDY 89
.:.:: * *: ::* : ::










CLUSTAL W(1.4) multiple sequence alignment


lXl_234 TSPASlRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT
lXl_235 TSPASlRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT
lXl_236 TSPASlRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT
lXl_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT


lXl_234 GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSRSAG
lXl_235 GGWKTCSGTC TTSTSTRHRG RSGW------ ----RASRKS MRAACSRSAG
lXl_236 GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSR--G
lXl_237 GGYKTCSGTC TTSTSTRHRG RSGYSARTTT AACLRASRKS MRAACSR--G


lXl_234 SRPNRFAPTL MSSClTSTTG PPAWAGDRSH E
lXl_235 SRPNRFAPTL MSSClTSTTG PPAWAGDRSH E
lXl_236 SRPPRFAPPL MSSClTSTTG PPPPAGDRSH E
lXl_237 SRPNRFAPTL MSSCLTSTTG PPAYAGDRSH E

Phylip

3 92
JC2395 -NVSDVNLNK YlWRTAEKMK lCDAKKFARQ HKlPESKlDE lEHNSPQDAA
FASA_MOUSE -NASNLSLSK YlPRlAEDMT lQEAKKFARE NNlKEGKlDE lMHDSlQDTA
KPEL_DROME MAlRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ lSSQKQRGRS

EQKlQLLQCW YQSHGKTGAC QALlQGLRKA NRCDlAEElQ AM
EQKVQLLLCW YQSHGKSDAY QDLlKGLKKA ECRRTLDKFQ DM
ASN-EFLNlW GGQYNHT--V QTLFALFKKL KLHNAMRLlK DY

PIR Format

>P1;JC2395

-NVSDVNLNKYlWRTAEKMKlCDAKKFARQHKlPESKlDElEHNSPQDAAEQKlQLLQCW
YQSHGKTGACQALlQGLRKANRCDlAEElQAM
*
>P1;FASA_MOUSE

-NASNLSLSKYlPRlAEDMTlQEAKKFARENNlKEGKlDElMHDSlQDTAEQKVQLLLCW
YQSHGKSDAYQDLlKGLKKAECRRTLDKFQDM
*
>P1;KPEL_DROME

MAlRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQlSSQKQRGRSASN-EFLNlW
GGQYNHT--VQTLFALFKKLKLHNAMRLlKDY
*

GDE

%JC2395
nvsdvnlnkyiwrtaekmkicdakkfarqhkipeskideiehnspqdaaeqkiqllqcwy
qshgktgacqaliqglrkanrcdiaeeiqam
%FASA_MOUSE
nasnlslskyipriaedmtiqeakkfarennikegkideimhdsiqdtaeqkvqlllcwy
qshgksdayqdlikglkkaecrrtldkfqdm
%KPEL_DROME
--mairllplpvraqlcahldaldvwqqlatavklypdqveqissqkqrgrsasneflni
wggqynhtvqtlfalfkklklhnamrlikdy

Nexus

#NEXUS
BEGlN DATA;
dimensions ntax=3 nchar=91;
format missing=?
symbols="ABCDEFGHlKLMNPQRSTUVWXYZ"
interleave datatype=PROTElN gap= -;

matrix
JC2395 NVSDVNLNKYlWRTAEKMKlCDAKKFARQHKlPESKlDElEHNSPQDAAE
FASA_MOUSE NASNLSLSKYlPRlAEDMTlQEAKKFARENNlKEGKlDElMHDSlQDTAE
KPEL_DROME --MAlRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQlSSQKQRG

JC2395 QKlQLLQCWYQSHGKTGACQALlQGLRKANRCDlAEElQAM
FASA_MOUSE QKVQLLLCWYQSHGKSDAYQDLlKGLKKAECRRTLDKFQDM
KPEL_DROME RSASNEFLNlWGGQYNHTVQTLFALFKKLKLHNAMRLlKDY
;
end;

!"#"$%& ("%)*$" (+$,%) -!((.

The general Ieature Iormat was developed so that annotations could be readily parsed by
a number oI programs to quickly determine the location oI various Ieatures. Example
uses oI GFF include importing data into ACE Iormats Ior quick Ieature viewing, and Ior
creating sequence images complete with Ieatures.

http://www.sanger.ac.uk/SoItware/Iormats/GFF/

A description oI multiple alignment Iormats is given on the BLOCKS server page:
http://www.blocks.Ihcrc.org/blocks/help/blocksIormat.html

The sequence Iormats used by EMBL are Iound at:
http://www.hgmp.mrc.ac.uk/SoItware/EMBOSS/Themes/SequenceFormats.html


!"#$"%&" ()%*"+,-)% .+)/+01,

SeqIO
ReadSeq

You might also like