2nd Lec Student Copy - 3

Introduction to Bioinformatics 2nd lecture
Muhammad Muddassir Ali Muddassir.ali@uvas.edu.pk IBBT
Topics of this lecture

Course aims and learning goals Databases (Primary and secondary) Different Sequence Formats NCBI
Learning outcomes of the lecture

The student should be able to:
Extensively define concepts of databases and its types
Understand the NCBI and its related basic information
Database
just a collection of data??? DB = structured (organized) collection of data Must have: Routines for adding new entries and updating old entries Methods for handling user queries, i.e. access to the data Libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.
Types of DBs:
Relational
On the basis of format style in which data is stored
Object-oriented Flat file, i.e. all DB entries stored in text file(s)
Many biological DBs are in flat file format: Historical reasons (thats what biologists started with) Easy to distribute, download and store No need for database management software
Types of DBs:
Primary secondary database
On the basis kind of data included
Biological experiments
Primary database
analysis
Secondary database
Exemples
Primary database
Gene bank Entrez EMBL and DDBJ The Sequence Retrieval System
SWISSPROT Uniprot Secondary Databases prosite prints pfam interpro
Sequence formats
Plain sequence format In plain sequence format may only contain one sequence, while most other formats accept several sequences in one file.
An example sequence in plain format is: ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGC CACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGAC AGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTG ACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGC CCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCG CACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTT CTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGT CACGCAAG TTTAATTACAGACCTGAA
EMBL format
contain several sequences. One sequence entry starts with an identifier line ("ID"), followed by further annotation lines. line starting with "SQ" and the end of the sequence is marked by two slashes ("//"). An example sequence in EMBL format is: ID AB000263 standard; RNA; 240 BP. XX AC AB000263; XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 240 BP; acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 180 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag 240 //
FASTA format
fast A.. all, for protein, P and Nucleotide N can contain several sequences. Each sequence begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column. >AB000263 |acc=AB000263|descr=Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.|len=368 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGC TCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGG GTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAA GCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCT CGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGG GCCCCTCATAGGAGAGG
GCG format
Contains exactly one sequence Begins with annotation lines Start of the sequence is marked by a line ending with two dot ("..") characters. sequence identifier, the sequence length. only be used if the file was created with the GCG package.
An example sequence in GCG format is

ID AB000263 standard; RNA; 121 BP. XX AC AB000263 XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 121 BP; AB000263 Length: .. 1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
IG formats
A sequence file in IG format can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences. An example sequence in IG format is: ; comment ; comment AB000263 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCG GGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACC GGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGG AAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGAC CTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGG GAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCG CGCGCCGGGACAGAATGCC1
GenBank format
can contain several sequences. starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). LOCUS AB000263 181 bp mRNA linear PRI 05-FEB-1999 DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. ACCESSION AB000263 Origin 1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag //
NCBI
At NCBI (National Center for Biotechnology Information) Founded 1982 Nucleotide sequences Anyone can add new data: bankIt web-based submission of a single sequence sequin software for larger submissions Scientific journals require sequence submission
Locus Unique, but may change Shows organism Here, SCU = S. cerevisiae Accession number Unique and permanent No info, just a number Most reliable identification CDS Coding sequence Exons Translation shown Origin The actual nucleotide seq
Summary
Databases Types of databases Sequence formats NCBI

2nd Lec Student Copy - 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2nd Lec Student Copy - 3

Uploaded by

Copyright:

Available Formats

Introduction to Bioinformatics 2nd lecture

Muhammad Muddassir Ali Muddassir.ali@uvas.edu.pk IBBT

Topics of this lecture

Learning outcomes of the lecture

Understand the NCBI and its related basic information

Object-oriented Flat file, i.e. all DB entries stored in text file(s)

SWISSPROT Uniprot Secondary Databases prosite prints pfam interpro

An example sequence in GCG format is

You might also like