You are on page 1of 19

Introduction to Bioinformatics 2nd lecture

Muhammad Muddassir Ali Muddassir.ali@uvas.edu.pk IBBT

Topics of this lecture


Course aims and learning goals Databases (Primary and secondary) Different Sequence Formats NCBI

Learning outcomes of the lecture


The student should be able to:
Extensively define concepts of databases and its types

Understand the NCBI and its related basic information

Database
just a collection of data??? DB = structured (organized) collection of data Must have: Routines for adding new entries and updating old entries Methods for handling user queries, i.e. access to the data Libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.

Types of DBs:

Relational
On the basis of format style in which data is stored

Object-oriented Flat file, i.e. all DB entries stored in text file(s)

Many biological DBs are in flat file format: Historical reasons (thats what biologists started with) Easy to distribute, download and store No need for database management software

Types of DBs:
Primary secondary database
On the basis kind of data included

Biological experiments

Primary database

analysis

Secondary database

Exemples
Primary database
Gene bank Entrez EMBL and DDBJ The Sequence Retrieval System

SWISSPROT Uniprot Secondary Databases prosite prints pfam interpro

Sequence formats
Plain sequence format In plain sequence format may only contain one sequence, while most other formats accept several sequences in one file.

An example sequence in plain format is: ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGC CACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGAC AGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTG ACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGC CCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCG CACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTT CTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGT CACGCAAG TTTAATTACAGACCTGAA

EMBL format
contain several sequences. One sequence entry starts with an identifier line ("ID"), followed by further annotation lines. line starting with "SQ" and the end of the sequence is marked by two slashes ("//"). An example sequence in EMBL format is: ID AB000263 standard; RNA; 240 BP. XX AC AB000263; XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 240 BP; acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 60 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 120 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 180 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag 240 //

FASTA format
fast A.. all, for protein, P and Nucleotide N can contain several sequences. Each sequence begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column. >AB000263 |acc=AB000263|descr=Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.|len=368 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGC TCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGG GTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAA GCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCT CGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGG GCCCCTCATAGGAGAGG

GCG format
Contains exactly one sequence Begins with annotation lines Start of the sequence is marked by a line ending with two dot ("..") characters. sequence identifier, the sequence length. only be used if the file was created with the GCG package.

An example sequence in GCG format is


ID AB000263 standard; RNA; 121 BP. XX AC AB000263 XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 121 BP; AB000263 Length: .. 1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc

IG formats
A sequence file in IG format can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences. An example sequence in IG format is: ; comment ; comment AB000263 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCG GGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACC GGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGG AAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGAC CTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGG GAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCG CGCGCCGGGACAGAATGCC1

GenBank format
can contain several sequences. starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). LOCUS AB000263 181 bp mRNA linear PRI 05-FEB-1999 DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. ACCESSION AB000263 Origin 1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg 121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc 181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag //

NCBI
At NCBI (National Center for Biotechnology Information) Founded 1982 Nucleotide sequences Anyone can add new data: bankIt web-based submission of a single sequence sequin software for larger submissions Scientific journals require sequence submission

Locus Unique, but may change Shows organism Here, SCU = S. cerevisiae Accession number Unique and permanent No info, just a number Most reliable identification CDS Coding sequence Exons Translation shown Origin The actual nucleotide seq

Summary
Databases Types of databases Sequence formats NCBI

You might also like