Professional Documents
Culture Documents
Central Dogma
Central Dogma
Analysis 1.(part 1)
Review of Basic biology + database searching in
Biology.
Hugues Sicotte
NCBI
The Flow of Biotechnology
Information
Gene Function
How does an mRNA specify amino acid sequence? The answer lies in
the genetic code. It would be impossible for each amino acid to be
specified by one nucleotide, because there are only 4 nucleotides and 20
amino acids. Similarly, two nucleotide combinations could only specify
16 amino acids. The final conclusion is that each amino acid is specified
by a particular combination of three nucleotides, called a codon:
Three-
dimensional
Tertiary
Secondary structure of tRNA structure
Bacterial Gene Prediction
Gene finding in
eukaryotic cDNA uses
ORF finding +blastx as
well.
http://www.ncbi.nlm.nih
.gov/gorf/gorf.html
try with gi=41 ( or your
own piece of DNA)
Eukaryotic Central Dogma
In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus)
The DNA does not contain a duplicate of the coding gene, rather exons must be spliced. (
many eukaryotes genes contain no introns! .. Particularly true in ´lower´ organisms)
mRNA – (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a
messenger to carry the information stored in the DNA in the nucleus to the cytoplasm
where the ribosomes can make it into protein.
Eukaryotic Nuclear Gene Structure
UACUAAC AG
Exons
•The exons of the transcript region are
composed of:
•5’UTR (mean length of 769 bp) with a
specific base composition, that
depends on local G+C content of
genome)
•AUG (or other start codon)
•Remainder of coding region
•Stop Codon
•3’ UTR (mean length of 457, with a
specific base composition that
depends on local G+C content of
genome)
Structure of the Eukaryotic Genome
Pseudogenes:
Dna sequence that might code for a
gene, but that is unable to result in a protein.
This deficiency might be in transcription (lack of
promoter, for example) or in translation or both.
Processed pseudogenes:
Gene retroposed back in the genome
after being processed by the splicing apperatus.
Thus it is fully spliced and has polyA tail.
Insertion process flanks mRNA sequence with
short direct repeats.
Thus no promoters.. Unless is accidentally
retroposed downstream of the promoter
sequence.
Do not confuse with single-exon genes.
Repeats
Each repeat family has many subfamilies.
- ALU: ~ 300nt long; 600,000 elements in human
genome. can cause false homology with mRNA.
Many have an Alu1 restriction site.
- Retroposons. ( can get copied back into
genome)
- Telltale sign: Direct or inverted repeat flank
the repeated element. That repeat was the
priming site for the RNA that was inserted.
LINEs (Long INtersped Elements)
L1 1-7kb long, 50000 copies
Have two ORFs!!!!! Will cause problems
for gene prediction programs.
SINEs (Short Intersped Elements)
Low-Complexity Elements
The protein as read off from the mRNA may not be in the final
form that will be used in the cell. Some proteins contains
• Signal Peptide (located at N-terminus (beginning)), this signal
peptide is used to guide the protein out of the nucleus towards it´s
final cellular localization. This signal peptide is cleaved-out at
the cleavage site once the protein has reach (or is near) it´s final
destination.
•Various Post-Translational modifications (phosphorylation)
• Introduction to Databases
• Searching the Internet for Biology
Information.
– General Search methods
– Biology Web sites
• Introduction to Genbank file format.
• Introduction to Entrez and Pubmed
Genbank division
PRI 6226959;6226762;4557224;…
MAM 41;…
Accession
NM_000014
6226959;6226762;4557224;
X63129 41;
Indexed searches.
• Boolean Query: Merging and Intersecting lists:
– AND (in both lists) (e.g. human AND genome)
– +human +genome
– human && genome
– OR (in either lists) (e.g. human OR genome)
– human || genome
Search strategies
• Search engines use complex strategies that go
beyond Boolean queries.
– Phrases matching:
• human genome -> “human genome”
– togetherness: documents with human close to genome
are scored higher.
– Term expansion & synomyms:
• human -> homo sapiens
– neigbours:
– human genome-> genome projects, chromosomes,genetics
– Frequency of links (www.google.com)
• To avoid these term mapping, enclose your queries in quotes:
“human” AND “genome”
Search strategies
• Search engines use complex strategies that
go beyond Boolean queries.
(disclaimer.. I don’t own stock in that company.. But I’d like to)
• Search Engines:
– Curated Collections: Not comprehensive:
Contains list of best sites for commonly
requested topics, but is missing important sites
for more specialized topics (like biology)
• www.yahoo.com (Has travel maps too!)
– Answer-based curated collections: Easy to
use english-like queries. First looks at list of
predefined answers, then refines answers based
on user interaction. Also answer new questions.
• www.askjeeves.com
• www.magellan.com
• www.altavista.com(has translation TOOLS)
• www.hotbot.com
• Search Engines:
– Meta-Search Engines: Polls several search
engines, and returns the consensus of all results.
Is likely to miss sites, but the sites it returns are
very relevant to the query.
– Other operating mode is to return the sum of all
the results.. Then becomes very sensitive to a
very detailled query.
• www.metacrawler.com
• www.savvysearch.com
• www.1blink.com (fast)
• www.metafind.com
• www.dogpile.com
• Virtual Libraries: Curated collections of
links for Biologists.(by Biologists)
– Pedro’s BioMolecular Research Tools:(1996)
• http://www.public.iastate.edu/~pedro/
– Virtual Library: Bio Sciences
• http://vlib.org/Biosciences.html
– Publications and abstract search.
• http://www.ncbi.nlm.nih.gov/
– Expasy server
• http://www.expasy.ch
– EBI Biocatalog (software & databases list)
• http://www.ebi.ac.uk/biocat/
Biological Databases
• Nucleotide databases:
– Genbank: International Collaboration
• NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)
• A “bank” No curation.. Submission to these database is
required for publication in a journal.
– Organism specific databases (Exercize: Find URLs
using search engines)
• FlyBase
• ChickGBASE
• pigbase
• wormpep
• YPD (Yeast Protein Database)
• SGD(Saccharomyces Genome Database)
• Protein Databases:
– NCBI:
– Swiss Prot:(Free for academic use, otherwise
commercial. Licensing restrictions on discoveries made
using the DB. 1998 version free of any licensing)
• http://www.expasy.ch(latest pay version)
• NCBI has the latest free version.
• Translated Proteins from Genbank Submissions
– EMBL
• TrEMBL is a computer-annotated supplement of SWISS-PROT
that contains all the translations of EMBL nucleotide sequence
entries not yet integrated in SWISS-PROT
– PIR
• Structure databases:
– PDB: Protein structure database.
• Http://www.rscb.org/pdb/
– MMDB: NCBI’s version of PDB with entrez
links.
• Http://www.ncbi.nlm.nih.gov
• Genome Mapping Information:
– http://www.il-st-acad-sci.org/health/genebase.html
– NCBI(Human)
– Genome Centers:
• Stanford, Washington University, Stanford
– Research Centers and Universities
• Litterature databases:
– NCBI: Pubmed: All biomedical litterature.
• Www.ncbi.nlm.nih.gov
• Abstracts and links to publisher sites for
– full text retrieval/ordering
– journal browsing.
– Publisher web sites.
– Biomednet: Commercial site for litterature
search.
• Pathways Database:
– KEGG: Kyoto Encyclopedia of Genes and
Genomes: www.genome.ad.jp/kegg/kegg/html
• Database Identifiers: Primary keys
– GI (changes with each sequence update for
NCBI only)
• Annotation may change without the gi changing!
– Accession(stable)
– version(changes with each sequence update)
– “Version” also refers to Accession.version
– Secondary accession: Records may have been
merged in the past.. So the records which were
not chosen as the primary were made
secondary.
Primary Databases
• A primary Database is a repository of data
derived from experiments or from research
knowledge.
– Genbank (Nucleotide repository)
– Protein DB, Swissprot
– PDB (MMDB) are primary databases.
– Pubmed (litterature)
– Genome Mapping databases.
– Kegg Database.(pathways)
Secondary Databases
• A secondary database contains information
derived from other sources.
– Refseq (Currated collection of Genbank at
NCBI)
– Unigene (Clustering of ESTs at NCBI)
• Organism-specific databases are often a mix
between primary and secondary.
Genbank Records
• A Bank: No attempt at reconciliation.
• Submit a sequence Get an Accession Number!
– Cannot modify sequences without submitter’s consent.
– No attempt at reconciliation.(not a unique collection per
LOCUS/gene)
– Entries of various sequence quality and different
sources==> Separate in various divisions based on
• High Quality sequences in taxon specific divisions.
• Low Quality sequences in Usage specific databases.
• A Collaboration between NCBI, EMBL and
DDBJ. They contain (nearly) the same
information, only the data format differs.
EMBL does not differentiate between the different types of RNA
records, while NCBI (and DDBJ) do. In Entrez EMBL records are
patched up to add that information.
Refseq and LocusLink
• Attempt to produce 1 mRNA, 1 protein, and
1 genomic gene for each frequently
occuring allele of a protein expressing gene.
• www.ncbi.nlm.nih.gov/LocusLink
• Special non-genbank Accession numbers
– NM_nnnnnn mRNA refseq
– NP_nnnnnn protein refseq
– NC_nnnnnn refseq genomic contig
– NT_nnnnnn temporary genomic contig
– NX_nnnnnn predicted gene
Genbank divisions
Example:
>gi|41|emb|X63129.1|BTA1AT B.taurus mRNA for alpha-1-anti-trypsin
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
Modified FASTA Format
1) A few tools follow the convention that
lower case sequences are masked. (repeat
masker, some versions of blast, megablast,
blastz)
2) A few analysis tools (like CLUSTAL)
want a simplified identifier on the defline..
So they can have a short string for the
alignment.
>X63129.1
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
• WIM now will talk about GCG …
Feature table
(NCBI;EMBL/DDBJ)
• http://www.ncbi.nlm.nih.gov/collab/FT/inde
x.html
Genbank Data format
41
• Tutorials:
• http://www.ncbi.nlm.nih.gov/Class/MLACo
urse/Genetics/index.html
• http://www.ncbi.nlm.nih.gov/Literature/pubmed_s
earch.html
• http://www.ncbi.nlm.nih.gov/Database.tut1.html
SWISSPROT
http://www.expasy.ch/sprot/sprot_details.html
1. Core data: protein sequence data; the citation information and the
taxonomic data
2. Annotation
• Function(s) of the protein
• Domains and sites. For example calcium binding regions, ATP-
binding sites, zinc fingers, homeobox, kringle, etc.
• Post-translational modification(s). For example carbohydrates,
phosphorylation, acetylation, GPI-anchor, etc.
• Secondary structure
• Quaternary structure. For example homodimer, heterotrimer, etc
• Similarities to other proteins
• Disease(s) associated with deficiencie(s) in the protein
• Sequence conflicts, variants, etc.
SWISSPROT
http://www.expasy.ch/cgi-bin/get-random-entry.pl?S
REBASE (Restriction enzymes dataBASE)
Restriction enzymes have a pattern recognition sequence, and then
within or a few bases away from that pattern is the actual
cutting site
http://rebase.neb.com/rebase/rebase.html
I prefer the bairoch format (SWISSPROT format)
http://rebase.neb.com/rebase/rebase.f19.html
ID enzyme name
ET enzyme type
OS microorganism name
PT prototype
RS recognition sequence, cut site
MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase
RN [count]
RA authors
RL jour, vol, pages, year, etc.
Exercises
•You can work in teams for this.
•1a) Use the first 6000 bases of your genomic piece [ or find a
bacterial genomic or mRNA sequence in Entrez with length between
2000:10000 ]
•b) Use the ORF finder to find the gene(s). Compare the answer you
get to the annotation you can infer from using blastn against genbank
and to using blastx against a protein database.
•Do the Entrez exercizes. ( separate word document)