You are on page 1of 64

Bioinformatics

Lecture 3
Presenter
Muhammad Shahid Khan
BS (Bioinformatics)
M.P.H
MSc. Epidemiology and Biostatistics

Outline
1. Introduction
2. Information sources in biology and
associated problems
3. DNA databases
4. Entrez. (+ exercise)
5. Summary

Aims
• Convince you that these bioinformatics
resources are valuable for research
• Give you some important searching
strategies
• Show you how to find what you want
• Suggest other resources and further help

Information Sources for
Research - Key Questions
• What is
available?
• Where do I find
it?
• How do I search
it?

Information Sources for
Research
Journals, books, theses, abstracts.

Technical literature (e.g. protocols,
equipment handbooks).

Conferences, seminars, meetings and
exhibitions.

Molecular biology databases.

Problems with Biological
Data
• Data collection
• The base of information is large,
expanding and diverse
• Organisation and accessibility
• Requirement for special search
techniques. You can’t Google a DNA
sequence…yet!
• A student/researcher wants the right
information quickly!!!

The Good News
• Large projects working to organise
this information
• Much is freely available over the
internet

What is a DNA Sequence?
• The DNA double helix is made up of a
series of chemical bases stung along a
sugar backbone
• There are 4 bases usually represented by
the letters A, T, C and G
• The linear sequence in which these bases
occur determines all the instructions for
building an organism

What is a Protein Sequence?
• Proteins are complex molecules which
control most aspects of cell biology
• Constructed of small subunits called
amino acids.
• There are 20 types of amino acid
• Assembled by ‘reading’ (or translating)
the DNA sequence
• Every set of 3 bases (e.g. ATG)
corresponds to an amino acid
• So a protein is built up one amino acid
at a time according to the DNA
blueprint.

In Summary…

DNA Sequence

DNA
Molecule

Proteins

Complete
Organism

Gene Prediction: Computational
Challenge
• Gene: A sequence of nucleotides
coding for protein
• Gene Prediction Problem: Determine
the beginning and end positions of
genes in a genome

Gene Prediction: Computational
Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa
tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc
taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt
taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa
tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat
ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct
gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc
gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat
gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc
ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat
gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct
aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat
ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat
gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational
Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa
tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc
taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt
taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa
tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat
ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct
gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc
gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat
gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc
ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat
gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct
aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat
ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat
gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational
Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa
tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc
taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt
taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa
tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat
ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct
gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc
gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat
gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc
gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc
ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat
gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct
aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat
ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat
gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene!

Central Dogma: DNA -> RNA ->
Protein
DNA

CCTGAGCCAACTATTGATGAA

transcription

RNA

CCUGAGCCAACUAUUGAUGAA

translation

Protein

PEPTIDE

Prokaryotic and eukaryotic
organisms

Translating Nucleotides into Amino
Acids

• Codon: 3 consecutive nucleotides
• 4 3 = 64 possible codons
– Includes start and stop codons

– An amino acid may be coded by more
than one codon

Codon
s

• In 1961 Sydney Brenner and Francis Crick
discovered frameshift mutations
• Systematically deleted nucleotides from
DNA
– Single and double deletions dramatically
altered protein product
– Effects of triple deletions were minor
– Conclusion: every triplet of nucleotides,
each codon, codes for exactly one
amino acid in a protein

Genetic Code and Stop Codons
UAA, UAG and
UGA correspond
to 3 Stop codons
that (together
with Start codon
ATG) delineate
Open Reading
Frames

Six Frames in a DNA
Sequence
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

• stop codons – TAA, TAG, TGA
• start codons - ATG

Open Reading Frames (ORFs)
• Detect potential coding regions by looking at
ORFs
– A genome of length n is comprised of (n/3)
codons
– Stop codons break genome into segments
between consecutive Stop codons
– The sub segments of these that start from
the Start codon (ATG) are ORFs
TGA overlap
• ORFs inATG
different frames may

Genomic Sequence

Open reading frame

Codon Usage in Human Genome

Transcription in prokaryotes
Transcribed region
start codon

stop codon

Coding region

5’

Promoter

3’

Untranslated regions

Transcription start side
upstream
downstream

Transcription stop side

Microbial gene finding
• Microbial genome tends to be gene
rich (80%-90% of the sequence is
coding)
• The most reliable method –
homology searches (e.g. using
BLAST and/or FASTA)
• Major problem – finding genes
without known homologue.

Open Reading Frame
Open Reading Frame (ORF) is a sequence of codons
which starts with start codon, ends with an end codon
and has no end codons in-between.
Searching for ORFs – consider all 6 possible
reading frames: 3 forward and 3 reverse
Is the ORF a coding sequence?
1. Must be long enough (roughly 300 bp or more)
2. Should have average amino-acid composition specific for a
give organism
3. Should have codon use specific for the given organism

Eukaryotic gene finding
• On average, vertebrate gene is about
30KB long
• Coding region takes about 1KB
• Exon sizes vary from double digit
numbers to kilobases
• An average 5’ UTR is about 750 bp
• An average 3’UTR is about 450 bp but
both can be much longer.

Exons and Introns
• In eukaryotes, the gene is a
combination of coding segments
(exons) that are interrupted by noncoding segments (introns)
• This makes computational gene
prediction in eukaryotes even more
difficult
• Prokaryotes don’t have introns Genes in prokaryotes are continuous

Gene Structure

Central Dogma and Splicing
exon1

intron1

exon2

intron2

exon3

transcription
splicing

exon = coding
intron = non-coding

translation

Splicing Signals

Exons are interspersed with introns
and typically flanked by GT and AG

Promoters
• Promoters are DNA segments
upstream of transcripts that initiate
5
3
transcriptionPromot


er

• Promoter attracts RNA Polymerase to
the transcription start site

Looking at DNA sequences I
• Analysis of DNA or protein sequences is a
frequent requirement of research
– Locating genes within a sequence
– Comparing two sequences for similarity
– Searching for similar genes
(orthologous) in other organisms

Looking at DNA sequences II

DNA sequences are easily stored, retrieved,
compared and manipulated on computers
Just represent each base as a letter!

Computers can compare two or more
sequences and find similar regions
Much analysis of genetic information now
takes place in silico

Looking at DNA Sequences III

DNA sequences can be determined
experimentally
Software allows biologists to construct and
view maps of DNA sequence
The DNA code of ATCG gets transformed into
something much more human friendly

Using a DNA Sequence

DNA Databases

Free access to vast numbers of sequences deposited by researchers
all over the world
Used alongside scientific papers
Can be searched or ‘mined’ in a variety of ways

Global Bioinformatics Agencies

DNA
Data
Bank of
Japan
International
Nucleotide
Sequence
Database
Collaboration

European
Molecular
Biology
Laboratory

National Centre
for
Biotechnology
Information

NCBI and Genbank

Genbank is NCBI’s DNA database
Extensive search and deposit capabilities

606 sequences

A Practical Example

A researcher might start with a piece of DNA rather than a literature
citation
Here we will –
1. Search a DNA database using a piece of DNA sequence
2. Use the results of the search to identify relevant literature

The Experiment
1) Grow
some bugs
4) Generate
sequence.

2) Extract
the DNA.

3) Amplify up
the desired
section of DNA.

A DNA Sequence

The following sequence is in FASTA
format
>G08_CHEV11Fed.seq
GTCGACGCGCAAATGGTTCTATATCCATACCAATAGCAGTATCGTTGCCA
TTATCACGAATGGAATTAAGTAAAGTTTTCATTCTATCAATAGACTCTAA
AACCACATCCATGATATCTGGAGTTATTTTTAACTCGCCATGTCTTGCTT
TGTTTAAAACATCCTCCATGTGGTGAGTTAACTTTGTTAAAACATCAAAA
TTTAAGAAGCTTGATGATCCTTTAACCGTATGTGCAACACGGAAAATTCT
ATTTAATAATTCTAAATCTTCTGGATTTGATTCAAGCTCTACTAAATCAT
GGTCGATTTGCTCAACAAGCTCAAAAGCTTCAACCAAAAAGTCTTCAAGT
ATTTCTTGCATATCTTCCATATTTTACCCCTGTTCTTGAGATTGATGTTT
TTTAATAACCTTTGCAATTTCATTGAAGAAATCGCTAGCGTTAAATTTGA
CAAGATAGCCTTCTCCACCAGCTTCTTGAACACCTTTCTCATTCATAAAT
TCATTTGATAAAGATGAGTTAAAGACTATAGGAATATCTTTAAATCCGGG
ATCTTCTTTAATGCGTGCAGCGGATCCCGGGTACCTGCAGAATTCAGCTG
CGCCCTTTAGTTCCTAAAGGGTTTTTATCAGTGCGACAAACTGGGATTTT
ATTTATTCAGCAAGTCTTGTAATTCATCCAAAAAACGGCAAACATGAAAG
CCGTCACAAACGGCATGATGCACTTGAATCGATAAGGGAATATAGTATTT
TCCGCCCTCCTCATAATACTTCCCAAACGTAAATATCGGCAGTAGATAGT

A BLAST Search

Basic Local Alignment Search Tool
Aimed at finding highly similar sequences in the database
Lets see how to submit a sequence query to the Genbank database

BLAST Search Screen

Enter sequence.

Select database.

Select BLAST type.

BLAST Results I

The Statistics

• Guidelines for evaluating stats (data from
‘Introduction to Bioinformatics’, Lesk, A, OUP (2005))
– E ≤0.02 – Sequences probably
homologous (i.e. derived from a
common ancestor)
– E between 0.02 and 1 – homology
unproven but can’t be ruled out.
– E>1 – Expect this good a match by
chance.

• Putting the amino acid sequence
NELLYTHEELEPHANT into a BLAST
protein search produces results!
• Best match E value = 9

BLAST Results II

Two possible
matches.

BLAST Results III

Literature references
allow us to go straight
to citations in PubMed
relevant to the
sequence we have
found.

Here is the name of the
gene!

Evaluating the Data
• There are errors in these databases!
Is a BLAST
search
appropriat
e?

What is the
source of
this
sequence?

Should I
cross
referenc
e?

What are
the
statistics
telling
me?

Structure of Entrez

Powerful resource for research
Entrez is a cross-database search engine
Records are cross referenced and linked

Entrez Main Screen

Single Keyword Search
• Type keyword into the search box
and click ‘GO’
The number of hits for the search
term is shown by each database
 Single keyword searches are limited
 Advanced search techniques refine
results and produce fewer irrelevant
hits

Using Boolean Operators
• Boolean operators and phrases build
complex searches
• Use AND, OR and NOT to join terms
• Use UPPERCASE for the operators

Refining Searches and Setting
Limits.
• Within an individual database results
may be further refined by setting
limits
• The number and type of limits will
depend on the database
• Click the ‘limits’ tab from within one
of the databases

Steps in Setting a Limit
1. Select a field to limit the search by.
2. Type in the limiting term in the
search box.
3. Select other limiting options e.g. –
– Publication date.
– Database.

4. Hit ‘GO’ to retrieve the results.

Using the History
• The history keeps track of previous
searches
• You can combine searches and limits
quickly and easily
• You can isolate records matching
very specific criteria

Jumping Between
Databases
• Records in Entrez are extensively cross
linked.
• The ‘links’ hyperlink next to each record
lets you jump between databases.

Entrez in Summary
• We’ve looked at –
– Simple and advanced searching.
– Accessing and moving between records
– Using the clipboard
– Setting limits
– Using the history
– Sorting results

Evaluating Entrez I

Advantages
Quickly cross reference many databases.
Elaborate searches can be constructed within each database.
Tools to save and modify searches.
Pools many resources.

Evaluating Entrez II

Disadvantages
Can return many irrelevant results.
Syntax for advanced searching is complicated (many databases
= many fields).
Doesn't cover everything!

Summary
• Bioinformatics resources help collect,
organise and analyse biological data.
• Essential resources for biology research.
• Bioinformatics databases can be searched
in unique ways.
• Entrez provides a powerful cross-database
searching tool.
• Many more resources out there!

Your Turn!
• A little practice using
Entrez

10 Minutes

Notes on the Exercise
• Using brackets with Boolean
operators refines search results.
• Care with placing brackets is
essential!
• The clipboard is helpful for recording
results of searches.

And Finally…
Thanks for listening!
Any Questions?

Resources
Search Engines and Software
• NCBI BLAST –
www.ncbi.nlm.nih.gov/blast/Blast.cgi
• Entrez – www.ncbi.nlm.nih.gov/sites/gquery
• SRS – Another cross database search engine for
bioinformatics data similar in principle to Entrez.
http://srs.ebi.ac.uk/
• EMBOSS Bioinformatics software – A whole
suite of free applications for processing many
kinds of biological data.
http://emboss.sourceforge.net/
• ARTEMIS – A free sequence viewer and editor.
www.sanger.ac.uk/Software/Artemis/