3gno Notes 2004

OPT-3GNO.
Functional genomics and bioinformatics Semester 2
Functional genomics and bioinformatics - 3GNO Lecture Notes

Dr. Simon Hubbard (H13) x. 8930, email: Simon.Hubbard@umist.ac.uk
Learning outcomes:
In these lectures, we will build on the 2GEN lectures and previous lectures in the 3GNO lecture
series to examine some applications of bioinformatics to functional genomics. At the end of the
lectures you should:
• Be able to describe how bioinformatics is used as a tool to functionally annotate genomes
• Be able to discuss issues in assigning biological function
• Recognise novel bioinformatics techniques currently being developed for comparative and
functional genomics to predict gene function.
• Be able to discuss how bioinformatics approaches can identify targets of clinical interest
• Be able to describe how bioinformatics is used to analyse gene expression data to predict
protein function and in clinical diagnostics.
• Be able to recognise the utility of bioinformatics applications in proteomics data analysis for
functional genomics
Bioinformatics
In second year lectures, we considered how sequence similarity searching (with programs such as
BLAST and FASTA) can find homologous genes and proteins, based on common patterns in
their primary sequences. The function of the unknown gene or protein sequence is inferred from
the function of the database match based on the hypothesis that the proteins share a common
ancestor from which they have both evolved. These methods typically begin to fail around the
"twilight zone" of c.25% pairwise sequence identity beyond which no statistical significance can be
assigned to a match, even when the proteins do share a common ancestral gene. However, many
genes in the yeast genome where seen to fall into this category - roughly 2000 so-called
"orphan" open reading frames - with no orthologues in the databases with functional
annotation. New methods are beginning to appear to extend functional classification of genome
sequences beyond these traditional bioinformatics attempts.
z Genome assembly, gene hunting, annotation

cosmids
cosmids Annotation
Orthologue
DNA
DNA of known fn.
sequencers Yes
sequencers
Functional No Homologue
Raw sequence
information?
Assembly
Assembly Yes
Find
Find genes
genes No
contigs and Homologue? Orphan
contigs and ORFs
ORFs
A schematic of the process involved in annotating a genome sequence is shown above. We hope
we can find many orthologous genes from other genomes whose function is well
characterised.
Terminology: Homologous proteins share a common ancestral gene from which they have evolved.
Orthologous sequences are homologous sequences, but are now in different organism's genomes.
Paralogous sequences are homologues which share sequence and functional similarity but have most
likely arisen from a gene duplication – they may still be in the same genome.
OPT-3GNO. Functional genomics and bioinformatics Semester 2
PSI-BLAST
Position-Specific Iterated BLAST. A more sensitive method1,2. Finds more distant homologues
than standard BLAST, hence supports more annotations to be made.
Unknown
Unknown BLAST
BLASTagainst
against
(new)
(new)query
query database
database
sequence
sequence
If no new hits,
STOP
Align
Align all
all“hits”
“hits” BLAST
BLASTprofile
profile
with
with score
scorebetter
better against
against
than
than threshold
threshold database
database
Build
Buildprofile
profile
p y
PSI-BLAST (1) PSI-BLAST (2)
False
positives
“Unknown” query sequence

false positives real biological matches
Pros Cons
Sensitivity False positives
Get automatic alignments Errors are propagated
Annotation using protein structures

SUPERFAMILY resource ( http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ )
HMMs build for every structure in SCOP database at superfamily level
Applied to sequenced genomes shows that (40-50% of genomes match to known structures)
Most proteins are multi-domain
Real genome annotation

What have we learnt from the human genome ?
Human Genome, ENSEMBL www.ensembl.org

See ref. Nature 409, 861-921 (p.896 onwards)
prokayoyes only
eukaryote and
0%
prokaryote
vertebrate only
21%
22%
74% of ~30000 genes
vertebrate and other

animals
24%
animals and other

eukaryotes
BLASTP & 32%
PSI-BLAST
no aminal homology
1%
Plasmodium genome: lots of new drug targets identified.
14 x-somes, c.5000 genes See Nature, 415, p.702-
Errors in Genome annotation

Can be assessed by comparing assignments of EC numbers3, or different groups annotating
different genomes (Steve Brenner group)
What is protein function ?

This is not such a well-defined concept for bioinformatics. We can consider protein function on
several different levels (molecular/biochemical, cellular, phenotypic, etc.). Even when we know
the structure, it is not trivial to predict the full cellular function, or sometimes even the
biochemical function. This will become more of a problem when structural genomics projects
start delivering many new structures.
Gene ontology4
Attempt to unify biology ! (Based on 3 genomes, yeast, fly, worm). Three ontological classes
(restricted vocabularies or keywords).
• Biological process,
• Molecular Function,
• Cellular component
Cellular component Example shows how genes can fit in

to different places in hierarchy. E.g.
MCM2 can work in different cellular
locations, and it can also be defined
in different molecular functions – all
of them are involved in different
aspects of the biological process of
DNA replication.
Biological process
Molecular function
New approaches
6217 yeast
• Proteins with common function share common phylogeny
proteins
E E.coli genome
H H.influenzae
Link functionally related proteins by:
M M.jannaschi
Experimental Related Related Rosetta Correlated
metabolic phylogenetic Stone mRNA P H.pylori
Data function profiles Method expression
E H M P Y-1 E H M P Y-2
Hypothesis: yeast proteins Y-1 and Y-2 are functionally related

Predict functions of yeast
yeast proteins
proteins using
links with
with characterised
characterised
3GNO-00
proteins
proteins 13
Attempt to go beyond prediction by homology, look at other features of sequences. Two recent
Nature papers5,6 epitomise recent advances. Eisenberg and colleagues4 have attempted to
combine lots of approaches, using all the known annotated genome sequences to predict the
function of 6217 yeast proteins. This exploited protein-protein interaction databases, metabolic
pathway databases, phylogenetic analyses, gene-fusion analysis and micro-array expression data.
Results were quite promising (although its not obvious!) several functional annotations were
made that could not otherwise have been spotted using simple homology methods. The "Rosetta
stone" method applied also by Enright and co-workers6 shows promise in identifying likely
protein:protein interactions from genome sequences.
Combined approach of Eisenberg - exploiting phylogenetic data,

the Rosetta stone method, and others to predict the function of yeast genes
• Rosetta Stone method (Multidomain proteins) Technique Number of Number of

proteins functional
False
positive
Ability to
predict
Ability in
random
links rate (%) function trials (%)
Genome A Genome B (%)
Experimental data 484 500 6.5 33.2 4.0
N A C
Metabolic data 188 2391 2.5 20.3 4.5
N C C Phylogenetic profiles 1976 20749 29.5 33.1 7.4
N B C Rosetta stone method 1898 45502 36.4 26.5 7.7
2 domains are fused Correlated mRNA 3387 26013 35.8 11.5 6.9
2 domains are separate
polypeptide chains Combined: Links
from >= 2 methods 683 1249 16.1 55.6 6.9
Hypothesis:
If A & C are orthologues, and Then:
B & C are orthologues, and A and B are functionally related
Of the 2557 uncharacterised Yeast proteins, 374 can be
A & B are not paralogues assigned functions (high confidence)
References
Hopefully, many of these will be able from online resources. I would suggest you try and consult
at least one paper from each section. Most of them are review type articles that you should be
able to follow. (** = highly recommended reading, should be available from short loans etc. * =
recommended, but more specialist).
1. *Altschul SF & Koonin EV (1998) Iterated profile searches with PSI-BLAST. TIBS 23, 444-447
2. **Jones DT & Swindells M (2002) Getting the most from PSI-BLAST. TIBS. 27, 161-164
3. *Devos D & Valencia A (2001) Intrinsic errors n genome annotation. Trends in Genetics. 17, 429-
431
4. **Gene Ontology Consortium. Tool for the unification of biology. Nature Gen 25, 25-28. Short
introduction to the functional annotation problem. See also website.
5. **Marcotte, Pellegrini, Thompson, Yeates & Eisenberg (1999) A combined algorithm for genome-
wide prediction of protein function. Nature, 402, 83-
6. *Enright, et al. (1999) Protein interaction maps for complete genomes based on gene fusion
events. Nature, 402, 86- (Note: 5 and 6 appeared next to each other in the same edition of Nature).
Functional genomics and bioinformatics - 3GNO Lecture Notes

Dr. Simon Hubbard (H13) x. 8930, email: Simon.Hubbard@umist.ac.uk
Comparative genomics
Much interest is currently being generated from the new biological insights that can be gained by
comparison of genome sequences using bioinformatics.
Comparative genomics to spot pathogenicity7 (pathogenicity islands are a subset of genomic
islands) which have structurally conserved properties. Acquired via horizontal transfer
Arrows are CAG pathogenicity islands
Comparative genomics can also be used to study conserved elements in non-coding regions to
find regulatory sites8,9. (CFTR = cystic fibrosis transmembrane conductance regulator)
1.8Mb
segment
10 genes
Including
CFTR
Gene annotation can also be verified.
Quality control on original yeast genome

Conservation of reading frame across genomes (RFC)
Green = conserved, yellow=not, white = gap, red=insertion
Also, many stop codons (not shown)
5945 ORFs tested, 367 rejected
Functional genomics
Two principle areas of study are on the transcriptome and proteome. These are the total cell
or tissue content of the messenger RNA and protein respectively. They are distinguished from
the genome by their dynamic nature. They are both context-dependent, and will vary under
different conditions. Hence, these technologies create the possibility to analyse/study:
• gene expression changes during the cell cycle • comparative studies of pathogenic/normal cells
• gene expression changes to stress • function of unknown genes
• developmental changes in gene expression • protein:protein interactions
• DNA polymorphism • post-translational modifications
Transcriptome studies
yp p p At the end of the experiment, a single value is
• Multi-dimensional data with a “value” for every gene obtained for each gene being investigated.
This value describes either the abundance of
ORF/Gene Value
YAL001w 2.4
message (mRNA) in the cell corresponding to
YAL002c 1.0 that gene, or the relative rate of expression
SCAN
YAL003w 0.2 of that gene under two different conditions.
YAL004w 3.8
“ “ In the case of the later, this can be a
“
“
“
“ comparison between different time points
“ “ relative to time 0 (t0 the start of the
YNR231w 0.2
YNR232c 3.8 experiment). This is like the data in a by
Brown lab at Stanford 10.
Using a starting point as a reference is a natural way to "normalise" the data, as the expression of
every single gene can be described relative to its expression at t0 or a wild-type.
e.g. ratio = expression at time t / expression at t0. In the example shown left, YAL001w is up-
regulated and YNR231w is down-regulated.
Each gene can be represented as a vector (or array, or row) of values corresponding to the
expression profile over the experiments. The "distance" between each gene vector can then be
calculated using some mathematical function (tEuclidean distance or correlation coefficient).
Then, a matrix can be built up of these "distance" metrics for an "all against. all" comparison of
the expression profiles. Clustering procedures are then used to build simple dendograms (trees)
where the genes that are co-regulated should be clustered together.
Eisen10 did this for 2467 yeast ORFs. Genes cluster together which have similar cellular functions
- obtained studying yeast response to the diauxic shift (during starvation), during the mitotic cell
cycle, and stress (temperature and reduction shock).
They found several clusters:
• 126 down-regulated in response to stress (ribosomal proteins, transcription and elongation factors
• proteasome genes (multiprotein complex)
• mitochondrial protein synthesis & respiration & ATP synthesis genes
• glycolytic genes
Diagnostics with microarrays

g
Diagnostics y
with microarrays
2 overall approaches: supervised, and unsupervised11

Application using 14 different classes of cancer12. Collect mRNA from tumours and perform
microarray experiments (220 tumour, 90 normal samples). Find most variant genes (approx 10K).
Perform clustering. Find gene sets which best represent a tumour.
W
SVM = support vector machines. Machine learning method for classifying data into 2 classes
(binary classifiers). Since we have multiple classes, we need multiple SVMs to test for each class.
Transcriptome vs. Proteome
There are good reasons to study both the proteome and the transcriptome. Currently the
technological advances enable us to study the transcriptome more easily and extensively, but
ultimately it can be argued that the proteome is the real "working model" for the cell, as the
proteins carry out most of the function and lead to a given phenotype.
Transcriptome Proteome
• Easier to assay large number of • Difficult to assay large number
genes simultaneously of proteins (400 yeast proteins)
• Easily quantified, higher dynamic • Can be difficult to quantify (esp.
range (>1 mRNA per cell is mass spectrometry)
detectable) • Proteins harder to deal with but
• Nucleotides easier to are more stable
manipulate (recombinant • Proteins are true gene products
technologies) but are less stable & are functional entities
• mRNA presence does not • Proteome is also affected by
guarantee presence or level of post-translational modifications,
gene product protein turnover
Correlation between transcriptome & proteome may be ~0.48-0.76
Proteomics
Proteomics involves protein identification from some experimental data (usually mass
spectrometric) using bioinformatics tools.
genome “virtual”
genome “virtual”proteome
proteome
knowledge+
knowledge+
Transcription & prediction prediction
regulation
peptide
peptidemass
massand
and
post-translational fragment
fragmentdatabase
database
modifications
separation Bioinformatics
methods Identification
real
real proteome
proteome
2D-gels,
2D-gels,
functional
functional
separations,
separations,
n-dimensional
n-dimensional
chromatography
chromatography
[digest] [fragment]
simple
simplemixtures Peptide
Peptidemass Fragment
&&single
mixtures mass Fragmention
ion
singleproteins
proteins map
mapfingerprint
fingerprint spectra
spectra
Several high throughput Proteomic applications to protein:protein interactions have been

developed12-15. These include:
• Yeast 2 hybrid
• TAP affinity pulldown + mass spec
• Synthetic lethals
• Correlated mRNA expression
• Bioinformatic approaches (see Eisenberg paper).
Bioinformatic approaches have been used to compare and validate all these approaches15. They
have shown that some approaches are not all they are cracked up to be. Interacting proteins
should share functional classes (using MIPS or GO-based classifiers)
Methods should be
“centred” on diagon
References ** strongly recommended, * recommended

7. *Hentschel and Hacker, (2001) Pathogenicity Islands. Microbes and Infection, 3, 545-548.
8. *Kellis M, et al (2003) Sequencing and comparison of yeast species to identify genes and regulatory
elements, Nature: 423, 241-. This paper is quite long, but is an important new development.
9. **Thomas J. W. et al (2003). Comparative analyses of multi-species sequences from targeted genomic
regions, Nature: 424, 788-. Has same principles as 8, but is applied to Human X-some 7.
10. *Eisen, Spellman, Brown & Botstein, (1998) Cluster analysis and display of genome-wide expression
studies, Proc.Natl.Acad.Sci USA 96, 14863-14868. – A classic microarray paper.
11. *Hampton GM, Frierson Jr HF. (2002) Classifying human cancer by gene expression. Trends in Mol
Medicine 9, 5-10. – A review paper which covers key concepts
12. **Ramaswamy S et al. (2001) Multiclass class diagnosis using tumor gene expression signatures. Proc
Natl Acad Sci USA 98, 15149- This contains more detail on methods than 11
13. Ho et al, (2002) Systemmatic identification of protein complexes in S. cerevisiae by mass spectrometry.
Nature, 415, 180-
14. Gavin et al (2002) Functional organization of the yeast proteome by systematic analysis of protein
complexes Nature 415, 141-
15. **von Meering et al. (2002) Comparative assessment of large-scale data sets of protein–protein
interactions Nature. 417, 399- Read this shorter, critical analysis rather than 13 or 14 ?
URLs
http://www.ncbi.nlm.nih/BLAST The Blast tools
http://www.ebi.ac.uk - general bioinformatics tools, info and databases.
http://genome-www.stanford.edu/Saccharomyces/ - the SGD database
http://www.geneontology.org – Gene Ontology website
http://www.ensembl.org – the European consortiums Human Genome Annotation website
http://www.tigr.org/ - the institute for genome research
http://www.sanger.ac.uk - the Sanger centre in Hinxton, Cambs. Lots of genome sequencing stuff.
http://cmgm.stanford.edu/pbrown/ - the Brown Lab homepage at Stanford. Lots of good microarray info
http://industry.ebi.ac.uk/~alan/MicroArray/ - lots of good stuff on gene expression/microarrays at the EBI
http://www.proteome.com - Lots of proteome info. Follow YPD link for yeast stuff.

3gno Notes 2004

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3gno Notes 2004

Uploaded by

Copyright:

Available Formats

OPT-3GNO.

Functional genomics and bioinformatics Semester 2

Functional genomics and bioinformatics - 3GNO Lecture Notes

z Genome assembly, gene hunting, annotation

“Unknown” query sequence

Annotation using protein structures

Real genome annotation

Human Genome, ENSEMBL www.ensembl.org

74% of ~30000 genes

vertebrate and other

animals and other

Plasmodium genome: lots of new drug targets identified.

14 x-somes, c.5000 genes See Nature, 415, p.702-

Errors in Genome annotation

What is protein function ?

Cellular component Example shows how genes can fit in

Hypothesis: yeast proteins Y-1 and Y-2 are functionally related

Combined approach of Eisenberg - exploiting phylogenetic data,

• Rosetta Stone method (Multidomain proteins) Technique Number of Number of

Functional genomics and bioinformatics - 3GNO Lecture Notes

Arrows are CAG pathogenicity islands

Gene annotation can also be verified.

Quality control on original yeast genome

5945 ORFs tested, 367 rejected

Diagnostics with microarrays

2 overall approaches: supervised, and unsupervised11

Several high throughput Proteomic applications to protein:protein interactions have been

References ** strongly recommended, * recommended

You might also like