Professional Documents
Culture Documents
Learning outcomes:
In these lectures, we will build on the 2GEN lectures and previous lectures in the 3GNO lecture
series to examine some applications of bioinformatics to functional genomics. At the end of the
lectures you should:
• Be able to describe how bioinformatics is used as a tool to functionally annotate genomes
• Be able to discuss issues in assigning biological function
• Recognise novel bioinformatics techniques currently being developed for comparative and
functional genomics to predict gene function.
• Be able to discuss how bioinformatics approaches can identify targets of clinical interest
• Be able to describe how bioinformatics is used to analyse gene expression data to predict
protein function and in clinical diagnostics.
• Be able to recognise the utility of bioinformatics applications in proteomics data analysis for
functional genomics
Bioinformatics
In second year lectures, we considered how sequence similarity searching (with programs such as
BLAST and FASTA) can find homologous genes and proteins, based on common patterns in
their primary sequences. The function of the unknown gene or protein sequence is inferred from
the function of the database match based on the hypothesis that the proteins share a common
ancestor from which they have both evolved. These methods typically begin to fail around the
"twilight zone" of c.25% pairwise sequence identity beyond which no statistical significance can be
assigned to a match, even when the proteins do share a common ancestral gene. However, many
genes in the yeast genome where seen to fall into this category - roughly 2000 so-called
"orphan" open reading frames - with no orthologues in the databases with functional
annotation. New methods are beginning to appear to extend functional classification of genome
sequences beyond these traditional bioinformatics attempts.
Functional No Homologue
Raw sequence
information?
Assembly
Assembly Yes
Find
Find genes
genes No
contigs and Homologue? Orphan
contigs and ORFs
ORFs
A schematic of the process involved in annotating a genome sequence is shown above. We hope
we can find many orthologous genes from other genomes whose function is well
characterised.
Terminology: Homologous proteins share a common ancestral gene from which they have evolved.
Orthologous sequences are homologous sequences, but are now in different organism's genomes.
Paralogous sequences are homologues which share sequence and functional similarity but have most
likely arisen from a gene duplication – they may still be in the same genome.
OPT-3GNO. Functional genomics and bioinformatics Semester 2
PSI-BLAST
Position-Specific Iterated BLAST. A more sensitive method1,2. Finds more distant homologues
than standard BLAST, hence supports more annotations to be made.
Unknown
Unknown BLAST
BLASTagainst
against
(new)
(new)query
query database
database
sequence
sequence
If no new hits,
STOP
Align
Align all
all“hits”
“hits” BLAST
BLASTprofile
profile
with
with score
scorebetter
better against
against
than
than threshold
threshold database
database
Build
Buildprofile
profile
p y
PSI-BLAST (1) PSI-BLAST (2)
False
positives
Pros Cons
Sensitivity False positives
Get automatic alignments Errors are propagated
Applied to sequenced genomes shows that (40-50% of genomes match to known structures)
Most proteins are multi-domain
OPT-3GNO. Functional genomics and bioinformatics Semester 2
prokayoyes only
eukaryote and
0%
prokaryote
vertebrate only
21%
22%
PSI-BLAST
no aminal homology
1%
Gene ontology4
Attempt to unify biology ! (Based on 3 genomes, yeast, fly, worm). Three ontological classes
(restricted vocabularies or keywords).
• Biological process,
• Molecular Function,
• Cellular component
Biological process
Molecular function
OPT-3GNO. Functional genomics and bioinformatics Semester 2
New approaches
6217 yeast
• Proteins with common function share common phylogeny
proteins
E E.coli genome
H H.influenzae
Link functionally related proteins by:
M M.jannaschi
Experimental Related Related Rosetta Correlated
metabolic phylogenetic Stone mRNA P H.pylori
Data function profiles Method expression
E H M P Y-1 E H M P Y-2
Attempt to go beyond prediction by homology, look at other features of sequences. Two recent
Nature papers5,6 epitomise recent advances. Eisenberg and colleagues4 have attempted to
combine lots of approaches, using all the known annotated genome sequences to predict the
function of 6217 yeast proteins. This exploited protein-protein interaction databases, metabolic
pathway databases, phylogenetic analyses, gene-fusion analysis and micro-array expression data.
Results were quite promising (although its not obvious!) several functional annotations were
made that could not otherwise have been spotted using simple homology methods. The "Rosetta
stone" method applied also by Enright and co-workers6 shows promise in identifying likely
protein:protein interactions from genome sequences.
References
Hopefully, many of these will be able from online resources. I would suggest you try and consult
at least one paper from each section. Most of them are review type articles that you should be
able to follow. (** = highly recommended reading, should be available from short loans etc. * =
recommended, but more specialist).
1. *Altschul SF & Koonin EV (1998) Iterated profile searches with PSI-BLAST. TIBS 23, 444-447
2. **Jones DT & Swindells M (2002) Getting the most from PSI-BLAST. TIBS. 27, 161-164
3. *Devos D & Valencia A (2001) Intrinsic errors n genome annotation. Trends in Genetics. 17, 429-
431
4. **Gene Ontology Consortium. Tool for the unification of biology. Nature Gen 25, 25-28. Short
introduction to the functional annotation problem. See also website.
5. **Marcotte, Pellegrini, Thompson, Yeates & Eisenberg (1999) A combined algorithm for genome-
wide prediction of protein function. Nature, 402, 83-
6. *Enright, et al. (1999) Protein interaction maps for complete genomes based on gene fusion
events. Nature, 402, 86- (Note: 5 and 6 appeared next to each other in the same edition of Nature).
OPT-3GNO. Functional genomics and bioinformatics Semester 2
Comparative genomics
Much interest is currently being generated from the new biological insights that can be gained by
comparison of genome sequences using bioinformatics.
Comparative genomics to spot pathogenicity7 (pathogenicity islands are a subset of genomic
islands) which have structurally conserved properties. Acquired via horizontal transfer
Comparative genomics can also be used to study conserved elements in non-coding regions to
find regulatory sites8,9. (CFTR = cystic fibrosis transmembrane conductance regulator)
1.8Mb
segment
10 genes
Including
CFTR
OPT-3GNO. Functional genomics and bioinformatics Semester 2
Functional genomics
Two principle areas of study are on the transcriptome and proteome. These are the total cell
or tissue content of the messenger RNA and protein respectively. They are distinguished from
the genome by their dynamic nature. They are both context-dependent, and will vary under
different conditions. Hence, these technologies create the possibility to analyse/study:
• gene expression changes during the cell cycle • comparative studies of pathogenic/normal cells
• gene expression changes to stress • function of unknown genes
• developmental changes in gene expression • protein:protein interactions
• DNA polymorphism • post-translational modifications
Transcriptome studies
yp p p At the end of the experiment, a single value is
• Multi-dimensional data with a “value” for every gene obtained for each gene being investigated.
This value describes either the abundance of
ORF/Gene Value
YAL001w 2.4
message (mRNA) in the cell corresponding to
YAL002c 1.0 that gene, or the relative rate of expression
SCAN
YAL003w 0.2 of that gene under two different conditions.
YAL004w 3.8
“ “ In the case of the later, this can be a
“
“
“
“ comparison between different time points
“ “ relative to time 0 (t0 the start of the
YNR231w 0.2
YNR232c 3.8 experiment). This is like the data in a by
Brown lab at Stanford 10.
Using a starting point as a reference is a natural way to "normalise" the data, as the expression of
every single gene can be described relative to its expression at t0 or a wild-type.
e.g. ratio = expression at time t / expression at t0. In the example shown left, YAL001w is up-
regulated and YNR231w is down-regulated.
Each gene can be represented as a vector (or array, or row) of values corresponding to the
expression profile over the experiments. The "distance" between each gene vector can then be
calculated using some mathematical function (tEuclidean distance or correlation coefficient).
Then, a matrix can be built up of these "distance" metrics for an "all against. all" comparison of
the expression profiles. Clustering procedures are then used to build simple dendograms (trees)
where the genes that are co-regulated should be clustered together.
OPT-3GNO. Functional genomics and bioinformatics Semester 2
Eisen10 did this for 2467 yeast ORFs. Genes cluster together which have similar cellular functions
- obtained studying yeast response to the diauxic shift (during starvation), during the mitotic cell
cycle, and stress (temperature and reduction shock).
They found several clusters:
• 126 down-regulated in response to stress (ribosomal proteins, transcription and elongation factors
• proteasome genes (multiprotein complex)
• mitochondrial protein synthesis & respiration & ATP synthesis genes
• glycolytic genes
W
OPT-3GNO. Functional genomics and bioinformatics Semester 2
SVM = support vector machines. Machine learning method for classifying data into 2 classes
(binary classifiers). Since we have multiple classes, we need multiple SVMs to test for each class.
Transcriptome vs. Proteome
There are good reasons to study both the proteome and the transcriptome. Currently the
technological advances enable us to study the transcriptome more easily and extensively, but
ultimately it can be argued that the proteome is the real "working model" for the cell, as the
proteins carry out most of the function and lead to a given phenotype.
Transcriptome Proteome
• Easier to assay large number of • Difficult to assay large number
genes simultaneously of proteins (400 yeast proteins)
• Easily quantified, higher dynamic • Can be difficult to quantify (esp.
range (>1 mRNA per cell is mass spectrometry)
detectable) • Proteins harder to deal with but
• Nucleotides easier to are more stable
manipulate (recombinant • Proteins are true gene products
technologies) but are less stable & are functional entities
• mRNA presence does not • Proteome is also affected by
guarantee presence or level of post-translational modifications,
gene product protein turnover
Correlation between transcriptome & proteome may be ~0.48-0.76
Proteomics
Proteomics involves protein identification from some experimental data (usually mass
spectrometric) using bioinformatics tools.
genome “virtual”
genome “virtual”proteome
proteome
knowledge+
knowledge+
Transcription & prediction prediction
regulation
peptide
peptidemass
massand
and
post-translational fragment
fragmentdatabase
database
modifications
separation Bioinformatics
methods Identification
real
real proteome
proteome
2D-gels,
2D-gels,
functional
functional
separations,
separations,
n-dimensional
n-dimensional
chromatography
chromatography
[digest] [fragment]
simple
simplemixtures Peptide
Peptidemass Fragment
&&single
mixtures mass Fragmention
ion
singleproteins
proteins map
mapfingerprint
fingerprint spectra
spectra
Methods should be
“centred” on diagon