You are on page 1of 15

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/6751591

Integrating forward and reverse proteomics to


unravel protein function

Article in PROTEOMICS · October 2006


DOI: 10.1002/pmic.200600211 · Source: PubMed

CITATIONS READS

15 42

2 authors:

Sandrine Palcy Eric Chevet


SP-Consulting French Institute of Health and Medical Resea…
15 PUBLICATIONS 685 CITATIONS 186 PUBLICATIONS 7,777 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sandrine Palcy on 17 October 2014.

The user has requested enhancement of the downloaded file.


Proteomics 2006, 6, 5467–5480 DOI 10.1002/pmic.200600211 5467

REVIEW

Integrating forward and reverse proteomics to unravel


protein function

Sandrine Palcy1 and Eric Chevet1, 2*


1
Organelle Signaling laboratory, Department of Surgery, McGill University, Montreal, Quebec, Canada
2
Departments of Anatomy and Cell Biology and of Medicine, McGill University, Montreal, Quebec, Canada

To date, proteomics approaches have aimed to either identify novel proteins or change in protein Received: March 28, 2006
expression/modification in various organisms under normal or disease conditions. One major Revised: May 19, 2006
aspect of functional proteomics is to identify protein biological properties in a given context, Accepted: May 21, 2006
however, forward proteomics approaches alone cannot complete this goal. Indeed, with the
increasing successes of such proteomics-based research strategies and the subsequent increas-
ing amounts of proteins identified with unknown molecular functions, approaches allowing for
systematic analyses of protein functions are desired. In this review, we propose to depict the
complementarities of forward and reverse proteomics approaches in the definite understanding
of protein functions. This dual strategy requires a data integration loop which allows for sys-
tematic characterization of protein function(s). The details of the integrative process combining
both in silico and experimental resources and tools are presented. Altogether, we believe that the
integration of forward and reverse proteomics approaches supported by bioinformatics will pro-
vide an efficient path towards systems biology.

Keywords:
Bioinformatics / Functional genomics / High-throughput proteomics / Protein

1 Introduction formation [2]. In addition, the identification of novel proteins


also represents an avenue for the understanding of yet
Proteomics has become a major discipline not only in the unclear biological mechanisms.
field of biology, applying to biochemistry, cell, and molecular For such accurate characterization, one requires not only
biology but also to pharmacology and toxicology. This dis- the capacity of generating quantitative data but also to assess
cipline, which mainly relies on technological advances at the the diversity of the above-described parameters at a pro-
level of protein separation, identification, sensitivity or teome-wide scale. The sequencing of an increasing number
throughput, and on bioinformatics is essentially based on of genomes added to the large cohort of proteomics data have
the understanding of changes occurring at the level of the allowed the generation of reliable databases which in turn
proteome in cells, tissues, or organisms [1]. These changes facilitate the acquisition of accurate and significant data at
may include not only quantitative parameters such as the level of the prime proteome analysis. Quantification
expression levels but also qualitative aspects such as locali- represents also an important parameter in the design of
zation, post translational modifications (PTM), or complex proteomics experiments to provide data which in turn could
be utilized to model specific biological systems.
These approaches defined as forward proteomics have
been successfully integrated in an increasing number of
Correspondence: Dr. Sandrine Palcy, Mcgill Surgery, 687 Pines studies describing proteome-wide model systems [3, 4]. In
Av. West, Montreal Quebec H3A1A1, Canada
addition, in several studies this information was also inte-
E-mail: palcy@hotmail.com

Abbreviations: ICAT, isotope-coded affinity tags; SCX, strong cat- * Additional corresponding author: Dr. Eric Chevet
ion exchange E-mail: eric.chevet@mcgill.ca

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


5468 S. Palcy and E. Chevet Proteomics 2006, 6, 5467–5480

grated to other types of datasets such as transcriptome [4] or level of their expression/PTM levels. This has been named
localizome [5, 6] thus providing more comprehensive repre- forward proteomics and relies mainly on the power of sam-
sentations of defined models and their variations. ple preparation technologies, MS analyses and bioinfor-
However, although these proteome-wide approaches have matics to achieve satisfactory goals. However, the under-
brought a very significant amount of biological data informa- standing of protein function(s) has necessarily to be carried
tive for specific biological systems, the biological functions of out using recombinant-based methodologies. These func-
specific individual components of these systems remain diffi- tional approaches which include recombinant expression in
cult to assess in a comprehensive manner. In a context where various systems, i.e., for protein localization and functional
biological systems can be decomplexified as simpler parts, it characterization, can be named reverse proteomics when
becomes of interest to develop integrated experimental based on an initial biological system. In this context, the
approaches in which these simplified systems are not only biological information associated to each component of the
characterized by the identification of their protein content (and system studied can indicate a specific modularity, which as a
its variation(s)) but also by the functional characterization of consequence, may determine its functionality as a whole.
each individual protein identified in relevant experimental Consequently, forward proteomics is used to define (exhaus-
models (from in vitro to in vivo; reverse proteomics). A pre- tively) the components of a given biological system and
conceived experimental design which will take into account reverse proteomics to study in an integrated manner the
the nature and composition of the initial biological system will biological characteristics of this system.
therefore allow major progresses in the understanding of the
specific functions of each individual component.

2.1 Forward proteomics


2 Description of forward and reverse
proteomics This refers to the classical proteomics approach starting
from a protein sample, and then leading to the characteriza-
Thus far, proteomics has been mainly considered as an tion and identification of its content. The process flow
approach allowing for either protein identification from described in Fig. 1 reports a number of critical steps to con-
complex mixtures or the characterization of changes at the sider in forward proteomics approaches. Initially, a critical

Figure 1. Schematic repre-


sentation of the forward prote-
omics workflow.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


Proteomics 2006, 6, 5467–5480 Systems Biology 5469

decision step is made at the level of the separation method 2.1.1.3 Protein separation
which will condition both the characterization methods and
the identification steps (Fig. 1). With or without sample prefractionation, the isolated pro-
teins still represent a too highly complex mixture for being
2.1.1 Protein isolation/extraction, separation, and directly analyzed by MS. Therefore, a protein separation step
preparation for MS analyses is applied in most of the proteomics workflow. Two major
protein separation approaches can be undertaken using
Regardless of the strategy chosen (global or hypothesis-driv- either gel-based or nongel-based methods. In addition, a
en analyses), the nature of the separation method will con- combination of both approaches can also be considered. Gel-
dition the isolation/extraction methods. based methods include both 1-DE and 2-DE. Although 1-DE
still represents a significant separation tool to resolve pro-
2.1.1.1 Sample collection and processing teins, proteomics is closely associated with 2-DE, because of
the separation power and quantification provided by staining
At first, sample collection should be controlled as strictly as and image analysis. However, there are a number of reasons
possible. For instance, tissue samples should include when that 2-D gels are less than optimal for proteomics studies
possible, biopsies from both unaffected and affected regions such as poor routine separations at extremes of molecular
as well as a representative distribution of certain parameters weight (MW) (limited from 10 to 150 kDa) or pI (limited
such as gender [7], type [7, 8]. In addition, sample collection from 3 to 10), LOD for staining (ng), limited dynamic range
must be well documented with as much clinically relevant for staining (below 1:1000), solubility problems for some
information as possible. When cultured cells are used, highly integral membrane proteins, and labor-intensive procedures.
consistent culture conditions are required. As for any Although many of these limitations are being addressed,
experimental approach, sample preparation is the most crit- such as membrane protein separation [16] and dynamic
ical part of proteomics experiments. This step involves tis- range, routine 2-D gels do not represent a universal separa-
sue/cell homogenization and/or lysis. However, conditions tion technique for all proteins [17, 18]. Alternatively, protein
that maintain protein solubility and are compatible with the chromatography represents a very efficient strategy to sepa-
downstream methods are frequently the most difficult to rate intact proteins preproteolysis. Interestingly, a recent
achieve. For instance, the nature of detergents used for study carried out on serum samples revealed that when the
membrane solubilization as well as the presence of specific aim of the analyses is to detect as many proteins as possible,
ions (Mg21 and Ca21) may be of importance relative to the the use of protein prefractionation methods coupled with
methods selected for protein separation. Similarly, inhibition multidimensional protein identification technology (Mud-
of proteases and phosphatases is critical for accurate analy- PIT; [19]) was the most effective in identifying many proteins
ses. and in a better coverage of individual proteins. This approach
included strong cation exchange (SCX) chromatography on
2.1.1.2 Sample prefractionation intact proteins followed by trypsin digestions, SCX chroma-
tography, RP-HPLC and ESI-MS [20, 21]. Additionally, this
In several cases, sample fractionation may be necessary either type of protein fractionation prior to MS analyses is particu-
to reduce sample complexity and/or to enrich in a particular larly suitable for membrane proteins, however, in this case
sample subset. Typical examples are the isolation of cellular an additional separation step such as 1-DE could also be car-
compartments or specific molecular machines. Subcellular ried out [22].
proteomics is now a well-accepted methodology to determine
the content of cellular compartment. These approaches have 2.1.1.4 Sample preparation for MS analyses
led to significant successes [9, 10] and revealed novel and
unexpected cellular functions [11]. Besides subcellular frac- In bottom up analyses, proteins are digested with proteoly-
tionation, the isolation of specific molecular machines such tic enzymes such as trypsin (or Glu C, etc.) prior to MS se-
as ribosomes, proteasome, or other large subcellular com- quencing. These methodologies have been extensively
plexes [12] may provide significant enrichment of their com- described in previous studies and are well accepted.
ponents for MS analyses. Traditionally, subcellular fractiona- Another aspect of sample preparation pertains to protein
tion has been accomplished after cell/tissue homogenization quantitation by either labeling intact proteins prior to pep-
by physical methods such as density gradient centrifugation, tide generation (stable isotope labeling with amino acids in
but immunomagnetic separations are also used for more cell culture, SILAC) [23, 24] or peptides resulting from
highly purified subcellular fractions [13]. Finally, in a number trypsin digestion (isotope-coded affinity tags; ICAT or iso-
of cases, and more particularly for serum proteomics, various baric tags for relative and absolute quantitation; iTRAQ)
steps of depletion have been established in an attempt to [25–28]. These approaches allow for the relative comparison
reduce the amount of very abundant proteins such as albu- of two biological samples (e.g., normal vs. disease or control
min and as a consequence decrease the dynamic range of the vs. treated). Indeed, in the case of SILAC, stable isotopes are
analysis [14, 15]. incorporated into proteins prior to isolation. This occurs by

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


5470 S. Palcy and E. Chevet Proteomics 2006, 6, 5467–5480

selective metabolic labeling with 15N containing amino scan and defines the ratio between parent proteins in the
acids (such as leucine). In ICAT technique, labeling is per- starting samples. MS database searching will be discussed
formed after protein isolation through alkylation of cysteine in Section 3.
groups with either an isotopically heavy reagent that incor-
porates eight deuterium atoms in place of eight hydrogen
2.2 Reverse proteomics
atoms (i.e., d8-ICAT) or a light reagent containing the natu-
ral distribution of elements (i.e., d0-ICAT). After labeling,
Whereas in classical proteomics, the starting material is
protein samples are combined and digested with trypsin.
isolated proteins which are analyzed via MS techniques, and
Peptides are fractionated by SCX chromatography and fur-
identified using complete genome sequences, in reverse
ther resolved by microcapillary RP LC, before MS analysis
proteomics, the starting point is the genome sequence of an
using IT, hybrid quadrupole TOF (Q-TOF), or TOF–TOF
organism. First, the transcriptome and proteome are pre-
instruments. Because isotopically labeled peptide pairs are
dicted in silico and subsequently this information is used to
biochemically identical, they coelute during column chro-
generate reagents for gene functional analysis. This experi-
matography [27].
mental strategy includes several steps such as cDNA clon-
ing, protein expression (several systems from Escherichia
2.1.2 MS sequencing and analyses
coli to mammals are available) and specific functional assays
based on gene function annotations. When reverse prote-
At this stage of the approach several options are yet possible.
omics is applied downstream of a forward proteomics pipe-
PMF by MALDI-TOF MS of peptide mixtures has the ad-
line, the experimental design will be focused on the genome
vantage of being very sensitive, automatable for high-
subset corresponding to the previously identified gene
throughput measures and very accurate for mass determi-
products.
nations. It has been shown that very small amounts of the
proteolytic digest are sufficient to provide an excellent
MALDI spectrum [29–31]. In the measurement of a tryptic 2.2.1 Cloning
peptide digest, intense peaks from trypsin autoproteolytic
fragments may be used as internal standards. In addition, Historically, expression clone collections have relied on
internal peptide standards may also be added to achieve cDNA libraries cloned as a pool into specific vectors. To
greater mass accuracy. When the peptide mass mapping generate such libraries, mRNA collected from an appro-
experiments do not yield unequivocal matches, fragment priate source is reverse-transcribed to DNA, and then
ions measurement by MS/MS or PSD are carried out. The introduced into a vector. Complementary DNA libraries
availability of tandem mass spectrometers with MALDI have found considerable success in the study of prokar-
sources, such as Q-TOF [32, 33] and TOF–TOF [34], also yotes and simple eukaryotes, but several issues such as the
permits the acquisition of more extensive fragmentation for presence of the 5’ and 3’ untranslated regions have limited
one or more peptides to confirm the peptide mass map their application for genetic studies in more complex spe-
results. cies (metazoan). In addition, the translational reading
Generally, peptide fragmentation spectra are obtained frame of the tag with respect to the coding sequence is not
by ESI-MS/MS. Triple quadrupole, quadrupole IT, and Q- known, and the UTR sequences themselves may contain
TOF are the typical mass analyzers. The fragment ion spec- in-frame stop codons. Therefore, it becomes of interest to
tra are acquired in a data-dependent manner (i.e., the pres- create clones that contain only coding sequences, often
ence of a peptide mass above a specified intensity threshold referred to as ORF clones. Typically, four types of starting
triggers the automated generation of the fragment ion material may be used as template to generate clones:
spectrum for that precursor mass). A comparative study of (i) genomic DNA; (ii) first-strand cDNA; (iii) cDNA library;
nanospray versus LC-MS/MS showed that in a complex or (iv) pre-existing sequence validated full-length cDNA
mixture fragment ion spectra could be generated for a clones (such as those available from the Mammalian Gene
greater number of peptides by use of LC-MS/MS rather Collection (MGC; [36–38]) or RIKEN [39, 40]. Genomic
than nanospray [35]. The use of LC-MS/MS for complex DNA is the best template, because all genes are equally
peptide mixtures has been enhanced in a number of ways. represented, but it is only applicable to organisms with lit-
Peptide separation and identification can be enhanced fur- tle or no splicing, limiting it to prokaryotes and simple
ther by increasing the ability of the mass spectrometer to eukaryotes.
perform MS/MS on every component. Normally the tran- Nevertheless, the challenges associated with ORF clones
sient nature of components eluting from LC columns limits creation are clearly outweighed by the advantages of their
the number of coeluting peptides that can be selected and use, such as the opportunity to execute parallel experiments,
fragmented by MS/MS. For peptide relative quantitation, in which all or a subset of genes in a genome can be tested in
the mass spectrometer toggles between survey scans (MS), a controlled setting under identical conditions. Moreover,
and sequencing scans (MS/MS). The ratio of ion intensities since their identity is known in advance, information
from coeluting labeled peptides is determined in the survey regarding all clones can be recorded.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


Proteomics 2006, 6, 5467–5480 Systems Biology 5471

To date, besides the earliest comprehensive (ORF) clone 2.2.2.1 Expression systems
collections which were constructed with gap-repair cloning, a
method of cloning that uses homologous recombination in These experiments can be carried out in vitro or in vivo
the yeast Saccharomyces cerevisiae [41–43], the best example of including bacterial to mammalian expression systems. The
reverse proteomics cloning platform for metazoan genomes expression of recombinant protein can serve several pur-
was provided by the work initiated by the Vidal lab [44–46] poses such as antibody production, structural, or functional
where ORFs were deduced from the analyses of complete studies. At this stage, the nature of the protein (size, presence
genome sequences. This was first demonstrated by the gen- of transmembrane domains, acidic or basic stretches, and
eration of the Caenorhabditis elegans ORFeome [47] and since proteins normally secreted) and the application chosen must
then a significant number of similar projects have followed be considered for experimental design including the choice
including the Arabidopsis thaliana, Sinorhizobium meliloti, of the expression system. For instance, in the case of anti-
Brucella melitensis, and Human ORFeomes [48–51]. These body production or structural studies, significant amount of
ORF collections can now be used in the assays described in recombinant proteins from specific ORFs must be gener-
Section 2.2.2. ated. Furthermore, proteins containing transmembrane
regions will most likely be insoluble in bacterial expression
2.2.2 Expression and functional assays systems. This type of problem can be solved by selecting
eukaryotic expression systems or expressing protein frag-
These experimental systems will provide the basic recombi- ments/domains, the latter representing a simpler solution.
nant material to conduct further protein characterization Similarly for structural genomics projects, protein solubility
studies including protein functional assays (see Fig. 2). and folding are of course of major interest [52].

Figure 2. Schematic repre-


sentation of the reverse prote-
omics workflow.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


5472 S. Palcy and E. Chevet Proteomics 2006, 6, 5467–5480

Today, there is a large variety of expression systems been undertaken (http://www.dkfz.de/LIFEdb; [56–58]) by
available for recombinant protein production, each having its systematic tagging of specific ORFs with fluorescent protein
own respective advantages in relation to cost, ease of use, and encoding sequences and transfected in various cell lines. A
their PTM profiles. (1) Traditional cell-free expression sys- major limitation of these projects is the presence of the GFP
tems have always been limited by low expression levels. tag on the original protein sequence which can lead to pro-
However, the use of a eukaryotic based approach coupled tein mis-localization. An alternative to this approach is pro-
with the rapid, cell-free utility offered by the system has a vided by the systematic use of antibodies as illustrated by the
number of clear advantages and offers a viable alternate to protein atlas project ([59–61]; www.proteinatlas.org).
traditional approaches [53]. (2) Bacterial expression is prob- Protein–protein interactions: The characterization of pro-
ably the most commonly used expression system for the tein function by the identification of specific functional part-
production of recombinant proteins. However, since it is a ners has been tackled by the establishment of protein–pro-
prokaryotic based system, heterologously expressed eukar- tein interaction networks at a genome-wide level. The use of
yotic proteins are not modified correctly. Furthermore, pro- yeast two-hybrid has allowed the establishment of the first
teins expressed in large amounts can precipitate, forming interaction map of a complete organism S. cerevisiae in 2001
inclusion bodies, and large complex proteins may also be [62], the evolution of this technology has then allowed the
difficult to propagate. (3) Baculovirus-mediated insect cell establishment of genome-wide protein–protein interaction
expression has become one of the most popular vehicles for maps in C. elegans [63], Drosophila melanogaster [64, 65], more
the production of large quantities of recombinant protein for recently in human [66] and soon in mouse. Similarly,
structural and functional studies. Baculovirus protein approaches using MS were used to identify and characterize
expression is a eukaryotic based expression system and thus protein complexes. These strategies involve the use of tan-
offers protein modification and processing patterns similar dem affinity purification (TAP) tags [67] which allows the
to those in higher eukaryotic cells. However, a major limita- elimination of a vast majority of contaminant usually found
tion of the baculovirus expression system has always been in classical immunoprecipitations studies. Comprehensive
that the N-glycosylation pathway of insect cells, specifically genome-wide protein–protein interaction maps were gener-
those of Sf9 and Sf21 (the most commonly used cell lines) is ated in yeast using this approach [68] as well as several others
different from that found in mammalian cells [52]. (4) Yeast, in mammalian systems but at lower scales [69].
in particular Pichia pastoris, offers a very powerful option for Enzymatic activities: The systematic analysis of enzymatic
secretory proteins, and offers a very powerful alternative to activities at a genome scale requires the development of
baculovirus or mammalian expression systems, specifically enzymatic assays allowing for high-throughput studies.
in regard to the generation of large quantities of secreted Thus far, such assays have been developed mainly to screen
material [52]. (5) Finally, the expression of proteins in a for compounds altering the activity of a given protein/en-
mammalian background offers many clear advantages to zyme [70–72]. However, in the case of reverse proteomics, the
their generation in E. coli or insect cells, including correct objective would be to systematically identify substrates for a
PTM and folding. However, although the use of mammalian family of enzymes which implicates on the one hand
cells such as CHO or HEK293 is well documented, the pro- recombinant protein expression systems able to handle the
cess of creating stable mammalian cell lines can often be la- expression of a large spectrum of proteins with not necessa-
borious and time consuming [52]. rily large quantities (mg). Such studies have been carried out
in S. cerevisiae where the yeast kinome was studied using a
2.2.2.2 Functional assays protein chip technology [73], in addition several other assays
have also been developed to allow high-throughput studies of
These assays will contribute to the functional characteriza- small [74] or trimeric G proteins [72]. Therefore, the devel-
tion of gene products. This information can be integrated at opment of assays allowing for the analysis of enzymatic ac-
genome-wide levels and provide therefore functionally rele- tivities (either intrinsic properties or substrate identification)
vant annotation of the gene products studied. represents another step in the characterization of proteins in
Subcellular localization: Determination of the subcellular a reverse proteomics pipeline.
localization and dynamics of proteins is a very important first Gene silencing: These approaches contribute importantly
step toward the understanding of their cellular function. For to the understanding of molecular functions related to spe-
instance, proteins of a given functional network must colo- cific phenotypes. These studies initially performed using
calize in order to interact with each other, or protein relocal- gene deletion strategies in various organisms have more
ization in response to external stimuli may reflect changes in recently included RNA interference approaches. Specific
activities. Systematic experimental pipelines to determine single or double gene knockouts have been carried out in a
the subcellular localization of proteins have been set in place systematic fashion in S. cerevisiae and the impact of these
in various systems. The most comprehensive studies have deletions on cell survival [75, 76] and other specific pheno-
been carried out in yeast where systematic tagging of the types evaluated. Similarly, RNA interference has become in
endogenous ORF was performed with GFP [54, 55]. In addi- the past 5–10 years a major technology in functional geno-
tion, in mammalian systems several initiatives have also mics providing specific gene silencing and upscalable to

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


Proteomics 2006, 6, 5467–5480 Systems Biology 5473

high-throughput [77]. The first genome-wide approaches lian midbodies, identified the components using a forward
were carried out on the worm C. elegans [78]. Since then these proteomics approach and tested the functionality of these
high-throughput approaches have been extended to other components in a reverse pipeline using C. elegans as a
systems including D. melanogaster or mammalian cells in model system thus revealing conserved cytokinesis mech-
culture [79–83]. Analyzed in the context of specific pheno- anisms.
types (which can be molecular as well), these systematic The definition of the biological sample must include
approaches have provided numerous functional information information about its nature (model and experimental con-
on the genes studied. ditions). Additional information about similar samples from
Integration of the above described datasets to unravel protein either the literature or publicly available datasets must be
function: Several examples are now available where either at collected and used for initial sample annotation. Then, the
the level of a single protein or at the level of a large number list of proteins resulting from MS analyses will constitute the
of proteins, the integration of datasets providing hetero- final information to condition the reverse approach. Al-
genous information has allowed the identification of protein though, the forward workflow is not affected in such inte-
functions. For instance, this was the case for a small GTP grated pipeline, the strategy for collecting and storing the
binding protein of the Ras superfamily identified by information must be applicable for a downstream integra-
sequence homology in the C. elegans genome [84]. By col- tion with the reverse step. The actual constraint is therefore
lecting information on its subcellular localization, its tis- applied to the reverse pipeline. Indeed, the limitation is no
sular distribution in the worm and its activity, it was made longer imposed by the identification of ORFs in a given ge-
possible to postulate a potential biological role in the main- nome but by the list of proteins identified in the forward
tenance of the apical polarity in epithelial cells. Similarly approach. This list must be analyzed (see Section 3) to pro-
but at a much larger scale, Gunsalus et al. [85] reported the vide sequence feature information (e.g., presence of trans-
establishment of phenoclusters in the worm based on the membrane domains) and eventually functional annotation.
integration of heterogenous datasets. This process has In addition, the nature of the initial biological sample will
allowed the understanding of C. elegans early embryogen- also condition the nature of the assays which will be under-
esis at a system level. taken for the functional evaluation of the identified compo-
nents. Finally, the data harvested through the reverse pipe-
line will complement the initial knowledge collected on the
2.3 Integration of forward and reverse experimental biological sample and lead to comprehensive analysis of the
pipelines system studied.

Individually, both forward and reverse proteomics are two


approaches with very high discovery potential for protein 3 Data analysis and integration
characterization. However, we can postulate that the sys-
tematic integration of both strategies (see Fig. 3) may repre- Forward and reverse proteomics analyses cannot be per-
sent an additional advance in the understanding of protein formed without the assistance of bioinformatics resources
functions in a specific biological context. To integrate these for data harvesting, management, storage, and analysis.
two experimental pipelines as described in Fig. 3, several Throughout all the steps, bioinformatics will support the
parameters must be considered including a very strict defi- analytical workflow and will provide the appropriate inter-
nition of the initial biological sample, and a data integration face for the interpretation of the data in a biologically relevant
pipeline between both forward and reverse steps. A very context.
elegant example of such an experimental strategy has been
recently provided by Skop et al. [86], who purified mamma-
3.1 Protein identification

Protein identification via MS is achieved using searching


against molecular sequence databases (see Table 1). In brief,
peptide or peptide fragment masses are matched to theoret-
ical masses predicted from in silico digested protein sequen-
ces [87]. Three major criteria have to be considered when
undertaking protein identification using MS data: (i) the
quality of the protein identification (as indicated by match
scoring and protein sequence coverage), (ii) the total number
of identified proteins, and (iii) the percentage of false posi-
tive. These criteria are not only influenced by the quality of
the MS data acquired but also by the proficiency of the
Figure 3. Integration of forward and reverse proteomics. bioinformatics resources.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


5474 S. Palcy and E. Chevet Proteomics 2006, 6, 5467–5480

Table 1. Molecular sequence databases

The International Nucleotide Sequence Database Collaboration (INSDC), http://www.insdc.org/


Name Database features Description Number of total
(or selected) entries

Nucleotide and genome databases


GenBank (NCBI) General This database contains a collection of publicly available DNA sequences for 54 584 635
Comprehensive more than 205 000 organisms. The sequence information is obtained from February 06
Minimal annotation individual laboratories and batch submissions from large-scale
Redundant sequencing projects. The records include annotation using a standard
Noncurated set of biological terms and crosslinks to taxonomy, genome, mapping,
protein structure, and domain information as well as the literature
database PubMed.
URL: http://www.ncbi.nlm.nih.gov/
dbEST (NCBI) General dbEST is the division of GenBank that contains “single-pass” cDNA 32 889 225
Minimal annotation sequences, or ESTs, from a number of organisms. January 06
Redundant URL: http://www.ncbi.nlm.nih.gov/
Noncurated
EMBL (EBI) General The EMBL Nucleotide Sequence Database contains DNA and RNA 64 739 883
Comprehensive sequences collected from individual researchers, genome sequencing December 05
Minimal annotation projects, patent applications and more.
Redundant The database is highly integrated with other data resources offered at
Noncurated the EBI and elsewhere (URL: http://www.ebi.ac.uk/embl/)
DDBJ DNA (Data Bank General This database contains DNA sequences including cDNAs, ESTs, genomic 52 272 669
of Japan) Comprehensive DNAs, sequence tagged sites (STS), and synthetic constructs December 05
Minimal annotation (URL: http://www.ddbj.nig.ac.jp)
Redundant
Noncurated
Genomic databases
NCBI Entrez Genome General This general genomic resource includes genomes of over 1000 organisms. 5202
Comprehensive Complete and partial genomes are included March 06
Moderate annotation (URL: http://www.ncbi.nlm.nih.gov/)
Redundant
Noncurated
MGD (The Jackson Specific The Mouse Genome Database is a highly curated and annotated resource 30 927 (genes)
Laboratory) Comprehensive on mouse genome which presents an integrated view of genotype March 06
Extensive annotation (sequence) to phenotype information including genes and gene
Nonredundant products. MGD includes homology data for mammalian orthologs,
Curated alleles and targeted mutation reports, and more
(http://www.informatics.jax.org)

Protein databases
NCBI Entrez Protein General The database is a compilation of translated nucleotide sequences from the 8 212 494
Comprehensive annotated coding regions in GenBank and RefSeq as well as protein March 06
Moderateannotation entries from Swiss-Prot, Protein Information Resource (PIR), Protein
Redundant Research Foundation (PRF), and Protein Data Bank (PDB). Protein
Noncurated sequence records in Entrez are linked to the other NCBI resources such
as precomputed protein BLAST alignments, protein structures,
conserved protein domains, nucleotide sequences, genomes, and genes
(URL: http://www.ncbi.nlm.nih.gov/entrez)
NCBInr General Nr is a composite database which contains all nonredundant GenBank 3 479 934
Comprehensive coding sequence translations, RefSeq Proteins, PDB, Swiss-Prot, PIR, March 06
Moderate annotation and PRF entries. The sequences in nr are crossreferenced to their
Nonredundant initial databases in order to avoid duplication
Noncurated (URL: http://www.ncbi.nlm.nih.gov)
RefSeq Protein (NCBI) General The database provides reference proteins sequence. Records are compiled 2 273 764
Moderate annotation using a combined approach of collaboration, automated methods, January 06
Nonredundant prediction, and curation. They are highly integrated with other NCBI
curated resources (URL: http://www.ncbi.nlm.nih.gov/RefSeq)

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


Proteomics 2006, 6, 5467–5480 Systems Biology 5475

Table 1. Continued

The International Nucleotide Sequence Database Collaboration (INSDC), http://www.insdc.org/


Name Database features Description Number of total
(or selected) entries

Swiss-Prot (Swiss-Prot General Swiss-Prot Protein Knowledgebase is a manually curated protein sequence 211 104
group and SIB) Extensive annotation database which provides a high level of annotation (including protein March 06
Nonredundant function, domains structure, PTMs, and variants). All sequence reports
curated corresponding to the same gene product are merged into a single entry
which is currently crossreferenced with about 60 different databases.
Protein records are annotated and reviewed by biologists to ensure that
the database is of a high quality (URL: http://www.expasy.org/sprot)
TrEMBL (Swiss-Prot General This database is a computer-annotated supplement of Swiss-Prot that 2 638 494
group and SIB) Extensive annotation contains all the translations of INSDC entries, not yet integrated in March 06
Nonredundant Swiss-Prot. Besides, the collection includes protein sequences extracted
curated from the literature or direct submission. As in Swiss-Prot, information
redundancy is avoided by merging several sequence records of the same
gene product. Together, Swiss-Prot and TrEMBL constitute a
comprehensive resource for all the publicly available protein sequences
(URL: http://ca.expasy.org/sprot/)
PIR-PSD (Georgetown General The Protein Information Resource Protein Sequence Database, the world’s 283 009
University Medical Comprehensive first database of classified and functionally annotated protein sequences. December 2004
Center) Extensive annotation The resource compiles protein sequences, organized by superfamily and
Nonredundant family, and annotated with functional, structural, bibliographic, and
curated genetic data. As all PIR-PSD entries has been merged into the UniProt
databases, it is no longer updated (last release in December 2004,
URL: http://pir.georgetown.edu/pirwww/dbinfo/pir_psd.shtml).
The database is extensively crossreferenced with INSDC collection,
PubMed and MEDLINE IDs, and many other databases
UniProt (UniProt General The Universal Protein Resource is the most comprehensive collection of 2 812 716
Consortium-EBI, SIB, Comprehensive protein sequences today. It is a joined resource of protein sequence February 06
and PIR) Extensive annotation repository and functional information. The resources include three
Nonredundant components: (i) UniProt Knowledgebase (UniProtKB) in which the protein
Curated sequence and functional information from Swiss-Prot, TrEMBL, and
PIR-PSD entries are merged, (ii) UniProt Reference Clusters (UniRef)
which provides a nonredundant view of the resources using several level
of sequence identity and (iii) UniProt Archive (UniParc), the most
comprehensive and nonredundant collection of publicly accessible
protein sequences from Swiss-Prot, TrEMBL, PIR-PSD, EMBL, Ensembl,
International Protein Index (IPI), PDB, RefSeq, FlyBase, WormBase and
the patent offices in Europe, the United States, and Japan
(URL: http://www.pir.uniprot.org/)
Example of specific protein databases
Brenda (University Specific BRENDA is a comprehensive resource on functional and molecular 3996 (EC numbers)
of Cologne) Comprehensive information of enzymes, based on primary literature. It contains all January 04
Extensive annotation enzymes classified according to the EC system of the Enzyme
Nonredundant Nomenclature Committee (IUBMB). The database content include
Curated enzyme sequence, structure, specificity, stability, reaction parameters,
and isolation data (URL: http://www.brenda.uni-koeln.de)
Histone Sequence Specific This database is a collection of histone-fold containing sequences derived 1378 (unique
Database (NHGRI/ Comprehensive from sequence-similarity searches of public databases. Sequence sets sequences, all
NCBI) Minimal annotation are presented in redundant and nonredundant, linked to GenBank histone classes)
Curated sequence files. For each class of eukaryotic histone, sequence- and January 05
text-based queries have been used to extract all relevant entries from
NCBIs Entrez protein sequence database
(URL: http://genome.nhgri.nih.gov/histones/)

Sequence databases are designed for the direct archiving of experimental results. They are termed as primary databases and contain
more or less associated information such as literature references, functional annotation and crossreferences to other sequence data-
bases.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


5476 S. Palcy and E. Chevet Proteomics 2006, 6, 5467–5480

3.1.1 Molecular sequence databases databases such as 3-D structures, 2-D gels, protein–protein
interactions, protein families, or literature databases.
The choice of the database is an important factor which can Searches using genomic or EST databases can identify
influence protein identification. Several aspects of the data- additional protein species which have not yet been incorpo-
base have to be taken into consideration: (i) the source of the rated in protein databases [95]. However, this information
information, (ii) the nature and specificity of the informa- source of data presents several challenges. First, nucleic acid
tion, and (iii) the size of the database. In protein databases, sequences require the translation into the six possible read-
the sequence information may have several sources. Some ing frames prior to be used for searching with mass spectra.
proteins sequences result from Edman degradation of puri- This extends considerably the time allocated for searches.
fied gene products. Others are sequences from functional Second, genomic data, in the case of eukaryotes, contain a
cDNA cloning and mRNA characterization. Finally, a signif- mixture of introns and exons, and a peptide sequence may
icant part corresponds to putative protein sequences deduced overlap two different exons which will therefore prevent its
from genome annotation and EST databases. The degree of identification due to the lack of a contiguous sequence in the
uncertainty regarding the exact protein coding sequence genomics database [96].
varies among these different sources. Indeed, information When using large databases, the impact on the identifi-
on the size of the gene product, translation initiation start, cation (positive match) and its quality (e-value), must be
reading frame, and occurrence of alternative splicing could taken into consideration. The size of the database can indeed
be more or less accurate [88]. To compensate the lack of con- affect the positive identification of unique peptides in shot
fidence toward certain sources, sequence validation strate- gun approaches [97].
gies have been put in place to automatically [89] or manually A general issue with the use of any molecular sequence
validate the available information [90]. database for peptide searching is that none of the PTMs
Molecular sequence databases can be classified as either encountered in the biological context or chemical modifica-
“general” or “specific”. The general databases contain all tions occurring during the protein separation will be taken
types of protein sequence entry from several organisms/ into account. While chemical modifications due to the
species, such as Swiss-Prot, TrEMBL, Refseq (NCBI), and experimental processes are well identified [98] and therefore
PIR [91]. Besides, specific databases contain protein entries easily predictable, the nature and number of the PTMs
relative to an organism/species, a category of protein (such represent a real challenge for global approaches. Therefore,
as enzymes) or a protein superfamily (such as histones) (see specific analytical workflows (using methods of sample iso-
Table 1). Furthermore, a clear distinction has to be made be- lation and/or MS analysis, see Section 2) should be applied
tween databases which are only sequence repositories (e.g., for PTMs detection at a large-scale [99]. Otherwise, it would
Genbank and TrEMBL) and annotated/curated databases be extremely difficult to identify the modified peptides due to
(e.g., Pir, Swiss-prot, and RefSeq). Finally, a new generation poor abundance and limitation in the database searching. As
of databases, named “metadatabases” is a database of data- a consequence, a list of unassigned and incorrectly assigned
bases which groups under a unique entry the information peptides would be returned. This is a major concern of the
found in several resources (e.g., UniProt). field that some groups have attempted to address [100, 101].
Most of the global approaches to forward proteomics
use general protein databases for protein sequence identifi- 3.1.2 MS and MS/MS database searching
cation, with or without species restriction, depending on
genome data availability [92, 93]. This is due to the fact that Comparison of MS data to in silico digested protein database
the comprehensiveness of the selected databases is critical is performed using bioinformatics tools commonly named
for guarantying a high number of protein identification. “MS search engines”. Several algorithms for peptide identi-
However, the direct use of sequence repositories will be fication are used in forward proteomics approaches such as
avoided due to their high level of redundancy [91]. Database SEQUEST, MASCOT, and ProteinProspector [94]. The prin-
redundancy is a real problem which affects the searching ciple of these algorithms is to match molecular mass values
time. Moreover, the amino acid distribution in the database of peptides or peptide fragments to theoretical mass values
will also be different and this will impact on some scoring predicted from protein sequence databases. When several
models [94]. Nevertheless, sequence repository databases peptides match the same protein entry, they are clustered
allow the rapid availability of sequence information to the together. The protein entry is then reported as a protein hit.
community, and when an appropriate filter is applied to Two main methodologies have been developed for pro-
decrease the information redundancy, these refined sources tein identification using MS (see Section 2): (i) PMF (using
of protein sequences become a powerful tool to be used for MS data) and (ii) peptide mass tag (using MS/MS data). The
protein identification (e.g., NCBInr). When undertaking a choice of one or the other method is determined by the pro-
more specific project, the use of curated protein databases is tein sample complexity and the database to be searched.
of great interest. Indeed, in addition to an absence of PMF cannot be applied when the protein mixture is too
redundant sequences, these resources present the advan- complex or when searching against a genomic or EST data-
tages to be well annotated and crossreferenced to other base [95]. Similarly, the choice of the MS search engine will

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


Proteomics 2006, 6, 5467–5480 Systems Biology 5477

influence the list of the validated peptide matches. All search ical element of protein functions and is of high importance
engines will return peptide matches based on the criteria set for the design of subsequent experiments in the reverse pro-
up by the investigator such as database used, mass tolerance teomics phase. Similarly, the prediction of subcellular locali-
(according to the mass accuracy of the mass spectrometer zation (Psort and SignalP [114]) could also bring additional
used), ion charge state, chemical (depending on protein elements on protein function and how it should be assessed
separation methods) and PTMs, number of proteolytic mis- experimentally. For all the tools mentioned above, the amino
cleavages, etc. However, the validity of the peptide matches acid sequence is sufficient to obtain some information.
and corresponding protein hits is controlled by a scoring However, the prediction of 3-D structure from protein
model which varies among the different search algorithms. sequence is a hardest task to achieve. Nevertheless, model
The quality of the search is then defined by the number of prediction can be obtained with knowledge-based algorithms
false positive and unassigned MS spectra. Clearly, the choice such as ProMod [115]. The comparative modeling method
of the MS search engine is a determinant for the success of a allows the prediction of 3-D structures of proteins sharing
proteomics analysis and already approaches have been .30% of sequence identity with a protein with a known 3-D
developed which combine results from different search structure [116]. For well-characterized proteins, reliable
engines for confirmation of peptide assignment and higher annotation information can simply be retrieved from curated
comprehensiveness [102]. databases such as UniProt [117].
One important aspect of protein annotation is the use of
3.2 Protein annotation controlled vocabulary, otherwise the use of the information
would lack homogeneity and crossdatabase searches would
Protein annotation provides the basic elements for under- become extremely difficult. The Gene Ontologies con-
standing the biological relevance and context of the identified sortium has produced the first and the mostly used con-
proteins. The proteins are annotated with information trolled vocabulary index in genomics [118]. The GO database
derived either from experimental data or prediction via has become an essential annotation resource for describing
sequence similarities [103]. Annotations of protein provides the roles of genes and gene products in any organism. The
information ranging from protein sequence features to elu- GO terms are used by several collaborating databases which
cidated biological function(s) and partner(s), including tissue facilitates automatic transfers of annotation. Recently, GO
distribution, subcellular localization, and involvement in Engine, a computational platform for GO annotation have
pathologies. In the context of forward proteomics, where the been developed and used for large-scale annotation of hu-
datasets generated are protein sequences, this step mainly man proteins [119].
implies the annotation of the primary structure. A large
panel of bioinformatics tools is now available to analyze 3.3 Organizing the data
sequence homologies, detect conserved domains, motifs,
and signatures as well as predict topology, subcellular locali- Forward proteomics analyses return such large amount of
zation, and 3-D structures. The analysis of protein sequence data that their direct interpretation or utilization in further
homologies using sequence global alignment tools such as analytical studies is extremely difficult. To overcome this
CLUSTALW [104] or MULTALIN [105] returns extremely problem, data clustering methods can be applied in order to
useful information about gene product conservation among turn these large datasets into manageable and meaningful
species and the occurrence of a close homolog. The detection subsets. Indeed, using the annotation information, proteins
of a high degree of homology with well-characterized pro- identified through the forward step can then be classified
teins provides a good link to the putative function of based on similar attributes (sequence, localization, function,
unknown proteins [106]. Furthermore, the use of local etc.).
alignment tools such as BLAST [107], PSI-BLAST [108], or When combining forward and reverse proteomics
SIM [109] informs on domain and pattern homology. This approaches, it becomes of high importance to partition the
may also highlight new conserved domain or signatures and identified proteins into functional subsets or clusters which
as such uncover new protein families. will provide the framework for the design and selection of
Further information to predict protein function can also the subsequent functional assays. As for the annotation step,
be obtained by scanning protein sequences against a the clustering steps cannot be undertaken manually when
sequence feature database such as Pfam, Prosite, Smart, and the scale of the analysis is too large and automated processes
Blocks. The metadatabase InterPro (an excellent resource of have to be implemented [120].
integrated documentation), unites the information of
11 resources for protein sequence features [110] such as 3.4 Data integration
protein families, conserved domains, and functional sites. Its
associated tool, InterProScan [111], scans sequences against In the forward–reverse proteomics pipeline, the genes cor-
the InterPro database members and returns an integrated responding to the proteins identified via MS analysis (for-
protein sequence analysis. Prediction of transmembrane do- ward step) will be then subjected to functional analysis
main and topology (Psort [112] and TMHMM [113]) is a crit- (reverse step). Due to crossreferences between protein and

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


5478 S. Palcy and E. Chevet Proteomics 2006, 6, 5467–5480

genome database, DNA sequence information is easy to [12] Yates, J. R. III, Gilchrist, A., Howell, K. E., Bergeron, J. J., Nat.
access and in most cases, is linked to either gene collections Rev. Mol. Cell Biol. 2005, 6, 702–714.
or ORFeomes. Therefore, the corresponding DNA material [13] Nguyen, D. T., Kebache, S., Fazel, A., Wong, H. N. et al., Mol.
will be either simply retrieved from DNA collections or Biol. Cell 2004, 15, 4248–4260.
cloned. Once expressed, the subsequent recombinant pro- [14] Yocum, A. K., Yu, K., Oe, T., Blair, I. A., J. Proteome Res. 2005,
teins will be tested according to their functional annotation. 4, 1722–1731.
The new set of data collected through the reverse prote- [15] Granger, J., Siddiqui, J., Copeland, S., Remick, D., Proteom-
ics 2005, 5, 4713–4718.
omics analysis will expend the knowledge on the proteins.
This information will therefore constitute a new source of [16] Santoni, V., Kieffer, S., Desclaux, D., Masson, F. et al., Elec-
trophoresis 2000, 21, 3329–3344.
protein annotation which could be used for another run of
either reverse proteomics analysis or setting new analytical [17] Gygi, S. P., Corthals, G. L., Zhang, Y., Rochon, Y. et al., Proc.
Natl. Acad. Sci. USA 2000, 97, 9390–9395.
condition for the forward proteomics workflow.
[18] Corthals, G. L., Wasinger, V. C., Hochstrasser, D. F., Sanchez,
J. C., Electrophoresis 2000, 21, 1104–1115.
[19] Kislinger, T., Gramolini, A. O., Maclennan, D. H., Emili, A., J.
4 Conclusions Am. Soc. Mass Spectrom. 2005, 16, 1207–1220.
[20] Hattan, S. J., Marchese, J., Khainovski, N., Martin, S. et al., J.
The integration of the data collected in a linked fashion [121], Proteome Res. 2005, 4, 1931–1941.
using systems which allow for querying and navigating [21] Li, X., Gong, Y., Wang, Y., Wu, S. et al., Proteomics 2005, 5,
multiple databanks, or crossreferences between entries from 3423–3441.
different databases, and those generated experimentally [22] Szponarski, W., Sommerer, N., Boyer, J. C., Rossignol, M. et
represents the next challenge to complete a system repre- al., Proteomics 2004, 4, 397–406.
sentation of the biological model studied. As a consequence, [23] Ong, S. E., Blagoev, B., Kratchmarova, I., Kristensen, D. B. et
systems biology type of integration should be based on al., Mol. Cell. Proteomics 2002, 1, 376–386.
modeling the initial biological system by integrating infor- [24] Blagoev, B., Kratchmarova, I., Ong, S. E., Nielsen, M. et al.,
mation collected at the level of each of its individual compo- Nat. Biotechnol. 2003, 21, 315–318.
nents thus revealing new knowledge through the emergent [25] Dunkley, T. P., Dupree, P., Watson, R. B., Lilley, K. S., Bio-
properties of the model. chem. Soc. Trans. 2004, 32, 520–523.
[26] Turecek, F., J. Mass Spectrom. 2002, 37, 1–14.
[27] Ong, S. E., Mann, M., Nat. Chem. Biol. 2005, 1, 252–262.
We apologize to those whose work was not cited in this review [28] Guerrera, I. C., Kleiner, O., Biosci. Rep. 2005, 25, 71–93.
due to space limitation. We thank the Chevet lab for critically
[29] Cramer, R., Gobom, J., Nordhoff, E., Expert Rev. Proteomics
reviewing this manuscript. E. C. is a Junior Scholar from the 2005, 2, 407–420.
Fonds de la recherché en Santé du Québec (FRSQ). [30] Thiede, B., Hohenwarter, W., Krah, A., Mattow, J. et al.,
Methods 2005, 35, 237–247.
[31] Wysocki, V. H., Resing, K. A., Zhang, Q., Cheng, G., Methods
5 References 2005, 35, 211–222.
[32] Shevchenko, A., Loboda, A., Ens, W., Standing, K. G., Anal.
[1] Cristoni, S., Bernardi, L. R., Expert Rev. Proteomics 2004, 1, Chem. 2000, 72, 2132–2141.
469–483. [33] Shevchenko, A., Chernushevich, I., Wilm, M., Mann, M.,
[2] Bradshaw, R. A., Burlingame, A. L., IUBMB Life 2005, 57, Methods Mol. Biol. 2000, 146, 1–16.
267–272. [34] Medzihradszky, K. F., Campbell, J. M., Baldwin, M. A., Falick,
[3] Arita, M., Robert, M., Tomita, M., Curr. Opin. Biotechnol. A. M. et al., Anal. Chem. 2000, 72, 552–558.
2005, 16, 344–349. [35] Pasa-Tolic, L., Masselon, C., Barry, R. C., Shen, Y. et al., Bio-
[4] Ge, H., Walhout, A. J., Vidal, M., Trends Genet. 2003, 19, 551– techniques 2004, 37, 621–624, 626–633, 636 passim.
560. [36] Strausberg, R. L., Feingold, E. A., Grouse, L. H., Derge, J. G.
[5] Fink, J. L., Aturaliya, R. N., Davis, M. J., Zhang, F. et al., et al., Proc. Natl. Acad. Sci. USA 2002, 99, 16899–16903.
Nucleic Acids Res. 2006, 34, D213–D217.
[37] Strausberg, R. L., Camargo, A. A., Riggins, G. J., Schaefer, C.
[6] Davis, T. N., Curr. Opin. Chem. Biol. 2004, 8, 49–53. F. et al., Pharmacogenomics J. 2002, 2, 156–164.
[7] Yan, L., Ge, H., Li, H., Lieber, S. C. et al., J. Mol. Cell Cardiol. [38] Strausberg, R. L., Feingold, E. A., Klausner, R. D., Collins, F.
2004, 37, 921–929. S., Science 1999, 286, 455–457.
[8] Zhan, X., Desiderio, D. M., Clin. Chem. 2003, 49, 1740–1751. [39] Okazaki, N., Kikuno, R., Ohara, R., Inamoto, S. et al., DNA
[9] Murphy, R. F., Biochem. Soc. Trans. 2005, 33, 535–538. Res. 2002, 9, 179–188.
[10] Brunet, S., Thibault, P., Gagnon, E., Kearney, P. et al., Trends [40] Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J. et al., Na-
Cell Biol. 2003, 13, 629–638. ture 2002, 420, 563–573.
[11] Gagnon, E., Duclos, S., Rondeau, C., Chevet, E. et al., Cell [41] Martzen, M. R., Mccraith, S. M., Spinelli, S. L., Torres, F. M. et
2002, 110, 119–131. al., Science 1999, 286, 1153–1155.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


Proteomics 2006, 6, 5467–5480 Systems Biology 5479

[42] Uetz, P., Giot, L., Cagney, G., Mansfield, T. A. et al., Nature [72] Thomsen, W., Frazer, J., Unett, D., Curr. Opin. Biotechnol.
2000, 403, 623–627. 2005, 16, 655–665.
[43] Ito, T., Tashiro, K., Muta, S., Ozawa, R. et al., Proc. Natl. Acad. [73] Ptacek, J., Devgan, G., Michaud, G., Zhu, H. et al., Nature
Sci. USA 2000, 97, 1143–1147. 2005, 438, 679–684.
[44] Walhout, M., Endoh, H., Thierry-Mieg, N., Wong, W. et al., [74] Caruso, M. E., Jenna, S., Beaulne, S., Lee, E. H. et al., Mol.
Am. J. Hum. Genet. 1998, 63, 955–961. Cell. Proteomics 2005, 4, 936–944.
[45] Walhout, A. J., Vidal, M., Nat. Rev. Mol. Cell Biol. 2001, 2, 55– [75] Tong, A. H., Lesage, G., Bader, G. D., Ding, H. et al., Science
62. 2004, 303, 808–813.
[46] Rual, J. F., Hill, D. E., Vidal, M., Curr. Opin. Chem. Biol. 2004, [76] Tong, A. H., Evangelista, M., Parsons, A. B., Xu, H. et al.,
8, 20–25. Science 2001, 294, 2364–2368.
[47] Reboul, J., Vaglio, P., Rual, J. F., Lamesch, P. et al., Nat. [77] Barstead, R., Curr. Opin. Chem. Biol. 2001, 5, 63–66.
Genet. 2003, 34, 35–41. [78] Sugimoto, A., Differentiation 2004, 72, 81–91.
[48] Schroeder, B. K., House, B. L., Mortimer, M. W., Yurgel, S. N. [79] Kuttenkeuler, D., Boutros, M., Brief. Funct. Genomic Prote-
et al., Appl. Environ. Microbiol. 2005, 71, 5858–5864. omic 2004, 3, 168–176.
[49] Dricot, A., Rual, J. F., Lamesch, P., Bertin, N. et al., Genome [80] Friedman, A., Perrimon, N., Curr. Opin. Genet. Dev. 2004,
Res. 2004, 14, 2201–2206. 14, 470–476.
[50] Rual, J. F., Hirozane-Kishikawa, T., Hao, T., Bertin, N. et al., [81] Downward, J., Oncogene 2004, 23, 8376–8383.
Genome Res. 2004, 14, 2128–2135. [82] Wheeler, D. B., Carpenter, A. E., Sabatini, D. M., Nat. Genet.
[51] Gong, W., Shen, Y. P., Ma, L. G., Pan, Y. et al., Plant Physiol. 2005, 37, S25–S30.
2004, 135, 773–782. [83] Ito, M., Kawano, K., Miyagishi, M., Taira, K., FEBS Lett.
[52] Hunt, I., Protein Expr. Purif. 2005, 40, 1–22. 2005, 579, 5988–5995.
[53] Elia, A. E., Cantley, L. C., Yaffe, M. B., Science 2003, 299, [84] Jenna, S., Caruso, M. E., Emadali, A., Nguyen, D. T. et al.,
1228–1231. Mol. Biol. Cell 2005, 16, 1629–1639.

[54] Huh, W. K., Falvo, J. V., Gerke, L. C., Carroll, A. S. et al., Na- [85] Gunsalus, K. C., Ge, H., Schetter, A. J., Goldberg, D. S. et al.,
ture 2003, 425, 686–691. Nature 2005, 436, 861–865.

[55] Ghaemmaghami, S., Huh, W. K., Bower, K., Howson, R. W. et [86] Skop, A. R., Liu, H., Yates, J. III, Meyer, B. J. et al., Science
al., Nature 2003, 425, 737–741. 2004, 305, 61–66.

[56] Pepperkok, R., Simpson, J. C., Rietdorf, J., Cetin, C. et al., [87] Blackstock, W. P., Weir, M. P., Trends Biotechnol. 1999, 17,
Methods Enzymol. 2005, 404, 8–18. 121–127.
[88] Brent, M. R., Genome Res. 2005, 15, 1777–1786.
[57] Simpson, J. C., Pepperkok, R., Genome Biol. 2003, 4, 240.
[89] Bianchetti, L., Thompson, J. D., Lecompte, O., Plewniak, F.
[58] Simpson, J. C., Wellenreuther, R., Poustka, A., Pepperkok, R.
et al., J. Bioinform. Comput. Biol. 2005, 3, 929–947.
et al., EMBO Rep. 2000, 1, 287–292.
[90] Pruitt, K. D., Tatusova, T., Maglott, D. R., Nucleic Acids Res.
[59] Nilsson, P., Paavilainen, L., Larsson, K., Odling, J. et al., Pro-
2005, 33, D501–D504.
teomics 2005, 5, 4327–4337.
[91] Apweiler, R., Bairoch, A., Wu, C. H., Curr. Opin. Chem. Biol.
[60] Uhlen, M., Bjorling, E., Agaton, C., Szigyarto, C. A. et al.,
2004, 8, 76–80.
Mol. Cell. Proteomics 2005, 4, 1920–1932.
[92] Oh, P., Li, Y., Yu, J., Durr, E. et al., Nature 2004, 429, 629–635.
[61] Uhlen, M., Ponten, F., Mol. Cell. Proteomics 2005, 4, 384–393.
[93] Andersen, J. S., Lam, Y. W., Leung, A. K., Ong, S. E. et al.,
[62] Ito, T., Chiba, T., Ozawa, R., Yoshida, M. et al., Proc. Natl.
Nature 2005, 433, 77–83.
Acad. Sci. USA 2001, 98, 4569–4574.
[94] Sadygov, R. G., Cociorva, D., Yates, J. R. III, Nat. Methods
[63] Li, S., Armstrong, C. M., Bertin, N., Ge, H. et al., Science 2004, 1, 195–202.
2004, 303, 540–543.
[95] Choudhary, J. S., Blackstock, W. P., Creasy, D. M., Cottrell, J.
[64] Formstecher, E., Aresta, S., Collura, V., Hamburger, A. et al., S., Trends Biotechnol. 2001, 19, S17–S22.
Genome Res. 2005, 15, 376–384.
[96] Arthur, J. W., Wilkins, M. R., J. Proteome. Res. 2004, 3, 393–
[65] Giot, L., Bader, J. S., Brouwer, C., Chaudhuri, A. et al., Sci- 402.
ence 2003, 302, 1727–1736.
[97] Verberkmoes, N. C., Hervey, W. J., Shah, M., Land, M. et al.,
[66] Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T. et Anal. Chem. 2005, 77, 923–932.
al., Nature 2005, 437, 1173–1178.
[98] Haebel, S., Albrecht, T., Sparbier, K., Walden, P. et al., Elec-
[67] Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M. et al., Nat. trophoresis 1998, 19, 679–686.
Biotechnol. 1999, 17, 1030–1032.
[99] Mann, M., Jensen, O. N., Nat. Biotechnol. 2003, 21, 255–
[68] Gavin, A. C., Bosche, M., Krause, R., Grandi, P. et al., Nature 261.
2002, 415, 141–147.
[100] Ulintz, P. J., Zhu, J., Qin, Z. S., Andrews, P. C., Mol. Cell.
[69] Bouwmeester, T., Bauch, A., Ruffner, H., Angrand, P. O. et al., Proteomics 2006, 5, 497–509.
Nat. Cell Biol. 2004, 6, 97–105. [101] Tsur, D., Tanner, S., Zandi, E., Bafna, V. et al., Nat. Bio-
[70] Zips, D., Thames, H. D., Baumann, M., In Vivo 2005, 19, 1–7. technol. 2005, 23, 1562–1567.
[71] Eglen, R. M., Comb. Chem. High Throughput Screen. 2005, [102] Chamrad, D. C., Koerting, G., Gobom, J., Thiele, H. et al.,
8, 311–318. Anal. Bioanal. Chem. 2003, 376, 1014–1022.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com


5480 S. Palcy and E. Chevet Proteomics 2006, 6, 5467–5480

[103] Watson, J. D., Laskowski, R. A., Thornton, J. M., Curr. Opin. [113] Krogh, A., Larsson, B., Von Heijne, G., Sonnhammer, E. L.,
Struct. Biol. 2005, 15, 275–284. J. Mol. Biol. 2001, 305, 567–580.
[104] Thompson, J. D., Higgins, D. G., Gibson, T. J., Nucleic [114] Bendtsen, J. D., Nielsen, H., Von Heijne, G., Brunak, S., J.
Acids Res. 1994, 22, 4673–4680. Mol. Biol. 2004, 340, 783–795.
[105] Corpet, F., Nucleic Acids Res. 1988, 16, 10881–10890. [115] Guex, N., Peitsch, M. C., Electrophoresis 1997, 18, 2714–
[106] Shah, P. K., Aloy, P., Bork, P., Russell, R. B., Protein Sci. 2005, 2723.
14, 1305–1314. [116] Ginalski, K., Curr. Opin. Struct. Biol. 2006,16, 172–177.
[107] Mcginnis, S., Madden, T. L., Nucleic Acids Res. 2004, 32, [117] Wu, C. H., Apweiler, R., Bairoch, A., Natale, D. A. et al.,
W20–W25. Nucleic Acids Res. 2006, 34, D187–D191.
[108] Altschul, S. F., Koonin, E. V., Trends Biochem. Sci. 1998, 23, [118] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D. et al.,
444–447. Nat. Genet. 2000, 25, 25–29.
[109] Huang, X., Miller, W., Adv. Appl. Math. 1991, 12, 337–357. [119] Xie, H., Wasserman, A., Levine, Z., Novik, A. et al., Genome
[110] Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A. et al., Res. 2002, 12, 785–794.
Nucleic Acids Res. 2001, 29, 37–40. [120] Can, T., Camoglu, O., Singh, A. K., Wang, Y. F., Proc. IEEE
[111] Quevillon, E., Silventoinen, V., Pillai, S., Harte, N. et al., Comput. Syst. Bioinform. Conf. 2004, 224–235.
Nucleic Acids Res. 2005, 33, W116–W120. [121] Barriot, R., Poix, J., Groppi, A., Barre, A. et al., Nucleic Acids
[112] Nakai, K., Horton, P., Trends Biochem. Sci. 1999, 24, 34–36. Res. 2004, 32, 3581–3589.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

View publication stats

You might also like