Professional Documents
Culture Documents
Quantitative proteomics holds considerable promise for protein abundances for comprehensive, system-wide bio-
elucidation of basic biology and for clinical biomarker logical research (1, 2). Quantitative proteomics may be used
discovery. However, it has been difficult to fulfill this to systematically identify and quantify proteins and their
promise due to over-reliance on identification-based modifications as a function of cell cycle, differentiation, or
quantitative methods and problems associated with chro- chemical treatment to obtain novel insights into basic cellular
matographic separation reproducibility. Here we describe biology. Proteomics also holds promise for discovery of pro-
new algorithms termed “Landmark Matching” and “Peak
teins in readily accessible biofluids that are diagnostic or prog-
Matching” that greatly reduce these problems. Landmark
nostic of a disease condition. Such proteins are termed
Matching performs time base-independent propagation
of peptide identities onto accurate mass LC-MS features “biomarkers.”
© 2006 by The American Society for Biochemistry and Molecular Biology, Inc. Molecular & Cellular Proteomics 5.10
Supplemental Material can be found at: 1927
This paper is available on line at http://www.mcponline.org http://www.mcponline.org/cgi/content/full/M600222-MCP200
/DC1
Proteomic Pattern Recognition Using PEPPeR
However, a significant problem for these identity-based peptide sequence, mass, and retention time can built from
methods is the limited and stochastic nature of MS/MS sam- multiple experiments and searched for ex post facto assign-
pling for peptide sequencing on the chromatographic time ment of identity to a chromatographic LC-MS peak based on
scale. This is exacerbated in complex samples where the these parameters. Abundance of the peaks enables large
MS/MS sampling process favors acquisition of spectra for the scale relative quantification of peptides among groups of
peptides of highest abundance that represent only a fraction samples. This in turn enables statistical analysis for biological
of the detectable m/z peaks in the mass spectrum. This characterization or biomarker discovery. Other approaches
means that a considerable amount of usable data is discarded that exploit high resolution pattern have recently been de-
by data-dependent experiments, and only the most abun- scribed as well (27–29), including a hybrid MS, MS/MS de-
dant peptides (and therefore proteins) in these experiments convolution method developed by Silva et al. (30, 31).
will be reliably quantified. Lower abundance proteins will be Several key problems face practitioners of quantitative pro-
sampled for MS/MS at lower frequency, resulting in poor teomics. Label-free approaches such as Smith’s do not nec-
reproducibility across samples (21). Researchers counter essarily rely on direct MS/MS sequencing for quantification,
this effect by either analyzing the same sample multiple but then the issues of chromatographic reproducibility and
times (21) or by fractionating highly complex samples at the alignment become manifest. Many approaches such as dy-
protein and peptide level prior to the final LC-MS/MS anal- namic time warping have been suggested as possible reme-
ysis (22). Although these approaches often improve detec- dies to these problems, but none has gained wide acceptance
tion of lower abundance components, they also greatly (12, 32). Finally in approaches that do not rely on sequencing
TABLE I
Protein mixture analysis concentrations
All values are in fmol/l.
Scale Mix Variability Mix
A B C D E F G H I ␣ 
Aprotinin 1 2 3 10 20 30 100 200 300 100 5
Ribonuclease A 300 1 2 3 10 20 30 100 200 100 100
Myoglobin 200 300 1 2 3 10 20 30 100 100 100
-Lactoglobulin 100 200 300 1 2 3 10 20 30 50 1
␣-Casein 30 100 200 300 1 2 3 10 20 100 10
Carbonic anhydrase 20 30 100 200 300 1 2 3 10 100 100
Ovalbumin 10 20 30 100 200 300 1 2 3 5 10
Fibrinogen 3 10 20 30 100 200 300 1 2 25 25
BSA 2 3 10 20 30 100 200 300 1 200 200
Transferrin 100 100 100 100 100 100 100 100 100 10 5
Plasminogen 30 30 30 30 30 30 30 30 30 2.5 25
-Galactosidase 10 10 10 10 10 10 10 10 10 1 10
data obtained from these instruments while greatly increas- tools on proteomics data for biomarker discovery and pro-
ing the accuracy and efficiency of obtaining the sequence vides a clear reverse path for identification of unknown
identities of even minor m/z peaks found to change across molecular species via accurate mass-driven, targeted
samples. Finally it will allow for use of established statistical MS/MS experiments (33).
EXPERIMENTAL PROCEDURES
Reagents and Chemicals
Proteins were obtained from Sigma. Mitochondrial extracts were
kindly provided by Dr. Vamsi Mootha of the Broad Institute, Cam-
bridge, MA. All other reagents (including water) were of HPLC or
proteomics grade.
harmonic noise (target value raised to 1 ⫻ 106 for mitochondrial sam- computer programs that implement landmark matching are written in
ples). Parameters for the Quantification Experiment acquisition strategy Perl and are available as a module of GenePattern at www.broad.
were identical except that the top three most abundant ions were mit.edu/tools/software.html.
selected rather than 10. For Identification Experiments, each sample Peptides sequenced from related LC-MS experiments become
was subjected to LC-MS analysis once. For Quantification Experiments, part of the BASIS SET. Information about confidently identified peptides
each sample was analyzed with five technical replicates. is retained in the BASIS SET, namely peptide sequence, charge in
which it was observed, experiment in which it was observed, and
Peptide Spectrum Interpretation scan boundaries composing the MS/MS scans grouped together for
sequence identification. These scan boundaries become the basis for
MS/MS spectra were extracted from the raw data and interpreted absolute or relative retention time comparisons in later landmark
using SpectrumMill Data Extractor and MS/MS Search Revision matching. This information is easily culled from the spectrum filename
B.03.02.059 (Agilent). SpectrumMill parameters for extraction, for spectrum extractor outputs that follow the same file naming con-
searching, and autovalidation can be seen in the accompanying sup- vention as extract_msn (ThermoElectron). In the case of our experi-
plemental information. Data from the Scale Mixes and Variability ments, an example filename VARMIX_A_01.3645.3675.2.pkl would
Mixes were searched against a small protein database consisting of translate to experiment VARMIX_A_01, scan boundaries 3645–3675,
only those proteins that composed the mixtures and common con- charge state z ⫽ 2. If a peptide was confidently identified from this
taminants (52 proteins total). Subsequently novel feature spectra spectrum, its sequence and spectrum name would become part of
acquired by targeted means were also searched against the National the BASIS SET. In cases where absolute retention time is required, scan
Center for Biotechnology Information (NCBI) non-redundant protein numbers can be translated into time units using software libraries
database dated August 1, 2005 and containing 2,724,841 entries. provided by the instrument manufacturer or “looked up” in the reten-
Data from the mitochondrial preparations were searched against the tion time (RT) ruler generated by MapQuant during xr2or data extrac-
International Protein Index (IPI) mouse database version 3.01 (35) and
each type of matching exercise are shown in Box 2. sponding m/z, retention time, and abundance information as derived
A walk-through of LANDMARK LIST selection and scoring is shown in from the corresponding MapQuant features. Processing an entire ex-
Fig. 2. A LANDMARK LIST is selected via a common overlap between periment results in a list of such matches. Any directly sequenced
peptides observed in the CURRENT EXPERIMENT (the LANDMARKS) and features for that experiment (LANDMARKS) that were not included in the
some other experiment in the BASIS SET (the COMPARISON EXPERIMENT) final match list are merged back into the data set. This final process also
with the requirement that the sequence of the PUTATIVE ASSIGNMENT provides an estimate of the false negative rate for landmark matching.
made to the feature being examined was confidently identified in the
COMPARISON EXPERIMENT. In this way, each PUTATIVE ASSIGNMENT has its Probabilistic Evaluation of Landmark Matches by Bootstrapping
own unique LANDMARK LIST, and multiple LANDMARK LISTS may be The probability that a given match is by chance can be evaluated
possible, but the “best” LANDMARK LIST is derived from a single COM- using the combined probabilities that the m/z assignment is by
PARISON EXPERIMENT. Currently the COMPARISON EXPERIMENT is selected
chance and that passing the landmark threshold filter is by chance.
where (a) the standard deviation of the centroided scans leading to Both probabilities are dependent on the assumptions that identifica-
the confident identification of the PUTATIVE ASSIGNMENT in the COMPAR- tions in the BASIS SET are correct and that the BASIS SET is complete.
ISON EXPERIMENT is less than a constant (typically ⫽ 200 scans
A p value can be computed as follows.
indicates a sharply eluting peak for our acquisition methods), (b) there
is at least one confidently identified peptide in common between the poverall ⫽ pm/z p共landmark兩m/z兲 (Eq. 2)
two experiments both before and after the observation of the PUTATIVE
ASSIGNMENT, (c) the standard deviation of the centroided scans leading pm/z can be computed analytically by counting the total number of
to the confident identification of the peptides before and after the BASIS SET peptides that fall within the tolerance of the matched m/z
PUTATIVE ASSIGNMENT in the COMPARISON EXPERIMENT is less than a at the specified charge (z) and dividing by the total number of BASIS
constant (typically ⫽ 500 scans leads to acceptable perform- SET peptides. p(landmark兩m/z) can be computed according to Bayes’
ance), and (d) the two experiments share the greatest common over- Rule.
冘
w
冦
1 if 共m兲 ⬍ 共n兲
共m, n兲 ⫽ 再 0.5
if 共m兲 ⬎ 共n兲
if 共n兲 ⫺ 共m兲 ⬍ ␦ and 共m兲 ⫹ 共m兲 ⬎ 共n兲 ⫺ 共n兲
⫺1 if else
0 if else
Because absolute retention time is difficult to reproduce, we apply tion time for all (landmark or unidentified) peptides is calculated as
a coarse retention time correction to all runs prior to clustering. The RTcorrected ⫽ a0 ⫹ a1 ⫻ RT ⫹ a2 ⫻ RT2 where the constants are
aim is not to achieve perfect chromatographic alignment but merely estimated by comparing the retention time of common landmarks in
to improve the efficiency of clustering and derive appropriate RT the run to their retention times in the reference.
tolerances. We start by selecting an arbitrary run as the reference. For In both mass and retention time corrections, there may be land-
each of the remaining runs, we correct retention time as follows: marks that have unusually large variation in mass or retention time,
identify the common set of landmarks in the reference run and the run respectively. Mass outliers can result from potentially incorrect pep-
under consideration. Using these landmarks, the corrected reten- tide identification, artifacts arising from MapQuant peak detection,
erroneous landmark matching, or stochastic MS variation. Retention Peak matching is performed using model-based clustering (37, 38)
time outliers can be caused by chromatographic variation, occasional using the Expectation Maximization algorithm (39) for Gaussian mix-
elution of a peptide over multiple time points, or the gradual elution of ture model parameter estimation. Peak matching takes only m/z and
a peptide over a long time period (i.e. a very broad chromatographic RT of the peak into account and does not use intensity during the
base peak). Because such variation is the exception rather than the peak clustering operation. For each matched peak, we assume that its
rule, we have designed our mass and retention time correction algo- true position in (m/z, RT) space is represented by , and the observed
rithms to be robust to such variation by excluding outlier landmarks. coordinates of this peak vary around in a Gaussian manner with
Outliers are defined as those landmarks whose mass or retention mean and covariance ⌺ so that it is represented by
再 冎
times (for mass or retention time correction, respectively) are greater
then Q3 ⫹ 1.5 ⫻ IQR or less than Q1 ⫺ 1.5 ⫻ IQR where Q1 is the 1
lower quartile, Q3 is the upper quartile, and the interquartile range
exp ⫺ 共x ⫺ k兲T⌺⫺1 k 共x ⫺ k兲
2
IQR ⫽ Q3–Q1. fk共x兩k, ⌺k兲 ⫽ (Eq. 6)
共2兲d/ 2兩⌺兩1/ 2
Clustering of Features with a Gaussian Mixture Model—With mass
and retention time corrected data that include all charge-identified where x represents the data (constituent peaks subsumed by this
peaks in all the runs detected by MapQuant, we now address the matched peak), k is an integer subscript specifying the matched peak,
peak matching and alignment problem. The challenge here is to and d is the number of dimensions (⫽2 in our case). Assuming that the
match identical LC-MS features across multiple sample runs, taking m/z strip has K matched peaks, i.e. k 僆 {1,2, . . . ,K}, the strip is
into account m/z and retention time variation. Given the high perform- represented by a mixture of K Gaussian peaks. The likelihood function
ance MS instrumentation we are using, m/z variation is minimal, but (37) for this Gaussian mixture model is as follows.
still needs to addressed due to the fact that complex protein mixtures
写冘
can give rise to peptides with very similar m/z values. Retention time, n K
however, can vary significantly from run to run for a given peptide, L共1, K, K; ⌺1, K, ⌺K; 1, K, K兩x兲 ⫽ kfk共xi兩k, ⌺k兲 (Eq. 7)
cases, the final intensity of the matched peak for that sample is the Landmark matching adds value even for the simple protein
sum of the multiple peak intensities. If a single peak from a run falls mixtures used to calibrate the system (Scale Mixes) and de-
into a matched peak, the corresponding intensity is carried over;
matched peaks missing from a specific run are marked missing.
termine the robustness of ratios (Variability Mixes). An aver-
Furthermore any peptide identities from landmarking are carried over, age of 165 ⫾ 27 peaks were directly sequence-identified in
and the matched peak is marked with the identity of the landmark. these experiments. After landmark matching, an average of
The final result of the peak matching and alignment process is an 281 ⫾ 44 peaks had sequence assignments, an increase of
intensity table whose rows represent features that are the matched 70%. The false positive rate of these matches can be evalu-
peaks and whose columns represent sample (or runs).
The peak matching algorithms are implemented primarily as an R
ated according to the formulas described in Equations 2 and
language library (41) with supporting shell and Perl scripts. The algo- 3. In a randomly selected LC-MS experiment, 93% of
rithms have been parallelized to efficiently utilize cluster computing matched features have a multiple hypothesis-corrected p
environments. value of ⬍0.005, but the p values of each individual match are
preserved and could be used for further filtering. The false
Quantitative Data Analysis
negative rate can be evaluated by counting the number of
Peptide sequences from matched features were assigned to their peaks that were directly sequence-identified but not assigned
parent proteins by string matching. The matches were then sorted by landmark matching. We calculate this rate at ⬍2%. In any
and grouped by parent protein. All abundance values were normal-
case, we can merge these directly sequence-identified peaks
ized to the summed abundance of all features identified by MapQuant
for the experiment and log2-transformed. Log2 transformation effec- back into the data set if they were missed during landmark
tively redistributes the abundance values observed into a normal matching. Summary landmark matching statistics can be
FIG. 3. Calibration of peptide data in Scale Mixes. A, composite averages for scaled peptide abundances across all variable proteins in
the Scale Mixes. Scaling and averaging is described under “Experimental Procedures.” A quadratic fit is shown with R2 ⫽ 0.98. B, composite
quantification ranged over 4 orders of magnitude demonstrat- ments (i.e. ratio clustering). Ratio clustering in combination
these contaminant proteins is akin to real world validation of tools. The key difference is that proteomics experiments are
At the same time, the utilization of high performance data and members of the Carr laboratory for performing protein digests for this
blind discovery of potential markers by PEPPeR may lessen study and especially Eric Kuhn for data on -galactosidase
contamination.
the need for sample fractionation.
We demonstrated an unintentional but remarkable valida- * This work was supported by grants from The Gates Foundation
tion of the PEPPeR system by detecting significantly changing and the Entertainment Industry Foundation (to S. A. C.) and the
peaks that turned out to correspond to peptides from E. coli United States Department of Energy (DOE-GTL) (to G. M. C.). The
proteins. There was only a single protein in our mixtures costs of publication of this article were defrayed in part by the pay-
ment of page charges. This article must therefore be hereby marked
sourced from E. coli, and its ratio did in fact change between “advertisement” in accordance with 18 U.S.C. Section 1734 solely to
the two Variability Mixes. These peaks were derived from indicate this fact.
contaminants present in the powdered stock. A secondary □S The on-line version of this article (available at http://www.
benefit to using the high performance instrumentation as used mcponline.org) contains supplemental material.
¶ To whom correspondence should be addressed. Tel.: 617-324-
in this study is that we had an accurate measurement of the
9762; Fax: 617-252-1902; E-mail: scarr@broad.mit.edu.
m/z values of the novel candidate markers. This allowed us to
rapidly close the loop and identify the candidates with the use REFERENCES
of accurate mass-driven precursor-dependent MS/MS acqui-
1. MacCoss, M. J., and Matthews, D. E. (2005) Quantitative MS for proteom-
sition. We obtained the identities of the novel markers literally ics: teaching a new dog old tricks. Anal. Chem. 77, 294A–302A
within 1 day of computational marker selection. The unbiased 2. Ong, S. E., and Mann, M. (2005) Mass spectrometry-based proteomics
marker selection also resulted in greater peptide coverage turns quantitative. Nat. Chem. Biol. 1, 252–262
17. Keller, A., Nesvizhskii, A. I., Kolker, E., and Aebersold, R. (2002) Empirical nonlinear peak alignment, matching, and identification. Anal. Chem. 78,
statistical model to estimate the accuracy of peptide identifications made 779 –787
by MS/MS and database search. Anal. Chem. 74, 5383–5392 30. Silva, J. C., Denny, R., Dorschel, C. A., Gorenstein, M., Kass, I. J., Li, G. Z.,
18. Nesvizhskii, A. I., Keller, A., Kolker, E., and Aebersold, R. (2003) A statistical McKenna, T., Nold, M. J., Richardson, K., Young, P., and Geromanos, S.
model for identifying proteins by tandem mass spectrometry. Anal. (2005) Quantitative proteomic analysis by accurate mass retention time
Chem. 75, 4646 – 4658 pairs. Anal. Chem. 77, 2187–2200
19. Ulintz, P. J., Zhu, J., Qin, Z. S., and Andrews, P. C. (2006) Improved 31. Silva, J. C., Gorenstein, M. V., Li, G. Z., Vissers, J. P., and Geromanos, S. J.
classification of mass spectrometry database search results using newer (2006) Absolute quantification of proteins by LCMSE: a virtue of parallel
machine learning approaches. Mol. Cell. Proteomics 5, 497–509 MS acquisition. Mol. Cell. Proteomics 5, 144 –156
20. Venable, J. D., Dong, M. Q., Wohlschlegel, J., Dillin, A., and Yates, J. R. 32. Aach, J., and Church, G. M. (2001) Aligning gene expression time series
(2004) Automated approach for quantitative analysis of complex peptide with time warping algorithms. Bioinformatics 17, 495–508
mixtures from tandem mass spectra. Nat. Methods 1, 39 – 45 33. Calvo, S., Jain, M., Xie, X., Sheth, S. A., Chang, B., Goldberger, O. A.,
21. Washburn, M. P., Ulaszek, R. R., and Yates, J. R., III (2003) Reproducibility Spinazzola, A., Zeviani, M., Carr, S. A., and Mootha, V. K. (2006) Sys-
of quantitative proteomic analyses of complex biological mixtures by tematic identification of human mitochondrial disease genes through
multidimensional protein identification technology. Anal. Chem. 75, integrative genomics. Nat. Genet. 38, 576 –582
5054 –5061 34. Mootha, V. K., Bunkenborg, J., Olsen, J. V., Hjerrild, M., Wisniewski, J. R.,
22. Tang, H. Y., Ali-Khan, N., Echan, L. A., Levenkova, N., Rux, J. J., and Stahl, E., Bolouri, M. S., Ray, H. N., Sihag, S., Kamal, M., Patterson, N.,
Speicher, D. W. (2005) A novel four-dimensional strategy combining Lander, E. S., and Mann, M. (2003) Integrated analysis of protein com-
protein and peptide separation methods enables detection of low-abun- position, tissue diversity, and gene regulation in mouse mitochondria.
dance proteins in human plasma and serum proteomes. Proteomics 5, Cell 115, 629 – 640
3329 –3342 35. Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., and
23. Listgarten, J., and Emili, A. (2005) Statistical and computational methods Apweiler, R. (2004) The International Protein Index: an integrated data-
for comparative proteomic profiling using liquid chromatography-tan- base for proteomics experiments. Proteomics 4, 1985–1988
dem mass spectrometry. Mol. Cell. Proteomics 4, 419 – 434 36. Leptos, K. C., Sarracino, D. A., Jaffe, J. D., Krastins, B., and Church, G. M.