Professional Documents
Culture Documents
Abstract
Background: A small number of prognostic and predictive tests based on gene expression are currently offered as
reference laboratory tests. In contrast to such success stories, a number of flaws and errors have recently been
identified in other genomic-based predictors and the success rate for developing clinically useful genomic
signatures is low. These errors have led to widespread concerns about the protocols for conducting and reporting
of computational research. As a result, a need has emerged for a template for reproducible development of
genomic signatures that incorporates full transparency, data sharing and statistical robustness.
Results: Here we present the first fully reproducible analysis of the data used to train and test MammaPrint, an
FDA-cleared prognostic test for breast cancer based on a 70-gene expression signature. We provide all the software
and documentation necessary for researchers to build and evaluate genomic classifiers based on these data. As an
example of the utility of this reproducible research resource, we develop a simple prognostic classifier that uses
only 16 genes from the MammaPrint signature and is equally accurate in predicting 5-year disease free survival.
Conclusions: Our study provides a prototypic example for reproducible development of computational algorithms
for learning prognostic biomarkers in the era of personalized medicine.
Keywords: Reproducible research, Gene expression analysis, Biomarkers, Top scoring pair, Prediction, Genomics,
Personalized medicine, Breast cancer, MammaPrint
© 2013 Marchionni et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Marchionni et al. BMC Genomics 2013, 14:336 Page 2 of 7
http://www.biomedcentral.com/1471-2164/14/336
hurdle hindering the translation of this research into discovery and validation data to develop an alternative sig-
clinically useful assays has been identified in the lack of nature and prognostic test for breast cancer, which is
rigorous criteria to report and publish tumor prognostic based on several two-gene comparisons [20,21]. This pro-
marker studies [8]. This issue has been addressed by vides a detailed, transparent and fully reproducible ex-
introducing the REMARK guidelines, a set of recom- ample of constructing a multi-gene classifier.
mendations for tumor marker prognostic studies, which
provides the necessary framework for reporting all rele- Methods
vant information about prognostic marker development Data assembly and code
(i.e. study design, specimen and patient characteris- We collected the data from the original experiments used to
tics, analytical and statistical methods) [11]. Another identify [9] and develop [10] the MammaPrint 70-gene
key issue in the development of cancer biomarkers is prognostic signature as provided as additional files with the
the need for detailed and complete disclosure of all original manuscripts. We also collected from ArrayExpress
data and software [8,12,13]. This need is not specific [22] the dataset used to retrain this signature on the custom
to the development of predictive signatures from array currently used in the MammaPrint assay [3] as well as
high-throughput molecular data but extends to many the independent validation cohort using the same array [19].
other branches of computational medicine and biology All of these datasets have been organized in an open re-
[14,15]. Whereas the guidelines for transparency in source that can be used to develop and compare prognostic
genomic data sharing date back a decade to the adop- signatures for breast cancer (available at http://
tion of the Minimal Information About Microarray Ex- luigimarchionni.org/breastTSP.html) and Bioconductor
periments (MIAMIE) standards [16], the recent scandal [23]. This resource also encompasses the R [24] code and
leading to the decision to cancel three clinical trials libraries used to retrieve, pre-process, manipulate, annotate,
based on microarray-based gene expression screening and analyze these data. The code, fully annotated and exe-
tests has dramatically underscored the need for revised cutable, is provided in the Additional files 1 and 2. All the
genomics research criteria [17] that extend and/or inte- analyses performed in our study were based on de-identi-
grate the REMARK and MIAME guidelines. fied publically available data, and they were performed in
Maximizing the level of evidence on the spectrum of re- compliance to the Helsinki declaration. The research did
producibility requires complete, independent replication not involve any experiment on human subjects or animals
[18]. As measured by this criterion, neither of the two suc- and for this reason no ethical approval was necessary.
cessful breast cancer assays, MammaPrint and
OncotypeDX, provides a paradigmatic example of the way An example of reproducible signature development
genomic predictors should be developed. In the case of In order to build our new classifier we selected the 78
OncotypeDX, the prediction algorithm is described in detail patients originally used in the 70-gene prognostic signa-
and can be reprogrammed, but the original datasets used ture discovery and limited our analysis to the 70 genes
for the implementation and validation [4] of the assay were contained in the original signature. We made these deci-
never placed in the public domain. Conversely, in the case sions for two reasons: (a) to make our development
of MammaPrint, although the original discovery and valid- process entirely analogous to the process for
ation datasets [3,19] are available, the pre-processing proto- MammaPrint and (b) so that our signature can be calcu-
col and prediction algorithm are only partially described. lated on the basis of the data from any current
Thus the entire development, including data and code, MammaPrint assay. To this end it should also be noted
is not available for either MammaPrint nor OncotypeDX. that the MammaPrint microarray platform only includes
However, in the case of MammaPrint it is possible to the prognostic signature genes and a set of housekeeping
undertake a transparent re-analysis of the data using an al- genes used for normalization purposes. These latter
ternative approach, since the raw microarray data are genes are designed not to change across samples and
available. We therefore focus here our efforts on reprodu- were therefore not used to train our predictor. We
cing the results of Mammaprint. We collect and organize adopted an extension of a rank-based approach to classi-
the original MammaPrint discovery and validation data. fication called “top-scoring pairs” (TSP) for developing
We also coordinate the associated metadata for these ex- understandable and powerful genomic signatures. This
periments and develop reproducible documents for their approach is invariant to all data preprocessing and
analysis. We reproduce and implement the preprocessing normalization steps that maintain the ordering within
described in the original manuscripts. These data repre- sample gene expression profiles. The TSP algorithm se-
sent a resource that can be used by other investigators lects the pair of genes whose expression levels switch
both to verify the original claims about the MammaPrint their ranking most consistently between the two prog-
signature and to build alternative predictors. As an ex- nostic groups (Figure 1). The original TSP algorithm
ample of the utility of these data, we use the MammaPrint [20] and extensions [25] have previously been
Marchionni et al. BMC Genomics 2013, 14:336 Page 3 of 7
http://www.biomedcentral.com/1471-2164/14/336
Figure 3 8-TSP breast cancer prognosis signature. Each of the 8 gene pairs votes independently; patients with two or more votes are
classified as poor prognosis.
study [19]. In this independent validation cohort our test preprocessing of the expression data that maintains the
achieved 91% sensitivity, 47% specificity, and 69% over- ordering among expression levels within sample profiles.
all accuracy (Figures 4A and 4B, and Additional files 1 Finally, all design decisions and choices of parameters
and 2). Sensitivity refers to correctly classifying poor were based entirely on the training set. There was no
prognosis patients and specificity refers to correctly “data leakage”: no test data was examined until all as-
classifying good prognosis patients. For comparison, pects of classifier development were “locked up.” These
the MammaPrint prognostic test achieves 89% sensitivity, are considered critical steps in developing reproducible
42% specificity, and 65% overall accuracy [19,31] in this and accurate genomic signatures as defined by the IOM
same validation set. Such performance in predicting report [8]. The two key parameters are K, the number of
metastatic recurrence within 5 years was reflected in the pairs of genes in the signature, and the score threshold.
AUC estimates: 0.69 (95% CI: 0.64 − 0.74) and 0.59 (95% We only considered values of K between 6 and 10 since
CI: 0.55 − 0.62) for the 8-TSP and MammaPrint respect- these values maximized overall performance, and we
ively. (Comparable results were obtained by PAM, a well- only considered thresholds that obtained 100% sensitiv-
known classification method; see the Additional files 1 ity. Under these design constraints, we selected the K = 8
and 2.) Finally, while in the prediction of a metastatic since this value maximized specificity at 100% sensitivity
event within five years the 8-TSP classifier performed (Figure 2). Our final classifier labels a sample as poor
better than the MammaPrint test, this latter assay prognosis if two or more among the 8 pairs votes for the
maintained a better performance at later time points as poor prognosis group (Figure 3).
revealed in survival analyses. This finding probably indi- Our 8-TSP signature can be viewed as the combin-
cates that the additional features of the 70-gene signature ation of multiple coordinated biological processes. Of
not used in the 8-TSP classifier might carry additional the 70 genes originally identified in the study by van’t
prognostic information beyond five years (see Additional Veer and colleagues [10], 18 genes had expression values
files 1 and 2). positively associated with good prognosis, while 52 were
We have therefore built a prognostic classifier based associated with metastatic recurrence. Four of the K = 8
on the genes from the MammaPrint signature that is as pairs combine genes positively correlated with good
accurate in predicting 5-year disease-free survival as the prognosis (RTN4RL1, LGP2, MS4A7, and GSTM3) with
MammaPrint prognostic test based. Our classifier only genes associated with bad prognosis (OXCT1, HRASLS,
requires the measurement of expression for 16 of the 70 Contig40831_RC, and MELK). These pairs represent a
genes used in Mammaprint. Moreover, the new test is coordinated change from good prognosis expression pat-
easy to interpret and is robust with respect to any terns to poor prognosis patterns across multiple gene
Marchionni et al. BMC Genomics 2013, 14:336 Page 5 of 7
http://www.biomedcentral.com/1471-2164/14/336
Conclusions
Our goal was to provide a transparent example of the
manner in which a genomics-based cancer predictor
might be developed from training data and evaluated on
independent test data with sufficient detail and docu-
mentation to allow the full process to be replicated by
other researchers. Due to the unavailability of the ori-
ginal data, it was not possible carry out this process for
OncotypeDX, which is presently the most used and vali-
dated predictor of this kind. Consequently, we
performed a re-analysis of MammaPrint data. To this
end, we selected the same samples and end-point origin-
ally used for the implementation of this assay, although
we are aware that a stratified analysis across ER positive
and negative patients would be much more appropriate.
In order to illustrate the development process from end
to end, including a transparent decision rule, we have in-
troduced a more parsimonious classifier with sensitivity,
specificity, and overall accuracy very similar to the 70-
gene MammaPrint signature.
Our analysis was performed in complete adherence to
the principles of transparent and reproducible research
[13,18], providing all data sources used, and the
complete code and software necessary for data prepro-
cessing, analysis and validation. To our knowledge, this
Marchionni et al. BMC Genomics 2013, 14:336 Page 6 of 7
http://www.biomedcentral.com/1471-2164/14/336
doi:10.1186/1471-2164-14-336
Cite this article as: Marchionni et al.: A simple and reproducible breast
cancer prognostic test. BMC Genomics 2013 14:336.