A Simple and Reproducible Breast Cancer Prognostic Test: Methodologyarticle Open Access

Marchionni et al.
BMC Genomics 2013, 14:336

http://www.biomedcentral.com/1471-2164/14/336
METHODOLOGY ARTICLE Open Access
A simple and reproducible breast cancer

prognostic test
Luigi Marchionni1, Bahman Afsari4, Donald Geman3,4* and Jeffrey T Leek2,5*
Abstract
Background: A small number of prognostic and predictive tests based on gene expression are currently offered as
reference laboratory tests. In contrast to such success stories, a number of flaws and errors have recently been
identified in other genomic-based predictors and the success rate for developing clinically useful genomic
signatures is low. These errors have led to widespread concerns about the protocols for conducting and reporting
of computational research. As a result, a need has emerged for a template for reproducible development of
genomic signatures that incorporates full transparency, data sharing and statistical robustness.
Results: Here we present the first fully reproducible analysis of the data used to train and test MammaPrint, an
FDA-cleared prognostic test for breast cancer based on a 70-gene expression signature. We provide all the software
and documentation necessary for researchers to build and evaluate genomic classifiers based on these data. As an
example of the utility of this reproducible research resource, we develop a simple prognostic classifier that uses
only 16 genes from the MammaPrint signature and is equally accurate in predicting 5-year disease free survival.
Conclusions: Our study provides a prototypic example for reproducible development of computational algorithms
for learning prognostic biomarkers in the era of personalized medicine.
Keywords: Reproducible research, Gene expression analysis, Biomarkers, Top scoring pair, Prediction, Genomics,
Personalized medicine, Breast cancer, MammaPrint
Background According to a recent report [8] from the Institute of

Currently, a number of molecular-based prognostic and Medicine (IOM), OncotypeDx was the most widely
predictive tests for breast cancer are offered as labora- used among these breast cancer assays, with more than
tory services for clinical use [1,2]. Such assays, which in- 175,000 patients tested as of mid 2011, followed by
clude MammaPrint [3], OncotypeDx [4], PAM50 Breast MammaPrint, used for 14,000 patients. OncotypeDX
Cancer Intrinsic Subtype Classifier [5], MapQuant Dx [6] combines the expression levels of 21 genes and was de-
and Theros Breast Cancer Index [7], are implemented veloped to predict the risk of distant recurrence at
by providing multiple gene expression measurements 10 years for women with lymph node negative, estrogen
obtained from tissue samples to multivariate classi- receptor (ER) positive breast cancer [4]. MammaPrint
fication algorithms. Currently, published evidence on utilizes 70 genes to report a good or bad prognosis for
clinical validity and utility for such assays as they are each patient, and was developed from microarray ex-
offered to the patients is only available for MammaPrint periments to predict 5-year metastatic recurrence of
and OncotypeDx; for the remainder of these tests the breast cancer as a first event among ER positive and
evidence derives from analyses performed in academic negative patients [9,10]. The MammaPrint algorithm is
settings [2]. based on correlating the 70-gene expression profile of a
patient with a stored cancer profile in order to deter-
mine a risk score for the patient.
* Correspondence: geman@jhu.edu; jleek@jhsph.edu
3
Institute for Computational Medicine, Johns Hopkins University, 3400 North A relative small fraction of published cancer prog-
Charles Street, Baltimore, MD 21218, USA nostic markers have subsequently been introduced in
2
Department of Biostatistics, Johns Hopkins Bloomberg School of Public clinical practice, despite the large number of available
Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
Full list of author information is available at the end of the article studies focusing on biomarkers development. A major
© 2013 Marchionni et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Marchionni et al. BMC Genomics 2013, 14:336 Page 2 of 7
hurdle hindering the translation of this research into discovery and validation data to develop an alternative sig-
clinically useful assays has been identified in the lack of nature and prognostic test for breast cancer, which is
rigorous criteria to report and publish tumor prognostic based on several two-gene comparisons [20,21]. This pro-
marker studies [8]. This issue has been addressed by vides a detailed, transparent and fully reproducible ex-
introducing the REMARK guidelines, a set of recom- ample of constructing a multi-gene classifier.
mendations for tumor marker prognostic studies, which
provides the necessary framework for reporting all rele- Methods
vant information about prognostic marker development Data assembly and code
(i.e. study design, specimen and patient characteris- We collected the data from the original experiments used to
tics, analytical and statistical methods) [11]. Another identify [9] and develop [10] the MammaPrint 70-gene
key issue in the development of cancer biomarkers is prognostic signature as provided as additional files with the
the need for detailed and complete disclosure of all original manuscripts. We also collected from ArrayExpress
data and software [8,12,13]. This need is not specific [22] the dataset used to retrain this signature on the custom
to the development of predictive signatures from array currently used in the MammaPrint assay [3] as well as
high-throughput molecular data but extends to many the independent validation cohort using the same array [19].
other branches of computational medicine and biology All of these datasets have been organized in an open re-
[14,15]. Whereas the guidelines for transparency in source that can be used to develop and compare prognostic
genomic data sharing date back a decade to the adop- signatures for breast cancer (available at http://
tion of the Minimal Information About Microarray Ex- luigimarchionni.org/breastTSP.html) and Bioconductor
periments (MIAMIE) standards [16], the recent scandal [23]. This resource also encompasses the R [24] code and
leading to the decision to cancel three clinical trials libraries used to retrieve, pre-process, manipulate, annotate,
based on microarray-based gene expression screening and analyze these data. The code, fully annotated and exe-
tests has dramatically underscored the need for revised cutable, is provided in the Additional files 1 and 2. All the
genomics research criteria [17] that extend and/or inte- analyses performed in our study were based on de-identi-
grate the REMARK and MIAME guidelines. fied publically available data, and they were performed in
Maximizing the level of evidence on the spectrum of re- compliance to the Helsinki declaration. The research did
producibility requires complete, independent replication not involve any experiment on human subjects or animals
[18]. As measured by this criterion, neither of the two suc- and for this reason no ethical approval was necessary.
cessful breast cancer assays, MammaPrint and
OncotypeDX, provides a paradigmatic example of the way An example of reproducible signature development
genomic predictors should be developed. In the case of In order to build our new classifier we selected the 78
OncotypeDX, the prediction algorithm is described in detail patients originally used in the 70-gene prognostic signa-
and can be reprogrammed, but the original datasets used ture discovery and limited our analysis to the 70 genes
for the implementation and validation [4] of the assay were contained in the original signature. We made these deci-
never placed in the public domain. Conversely, in the case sions for two reasons: (a) to make our development
of MammaPrint, although the original discovery and valid- process entirely analogous to the process for
ation datasets [3,19] are available, the pre-processing proto- MammaPrint and (b) so that our signature can be calcu-
col and prediction algorithm are only partially described. lated on the basis of the data from any current
Thus the entire development, including data and code, MammaPrint assay. To this end it should also be noted
is not available for either MammaPrint nor OncotypeDX. that the MammaPrint microarray platform only includes
However, in the case of MammaPrint it is possible to the prognostic signature genes and a set of housekeeping
undertake a transparent re-analysis of the data using an al- genes used for normalization purposes. These latter
ternative approach, since the raw microarray data are genes are designed not to change across samples and
available. We therefore focus here our efforts on reprodu- were therefore not used to train our predictor. We
cing the results of Mammaprint. We collect and organize adopted an extension of a rank-based approach to classi-
the original MammaPrint discovery and validation data. fication called “top-scoring pairs” (TSP) for developing
We also coordinate the associated metadata for these ex- understandable and powerful genomic signatures. This
periments and develop reproducible documents for their approach is invariant to all data preprocessing and
analysis. We reproduce and implement the preprocessing normalization steps that maintain the ordering within
described in the original manuscripts. These data repre- sample gene expression profiles. The TSP algorithm se-
sent a resource that can be used by other investigators lects the pair of genes whose expression levels switch
both to verify the original claims about the MammaPrint their ranking most consistently between the two prog-
signature and to build alternative predictors. As an ex- nostic groups (Figure 1). The original TSP algorithm
ample of the utility of these data, we use the MammaPrint [20] and extensions [25] have previously been
Here we used re-substitution AUC for training, since the

TSP approach is based on binary decisions and is not
prone to overfitting. The AUC increased with K until
reaching a peak and then declined as further pairs were
added (Figure 2). We focused on values of K near the
peak AUC, namely K = 6 to K = 10, and only considered
score thresholds achieving 100% sensitivity. The number
of gene pairs K was then chosen to maximize specificity,
which is equivalent to choosing the maximum score
threshold which achieves 100% sensitivity. This resulted
in the 8-TSP classifier (Figure 3) with score threshold
Figure 1 Top Scoring Pair. A Top Scoring Pair (TSP) is formed by a
pair of measurements that consistently change ranking between
two. Such resubstitution estimates obtained from the
samples from different prognostic groups. training set of samples were used only for the model
optimization and do not reflect its performance, which
in turn was assessed on an independent cohort of pa-
successfully applied to differentiate [26], predict treat- tients (see below).
ment response in breast cancer [27] and acute myeloid
leukemia [28], and grade prostate cancers [29]. Validation of the 8-TSP signature in an independent
patients cohort
Building the K-TSP classifier To evaluate the classifier on a new sample, the relative
We recorded the relative ordering of each pair of ordering of each of the K = 8 pairs of genes is deter-
genes in the 70-gene MammaPrint signature in each of mined and the sample is assigned to the poor prognosis
the 78 training samples. In other words, for each pair group if there are two or more votes for poor prognosis
of genes g and g’, and for each sample j, we record (Figure 3), using the same procedures previously defined
whether the expression of g in sample j is larger than in the training set of patients. The 8-TSP signature and
the expression of g’ in sample j or vice-versa. The “sig- the MammaPrint test were hence compared in terms of
nature” for the TSP classifier is the pair of genes that classification performance, using standard measures
most consistently changes its relative expression order- such as accuracy, sensitivity, specificity, and AUC, and
ing between the two groups of patients and the corre- in term of survival, by Kaplan-Meier and Cox regression
sponding decision rule for a new profile is determined analyses.
entirely by the ordering between these two genes:
choose group one if the observed ordering was most Results and discussion
often seen in group one and group two otherwise. We compared our prognostic test to the MammaPrint
Here, the two groups of patients are those that re- test based on a large independent validation cohort
curred within 5 years (poor prognosis) and those that consisting of 307 patients from a European multi-center
who did not recur (good prognosis). The K-TSP algo-
rithm uses K pairs of genes. It proceeds by first identi-
fying the TSP, removing these two genes from the 70-
gene signature, then searching for the pair of genes
among the 68 remaining that most often switch their
ordering between groups, removing these from the list,
and so forth. Individually, each pair of genes “votes”
for one of the two groups based on the observed or-
dering. For a fixed number K of pairs, the final prog-
nostic score is the sum of the votes for the poor
prognosis group among all K pairs. The higher the
score, the more evidence there is for poor prognosis.
Selecting the number of pairs

For each possible number of pairs K we measured the Figure 2 Resubstittution performance in the training set.
accuracy of the prognostic score on the training set by Receiver Operator Characteristics (ROC) analysis was performed in
calculating the area under the receiver operating charac- the training set and the Area Under the Curve (AUC) was used to
select the final number of TSPs. An 8-TSP classifier was chosen to
teristic curve (AUC) [30] determined by considering all
maintain 100% training set sensitivity and maximize specificity.
possible score thresholds for declaring poor prognosis.
Figure 3 8-TSP breast cancer prognosis signature. Each of the 8 gene pairs votes independently; patients with two or more votes are
classified as poor prognosis.
study [19]. In this independent validation cohort our test preprocessing of the expression data that maintains the
achieved 91% sensitivity, 47% specificity, and 69% over- ordering among expression levels within sample profiles.
all accuracy (Figures 4A and 4B, and Additional files 1 Finally, all design decisions and choices of parameters
and 2). Sensitivity refers to correctly classifying poor were based entirely on the training set. There was no
prognosis patients and specificity refers to correctly “data leakage”: no test data was examined until all as-
classifying good prognosis patients. For comparison, pects of classifier development were “locked up.” These
the MammaPrint prognostic test achieves 89% sensitivity, are considered critical steps in developing reproducible
42% specificity, and 65% overall accuracy [19,31] in this and accurate genomic signatures as defined by the IOM
same validation set. Such performance in predicting report [8]. The two key parameters are K, the number of
metastatic recurrence within 5 years was reflected in the pairs of genes in the signature, and the score threshold.
AUC estimates: 0.69 (95% CI: 0.64 − 0.74) and 0.59 (95% We only considered values of K between 6 and 10 since
CI: 0.55 − 0.62) for the 8-TSP and MammaPrint respect- these values maximized overall performance, and we
ively. (Comparable results were obtained by PAM, a well- only considered thresholds that obtained 100% sensitiv-
known classification method; see the Additional files 1 ity. Under these design constraints, we selected the K = 8
and 2.) Finally, while in the prediction of a metastatic since this value maximized specificity at 100% sensitivity
event within five years the 8-TSP classifier performed (Figure 2). Our final classifier labels a sample as poor
better than the MammaPrint test, this latter assay prognosis if two or more among the 8 pairs votes for the
maintained a better performance at later time points as poor prognosis group (Figure 3).
revealed in survival analyses. This finding probably indi- Our 8-TSP signature can be viewed as the combin-
cates that the additional features of the 70-gene signature ation of multiple coordinated biological processes. Of
not used in the 8-TSP classifier might carry additional the 70 genes originally identified in the study by van’t
prognostic information beyond five years (see Additional Veer and colleagues [10], 18 genes had expression values
files 1 and 2). positively associated with good prognosis, while 52 were
We have therefore built a prognostic classifier based associated with metastatic recurrence. Four of the K = 8
on the genes from the MammaPrint signature that is as pairs combine genes positively correlated with good
accurate in predicting 5-year disease-free survival as the prognosis (RTN4RL1, LGP2, MS4A7, and GSTM3) with
MammaPrint prognostic test based. Our classifier only genes associated with bad prognosis (OXCT1, HRASLS,
requires the measurement of expression for 16 of the 70 Contig40831_RC, and MELK). These pairs represent a
genes used in Mammaprint. Moreover, the new test is coordinated change from good prognosis expression pat-
easy to interpret and is robust with respect to any terns to poor prognosis patterns across multiple gene
Figure 4 8-TSP classification results in the validation set. Panel

A) The 8-TSP results from the first 150 patients in the validation set.
Each column represents one of the 8 pairs (blue = good prognosis vote,
red = bad prognosis vote) and each row is a patient. Patients with bad
prognosis (top rows) have more votes for bad prognosis. Panel B) The
8-TSP results from the last 157 patients in the validation set.
pairs. The remaining pairs comprise only genes origin-

ally associated with a poor prognosis (GPR180, DTL,
IGFBP5, SERF1A, GNAZ, RFC4, CDCA7, and UCHL5),
suggesting that it is the quantitative level of expression
of these genes that is important for predicting prognosis.
It is of note that each individual TSP involved in the
final classification scheme can be viewed as a separate
molecular switch between the two prognostic groups,
possibly entailing also a mechanistic underpinning. To
this end some of the pairs we have identified appear to
have an additional underlying mechanistic biological re-
lationship. For instance one of the gene pairs, DTL-
RCF4, appears to be tightly associated with the regula-
tion of the replication fork and the DNA damage re-
sponse. DTL and RCF4 physically interact and modulate
the activity of the proliferating cell nuclear antigen
(PCNA) [32-34], which plays a central role in the coord-
ination of these processes. Similarly, another pair,
GPR180-GNAZ, code for proteins involved in G protein
mediated cellular signaling.
Conclusions
Our goal was to provide a transparent example of the
manner in which a genomics-based cancer predictor
might be developed from training data and evaluated on
independent test data with sufficient detail and docu-
mentation to allow the full process to be replicated by
other researchers. Due to the unavailability of the ori-
ginal data, it was not possible carry out this process for
OncotypeDX, which is presently the most used and vali-
dated predictor of this kind. Consequently, we
performed a re-analysis of MammaPrint data. To this
end, we selected the same samples and end-point origin-
ally used for the implementation of this assay, although
we are aware that a stratified analysis across ER positive
and negative patients would be much more appropriate.
In order to illustrate the development process from end
to end, including a transparent decision rule, we have in-
troduced a more parsimonious classifier with sensitivity,
specificity, and overall accuracy very similar to the 70-
gene MammaPrint signature.
Our analysis was performed in complete adherence to
the principles of transparent and reproducible research
[13,18], providing all data sources used, and the
complete code and software necessary for data prepro-
cessing, analysis and validation. To our knowledge, this
is one of the few, if not the first, development of a gen- References

omic signature adhering to these standards. 1. Marchionni L, Wilson RF, Wolff AC, Marinopoulos S, Parmigiani G, Bass EB,
Goodman SN: Systematic review: gene expression profiling assays in
early-stage breast cancer. Ann Intern Med 2008, 148(5):358–369.
Additional files 2. Paik S: Is gene array testing to be considered routine now? Breast 2011,
20(Suppl 3):S87–S91.
3. Glas AM, Floore A, Delahaye LJ, Witteveen AT, Pover RC, Bakx N,
Additional file 1: Fully reproducible vignette of the analysis. Lahti-Domenici JS, Bruinsma TJ, Warmoes MO, Bernards R, et al:
Additional file 2: The archive contains the following files: Converting a breast cancer microarray signature into a high-
“bmc_article.bst”: BMC series bibliography style; “localFiles/ throughput diagnostic test. BMC Genomics 2006, 7:278.
contactAgendia": instructions to obtain the hybridization mapping 4. Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG,
information from Agendia; “objs/buyseEset.rda”: ExpressionSet for Watson D, Park T, et al: A multigene assay to predict recurrence of
the Buyse cohort; “objs/glasEset.rda”: ExpressionSet for the Glas tamoxifen-treated, node-negative breast cancer. N Engl J Med 2004,
cohort; “Supplement.bib”: Bibliography in BibTex format; 351(27):2817–2826.
“Supplement.Rnw”: Rnoweb/Sweave file containing code and text 5. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S,
used to create the “Supplement.tex” file; “Supplement.tex”: LaTeX Fauron C, He X, Hu Z, et al: Supervised risk predictor of breast cancer
file resulting from running the Sweave with the “Supplement.Rnw” based on intrinsic subtypes. J Clin Oncol 2009, 27(8):1160–1167.
file; All source code, data, and software packages used in the 6. Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A,
analyses are also available for download online from: http:// Bergh J, Foekens JA, et al: Definition of clinically distinct molecular subtypes
luigimarchionni.org/breastTSP.html. in estrogen receptor-positive breast carcinomas through genomic grade.
J Clin Oncol 2007, 25(10):1239–1246.
7. Ma XJ, Salunga R, Dahiya S, Wang W, Carney E, Durbecq V, Harris A, Goss P,
Abbreviations Sotiriou C, Erlander M, et al: A five-gene molecular grade index and
TSP: Top scoring pair; ROC: Receiver operator curve; AUC: Area under the HOXB13:IL17BR are complementary prognostic factors in early stage
curve; FDA: Food and drug administration; IOM: Institute of medicine; breast cancer. Clin Cancer Res 2008, 14(9):2601–2608.
ER: Estrogen receptor; REMARK: Reporting recommendations for tumour 8. IOM (Institute of Medicine): Evolution of translational Omics: lessons learned
marker prognostic studies; MIAMIE: Minimal information about microarray and the path forward. Washington, D.C: The National Academy Press; 2012.
experiments; CI: Confidence intervals; RTN4RL1: Reticulon 4 receptor-like 1; 9. van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL,
LGP2: DHX58 DEXH (ASP-GLU-X-HIS) box polypeptide 58; MS4A7: MS4A7 van der Kooy K, Marton MJ, Witteveen AT, et al: Gene expression profiling
membrane-spanning 4-domains, subfamily A, member 7; GSTM3: Glutathione predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–536.
S-Transferase MU 3 (BRAIN); OXCT1: 3-oxoacid coa transferase 1;
10. van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ,
HRASLS: HRAS-Like suppressor; MELK: Maternal embryonic leucine zipper
Peterse JL, Roberts C, Marton MJ, et al: A gene-expression signature as a
kinase; GPR180: G Protein-coupled receptor 180; DTL: Denticleless E3
predictor of survival in breast cancer. N Engl J Med 2002, 347(25):1999–2009.
ubiquitin protein ligase homolog (drosophila); IGFBP5: Insulin-like growth
11. McShane LM, Altman DG, Sauerbrei W, Taube SE, Gion M, Clark GM: Reporting
factor binding protein 5; SERF1A: Small edrk-rich factor 1A (TELOMERIC);
recommendations for tumor marker prognostic studies (REMARK). J Natl
GNAZ: Guanine nucleotide binding protein (G protein), alpha Z polypeptide;
Cancer Inst 2005, 97(16):1180–1184.
RFC4: Replication factor C (activator 1) 4, 37KDA; CDCA7: Cell division cycle
12. Leek JT, Peng RD, Anderson RR: Personalized medicine: keep a way open
associated 7; UCHL5: Ubiquitin carboxyl-terminal hydrolase L5;
for tailored treatments. Nature 2012, 484(7394):318.
PCNA: Proliferating cell nuclear antigen.
13. Baggerly K: Disclose all data in publications. Nature 2010, 467(7314):401.
14. Peng RD: Reproducible research and biostatistics. Biostatistics 2009,
Competing interests 10(3):405–408.
The authors declare that they have no competing interests. 15. Peng RD, Dominici F, Zeger SL: Reproducible epidemiologic research.
Am J Epidemiol 2006, 163(9):783–789.
Authors’ contributions 16. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C,
LM, JTL, and DG conceived the study; LM performed all the analysis and Aach J, Ansorge W, Ball CA, Causton HC, et al: Minimum information about
assembled all the datasets; LM and BA implemented the software packages a microarray experiment (MIAME)-toward standards for microarray data.
used in the analysis; LM, JTL and DG wrote the manuscript. All authors read Nat Genet 2001, 29(4):365–371.
and approved the manuscript. 17. Goozner M: Duke scandal highlights need for genomics research criteria.
J Natl Cancer Inst 2011, 103(12):916–917.
Acknowledgements 18. Peng RD: Reproducible research in computational science. Science 2012,
The authors express gratitude to Antonio C. Wolff for the invaluable 334(6060):1226–1227.
comments, and Annuska M Glas for the information on the datasets. 19. Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas AM, d'Assignies MS,
Bergh J, Lidereau R, Ellis P, et al: Validation and clinical utility of a 70-gene
Funding prognostic signature for women with node-negative breast cancer.
This work was supported by the Johns Hopkins Breast Cancer Program through J Natl Cancer Inst 2006, 98(17):1183–1192.
funding from the Safeway Research Foundation, and by the National Institute of 20. Geman D, d'Avignon C, Naiman DQ, Winslow RL: Classifying gene expression
Health (P30 CA006973 to LM, and R01 GM08308 to JTL). profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 2004,
3:Article 19.
Author details 21. Leek JT: The tspair package for finding top scoring pair classifiers in R.
1 Bioinformatics 2009, 25(9):1203–1204.
The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University
School of Medicine, 1550 Orleans Street, Baltimore, MD 21231, USA. 22. Brazma A, Kapushesky M, Parkinson H, Sarkans U, Shojatalab M: Data storage
2 and analysis in ArrayExpress. Methods Enzymol 2006, 411:370–386.
Department of Biostatistics, Johns Hopkins Bloomberg School of Public
Health, 615 North Wolfe Street, Baltimore, MD 21205, USA. 3Institute for 23. A simple and reproducible breast cancer prognostic test. http://
Computational Medicine, Johns Hopkins University, 3400 North Charles luigimarchionni.org/breastTSP.html.
Street, Baltimore, MD 21218, USA. 4Department of Applied Mathematics and 24. Ihaka R, Gentleman R: R: A language for data analysis and graphics.
Statistics, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD J Comput Graph Stat 1996, 5:299–314.
21218, USA. 5Center for Computational Biology, Johns Hopkins University, 25. Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D: Simple decision rules for
Baltimore, MD 21205, USA. classifying human cancers from gene expression profiles. Bioinformatics
2005, 21(20):3896–3904.
Received: 18 December 2012 Accepted: 4 May 2013 26. Price ND, Trent J, El-Naggar AK, Cogdell D, Taylor E, Hunt KK, Pollock RE,
Published: 17 May 2013 Hood L, Shmulevich I, Zhang W: Highly accurate two-gene classifier for
differentiating gastrointestinal stromal tumors and leiomyosarcomas.

Proc Natl Acad Sci U S A 2007, 104(9):3414–3419.
27. Weichselbaum RR, Ishwaran H, Yoon T, Nuyten DS, Baker SW, Khodarev N,
Su AW, Shaikh AY, Roach P, Kreike B, et al: An interferon-related gene
signature for DNA damage resistance is a predictive marker for
chemotherapy and radiation for breast cancer. Proc Natl Acad Sci U S A
2008, 105(47):18490–18495.
28. Raponi M, Lancet JE, Fan H, Dossey L, Lee G, Gojo I, Feldman EJ, Gotlib J,
Morris LE, Greenberg PL, et al: A 2-gene classifier for predicting response
to the farnesyltransferase inhibitor tipifarnib in acute myeloid leukemia.
Blood 2008, 111(5):2589–2596.
29. Carro MS, Lim WK, Alvarez MJ, Bollo RJ, Zhao X, Snyder EY, Sulman EP, Anne SL,
Doetsch F, Colman H, et al: The transcriptional network for mesenchymal
transformation of brain tumours. Nature 2010, 463(7279):318–325.
30. van Belle G, Fisher LD, Heagerty PJ, Lumley T: Biostatistics: A methodology for
the health sciences. 2nd edition. Hoboken, New Jersey: John Wiley and Sons;
2004.
31. Tian S, Roepman P, Van't Veer LJ, Bernards R, de Snoo F, Glas AM: Biological
functions of the genes in the mammaprint breast cancer profile reflect
the hallmarks of cancer. Biomark Insights 2010, 5:129–138.
32. Zhang G, Gibbs E, Kelman Z, O'Donnell M, Hurwitz J: Studies on the
interactions between human replication factor C and human proliferating
cell nuclear antigen. Proc Natl Acad Sci U S A 1999, 96(5):1869–1874.
33. Ohta S, Shiomi Y, Sugimoto K, Obuse C, Tsurimoto T: A proteomics
approach to identify proliferating cell nuclear antigen (PCNA)-binding
proteins in human cell lysates. Identification of the human CHL12/
RFCs2-5 complex as a novel PCNA-binding protein. J Biol Chem 2002,
277(43):40362–40367.
34. Jascur T, Fotedar R, Greene S, Hotchkiss E, Boland CR: N-methyl-N'-nitro-N-
nitrosoguanidine (MNNG) triggers MSH2 and Cdt2 protein-dependent
degradation of the cell cycle and mismatch repair (MMR) inhibitor
protein p21Waf1/Cip1. J Biol Chem 2011, 286(34):29531–29539.
doi:10.1186/1471-2164-14-336
Cite this article as: Marchionni et al.: A simple and reproducible breast
cancer prognostic test. BMC Genomics 2013 14:336.
Submit your next manuscript to BioMed Central

and take full advantage of:
• Convenient online submission

• Thorough peer review
• No space constraints or color ﬁgure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at

www.biomedcentral.com/submit

A Simple and Reproducible Breast Cancer Prognostic Test: Methodologyarticle Open Access

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Simple and Reproducible Breast Cancer Prognostic Test: Methodologyarticle Open Access

Uploaded by

Copyright:

Available Formats

Marchionni et al.

BMC Genomics 2013, 14:336

METHODOLOGY ARTICLE Open Access

A simple and reproducible breast cancer

Background According to a recent report [8] from the Institute of

Here we used re-substitution AUC for training, since the

Selecting the number of pairs

Figure 4 8-TSP classification results in the validation set. Panel

pairs. The remaining pairs comprise only genes origin-

is one of the few, if not the first, development of a gen- References

differentiating gastrointestinal stromal tumors and leiomyosarcomas.

Submit your next manuscript to BioMed Central

• Convenient online submission

Submit your manuscript at

You might also like