Professional Documents
Culture Documents
Abstract
Motivation: In the training of predictive models using high-dimensional genomic data, multiple studies’
worth of data are often combined to increase sample size and improve generalizability. A drawback of
this approach is that there may be different sets of features measured in each study due to variations
in expression measurement platform or technology. It is often common practice to work only with the
intersection of features measured in common across all studies, which results in the blind discarding of
potentially useful feature information that is measured in individual or subsets of studies.
Results: We characterize the loss in predictive performance incurred by using only the intersection of
feature information available across all studies when training predictors using gene expression data from
microarray and sequencing datasets. We study the properties of linear and polynomial regression for
imputing discarded features and demonstrate improvements in the external performance of prediction
functions through simulation and in gene expression data collected on breast cancer patients. To improve
this process, we propose a pairwise strategy that applies any imputation algorithm to two studies at a time
and averages imputed features across pairs. We demonstrate that the pairwise strategy is preferable to first
merging all datasets together and imputing any resulting missing features. Finally, we provide insights on
which subsets of intersected and study-specific features should be used so that missing-feature imputation
best promotes cross-study replicability.
Availability: The code is available at https://github.com/YujieWuu/Pairwise_imputation
Contact: patil@bu.edu
Supplementary information: Supplementary information is available at Bioinformatics online.
1 Introduction This has yielded larger-scale experiments and, over time, the ability to
Individual gene expression profiles have successfully been used to model combine multiple gene expression studies of the same disease outcome
the prognosis or risk of many diseases and disorders in personalized measured in different patient cohorts. This abundance of data has led to
medicine (Van’t Veer et al., 2002; Wang et al., 2005). These predictive the use of complex statistical prediction and machine learning algorithms
models capture disease identification (Tan and Gilbert, 2003; Fakoor et al., for the development of gene signatures (Shipp et al., 2002; Ye et al., 2003;
2013), cancer subtyping (Huang et al., 2018; Gao et al., 2019), and risks of Pirooznia et al., 2008).
A critical issue facing the translation of these gene signatures into
recurrence and relapse (Wang et al., 2004; Hartmann et al., 2005; Ascierto
et al., 2012). Technological advancements have seen these studies graduate viable clinical tests is generalizability, or how well we expect the predictor
from custom chips to commercial tools to whole-genome sequencing. to perform on a new patient or set of patients. Techniques such as cross-
validation can overestimate how well a prediction model will generalize
as compared with direct evaluation in a held-out test or validation dataset
(Bernau et al., 2014). This discrepancy is often due to cross-study
1
22 heterogeneity in patient characteristics, measurement platforms, and study feature selection and a resulting smaller and more focused set of candidate 84
23 designs (Patil and Parmigiani, 2018). features is preferable to applying regularization to a larger set of candidate 85
24 To combat the effects of cross-study heterogeneity and increase features (Demir-Kavuk et al., 2011; Spooner et al., 2020). 86
25 training sample size to improve generalization, researchers have merged The paper is organized as follows: Section 2 presents formal notation 87
26 multiple studies (Xu et al., 2008). van Vliet et al. (2008) showed that for the general pairwise strategy to impute study-specific missing genes 88
27 pooling data sets together will result in more accurate classification and across multiple studies, as well as the specific ‘Core’ and ‘All’ methods. 89
28 convergence of signature genes. However, a major challenge in combining Section 3 presents a simulation study that evaluates the performance 90
30 all studies. This may be due to differences in measurement platform or describes a real data analysis predicting the expression of the gene ESR1 92
31 variations in the same platform when studies are conducted at different across multiple curated breast cancer studies. Section 5 concludes with a 93
38 detailed procedure for merging datasets by taking the intersected genes Let s = 1, 2, . . . , S index the studies for aggregated analysis, with ns 97
39 of all studies followed by a batch effect removal procedure. Although individuals and ps genes in study s. Let X s denote the gene expression 98
40 omitting provides a simple approach for seamlessly merging studies, it data set for the sth study. X s is a ns ×ps matrix where each row represents 99
41 comes with the potentially high cost of discarding important predictive an individual and each column represents the measurement values for a 100
42 information in features not contained in the intersection. Yasrebi et al. particular gene. Let Y s be a ns ×1 column vector of the response variable 101
43 (2009) noted that if some genes that have high diagnostic power are not of the sth study. For a pair of studies s and j, denote by Gsj , Gs/j and Gj/s 102
44 available for all studies, the aggregated data may not actually improve the the set of genes that are found in both studies, unique to study s and unique 103
45 final predictive model. to study j, respectively. Let |Gsj | = psj , where | · | is the cardinality of 104
46 A solution to the data loss due to omission is imputation. Zhou et al. a set. It follows that |Gs/j | = ps − psj and |Gj/s | = pj − psj . Denote 105
47 (2017) built LASSO models to impute missing genes across different the gene expression matrix for a subset of genes G in study s as Xs,G . 106
48 studies assayed by two Affymetrix platforms for which the probe names Throughout the manuscript, we assume that psj > 0 for every pair of 107
49 of one platform are a proper subset of the other. Bobak et al. (2020) built (s, j) (complete notation table in supplement S1). 108
50 several imputation models across studies that are measured using a variety Our goal is to impute study-specific missing genes to augment the 109
51 of gene expression platforms. Both approaches proceed by first merging candidate gene set used for building a predictive model. To this end, we 110
52 all studies together, then using the common genes in the intersection to propose a pairwise approach where imputation is applied for two studies 111
53 impute missing genes. As the number of studies increases, the size of the at a time, as the available intersection of genes across any two studies will 112
54 intersection is likely to decrease, resulting in a smaller candidate feature tend to be larger than that across all S studies. For studies s and j, the 113
55 pool and less accurate imputation of omitted genes. This makes merging pairwise approach uses Gsj to construct imputation models of every gene 114
56 before imputing a less attractive option when dealing with more than two in Gs/j separately using data from study s, based on which the expression 115
57 studies, such as in the cases of the MetaGxData, CuratedOvarianData, or profiles in Gs/j will be imputed for study j. The same procedure applies 116
58 CuratedBreastData collections where dozens of datasets may be available to the imputation of genes in Gj/s for study s. The imputed studies s and 117
59 for combination (Gendoo et al., 2019; Ganzfried et al., 2013; Planey j both then contain the same set of genes Gs ∪ Gj (see Fig. 1). We repeat 118
60 and Butte, 2013). Moreover, these previous approaches were focused on this pairwise approach for all S2 pairs of studies. If a particular gene in
119
61 accurate imputation of missing genes and its effect on downstream analyses one study is imputed multiple times across pairs, we average its imputed 120
62 such as gene pathway enrichment analysis. Whether or not imputation can values over all imputations. The result from a single imputation may be 121
63 help improve a prediction model and make it more generalizable to external overfit to its pair and may not generalize well across pairs. We follow the 122
64 data in this context is mostly unstudied. philosophy of ensemble modeling (Zhang and Ma, 2012) which suggests 123
65 In this paper, we propose a pairwise strategy, in which instead of that averaging across pairs will limit overfitting and outperform the single 124
66 merging all available studies together at the outset to build an imputation best imputation model. Training only with observed data and averaging 125
67 model for missing gene features, we merge two studies at a time and genes imputed multiple times makes the pairwise strategy invariant to the 126
68 perform imputation within the pair. We then repeat this imputation order in which pairs of studies are constructed. 127
69 procedure for all possible pairs of studies and average imputed values
70 for features that are missing across multiple pairs before training a
2.2 Imputation under incomplete validation set and sparse 128
71 prediction model. Inspired by the concept of knowledge transfer proposed
72 by Vapnik and Izmailov (2015), which posits that some functional forms signals 129
73 of the existing features could potentially capture missing information, we The pairwise approach must be refined before it can be used for practical 130
74 examine the ability of both linear and polynomial regression to impute applications. The main limitation of the approach as stated is two-fold: 1) it 131
75 missing features and use LASSO models (Hastie et al., 2009) both for assumes that all studies are for training and no validation study is present; 132
76 imputation and for outcome prediction. Our strategy can be implemented 2) all genes that appear in at least one study should be included in the 133
77 using any imputation method applicable to the data types being studied, final prediction model. The first assumption makes cross-study validation 134
78 and we evaluate both traditional and machine learning-based imputation inaccessible while the second can lead to overfitting and increase the 135
79 methods. Lastly, we consider the impact of using only features selected computational cost of the approach if ps is large. Therefore, we propose 136
80 as “important” (highly associated with the predictive outcome of interest) the following ‘Core’ and ‘All’ variations of the generic pairwise approach, 137
81 across both studies within a given study pair (‘Core Imputation’) versus with ‘Core’ using selected cross-study genes to build the imputation model 138
82 using all available features (‘All Imputation’) when building study-specific and ‘All’ using all available genes in the studies. The algorithm statements 139
83 imputation models. Here, we revisit the question of whether some form of of the ‘Core’ and ‘All’ variations are provided in the supplement (S2.3). 140
Pairwise Strategy for Imputing Predictive Features 3
141 2.2.1 ‘Core’ pairwise approach We provide a more in-depth illustrative example as well as an algorithm 175
142 Before introducing the methods, we make the assumption that the response statement in the supplement (S2) 176
143 variable is not available in the validation set. For each study, suppose not
144 all genes are predictive of the outcome, and due to the mixture of signal
145 and noise, a common pre-processing step to filter to a subset of genes
146 that are most related to the outcome is applied. For example, in each
147 training study we can select the top q genes with the largest magnitudes
148 of coefficient estimates from LASSO, where the response is the outcome 2.2.2 ‘All’ pairwise approach 177
149 and the predictors are the expression values of the genes. The ‘Core’ pairwise approach introduced above will only use the genes 178
150 The ‘Core’ pairwise approach takes two training sets denoted by T i in H1 , which are the genes that are predictive of the outcome in training 179
151 and T j as a pair, and imputes the missing genes in T i , T j and the sets, to impute the missing genes in H2 . However, it is possible that 180
152 validation set (V ). In the preliminary screening stage, suppose in each genes not selected for predicting the outcome (i.e. genes not in H) are 181
153 training set that the top q predictive genes are selected for the final still helpful for imputing the missing gene expression values. Therefore, 182
154 prediction model. Due to study heterogeneity, different sets of genes may another imputation strategy is to use the intersection of all available genes 183
155 be chosen from the two training sets, and we denote them as Qi and from the three studies instead of focusing only on the intersection of the 184
156 Qj , respectively. Furthermore, let QV be all the available genes in V , top predictive genes. 185
157 H = Qi ∪ Qj , H1 = Qi ∩ Qj ∩ QV and H2 = H\H1 . The Denote Qci and Qcj as the other existing genes in T i and T j but not 186
158 idea of ‘Core’ pairwise approach is to impute the genes in H2 that are in Qi and Qj , and let Hc = Qci ∪ Qcj , Hint = (Qi ∪ Qci ) ∩ (Qj ∪ 187
159 not shared by all of the three studies using genes in H1 that are common Qcj ) ∩ QV . The idea of the ‘All’ pairwise approach is to use genes in 188
160 across all studies. Note that for the ‘Core’ pairwise approach, the genes Hint to impute the study-specific missing genes in H2 . Note that the 189
161 used for imputation are all predictive of the outcome in at least one of genes in Hint are the intersection of all the available genes in T i , T j 190
162 the training sets. To properly perform ‘Core’ imputation, three different and V , and thus not necessarily predictive of the outcome. Four scenarios 191
163 scenarios need to be considered: (1) if a gene is found in only one of Qi require consideration: (1) if a gene is completely missing (e.g. not in Qi 192
164 and Qj and is also missing in QV , an imputation model will be built in the nor in Qci ) in one of the training sets and QV , an imputation model will be 193
165 training set that has this gene available, and imputation will be performed built for this gene in the training set that has this gene and imputation will 194
166 for the other training set and V ; (2) if this gene is not missing in QV , be performed for the other training set and V ; (2) if this gene is available 195
167 then no imputation is needed in V since we can use the original values in QV , then no imputation is needed in V since we can use the original 196
168 of this gene; (3) if a gene is available in both Qi and Qj but is missing values of this gene; (3) if the gene is found in Hc (i.e. this gene is not 197
169 in QV , then we merge T i and T j together to train a single imputation predictive of the outcome in one of the training sets, but still exists) but 198
170 model for this gene and impute in V . If we have S > 2 training sets, we is completely missing in QV , the training set that has this gene missing 199
can repeat the above procedure for all possible S2 pairs of training sets in the top q predictive gene list can still use its original value, and then
171 200
172 combined with the additional validation set V , and if a gene is imputed we merge the two training sets together to build a single imputation model 201
173 multiple times, we take the average over the multiple imputed values as for this gene and imputation will be performed in V ; (4) if the gene is 202
174 the final imputation. found in both Hc and QV , all studies will use their original values and 203
206 3 Results are Bonferroni-adjusted for multiple comparisons). For the cases with p- 245
values less than 0.05, we add a directed arrow to indicate the direction of 246
207 3.1 Simulation
the test, such that the method to which the arrow points has significantly 247
208 3.1.1 Comparison between pairwise and merged approaches smaller median prediction RMSE. Above each method, we report the 248
209 We perform a simulation study to compare the performance of our proposed proportion of simulation replicates for which that method obtained the 249
210 pairwise approach to the merged approach, where we first merge all studies smallest prediction RMSE in the validation set across the 300 simulation 250
211 together and use the intersection of variables across all studies to impute replicates. As shown in the figure, when the proportion of missingness is 251
imputation methods. We also show the average difference in the number of 259
Y = β1 X1 + . . . + β5 X5 + β1∗ X1∗ + . . . + β5∗ X5∗ + (1) intersected variables used to impute the study-specific missing variables 260
between the merged imputation methods and the pairwise imputation 261
where methods as the cross points in Figure S1(b). The cross points show that 262
" #
X as the proportion of missing variables increases, the pairwise imputation 263
∼ MVN(0.1, Σ)
X∗ methods have increasingly larger numbers of intersected genes that can 264
243 line indicates that the p-value is greater than 0.01 but less than 0.05; expression data, where each data set contains both genes that are predictive 272
244 and a blue line indicates that the p-value is greater than 0.05 (p-values of the clinical outcome as well as genes that are irrelevant to the outcome. 273
Pairwise Strategy for Imputing Predictive Features 5
274 The data generation mechanism is as follows, thus when the number of top genes included for prediction is small, ‘All’ 324
imputation has the advantage of access to more informative genes to impute 325
Yi =β1 X1,i + β2 X2,i + β3 X3,i + . . . + β5 X5,i missing genes. However, when the number of genes included is large, most 326
∗ ∗ signals will be selected by LASSO and therefore both ‘Core’ and ‘All’ 327
+ β6 X1,i + ... + β10 X5,i (2)
imputation will use approximately the same number of truly informative 328
+ β11 Z1,i + . . . + β20 Z10,i + i genes for imputation, while for ‘All’ imputation, more noise will be 329
287 3.1.2 Comparison between the ‘Core’ and ‘All’ pairwise methods
288 To mimic genomic data sets, we perform another simulation study where
289 the signature genes for predicting the outcome of interest are sparse in the
290 whole data set. For illustrative purposes, we have two training sets, and
291 we will make predictions on another validation set that also has missing
292 genes. Since Section 3.1.1 suggests that the pairwise strategy generally
293 performs better than the merged strategy in terms of prediction RMSE, in
294 the subsequent simulation study we provide a focused comparison of the
295 ‘All’ and ‘Core’ pairwise strategies described in Section 2.2.
296 We generate data as follows, with the sample size of each study set to
297 be 100: Fig. 3. (a) RMSE of prediction on the validation set for the Omitting, ‘Core’ and ‘All’
imputation method across the 300 simulation replicates. Left panel: β1 = . . . , = β20 =
∗ ∗ ∗
5, β1 = . . . , = β10 = 10; Right panel: β1 = . . . , = β20 = 10, β1 = ...,=
∗
∗ ∗ β20 = 5; (b, c) Pairwise paired Wilcoxon test on RMSE between Omitting, ‘Core’ and
Yi = β1 X1,i + . . . + β20 X20,i + β21 X1,i + . . . + β40 X20,i
(3) ‘All’ imputation methods for scenarios when X ∗ ’s have larger and smaller coefficients
+ β41 Z1,i + . . . + β160 Z120,i + i than X ’s, respectively. The p-values are adjusted using Bonferroni correction for multiple
comparisons.
298 where X1 , . . . , X20 , X1∗ , . . . , X20
∗ jointly follow a multivariate normal
299 distribution with mean 0.1, variance 1 and covariance 0.5. Z41 , . . . , Z120
300 jointly follows a multivariate normal distribution with mean 0, variance 1
301 and correlation coefficient 0.2. We restrict the corresponding coefficients 3.2 Sensitivity Analyses 336
302 β41 , . . . , β160 = 0 such that Z1 , . . . , Z120 can be regarded as the genes Apart from using linear regression and polynomial regression as the 337
303 that are irrelevant to the outcome of interest. In simulation, X1 − X20 are imputation models, we also explore using more complex machine learning 338
304 available to all data sets, and we deliberately set 10 of the genes among algorithms such as Random Forest, Support Vector Machines and Multiple 339
306 for both the training and validation sets. Therefore, for each study, we have of which imputation algorithm is used, the pairwise strategy consistently 341
307 20 common predictive genes, 10 study-specific predictive genes and 70 has smaller prediction RMSE than the corresponding merged approach. 342
308 study-specific irrelevant genes. Figures S7 and S8 show the prediction RMSE when three studies 343
309 For preliminary feature screening, we applied LASSO and selected the are used for imputation at a time rather than two. The pairwise strategy 344
310 top n, n = 30, 40, . . . , 100 genes with the largest absolute coeffecients consistently has a lower prediction RMSE than imputation across three 345
311 associated with the outcome. Figure 3.a shows the RMSE of prediction on studies at a time. The difference in performance is due to the pairwise 346
312 the validation set for the omitting, ‘Core’ and ‘All’ imputation methods. strategy retaining a larger set of intersected features to be used for 347
313 Figure 3.b and 3.c show the paired Wilcoxon test results on the RMSE. imputation. As the number of studies used in the subset increases, the 348
314 Figure S3(a) shows boxplots of the log RMSE ratio of the ‘Core’ and subset imputation approach will converge to the merging approach. 349
315 ‘All’ imputation methods to the omitting method, and Figure S3(b) shows We varied the number of training sets used across 3, 6 or 9 training 350
316 boxplots of the log RMSE ratio of the ‘Core’ imputation method to the ‘All’ datasets; results are shown in Figures S11 and S12. The pairwise strategy 351
317 imputation method. Across these figures, both ‘Core’ and ‘All’ imputation consistently has smaller prediction RMSE than the merged approach 352
318 methods have better prediction performance than the omitting method. regardless of the number of training sets. As the number of training sets 353
319 When the number of top genes included is small, ‘All’ imputation has increases, the intersection used by the merged strategy will shrink while the 354
320 smaller prediction RMSE than ‘Core’ imputation method, while when the intersection used for any pair in the pairwise strategy will remain roughly 355
321 number of genes included is large (greater than 60 or 70), ‘Core’ imputation the same size. 356
322 works better. We hypothesize that LASSO may inevitably include noise We also assess the impact of cross-study heterogeneity when 357
323 among the top predictive genes while some signal will be neglected, and generating data. For equation (1), X is still generated from a multivariate 358
6 Wu et al.
359 normal distribution with mean 0.1, variance 1 and covariance 0.5. To 2034, 25055, 25065 because they all used the Affymetrix Human Genome 390
360 generate X ∗ , we first generate the study-specific mean slopes γk ∼ U133A chip for microarray gene expression measurements. The sample 391
361 N (0, τ 2 ). In the kth study, Xj∗ , j = 1, . . . , 5 is generated as sizes of these studies range from 54 to 286, with a total of 1328 patients 392
362 Xγ T 2
k,j , where γ k,j ∼ MVN(γk , 1). Therefore, τ controls the study across studies. 393
∗
363 heterogeneity of the relationship between X and X across studies, and We apply the ‘Core’ and ‘All’ strategies to predict the expression level 394
364 larger τ 2 corresponds to more heterogeneous X-X ∗ relationships across of the gene ESR1. In each experiment, we take four studies as training sets 395
365 studies. For Equation (2), to generate Z, we additionally generate the and a fifth study is chosen as the validation set. The imputation and the 396
367 are obtained from a multivariate normal distribution with mean 0.1 + µk We restrict our analysis to the top 1000 most variable genes in each 398
368 with variance 1 and covariance 0.2. Figure 4 and Figures S13-S15 of study. This induces heterogeneous missing patterns across our candidate 399
369 supplementary material show the corresponding results. We observe studies, as the set of 1000 highest-variance genes varies across studies. 400
370 that even under study heterogeneity, both the pairwise and merged The distribution of the variability of the genes for each study is shown 401
371 strategies have smaller prediction RMSE than the omitting method, and in Figure S16 of the supplementary material. In general, the variability 402
372 the pairwise strategy consistently has better prediction performance than is similar across studies. Around 35% of the top 1000 variable genes are 403
373 the corresponding merged strategy. This is attributable to the added common across all 8 studies, with each study having around 650 study- 404
374 robustness of the ensemble approach implemented in the pairwise strategy, specific genes that are missing in at least one of the 8 studies. We perform a 405
375 where the final imputation is an average of imputations from multiple principal component analysis (PCA) on the intersected genes and plotted 406
376 prediction models. Ensembling in this manner can smooth over cross- the first two principal components annotated by study shown in Figure 407
377 study heterogeneity and exceed the advantage of the larger sample size S17. We observe that even after batch effect correction, there still exist 408
378 used by the merged model (Guan et al., 2019). some study heterogeneity since there are two distinct clusters with each 409
379 Lastly, we compare the run time of the pairwise imputation and merged cluster consisting of four studies. 410
380 imputation strategies. Figure S20 in the supplementary material shows the Since not all genes are predictive of the outcome, for the screening 411
381 run time for linear pairwise and merged strategies when data is generated step, we fit a LASSO model to predict ESR1 based on other gene 412
382 following Equation (1) across 3, 6, and 9 training studies. Since imputation expression levels in each study and select the genes with larger magnitude 413
383 models will be built multiple times, the pairwise approach takes longer, of coefficients. We then vary the numbers of top predictive genes we select 414
384 and this can be exacerbated by the total number of studies. in each study to predict the expression of ESR1. Figure 5.a shows the 415
RMSE of prediction on the validation set for the omitting, ‘Core’ and 416
‘All’ strategies. When the number of top genes selected for predicting 417
the outcome in each study is fewer than 400, the ‘All’ strategy has better 418
performance than the omitting method. But as the number of predictive 419
genes included in each study reaches 600, the RMSE from ‘Core’ and 420
5.b shows the paired Wilcoxon test results on the RMSE between the 422
three methods. Contrary to the boxplots of the marginal RMSE in Figure 423
5.a, the ‘All’ strategy consistently has a significantly smaller prediction 424
RMSE than the omitting method. Figure 5.b therefore contains the paired 425
shows the log RMSE ratio of the ‘Core’ and ‘All’ strategies to the omitting 428
method as we vary the number of top predictive genes included for 429
predicting the outcome, and Figure S6.b shows the log RMSE ratio of 430
the ‘Core’ to ‘All’. Consistent with the paired Wilcoxon test in Figure 5.b, 431
predict ESR1 expression levels, the median log RMSE ratios of the ‘All’ 433
imputation strategy is always smaller than 0, indicating that more than half 434
of the experiments have a decrease in the RMSE by employing the ‘All’ 435
We also vary the number of training sets in each experiment. Figure 437
S18 and S19 show the prediction RMSE in the testing set when we use 438
2 or 7 training sets. We observe that the ‘All’ strategy consistently yields 439
smaller prediction RMSE and is robust against the number of top predictive 440
genes that are included for analysis. The ‘Core’ strategy is less robust 441
Fig. 4. Add study heterogeneity in the X − X ∗ relationship. The baseline method for and is sometimes worse than the omitting method. This is likely due to 442
comparison is Omitting. ‘All’ using a larger gene set for building imputation models than ‘Core’, 443
using genes that are not predictive of the outcome but still informative 444
for imputing missing genes. When the number of top predictive genes 445
increases to 500 or 600, the three methods have comparable performance. 446
385 3.3 Real data analysis In this scenario, most truly predictive genes are already included in the data 447
386 We apply the ‘Core’ and ‘All’ pairwise strategies with polynomial sets and imputing the remainder does not impact the eventual prediction 448
387 imputation to impute study-specific missing genes on microarray data sets model trained. 449
450 4 Discussion selection can be more effective than penalization/regularization, and that 489
451 In this paper, we propose a pairwise strategy to apply imputation methods even if the cross-study selected features are a subset of all features fed 490
452 which account for differing feature sets across multiple studies when the to the regularizing model, there may be study-specific features that the 491
453 goal is to combine information across studies to build a predictive model. regularization prefers. 492
454 Compared with the traditionally convenient method of discarding non- One limitation of the ‘Core’ and ‘All’ strategies is that neither method 493
455 intersected genes or the simpler approach of merging studies together is using an optimal gene set to impute the study-specific missing genes. 494
456 and imputing using genes shared by all studies, our method maximizes ‘Core’ imputation relies only on genes that are predictive of the outcome 495
while completely neglecting other genes that might be informative of those 496
457 common genes for imputation based on the intersection between two
458 studies at a time. Our simulation studies show that the pairwise method has missing genes even though they are not predictive of the outcome. ‘All’ 497
459 significantly better performance than the omitting and merged methods imputation uses as many genes as possible for imputation with many 498
460 in terms of the RMSE of prediction on an external validation set. This ‘noise’ genes being included; those additional ‘noise’ features will also 499
461 advantage is more pronounced when there are more studies or when there is lead to less precise imputation of the missing genes. Moreover, an implicit 500
462 cross-study heterogeneity in the inter-gene relationships, and the pairwise assumption of our imputation procedure is that the genes are missing at 501
463 method exhibits the best performance no matter the underlying imputation random (MAR). If the MAR assumption is violated, for instance, if the 502
464 model (e.g. regression, ML, multiple imputation). missingness mechanism also depends on the outcome, then the imputation 503
465 Since only a subset of genes are likely to be relevant to the outcome might yield even worse predictive performance. We also conduct the bulk 504
466 of interest and because the external validation sets may also have genes of our simulations with a linear model data generating mechanism, which 505
469 and ‘All’ methods will decrease the RMSE of prediction compared to the explore more complex associations in the supplement and observe similar 508
470 omitting method, with ‘All’ demonstrating better performance than ‘Core’, patterns). 509
471 especially when the number of genes included for prediction is small. In The observed robustness of the pairwise imputation strategy compared 510
472 our real data examples, ‘All’ imputation again has better performance to merging against study heterogeneity in the inter-gene relationships is 511
473 than ‘Core’ and ‘Core’ imputation tends to be more volatile than ‘All’ likely introduced via the averaging approach we implemented to harmonize 512
474 imputation. In examining the resulting imputation and prediction models, imputations of the same gene from models trained in different study pairs. 513
475 we find that we are less successful at selecting relevant features using This approach can be viewed as a simplified version of the multi-study 514
476 LASSO as we were in simulation when the data generating mechanism stacking framework (Patil and Parmigiani, 2018), which utilizes ensemble 515
479 we conjecture that this is because using features that are known to be plan to investigate whether the original multi-study stacking framework 518
480 predictive of the outcome across studies to build imputation models can be used to further improve the performance of our imputation strategies 519
481 is more reliable and robust to cross-study heterogeneity than using a as next steps. 520
482 mixture of cross-study and study-specific features (‘All’ imputation). To formally compare the performance of different imputation methods, 521
483 In ‘All’ imputation, it is possible that a study-specific feature which we applied a pairwise Wilcoxon test on the median RMSE of prediction 522
484 is only coincidentally predictive within that study will replace a more between different methods. Another future direction of research is on 523
485 reliable cross-study feature from the intersection, and while the resulting hypothesis tests for rigorous comparison of the performance of different 524
486 imputation model would exhibit good performance for that study, it may methods that accounts for simulation variation, multiple comparisons and 525
528 PP and BR were supported by NSF-DMS grant 1810829; YW is supported Ascierto, M. L., Kmieciak, M., Idowu, M. O., Manjili, R., Zhao, Y., Grimes, M., 532
529 by NSF-DMS grant 2113707. Dumur, C., Wang, E., Ramakrishnan, V., Wang, X.-Y., et al. (2012). A signature of 533
immune function genes associated with recurrence-free survival in breast cancer 534
530 Conflicts of Interest: None declared patients. Breast cancer research and treatment, 131(3), 871–880. 535
Bernau, C., Riester, M., Boulesteix, A.-L., Parmigiani, G., Huttenhower, C., 536
Waldron, L., and Trippa, L. (2014). Cross-study validation for the assessment 537