0% found this document useful (0 votes)

19 views13 pages

9 - RF Behaviour

Uploaded by

p.alonso1910

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views13 pages

9 - RF Behaviour

Uploaded by

p.alonso1910

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Nicodemus et al.

BMC Bioinformatics 2010, 11:110

http://www.biomedcentral.com/1471-2105/11/110

RESEARCH ARTICLE Open Access

The behaviour of random forest permutation-

based variable importance measures under
predictor correlation
Kristin K Nicodemus1,2,3*, James D Malley4, Carolin Strobl5, Andreas Ziegler6

Abstract
Background: Random forests (RF) have been increasingly used in applications such as genome-wide association
and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based
variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an
extended simulation study to synthesize results.
Results: In the case when both predictor correlation was present and predictors were associated with the
outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while
under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was
unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional
VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0.
Conclusions: Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are
unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be
considered a “bias” - because they do not directly reflect the coefficients in the generating model - or if it is a
beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies,
where correlation between markers may help to localize the functionally relevant variant, the increased importance
of correlated predictors may be an advantage. On the other hand, we show examples where this increased
importance may result in spurious signals.

Background larger VIMs for correlated predictors under HA using

Random forest (RF) [1] and related methods such as CIF was due to the preference of correlated predictors
conditional inference forest (CIF) [2] are both tree- in early splits of the trees and the particular permuta-
building methods that have been found increasingly suc- tion scheme employed in the computation of the per-
cessful in bioinformatics applications. This is especially mutation-based VIM. They proposed a new conditional
true in statistical genetics, microarray analysis and the permutation-based VIM to circumvent this inflation. In
broad and rapidly expanding area of -omics studies. contrast, Nicodemus and Malley [4] reported that RF
However, these types of bioinformatics applications prefers uncorrelated predictors over all splits performed
often exhibit complex patterns of within-predictor cor- in building all trees in the forest under H0 and the alter-
relation. Two recent reports have examined the impact native hypothesis HA (unless the effect size is large, e.g.,
of within-predictor correlation on RF [3,4] and have an odds ratio of 5.0) because the splitting rule is based
arrived at apparently divergent conclusions. Both exam- on the Gini Index. They further reported that, under H0,
ined whether bias in variable selection during tree-build- unconditional permutation-based VIMs are unbiased
ing led to biased variable importance measures (VIMs) under within-predictor correlation for both RF and CIF,
when predictors were correlated. Strobl et al. [3] showed although Gini Index-based VIMs in RF are biased. In a
third study, Meng et al. [5] suggest a revised tree-build-
* Correspondence: kristin.nicodemus@well.ox.ac.uk
1
ing algorithm and VIM that suppress the inclusion of
Statistical Genetics, Wellcome Trust Centre for Human Genetics, University
of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK
correlated predictors in the same tree. Their findings

© 2010 Nicodemus et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 2 of 13
http://www.biomedcentral.com/1471-2105/11/110

indicate that the stronger the association with the methods section. Under the null hypothesis all predic-
response, the stronger the effect predictor correlation tors had a coefficient of zero, but the same block corre-
has on the performance of RF, which is in accordance lation structure.
to the findings of [3] and [4] with zero to high effects The results are reported in terms of bias in the statis-
on the response respectively. While the identification of tical sense that an estimator is unbiased if, on average, it
only those predictors associated with the response was reproduces the coefficient of the original model used for
found to be aggravated in the presence of predictor cor- generating the data, and is biased if it systematically
relation, the identification of sets of predictors both deviates from the original coefficient. However, a key
associated with the response and correlated with other point is whether or not machine learning algorithms,
predictors might be useful, e.g., in genome-wide associa- which are by their very nature often nonlinear and non-
tion studies, where strong LD may be present between parametric, should be expected to approximate the lin-
physically proximal genetic markers. ear generating model, or if they contain additional
To study the reported phenomenon in more detail information that may be exploited in the search for
and to synthesize the results of the previous studies highly influential predictors.
[3-5], we conducted a simulation study using a simple
linear model containing twelve predictors as first studied Predictor selection frequencies
in [3]. This model contained a set of four strongly cor- Under HA, correlated predictors were selected more fre-
related predictors (r = 0.9) and eight uncorrelated pre- quently at the first split in trees grown with subsets of
dictors (r = 0) (Table 1). Three of the four correlated three and eight (mtry) randomly preselected predictors
predictors and three of the eight uncorrelated predictors (first column of Figures 1 and 2), as reported in [3].
had non-zero coefficients. We examined the potential This was true for both algorithms. When one variable
bias in RF and CIF during variable selection for the first was selected for splitting (mtry = 1), CIF selected the
splitting variable and across all the trees in the forest. four correlated predictors and the three uncorrelated
We studied the impact of correlated predictors on the predictors with non-zero coefficients more frequently
resulting variable importance measures generated by the than the uncorrelated predictors with coefficients = 0.
two algorithms, including unscaled, unconditional per- This suggests the p-value splitting criterion in CIF is
mutation-based VIMs (RF and CIF), scaled permutation- more sensitive to predictor association with the outcome
based VIMs (RF) and conditional permutation-based than the Gini Index used by RF when mtry = 1, because
VIMs (CIF). CIF selected the associated predictors (or predictors
strongly correlated with associated predictors) more fre-
Results and Discussion quently than RF at both the first split and across all
In what follows, the results of RF VIM and estimated splits in the forest (first row, Figures 1 and 2).
coefficients of bivariate and multiple linear regression However, the pattern of selection frequencies using all
models are compared to the coefficients that were used splits showed the same pattern as reported in [4] for
to generate the data by means of a multiple linear both algorithms (second column of Figures 1 and 2).
regression model. Under the alternative some predictors Uncorrelated strongly-associated predictors (x5 and x6)
were influential with the coefficients reported in the were selected slightly more frequently across all trees

Table 1 Bias and 95% coverage for full and single predictor models.
Predictor True Member of Full Model: Full Model: Single Model: Single Model:
Value Correlated Group Bias 95% Coverage Bias 95% Coverage
x1 5 T 9.16e-04 94.6 6.31 0.0
x2 5 T -6.09e-04 97.2 6.31 0.0
x3 2 T -0.0014 94.6 9.02 0.0
x4 0 T 0.0011 96.2 10.81 0.0
x5 -5 F 6.77e-04 96.0 -0.015 94.2
x6 -5 F 2.75e-04 95.2 -0.017 94.2
x7 -2 F -3.90e-04 93.8 -0.0017 95.8
x8 0 F -5.26e-04 95.8 0.0050 93.4
x9 0 F 1.86e-04 94.6 -0.012 94.6
x10 0 F 2.47e-04 94.4 0.012 95.8
x11 0 F -3.60e-04 95.2 0.0096 94.4
x12 0 F -3.56e-06 95.8 0.0040 95.8
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 3 of 13
http://www.biomedcentral.com/1471-2105/11/110

Figure 1 Histograms of selection frequencies using random forest. First row: number of variables selected at each split (mtry) = 1, 2nd row:
mtry = 3, 3rd row: mtry = 8. First column shows the first split selection frequencies and 2nd column shows the selection frequencies across all
trees in all forests.

when the pool of potential predictor variables (mtry) likelihood of more than one correlated predictor being
was set to greater than one because of competition in the pool of available predictors to choose from was
between correlated predictors for selection into a tree greater. Further, we observed this same preference for
(second and third rows, second column, Figures 1 and uncorrelated predictors in the first split and across all
2). Differences from results in [3] are due to different splits in the tree under H0 for both algorithms (Addi-
choices in tuning parameters regulating the depth of tional files 1 and 2). In summary, the preference for cor-
trees, as further discussed below. related variables as the first splitting predictor was only
As the number of variables to select from for splitting induced when the correlated predictors were associated
increased (i.e., as mtry went from 3 to 8) this preference with the outcome. In the next section we explain why
for uncorrelated predictors was stronger, because the this phenomenon was observed here and in [3].
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 4 of 13
http://www.biomedcentral.com/1471-2105/11/110

Figure 2 Histograms of selection frequencies for each variable using conditional inference forest. First row: number of variables selected
at each split (mtry) = 1, 2nd row: mtry = 3, 3rd row: mtry = 8. First column shows the first split selection frequencies and 2nd column shows the
selection frequencies across all trees in all forests.

As previously reported [3], we found both algorithms is almost as strong as that of the originally influential
showed increased selection of x4, which had a coefficient predictor. In this case, as we show in detail, even though
of 0 in the generating model, versus the uncorrelated the predictor x4 has a coefficient of 0 in the generating
predictors with coefficients of 0 (x8... x12). Here we give model, the correlation between x4 and the outcome is
a more detailed illustration and explanation of this ~0.79 because of the correlation between x 4 and the
effect. Tree-building uses predictors sequentially in each correlated predictors (x1, x2, and x3) which have non-
tree and single predictors that are strongly correlated zero coefficients.
with the outcome are more likely to be selected at the Thus, not only the first split of a tree, but also a
first split. Statistically speaking, in the first split of each bivariate linear regression model indicated that predic-
tree the split criterion is a measure of the bivariate effect tors that were strongly correlated with an influential
of each predictor on the response, or of the marginal predictor - and can thus serve as a proxy for it - is pre-
correlation between that predictor and the response. In dictive of the response. For example, within the context
the case of predictors that are strongly correlated with of a genetic association study, let us suppose there are
influential predictors, this univariate or marginal effect two SNPs: SNP1 and SNP2. SNP 1 is the true causal
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 5 of 13
http://www.biomedcentral.com/1471-2105/11/110

variant associated with the outcome, which may be dis- These results are helpful to understand the preference
ease status or another phenotype, and SNP2 is in link- for correlated variables at the first split in trees. At the
age disequilibrium with SNP1. In this example, because first split, RF and CIF are simply a bivariate model con-
SNP 1 and SNP 2 are correlated with each other, either taining the outcome and one predictor. As illustrated by
predictor, considered on its own, may look associated means of the bivariate regression models, from this
with the outcome. bivariate or marginal point of view the correlated pre-
The fact that in the first split feature selection was dictors are actually more strongly associated with the
only bivariate or marginal is native to tree-building and outcome than the uncorrelated predictors. This explains
is not related to properties of the data generating why they were preferred at the first split, which we illus-
model. However, in the case of linear models it is parti- trate further below.
cularly intuitive that single features can be highly pre- Another way to assess the first split preference of RF/
dictive when considered in isolation, even when they CIF for the correlated predictors - especially for the cor-
appear with small or zero weights in the underlying related predictor which had a 0 coefficient in the gener-
models having many predictors. Therefore, linear mod- ating model (x4) - is to calculate the mean bivariate or
els are used both as the data generating process and as marginal correlation coefficient between each predictor
a means of illustration here. and the outcome compared with the true value (Table
2). This showed again that, from the bivariate or mar-
Bias, 95% coverage and correlation between predictors ginal point of view, the correlated predictors (x1 - x4)
and outcome under H0 and HA were much more strongly correlated with the outcome
RF and CIF consider predictors one at a time during (r ~ 0.80) than the uncorrelated predictors with -5 coef-
tree-building. In particular, the model is a simple bivari- ficients (x5 and x6; r = -0.36) or the uncorrelated predic-
ate model at the first split containing the outcome and a tor with -2 coefficient (x 7 ; r = -0.15). The observed
single predictor. To better understand the preference for values from our simulations were virtually identical to
correlated predictors at the first split, we examined the the true values, as expected. In addition, we calculated
bias, 95% coverage and correlation between the outcome the true value of each coefficient using a single variable
and predictors using the full twelve-predictor model and model and compared this value with the mean observed
twelve single-predictor (bivariate) regression models. We value from our simulations. These two values were also
also calculated the true (i.e., mathematically-derived nearly identical (Table 2).
from the generating model equation) values of the coef- Under H 0 , the magnitude of the bias for correlated
ficients and true correlations for bivariate models for predictors in the full model ranged from 0.00039 to
each of the predictors and compared those values to the 0.0015 and was similar in the single variable models. For
mean value of the bivariate coefficients and correlations uncorrelated predictors the magnitudes were also simi-
observed in our simulations. We further considered a lar, and 95% coverage was appropriate for all predictors
model retaining the same correlation structure but (see Additional file 3). Since the generating model did
where all coefficients were set to 0 (H0). not contain an intercept, we re-calculated bias and 95%
As expected, when using the full (twelve predictor)
linear regression model under HA, bias was small and Table 2 True and observed values for single predictor
centred around 0 (Table 1), indicating that the linear model coefficients and correlations with outcome.
regression model can reproduce the original coefficients Predictor True Mean True Mean
despite the high correlation, as expected since the gen- Bivariate Model Observed r(xi, Observed r(xi,
erating model was linear. Ninety-five percent coverage b Bivariate y) y)
model b
ranged from 93.8% to 97.2%, and was centred around
x1 11.30 11.31 0.82 0.82
95% for all predictors (Table 1). However, in the single
x2 11.30 11.31 0.82 0.82
predictor (or bivariate) regression models, bias was large
x3 11.00 11.02 0.80 0.80
for correlated predictors, ranging from 6.31 to 10.81
x4 10.80 10.81 0.78 0.79
higher than the true value from the generating model
x5 -5.0 -5.02 -0.36 -0.36
(Table 1). The 95% coverage for all four correlated pre-
x6 -5.0 -5.02 -0.36 -0.36
dictors was 0 (Table 1). This effect is known in statistics
x7 -2.0 -2.002 -0.15 -0.15
as a “spurious correlation” where a correlated predictor
x8 0.0 0.002 0.0 -3.6E-04
serves as a proxy and appears to be influential as long
x9 0.0 0.005 0.0 -8.4E-04
as the originally influential predictor is not included in
x10 0.0 -0.012 0.0 -8.7E-04
the model. For uncorrelated predictors, the magnitude
x11 0.0 0.01 0.0 7.3E-04
of the bias ranged from 0.0017 to 0.017 and the cover-
x12 0.0 0.004 0.0 2.9E-04
age ranged around the expected 95%.
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 6 of 13
http://www.biomedcentral.com/1471-2105/11/110

coverage using an intercept-free model which revealed

similar results (Table 1 and Additional file 4).
observed V IM / standard error  V IM 

Unconditional and unscaled RF variable importance for each VIM. Under H0, the scaled VIMs for the cor-
measures related predictors are inflated relative to the uncorre-
Unscaled VIMs for RF under HA using one variable to lated predictors, which was due to the fact that the
select from at each split (mtry = 1) showed behaviour unscaled correlated predictors are always non-negative
intermediate to the bivariate models and the full (Figure 3). Thus, the magnitude of the scaled predictors
regression model shown in the above section (Figure 3, was dependent on the size of the forest [6,7] and the
column 1). Larger VIMs were observed for the corre- correlation between predictors.
lated variables versus uncorrelated predictors, regard-
less of generating model coefficient, as in [3]. The Unconditional unscaled CIF variable importance measures
correlated predictors with the largest - coefficients (x1, For CIF, unconditional VIMs under H A were very
x 2 ) showed larger VIMs than correlated predictors similar to those obtained using the unscaled VIM in
with coefficients of 2 (x 3 ) or 0 (x 4 ), but the latter RF (Figure 5; compare with Figure 3). Indeed, the
showed higher VIMs than uncorrelated predictors with rankings of the importance measures did not differ
the same coefficients. between the two algorithms under any value of mtry.
As the number of variables to select from at each split Even the magnitude of the VIMs was similar between
increased (mtry increased from 3 to 8), the VIMs for the two algorithms when using three or eight variables
the correlated predictors approached the coefficients in for splitting. In other words, under HA , the values of
the generating model (Figure 3, column 1). In other the unconditional VIMs from both algorithms reflected
words, the VIMs for the correlated predictors with 2 or an intermediate step between single predictor linear
0 coefficients (x3, x4) were reduced relative to those with regression models and the full twelve-predictor linear
larger coefficients (x1, x2, x5 and x6) regardless of corre- regression model, with the values for the correlated
lation. This is particularly true when eight variables associated variables larger than for the uncorrelated
were randomly selected at each split. The VIMs for the associated ones.
correlated predictors were often larger than those for Under H0, CIF showed an even smaller bias in infla-
the uncorrelated due to the stronger association tion of VIMs for the correlated variables. The median
between correlated variables and outcome when only a difference between the VIMs for correlated versus
single correlated predictor is considered. As we showed uncorrelated variables when selecting from three predic-
above, this is often the case, because they are most tors to split on at each node was 6.54·10 -5 and with
often selected at the first split, as in [3]. eight predictors it was 4.75·10-4. Therefore, we did not
To ascertain whether RF unconditional unscaled per- observe substantial bias in unconditional permutation-
mutation-based VIMs were biased under predictor cor- based VIMs for either algorithm under H0.
relation under H 0 , we conducted the same analysis
under H 0 (Figure 3, column 2). This is an important Conditional unscaled CIF variable importance measures
aspect because a bias under the null hypothesis would Using the conditional VIM from CIF and randomly
mean that one would be mislead to consider some pre- selecting a single variable to split on at each node (mtry
dictors that are correlated as influential, even if all pre- = 1) we observed a similar pattern to the unscaled
dictors were random noise variables. We observed a unconditional VIMs from both CIF and RF (Figure 6).
negligible inflation in VIMs for the correlated predictors When the number of variables to randomly split on was
of approximately 0.014. This inflation was relatively larger (mtry = 3 or 8) we observed an inflation in the
invariant to the number of splitting variables used. median VIMs for the uncorrelated strongly-associated
variables (x5, x6) relative to those observed for the corre-
Unconditional scaled RF variable importance measures lated variables that were strongly-associated (x1, x2). The
As previously shown [6], in the regression case the RF inflation in the median VIMs for the uncorrelated
scaled variable importance measures grew larger with the strongly-associated variables (x5, x6) is contrary to the
number of trees grown (data not shown). In addition, we inflation found for the correlated strongly-associated
show here that they were also sensitive to predictor cor- variables (x1, x2) for the unconditional unscaled VIMs.
relation (Figure 4). Under HA, VIMs for the uncorrelated However, neither pattern is consistent with the full gen-
but strongly-associated predictors (x5 and x6) were the erating model. It should be noted that the variability of
largest because these variables were selected more fre- the conditional VIMs for the uncorrelated strongly-asso-
quently, thus their empirical standard error was smaller. ciated variables (x5, x6) is very high (Figure 6), so that
The scaled VIMs were calculated using: the inflation is less pronounced than for the correlated
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 7 of 13
http://www.biomedcentral.com/1471-2105/11/110

Figure 3 Boxplots of unscaled permutation-based VIMs using random forest. First column: unscaled VIMs of variables under HA using: 1st
row: mtry = 1, 2nd row: mtry = 3, 3rd row: mtry = 8. Second column VIMs of variables under H0. V = number of predictor (xi) in the model.
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 8 of 13
http://www.biomedcentral.com/1471-2105/11/110

Figure 4 Boxplots of scaled permutation-based VIMs using random forest. First column: scaled VIMs of variables under HA using: 1st row:
mtry = 1, 2nd row: mtry = 3, 3rd row: mtry = 8. Second column VIMs of variables under H0. V = number of predictor (xi) in the model.
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 9 of 13
http://www.biomedcentral.com/1471-2105/11/110

Figure 5 Boxplots of unconditional permutation-based VIMs using conditional inference forest. First column: unconditional VIMs of
variables under HA using: 1st row: mtry = 1, 2nd row: mtry = 3, 3rd row: mtry = 8. Second column VIMs of variables under H0. V = number of
predictor (xi) in the model.
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 10 of 13
http://www.biomedcentral.com/1471-2105/11/110

Figure 6 Boxplots of conditional permutation-based VIMs using conditional inference forest. First column: conditional VIMs of variables
under HA using: 1st row: mtry = 1, 2nd row: mtry = 3, 3rd row: mtry = 8. Second column VIMs of variables under H0. V = number of predictor (xi)
in the model.
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 11 of 13
http://www.biomedcentral.com/1471-2105/11/110

strongly-associated variables (x1, x2) using the uncondi- One of the key aims of the study was whether the ori-
tional VIMs. This increased variability, which was not ginal RF VIM, whose behaviour in the presence of cor-
found in [3], may be attributable to the use of larger relation is situated between that of bivariate and
trees in the present study and is currently the topic of multivariate regression models, may be preferable to the
further research. Under H 0 , the conditional VIMs conditional VIM, whose behaviour is situated closer to
showed a negligible bias similar to the unconditional that of multiple regression models. The latter may be
permutation-based variable importance measure from preferable in small studies where the aim is to uncover
CIF: with mtry = 3 it was 3.78·10-5 and with mtry = 8 it spurious correlations and identify a set of truly influen-
was 1.52·10-4. tial predictors among a set of correlated ones. However,
in large-scale screening studies, such as genome wide
Conclusions association studies, the original RF VIM may be better
Both algorithms, RF and CIF, more frequently selected suited to identify regions of markers containing influen-
correlated predictors at the first split in the tree, as pre- tial predictors, because in this case correlation is usually
viously reported [3]. As we showed, these predictors a consequence of physical proximity of loci and thus
were more strongly associated at the bivariate or first- may help localise causal variants. Since RF and CIF are
split level than the uncorrelated predictors associated nonlinear and nonparametric methods they may not be
with the outcome (here, x5 - x7). Across all splits in the expected to exactly reproduce inferences drawn from
forest under HA, CIF and RF showed variable selection linear regression models. In fact, we would argue that
frequencies relatively proportional to the effect size of these algorithms may provide a more effective first stage
the predictors. However, we observed a slight preference information-gathering scheme than a traditional regres-
for selection of uncorrelated predictors as shown in [4], sion model. This would be especially true as the number
and this was true under both HA and H0. of predictors increase, such as in the current era of gen-
Unscaled permutation-based VIMs were found to ome-wide association studies or microarray studies [8].
reflect an intermediate step between the bivariate and
multivariate linear regression model coefficients. We did Methods
observe a very slight inflation in these VIMs for corre- We investigated the behaviour of RF/CIF under predic-
lated variables under H 0 . However, the difference tor correlation using the same model studied in [3],
between the median VIM for correlated versus uncorre- which was of the form:
lated predictors was less than 0.014. These results high-
y  5  x1   5  x 2   2  x 3   0  x 4   5  x 5 
light the importance of using Monte Carlo simulation to
assess the size of observed VIMs relative to those  5  x 6   2  x 7   0  x 8   0  x 9   0  x10   0  x11   0  x12   

obtained under H0 in applied studies. Further, the selec-

tion of the numbers of variables to randomly use at All predictor variables were ~N(0,1) and ε ~N(0, 0.5).
each split (mtry) should be empirically assessed when As in the original model [3], our generating model con-
correlation between predictors is present, as we show tained no intercept. The first 4 predictors (x1... x4) were
that a large value can lead to inflation for the correlated block correlated at 0.9 and all other predictors were
predictors using unconditional VIMs (Figures 3 and 5) uncorrelated with the correlated predictors and one
whereas the opposite was observed for the conditional another. Predictor variables were simulated using the R
VIMs (Figure 6). package mvtnorm version 0.9-5, and MLAs were per-
Our results regarding the scaled VIM from RF agree formed using the R packages randomForest version 4.5-
with those of [6], who recommended the use of the 25 [9] and party version 0.9-996. All additional analyses
unscaled VIM for regression. We also found the scaled and simulations were conducted using R version 2.7.0
VIM was dependent on forest size [6,7] and, as shown [10]. We simulated 2,000 observations per 500 replicates
here, also dependent on predictor correlation. This was for all conditions and constructed forests containing
the only VIM showing substantial bias under H0. 5,000 trees. Because of the computational burden in
The conditional VIM from CIF appeared to inflate the CIF, we adopted the approach of Strobl et al. [3] and
uncorrelated, strongly-associated (x5 and x6) predictor esti- calculated the conditional variable importance measure
mates of importance (both the median values and in the using a forest size of 500 trees and a randomly selected
variability) relative to that of the correlated, strongly-asso- subset of 500 observations from each replicate.
ciated predictors (x1 and x2). However, we found it, at pre- The VIMs evaluated were all permutation-based. In
sent, computationally intractable for large datasets: we RF and CIF, the unconditional VIM is calculated as the
calculated the other VIMs on the full set of observations difference in mean square error (MSE) on the out-of-
(n = 2,000) whereas we were only able to calculate the bag data and the MSE, also using the out-of-bag data,
conditional VIM on a subset (n = 500) of observations. calculated after permuting observed values for a
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 12 of 13
http://www.biomedcentral.com/1471-2105/11/110

predictor. The differences between these two quantities Additional file 2: Supplementary Figure 2.
are averaged across all the trees in the forest for a parti- Click here for file
cular predictor. RF provides a scaled VIM where the [ http://www.biomedcentral.com/content/supplementary/1471-2105-11-
110-S2.PDF ]
scaling function used was:
Additional file 3: Supplementary Table 1.
Click here for file
observed V IM/empirical standard error (V IM). [ http://www.biomedcentral.com/content/supplementary/1471-2105-11-
110-S3.TXT ]
The conditional VIM in CIF differs from the uncondi- Additional file 4: Supplementary Table 2.
tional VIM by permuting one predictor within strata of Click here for file
sets of predictors that are correlated with that predictor [ http://www.biomedcentral.com/content/supplementary/1471-2105-11-
110-S4.TXT ]
[3]. RF and CIF used subsampling of 63.2% of the obser-
vations, and used a randomly-selected subset of variables
for splitting (mtry) set to 1, 3 and 8 to be consistent with
[3]. The minimum node size for both algorithms was set Acknowledgements
We are grateful to Dr. Yan Meng for her insightful comments on our
to 20, as small terminal node sizes are more sensitive to manuscript. The Wellcome Trust provided support for KKN. This study
correlations between predictors [4]. utilized the high performance computational capabilities of the Biowulf
We compared the frequency of variable selection in Linux cluster at the National Institutes of Health, Bethesda, MD http://
biowulf.nih.gov.
RF and CIF for both the first split and across all splits
in the forest. We computed bias and 95% coverage Author details
1
using linear regression models for the full model con- Statistical Genetics, Wellcome Trust Centre for Human Genetics, University
of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK. 2Department of Clinical
taining all 12 predictors and additionally for models Pharmacology, University of Oxford, Old Road Campus Research Building, Off
containing one predictor at a time. Because the generat- Roosevelt Drive, Oxford OX3 7DQ, UK. 3Genes, Cognition and Psychosis
ing model did not include an intercept, we calculated Program, Intramural Research Program, National Institute of Mental Health,
National Institutes of Health, Room 4S-235, 10 Center Drive, Bethesda,
bias and 95% coverage using a model containing an Maryland, 20892, USA. 4Mathematical and Statistical Computing Laboratory,
intercept, and a model where the intercept was forced Division of Computational Bioscience, Center for Information Technology,
through the origin. Bias was calculated as the mean dif- National Institutes of Health, Bethesda, Maryland, 20892, USA. 5Department
für Statistik, Ludwig-Maximilians Universität München, Ludwigstr. 33, 80539
ference between the estimated coefficients for a particu- München, Germany. 6Institut für Medizinische Biometrie und Statistik,
lar predictor and the true value. Coverage was Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus
calculated as the total number of 95% confidence inter- Lübeck, Maria-Goeppert-Str 1, 23562, Lübeck, Germany.
vals for the estimated coefficients from the linear regres- Authors’ contributions
sion model that contained the true value, thus should be KKN conceived of the study, defined the methods, performed all simulation
in the range of 95%. Correlation between individual pre- and analysis and wrote the paper. JDM contributed to the design of the
methods, wrote the paper and edited the paper. CS wrote and edited the
dictors and outcome was calculated using Pearson’s cor- paper. AZ edited the paper. All authors read and approved the final
relation coefficient. Calculating true values of the manuscript.
correlation between the outcome y and the predictors in
Received: 27 August 2009 Accepted: 27 February 2010
the single-predictor models begins by letting Published: 27 February 2010

m  Vb References
1. Breiman L: Random forests. Machine Learn 2001, 45:5-32.
where V is the correlation matrix of the predictors 2. Hothorn T, Hornik K, Zeileis A: Unbiased recursive partitioning: A
and b is the vector of coefficients for each variable [11]. conditional inference framework. J Comp Graph Stat 2006, 15:651-674.
3. Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A: Conditional variable
Then, the true correlations are found from importance for random forests. BMC Bioinformatics 2008, 9:307.
4. Nicodemus KK, Malley JD: Predictor correlation impacts machine learning
cor  y , x   m / var  y 
1/2 algorithms: implications for genomic studies. Bioinformatics 2009,
25(15):1884-90.
5. Meng Y, Yu Y, Cupples LA, Farrer LA, Lunetta KL: Performance of random
where forest when SNPs are in linkage disequilibrium. BMC Bioinformatics 2009,
10:78.
var  y   b T m  1 / 2. 6. Díaz-Uriarte R, Alvarez de Andrés S: Gene selection and classification of
microarray data using random forest. BMC Bioinformatics 2006, 7:3.
7. Strobl C, Zeileis A: Exploring the statistical properties of a test for
random forest variable importance. COMPSTAT 2008 - Proceedings in
Computational Statistics Physica Verlag, Heidelberg 2008, II:59-66.
8. Cordell HJ: Genome-wide association studies: Detecting gene-gene
Additional file 1: Supplementary Figure 1. interactions that underlie human diseases. Nat Rev Genet 2009,
Click here for file 10:392-404.
[ http://www.biomedcentral.com/content/supplementary/1471-2105-11- 9. Liaw A, Weiner M: Classification and regression by randomForest. R News
110-S1.PDF ] 2002, 2:18-22.
Nicodemus et al. BMC Bioinformatics 2010, 11:110 Page 13 of 13
http://www.biomedcentral.com/1471-2105/11/110

10. R Development Core Team: R: A language and environment for statistical

computing. R Foundation for statistical computing, Vienna, Austria 2007.
11. Kendall M, Stuart A: The Advanced Theory of Statistics Griffin: London 1979.

doi:10.1186/1471-2105-11-110
Cite this article as: Nicodemus et al.: The behaviour of random forest
permutation-based variable importance measures under predictor
correlation. BMC Bioinformatics 2010 11:110.

Submit your next manuscript to BioMed Central

and take full advantage of:

• Convenient online submission

• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution

Submit your manuscript at

www.biomedcentral.com/submit

An AUC-based Permutation Variable Importance Measure For Random Forests Janitza, University of Munich BMC Bioinformatics
No ratings yet
An AUC-based Permutation Variable Importance Measure For Random Forests Janitza, University of Munich BMC Bioinformatics
11 pages
Overview of Random Forest Methodology and Practical Guidance With Emphasis On Computational Biology and Bioinformatics
No ratings yet
Overview of Random Forest Methodology and Practical Guidance With Emphasis On Computational Biology and Bioinformatics
32 pages
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
No ratings yet
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
21 pages
Importance Metrics in Machine Learning
No ratings yet
Importance Metrics in Machine Learning
35 pages
Overview of Random Forest Methodology and Practical Guidance With Emphasis On Computational Biology and Bioinformatics
No ratings yet
Overview of Random Forest Methodology and Practical Guidance With Emphasis On Computational Biology and Bioinformatics
32 pages
Variable Selection with Random Forests
No ratings yet
Variable Selection with Random Forests
8 pages
Biology 12 00442
No ratings yet
Biology 12 00442
22 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
109 pages
Rms PDF
No ratings yet
Rms PDF
506 pages
Non-parametric Conditional Inference
No ratings yet
Non-parametric Conditional Inference
27 pages
Course PDF
No ratings yet
Course PDF
403 pages
SEM Fit Indices Overview and Recommendations
No ratings yet
SEM Fit Indices Overview and Recommendations
4 pages
Regression Modeling Strategies
No ratings yet
Regression Modeling Strategies
506 pages
DG Fast 2009
No ratings yet
DG Fast 2009
13 pages
Heart Disease Prediction Using Random Forest
No ratings yet
Heart Disease Prediction Using Random Forest
10 pages
Statistics Consulting Overview
100% (1)
Statistics Consulting Overview
44 pages
Research Article: Bootstrapping Nonparametric Prediction Intervals For Conditional Value-at-Risk With Heteroscedasticity
No ratings yet
Research Article: Bootstrapping Nonparametric Prediction Intervals For Conditional Value-at-Risk With Heteroscedasticity
7 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
Beam Foci Trans
No ratings yet
Beam Foci Trans
22 pages
Comparison of Random Forest and Stepwise Regression For Variable Selection Using Low Prevalence Predictors: A Case Study in Paediatric Sepsis
No ratings yet
Comparison of Random Forest and Stepwise Regression For Variable Selection Using Low Prevalence Predictors: A Case Study in Paediatric Sepsis
10 pages
CHNGPT Code R
No ratings yet
CHNGPT Code R
25 pages
Random Forest, Chisquare
No ratings yet
Random Forest, Chisquare
8 pages
Course Regression Model Strategies PDF
No ratings yet
Course Regression Model Strategies PDF
307 pages
Variable Ranking for Classification Efficiency
No ratings yet
Variable Ranking for Classification Efficiency
7 pages
Paper - Heart Disease Prediction
No ratings yet
Paper - Heart Disease Prediction
5 pages
Model Combining in Longitudinal Data Analysis
No ratings yet
Model Combining in Longitudinal Data Analysis
32 pages
Clarifications on Absolute Fit Indices
No ratings yet
Clarifications on Absolute Fit Indices
4 pages
Heart Disease Prediction Using Machine Learning Techniques: Abstract
No ratings yet
Heart Disease Prediction Using Machine Learning Techniques: Abstract
5 pages
Fit Indices Versus Test Statistics
No ratings yet
Fit Indices Versus Test Statistics
34 pages
Usp, Winter - 2010 - Practical Approaches To Dealing With Nonnormal and Categorical Variables-Annotated
No ratings yet
Usp, Winter - 2010 - Practical Approaches To Dealing With Nonnormal and Categorical Variables-Annotated
3 pages
Random Forest: Prediction of Genetic Susceptibility To Complex Diseases
No ratings yet
Random Forest: Prediction of Genetic Susceptibility To Complex Diseases
7 pages
Count Data Regression for Medical Diagnosis
No ratings yet
Count Data Regression for Medical Diagnosis
26 pages
Qin 等 - 2024 - Distribution-Free Prediction Intervals Under Covariate Shift, With an Application to Causal Inferenc
No ratings yet
Qin 等 - 2024 - Distribution-Free Prediction Intervals Under Covariate Shift, With an Application to Causal Inferenc
14 pages
Econometrics Lecture Notes Overview
No ratings yet
Econometrics Lecture Notes Overview
74 pages
Nihms 227851
No ratings yet
Nihms 227851
41 pages
SEM Fit Indices Guide
No ratings yet
SEM Fit Indices Guide
6 pages
Prediction of Heart Disease Using Random Forest in Comparison With Logistic Regression To Measure Accuracy
No ratings yet
Prediction of Heart Disease Using Random Forest in Comparison With Logistic Regression To Measure Accuracy
5 pages
Cardiac Arrhythmia Detection & Classification
No ratings yet
Cardiac Arrhythmia Detection & Classification
5 pages
NeurIPS 2021 Adaptive Conformal Inference Under Distribution Shift Paper
No ratings yet
NeurIPS 2021 Adaptive Conformal Inference Under Distribution Shift Paper
13 pages
Heart Disease Prediction via Deep Learning
No ratings yet
Heart Disease Prediction via Deep Learning
6 pages
Biometrics - 2020 - Williamson - Nonparametric Variable Importance Assessment Using Machine Learning Techniques
No ratings yet
Biometrics - 2020 - Williamson - Nonparametric Variable Importance Assessment Using Machine Learning Techniques
14 pages
Understanding Random Forest Algorithm
No ratings yet
Understanding Random Forest Algorithm
8 pages
Unit-II DM Techniques
No ratings yet
Unit-II DM Techniques
20 pages
Univariate vs Multivariate Forecasting
No ratings yet
Univariate vs Multivariate Forecasting
12 pages
Heart Disease Prediction Final
67% (3)
Heart Disease Prediction Final
45 pages
Model Diagnostics in Biomedical Research
No ratings yet
Model Diagnostics in Biomedical Research
43 pages
Understanding Bagging in Random Forests
No ratings yet
Understanding Bagging in Random Forests
53 pages
Hierarchical Variable Importance With Statistical Control For Medical Data-Based Prediction
No ratings yet
Hierarchical Variable Importance With Statistical Control For Medical Data-Based Prediction
15 pages
Introduction to Recursive Partitioning
No ratings yet
Introduction to Recursive Partitioning
60 pages
Statistics Interview Prep Guide
No ratings yet
Statistics Interview Prep Guide
5 pages
MIS410 Chapter6
No ratings yet
MIS410 Chapter6
47 pages
Chapter 09 Linear
No ratings yet
Chapter 09 Linear
29 pages
Logistic Regression Model Building Guide
No ratings yet
Logistic Regression Model Building Guide
7 pages
Statistical Tests for Mean and Variance
No ratings yet
Statistical Tests for Mean and Variance
13 pages
Bagging and Random Forest Presentation1
100% (4)
Bagging and Random Forest Presentation1
23 pages
E3sconf Icmpc2023 01051
No ratings yet
E3sconf Icmpc2023 01051
10 pages
Understanding Random Forest Importance
No ratings yet
Understanding Random Forest Importance
47 pages
NYC Data Science Bootcamp Curriculum
No ratings yet
NYC Data Science Bootcamp Curriculum
13 pages
Lung Disease Prediction with ML Techniques
No ratings yet
Lung Disease Prediction with ML Techniques
11 pages
ML-Based IoT Botnet Attack Detection
No ratings yet
ML-Based IoT Botnet Attack Detection
66 pages
Artificial Intelligence in International Arbitration A Spotlight On The Options
No ratings yet
Artificial Intelligence in International Arbitration A Spotlight On The Options
6 pages
Web Driven Health Insights: An AI-Powered Recommendation System For Optimized Patient Care
No ratings yet
Web Driven Health Insights: An AI-Powered Recommendation System For Optimized Patient Care
19 pages
Dmbi Mcqs Mcqs For Data Mining and Business Intelligence
No ratings yet
Dmbi Mcqs Mcqs For Data Mining and Business Intelligence
24 pages
AI-Enhanced Credit Risk Scoring for SMEs
No ratings yet
AI-Enhanced Credit Risk Scoring for SMEs
23 pages
D1-22683 Aam Tyan 2023-24 SMD
No ratings yet
D1-22683 Aam Tyan 2023-24 SMD
6 pages
Classifier Comparison HAZOP Ekramipooya 2023
No ratings yet
Classifier Comparison HAZOP Ekramipooya 2023
9 pages
App Recommendation System with ML
No ratings yet
App Recommendation System with ML
12 pages
Papers Pradeepth
No ratings yet
Papers Pradeepth
55 pages
Ijst 2024 2619
No ratings yet
Ijst 2024 2619
7 pages
AI Tutors Curiculum PDF
No ratings yet
AI Tutors Curiculum PDF
78 pages
Salary Prediction Document
No ratings yet
Salary Prediction Document
30 pages
Essential ML Interview Questions Guide
100% (2)
Essential ML Interview Questions Guide
243 pages
ML Models for Wildlife ID in East Africa
No ratings yet
ML Models for Wildlife ID in East Africa
9 pages
Hybrid ML-DL Approach For Android Malware Detection
No ratings yet
Hybrid ML-DL Approach For Android Malware Detection
9 pages
Aircraft Engine Remaining Useful Life Prediction U
No ratings yet
Aircraft Engine Remaining Useful Life Prediction U
3 pages
Category AI Model
No ratings yet
Category AI Model
7 pages
Tax Prediction System Using Machine Learning
No ratings yet
Tax Prediction System Using Machine Learning
7 pages
IT & ML Expert for Advanced AI Solutions
No ratings yet
IT & ML Expert for Advanced AI Solutions
7 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
23 pages
Hybrid Machine Learning for E-Commerce Classification
No ratings yet
Hybrid Machine Learning for E-Commerce Classification
16 pages
10 17671-Gazibtd 1521520-4093854
No ratings yet
10 17671-Gazibtd 1521520-4093854
18 pages
5.random Forest
No ratings yet
5.random Forest
12 pages
Uber Fare Prediction with Regression Models
No ratings yet
Uber Fare Prediction with Regression Models
5 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
3 pages

9 - RF Behaviour

Uploaded by

9 - RF Behaviour

Uploaded by

Nicodemus et al.

BMC Bioinformatics 2010, 11:110

RESEARCH ARTICLE Open Access

The behaviour of random forest permutation-

Background larger VIMs for correlated predictors under HA using

coverage using an intercept-free model which revealed

obtained under H0 in applied studies. Further, the selec-

10. R Development Core Team: R: A language and environment for statistical

Submit your next manuscript to BioMed Central

• Convenient online submission

Submit your manuscript at

You might also like