Professional Documents
Culture Documents
Abstract 1. Introduction
We describe a novel non-parametric statistical Tests of dependence are important tools in statistical anal-
hypothesis test of relative dependence between ysis, and are widely applied in many data analysis con-
a source variable and two candidate target vari- texts. Classical criteria include Spearman’s ρ and Kendall’s
ables. Such a test enables us to determine τ , which can detect non-linear monotonic dependencies.
whether one source variable is significantly more More recent research on dependence measurement has fo-
dependent on a first target variable or a sec- cused on non-parametric measures of dependence, which
ond. Dependence is measured via the Hilbert- apply even when the dependence is nonlinear, or the vari-
Schmidt Independence Criterion (HSIC), result- ables are multivariate or non-euclidean (for instance im-
ing in a pair of empirical dependence mea- ages, strings, and graphs). The statistics for such tests are
sures (source-target 1, source-target 2). We test diverse, and include kernel measures of covariance (Gret-
whether the first dependence measure is signif- ton et al., 2008; Zhang et al., 2011) and correlation (Daux-
icantly larger than the second. Modeling the ois & Nkiet, 1998; Fukumizu et al., 2008), distance co-
covariance between these HSIC statistics leads variances (which are instances of kernel tests) (Székely
to a provably more powerful test than the con- et al., 2007; Sejdinovic et al., 2013b), kernel regression
struction of independent HSIC statistics by sub- tests (Cortes et al., 2009; Gunn & Kandola, 2002), rank-
sampling. The resulting test is consistent and ings (Heller et al., 2013), and space partitioning approaches
unbiased, and (being based on U-statistics) has (Gretton & Gyorfi, 2010; Reshef et al., 2011; Kinney & At-
favorable convergence properties. The test can wal, 2014). Specialization of such methods to univariate
be computed in quadratic time, matching the linear dependence can yield similar tests to classical ap-
computational complexity of standard empiri- proaches such as Darlington (1968); Bring (1996).
cal HSIC estimators. The effectiveness of the For many problems in data analysis, however, the question
test is demonstrated on several real-world prob- of whether dependence exists is secondary: there may be
lems: we identify language groups from a mul- multiple dependencies, and the question becomes which
tilingual corpus, and we prove that tumor lo- dependence is the strongest. For instance, in neuroscience,
cation is more dependent on gene expression multiple stimuli may be present (e.g. visual and audio), and
than chromosomal imbalances. Source code is it is of interest to determine which of the two has a stronger
available for download at https://github. influence on brain activity (Trommershauser et al., 2011).
com/wbounliphone/reldep. In automated translation (Peters et al., 2012), it is of inter-
est to determine whether documents in a source language
are a significantly better match to those in one target lan-
Proceedings of the 32 nd International Conference on Machine guage than to another target language, either as a measure
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- of difficulty of the respective learning tasks, or as a basic
right 2015 by the author(s). tool for comparative linguistics.
A low variance consistent test of relative dependency
We present a statistical test which determines whether two 2. Definitions and description of HSIC
target variables have a significant difference in their de-
pendence on a third, source variable. The dependence be- We base our underlying notion of dependence on the
tween each of the target variables and the source is com- Hilbert-Schmidt Independence Criterion (Gretton et al.,
puted using the Hilbert-Schmidt Independence Criterion 2005; 2008; Song et al., 2012). All results in this section
(Gretton et al., 2005; 2008).1 Care must be taken in an- except for Problem 1 can be found in these previous works.
alyzing the asymptotic behavior of the test statistics, since Definition 1. (Gretton et al., 2005, Definition 1,Lemma 1:
the two measures of dependence will themselves be cor- Hilbert-Schmidt Independence Criterion)
related: they are both computed with respect to the same
Let Pxy be a Borel probability measure over over (X ×
source. Thus, we derive the joint asymptotic distribution
Y, Γ × Λ) with Γ and Λ the respective Borel sets on X and
of both dependencies. The derivation of our test utilizes
Y, and Px and Py the marginal distributions on domains
classical results of U -statistics (Hoeffding, 1963; Serfling,
X and Y. Given separable RKHSs F and G, the Hilbert-
1981; Arcones & Gine, 1993). In particular, we make use
Schmidt Independence Criterion (HSIC) is defined as the
of results by Hoeffding (1963) and Serfling (1981) to deter-
squared HS-norm of the associated cross-covariance oper-
mine the asymptotic joint distributions of the statistics (see
ator Cxy . When the kernels k, l are associated uniquely
Theorem 4). Consequently, we derive the lowest variance
withs respective RKHSs F and G and bounded, HSIC can
unbiased estimator of the test statistic.
be expressed in terms of expectations of kernel functions
We prove our approach to have greater statistical power
than constructing two uncorrelated statistics on the same HSIC(F, G, Pxy ) := kCxy k2HS
data by subsampling, and testing on these. In experiments, = Exx0 yy0 [k(x, x0 )l(y, y 0 )] + Exx0 [k(x, x0 )] Eyy0 [l(y, y 0 )]
we are able to successfully test which of two variables is − 2Exy [Ex0 [k(x, x0 )]Ey0 [l(y, y 0 )]] . (1)
most strongly related to a third, in synthetic examples, in a
language group identification task, and in a task for iden-
tifying the relative strength of factors for Glioma type in a HSIC determines independence: HSIC = 0 iff Pxy = Px Py
pediatric patient population. when kernels k and l are characteristic on their respective
marginal domains (Gretton, 2015).
To our knowledge, there do not exist competing non-
parametric tests to determine which of two dependencies With this choice, the problem we would like to solve is
is strongest. One related area is that of multiple regression described as follows:
analysis (e.g. (Sen & Srivastava, 2011)). In this case a lin- Problem 1. Given separable RKHSs F, G, and H with
ear model is assumed, and it is determined whether individ- HSIC(F, G, Pxy ) > 0 and HSIC(F, H, Pxz ) > 0,
ual inputs have a statistically significant effect on an output we test the null hypothesis H0 : HSIC(F, G, Pxy ) ≤
variable. The procedure does not address the question of HSIC(F, H, Pxz ) versus the alternative hypothesis H1 :
whether the influence of one variable is higher than that of HSIC(F, G, Pxy ) > HSIC(F, H, Pxz ) at a given sig-
another to a statistically significant degree. The problem of nificance level α.
variable selection has also been investigated in the case of
nonlinear relations between the inputs and outputs (Cortes We now describe the asymptotic behavior of the HSIC for
et al., 2009; 2012; Song et al., 2012), however this again dependent variables.
does not address which of two variables most strongly in- Theorem 1. (Song et al., 2012, Theorem 2: Unbiased
fluences a third. A less closely related area is that of detect- estimator for HSIC(F, G, Pxy )) We denote by S the set
ing three-variable interactions (Sejdinovic et al., 2013a), of observations {(x1 , y1 ), ..., (xm , ym )} of size m drawn
where it is determined whether there exists any factoriza- i.i.d. from Pxy . The unbiased estimator HSICm (F, G, S)
tion of the joint distribution over three variables. This test is given by
again does not address the issue of finding which connec-
tions are strongest, however. 1
HSICm (F, G, S) = × (2)
1 m(m − 3)
Dependency can also be tested with the correlation operator. " #
However, Fukumizu et al., (2007) show that unlike the covariance 10 K̃110 L̃1 2
operator, the asymptotic distribution of the norm of the correlation Tr(K̃L̃) + − 10 K̃L̃1
operator is unknown, so the construction of a computationally ef-
(m − 1)(m − 2) m − 2
ficient test of relative dependence remains an open problem.
where K̃ and L̃ are related to K and L by K̃ij = (1 −
δij )K̃ij and L̃ij = (1 − δij )L̃ij .
Theorem 2. (Song et al., 2012, Theorem 3: U-statistic of
XY
HSIC) This finite sample unbiased estimator of HSICm
A low variance consistent test of relative dependency
XY XZ
can be written as a U-statistic, dij = d(zi , zj ). Let HSICm and HSICm be respec-
X tively the unbiased estimators of HSIC(F, G, Pxy ) and
XY
HSICm = (m)−1
4 hijqr (3) HSIC(F, H, Pxz ), written as a sum of U-statistics with
(i,j,q,r)∈im
4 respective kernels hijqr and gijqr as described in (4),
(i,j,q,r)
m! 1
, the index set im
X
where (m)4 := 4 denotes the hijqr = kst (lst + luv − 2lsu ),
(m − 4)! 24
set of all 4−tuples drawn without replacement from the set (s,t,u,v)
3.3. The dependent test is more powerful where we again make the most conservative possible as-
sumption that HSIC(F, G, Pxy ) − HSIC(F, H, Pxz ) =
While discarding half the samples leads to a consistent test,
0 under the null. The Type II error probability of the de-
we might expect some loss of power over the approach in
pendent test at level α is
Section 3.1, due to the increase in variance with lower sam-
ple size. In this section, we prove the Section 3.1 test is
m−1/2 HSIC(F, G, Pxy )
more powerful than that of Section 3.2, regardless of Pxy −1 −HSIC(F, H, Pxz )
and Pxz . Φ (1 − α) − pσ 2 + σ 2 − 2σ
Φ (21)
XY XZ XY XZ
We call the simple and consistent approach in Section 3.2,
the independent approach, and the lower variance approach
in Section 3.1, the dependent approach. The following the- where Φ is the CDF of the standard normal distribution.
orem compares these approaches. The numerator in Eq. (20) is the same as the numerator in
Eq. (21), and the denominator in Eq. (21) is smaller due to
Theorem 6. The asymptotic relative efficiency (ARE) of the Lemma 1. The lower variance dependent test therefore has
independent approach relative to the dependent approach higher ARE, i.e., for a sufficient sample size m > τ for
is always greater than 1. some distribution dependent τ ∈ N+ , the dependent test
Remark 1. The asymptotic relative efficiency (ARE) is de- will be more powerful than the independent test.
fined in e.g. Serfling (1981, Chap.5, Section 1.15.4). If mA
and mB are the sample sizes at which tests ”perform equiv- 4. Generalizing to more than two HSIC
mA
alently” (i.e. have equal power), then the ratio m repre-
B statistics
sents the relative efficiency. When mA and mB tend to +∞
mA
and the ratio m B
→ L (at equivalent performance), then The generalization of the dependence test to more
the value L represents the asymptotic relative efficiency of than three random variables follows from the earlier
procedure B relative to procedure A. This example is rele- derivation by applying successive rotations to a higher
vant to our case since we are comparing two test statistics dimensional joint Gaussian distribution over multiple
with different asymptotically Normal distributions. HSIC statistics. We assume a sample S of size m
over n domains with kernels k1 , . . . , kn associated
The following lemma is used for the proof of Theorem 6. uniquely with respective reproducing kernel Hilbert
Lemma 1. (Lower Variance) The variance of the depen- spaces F1 , . . . , Fn . We define a generalized statisti-
dent test statistic is smaller than the variance of the inde- cal test, P Tg (S) → {0, 1} to test the null hypothesis
pendent test statistic. H0 : (x,y)∈{1,...,n}2 v(x,y) HSIC(Fx , Fy , Pxy ) ≤
0P versus the alternative hypothesis Hm :
(x,y)∈{1,...,n} 2 v (x,y) HSIC(F x , F y , Pxy ) > 0, where
Proof. From the convergence of moments in the applica-
v is a vector of weights on each HSIC statistic. We
tion of the central limit theorem (von Bahr, 1965), we
2 2 may recover the test in the previous section by set-
have that σX 0 Y 0 = 2σXY . Then the variance summary
ting v(1,2) = +1 v(1,3) = −1 and v(i,j) = 0 for all
in Eq. (11) is 12 (σXY
2
+ σXZ2
− 2σXY XZ ) and the variance
(i, j) ∈ {1, 2, 3}2 \ {(1, 2), (1, 3)}.
summary in Equation (18) is 21 (2σXY 2 2
√ + 2σXZ ) where in
both cases the statistic is scaled by m. We have that the The derivation of the test follows the general strategy used
variance of the independent test statistic is smaller than the in the previous section: we construct a rotation matrix so
variance of the dependent test statistic when as to project the joint Gaussian distribution onto the first
axis, and read the p-value from a standard normal table. To
1 2 2 1 2 2 construct the rotation matrix, we simply need to rotate v
(σXY + σXZ − 2σXY XZ ) < (2σXY + 2σXZ )
2 2 such that it is aligned with the first axis. Such a rotation
2 2
⇐⇒ −2σXY XZ < σXY + σXZ (19) can be computed by composing n 2-dimensional rotation
matrices as in Algorithm 1.
which is implied by the positive definiteness of Σ.
5. Experiments
Proof of Theorem 6. The Type II error probability of the
independent test at level α is We apply our estimates of statistical dependence to three
challenging problems. The first is a synthetic data experi-
m−1/2 HSIC(F, G, Pxy )
ment, in which we can directly control the relative degree
−1 −HSIC(F, H, Pxz ) of functional dependence between variates. The second ex-
Φ (1 − α) −
Φ , (20)
p
σ2 + σ2 periment uses a multilingual corpus to determine the rela-
X0Y 0 X 00 Z 00
tive relations between European languages. The last exper-
A low variance consistent test of relative dependency
Require: v ∈ Rn
Ensure: [Qv]i = 0 ∀i 6= 1, QT Q = I 0.8
dependent tests
Q = Qi Q independent tests
0
end for 0.5 1 1.5
γ3
2 2.5 3 3.5
2 10 10
t cos(t) + γ3 N (0, 1)
t sin(t) + γ2 N (0, 1)
1.5
5 5
1
0.5
0
0 0
tion of γ3 on the synthetic data described in Section 5.1. For val-
−0.5
−5 −5
ues of γ3 > 0.3 the distribution in Fig. 1(a) is closer to 1(b) than
to 1(c). The problem becomes difficult as γ3 → 0.3. As predicted
−1
−10 −10
−1.5
−2
−1 0 1 2 3 4 5 6 7
−15
−10 −5 0 5 10 15
−15
−15 −10 −5 0 5 10 15
by theory, the dependent test is significantly more powerful over
t + γ1 N (0, 1) t cos(t) + γ2 N (0, 1) t cos(t) + γ3 N (0, 1)
(a) γ1 = 0.3 (b) γ2 = 0.3 (c) γ3 = 0.6 almost all values of γ3 by a substantial margin.
HSICXZ
HSICXZ
HSICXZ
HSICXZ
0.02 0.02 0.02
Figure 3. For the synthetic experiments described in Section 5.1, we plot empirical HSIC values for dependent and independent tests for
100 repeated draws with different sample sizes. Empirical p-values for each test show that the dependent distribution converges faster
than the independent distribution even at low sample size, resulting in a more powerful statistical test.
HSIC XZ
2
is higher than the dependence between groups. The results 1
of these tests are in Table 3. As before, our test is able to 0
distinguish language groups with high significance. −1
−2
0 1 2 3 4 5 6 ×10−3
Source Targets p-value HSIC XY
da de sv fi < 10−9
da sv en fr < 10−9 Figure 5. 2σ iso-curves of the Gaussian distributions estimated
de sv en it < 10−5 from the pediatric Glioma data. As before, the dependent test
fr it es sv < 10−5 has a much lower variance than the independent test. The tests
es fr pt nl 0.0175 support the stronger dependence on the tumor location to gene
expression than chromosomal imbalances.
Table 3. Relative dependency test between four pairs of HSIC
statistics for the multilingual corpus data. These tests show the
ability of the relative dependence test to generalize to arbitrary
numbers of HSIC statistics by constructing a rotation matrix us- ing of dependence, but with a p-value that is three orders of
ing Algorithm 1. In all cases v = [1 1 −2]. magnitude larger (p = 0.005). Figure 5 shows iso-curves
of the Gaussian distributions estimated in the independent
and dependent tests. The empirical relative dependency is
5.3. Pediatric glioma data consistent with findings in the medical literature, and pro-
vides additional statistical support for the importance of
Brain tumors are the most common solid tumors in children tumor location in Glioma (Gilbertson & Gutmann, 2007;
and have the highest mortality rate of all pediatric cancers. Palm et al., 2009; Puget et al., 2012).
Despite advances in multimodality therapy, children with
pediatric high-grade gliomas (pHGG) invariably have an
overall survival of around 20% at 5 years. Depending on 6. Conclusions
their location (e.g. brainstem, central nuclei, or supraten- We have described a novel statistical test that determines
torial), pHGG present different characteristics in terms of whether a source random variable is more strongly de-
radiological appearance, histology, and prognosis. The hy- pendent on one target random variable or another. This
pothesis is that pHGG have different genetic origins and test, built on the Hilbert-Schmidt Independence Criterion,
oncogenic pathways depending on their location. Thus, the is low variance, consistent, and unbiased. We have shown
biological processes involved in the development of the tu- that our test is strictly more powerful than a test that does
mor may be different from one location to another. not exploit the covariance between HSIC statistics, and
In order to evaluate such hypotheses, pre-treatment frozen empirically achieves p-values several orders of magnitude
tumor samples were obtained from 53 children with newly smaller. We have empirically demonstrated the test perfor-
diagnosed pHGG from Necker Enfants Malades (Paris, mance on synthetic data, where the degree of dependence
France) from Puget et al, (2012). The 53 tumors are di- could be controlled; on the challenging problem of iden-
vided into 3 locations: supratentorial (HEMI), central nu- tifying language groups from a multilingual corpus; and
clei (MIDL), and brain stem (DIPG). The final dataset is or- for finding the most important determinant of Glioma type.
ganized in 3 blocks of variables defined for the 53 tumors: The computation and memory requirements of the test are
X is a block of indicator variables describing the location quadratic in the sample size, matching the performance of
category, the second data matrix Y provides the expression HSIC and related tests for dependence between two ran-
of 15 702 genes (GE). The third data matrix Z contains the dom variables. The test is therefore scalable to the wide
imbalances of 1229 segments (CGH) of chromosomes. range of problem instances where non-parametric depen-
dency tests are currently applied. We have generalized the
For X, we use a linear kernel, which is characteristic for test framework to more than two HSIC statistics, and have
indicator variables, and for Y and Z, the kernel was cho- given an algorithm to construct a consistent, low-variance,
sen to be the Gaussian kernel with σ selected as the median unbiased test in this setting.
of pairwise distances. The p-value of our relative depen-
dency test is < 10−5 . This shows that the tumor location
Acknowledgements
in the brain is more dependent on gene expression than on
chromosomal imbalances. By contrast with Section 5.1, We thank Ioannis Antonoglou for helpful discussions. The
the independent test was also able to find the same order- first author is supported by a fellowship from Centrale-
A low variance consistent test of relative dependency
Supélec. This work is partially funded by the Euro- Gretton, A. and Gyorfi, L. Consistent nonparametric tests
pean Commission through ERC Grant 259112 and FP7- of independence. Journal of Machine Learning Re-
MCCIG334380. search, 11:1391–1423, 2010.
Cortes, C., Mohri, M., and Rostamizadeh, A. Learning Gunn, S. R. and Kandola, J. S. Structural modelling with
non-linear combinations of kernels. In Neural Informa- sparse kernels. Machine Learning, 48(1):137–163, 2002.
tion Processing Systems, 2009.
Heller, R., Heller, Y., and Gorfine, M. A consistent mul-
Cortes, C., Mohri, M., and Rostamizadeh, A. Algorithms tivariate test of association based on ranks of distances.
for learning kernels based on centered alignment. Jour- Biometrika, 100(2):503–510, 2013.
nal of Machine Learning Research, 13:795–828, 2012.
Hoeffding, W. Probability inequalities for sums of bounded
Darlington, Richard B. Multiple regression in psychologi- random variables. Journal of the American statistical
cal research and practice. Psychological bulletin, 69(3): association, 58(301):13–30, 1963.
161, 1968.
Kinney, J. B. and Atwal, G. S. Equitability, mutual infor-
Dauxois, J. and Nkiet, G. M. Nonlinear canonical analy- mation, and the maximal information coefficient. Pro-
sis and independence tests. Annals of Statistics, 26(4): ceedings of the National Academy of Sciences, 2014.
1254–1278, 1998.
Koehn, P. Europarl: A parallel corpus for statistical ma-
Fukumizu, K., Bach, F. R., and Gretton, A. Statisti- chine translation. In MT summit, volume 5, pp. 79–86,
cal consistency of kernel canonical correlation analysis. 2005.
The Journal of Machine Learning Research, 8:361–383,
2007. Palm, T., Figarella-Branger, D., Chapon, F., Lacroix, C.,
Gray, F., Scaravilli, F., Ellison, D. W., Salmon, I.,
Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. Ker- Vikkula, M., and Godfraind, C. Expression profiling
nel measures of conditional dependence. In Advances of ependymomas unravels localization and tumor grade-
in Neural Information Processing Systems, pp. 489–496. specific tumorigenesis. Cancer, 115(17):3955–3968,
MIT Press, 2008. 2009.
Gilbertson, R. J. and Gutmann, D. H. Tumorigenesis in the Peters, C., Braschler, M., and Clough, P. Multilin-
brain: location, location, location. Cancer research, 67 gual Information Retrieval: From Research to Practice.
(12):5579–5582, 2007. Springer, 2012.
Gray, R. D. and Atkinson, Q. D. Language-tree divergence Puget, S., Philippe, C., Bax, D., Job, B., Varlet, P., Ju-
times support the Anatolian theory of Indo-European nier, M. P., Andreiuolo, F., Carvalho, D., Reis, R.,
origin. Nature, 426(6965):435–439, 2003. and Guerrini-Rousseau, L. Mesenchymal transition
and PDGFRA amplification/mutation are key distinct
Gretton, A. A simpler condition for consistency of a kernel oncogenic events in pediatric diffuse intrinsic pontine
independence test. arXiv:1501.06103, 2015. gliomas. PloS one, 7(2):e30313, 2012.
A low variance consistent test of relative dependency
Song, L., Smola, A., Gretton, A., Bedo, J., and Borg-
wardt, K. Feature selection via dependence maximiza-
tion. Journal of Machine Learning Research, 13:1393–
1434, 2012.
Székely, G., Rizzo, M., and Bakirov, N. Measuring and
testing dependence by correlation of distances. Annals
of Statistics, 35(6):2769–2794, 2007.
Trommershauser, J., Kording, K., and Landy, M. S. Sen-
sory Cue Integration. Oxford University Press, 2011.