A Low Variance Consistent Test of Relative Dependency

A low variance consistent test of relative dependency
Wacha Bounliphone WACHA . BOUNLIPHONE @ CENTRALESUPELEC . FR

CentraleSupélec & Inria, Grande Voie des Vignes, 92295 Châtenay-Malabry, France
Arthur Gretton ARTHUR . GRETTON @ GMAIL . COM
Gatsby Computational Neuroscience Unit, University College London, United Kingdom
Arthur Tenenhaus ARTHUR . TENENHAUS @ CENTRALESUPELEC . FR
CentraleSupélec, 3 rue Joliot-Curie, 91192 Gif-Sur-Yvette, France
Matthew B. Blaschko MATTHEW. BLASCHKO @ INRIA . FR
Inria & CentraleSupélec, Grande Voie des Vignes, 92295 Châtenay-Malabry, France
Abstract 1. Introduction
We describe a novel non-parametric statistical Tests of dependence are important tools in statistical anal-
hypothesis test of relative dependence between ysis, and are widely applied in many data analysis con-
a source variable and two candidate target vari- texts. Classical criteria include Spearman’s ρ and Kendall’s
ables. Such a test enables us to determine τ , which can detect non-linear monotonic dependencies.
whether one source variable is significantly more More recent research on dependence measurement has fo-
dependent on a first target variable or a sec- cused on non-parametric measures of dependence, which
ond. Dependence is measured via the Hilbert- apply even when the dependence is nonlinear, or the vari-
Schmidt Independence Criterion (HSIC), result- ables are multivariate or non-euclidean (for instance im-
ing in a pair of empirical dependence mea- ages, strings, and graphs). The statistics for such tests are
sures (source-target 1, source-target 2). We test diverse, and include kernel measures of covariance (Gret-
whether the first dependence measure is signif- ton et al., 2008; Zhang et al., 2011) and correlation (Daux-
icantly larger than the second. Modeling the ois & Nkiet, 1998; Fukumizu et al., 2008), distance co-
covariance between these HSIC statistics leads variances (which are instances of kernel tests) (Székely
to a provably more powerful test than the con- et al., 2007; Sejdinovic et al., 2013b), kernel regression
struction of independent HSIC statistics by sub- tests (Cortes et al., 2009; Gunn & Kandola, 2002), rank-
sampling. The resulting test is consistent and ings (Heller et al., 2013), and space partitioning approaches
unbiased, and (being based on U-statistics) has (Gretton & Gyorfi, 2010; Reshef et al., 2011; Kinney & At-
favorable convergence properties. The test can wal, 2014). Specialization of such methods to univariate
be computed in quadratic time, matching the linear dependence can yield similar tests to classical ap-
computational complexity of standard empiri- proaches such as Darlington (1968); Bring (1996).
cal HSIC estimators. The effectiveness of the For many problems in data analysis, however, the question
test is demonstrated on several real-world prob- of whether dependence exists is secondary: there may be
lems: we identify language groups from a mul- multiple dependencies, and the question becomes which
tilingual corpus, and we prove that tumor lo- dependence is the strongest. For instance, in neuroscience,
cation is more dependent on gene expression multiple stimuli may be present (e.g. visual and audio), and
than chromosomal imbalances. Source code is it is of interest to determine which of the two has a stronger
available for download at https://github. influence on brain activity (Trommershauser et al., 2011).
com/wbounliphone/reldep. In automated translation (Peters et al., 2012), it is of inter-
est to determine whether documents in a source language
are a significantly better match to those in one target lan-
Proceedings of the 32 nd International Conference on Machine guage than to another target language, either as a measure
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- of difficulty of the respective learning tasks, or as a basic
right 2015 by the author(s). tool for comparative linguistics.
We present a statistical test which determines whether two 2. Definitions and description of HSIC
target variables have a significant difference in their de-
pendence on a third, source variable. The dependence be- We base our underlying notion of dependence on the
tween each of the target variables and the source is com- Hilbert-Schmidt Independence Criterion (Gretton et al.,
puted using the Hilbert-Schmidt Independence Criterion 2005; 2008; Song et al., 2012). All results in this section
(Gretton et al., 2005; 2008).1 Care must be taken in an- except for Problem 1 can be found in these previous works.
alyzing the asymptotic behavior of the test statistics, since Definition 1. (Gretton et al., 2005, Definition 1,Lemma 1:
the two measures of dependence will themselves be cor- Hilbert-Schmidt Independence Criterion)
related: they are both computed with respect to the same
Let Pxy be a Borel probability measure over over (X ×
source. Thus, we derive the joint asymptotic distribution
Y, Γ × Λ) with Γ and Λ the respective Borel sets on X and
of both dependencies. The derivation of our test utilizes
Y, and Px and Py the marginal distributions on domains
classical results of U -statistics (Hoeffding, 1963; Serfling,
X and Y. Given separable RKHSs F and G, the Hilbert-
1981; Arcones & Gine, 1993). In particular, we make use
Schmidt Independence Criterion (HSIC) is defined as the
of results by Hoeffding (1963) and Serfling (1981) to deter-
squared HS-norm of the associated cross-covariance oper-
mine the asymptotic joint distributions of the statistics (see
ator Cxy . When the kernels k, l are associated uniquely
Theorem 4). Consequently, we derive the lowest variance
withs respective RKHSs F and G and bounded, HSIC can
unbiased estimator of the test statistic.
be expressed in terms of expectations of kernel functions
We prove our approach to have greater statistical power
than constructing two uncorrelated statistics on the same HSIC(F, G, Pxy ) := kCxy k2HS
data by subsampling, and testing on these. In experiments, = Exx0 yy0 [k(x, x0 )l(y, y 0 )] + Exx0 [k(x, x0 )] Eyy0 [l(y, y 0 )]
we are able to successfully test which of two variables is − 2Exy [Ex0 [k(x, x0 )]Ey0 [l(y, y 0 )]] . (1)
most strongly related to a third, in synthetic examples, in a
language group identification task, and in a task for iden-
tifying the relative strength of factors for Glioma type in a HSIC determines independence: HSIC = 0 iff Pxy = Px Py
pediatric patient population. when kernels k and l are characteristic on their respective
marginal domains (Gretton, 2015).
To our knowledge, there do not exist competing non-
parametric tests to determine which of two dependencies With this choice, the problem we would like to solve is
is strongest. One related area is that of multiple regression described as follows:
analysis (e.g. (Sen & Srivastava, 2011)). In this case a lin- Problem 1. Given separable RKHSs F, G, and H with
ear model is assumed, and it is determined whether individ- HSIC(F, G, Pxy ) > 0 and HSIC(F, H, Pxz ) > 0,
ual inputs have a statistically significant effect on an output we test the null hypothesis H0 : HSIC(F, G, Pxy ) ≤
variable. The procedure does not address the question of HSIC(F, H, Pxz ) versus the alternative hypothesis H1 :
whether the influence of one variable is higher than that of HSIC(F, G, Pxy ) > HSIC(F, H, Pxz ) at a given sig-
another to a statistically significant degree. The problem of nificance level α.
variable selection has also been investigated in the case of
nonlinear relations between the inputs and outputs (Cortes We now describe the asymptotic behavior of the HSIC for
et al., 2009; 2012; Song et al., 2012), however this again dependent variables.
does not address which of two variables most strongly in- Theorem 1. (Song et al., 2012, Theorem 2: Unbiased
fluences a third. A less closely related area is that of detect- estimator for HSIC(F, G, Pxy )) We denote by S the set
ing three-variable interactions (Sejdinovic et al., 2013a), of observations {(x1 , y1 ), ..., (xm , ym )} of size m drawn
where it is determined whether there exists any factoriza- i.i.d. from Pxy . The unbiased estimator HSICm (F, G, S)
tion of the joint distribution over three variables. This test is given by
again does not address the issue of finding which connec-
tions are strongest, however. 1
HSICm (F, G, S) = × (2)
1 m(m − 3)
Dependency can also be tested with the correlation operator. " #
However, Fukumizu et al., (2007) show that unlike the covariance 10 K̃110 L̃1 2
operator, the asymptotic distribution of the norm of the correlation Tr(K̃L̃) + − 10 K̃L̃1
operator is unknown, so the construction of a computationally ef-
(m − 1)(m − 2) m − 2
ficient test of relative dependence remains an open problem.
where K̃ and L̃ are related to K and L by K̃ij = (1 −
δij )K̃ij and L̃ij = (1 − δij )L̃ij .
Theorem 2. (Song et al., 2012, Theorem 3: U-statistic of
XY
HSIC) This finite sample unbiased estimator of HSICm
XY XZ
can be written as a U-statistic, dij = d(zi , zj ). Let HSICm and HSICm be respec-
X tively the unbiased estimators of HSIC(F, G, Pxy ) and
XY
HSICm = (m)−1
4 hijqr (3) HSIC(F, H, Pxz ), written as a sum of U-statistics with
(i,j,q,r)∈im
4 respective kernels hijqr and gijqr as described in (4),
(i,j,q,r)
m! 1
, the index set im
X
where (m)4 := 4 denotes the hijqr = kst (lst + luv − 2lsu ),
(m − 4)! 24
set of all 4−tuples drawn without replacement from the set (s,t,u,v)
{1, . . . m}, and the kernel h of the U-statistic is defined as (i,j,q,r)

1 X
gijqr = kst (dst + duv − 2dsu ). (6)
(i,j,q,r) 24
1 X (s,t,u,v)
hijqr = kst (lst + luv − 2lsu ) (4)
24 Theorem 4. (Joint asymptotic distribution of HSIC) If
(s,t,u,v)
E[h2 ] < ∞ and E[g 2 ] < ∞, then
where the kernels k and l are associated uniquely with re- √

XY

HSICm HSIC(F, G, Pxy )
spective reproducing kernel Hilbert spaces F and G. m XZ −
HSICm HSIC(F, H, Pxz )
Theorem 3. (Gretton et al., 2008, Theorem 1: Asymptotic
2

d 0 , σXY σXY XZ
distribution of HSICm ) If E[h2 ] < ∞, and source and −→ N 0 σXY XZ 2
σXZ
, (7)
targets are not independent, then, under H1 , as m → ∞,
2 2
√ where σXY and σXZ are as in Theorem 3. The
XY d 2
m(HSICm − HSIC(F, G, Pxy )) −→ N (0, σXY
) empirical estimate of σXY XZ is σ̂XY XZ =
(5) 16 R XY XZ

XY XZ − HSICm HSICm , where
2
where σXY
2
= 16 Ei (Ej,q,r hijqr ) − HSIC(F, G, Pxy )) m
 
with Ej,q,r := ESj ,Sq ,Sr . Its empirical esti- 1 Xm X
(m − 1)−2
XY 2

mate is σ̂XY = 16 RXY − (HSICm ) where RXY XZ = 3 hijqr gijqr  .
2 m m
i=1 (j,q,r)∈i3 \{i}

m
1 X −1
X (8)
RXY = (m − 1)3 hijqr  and
m
i=1 (j,q,r)∈im
3 \{i}
m
the index set i3 \ {i} denotes the set of all 3−tuples drawn Proof. Eq. (8) is constructed with the definition of vari-
without replacement from the set {1, . . . m} \ {i}. ance of a U-statistic as given by Serfling, Ch. 5 (1981),
where one variable is fixed. Eq. (7) follows from the appli-
cation of Hoeffding, Theorem 7.1 (1963), which gives the
3. A test of relative dependence joint asymptotic distribution of U-statistics.
In this section we calculate two dependent HSIC statistics
and derive the joint asymptotic distribution of these depen- Based on the joint asymptotic distribution of HSIC de-
dent quantities, which is used to construct a consistent test scribed in Theorem 4, we can now describe a statistical
for Problem 1. We next construct a simpler consistent test, test to solve Problem 1: given a sample S1 as described
by computing two independent HSIC statistics on sample in Section 3.1, T (S1 ) : {(X × Y × Z)m } → {0, 1} is
subsets. While the simpler strategy is superficially attrac- used to test the null hypothesis H0 : HSIC(F, G, Pxy ) ≤
tive and less effort to implement, we prove the dependent HSIC(F, H, Pxz ) versus the alternative hypothesis H1 :
strategy is strictly more powerful. HSIC(F, G, Pxy ) > HSIC(F, H, Pxz ) at a given sig-
nificance level α. This is achieved by projecting the distri-
XY XZ
3.1. Joint asymptotic distribution of HSIC and test bution to 1D using the statistic HSICm − HSICm ,
and determining where the statistic falls relative to a
In the present section, we compute each HSIC estimate conservative estimate of the the 1 − α quantile of the
on the full dataset, and explicitly obtain the correlations null. We now derive this conservative estimate. A sim-
between the resulting empirical dependence measurements ple way of achieving this is to rotate the distribution by
XY XZ π
HSICm and HSICm . We denote by S1 = (X, Y, Z) 4 counter-clockwise about the origin, and to integrate
the joint sample of observations which are drawn i.i.d. with the resulting distribution projected onto the first axis (cf.
respective Borel probability measure Pxyz defined on the Fig.
√ 3). Denote XY
the asymptotically normal distribution of
XZ T
domain X × Y × Z. The kernels k, l and d are associated m[HSICm HSICm ] as N (µ, Σ). The distribution
uniquely with respective reproducing kernel Hilbert spaces resulting from rotation and projection is
F, G and H. Moreover, K, L and D ∈ Rm×m are kernel
N [Qµ]1 , [QΣQT ]11 ,

matrices containing kij = k(xi , xj ), lij = l(yi , yj ) and (9)
√
2 1 −1 Then for m > 1 and all δ > 0 with probability at least
where Q = is the rotation matrix by π4 and
2 1 1 1 − δ, for all pxyz , the generalization bound on the differ-
√ ence of empirical HSIC statistics is
2
[Qµ]1 = (HSIC(F, G, Pxy ) − HSIC(F, H, Pxz )) ,
2 | {HSIC(F, G, Pxy ) − HSIC(F, H, Pxz )}
(10) XY XZ

− HSICm − HSICm |
1
[QΣQT ]11 = (σXY2 2
+ σXZ − 2σXY XZ ). (11)
(r )
2 log(6/δ) C
≤2 + (14)
α2 m m
Following the empirical distribution from Eq. (9), a test
XY XZ
with statistic HSICm − HSICm has p-value
where α > 0.24 and C are constants.
!
XY XZ
(HSICm − HSICm )
p≤1−Φ p 2 2
, (12) Proof. In Gretton et al., (2005) a finite sample bound is
σXY + σXZ − 2σXY XZ given for a single HSIC statistic. Eq. (14) is proved by
using a union bound.
where Φ is the CDF of a standard normal distribution, and
we have made the most conservative possible assumption Corollary 1. HSICm XY
−HSIC XZ
m converges to the pop-
√
that HSIC(F, G, Pxy ) − HSIC(F, H, Pxz ) = 0 under ulation statistic at rate O( m).
the null (the null also allows for the difference in population
dependence measures to be negative). 3.2. A simple consistent test via uncorrelated HSICs
To implement the test in practice, the variances of From the result in Eq. (5), a simple, consistent test of
2 2 2
σXY , σXZ and σXY XZ may be replaced by their empir- relative dependence can be constructed as follows: split
ical estimates. The test will still be consistent for a large the samples from Px into two equal sized sets denoted
enough sample size, since the estimates will be sufficiently by X 0 and X 00 , and drop the second half of the sam-
well converged to ensure the test is calibrated. Eq. (8) is ple pairs with Y and the first half of the sample pairs
expensive to compute naı̈vely, because even computing the with Z. We will denote the remaining samples as Y 0
kernels hijqr and gijqr of the U -statistic itself is a non and Z 00 . We can now estimate the joint distribution of
trivial task. Following (Song et al., 2012, Section 2.5), √ X0Y 0 X 00 Z 00 T
m[HSICm/2 , HSICm/2 ] as
P first form a vector hXY with entries corresponding to
we
hijqr , and a vector hXZ with entries cor-
(j,q,r)∈im
3 \{i}P HSIC(F, G, Pxy ) σ2 0
responding to (j,q,r)∈im \{i} gijqr . Collecting terms in N , X0Y 0 2 , (15)
3
HSIC(F, H, Pxz ) 0 σX 00 Z 00
Eq. (4) related to kernel matrices K̃ and L̃, hXY can be

written as which we will write as N (µ0 , Σ0 ). Given this joint
distribution, we need to determine the distribution over
hXY = (m − 2)2 K̃ L̃ 1 − m(K̃1) (L̃1) (13) the half space defined by HSIC(F, G, Pxy ) <
HSIC(F, H, Pxz ). As in the previous section, we achieve
+ (m − 2) (Tr(K̃L̃))1 − K̃(L̃1) − L̃(K̃1) this by rotating the distribution by π4 counter-clockwise
about the origin, and integrating the resulting distribution
+ (1T L̃1)K̃1 + (1T K̃1)L̃1 − ((1T K̃)(L̃1))1
projected onto the first axis (cf. Fig. 3). The resulting pro-
where denotes the Hadamard product. Then RXY XZ jection of the rotated distribution onto the primary axis is
in Eq. (8) can be computed as RXY XZ = (4m)−1 (m −
N [Qµ0 ]1 , QΣ0 QT 11

(16)
1)−2 T
3 hXY hXZ . Using the order of operations implied by
the parentheses in Eq. (13), the computational cost of the where
cross covariance term is O(m2 ). Combining this with the √
unbiased estimator of HSIC in Eq. (2) leads to a final com- 0 2
putational complexity of O(m2 ). [Qµ ]1 = (HSIC(F, G, Pxy ) − HSIC(F, H, Pxz )) ,
2
(17)
In addition to the asymptotic consistency result, we provide
a finite sample bound on the deviation between the differ- 1
[QΣ0 QT ]11 = (σX2 2
0 Y 0 + σX 00 Z 00 ). (18)
ence of two population HSIC statistics and the difference 2
of two empirical HSIC estimates. From this empirically estimated distribution, it is straight-
Theorem 5 (Generalization bound on the difference of forward to construct a consistent test (cf. Eq. (12)). The
empirical HSIC statistics). Assume that k, l, and d are power of this test varies inversely with the variance of the
bounded almost everywhere by 1, and are non-negative. distribution in Eq. (16).
3.3. The dependent test is more powerful where we again make the most conservative possible as-
sumption that HSIC(F, G, Pxy ) − HSIC(F, H, Pxz ) =
While discarding half the samples leads to a consistent test,
0 under the null. The Type II error probability of the de-
we might expect some loss of power over the approach in
pendent test at level α is
Section 3.1, due to the increase in variance with lower sam-
ple size. In this section, we prove the Section 3.1 test is 
m−1/2 HSIC(F, G, Pxy )

more powerful than that of Section 3.2, regardless of Pxy  −1 −HSIC(F, H, Pxz ) 
and Pxz . Φ (1 − α) − pσ 2 + σ 2 − 2σ
Φ (21)


XY XZ XY XZ
We call the simple and consistent approach in Section 3.2,
the independent approach, and the lower variance approach
in Section 3.1, the dependent approach. The following the- where Φ is the CDF of the standard normal distribution.
orem compares these approaches. The numerator in Eq. (20) is the same as the numerator in
Eq. (21), and the denominator in Eq. (21) is smaller due to
Theorem 6. The asymptotic relative efficiency (ARE) of the Lemma 1. The lower variance dependent test therefore has
independent approach relative to the dependent approach higher ARE, i.e., for a sufficient sample size m > τ for
is always greater than 1. some distribution dependent τ ∈ N+ , the dependent test
Remark 1. The asymptotic relative efficiency (ARE) is de- will be more powerful than the independent test.
fined in e.g. Serfling (1981, Chap.5, Section 1.15.4). If mA
and mB are the sample sizes at which tests ”perform equiv- 4. Generalizing to more than two HSIC
mA
alently” (i.e. have equal power), then the ratio m repre-
B statistics
sents the relative efficiency. When mA and mB tend to +∞
mA
and the ratio m B
→ L (at equivalent performance), then The generalization of the dependence test to more
the value L represents the asymptotic relative efficiency of than three random variables follows from the earlier
procedure B relative to procedure A. This example is rele- derivation by applying successive rotations to a higher
vant to our case since we are comparing two test statistics dimensional joint Gaussian distribution over multiple
with different asymptotically Normal distributions. HSIC statistics. We assume a sample S of size m
over n domains with kernels k1 , . . . , kn associated
The following lemma is used for the proof of Theorem 6. uniquely with respective reproducing kernel Hilbert
Lemma 1. (Lower Variance) The variance of the depen- spaces F1 , . . . , Fn . We define a generalized statisti-
dent test statistic is smaller than the variance of the inde- cal test, P Tg (S) → {0, 1} to test the null hypothesis
pendent test statistic. H0 : (x,y)∈{1,...,n}2 v(x,y) HSIC(Fx , Fy , Pxy ) ≤
0P versus the alternative hypothesis Hm :
(x,y)∈{1,...,n} 2 v (x,y) HSIC(F x , F y , Pxy ) > 0, where
Proof. From the convergence of moments in the applica-
v is a vector of weights on each HSIC statistic. We
tion of the central limit theorem (von Bahr, 1965), we
2 2 may recover the test in the previous section by set-
have that σX 0 Y 0 = 2σXY . Then the variance summary
ting v(1,2) = +1 v(1,3) = −1 and v(i,j) = 0 for all
in Eq. (11) is 12 (σXY
2
+ σXZ2
− 2σXY XZ ) and the variance
(i, j) ∈ {1, 2, 3}2 \ {(1, 2), (1, 3)}.
summary in Equation (18) is 21 (2σXY 2 2
√ + 2σXZ ) where in
both cases the statistic is scaled by m. We have that the The derivation of the test follows the general strategy used
variance of the independent test statistic is smaller than the in the previous section: we construct a rotation matrix so
variance of the dependent test statistic when as to project the joint Gaussian distribution onto the first
axis, and read the p-value from a standard normal table. To
1 2 2 1 2 2 construct the rotation matrix, we simply need to rotate v
(σXY + σXZ − 2σXY XZ ) < (2σXY + 2σXZ )
2 2 such that it is aligned with the first axis. Such a rotation
2 2
⇐⇒ −2σXY XZ < σXY + σXZ (19) can be computed by composing n 2-dimensional rotation
matrices as in Algorithm 1.
which is implied by the positive definiteness of Σ.
5. Experiments
Proof of Theorem 6. The Type II error probability of the
independent test at level α is We apply our estimates of statistical dependence to three
challenging problems. The first is a synthetic data experi-
m−1/2 HSIC(F, G, Pxy )
 
ment, in which we can directly control the relative degree
 −1 −HSIC(F, H, Pxz )  of functional dependence between variates. The second ex-
Φ (1 − α) −
Φ  , (20)
p
σ2 + σ2  periment uses a multilingual corpus to determine the rela-
X0Y 0 X 00 Z 00
tive relations between European languages. The last exper-
Power of the tests for m=500 and 100 repeated tests

Algorithm 1 Successive rotation for generalized high-
dimensional relative tests of dependency (cf. Section 4) 1
Require: v ∈ Rn
Ensure: [Qv]i = 0 ∀i 6= 1, QT Q = I 0.8
Power of the tests

Q=I 0.6
for i = 2 to n do
vi
Qi = I; θ = − tan−1 [Qv] 1
0.4
[Qi ]11 = cos(θ); [Qi ]1i = − sin(θ)

[Qi ]i1 = sin(θ); [Qi ]ii = cos(θ) 0.2
dependent tests
Q = Qi Q independent tests
0
end for 0.5 1 1.5
γ3
2 2.5 3 3.5
2 10 10
t cos(t) + γ3 N (0, 1)
t sin(t) + γ2 N (0, 1)
1.5
Figure 2. Power of the dependent and independent test as a func-

sin(t) + γ1 N (0, 1)
5 5
1
0.5
0
0 0
tion of γ3 on the synthetic data described in Section 5.1. For val-
−0.5
−5 −5
ues of γ3 > 0.3 the distribution in Fig. 1(a) is closer to 1(b) than
to 1(c). The problem becomes difficult as γ3 → 0.3. As predicted
−1
−10 −10
−1.5
−2
−1 0 1 2 3 4 5 6 7
−15
−10 −5 0 5 10 15
−15
−15 −10 −5 0 5 10 15
by theory, the dependent test is significantly more powerful over
t + γ1 N (0, 1) t cos(t) + γ2 N (0, 1) t cos(t) + γ3 N (0, 1)
(a) γ1 = 0.3 (b) γ2 = 0.3 (c) γ3 = 0.6 almost all values of γ3 by a substantial margin.
Figure 1. Illustration of a synthetic dataset sampled from the dis-

tribution in Eq. (22).
5.2. Multilingual data
In this section, we demonstrate dependence testing to pre-
iment is a 3-block dataset which combines gene expression, dict the relative similarity of different languages. We use
comparative genomic hybridization, and a qualitative phe- a real world dataset taken from the parallel European Par-
notype measured on a sample of Glioma patients. liament corpus (Koehn, 2005). We choose 3000 random
documents in common written in: Finnish (fi), Italian (it),
5.1. Synthetic experiment French (fr), Spanish (es), Portuguese (pt), English (en),
We constructed 3 distributions as defined in Eq. (22) and Dutch (nl), German (de), Danish (da) and Swedish (sv).
illustrated in Figure 1. These languages can be broadly categorized into either the
Romance, Germanic or Uralic groups (Gray & Atkinson,
Let t ∼ U[(0, 2π)], (22) 2003). In this dataset, we considered each language as a
(a) x1 ∼ t + γ1 N (0, 1) y1 ∼ sin(t) + γ1 N (0, 1) random variable and each document as an observation.
(b) x2 ∼ t cos(t) + γ2 N (0, 1) y2 ∼ t sin(t) + γ2 N (0, 1) Our first goal is to test if the statistical dependence between
(c) x3 ∼ t cos(t) + γ3 N (0, 1) y3 ∼ t sin(t) + γ3 N (0, 1) two languages in the same group is greater than the sta-
tistical dependence between languages in different groups.
For pre-processing, we removed stop-words (http://
These distributions are specified so that we can control the
www.nltk.org) and performed stemming (http://
relative degree of functional dependence between the vari-
snowball.tartarus.org). We applied the TF-IDF
ates by varying the relative size of noise scaling parameters
model as a feature representation and used a Gaussian ker-
γ1 , γ2 and γ3 . The question is then whether the dependence
nel with the bandwidth σ set per language as the median
between (a) and (b) is larger than the dependence between
pairwise distance between documents.
(a) and (c). In these experiments, we fixed γ1 = γ2 = 0.3,
while we varied γ3 , and used a Gaussian kernel with band- In Table 1, a selection of tests between language groups
width σ selected as the median pairwise distance between (Germanic, Romance, and Uralic) is given: all p-values
data points. This kernel is sufficient to obtain good perfor- strongly support that our relative dependence test finds the
mance, although others choices exist (Gretton et al., 2012). different language groups with very high significance.
Figure 2 shows the power of the dependent and the inde- Further, if we focus on the Romance family, our test en-
pendent tests as we vary γ3 . It is clear from these results ables one to answer more fine-grained questions about the
that the dependent test is far more powerful than the inde- relative similarity of languages within the same group. As
pendent test over the great majority of γ3 values consid- before, we determine the ground truth similarities from the
ered. Figure 3 demonstrates that this superior test power topology of the tree of European languages determined by
arises due to the tighter and more concentrated distribution the linguistics community (Gray & Atkinson, 2003; Bouck-
of the dependent statistic. aert et al., 2012) as illustrated in Fig. 4 for the Romance
0.035 0.035 0.035

independent tests independent tests independent tests
dependent tests dependent tests dependent tests
0.03 0.03 0.03
0.025 0.025 0.025
HSICXZ
HSICXZ
HSICXZ
0.02 0.02 0.02
0.015 0.015 0.015
0.01 0.01 0.01

0.01 0.015 0.02 0.025 0.03 0.035 0.01 0.015 0.02 0.025 0.03 0.035 0.01 0.015 0.02 0.025 0.03 0.035
HSICXY HSICXY HSICXY
(a) m=500, γ3 = 0.7 (b) m=1000, γ3 = 0.7 (c) m=3000, γ3 = 0.7

pdep = 0.0189, pindep = 0.3492 pdep = 10−4 , pindep = 0.3690 pdep = 10−6 , pindep = 0.2876
0.035 0.035 0.035
independent tests independent tests independent tests
dependent tests dependent tests dependent tests
0.03 0.03 0.03
0.025 0.025 0.025

HSICXZ
HSICXZ
HSICXZ
0.02 0.02 0.02
0.015 0.015 0.015
0.01 0.01 0.01

0.01 0.015 0.02 0.025 0.03 0.035 0.01 0.015 0.02 0.025 0.03 0.035 0.01 0.015 0.02 0.025 0.03 0.035
HSICXY HSICXY HSICXY
(d) m=500, γ3 = 1.7 (e) m=1000, γ3 = 1.7 (f) m=3000, γ3 = 1.7

pdep = 10−9 , pindep = 0.982 pdep = 10−10 , pindep = 0.0326 pdep = 10−13 , pindep = 0.005
Figure 3. For the synthetic experiments described in Section 5.1, we plot empirical HSIC values for dependent and independent tests for
100 repeated draws with different sample sizes. Empirical p-values for each test show that the dependent distribution converges faster
than the independent distribution even at low sample size, resulting in a more powerful statistical test.
Source Target 1 Target 2 p-value

es pt fi 0.0066
fr it da 0.0418
it es fi 0.0169
pt es da 0.0173
de nl fi < 10−4
nl en es < 10−4
da sv fr < 10−6
sv en it < 10−4
en de es < 10−4
Table 1. A selection of relative dependency tests between two
pairs of HSIC statistics for the multilingual corpus data. Low p- Figure 4. Partial tree of Romance languages adapted from (Gray
values indicate a source is closer to target 1 than to target 2. In all & Atkinson, 2003).
cases, the test correctly identifies that languages within the same
group are more strongly related than those in different groups. Source Target 1 Target 2 p-value
fr es it 0.0157
fr pt it 0.1882
es fr it 0.2147
group. We have run the test on all triplets from the cor- es pt it < 10−4
pus for which the topology of the tree specifies a correct es pt fr < 10−4
ordering of the dependencies. In a fraction of a second (ex- pt fr it 0.7649
pt es it 0.0011
cluding kernel computation), we are able to recover certain
pt es fr < 10−8
features of the subtree of relationships between languages
present in the Romance language group (Table 2). The test Table 2. Relative dependency tests between Romance languages.
always indicates the correct relative similarity of languages The tests are ordered such that a low p-value corresponds with a
when nearby languages (pt,es) are compared with those fur- confirmation of the topology of the tree of Romance languages de-
ther away (ft,it), however errors are made when comparing termined by the linguistics community (Gray & Atkinson, 2003).
triplets of languages for which the nearest common ances-
tor is more than one link removed.
×10−3
In our next tests, we evaluate our more general framework 6
dependent test
5
for testing relative dependencies with more than two HSIC independent test
4
statistics. We chose four languages, and tested whether the
3
average dependence between languages in the same group
HSIC XZ
2
is higher than the dependence between groups. The results 1
of these tests are in Table 3. As before, our test is able to 0
distinguish language groups with high significance. −1
−2
0 1 2 3 4 5 6 ×10−3
Source Targets p-value HSIC XY
da de sv fi < 10−9
da sv en fr < 10−9 Figure 5. 2σ iso-curves of the Gaussian distributions estimated
de sv en it < 10−5 from the pediatric Glioma data. As before, the dependent test
fr it es sv < 10−5 has a much lower variance than the independent test. The tests
es fr pt nl 0.0175 support the stronger dependence on the tumor location to gene
expression than chromosomal imbalances.
Table 3. Relative dependency test between four pairs of HSIC
statistics for the multilingual corpus data. These tests show the
ability of the relative dependence test to generalize to arbitrary
numbers of HSIC statistics by constructing a rotation matrix using of dependence, but with a p-value that is three orders of
ing Algorithm 1. In all cases v = [1 1 −2]. magnitude larger (p = 0.005). Figure 5 shows iso-curves
of the Gaussian distributions estimated in the independent
and dependent tests. The empirical relative dependency is
5.3. Pediatric glioma data consistent with findings in the medical literature, and pro-
vides additional statistical support for the importance of
Brain tumors are the most common solid tumors in children tumor location in Glioma (Gilbertson & Gutmann, 2007;
and have the highest mortality rate of all pediatric cancers. Palm et al., 2009; Puget et al., 2012).
Despite advances in multimodality therapy, children with
pediatric high-grade gliomas (pHGG) invariably have an
overall survival of around 20% at 5 years. Depending on 6. Conclusions
their location (e.g. brainstem, central nuclei, or supraten- We have described a novel statistical test that determines
torial), pHGG present different characteristics in terms of whether a source random variable is more strongly de-
radiological appearance, histology, and prognosis. The hy- pendent on one target random variable or another. This
pothesis is that pHGG have different genetic origins and test, built on the Hilbert-Schmidt Independence Criterion,
oncogenic pathways depending on their location. Thus, the is low variance, consistent, and unbiased. We have shown
biological processes involved in the development of the tu- that our test is strictly more powerful than a test that does
mor may be different from one location to another. not exploit the covariance between HSIC statistics, and
In order to evaluate such hypotheses, pre-treatment frozen empirically achieves p-values several orders of magnitude
tumor samples were obtained from 53 children with newly smaller. We have empirically demonstrated the test perfor-
diagnosed pHGG from Necker Enfants Malades (Paris, mance on synthetic data, where the degree of dependence
France) from Puget et al, (2012). The 53 tumors are di- could be controlled; on the challenging problem of iden-
vided into 3 locations: supratentorial (HEMI), central nu- tifying language groups from a multilingual corpus; and
clei (MIDL), and brain stem (DIPG). The final dataset is or- for finding the most important determinant of Glioma type.
ganized in 3 blocks of variables defined for the 53 tumors: The computation and memory requirements of the test are
X is a block of indicator variables describing the location quadratic in the sample size, matching the performance of
category, the second data matrix Y provides the expression HSIC and related tests for dependence between two ran-
of 15 702 genes (GE). The third data matrix Z contains the dom variables. The test is therefore scalable to the wide
imbalances of 1229 segments (CGH) of chromosomes. range of problem instances where non-parametric depen-
dency tests are currently applied. We have generalized the
For X, we use a linear kernel, which is characteristic for test framework to more than two HSIC statistics, and have
indicator variables, and for Y and Z, the kernel was cho- given an algorithm to construct a consistent, low-variance,
sen to be the Gaussian kernel with σ selected as the median unbiased test in this setting.
of pairwise distances. The p-value of our relative depen-
dency test is < 10−5 . This shows that the tumor location
Acknowledgements
in the brain is more dependent on gene expression than on
chromosomal imbalances. By contrast with Section 5.1, We thank Ioannis Antonoglou for helpful discussions. The
the independent test was also able to find the same order- first author is supported by a fellowship from Centrale-
Supélec. This work is partially funded by the Euro- Gretton, A. and Gyorfi, L. Consistent nonparametric tests
pean Commission through ERC Grant 259112 and FP7- of independence. Journal of Machine Learning Re-
MCCIG334380. search, 11:1391–1423, 2010.
Gretton, A., Bousquet, O., Smola, A. J., and Schölkopf, B.

References Measuring statistical dependence with Hilbert-Schmidt
Arcones, M. A. and Gine, E. Limit theorems for U- norms. In Algorithmic Learning Theory, pp. 63–77,
processes. The Annals of Probability, pp. 1494–1542, 2005.
1993.
Gretton, A., Fukumizu, K., Teo, C.-H., Song, L.,
Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alek- Schölkopf, B., and Smola, A. J. A kernel statistical test
seyenko, A. V., Drummond, A. J., Gray, R. D., Suchard, of independence. In Neural Information Processing Sys-
M. A., and Atkinson, Q. D. Mapping the origins and ex- tems, pp. 585–592, 2008.
pansion of the Indo-European language family. Science,
337(6097):957–960, 2012. Gretton, A., Sejdinovic, D., Strathmann, H.and Balakrish-
nan, S., Pontil, M., Fukumizu, K., and Sriperumbudur,
Bring, J. A geometric approach to compare variables in B. K. Optimal kernel choice for large-scale two-sample
a regression model. The American Statistician, 50(1): tests. In Advances in Neural Information Processing Sys-
57–62, 1996. tems, pp. 1205–1213, 2012.
Cortes, C., Mohri, M., and Rostamizadeh, A. Learning Gunn, S. R. and Kandola, J. S. Structural modelling with
non-linear combinations of kernels. In Neural Informa- sparse kernels. Machine Learning, 48(1):137–163, 2002.
tion Processing Systems, 2009.
Heller, R., Heller, Y., and Gorfine, M. A consistent mul-
Cortes, C., Mohri, M., and Rostamizadeh, A. Algorithms tivariate test of association based on ranks of distances.
for learning kernels based on centered alignment. Jour- Biometrika, 100(2):503–510, 2013.
nal of Machine Learning Research, 13:795–828, 2012.
Hoeffding, W. Probability inequalities for sums of bounded
Darlington, Richard B. Multiple regression in psychologi- random variables. Journal of the American statistical
cal research and practice. Psychological bulletin, 69(3): association, 58(301):13–30, 1963.
161, 1968.
Kinney, J. B. and Atwal, G. S. Equitability, mutual infor-
Dauxois, J. and Nkiet, G. M. Nonlinear canonical analy- mation, and the maximal information coefficient. Pro-
sis and independence tests. Annals of Statistics, 26(4): ceedings of the National Academy of Sciences, 2014.
1254–1278, 1998.
Koehn, P. Europarl: A parallel corpus for statistical ma-
Fukumizu, K., Bach, F. R., and Gretton, A. Statisti- chine translation. In MT summit, volume 5, pp. 79–86,
cal consistency of kernel canonical correlation analysis. 2005.
The Journal of Machine Learning Research, 8:361–383,
2007. Palm, T., Figarella-Branger, D., Chapon, F., Lacroix, C.,
Gray, F., Scaravilli, F., Ellison, D. W., Salmon, I.,
Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. Ker- Vikkula, M., and Godfraind, C. Expression profiling
nel measures of conditional dependence. In Advances of ependymomas unravels localization and tumor grade-
in Neural Information Processing Systems, pp. 489–496. specific tumorigenesis. Cancer, 115(17):3955–3968,
MIT Press, 2008. 2009.
Gilbertson, R. J. and Gutmann, D. H. Tumorigenesis in the Peters, C., Braschler, M., and Clough, P. Multilin-
brain: location, location, location. Cancer research, 67 gual Information Retrieval: From Research to Practice.
(12):5579–5582, 2007. Springer, 2012.
Gray, R. D. and Atkinson, Q. D. Language-tree divergence Puget, S., Philippe, C., Bax, D., Job, B., Varlet, P., Ju-
times support the Anatolian theory of Indo-European nier, M. P., Andreiuolo, F., Carvalho, D., Reis, R.,
origin. Nature, 426(6965):435–439, 2003. and Guerrini-Rousseau, L. Mesenchymal transition
and PDGFRA amplification/mutation are key distinct
Gretton, A. A simpler condition for consistency of a kernel oncogenic events in pediatric diffuse intrinsic pontine
independence test. arXiv:1501.06103, 2015. gliomas. PloS one, 7(2):e30313, 2012.
Reshef, D., Reshef, Y., Finucane, H., Grossman, S.,

McVean, G., Turnbaugh, P., Lander, E., Mitzenmacher,
M., and Sabeti, P. Detecting novel associations in large
datasets. Science, 334(6062), 2011.
Sejdinovic, D., Gretton, A., and Bergsma, W. A kernel
test for three-variable interactions. In Neural Informa-
tion Processing Systems, 2013a.
Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fuku-
mizu, K. Equivalence of distance-based and RKHS-
based statistics in hypothesis testing. Annals of Statis-
tics, 41(5):2263–2702, 2013b.
Sen, A. and Srivastava, M. Regression Analysis – Theory,

Methods, and Applications. Springer-Verlag, 2011.
Serfling, R. J. Approximation theorems of mathematical
statistics. Wiley Series in Probability and Statistics. Wi-
ley, 1981.
Song, L., Smola, A., Gretton, A., Bedo, J., and Borg-
wardt, K. Feature selection via dependence maximiza-
tion. Journal of Machine Learning Research, 13:1393–
1434, 2012.
Székely, G., Rizzo, M., and Bakirov, N. Measuring and
testing dependence by correlation of distances. Annals
of Statistics, 35(6):2769–2794, 2007.
Trommershauser, J., Kording, K., and Landy, M. S. Sen-
sory Cue Integration. Oxford University Press, 2011.
von Bahr, Bengt. On the convergence of moments in

the central limit theorem. The Annals of Mathematical
Statistics, 36(3):808–818, 06 1965.
Zhang, K., Peters, J., Janzing, D., B., and Schölkopf, B.
Kernel-based conditional independence test and applica-
tion in causal discovery. In 27th Conference on Uncer-
tainty in Artificial Intelligence, pp. 804–813, 2011.

A Low Variance Consistent Test of Relative Dependency

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Low Variance Consistent Test of Relative Dependency

Uploaded by

Copyright:

Available Formats

A low variance consistent test of relative dependency

Wacha Bounliphone WACHA . BOUNLIPHONE @ CENTRALESUPELEC . FR

{1, . . . m}, and the kernel h of the U-statistic is defined as (i,j,q,r)

Eq. (4) related to kernel matrices K̃ and L̃, hXY can be

Power of the tests for m=500 and 100 repeated tests

Power of the tests

[Qi ]11 = cos(θ); [Qi ]1i = − sin(θ)

Figure 2. Power of the dependent and independent test as a func-

Figure 1. Illustration of a synthetic dataset sampled from the dis-

0.035 0.035 0.035

0.03 0.03 0.03

0.025 0.025 0.025

0.02 0.02 0.02

0.015 0.015 0.015

0.01 0.01 0.01

(a) m=500, γ3 = 0.7 (b) m=1000, γ3 = 0.7 (c) m=3000, γ3 = 0.7

0.03 0.03 0.03

0.025 0.025 0.025

0.015 0.015 0.015

0.01 0.01 0.01

(d) m=500, γ3 = 1.7 (e) m=1000, γ3 = 1.7 (f) m=3000, γ3 = 1.7

Source Target 1 Target 2 p-value

Gretton, A., Bousquet, O., Smola, A. J., and Schölkopf, B.

Reshef, D., Reshef, Y., Finucane, H., Grossman, S.,

Sen, A. and Srivastava, M. Regression Analysis – Theory,

von Bahr, Bengt. On the convergence of moments in

You might also like