You are on page 1of 11

Unbiased estimators for the variance of MMD estimators

Danica J. Sutherland
June 2019
arXiv:1906.02104v2 [stat.ML] 14 Jan 2021

Abstract
The maximum mean discrepancy (MMD) is a kernel-based distance between probability distributions
useful in many applications (Gretton et al. 2012), bearing a simple estimator with pleasing computational
and statistical properties. Being able to efficiently estimate the variance of this estimator is very helpful
to various problems in two-sample testing. Towards this end, Bounliphone et al. (2016) used the theory
of U-statistics to derive estimators for the variance of an MMD estimator, and differences between two
such estimators. Their estimator, however, drops lower-order terms, and is unnecessarily biased. We
show in this note – extending and correcting work of Sutherland et al. (2017) – that we can find a truly
unbiased estimator for the actual variance of both the squared MMD estimator and the difference of two
correlated squared MMD estimators, at essentially no additional computational cost.
We give only minimal background in this note; see Bounliphone et al. (2016) and Sutherland et al. (2017)
for uses of these estimators.
Given a positive semidefinite kernel k : X × X → R corresponding to an RKHS H, there exists a
feature map ϕ : X → H such that k(X, Y ) = hϕ(X), ϕ(Y )i. The mean embedding (Muandet et al. 2017)
of a distribution PX is µX := EX∼PX [ϕ(X)] ∈ H, and the MMD is merely the distance between mean
embeddings:
MMD2 (PX , PY ) = kµX − µY k2
= EX,X ′ ∼PX [k(X, X ′ )] + EY,Y ′ ∼PY [k(Y, Y ′ )] − 2 EX∼PX ,Y ∼PY [k(X, Y )] .
Suppose we have independent samples X := {Xi }m m m m m m
i=1 ∼ PX , Y := {Yi }i=1 ∼ PY , Z := {Zi }i=1 ∼ PZ .
The following is an unbiased estimator of MMD(PX , PY ) with nearly minimal variance among unbiased
estimators (Gretton et al. 2012):
m
2
\ U (X, Y) := 1 X
MMD [k(Xi , Xj ) + k(Yi , Yj ) − k(Xi , Yj ) − k(Xj , Yi )] . (1)
m(m − 1)
i6=j

Compared to the MVUE, terms of the form k(Xi , Yi ) are dropped. This estimator, however, is a U statistic,
for which there is well-established theory (Serfling 1980, Chapter 5), including expressions for the variance.
We assume Bochner integrability throughout this note, to be able to exchange expectations with inner
products in H. This technical condition holds automatically for continuous bounded kernels or for continuous
kernels on compact domains, but must be verified in other situations.
In this note, we first employ that theory to derive expressions for the variance in terms of various
expectations of inner products in H (Section 1). Then, in Section 2, we derive unbiased estimators for these
expressions, yielding the final results (4) and (5) which are unbiased variance estimators evaluable in the
same O(m2 ) time it takes to evaluate (1).

1 Variance expressions
2 2 2
\ U (X, Y) and MMD
We will first derive expressions for the variances of MMD \ U (X, Y) − MMD\ U (X, Z). This
section is quite similar to Appendix A of Bounliphone et al. (2016), but avoids unnecessarily dropping lower-
order terms (which provides almost no computational advantage, and may harm the accuracy for small

1
sample sizes, although it does make for a less tedious derivation). The result (2) of Section 1.1 is identical
to that of Sutherland et al. (2017).

1.1 Variance of the MMD estimator


Let Ui denote the pair (Xi , Yi ), and define the function

h(U1 , U2 ) := k(X1 , X2 ) + k(Y1 , Y2 ) − k(X1 , Y2 ) − k(X2 , Y1 ).

Then
2
\ U (X, Y) = 1 X
MMD h(Ui , Uj ),
m(m − 1)
i6=j

which is a U -statistic on the joint data U. Thus standard results (Serfling 1980, Section 5.2.1, Lemma A)
give us
\ U (X, Y) = Vm := 4(m − 2) ζ1 + 2
h 2 i
Var MMD ζ2 ,
m(m − 1) m(m − 1)
where
ζ1 := VarU1 [EU2 [h(U1 , U2 )]] , ζ2 := VarU1 ,U2 [h(U1 , U2 )] .
The first-order term ζ1 is:

ζ1 = VarU1 [EU2 [h(U1 , U2 )]]


= VarX1 ,Y1 [EX2 k(X1 , X2 ) + EY2 k(Y1 , Y2 ) − EY2 k(X1 , Y2 ) − EX2 k(X2 , Y1 )]
= VarX,Y [hϕ(X), µX i + hϕ(Y ), µY i − hϕ(X), µY i − hµX , ϕ(Y )i]
= Var [hϕ(X), µX ii] + Var [hϕ(Y ), µY ii] + Var [hϕ(X), µY ii] + Var [hµX , ϕ(Y )ii]
− 2 Cov (hϕ(X), µX i , hϕ(X), µY i) − 2 Cov (hϕ(Y ), µY i , hµX , ϕ(Y )i) .

Noting that
h i
2 2
Var [hϕ(A), µB i] = E hϕ(A), µB i − hµA , µB i
Cov (hϕ(A), µB i , hϕ(A), µC i) = E [hϕ(A), µB i hϕ(A), µC i] − hµA , µB i hµA , µC i

we have
h i
2 2
ζ1 = E hϕ(X), µX i − hµX , µX i
h i
2 2
+ E hϕ(Y ), µY i − hµY , µY i
h i
+ E hϕ(X), µY i2 − hµX , µY i2
h i
2 2
+ E hϕ(Y ), µX i − hµY , µX i
− 2 E [hϕ(X), µX i hϕ(X), µY i] + 2 hµX , µX i hµX , µY i
− 2 E [hϕ(Y ), µY i hϕ(Y ), µX i] + 2 hµY , µY i hµX , µY i .

2
We can similarly compute the second-order term ζ2 as:

ζ2 = Var [h(U1 , U2 )]
= Var [k(X1 , X2 ) + k(Y1 , Y2 ) − k(X1 , Y2 ) − k(X2 , Y1 )]
= Var [k(X1 , X2 )] + Var [k(Y1 , Y2 )] + Var [k(X1 , Y2 )] + Var [k(X2 , Y1 )]
− 2 Cov (k(X1 , X2 ), k(X1 , Y2 )) − 2 Cov (k(X1 , X2 ), k(X2 , Y1 ))
− 2 Cov (k(Y1 , Y2 ), k(X1 , Y2 )) − 2 Cov (k(Y1 , Y2 ), k(X2 , Y1 ))
= Var [k(X1 , X2 )] + Var [k(Y1 , Y2 )] + 2 Var [k(X, Y )]
− 4 Cov (k(X1 , X2 ), k(X1 , Y )) − 4 Cov (k(Y1 , Y2 ), k(Y1 , X))
2
= E k(X1 , X2 )2 − hµX , µX i
 

+ E k(Y1 , Y2 )2 − hµY , µY i2
 

2
+ 2 E k(X, Y )2 − 2 hµX , µY i
 

− 4 E [hϕ(X), µX i hϕ(X), µY i] + 4 hµX , µX i hµX , µY i


− 4 E [hϕ(Y ), µY i hϕ(Y ), µX i] + 4 hµY , µY i hµX , µY i .
2
\ U (X, Y)] of
Combining the two yields an expression for Var[MMD
2 h i
Vm = 2(m − 2)ζ1 + ζ2
m(m − 1)
"
2 h
2
i
2
= 2(m − 2) E hϕ(X), µX i − 2(m − 2) hµX , µX i
m(m − 1)
h i
2 2
+ 2(m − 2) E hϕ(Y ), µY i − 2(m − 2) hµY , µY i
h i
2 2
+ 2(m − 2) E hϕ(X), µY i − 2(m − 2) hµX , µY i
h i
+ 2(m − 2) E hϕ(Y ), µX i2 − 2(m − 2) hµY , µX i2
− 4(m − 2) E [hϕ(X), µX i hϕ(X), µY i] + 4(m − 2) hµX , µX i hµX , µY i
− 4(m − 2) E [hϕ(Y ), µY i hϕ(Y ), µX i] + 4(m − 2) hµY , µY i hµX , µY i
2
+ E k(X1 , X2 )2 − hµX , µX i
 

+ E k(Y1 , Y2 )2 − hµY , µY i2
 

2
+ 2 E k(X, Y )2 − 2 hµX , µY i
 

− 4 E [hϕ(X), µX i hϕ(X), µY i] + 4 hµX , µX i hµX , µY i


#
− 4 E [hϕ(Y ), µY i hϕ(Y ), µX i] + 4 hµY , µY i hµX , µY i

3
and so
"
2 h
2
i
2
Vm = 2(m − 2) E hϕ(X), µX i − (2m − 3) hµX , µX i
m(m − 1)
h i
2 2
+ 2(m − 2) E hϕ(Y ), µY i − (2m − 3) hµY , µY i
 h i h i
2 2 2
+ 2(m − 2) E hϕ(X), µY i + E hϕ(Y ), µX i − (4m − 6) hµX , µY i (2)
− 4(m − 1) E [hϕ(X), µX i hϕ(X), µY i] + 4(m − 1) hµX , µX i hµX , µY i
− 4(m − 1) E [hϕ(Y ), µY i hϕ(Y ), µX i] + 4(m − 1) hµY , µY i hµX , µY i
#
2 2 2
     
+ E k(X1 , X2 ) + E k(Y1 , Y2 ) + 2 E k(X, Y ) .

We can see that, compared to ξ1 , (2) mostly just tweaks constants. The only new terms are the expec-
tations of squared kernels, which as we’ll see later will in fact not introduce any new computational expense
as their estimators will combine with terms needed to correct biases in the other terms.

1.2 Variance of the difference of MMD estimators


Let Wi := (Xi , Yi , Zi ), and
f (W1 , W2 ) := (k(Y1 , Y2 ) − k(X1 , Y2 ) − k(X2 , Y1 )) − (k(Z1 , Z2 ) − k(X1 , Z2 ) − k(X2 , Z1 )) .
Then
2
\ U (X, Z) =
\ U (X, Y) − MMD
2 1 X
MMD f (Wi , Wj )
m(m − 1)
i6=j
is a U -statistic on W, so again
2 2
\ U (X, Z)] = νm :=
\ U (X, Y) − MMD 4(m − 2) 2
Var[MMD ξ1 + ξ2
m(m − 1) m(m − 1)
ξ1 := VarW1 [EW2 [f (W1 , W2 )]] ξ2 := VarW1 ,W2 [f (W1 , W2 )] .
We can proceed as before, but with more terms. The first-order term ξ1 is
ξ1 = VarW1 [EW2 [f (W1 , W2 )]]
= VarX,Y,Z [hϕ(Y ), µY i − hϕ(X), µY i − hµX , ϕ(Y )i − hϕ(Z), µZ i + hϕ(X), µZ i + hµX , ϕ(Z)i]
= Var [hϕ(X), µY i] + Var [hϕ(X), µZ i] + Var [hϕ(Y ), µX i] + Var [hϕ(Y ), µY i] + Var [hϕ(Z), µX i] + Var [hϕ(Z), µZ i]
− 2 Cov (hϕ(X), µY i , hϕ(X), µZ i) − 2 Cov (hϕ(Y ), µX i , hϕ(Y ), µY i) − 2 Cov (hϕ(Z), µX i , hϕ(Z), µZ i)
h i
2 2
= E hϕ(X), µY i − hµX , µY i
h i
+ E hϕ(X), µZ i2 − hµX , µZ i2
h i
2 2
+ E hϕ(Y ), µX i − hµY , µX i
h i
2 2
+ E hϕ(Y ), µY i − hµY , µY i
h i
+ E hϕ(Z), µX i2 − hµZ , µX i2
h i
2 2
+ E hϕ(Z), µZ i − hµZ , µZ i
− 2 E [hϕ(X), µY i hϕ(X), µZ i] + 2 hµX , µY i hµX , µZ i
− 2 E [hϕ(Y ), µX i hϕ(Y ), µY i] + 2 hµX , µY i hµY , µY i
− 2 E [hϕ(Z), µX i hϕ(Z), µZ i] + 2 hµX , µZ i hµZ , µZ i .

4
The second-order term ξ2 is

ξ2 = Var [−k(X1 , Y2 ) − k(X2 , Y1 ) + k(X1 , Z2 ) + k(X2 , Z1 ) + k(Y1 , Y2 ) − k(Z1 , Z2 )]


= Var [k(X1 , Y2 )] + Var [k(X2 , Y1 )] + Var [k(X1 , Z2 )] + Var [k(X2 , Z1 )] + Var [k(Y1 , Y2 )] + Var [k(Z1 , Z2 )]
− 2 Cov (k(X1 , Y2 ), k(X1 , Z2 )) − 2 Cov (k(X1 , Y2 ), k(Y1 , Y2 ))
− 2 Cov (k(X2 , Y1 ), k(X2 , Z1 )) − 2 Cov (k(X2 , Y1 ), k(Y1 , Y2 ))
− 2 Cov (k(X1 , Z2 ), k(Z1 , Z2 )) − 2 Cov (k(X2 , Z1 ), k(Z1 , Z2 ))
= 2 Var [k(X, Y )] + 2 Var [k(X, Z)] + Var [k(Y1 , Y2 )] + Var [k(Z1 , Z2 )]
− 4 Cov (k(X, Y ), k(X, Z)) − 4 Cov (k(X, Y1 ), k(Y1 , Y2 )) − 4 Cov (k(X, Z1 ), k(Z1 , Z2 ))
2
= 2 E k(X, Y )2 − 2 hµX , µY i
 

2
+ 2 E k(X, Z)2 − 2 hµX , µZ i
 

+ E k(Y1 , Y2 )2 − hµY , µY i2
 

2
+ E k(Z1 , Z2 )2 − hµZ , µZ i
 

− 4 E [hϕ(X), µY i hϕ(X), µZ i] + 4 hµX , µY i hµX , µZ i


− 4 E [hϕ(Y ), µX i hϕ(Y ), µY i] + 4 hµX , µY i hµY , µY i
− 4 E [hϕ(Z), µX i hϕ(Z), µZ i] + 4 hµX , µZ i hµZ , µZ i .
h 2 2 i
Combining the two gives an expression for Var MMD \ U (X, Z) of
\ U (X, Y) − MMD

2 h i
νm = 2(m − 2)ξ1 + ξ2
m(m − 1)
"
2 h i
= 2(m − 2) E hϕ(X), µY i2 − 2(m − 2) hµX , µY i2
m(m − 1)
h i
2 2
+ 2(m − 2) E hϕ(X), µZ i − 2(m − 2) hµX , µZ i
h i
2 2
+ 2(m − 2) E hϕ(Y ), µX i − 2(m − 2) hµY , µX i
h i
+ 2(m − 2) E hϕ(Y ), µY i2 − 2(m − 2) hµY , µY i2
h i
2 2
+ 2(m − 2) E hϕ(Z), µX i − 2(m − 2) hµZ , µX i
h i
2 2
+ 2(m − 2) E hϕ(Z), µZ i − 2(m − 2) hµZ , µZ i
− 4(m − 2) E [hϕ(X), µY i hϕ(X), µZ i] + 4(m − 2) hµX , µY i hµX , µZ i
− 4(m − 2) E [hϕ(Y ), µX i hϕ(Y ), µY i] + 4(m − 2) hµX , µY i hµY , µY i
− 4(m − 2) E [hϕ(Z), µX i hϕ(Z), µZ i] + 4(m − 2) hµX , µZ i hµZ , µZ i
2
+ 2 E k(X, Y )2 − 2 hµX , µY i
 

2
+ 2 E k(X, Z)2 − 2 hµX , µZ i
 

+ E k(Y1 , Y2 )2 − hµY , µY i2
 

2
+ E k(Z1 , Z2 )2 − hµZ , µZ i
 

− 4 E [hϕ(X), µY i hϕ(X), µZ i] + 4 hµX , µY i hµX , µZ i


− 4 E [hϕ(Y ), µX i hϕ(Y ), µY i] + 4 hµX , µY i hµY , µY i
#
− 4 E [hϕ(Z), µX i hϕ(Z), µZ i] + 4 hµX , µZ i hµZ , µZ i

5
so that
"
2 h h
2
i h
2
i h
2
i h
2
ii
νm = 2(m − 2) E hϕ(X), µY i + E hϕ(X), µZ i + E hϕ(Y ), µX i + E hϕ(Z), µX i
m(m − 1)
h h i h ii
+ 2(m − 2) E hϕ(Y ), µY i2 + E hϕ(Z), µZ i2
h i h i
2 2 2 2
− 2(2m − 3) hµX , µY i + hµX , µZ i − (2m − 3) hµY , µY i + hµZ , µZ i
+ 4(m − 1) [hµX , µY i hµX , µZ i + hµX , µY i hµY , µY i + hµX , µZ i hµZ , µZ i] (3)
− 4(m − 1) E [hϕ(X), µY i hϕ(X), µZ i]
− 4(m − 1) E [hϕ(Y ), µX i hϕ(Y ), µY i]
− 4(m − 1) E [hϕ(Z), µX i hϕ(Z), µZ i]
#
2 2 2 2
       
+ 2 E k(X, Y ) + 2 E k(X, Z) + E k(Y1 , Y2 ) + E k(Z1 , Z2 ) .

2 Estimators of terms
2.1 Sub-expressions
We now give unbiased estimators of the various terms in the variance results of Section 1.
Define an m × m matrix KXY by (KXY )ij = k(Xi , Yj ), and KXZ , KXX , KYY , KZZ similarly. Let
K̃XX , K̃YY , K̃ZZ be KXX , KYY , KZZ with their diagonals set to zero. Let 1 be the m-vector of all ones.
We’ll also use the falling factorial notation (m)k := m(m − 1) · · · (m − k + 1).
For unbiased estimators, the important thing is to subtract off elements of sums which share data points.
For example,
1 X 1
hµX , µY i = hEX ϕ(X), EY ϕ(Y )i = EX,Y k(X, Y ) ≈ 2
k(Xi , Yj ) = 2 1T KXY 1
m i,j m
1 X 1
hµX , µX i = hEX ϕ(X), EX ′ ϕ(X ′ )i = EX,X ′ k(X, X ′ ) ≈ k(Xi , Xj ) = 1T K̃XX 1.
m(m − 1) m(m − 1)
i6=j

It is also important to do so for products of these terms: this caused the bias present in the publication
version of Sutherland et al. (2017). For instance, the square of an unbiased estimator for hµX , µY i is not
2
unbiased for hµX , µY i , but the following is:

hµX , µY i2 = hµX , µY i hµX , µY i = EX,X ′ ,Y,Y ′ [k(X, Y )k(X ′ , Y ′ )]


1 X 1 XX
≈ 2 k(Xi , Yj ) k(Xi′ , Yj ′ )
m i,j (m − 1)2 ′ ′
i 6=i j 6=j
 
1 X X X X
= 2 k(Xi , Yj )  k(Xi′ , Yj ′ ) − k(Xi′ , Yj ) − k(Xi , Yj ′ ) + k(Xi , Yj )
m (m − 1)2 i,j
i′ ,j ′ i′ j′
 
1 X X X X
= 2  (KXY )ij (KXY )i′ j ′ − (KXY )ij (KXY )i′ j − (KXY )ij (KXY )ij ′ + (KXY )2ij 
m (m − 1)2 i,j
i,j,i′ ,j ′ i,j,i′ i,j,j ′
1 h 2 2
i
= 2 1T KXY 1 − 1T KXY KXY T 1 − 1T KXY T KXY 1 + kKXY kF
m (m − 1)2
 
1 T
2 T
2
2 2
= 2 1 KXY 1 − KXY 1 − kKXY 1k + kKXY kF .
m (m − 1)2

6
Similarly,
2
hµX , µX i = EX1 ,X2 ,X3 ,X4 [k(X1 , X2 )k(X3 , X4 )]
1 XX X X
≈ k(Xi , Xj ) k(Xa , Xb )
(m)4 i
j6=i a∈{i,j}
/ b∈{i,j,a}
/
1 X X
= (K̃XX )ij (K̃XX )ab
(m)4 ij
a,b∈{i,j}
/
"
1 X X X X X X
= (K̃XX )ij (K̃XX )ab − (K̃XX )ai − (K̃XX )aj − (K̃XX )ib − (K̃XX )jb
(m)4 ij a a
ab b b
#
+ (K̃XX )ii + (K̃XX )ij + (K̃XX )ji + (K̃XX )jj
" #
1 X X X X
= (K̃XX )ij (K̃XX )ab − 2 (K̃XX )ai − 2 (K̃XX )aj + 2(K̃XX )ij
(m)4 ij a a
ab
" #
1 X X X X
2
= (K̃XX )ij (K̃XX )ab − 2 (K̃XX )ij (K̃XX )ai − 2 (K̃XX )ij (K̃XX )aj + 2 (K̃XX )ij
(m)4 ija ija ij
ijab
" #
1 
T
2 2 2
= 1 K̃XX 1 − 4 K̃XX 1 + 2 K̃XX .
(m)4 F

We also need

hµX , µX i hµX , µY i = EX1 ,X2 ,X3 ,Y [k(X1 , X2 )k(X3 , Y )]


1 XX X X
≈ k(Xi , Xj ) k(Xℓ , Ya )
m(m)3 i a
j6=i ℓ∈{i,j}
/
" #
1 X X X X
= (K̃XX )ij (KXY )ℓa − (KXY )ia − (KXY )ja
m(m)3 ij a a
ℓa
 
1 X X X
=  (K̃XX )ij (KXY )ℓa − (K̃XX )ij (KXY )ia − (K̃XX )ij (KXY )ja 
m(m)3 ija ija
ijℓa
1 h i
= 1T K̃XX 11T KXY 1 − 2 1T K̃XX KXY 1
m(m)3

and
1 XXXX
hµX , µY i hµX , µZ i ≈ k(Xi , Ya )k(Xj , Zb )
m3 (m − 1) i a j6=i b
 
1 X X
= 3  (KXY )ia (KXZ )jb − (KXY )ia (KXZ )ib 
m (m − 1)
ijab iab
1 h i
= 3 1T KXY 11T KXZ 1 − 1T KXY T KXZ 1 .
m (m − 1)

7
We also need some similar terms with ϕ(X) shared:
h
2
i 1 XX X
E hϕ(X), µX i ≈ k(Xi , Xj )k(Xi , Xℓ )
(m)3 i
j6=i ℓ∈{i,j}
/
1 X X
= (K̃XX )ij k(Xi , Xℓ )
(m)3 ij
ℓ∈{i,j}
/
1 XX
= (K̃XX )ij (K̃XX )iℓ
(m)3 ij
ℓ6=j
 
1  X X
= (K̃XX )ij (K̃XX )iℓ − (K̃XX )2ij 
(m)3 ij
ijℓ
 
1 T
2
= 1 K̃XX K̃XX 1 − K̃XX
(m)3 F
 
1 2 2
= K̃XX 1 − K̃XX ,
(m)3 F

h
2
i 1 XXX
E hϕ(X), µY i ≈ k(Xi , Yj )k(Xi , Yℓ )
m2 (m − 1) i j
ℓ6=j
 
1 X X
= 2  (KXY )ij (KXY )iℓ − (KXY )2ij 
m (m − 1) ij
ijℓ
1 h i
= 2 kKXY 1k2 − kKXY k2F ,
m (m − 1)

1 XXX
E [hϕ(X), µX i hϕ(X), µY i] ≈ k(Xi , Xj )k(Xi , Yℓ )
m2 (m − 1) i j6=i ℓ
1 X
= (K̃XX )ij (KXY )iℓ
m2 (m − 1)
ijℓ
1
= 1T K̃XX KXY 1,
m2 (m − 1)

and
1 X 1
E [hϕ(X), µY i hϕ(X), µZ i] ≈ 3
k(Xi , Yj )k(Xi , Zℓ ) = 3 1T KXY T KXZ 1.
m m
ijℓ

Finally, we need a few terms that don’t involve any mean embeddings:
1 X 1 2
E k(X1 , X2 )2 ≈ k(Xi , Xj )2 =
 
K̃XX
m(m − 1) m(m − 1) F
i6=j
1 2
E k(X, Y )2 ≈ 2 kKXY kF .
 
m

8
2.2 Final MMD variance estimator
h 2 i
\ U (X, Y) of (2):
Recall the variance Vm = Var MMD

4(m − 2) h i 2(2m − 3)
Vm = E hϕ(X), µX i2 − hµX , µX i2
m(m − 1) m(m − 1)
4(m − 2) h 2
i 2(2m − 3)
2
+ E hϕ(Y ), µY i − hµY , µY i
m(m − 1) m(m − 1)
4(m − 2)  h 2
i h
2
i 4(2m − 3)
2
+ E hϕ(X), µY i + E hϕ(Y ), µX i − hµX , µY i
m(m − 1) m(m − 1)
8 8
− E [hϕ(X), µX i hϕ(X), µY i] + hµX , µX i hµX , µY i
m m
8 8
− E [hϕ(Y ), µY i hϕ(Y ), µX i] + hµY , µY i hµX , µY i
m m
2 2 4
E k(X1 , X2 )2 + E k(Y1 , Y2 )2 + E k(X, Y )2 .
     
+
m(m − 1) m(m − 1) m(m − 1)

Plugging in the estimators of Section 2.1, we at last get an estimator for the variance:
 
4 2 2 2 2
V̂m = 2 K̃XX 1 + K̃YY 1 − K̃XX − K̃YY
m (m − 1)2 F F
  
2(2m − 3) 2  2 2 2 2 2
+ − 1T K̃XX 1 − 1T K̃YY 1 + 4 K̃XX 1 + 4 K̃YY 1 − 2 K̃XX − 2 K̃YY
(m)2 (m)4 F F
 
4(m − 2) 2 2 T
2
2
+ 3 kKXY 1k − kKXY kF + KXY 1 − kKXY kF
m (m − 1)2
 
4(2m − 3) T
2 T
2
2 2
+ 3 − 1 KXY 1 + KXY 1 + kKXY 1k − kKXY kF
m (m − 1)3
8 8 h i
− 3 1T K̃XX KXY 1 + 2 1T K̃XX 11T KXY 1 − 2 1T K̃XX KXY 1
m (m − 1) m (m)3
8 8 h i
− 3 1T K̃YY KXY T 1 + 2 1T K̃YY 11T KXY 1 − 2 1T K̃YY KXY T 1
m (m − 1) m (m)3
 
2 2 2 4 2
+ 2 2
K̃XX + K̃YY + 3 kKXY kF ,
m (m − 1) F F m (m − 1)

which simplifies to

4(m2 − m − 1) h
 
4 2 2
2 T 2
i
V̂m = K̃XX 1 + K̃YY 1 + kK XY 1k + kK XY 1k
(m)4 m3 (m − 1)2
8 h i
− 1T K̃XX KXY 1 + 1T K̃YY KXY T 1
m2 (m2 − 3m + 2)
8 h  i
+ 2 1T K̃XX 1 + 1T K̃YY 1 1T KXY 1 (4)
m (m)3
 2 
2(2m − 3)  T 2 
T 4(2m − 3) h T 2 i
− 1 K̃XX 1 + 1 K̃YY 1 − 3 1 K XY 1
(m)2 (m)4 m (m − 1)3
 
2 2 2 4(m − 2) 2
− 3 2
K̃XX + K̃YY + 2 kKXY kF .
m(m − 6m + 11m − 6) F F m (m − 1)3

9
2.3 Final difference of MMD variance estimator
h 2 2 i
\ U (X, Y) − MMD
The result (3) of Section 1.2 was that Var MMD \ U (X, Z) is

4(m − 2) h h i h i h i h ii
νm = E hϕ(X), µY i2 + E hϕ(X), µZ i2 + E hϕ(Y ), µX i2 + E hϕ(Z), µX i2
m(m − 1)
4(m − 2) h h 2
i h
2
ii
+ E hϕ(Y ), µY i + E hϕ(Z), µZ i
m(m − 1)
4(2m − 3) h 2 2
i 2(2m − 3) h
2 2
i
− hµX , µY i + hµX , µZ i − hµY , µY i + hµZ , µZ i
m(m − 1) m(m − 1)
8
+ [hµX , µY i hµX , µZ i + hµX , µY i hµY , µY i + hµX , µZ i hµZ , µZ i]
m
8
− E [hϕ(X), µY i hϕ(X), µZ i]
m
8
− (E [hϕ(Y ), µX i hϕ(Y ), µY i] + E [hϕ(Z), µX i hϕ(Z), µZ i])
m
4 2
E k(X, Y )2 + E k(X, Z)2 + E k(Y1 , Y2 )2 + E k(Z1 , Z2 )2 ,
       
+
m(m − 1) m(m − 1)

which gives us the estimator


 
4(m − 2) 2 T
2
2 T
2
2 2
νm = 3 kKXY 1k + KXY 1 + kKXZ 1k + KXZ 1 − 2 kKXY kF − 2 kKXZ kF
m (m − 1)2
 
4 2 2 2 2
+ 2 K̃YY 1 − K̃YY + K̃ZZ 1 − K̃ZZ
m (m − 1)2 F F

4(2m − 3) 2 2 2 2
2 2
1T KXY 1 + 1T KXZ 1 − kKXY 1k − kKXZ 1k − KXY T 1 − KXZ T 1
 
− 3 3
m (m − 1)
i
2 2
+ kKXY kF + kKXZ kF
 
2(2m − 3)  T 2  2 2 2 2 2
− 1 K̃YY 1 + 1T K̃ZZ 1 − 4 K̃YY 1 − 4 K̃ZZ 1 + 2 K̃YY + 2 K̃ZZ
(m)2 (m)4 F F
8 h i
+ 4 1T KXY 11T KXZ 1 − 1T KXY T KXZ 1
m (m − 1)
8 h i
+ 2 1T K̃YY 11T KXY 1 + 1T K̃ZZ 11T KXZ 1 − 2 1T K̃YY KXY T 1 − 2 1T K̃ZZ KXZ T 1
m (m)3
8
− 4 1T KXY T KXZ 1
m
8 h i
− 3 1T K̃YY KXY T 1 + 1T K̃ZZ KXZ T 1
m (m − 1)
 
4 
2 2
 2 2 2
+ 3 kKXY kF + kKXZ kF + 2 K̃ YY + K̃ ZZ .
m (m − 1) m (m − 1)2 F F

10
Combining like terms, we obtain

4(m2 − m − 1)
 
2 2
2 T 2 T
ν̂m = kKXY 1k + KXY 1 + kKXZ 1k + KXZ 1
m3 (m − 1)3
 
4 2 2 8
+ K̃YY 1 + K̃ZZ 1 − 3 1T KXY T KXZ 1
(m)4 m (m − 1)
8 h i
− 2 2 1T K̃YY KXY T 1 + 1T K̃ZZ KXZ T 1
m (m − 3m + 2)
 2  (5)
4(2m − 3) h T 2 T
2 i 2(2m − 3)  T 2 
T
− 3 1 KXY 1 + 1 KXZ 1 − 1 K̃YY 1 + 1 K̃ZZ 1
m (m − 1)3 (m)2 (m)4
8 8 h i
+ 4 1T KXY 11T KXZ 1 + 2 1T K̃YY 11T KXY 1 + 1T K̃ZZ 11T KXZ 1
m (m − 1) m (m)3
 
4(m − 2) h 2 2
i 2 2 2
− 2 kK XY kF + kK XZ k F − K̃ YY + K̃ ZZ .
m (m − 1)3 m(m3 − 6m2 + 11m − 6) F F

References
Bounliphone, Wacha, Eugene Belilovsky, Matthew B. Blaschko, Ioannis Antonoglou, and Arthur Gretton
(2016). “A Test of Relative Similarity For Model Selection in Generative Models.” In: International
Conference on Learning Representations (ICLR). arXiv: 1511.04581.
Gretton, Arthur, Karsten M. Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alex J. Smola (2012). “A
Kernel Two-Sample Test.” In: The Journal of Machine Learning Research 13.
Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf (2017). “Kernel
Mean Embedding of Distributions: A Review and Beyond.” In: Foundations and Trends in Machine
Learning 10.1-2, pp. 1–141. arXiv: 1605.09522.
Serfling, Robert J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley & Sons.
Sutherland, Danica J., Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola,
and Arthur Gretton (2017). “Generative Models and Model Criticism via Optimized Maximum Mean
Discrepancy.” In: International Conference on Learning Representations (ICLR). arXiv: 1611.04488.

11

You might also like