arXiv:1612.05845v1 [cs.

IT] 18 Dec 2016

Dependence Measures Bounding the Exploration
Bias for General Measurements
Jiantao Jiao

Yanjun Han

Tsachy Weissman

EE Department, Stanford University
jiantao@stanford.edu

EE Department, Stanford University
yjhan@stanford.edu

EE Department, Stanford University
tsachy@stanford.edu

Abstract—We propose a framework to analyze and quantify
the bias in adaptive data analysis. It generalizes that proposed by
Russo and Zou’15, applying to all measurements whose moment
generating function exists, and to all measurements with a finite
p-norm. We introduce a new class of dependence measures
which retain key properties of mutual information while more
effectively quantifying the exploration bias for heavy tailed
distributions. We provide examples of cases where our bounds
are nearly tight in situations where the original framework of
Russo and Zou’15 does not apply.

I. I NTRODUCTION
Suppose we have n measurements φi , 1 ≤ i ≤ n of
a dataset, and wish to select one of the measurements for
further processing. Settings of this flavor appear frequently
in applications such as model selection and reinforcement
learning, where the statistician wants to exploit the information
collected to infer the ground truth. We denote the expectations
of each measurement φi as µi , i.e., Eφi = µi . We also
denote the index of the measurement selected as random
variable T ∈ {1, 2, . . . , n}. The natural question is, how much
does EφT deviate from EµT ? This question is of practical
importance, since a deviation of EφT from EµT corresponds
to a misguided rule or generalization error for selecting from
the components of φ = (φ1 , φ2 , . . . , φn ).
It was shown in Russo and Zou [1] that one can bound the
bias of the data exploration, i.e., the quantity E[φT − µT ] as
follows:
Theorem 1. [1, Prop. 3.1.] If φi − µi is σ-sub-Gaussian for
each i ∈ {1, 2, . . . , n}, then
p
(1)
|E[φT − µT ]| ≤ σ 2I(T ; φ),

where I(T ; φ) denotes the mutual information between T and
φ.
Moreover, Russo and Zou [1] argued that for certain selection rules T that are variants of selecting the maximum
among {φ1 , φ2 , . . . , φn }, this bound is tight for Gaussian and
i.i.d.
exponential distributions. Indeed, if φi ∼ N (0, σ 2 ) and
T = arg maxi {φ1 , φ2 , . . . , φn }, it is well known that
φT
a.s.

→1
σ 2 ln n

and I(T ; φ) = H(T ) = ln n.

as n → ∞,

(2)

The interested readers are referred to Russo and Zou [1]
for variations on this bound and its applications. Our work is
motivated by the observations that
1) Theorem 1 assumes sub-Gaussian distributions1 ,
whereas in many real world applications, such as natural
language processing and e-commerce recommendation
systems, the measurements follow long-tail distributions
which are neither sub-Gaussian nor sub-exponential;
2) Lower bounds corresponding to Theorem 1 were proved
for some specific selection rules and assuming Gaussian
distributions in [1].
In this context, our main contributions are the following:
• We generalize Theorem 1 to all distributions with a nontrivial moment generating function. We show that for such
distributions, the bound on the right hand side of (1) is
replaced by a function f (I(T ; φ)). For Gaussian
√ random
variables the function specializes to f (x) = σ 2x.
• We introduce a new class of dependence measures
Iα (X; Y ) paralleling mutual information. Concretely, for
1 ≤ α < ∞, we define
Iα (X; Y ) = Dφα (PXY kPX PY ),

(3)

where Dφα (P kQ) is the φ-divergence generated by the
convex function φα (x) = |x−1|α . Clearly Iα (X; Y ) ≥ 0
and Iα (X; Y ) = 0 if and only if X and Y are independent. It satisfies the following.
Lemma 1. Suppose X takes values in a finite alphabet
with cardinality |X |, while Y is arbitrary. Then

α .

 X .

1 .

2 .

.

Y ) ≤ 1 + [PX (x)] . Iα (X.

− 1.

In particular. Iα (X. We show that these measures arise in bounding the exploration bias for distributions whose moment generating 1 A generalization to sub-exponential distributions. Y ) ≤  |X | − 1  (|X | − 1)α−1 + 1 < 1 + |X |α−1 |X | (5) for 1 ≤ α ≤ 2. which does not follow the form of the inequality in Theorem 1. was also presented in [1]. . It is tightened by Corollary 3 of this paper. − 1 PX (x) x∈X (4) which is tight iff X is a function of Y .

and one should apply different functionals for distributions with different tail behaviors. II. A. i. Prop. we have √ |E[φT − µT ]| ≤ kσT k2 n − 1. c). A random variable is called subλ2 σ 2 exponential with parameter (σ. Then. the conjugate of β via relation α1 + β1 = 1. (15) and for 2 < β ≤ ∞ (1 ≤ α < 2). φ).e. introduced φ-divergences defined as follows: Definition 1 (φ-divergence). (14) Moreover. b). [1. b). Γ+ (σ 2 . where 1 < β ≤ ∞. Theorem 3. The general form of φdivergences is  Z  dP dQ. φ) ≤ σ2b σ 2I(T . λ ∈ [0. M AIN RESULTS Theorem 2. For two nonnegative sequences {an } and {bn }. b). Then. (12) ¯ ∗−1 (I(T . We present theorems paralleling Theorem 1 for such heavy tailed distributions.2. where ψ¯∗−1 is the inverse of the convex conjugate of the ¯ function ψ. E[φT − µT ] ≤ (ψ) (13) i Then. we say an . Suppose φi − µi has its β-norm upper bounded by σi . φ)1/α . φ)). Computing the convex conjugate of ψ(λ) 2 leads us to q ¯ ∗−1 (x) = 2xET σ 2 . c (7) We note that the χ2 distribution with p degrees of freedom belongs to Γ+ (2p. p E[φT − µT ] ≤ kσT k2 2I(T . yi−f (x)}. (21) ET [φT − µT ] ≤ σ2 otherwise bI(T . φ) + 2b 2 Note that Corollary 3 is a strengthened version of [1. c) if 2 2 ψ(λ) ≤ λ σ 2(1 − cλ) for all 0 < λ < 1 . The β-norm of a random variable X for β ≥ 1 is defined as ( (E|X|β )1/β 1 ≤ β < ∞ kXkβ = . 2). III. (6) We assume that there exists a λ > 0 such that EeλX < ∞. λ ≥ 0. (8) ess sup |X| β = ∞ where the essential supremum is defined as ess sup X = inf{M : P(X > M ) = 0}. bn if there exists a constant C > 0 such that lim supn abnn ≤ C. A random variable is called σ-sub-Gaussian if EeλX ≤ λ2 σ 2 e 2 for all λ ∈ R. we have f (x) + f ∗ (y) ≥ hx.functions do not exist. We conclude with connections to the literature of maximal inequalities and a generalization to random variables in any Orlicz spaces. Define α. y ∈ X ∗ . Then. we have 2 λ2 ET σT ¯ ¯ ψ(λ) = . and construct examples implying that our bounds are essentially tight for a sequence of non-Gaussian heavy tailed distributions. Corollary 2. Suppose φi −µi has cumulant generating function upper bounded by ψi (λ) over domain [0. (19) (ψ) T which proves the corollary. Prop. . A. and ψ(λ) is convex on this interval. min bi ). Csisz´ar [2]. for all x ∈ X. for β = 2. b) if EeλX ≤ e 2 for all 0 ≤ λ < 1b . p E[φT − µT ] ≤ σ 2I(T . bi ) where 0 < bi ≤ ¯ ∞. (10) which follows from the definition of convex conjugate f ∗ (y) = supx∈X {hx. (11) Dφ (P kQ) = φ dQ ¯ ψ(λ) = ET ψT (λ). (18) λ2 σ2 Proof: Applying Theorem 2 with φi (λ) = 2 i . yi. 0 < b ≤ ∞ such that ψ(λ) < ∞ for all λ ∈ (0. we have |E[φT − µT ]| ≤ kσT kβ (1 + nα−1 )1/α 1 α ≤ 2 kσT kβ n 1/β . It is clear that Dφ (P kQ) = DKL (P kQ) when φ(x) = x ln x − x + 1. and independently Ali and Silvey [3]. It follows from the Fenchel– Moreau theorem that f = f ∗∗ if and only if f is convex and lower semi-continuous. Suppose φi − µi are sub-Gamma random variables belonging to Γ+ (σ 2 . Our results imply that mutual information is not fundamental to bounding exploration bias. (16) (17) Corollary 1. φ) + cI(T . an . We say an & bn if bn . φ). φ) . (9) The Fenchel–Young inequality states that for any function f and its convex conjugate f ∗ .1. Then.]. Suppose φi − µi is sub-exponential with parameter (σ. Define the expected cumulant generating function ψ(λ) as The cumulant generating function of a random variable X is defined as ψ(λ) = ln EeλX .] Suppose φi − µi is σi -subGaussian. A random variable is called sub-gamma on the right tail with variance factor σ 2 and scale parameter c. ( p 2 if I(T . |E[φT − µT ]| ≤ kσT kβ Iα (T . (20) Corollary 3. It follows from H¨older’s inequality that there exists an interval (0. P RELIMINARIES where φ : R≥0 7→ R is a convex and lower semi-continuous function and satisfies φ(1) = 0.

In particular. Indeed. f = λ(φi − µi ). taking logarithm and dividing both sides by α. φ)).IV. and the other is to go through H¨older’s inequality. T HE PATH FROM T HEOREM 1 TO OUR MAIN RESULTS Underlying the proof of [1. Note that EP f = limα→0+ α1 ln eαf dP . it follows from H¨older’s inequality that . γ > 1. and the data processing property of KL divergence. taking α → 0+ . Applying H¨older’s inequality. α c cα dQ (31) It is clear that (31) is a generalization of Lemma 2. β > 1. c = 1. we have EP f ≤ ln EQ ef + D(P kQ).mini bi ) ∗−1 ¯ = (ψ) (I(T . (22) f ∈C dP where the supremum is attained when f = ln dQ . Generalization through H¨older’s inequality We first present the generalization path through H¨older’s inequality R investigated in [6]. The literature consists of two generalization paths from the Donsker–Varadhan theorem: one is to go through the Fenchel–Young inequality in convex duality theory. the generalization of Lemma 2 should involve some φ-divergence. Q be probability measures on X and let C denote the set of functions f : X 7→ R such that EQ [ef (X) ] < ∞. Proof: (of Theorem 2) Applying Lemma 2 and setting P = Pφi |T =i . rearranging terms.4]. a natural generalization of Lemma 2 would lead to generalizations of Theorem 2. If D(P kQ) < ∞ then for every f ∈ C the expectation EP [f (X)] exists and furthermore D(P kQ) = sup EP [f (X)] − ln EQ [e f (X) ]. ∆i = φi − µi and noting that EQ ∆i = 0. φ) λ λ∈[0. Lemma 2. dP dQ (29) ≤ (EQ eαβf )1/β (EQ ( dP γ 1/γ ) ) . Q = Pφi . Intuitively. Taking expectation with respect to T on both sides. Proof: (of Theorem 3) Setting P = Pφi |T =i . (26) which implies ¯ ψ(λ) + I(T . φ). since φ-divergences are the only decomposable divergences that satisfy the data processing inequality for alphabet at least three [5]. 3. and the data processing inequality for the relative entropy. It is interesting to consider how one can generalize Theorem 2 to distributions whose moment generating functions do not exist. E(φT − µT ) ≤ we have (32) Now we present a proof of Theorem 3 using H¨older’s inequality. we have   c Z dP c−α 1 c−α 1 αf cf ln e dP ≤ ln EQ e + ln EQ . Similar arguments were also used in [7]. we have ¯ λE(φT − µT ) ≤ ψ(λ) + I(T . It is clear that the application of Lemma 2 relies on existence of the cumulant generating function but not sub-Gaussianness.1. (23) f ≤ ln EQ e + D(Pφi |T =i kPφi ) ≤ ψi (λ) + D(Pφ|T =i kPφ ). we have λ(E[φi |T = i] − µi ) = EP f Z eαf dP = EQ eαf (24) (25) where in the last step we have used the fact that the cumulant generating function of φi − µi is upper bounded by ψi (λ). It leads to the following proof of Theorem 2. inf (27) (28) where in the last step we have used [4. dQ (30) where β1 + γ1 = 1. A. Let P. Q = Pφi . It is intriguing that both generalizations lead to the same results presented in Theorem 3.] is the Donsker– Varadhan variational representation of relative entropy stated below. Prop. λ > 0. Lemma 2 (Donsker–Varadhan). Defining c = αβ > α. and the data processing property is needed in the proof of Theorem 2.

.

.

dP .

.

.

(33) |EP ∆i | = .

EQ ∆i dQ .

.

.

 .

.

dP − 1 .

.

(34) = .

.

EQ ∆i dQ .

.

α 1/α  .

dP .

 β 1/β .

EQ .

≤ EQ |∆i | − 1.

.

+ |E[φi |T = i] − µi | ≤ inf λ>0 λ β α (41) Taking expectations on both sides with respect to T . we have |E[φT − µT ]| ≤ max σi I1 (T . (39) (40) If 1 ≤ β < ∞. we have α = 1 and |E[φi |T = i] − µi | ≤ max σi Dφ1 (Pφ|T =i kPφ ). φ) i = kσT k∞ I1 (T . . φ). (37) 1/α where in the last step we used the data processing inequality of the φα -divergence. (35) dQ which implies 1/α |E[φi |T = i] − µi | ≤ σi Dφα (Pφi |T =i kPφi ) (36) ≤ σi Dφα (Pφ|T =i kPφ ). applying Young’s inequality to (37). If β = ∞. and using . we have " # Dφα (Pφ|T =i kPφ ) 1 λβ σiβ . i (38) Taking expectations with respect to T on both sides.

At the same time. Indeed. Generalization through the Fenchel–Young inequality We have the following natural variational representation of φ-divergences which is well known in the literature [8].4. shrinking the primal space would decrease the convex conjugate. Thm. (46) if EQ φ∗ (f ) exists. for 2 ≤ β < ∞. Indeed. The Type II (Fr´echet type) extreme value distribution with parameter β > 0 is characterized by ( 0 x<0 Φβ (x) = (52) −β e−x x≥0 Lemma 3.2. Theorem 3 implies that E[φT − µT ] . 1. but ln EQ ef is the convex conjugate of D(·kQ) when P is constrained to be a probability measure. (54) lim P n→∞ an where an = F ← (1 − n−1 ) and F ← (·) denotes the inverse function of F (x). Thm. φ). Now we compute the β-norm of φi . 5. it follows from [10.d. we have   (48) D(P kQ) = sup EP [f (X)] − EQ [ef (X) ] − 1 f  = sup EP [f + λ] − EQ [ef +λ ] − 1 . λ > 0. one arrives at the results of Theorem 3. it follows from [10.  Z  dP Dφ (P kQ) = φ dQ (44) dQ   Z dP = sup f − φ∗ (f ) dQ (45) dQ f ≥ EP f − EQ φ∗ (f ). Cor. Suppose all the φi . We have Z ∞ βxβ−1 P(φi ≥ x)dx Eφβi = 0 Z ∞ Z x0 xβ (ln x0 )c βxβ−1 0β βxβ−1 dx + = dx x (ln x)c x0 0 < ∞. it is clear that (47) cannot be used to derive Theorem 2. For the CDF in (51) it is easy to see that an is the solution to the equation xβ (ln x)c = n(x0 )β (ln x0 )c . Indeed. Q = Pφi . one obtains the following variational representation of Dφα (P kQ) for φα (x) = |x − 1|α . we can obtain the Donsker–Varadhan result from (46). xβ (ln x)c x ≥ x0 > e1/β . by setting P = Pφi |T =i . Indeed.] The necessary and sufficient condition for a distribution function F (x) to fall into the domain of attraction of Φβ (x). (56) (57) (58) Hence. then the exploration bias is upper bounded by n1/β of the β-norm if 2 ≤ β < ∞. T IGHTNESS (42) (43) The remaining results in Theorem 3 follow from Lemma 1.2. Also.] that . Note that specializing the variational representation above to φ(x) = x ln x−x+1 cannot directly lead us to the Donsker– Varadhan result. Defining T = arg maxi φi . α ≥ 1: 1 |f |β Dφα (P kQ) = sup EP [f ] − EQ [f ] − EQ . it is because in the KL divergence case EQ ef − 1 is in fact the convex conjugate of the convex function D(·kQ) when P takes value in the space of all measures. and lim t→∞ 1 − F (tx) = x−β . after taking φ(x) = x ln x − x + 1. we have λ = − ln EQ [ef (X)].1. x > 0.1. However. The equality is attained when f = dP φ′ dQ . n1/β . Indeed. (47) f which is weaker than the Donsker–Varadhan result since ln EQ [ef (X) ] ≤ EQ ef (X) − 1. we have " # β β Iα (T .3. φ) 1 λ kσT kβ + |E(φT − µT )| ≤ inf λ>0 λ β α = kσT kβ Iα1/α (T . ln dQ Analogously. we have   D(P kQ) = sup EP [f (X)] − EQ [ef (X) ] − 1 . we now care about the asymptotic distribution of φT as n → ∞. f = λ(φi − µi ). with CDF F (x) = 1 − xβ0 (ln x0 )c . V. [10. (51) where c > 1. α β f (50) where α1 + β1 = 1.E inf λ Xλ ≤ inf λ EXλ .] that there exists a sequence an → ∞ such that   φT ≤ x = Φβ (x). It suffices to verify that f +λ = f −ln EQ [ef (X)] can still attain dP the value ln dP dQ . (55) 1/β which satisfies an & (lnnn)c/β . (49) f +λ Setting EQ [ef (X)+λ ] = 1. OF THE BOUNDS Theorem 3 essentially shows that if all the φi − µi have β-norm bounded. 1 ≤ i ≤ n are i. 1. 1 − F (t) (53) Then. it is equal to ln dQ when f (X) = dP . Following a path similar to that in the proof of Theorem 2. 0 < β < ∞ is sup{x : F (x) < 1} = ∞. B. We now show through extreme value theory that it is essentially tight for certain heavy tailed distributions. This phenomenon was already observed in the literature [9].i.

.

Z   .

φT .

1 . (59) lim E .

.

.

.

This shows that the bounds in Theorem 3 are (ln n)c/β essentially tight. = xdΦβ (x) = Γ 1 − n→∞ an β which shows that EφT is of order an . . which is at least n1/β .

    X   |Zi | |Zi | |Zi | ψ E max ≤ E max ψ ≤ Eψ i≤n σ i≤n σ σ i≤n (65) ≤ n max Eψ i≤n  |Zi | σ  . “Soft” generalizations of the “hard” results Theorem 2 can be viewed as a soft generalization of the following well-known arguments [4] through replacing ln n with I(T . More generally [11]. but the “hard” argument presented above applies equally to all β ≥ 1. we note that Theorem 3 is not a perfect generalization of this “hard” argument. we have . and the moment generating function of Zi is upper bounded by ψi (λ). For example. Q = Pφi . n X EeλZi  i It follows that n X i=1 (62) (63) i≤n i≤n (64) However. (66) dP .and the Ameniya norm of a random variable X as   1 + Eψ(|tX|) A kXkψ = inf :t>0 . then. φ). Proof: Denoting P = Pφi |T =i . we are only able to bound the RHS of Theorem 3 uniformly over the distribution of T when β ≥ 2. ∆i = (φi − µi )/σ. if ψ is a nonnegative. t > 0. for each σ > 0. Suppose we have n random variables Zi such that EZi = 0. t VI. b). convex. λ ∈ [0. D ISCUSSIONS A.φ where dPTTdP follows the product distribution PT Pφ in the φ Ameniya norm. eλE max Zi ≤ Eeλ max Zi = E max eλZi ≤ ≤ nemaxi ψ(λ) . strictly increasing function on R+ that satisfies ψ(0) = 0.

.

 .

dP .

.

(72) |EP [∆i ]| = .

.

EQ ∆i dQ .

.

  .

.

dP 1.

= .

.

EQ ∆i t − 1 .

.

(73) t dQ .

  .

 .

dPφi |T =i .

1 EQ ψ (|∆i |) + EQ ψ ∗ t .

.

− 1.

.

≤ t dPφi (74) .

  .

 .

.

dP 1 φ|T =i 1 + EPφ ψ ∗ t .

.

. (75) − 1.

.

e. the convexity of ψ ∗ (| · |)). ≤ t dPφ where in the last step we have used the data processing property (i. Taking expectations with respect to T on both sides and taking infimum for t.. we have .

  .

  .

φ  1 + EPT Pφ ψ ∗ t . dPT .

.

dP  − 1 .

i. The following theorem applies to random variables whose Luxemburg norms are bounded.φ − 1 (71) |E[φT − µT ]| ≤ σ ∗. (61) Analogously. ∞) 7→ [0. T dPφ |E[φT − µT ]| ≤ σ inf :t>0   t (76) If σ is such that Eψ(|Zi |/σ) ≤ 1 for all i ≤ n. i≤n (70) Theorem 4. (E max |Zi |)β ≤ E max |Zi |β ≤ Lemma 4 (Generalized H¨older’s Inequality).b) ∗−1 = (max ψ) (ln n). Then.φ follows the product distribution PT Pφ . where dPTTdP φ A natural question is: are there more natural “soft” generalizations of all the “hard” arguments above? B. Theorem 3 can be viewed as the generalization of the following argument. A dPT. For β ≥ 1. [12] Denote an Orlicz function by ψ and its convex conjugate by ψ ∗ = sup{uv − ψ(v) : v ≥ 0}. ∞).φ = σ − 1 ∗.. then we have E max |Zi | ≤ σψ −1 (n). ∞]. Then. (68) kXkψ = inf{σ > 0 : Eψ σ A dPT. Connections with other generalizations of mutual information There exist various generalizations of mutual information in the literature. and we refer the interested readers to [13]–[15] for references. Y ) introduced in this paper seems to have received only scant attention in . we have the generalized H¨older’s inequality: (60) i=1  (69) (67) We note that the generalization of H¨older’s inequality in Orlicz spaces could provide a “soft” generalization of the arguments above. a convex function vanishing at zero and is also not identically 0 or ∞ on (0. defining the Luxemburg norm of a random variable X as   |X| ≤ 1}. The dependence measure Iα (X. Suppose φi − µi has its Luxemburg norm upper bounded by σ.e. dPT dPφ (77) ψ dP . we have ln n + maxi ψ(λ) E max Zi ≤ inf λ λ∈(0. Taking logarithm. For a general Orlicz function ψ : [0. E[XY ] ≤ kXkψ kY kA ψ∗ . dPT dPφ ψ E|Zi |β ≤ n max E|Zi |β . E max |Zi | ≤ n1/β max kZi kβ .

y. ACKNOWLEDGEMENT We are grateful to James Zou for discussing with us the results in [1]. 1]. 2). we have PXY (x. y) ∈ {0. |X |2 g(1) ≤ (88) (89) In summary.” Studia Sci. then for convex φ : R≥0 → R. For example. 1]. Y ) ≤ φ(0) 1 − [PX (x)] x∈X + X x∈X [PX (x)]2 φ  1 PX (x)  . it is straightforward to see that f ′′ (t) is increasing on [0. In particular. 1967. Lemma 5. Even when power functions are used to define φ-divergences. Now the first statement in Lemma 1 follows from Lemma 5 applied to φα (t) = |t − 1|α . if we define   1 2(|X | − 1)α − 2 g(t) . and   2(|X | − 1)α − 2 1 α−1 g(1) = −1 − (1 − − α(|X | − 1) ) |X | |X | (|X | − 1)α − 1 (85) − |X |2 (α − 1)(|X | − 1)α − (2 − α)(|X | − 1)α+1 − (|X | − 1)2 = |X |2 (86)  (|X | − 1)α  = α − 1 − (2 − α)(|X | − 1) − (|X | − 1)2−α . 1]: α α−1   1 1 −1 −1 − 2(2 − α)α f ′′ (t) = −(2 − α)(α − 1) t t  α−2 1 + α(α − 1) −1 −2 (81) t satisfies f ′′ (1) = +∞ for α ∈ [1. . y) PXY (x. φ PY (y) PX (x) (80) Now summing over x. y) 1 0≤ ≤ (79) PX (x)PY (y) PX (x) then by the convexity of φ we know that     PXY (x. 1]. functions xα . and the dependence measure in [15] involves minimizing jointly over QX QY . y) 1 + . 2015. For the second statement. A PPENDIX P ROOF OF L EMMA 1 We first prove a general result regarding the φ-mutual information Iφ (X. vol. PY (y)} for any x. the maximum of g(t) over t ∈ [0. ! X 2 Iφ (X.. g ′′ (|X |−1 ) ≤ 0 and g ′′ (t) is increasing on [0. Y ) . pp. Some generalizations such as Sibson’s mutual information [16] involve minimizing over an auxiliary distribution QY . R EFERENCES [1] D. y) ≤ 1− φ(0) φ PX (x)PY (y) PY (y)   PXY (x. we have g(t) ≤ max{g(|X |−1 ). f (t) − − α(|X | − 1)α−1 (t − ) |X | |X | (|X | − 1)α − 1 . g(1)} = 0 for any t ∈ [0. y in the definition of Iφ (X. 2] we define f (t) = t2 [( 1t −1)α −1] = t2−α (1−t)α −t2 on [0.” arXiv preprint arXiv:1511. 1]. 299–318. we have (|X | − 1)α [α − 1 − (2 − α) · 1 − 1] |X |2 2(α − 2)(|X | − 1)α = ≤ 0. Dφ (PXY kPX PY ) for general convex function φ.the existing literature. Zou. Let X take value in a finite set X . We note that f (t) is not concave on [0. Math. the usual definition of R´enyi divergence involves the function xα but not |x − 1|α . 1] is attained at t = |X |−1 or t = 1. Note that PXY (x. Russo and J. “How much does your data exploration overfit? controlling bias via information usage. (90) |X | |X | Now summing over x ∈ X completes the proof. Csisz´ar et al. 2. and if |X | ≥ 2. which inspired this work. 2 |X | (87) Obviously g(1) = 0 if |X | = 1. which means that X is a function of Y . VII. It remains an interesting question whether other generalizations of mutual information could prove useful in bounding the exploration bias. (82) (83) Hence. When φ(t) is strictly convex at t = 1 and the equality holds. α ≥ 1 plus some affine terms were used much more frequently than |x − 1|α except for the case of α = 1 (total variation distance) and α = 2 (χ2 -distance). As a result. “Information-type measures of difference of probability distributions and indirect observations. (78) If φ(t) is strictly convex at t = 1. and thus (|X | − 1)α − 1 f (PX (x)) ≤ + |X |2   1 2(|X | − 1)α − 2 − α(|X | − 1)α−1 (PX (x) − ). [2] I. However. and f ′′ (|X |−1 ) ≤ α(α − 1)(|X | − 1)α−2 − 2 ≤ α(α − 1) − 2 ≤ 0..05219. then the upper bound is tight iff X is a deterministic function of Y . Proof of Lemma 5: It follows from the generalization of the Gel’fand-Yaglom-Peres theorem for φ-divergences [17] that it suffices to consider Y being a discrete random variable. Hungar. Y ) yields the desired result. for α ∈ [1. (84) − |X |2 it is straightforward to see that g(|X |−1 ) = g ′ (|X |−1 ) = 0 and g ′′ (t) = f ′′ (t) on [0. 1].

Massart. Chowdhary. Gilardoni. “On functionals satisfying a data-processing theorem. 11. “Information radius. 1969. Springer Science & Business Media. 2014.” arXiv preprint arXiv:0911.02330. 2000. IEEE. and M. pp. Venkat. [6] R. Ruderman. [10] L. Wainwright. “α-mutual information. vol. Garc´ıa-Garc´ıa. Ali and S. 56. 1–6. Silvey. IEEE. A. pp. “A general class of coefficients of divergence of one distribution from another. [7] T. I.” Indagationes Mathematicae. Pfister. 18–33. 11.[3] S. Weissman. M. [4] S. 4. pp. Ziv and M. 14. (2005) Asymptopia. 2015.yale. and T. 2013. 275–283. Sibson. 1. Lugosi.” IEEE Transactions on Information Theory. 149–160. 2012. De Haan and A. 2014.” in 2014 IEEE International Symposium on Information Theory. Maligranda. Verd´u. [9] A. and P. [8] X. Oxford University Press. 2007. [Online]. Extreme value theory: an introduction. [13] J. “Information measures: the curious case of the binary alphabet. J. no. 2494–2498. G. 573–585. [11] D.” in Information Theory and Applications Workshop (ITA). 2009. Courtade.4664. D. [5] J. vol.stat.pdf [12] H. 12. Hudzik and L. 2015. Reid. 2015. pp. Courtade and S. A. vol. “Amemiya norm equals Orlicz norm in general. Jordan. “Estimating divergence functionals and the likelihood ratio by convex risk minimization. no. “Two measures of dependence. Petterson. “Tighter variational representations of f -divergences via restriction to probability measures. “On a Gel’fand-Yaglom-Peres theorem for f divergences. 60. no. 2.” Journal of the Royal Statistical Society. “Robust bounds on risk-sensitive functionals via R´enyi divergence. Boucheron. [16] R. 1973. D. vol. “Cumulant generating function of codeword lengths in optimal lossless compression. K. vol. A. T. Verd´u.1934.spring05/handouts/finite-max. pp. Lapidoth and C. and J. Zakai.” arXiv preprint arXiv:1607. [17] G. vol. and P. Nguyen. 7616–7626. 5847–5861. no. 19. No. Atar. pp. 1966. M.” IEEE Transactions on Information Theory. K. pp. pp.edu/∼ pollard/Courses/607. 3. Jiao. Series B (Methodological). 2016. pp. Ferreira. [15] A. no. Dupuis.” arXiv preprint arXiv:1206. 2010. . Pollard. no. [14] S.” IEEE Transactions on Information Theory. Concentration inequalities: A nonasymptotic theory of independence. 131–142.” Zeitschrift f¨ur Wahrscheinlichkeitstheorie und verwandte Gebiete. Available: http://www.” SIAM/ASA Journal on Uncertainty Quantification. 3. L.