- Full Text 01
- Stochastic Analysis
- Answers - Continuous Probability Distributions
- Labs 2.0
- hw15
- 2012Filoma-AliChoNaveen
- fremlin measure theory book 2
- nashmoser(1).pdf
- Basic Mathematical Tools
- Chapter 11
- 1
- Intro to Error Analysis
- 99ebook_com-0132311232
- Integral Transform Methods for Inverse Problem of Heat Conduction With Known Boundary of Semi Infinite Hollow Cylinder and Its Stresses
- [University of Wisconsin-Madison, Shalizi] CSSS 2000-2001 Math Review Lectures - Probability, Statistics, And Stochastic Processes
- Distributions
- Communication Assignment 2
- inferential theoy.pdf
- Amicarelli, Di Sciascio, Toibero and Alvarez (2010) Including Dissolved Oxygen Dynamics
- Experimentation and stochastic modelling for reliability evaluation of an existing bridge
- design_values_ed02[1].pdf
- Appendix a 4
- ch06.doc
- Bass
- DDE Prob Lecture Notes
- Härdle (1995)
- Applied Nonparametric Regression [Hardle]
- Acceptance Sampling
- syllabus
- AppendixA Common Statiscal Distribution
- 16142 Electrical Connections for Equipment
- 10.1137@1.9781611971392.pdf
- Equipment
- Ch02_mechanical Properties of Materials
- MEXT Scholarship Interview 2013
- Www Kengen Co Default
- Electrical Installation Equipment.docx
- Robust Control - Wikipedia
- (Spring 2017)UOU Admisison Guideline for Postgraduate Programme
- Stem and Leaf
- Guidelines Master Mind Scholarships 2017-2018
- Individual Assignment
- 3276_2016-11-25_09-03-13_en_3
- Assignment.docx
- Assignment
- Individual Assignment.docx

IT] 18 Dec 2016

**Dependence Measures Bounding the Exploration
**

Bias for General Measurements

Jiantao Jiao

Yanjun Han

Tsachy Weissman

**EE Department, Stanford University
**

jiantao@stanford.edu

**EE Department, Stanford University
**

yjhan@stanford.edu

**EE Department, Stanford University
**

tsachy@stanford.edu

**Abstract—We propose a framework to analyze and quantify
**

the bias in adaptive data analysis. It generalizes that proposed by

Russo and Zou’15, applying to all measurements whose moment

generating function exists, and to all measurements with a finite

p-norm. We introduce a new class of dependence measures

which retain key properties of mutual information while more

effectively quantifying the exploration bias for heavy tailed

distributions. We provide examples of cases where our bounds

are nearly tight in situations where the original framework of

Russo and Zou’15 does not apply.

I. I NTRODUCTION

Suppose we have n measurements φi , 1 ≤ i ≤ n of

a dataset, and wish to select one of the measurements for

further processing. Settings of this flavor appear frequently

in applications such as model selection and reinforcement

learning, where the statistician wants to exploit the information

collected to infer the ground truth. We denote the expectations

of each measurement φi as µi , i.e., Eφi = µi . We also

denote the index of the measurement selected as random

variable T ∈ {1, 2, . . . , n}. The natural question is, how much

does EφT deviate from EµT ? This question is of practical

importance, since a deviation of EφT from EµT corresponds

to a misguided rule or generalization error for selecting from

the components of φ = (φ1 , φ2 , . . . , φn ).

It was shown in Russo and Zou [1] that one can bound the

bias of the data exploration, i.e., the quantity E[φT − µT ] as

follows:

Theorem 1. [1, Prop. 3.1.] If φi − µi is σ-sub-Gaussian for

each i ∈ {1, 2, . . . , n}, then

p

(1)

|E[φT − µT ]| ≤ σ 2I(T ; φ),

**where I(T ; φ) denotes the mutual information between T and
**

φ.

Moreover, Russo and Zou [1] argued that for certain selection rules T that are variants of selecting the maximum

among {φ1 , φ2 , . . . , φn }, this bound is tight for Gaussian and

i.i.d.

exponential distributions. Indeed, if φi ∼ N (0, σ 2 ) and

T = arg maxi {φ1 , φ2 , . . . , φn }, it is well known that

φT

a.s.

√

→1

σ 2 ln n

and I(T ; φ) = H(T ) = ln n.

as n → ∞,

(2)

**The interested readers are referred to Russo and Zou [1]
**

for variations on this bound and its applications. Our work is

motivated by the observations that

1) Theorem 1 assumes sub-Gaussian distributions1 ,

whereas in many real world applications, such as natural

language processing and e-commerce recommendation

systems, the measurements follow long-tail distributions

which are neither sub-Gaussian nor sub-exponential;

2) Lower bounds corresponding to Theorem 1 were proved

for some specific selection rules and assuming Gaussian

distributions in [1].

In this context, our main contributions are the following:

• We generalize Theorem 1 to all distributions with a nontrivial moment generating function. We show that for such

distributions, the bound on the right hand side of (1) is

replaced by a function f (I(T ; φ)). For Gaussian

√ random

variables the function specializes to f (x) = σ 2x.

• We introduce a new class of dependence measures

Iα (X; Y ) paralleling mutual information. Concretely, for

1 ≤ α < ∞, we define

Iα (X; Y ) = Dφα (PXY kPX PY ),

(3)

**where Dφα (P kQ) is the φ-divergence generated by the
**

convex function φα (x) = |x−1|α . Clearly Iα (X; Y ) ≥ 0

and Iα (X; Y ) = 0 if and only if X and Y are independent. It satisfies the following.

Lemma 1. Suppose X takes values in a finite alphabet

with cardinality |X |, while Y is arbitrary. Then

α .

X .

1 .

2 .

.

Y ) ≤ 1 + [PX (x)] . Iα (X.

− 1.

In particular. Iα (X. We show that these measures arise in bounding the exploration bias for distributions whose moment generating 1 A generalization to sub-exponential distributions. Y ) ≤ |X | − 1 (|X | − 1)α−1 + 1 < 1 + |X |α−1 |X | (5) for 1 ≤ α ≤ 2. which does not follow the form of the inequality in Theorem 1. was also presented in [1]. . It is tightened by Corollary 3 of this paper. − 1 PX (x) x∈X (4) which is tight iff X is a function of Y .

and one should apply different functionals for distributions with different tail behaviors. II. A. i. Prop. we have √ |E[φT − µT ]| ≤ kσT k2 n − 1. c). A random variable is called subλ2 σ 2 exponential with parameter (σ. Then. the conjugate of β via relation α1 + β1 = 1. (15) and for 2 < β ≤ ∞ (1 ≤ α < 2). φ).e. introduced φ-divergences defined as follows: Definition 1 (φ-divergence). (14) Moreover. b). [1. b). Γ+ (σ 2 . where 1 < β ≤ ∞. Theorem 3. The general form of φdivergences is Z dP dQ. φ) ≤ σ2b σ 2I(T . λ ∈ [0. M AIN RESULTS Theorem 2. For two nonnegative sequences {an } and {bn }. b). Then. (12) ¯ ∗−1 (I(T . We present theorems paralleling Theorem 1 for such heavy tailed distributions.2. where ψ¯∗−1 is the inverse of the convex conjugate of the ¯ function ψ. E[φT − µT ] ≤ (ψ) (13) i Then. we say an . Suppose φi − µi has its β-norm upper bounded by σi . φ)1/α . φ)). Computing the convex conjugate of ψ(λ) 2 leads us to q ¯ ∗−1 (x) = 2xET σ 2 . c (7) We note that the χ2 distribution with p degrees of freedom belongs to Γ+ (2p. p E[φT − µT ] ≤ kσT k2 2I(T . yi−f (x)}. (21) ET [φT − µT ] ≤ σ2 otherwise bI(T . φ) + 2b 2 Note that Corollary 3 is a strengthened version of [1. c) if 2 2 ψ(λ) ≤ λ σ 2(1 − cλ) for all 0 < λ < 1 . The β-norm of a random variable X for β ≥ 1 is defined as ( (E|X|β )1/β 1 ≤ β < ∞ kXkβ = . 2). III. (6) We assume that there exists a λ > 0 such that EeλX < ∞. λ ≥ 0. (8) ess sup |X| β = ∞ where the essential supremum is defined as ess sup X = inf{M : P(X > M ) = 0}. bn if there exists a constant C > 0 such that lim supn abnn ≤ C. A random variable is called σ-sub-Gaussian if EeλX ≤ λ2 σ 2 e 2 for all λ ∈ R. we have f (x) + f ∗ (y) ≥ hx.functions do not exist. We conclude with connections to the literature of maximal inequalities and a generalization to random variables in any Orlicz spaces. Define α. y ∈ X ∗ . Then. we have 2 λ2 ET σT ¯ ¯ ψ(λ) = . and construct examples implying that our bounds are essentially tight for a sequence of non-Gaussian heavy tailed distributions. Corollary 2. Suppose φi −µi has cumulant generating function upper bounded by ψi (λ) over domain [0. (19) (ψ) T which proves the corollary. Prop. . A. and ψ(λ) is convex on this interval. min bi ). Csisz´ar [2]. for all x ∈ X. for β = 2. b) if EeλX ≤ e 2 for all 0 ≤ λ < 1b . p E[φT − µT ] ≤ σ 2I(T . bi ) where 0 < bi ≤ ¯ ∞. (10) which follows from the definition of convex conjugate f ∗ (y) = supx∈X {hx. (11) Dφ (P kQ) = φ dQ ¯ ψ(λ) = ET ψT (λ). (18) λ2 σ2 Proof: Applying Theorem 2 with φi (λ) = 2 i . yi. 0 < b ≤ ∞ such that ψ(λ) < ∞ for all λ ∈ (0. we have |E[φT − µT ]| ≤ kσT kβ (1 + nα−1 )1/α 1 α ≤ 2 kσT kβ n 1/β . It is clear that Dφ (P kQ) = DKL (P kQ) when φ(x) = x ln x − x + 1. and independently Ali and Silvey [3]. It follows from the Fenchel– Moreau theorem that f = f ∗∗ if and only if f is convex and lower semi-continuous. Suppose φi − µi are sub-Gamma random variables belonging to Γ+ (σ 2 . Our results imply that mutual information is not fundamental to bounding exploration bias. (16) (17) Corollary 1. φ) + cI(T . an . We say an & bn if bn . φ). φ) . (9) The Fenchel–Young inequality states that for any function f and its convex conjugate f ∗ .1. Then.]. Suppose φi − µi is sub-exponential with parameter (σ. Define the expected cumulant generating function ψ(λ) as The cumulant generating function of a random variable X is defined as ψ(λ) = ln EeλX .] Suppose φi − µi is σi -subGaussian. A random variable is called sub-gamma on the right tail with variance factor σ 2 and scale parameter c. ( p 2 if I(T . |E[φT − µT ]| ≤ kσT kβ Iα (T . (20) Corollary 3. It follows from H¨older’s inequality that there exists an interval (0. P RELIMINARIES where φ : R≥0 7→ R is a convex and lower semi-continuous function and satisfies φ(1) = 0.

In particular. Indeed. f = λ(φi − µi ). taking logarithm and dividing both sides by α. φ)).IV. and the other is to go through H¨older’s inequality. T HE PATH FROM T HEOREM 1 TO OUR MAIN RESULTS Underlying the proof of [1. Note that EP f = limα→0+ α1 ln eαf dP . it follows from H¨older’s inequality that . γ > 1. and the data processing property of KL divergence. taking α → 0+ . Applying H¨older’s inequality. α c cα dQ (31) It is clear that (31) is a generalization of Lemma 2. β > 1. c = 1. we have EP f ≤ ln EQ ef + D(P kQ).mini bi ) ∗−1 ¯ = (ψ) (I(T . (22) f ∈C dP where the supremum is attained when f = ln dQ . Generalization through H¨older’s inequality We first present the generalization path through H¨older’s inequality R investigated in [6]. The literature consists of two generalization paths from the Donsker–Varadhan theorem: one is to go through the Fenchel–Young inequality in convex duality theory. the generalization of Lemma 2 should involve some φ-divergence. Q be probability measures on X and let C denote the set of functions f : X 7→ R such that EQ [ef (X) ] < ∞. Proof: (of Theorem 2) Applying Lemma 2 and setting P = Pφi |T =i . rearranging terms.4]. a natural generalization of Lemma 2 would lead to generalizations of Theorem 2. If D(P kQ) < ∞ then for every f ∈ C the expectation EP [f (X)] exists and furthermore D(P kQ) = sup EP [f (X)] − ln EQ [e f (X) ]. ∆i = φi − µi and noting that EQ ∆i = 0. φ) λ λ∈[0. Lemma 2. dP dQ (29) ≤ (EQ eαβf )1/β (EQ ( dP γ 1/γ ) ) . Q = Pφi . Intuitively. Taking expectation with respect to T on both sides. Proof: (of Theorem 3) Setting P = Pφi |T =i . (26) which implies ¯ ψ(λ) + I(T . φ). since φ-divergences are the only decomposable divergences that satisfy the data processing inequality for alphabet at least three [5]. 3. and the data processing inequality for the relative entropy. It is interesting to consider how one can generalize Theorem 2 to distributions whose moment generating functions do not exist. E(φT − µT ) ≤ we have (32) Now we present a proof of Theorem 3 using H¨older’s inequality. we have c Z dP c−α 1 c−α 1 αf cf ln e dP ≤ ln EQ e + ln EQ . Similar arguments were also used in [7]. we have ¯ λE(φT − µT ) ≤ ψ(λ) + I(T . It is clear that the application of Lemma 2 relies on existence of the cumulant generating function but not sub-Gaussianness.1. (23) f ≤ ln EQ e + D(Pφi |T =i kPφi ) ≤ ψi (λ) + D(Pφ|T =i kPφ ). we have λ(E[φi |T = i] − µi ) = EP f Z eαf dP = EQ eαf (24) (25) where in the last step we have used the fact that the cumulant generating function of φi − µi is upper bounded by ψi (λ). It leads to the following proof of Theorem 2. inf (27) (28) where in the last step we have used [4. dQ (30) where β1 + γ1 = 1. A. Let P. Q = Pφi . It is intriguing that both generalizations lead to the same results presented in Theorem 3.] is the Donsker– Varadhan variational representation of relative entropy stated below. Prop. λ > 0. Lemma 2 (Donsker–Varadhan). Defining c = αβ > α. and the data processing property is needed in the proof of Theorem 2.

.

.

dP .

.

.

(33) |EP ∆i | = .

EQ ∆i dQ .

.

.

.

.

dP − 1 .

.

(34) = .

.

EQ ∆i dQ .

.

α 1/α .

dP .

β 1/β .

EQ .

≤ EQ |∆i | − 1.

.

+ |E[φi |T = i] − µi | ≤ inf λ>0 λ β α (41) Taking expectations on both sides with respect to T . we have |E[φT − µT ]| ≤ max σi I1 (T . (39) (40) If 1 ≤ β < ∞. we have α = 1 and |E[φi |T = i] − µi | ≤ max σi Dφ1 (Pφ|T =i kPφ ). φ) i = kσT k∞ I1 (T . . φ). (37) 1/α where in the last step we used the data processing inequality of the φα -divergence. (35) dQ which implies 1/α |E[φi |T = i] − µi | ≤ σi Dφα (Pφi |T =i kPφi ) (36) ≤ σi Dφα (Pφ|T =i kPφ ). applying Young’s inequality to (37). If β = ∞. and using . we have " # Dφα (Pφ|T =i kPφ ) 1 λβ σiβ . i (38) Taking expectations with respect to T on both sides.

At the same time. Indeed. Generalization through the Fenchel–Young inequality We have the following natural variational representation of φ-divergences which is well known in the literature [8].4. shrinking the primal space would decrease the convex conjugate. Thm. (46) if EQ φ∗ (f ) exists. for 2 ≤ β < ∞. Indeed. The Type II (Fr´echet type) extreme value distribution with parameter β > 0 is characterized by ( 0 x<0 Φβ (x) = (52) −β e−x x≥0 Lemma 3.2. Theorem 3 implies that E[φT − µT ] . 1. but ln EQ ef is the convex conjugate of D(·kQ) when P is constrained to be a probability measure. (54) lim P n→∞ an where an = F ← (1 − n−1 ) and F ← (·) denotes the inverse function of F (x). Thm. φ). Now we compute the β-norm of φi . 5. it follows from [10.d. we have (48) D(P kQ) = sup EP [f (X)] − EQ [ef (X) ] − 1 f = sup EP [f + λ] − EQ [ef +λ ] − 1 . λ > 0. one arrives at the results of Theorem 3. it follows from [10. Z dP Dφ (P kQ) = φ dQ (44) dQ Z dP = sup f − φ∗ (f ) dQ (45) dQ f ≥ EP f − EQ φ∗ (f ). Cor. Suppose all the φi . We have Z ∞ βxβ−1 P(φi ≥ x)dx Eφβi = 0 Z ∞ Z x0 xβ (ln x0 )c βxβ−1 0β βxβ−1 dx + = dx x (ln x)c x0 0 < ∞. it is clear that (47) cannot be used to derive Theorem 2. For the CDF in (51) it is easy to see that an is the solution to the equation xβ (ln x)c = n(x0 )β (ln x0 )c . Indeed. Q = Pφi . one obtains the following variational representation of Dφα (P kQ) for φα (x) = |x − 1|α . we can obtain the Donsker–Varadhan result from (46). xβ (ln x)c x ≥ x0 > e1/β . by setting P = Pφi |T =i . Indeed.] The necessary and sufficient condition for a distribution function F (x) to fall into the domain of attraction of Φβ (x). (56) (57) (58) Hence. then the exploration bias is upper bounded by n1/β of the β-norm if 2 ≤ β < ∞. T IGHTNESS (42) (43) The remaining results in Theorem 3 follow from Lemma 1.2. Also.] that . Note that specializing the variational representation above to φ(x) = x ln x−x+1 cannot directly lead us to the Donsker– Varadhan result. Defining T = arg maxi φi . α ≥ 1: 1 |f |β Dφα (P kQ) = sup EP [f ] − EQ [f ] − EQ . it is because in the KL divergence case EQ ef − 1 is in fact the convex conjugate of the convex function D(·kQ) when P takes value in the space of all measures. and lim t→∞ 1 − F (tx) = x−β . after taking φ(x) = x ln x − x + 1. we have λ = − ln EQ [ef (X)].1. x > 0.1. However. The equality is attained when f = dP φ′ dQ . n1/β . Indeed. (47) f which is weaker than the Donsker–Varadhan result since ln EQ [ef (X) ] ≤ EQ ef (X) − 1. we have " # β β Iα (T .3. φ) 1 λ kσT kβ + |E(φT − µT )| ≤ inf λ>0 λ β α = kσT kβ Iα1/α (T . ln dQ Analogously. we have D(P kQ) = sup EP [f (X)] − EQ [ef (X) ] − 1 . we now care about the asymptotic distribution of φT as n → ∞. f = λ(φi − µi ). with CDF F (x) = 1 − xβ0 (ln x0 )c . V. [10. (51) where c > 1. α β f (50) where α1 + β1 = 1.E inf λ Xλ ≤ inf λ EXλ .] that there exists a sequence an → ∞ such that φT ≤ x = Φβ (x). It suffices to verify that f +λ = f −ln EQ [ef (X)] can still attain dP the value ln dP dQ . (55) 1/β which satisfies an & (lnnn)c/β . (49) f +λ Setting EQ [ef (X)+λ ] = 1. OF THE BOUNDS Theorem 3 essentially shows that if all the φi − µi have β-norm bounded. 1 ≤ i ≤ n are i. 1. 1 − F (t) (53) Then. it is equal to ln dQ when f (X) = dP . Following a path similar to that in the proof of Theorem 2. 0 < β < ∞ is sup{x : F (x) < 1} = ∞. B. We now show through extreme value theory that it is essentially tight for certain heavy tailed distributions. This phenomenon was already observed in the literature [9].i.

.

Z .

φT .

1 . (59) lim E .

.

.

.

This shows that the bounds in Theorem 3 are (ln n)c/β essentially tight. = xdΦβ (x) = Γ 1 − n→∞ an β which shows that EφT is of order an . . which is at least n1/β .

X |Zi | |Zi | |Zi | ψ E max ≤ E max ψ ≤ Eψ i≤n σ i≤n σ σ i≤n (65) ≤ n max Eψ i≤n |Zi | σ . “Soft” generalizations of the “hard” results Theorem 2 can be viewed as a soft generalization of the following well-known arguments [4] through replacing ln n with I(T . More generally [11]. but the “hard” argument presented above applies equally to all β ≥ 1. we note that Theorem 3 is not a perfect generalization of this “hard” argument. we have . and the moment generating function of Zi is upper bounded by ψi (λ). For example. Q = Pφi . n X EeλZi i It follows that n X i=1 (62) (63) i≤n i≤n (64) However. (66) dP .and the Ameniya norm of a random variable X as 1 + Eψ(|tX|) A kXkψ = inf :t>0 . then. φ). Proof: Denoting P = Pφi |T =i . we are only able to bound the RHS of Theorem 3 uniformly over the distribution of T when β ≥ 2. ∆i = (φi − µi )/σ. if ψ is a nonnegative. t > 0. for each σ > 0. Suppose we have n random variables Zi such that EZi = 0. t VI. b). convex. λ ∈ [0. D ISCUSSIONS A.φ where dPTTdP follows the product distribution PT Pφ in the φ Ameniya norm. eλE max Zi ≤ Eeλ max Zi = E max eλZi ≤ ≤ nemaxi ψ(λ) . strictly increasing function on R+ that satisfies ψ(0) = 0.

.

.

dP .

.

(72) |EP [∆i ]| = .

.

EQ ∆i dQ .

.

.

.

dP 1.

= .

.

EQ ∆i t − 1 .

.

(73) t dQ .

.

.

dPφi |T =i .

1 EQ ψ (|∆i |) + EQ ψ ∗ t .

.

− 1.

.

≤ t dPφi (74) .

.

.

.

dP 1 φ|T =i 1 + EPφ ψ ∗ t .

.

. (75) − 1.

.

e. the convexity of ψ ∗ (| · |)). ≤ t dPφ where in the last step we have used the data processing property (i. Taking expectations with respect to T on both sides and taking infimum for t.. we have .

.

.

φ 1 + EPT Pφ ψ ∗ t . dPT .

.

dP − 1 .

i. The following theorem applies to random variables whose Luxemburg norms are bounded.φ − 1 (71) |E[φT − µT ]| ≤ σ ∗. (61) Analogously. ∞) 7→ [0. T dPφ |E[φT − µT ]| ≤ σ inf :t>0 t (76) If σ is such that Eψ(|Zi |/σ) ≤ 1 for all i ≤ n. i≤n (70) Theorem 4. (E max |Zi |)β ≤ E max |Zi |β ≤ Lemma 4 (Generalized H¨older’s Inequality).b) ∗−1 = (max ψ) (ln n). Then.φ follows the product distribution PT Pφ . where dPTTdP φ A natural question is: are there more natural “soft” generalizations of all the “hard” arguments above? B. Theorem 3 can be viewed as the generalization of the following argument. A dPT. For β ≥ 1. [12] Denote an Orlicz function by ψ and its convex conjugate by ψ ∗ = sup{uv − ψ(v) : v ≥ 0}. ∞).φ = σ − 1 ∗.. then we have E max |Zi | ≤ σψ −1 (n). ∞]. Then. (68) kXkψ = inf{σ > 0 : Eψ σ A dPT. Connections with other generalizations of mutual information There exist various generalizations of mutual information in the literature. and we refer the interested readers to [13]–[15] for references. Y ) introduced in this paper seems to have received only scant attention in . we have the generalized H¨older’s inequality: (60) i=1 (69) (67) We note that the generalization of H¨older’s inequality in Orlicz spaces could provide a “soft” generalization of the arguments above. a convex function vanishing at zero and is also not identically 0 or ∞ on (0. defining the Luxemburg norm of a random variable X as |X| ≤ 1}. The dependence measure Iα (X. Suppose φi − µi has its Luxemburg norm upper bounded by σ.e. dPT dPφ (77) ψ dP . we have ln n + maxi ψ(λ) E max Zi ≤ inf λ λ∈(0. Taking logarithm. For a general Orlicz function ψ : [0. E[XY ] ≤ kXkψ kY kA ψ∗ . dPT dPφ ψ E|Zi |β ≤ n max E|Zi |β . E max |Zi | ≤ n1/β max kZi kβ .

y. ACKNOWLEDGEMENT We are grateful to James Zou for discussing with us the results in [1]. 1]. 2). we have PXY (x. y) ∈ {0. |X |2 g(1) ≤ (88) (89) In summary.” Studia Sci. then for convex φ : R≥0 → R. For example. 1]. Y ) ≤ φ(0) 1 − [PX (x)] x∈X + X x∈X [PX (x)]2 φ 1 PX (x) . it is straightforward to see that f ′′ (t) is increasing on [0. In particular. 1967. Lemma 5. Even when power functions are used to define φ-divergences. Now the first statement in Lemma 1 follows from Lemma 5 applied to φα (t) = |t − 1|α . if we define 1 2(|X | − 1)α − 2 g(t) . and 2(|X | − 1)α − 2 1 α−1 g(1) = −1 − (1 − − α(|X | − 1) ) |X | |X | (|X | − 1)α − 1 (85) − |X |2 (α − 1)(|X | − 1)α − (2 − α)(|X | − 1)α+1 − (|X | − 1)2 = |X |2 (86) (|X | − 1)α = α − 1 − (2 − α)(|X | − 1) − (|X | − 1)2−α . 1]: α α−1 1 1 −1 −1 − 2(2 − α)α f ′′ (t) = −(2 − α)(α − 1) t t α−2 1 + α(α − 1) −1 −2 (81) t satisfies f ′′ (1) = +∞ for α ∈ [1. . y) PXY (x. φ PY (y) PX (x) (80) Now summing over x. y) 1 0≤ ≤ (79) PX (x)PY (y) PX (x) then by the convexity of φ we know that PXY (x. 1]. functions xα . and the dependence measure in [15] involves minimizing jointly over QX QY . y) 1 + . 2015. For the second statement. A PPENDIX P ROOF OF L EMMA 1 We first prove a general result regarding the φ-mutual information Iφ (X. vol. PY (y)} for any x. the maximum of g(t) over t ∈ [0. ! X 2 Iφ (X.. g ′′ (|X |−1 ) ≤ 0 and g ′′ (t) is increasing on [0. Y ) . pp. Some generalizations such as Sibson’s mutual information [16] involve minimizing over an auxiliary distribution QY . R EFERENCES [1] D. y) ≤ 1− φ(0) φ PX (x)PY (y) PY (y) PXY (x. we have g(t) ≤ max{g(|X |−1 ). f (t) − − α(|X | − 1)α−1 (t − ) |X | |X | (|X | − 1)α − 1 . g(1)} = 0 for any t ∈ [0. y in the definition of Iφ (X. 2] we define f (t) = t2 [( 1t −1)α −1] = t2−α (1−t)α −t2 on [0.” arXiv preprint arXiv:1511. 1]. 299–318. we have (|X | − 1)α [α − 1 − (2 − α) · 1 − 1] |X |2 2(α − 2)(|X | − 1)α = ≤ 0. Dφ (PXY kPX PY ) for general convex function φ.the existing literature. Zou. Let X take value in a finite set X . We note that f (t) is not concave on [0. Math. the usual definition of R´enyi divergence involves the function xα but not |x − 1|α . 1] is attained at t = |X |−1 or t = 1. Note that PXY (x. Russo and J. “How much does your data exploration overfit? controlling bias via information usage. (90) |X | |X | Now summing over x ∈ X completes the proof. Csisz´ar et al. 2. and if |X | ≥ 2. which inspired this work. 2 |X | (87) Obviously g(1) = 0 if |X | = 1. which means that X is a function of Y . VII. It remains an interesting question whether other generalizations of mutual information could prove useful in bounding the exploration bias. (82) (83) Hence. When φ(t) is strictly convex at t = 1 and the equality holds. α ≥ 1 plus some affine terms were used much more frequently than |x − 1|α except for the case of α = 1 (total variation distance) and α = 2 (χ2 -distance). As a result. “Information-type measures of difference of probability distributions and indirect observations. (78) If φ(t) is strictly convex at t = 1. and thus (|X | − 1)α − 1 f (PX (x)) ≤ + |X |2 1 2(|X | − 1)α − 2 − α(|X | − 1)α−1 (PX (x) − ). [2] I. However. and f ′′ (|X |−1 ) ≤ α(α − 1)(|X | − 1)α−2 − 2 ≤ α(α − 1) − 2 ≤ 0..05219. then the upper bound is tight iff X is a deterministic function of Y . Proof of Lemma 5: It follows from the generalization of the Gel’fand-Yaglom-Peres theorem for φ-divergences [17] that it suffices to consider Y being a discrete random variable. Hungar. Y ) yields the desired result. for α ∈ [1. (84) − |X |2 it is straightforward to see that g(|X |−1 ) = g ′ (|X |−1 ) = 0 and g ′′ (t) = f ′′ (t) on [0. 1].

Massart. Chowdhary. Gilardoni. “On functionals satisfying a data-processing theorem. 11. “Information radius. 1969. Springer Science & Business Media. 2014.” arXiv preprint arXiv:0911.02330. 2000. IEEE. and M. pp. Venkat. [6] R. Ruderman. [10] L. Wainwright. “α-mutual information. vol. Garc´ıa-Garc´ıa. Ali and S. 56. 1–6. Silvey. IEEE. A. pp. “A general class of coefficients of divergence of one distribution from another. [7] T. I.” Indagationes Mathematicae. Pfister. 18–33. 11.[3] S. Weissman. M. [4] S. 4. pp. Ziv and M. 14. (2005) Asymptopia. 2015.yale. and T. 2013. 275–283. Sibson. 1. Lugosi.” IEEE Transactions on Information Theory. 149–160. 2012. De Haan and A. 2014.” in 2014 IEEE International Symposium on Information Theory. Maligranda. Verd´u. [9] A. and P. [8] X. Oxford University Press. 2007. [Online]. Extreme value theory: an introduction. [13] J. “Information measures: the curious case of the binary alphabet. J. no. 2494–2498. G. 573–585. [11] D.” in Information Theory and Applications Workshop (ITA). 2009. Courtade.4664. D. [5] J. vol.stat.pdf [12] H. 12. Hudzik and L. 2015. Reid. 2015. pp. Courtade and S. A. vol. “Amemiya norm equals Orlicz norm in general. Jordan. “Estimating divergence functionals and the likelihood ratio by convex risk minimization. no. “Two measures of dependence. Petterson. “Tighter variational representations of f -divergences via restriction to probability measures. “On a Gel’fand-Yaglom-Peres theorem for f divergences. 60. no. 2.” Journal of the Royal Statistical Society. “Robust bounds on risk-sensitive functionals via R´enyi divergence. Boucheron. [16] R. 1973. D. vol. “Cumulant generating function of codeword lengths in optimal lossless compression. K. vol. A. T. Verd´u.1934.spring05/handouts/finite-max. pp. Lapidoth and C. and J. Zakai.” arXiv preprint arXiv:1607. [17] G. vol. and P. Nguyen. 7616–7626. 5847–5861. no. 19. No. Atar. pp. 1966. M.” IEEE Transactions on Information Theory. K. pp. pp.edu/∼ pollard/Courses/607. 3. Jiao. Series B (Methodological). 2016. pp. Ferreira. [15] A. no. Dupuis.” arXiv preprint arXiv:1206. 2010. . Pollard. no. [14] S.” IEEE Transactions on Information Theory. Concentration inequalities: A nonasymptotic theory of independence. 131–142.” Zeitschrift f¨ur Wahrscheinlichkeitstheorie und verwandte Gebiete. Available: http://www.” SIAM/ASA Journal on Uncertainty Quantification. 3. L.

- Full Text 01Uploaded byRasa Govindasmay
- Stochastic AnalysisUploaded bylunar7
- Answers - Continuous Probability DistributionsUploaded bynikowawa
- Labs 2.0Uploaded byderk9012
- hw15Uploaded byGary Stafford
- 2012Filoma-AliChoNaveenUploaded bykumar_n38
- fremlin measure theory book 2Uploaded byRamesh Kadambi
- nashmoser(1).pdfUploaded bymez007
- Basic Mathematical ToolsUploaded byDragu Stelian
- Chapter 11Uploaded byarmin2200
- 1Uploaded byviewh2003
- Intro to Error AnalysisUploaded byTomDenver11
- 99ebook_com-0132311232Uploaded byAli Qazmi
- Integral Transform Methods for Inverse Problem of Heat Conduction With Known Boundary of Semi Infinite Hollow Cylinder and Its StressesUploaded byEditor IJLTEMAS
- [University of Wisconsin-Madison, Shalizi] CSSS 2000-2001 Math Review Lectures - Probability, Statistics, And Stochastic ProcessesUploaded byJay Kab
- DistributionsUploaded byhellokeety
- Communication Assignment 2Uploaded byLasantha Abeykoon
- inferential theoy.pdfUploaded byafshan muneer
- Amicarelli, Di Sciascio, Toibero and Alvarez (2010) Including Dissolved Oxygen DynamicsUploaded bysanlopezres
- Experimentation and stochastic modelling for reliability evaluation of an existing bridgeUploaded byAJER JOURNAL
- design_values_ed02[1].pdfUploaded byDuy Le
- Appendix a 4Uploaded byErRajivAmie
- ch06.docUploaded bySuvam Patel
- BassUploaded byRahil Panjiyar
- DDE Prob Lecture NotesUploaded byChidambaranathan Subramanian
- Härdle (1995)Uploaded byHenne Popenne
- Applied Nonparametric Regression [Hardle]Uploaded bypkiitb
- Acceptance SamplingUploaded byQuynh Dao
- syllabusUploaded byOnat Pagaduan
- AppendixA Common Statiscal DistributionUploaded byAnnisa Rakhmawati

- 16142 Electrical Connections for EquipmentUploaded byTsige Tadesse
- 10.1137@1.9781611971392.pdfUploaded byTsige Tadesse
- EquipmentUploaded byTsige Tadesse
- Ch02_mechanical Properties of MaterialsUploaded byTsige Tadesse
- MEXT Scholarship Interview 2013Uploaded byTsige Tadesse
- Www Kengen Co DefaultUploaded byTsige Tadesse
- Electrical Installation Equipment.docxUploaded byTsige Tadesse
- Robust Control - WikipediaUploaded byTsige Tadesse
- (Spring 2017)UOU Admisison Guideline for Postgraduate ProgrammeUploaded byTsige Tadesse
- Stem and LeafUploaded byTsige Tadesse
- Guidelines Master Mind Scholarships 2017-2018Uploaded byTsige Tadesse
- Individual AssignmentUploaded byTsige Tadesse
- 3276_2016-11-25_09-03-13_en_3Uploaded byTsige Tadesse
- Assignment.docxUploaded byTsige Tadesse
- AssignmentUploaded byTsige Tadesse
- Individual Assignment.docxUploaded byTsige Tadesse