Inequalities in Statistics

Lecture Notes 2
Probability Inequalities
Inequalities are useful for bounding quantities that might otherwise be hard to compute.
They will also be used in the theory of convergence.
Theorem 1 (The Gaussian Tail Inequality) Let X N (0, 1). Then

2 /2
2e
P(|X| > )

If X1 , . . . , Xn N (0, 1) then
2
2
P(|X n | > ) en /2 .
n
Proof. The density of X is (x) = (2)1/2 ex /2 . Hence,

Z
Z
1
P(X > ) =
(s)ds
s (s)ds

Z
2
()
1 0
e /2
(s)ds =
=
.

By symmetry,
2
2e /2
.

P
d
Now let X1 , . . . , Xn N (0, 1). Then X n = n1 ni=1 Xi N (0, 1/n). Thus, X n = n1/2 Z
where Z N (0, 1) and
P(|X| > )
P(|X n | > ) = P(n1/2 |Z| > ) = P(|Z| >

2
2
n ) en /2 .
n
Theorem 2 (Markovs inequality) Let X be a non-negative random variable and

suppose that E(X) exists. For any t > 0,
E(X)
.
t
P(X > t)
(1)
Proof. Since X > 0,

Z
x p(x)dx t
xp(x)dx
x p(x)dx +
x p(x)dx =
E(X) =
p(x)dx = t P(X > t).

Theorem 3 (Chebyshevs inequality) Let = E(X) and 2 = Var(X). Then,
P(|X | t)
2
t2
and
P(|Z| k)
1
k2
(2)
where Z = (X )/. In particular, P(|Z| > 2) 1/4 and P(|Z| > 3) 1/9.
Proof. We use Markovs inequality to conclude that
P(|X | t) = P(|X |2 t2 )
2
E(X )2
=
.
t2
t2
The second part follows by setting t = k.

P
If X1 , . . . , Xn Bernoulli(p) then and X n = n1 ni=1 Xi Then, Var(X n ) = Var(X1 )/n =
p(1 p)/n and
Var(X n )
p(1 p)
1
P(|X n p| > )
=
2
2

n
4n2
since p(1 p) 14 for all p.
Hoeffdings Inequality
Hoeffdings inequality is similar in spirit to Markovs inequality but it is a sharper inequality.

We begin with the following important result.
Lemma 4 Suppose that E(X) = 0 and that a X b. Then
2 (ba)2 /8
E(etX ) et
2
Recall that a function g is convex if for each x, y and each [0, 1],
g(x + (1 )y) g(x) + (1 )g(y).
Proof. Since a X b, we can write X as a convex combination of a and b, namely,
X = b + (1 )a where = (X a)/(b a). By the convexity of the function y ety we
have
X a tb b X ta
e +
e .
etX etb + (1 )eta =
ba
ba
Take expectations of both sides and use the fact that E(X) = 0 to get
EetX
a tb
b ta
e +
e = eg(u)
ba
ba
(3)
where u = t(b a), g(u) = u + log(1 + eu ) and = a/(b a). Note that
00
g(0) = g 0 (0) = 0. Also, g (u) 1/4 for all u > 0. By Taylors theorem, there is a (0, u)
such that
0
g(u) = g(0) + ug (0) +

2 (ba)2 /8
Hence, EetX eg(u) et
u2 00
u2
t2 (b a)2
u2 00
g () = g ()
=
.
2
2
8
8
.
Next, we need to use Chernoff s method.
Lemma 5 Let X be a random variable. Then

P(X > ) inf et E(etX ).
t0
Proof. For any t > 0,

P(X > ) = P(eX > e ) = P(etX > et ) et E(etX ).
Since this is true for every t 0, the result follows.
Theorem 6 (Hoeffdings Inequality) Let Y1 , . . . , Yn be iid observations such that
E(Yi ) = and a Yi b. Then, for any > 0,

2
2
P |Y n | 2e2n /(ba) .
(4)
Corollary 7 If X1 , X2 , . . . , Xn are independent with P(a Xi b) = 1 and common

mean , then, with probability at least 1 ,
s

c
2
|X n |
log
(5)
2n
where c = (b a)2 .
Proof. Without los of generality, we asume that = 0. First we have
P(|Y n | ) = P(Y n ) + P(Y n )
= P(Y n ) + P(Y n ).
Next we use Chernoffs method. For any t > 0, we have, from Markovs inequality, that
!
n
Pn

X
P(Y n ) = P
Yi n = P e i=1 Yi en

i=1
P
t n
i=1 Yi
= P e
= etn

Pn
etn etn E et i=1 Yi
E(etYi ) = etn (E(etYi ))n .
i
2 (ba)2 /8
From Lemma 4, E(etYi ) et
. So
2 n(ba)2 /8
P(Y n ) etn et
This is minimized by setting t = 4/(b a)2 giving

P(Y n ) e2n
2 /(ba)2
Applying the same argument to P(Y n ) yields the result.

Example 8 Let X1 , . . . , Xn Bernoulli(p). From, Hoeffdings inequality,
2
P(|X n p| > ) 2e2n .
The Bounded Difference Inequality
So far we have focused on sums of random variables. The following result extends Hoeffdings
inequality to more general functions g(x1 , . . . , xn ). Here we consider McDiarmids inequality,
also known as the Bounded Difference inequality.
4
Theorem 9 (McDiarmid) Let X1 , . . . , Xn be independent random variables. Suppose that

0
sup g(x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) g(x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) ci (6)

x1 ,...,xn ,x0i
for i = 1, . . . , n. Then
!
P g(X1 , . . . , Xn ) E(g(X1 , . . . , Xn ))

22
exp Pn
2
i=1 ci

.
(7)
Proof.
Let Vi = E(g|X1 , . . . , Xi )E(g|X1 , . . . , Xi1 ). Then g(X1 , . . . , Xn )E(g(X1 , . . . , Xn )) =
Pn
i=1 Vi and E(Vi |X1 , . . . , Xi1 ) = 0. Using a similar argument as in Hoeffdings Lemma we
have,
2 2
E(etVi |X1 , . . . , Xi1 ) et ci /8 .
(8)
Now, for any t > 0,
P (g(X1 , . . . , Xn ) E(g(X1 , . . . , Xn )) ) = P
n
X
!
Vi
i=1
Pn
=P e
e et E et i=1 Vi

!!

Pn1

= et E et i=1 Vi E etVn X1 , . . . , Xn1

Pn1
2 2
et et cn /8 E et i=1 Vi

Pn
i=1 Vi
t
..
.
Pn
2
2
et et i=1 ci .
P
The result follows by taking t = 4/ ni=1 c2i .
P
Example 10 If we take g(x1 , . . . , xn ) = n1 ni=1 xi then we get back Hoeffdings inequality.
Example 11 Suppose we throw m balls into n bins. What fraction of bins are empty? Let
Z be P
the number of empty bins and let F = Z/n be the fraction of empty bins. We can write
Z = ni=1 Zi where Zi = 1 of bin i is empty and Zi = 0 otherwise. Then
= E(Z) =
n
X
E(Zi ) = n(1 1/n)m = nem log(11/n) nem/n
i=1
and = E(F ) = /n em/n . How close is Z to ? Note that the Zi s are not independent
so we cannot just apply Hoeffding. Instead, we proceed as follows.
5
Define variables X1 , . . . , Xm where Xs = i if ball s falls into bin i. Then Z = g(X1 , . . . , Xm ).

If we move one ball into a different bin, then Z can change by at most 1. Hence, (6) holds
with ci = 1 and so
2
P(|Z | > t) 2e2t /m .
Recall that he fraction of empty bins is F = Z/m with mean = /n. We have
2 t2 /m
P(|F | > t) = P(|Z | > nt) 2e2n
Bounds on Expected Values
Theorem 12 (Cauchy-Schwartz inequality) If X and Y have finite variances

then
p
(9)
E |XY | E(X 2 )E(Y 2 ).
The Cauchy-Schwarz inequality can be written as

2 2
Cov2 (X, Y ) X
Y .
Recall that a function g is convex if for each x, y and each [0, 1],
g(x + (1 )y) g(x) + (1 )g(y).
If g is twice differentiable and g 00 (x) 0 for all x, then g is convex. It can be shown that if
g is convex, then g lies above any line that touches g at some point, called a tangent line.
A function g is concave if g is convex. Examples of convex functions are g(x) = x2 and
g(x) = ex . Examples of concave functions are g(x) = x2 and g(x) = log x.
Theorem 13 (Jensens inequality) If g is convex, then
Eg(X) g(EX).
(10)
Eg(X) g(EX).
(11)
If g is concave, then
Proof. Let L(x) = a + bx be a line, tangent to g(x) at the point E(X). Since g is convex,
it lies above the line L(x). So,
Eg(X) EL(X) = E(a + bX) = a + bE(X) = L(E(X)) = g(EX).
Example 14 From Jensens inequality we see that E(X 2 ) (EX)2 .
Example 15 (Kullback Leibler Distance) Define the Kullback-Leibler distance between

two densities p and q by

Z
p(x)
dx.
D(p, q) = p(x) log
q(x)
Note that D(p, p) = 0. We will use Jensen to show that D(p, q) 0. Let X p. Then

Z
Z
q(X)
q(X)
q(x)
D(p, q) = E log
log E
= log p(x)
dx = log q(x)dx = log(1) = 0.
p(X)
p(X)
p(x)
So, D(p, q) 0 and hence D(p, q) 0.
Example 16 It follows from Jensens inequality that 3 types of means can be ordered. Assume that a1 , . . . , an are positive numbers and define the arithmetic, geometric and harmonic
means as
1
(a1 + . . . + an )
n
= (a1 . . . an )1/n
1
= 1 1
.
( + . . . + a1n )
n a1
aA =
aG
aH
Then aH aG aA .
Suppose we have an exponential bound on P(Xn > ). In that case we can bound E(Xn ) as
follows.
Theorem 17 Suppose that Xn 0 and that for every > 0,
2
P(Xn > ) c1 ec2 n
(12)
for some c2 > 0 and c1 > 1/e. Then,

r
E(Xn )
C
.
n
(13)
where C = (1 + log(c1 ))/c2 .

R
Proof. Recall that for any nonnegative random variable Y , E(Y ) = 0 P(Y t)dt. Hence,
for any a > 0,
Z
Z a
Z
Z
2
2
2
2
E(Xn ) =
P(Xn t)dt =
P(Xn t)dt +
P(Xn t)dt a +
P(Xn2 t)dt.
0

Equation (12) implies that P(Xn > t) c1 ec2 nt . Hence,
Z
Z
Z
2
2
E(Xn ) a +
P(Xn t)dt = a +
P(Xn t)dt a + c1
a
ec2 nt dt = a +
c1 ec2 na
.
c2 n
Set a = log(c1 )/(nc2 ) and conclude that

E(Xn2 )
log(c1 )
1
1 + log(c1 )
+
=
.
nc2
nc2
nc2
Finally, we have
s
p
E(Xn )
E(Xn2 )
1 + log(c1 )
.
nc2

Now we consider bounding the maximum of a set of random variables.
Theorem 18 Let X1 , . . . , Xn be random variables. Suppose there exists > 0 such
2 2
that E(etXi ) et /2 for all t > 0. Then

p
E max Xi 2 log n.
(14)
1in
Proof. By Jensens inequality,

exp tE max Xi
E exp t max Xi
1in
1in

max exp {tXi }
= E
1in
n
X
2 2 /2
E (exp {tXi }) net
i=1
Thus,

E
The result follows by setting t =

max Xi
1in
log n t 2
+
.
t
2
2 log n/.
OP and oP
In statisics, probability and machine learning, we make use of oP and OP notation.

Recall first, that an = o(1) means that an 0 as n . an = o(bn ) means that
an /bn = o(1).
an = O(1) means that an is eventually bounded, that is, for all large n, |an | C for some
C > 0. an = O(bn ) means that an /bn = O(1).
8
We write an bn if both an /bn and bn /an are eventually bounded. In computer sicence this
s written as an = (bn ) but we prefer using an bn since, in statistics, often denotes a
parameter space.
Now we move on to the probabilistic versions. Say that Yn = oP (1) if, for every > 0,
P(|Yn | > ) 0.
Say that Yn = oP (an ) if, Yn /an = oP (1).
Say that Yn = OP (1) if, for every > 0, there is a C > 0 such that
P(|Yn | > C) .
Say that Yn = OP (an ) if Yn /an = OP (1).
Lets use Hoeffdings inequality to show that sample proportions are OP (1/ n) within the
the true mean. Let Y1 , . . . , Yn be coin flips i.e. Yi {0, 1}. Let p = P(Yi = 1). Let
n
1X
Yi .
pbn =
n i=1
We will show that: pbn p = oP (1) and pbn p = OP (1/ n).

We have that
2
P(|b
pn p| > ) 2e2n 0
and so pbn p = oP (1). Also,

C
pn p| > C) = P |b
pn p| >
P( n|b
n
2
2e2C <
if we pick C large enough. Hence,
n(b
pn p) = OP (1) and so

1
pbn p = OP
.
n
Make sure you can prove the following:

OP (1)oP (1) = oP (1)
OP (1)OP (1) = OP (1)
oP (1) + OP (1) = OP (1)
OP (an )oP (bn ) = oP (an bn )
OP (an )OP (bn ) = OP (an bn )

Inequalities in Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inequalities in Statistics

Uploaded by

Copyright:

Available Formats

Lecture Notes 2

Theorem 1 (The Gaussian Tail Inequality) Let X N (0, 1). Then

Proof. The density of X is (x) = (2)1/2 ex /2 . Hence,

P(|X n | > ) = P(n1/2 |Z| > ) = P(|Z| >

Theorem 2 (Markovs inequality) Let X be a non-negative random variable and

Proof. Since X > 0,

p(x)dx = t P(X > t).

The second part follows by setting t = k.

Hoeffdings inequality is similar in spirit to Markovs inequality but it is a sharper inequality.

g(u) = g(0) + ug (0) +

Hence, EetX eg(u) et

Next, we need to use Chernoff s method.

Lemma 5 Let X be a random variable. Then

Proof. For any t > 0,

Corollary 7 If X1 , X2 , . . . , Xn are independent with P(a Xi b) = 1 and common

E(etYi ) = etn (E(etYi ))n .

From Lemma 4, E(etYi ) et

This is minimized by setting t = 4/(b a)2 giving

Applying the same argument to P(Y n ) yields the result.

P(|X n p| > ) 2e2n .

The Bounded Difference Inequality

Theorem 9 (McDiarmid) Let X1 , . . . , Xn be independent random variables. Suppose that

E(Zi ) = n(1 1/n)m = nem log(11/n) nem/n

Define variables X1 , . . . , Xm where Xs = i if ball s falls into bin i. Then Z = g(X1 , . . . , Xm ).

P(|F | > t) = P(|Z | > nt) 2e2n

Bounds on Expected Values

Theorem 12 (Cauchy-Schwartz inequality) If X and Y have finite variances

The Cauchy-Schwarz inequality can be written as

Example 14 From Jensens inequality we see that E(X 2 ) (EX)2 .

Example 15 (Kullback Leibler Distance) Define the Kullback-Leibler distance between

P(Xn > ) c1 ec2 n

for some c2 > 0 and c1 > 1/e. Then,

where C = (1 + log(c1 ))/c2 .

Set a = log(c1 )/(nc2 ) and conclude that

Proof. By Jensens inequality,

E (exp {tXi }) net

In statisics, probability and machine learning, we make use of oP and OP notation.

We will show that: pbn p = oP (1) and pbn p = OP (1/ n).

Make sure you can prove the following:

You might also like

P(|X n | > ) = P(n1/2 |Z| > ) = P(|Z| >

E(etYi ) = etn (E(etYi ))n .

This is minimized by setting t = 4/(b a)2 giving

Applying the same argument to P(Y n ) yields the result.

P(|X n p| > ) 2e2n .

P(Xn > ) c1 ec2 n