You are on page 1of 9

Lecture Notes 2

Probability Inequalities

Inequalities are useful for bounding quantities that might otherwise be hard to compute.
They will also be used in the theory of convergence.

Theorem 1 (The Gaussian Tail Inequality) Let X N (0, 1). Then


2 /2

2e
P(|X| > )


If X1 , . . . , Xn N (0, 1) then
2
2
P(|X n | > ) en /2 .
n

Proof. The density of X is (x) = (2)1/2 ex /2 . Hence,


Z
Z
1
P(X > ) =
(s)ds
s (s)ds
 

Z
2
()
1 0
e /2
(s)ds =
=

.
 


By symmetry,
2

2e /2
.

P
d
Now let X1 , . . . , Xn N (0, 1). Then X n = n1 ni=1 Xi N (0, 1/n). Thus, X n = n1/2 Z
where Z N (0, 1) and
P(|X| > )

P(|X n | > ) = P(n1/2 |Z| > ) = P(|Z| >




2
2
n ) en /2 .
n

Theorem 2 (Markovs inequality) Let X be a non-negative random variable and


suppose that E(X) exists. For any t > 0,
E(X)
.
t

P(X > t)

(1)

Proof. Since X > 0,


Z

x p(x)dx t

xp(x)dx

x p(x)dx +

x p(x)dx =

E(X) =

p(x)dx = t P(X > t).


Theorem 3 (Chebyshevs inequality) Let = E(X) and 2 = Var(X). Then,
P(|X | t)

2
t2

and

P(|Z| k)

1
k2

(2)

where Z = (X )/. In particular, P(|Z| > 2) 1/4 and P(|Z| > 3) 1/9.
Proof. We use Markovs inequality to conclude that
P(|X | t) = P(|X |2 t2 )

2
E(X )2
=
.
t2
t2

The second part follows by setting t = k. 


P
If X1 , . . . , Xn Bernoulli(p) then and X n = n1 ni=1 Xi Then, Var(X n ) = Var(X1 )/n =
p(1 p)/n and
Var(X n )
p(1 p)
1
P(|X n p| > )
=

2
2

n
4n2
since p(1 p) 14 for all p.

Hoeffdings Inequality

Hoeffdings inequality is similar in spirit to Markovs inequality but it is a sharper inequality.


We begin with the following important result.
Lemma 4 Suppose that E(X) = 0 and that a X b. Then
2 (ba)2 /8

E(etX ) et
2

Recall that a function g is convex if for each x, y and each [0, 1],
g(x + (1 )y) g(x) + (1 )g(y).
Proof. Since a X b, we can write X as a convex combination of a and b, namely,
X = b + (1 )a where = (X a)/(b a). By the convexity of the function y ety we
have
X a tb b X ta
e +
e .
etX etb + (1 )eta =
ba
ba
Take expectations of both sides and use the fact that E(X) = 0 to get
EetX

a tb
b ta
e +
e = eg(u)
ba
ba

(3)

where u = t(b a), g(u) = u + log(1 + eu ) and = a/(b a). Note that
00
g(0) = g 0 (0) = 0. Also, g (u) 1/4 for all u > 0. By Taylors theorem, there is a (0, u)
such that
0

g(u) = g(0) + ug (0) +


2 (ba)2 /8

Hence, EetX eg(u) et

u2 00
u2
t2 (b a)2
u2 00
g () = g ()
=
.
2
2
8
8

.

Next, we need to use Chernoff s method.

Lemma 5 Let X be a random variable. Then


P(X > ) inf et E(etX ).
t0

Proof. For any t > 0,


P(X > ) = P(eX > e ) = P(etX > et ) et E(etX ).
Since this is true for every t 0, the result follows. 
Theorem 6 (Hoeffdings Inequality) Let Y1 , . . . , Yn be iid observations such that
E(Yi ) = and a Yi b. Then, for any  > 0,

2
2
P |Y n |  2e2n /(ba) .
(4)

Corollary 7 If X1 , X2 , . . . , Xn are independent with P(a Xi b) = 1 and common


mean , then, with probability at least 1 ,
s
 
c
2
|X n |
log
(5)
2n

where c = (b a)2 .
Proof. Without los of generality, we asume that = 0. First we have
P(|Y n | ) = P(Y n ) + P(Y n )
= P(Y n ) + P(Y n ).
Next we use Chernoffs method. For any t > 0, we have, from Markovs inequality, that
!
n
 Pn

X
P(Y n ) = P
Yi n = P e i=1 Yi en


i=1
P
t n
i=1 Yi

= P e

= etn


 Pn 
etn etn E et i=1 Yi

E(etYi ) = etn (E(etYi ))n .

i
2 (ba)2 /8

From Lemma 4, E(etYi ) et

. So
2 n(ba)2 /8

P(Y n ) etn et

This is minimized by setting t = 4/(b a)2 giving


P(Y n ) e2n

2 /(ba)2

Applying the same argument to P(Y n ) yields the result. 


Example 8 Let X1 , . . . , Xn Bernoulli(p). From, Hoeffdings inequality,
2

P(|X n p| > ) 2e2n .

The Bounded Difference Inequality

So far we have focused on sums of random variables. The following result extends Hoeffdings
inequality to more general functions g(x1 , . . . , xn ). Here we consider McDiarmids inequality,
also known as the Bounded Difference inequality.
4

Theorem 9 (McDiarmid) Let X1 , . . . , Xn be independent random variables. Suppose that








0
sup g(x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) g(x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) ci (6)

x1 ,...,xn ,x0i
for i = 1, . . . , n. Then
!
P g(X1 , . . . , Xn ) E(g(X1 , . . . , Xn )) 


22
exp Pn

2
i=1 ci


.

(7)

Proof.
Let Vi = E(g|X1 , . . . , Xi )E(g|X1 , . . . , Xi1 ). Then g(X1 , . . . , Xn )E(g(X1 , . . . , Xn )) =
Pn
i=1 Vi and E(Vi |X1 , . . . , Xi1 ) = 0. Using a similar argument as in Hoeffdings Lemma we
have,
2 2
E(etVi |X1 , . . . , Xi1 ) et ci /8 .
(8)
Now, for any t > 0,
P (g(X1 , . . . , Xn ) E(g(X1 , . . . , Xn )) ) = P

n
X

!
Vi 

i=1
 Pn 
=P e
e et E et i=1 Vi

!!

Pn1

= et E et i=1 Vi E etVn X1 , . . . , Xn1



Pn1
2 2
et et cn /8 E et i=1 Vi


Pn

i=1 Vi

t

..
.
Pn
2
2
et et i=1 ci .
P
The result follows by taking t = 4/ ni=1 c2i . 
P
Example 10 If we take g(x1 , . . . , xn ) = n1 ni=1 xi then we get back Hoeffdings inequality.
Example 11 Suppose we throw m balls into n bins. What fraction of bins are empty? Let
Z be P
the number of empty bins and let F = Z/n be the fraction of empty bins. We can write
Z = ni=1 Zi where Zi = 1 of bin i is empty and Zi = 0 otherwise. Then
= E(Z) =

n
X

E(Zi ) = n(1 1/n)m = nem log(11/n) nem/n

i=1

and = E(F ) = /n em/n . How close is Z to ? Note that the Zi s are not independent
so we cannot just apply Hoeffding. Instead, we proceed as follows.
5

Define variables X1 , . . . , Xm where Xs = i if ball s falls into bin i. Then Z = g(X1 , . . . , Xm ).


If we move one ball into a different bin, then Z can change by at most 1. Hence, (6) holds
with ci = 1 and so
2
P(|Z | > t) 2e2t /m .
Recall that he fraction of empty bins is F = Z/m with mean = /n. We have
2 t2 /m

P(|F | > t) = P(|Z | > nt) 2e2n

Bounds on Expected Values

Theorem 12 (Cauchy-Schwartz inequality) If X and Y have finite variances


then
p
(9)
E |XY | E(X 2 )E(Y 2 ).

The Cauchy-Schwarz inequality can be written as


2 2
Cov2 (X, Y ) X
Y .

Recall that a function g is convex if for each x, y and each [0, 1],
g(x + (1 )y) g(x) + (1 )g(y).
If g is twice differentiable and g 00 (x) 0 for all x, then g is convex. It can be shown that if
g is convex, then g lies above any line that touches g at some point, called a tangent line.
A function g is concave if g is convex. Examples of convex functions are g(x) = x2 and
g(x) = ex . Examples of concave functions are g(x) = x2 and g(x) = log x.
Theorem 13 (Jensens inequality) If g is convex, then
Eg(X) g(EX).

(10)

Eg(X) g(EX).

(11)

If g is concave, then

Proof. Let L(x) = a + bx be a line, tangent to g(x) at the point E(X). Since g is convex,
it lies above the line L(x). So,
Eg(X) EL(X) = E(a + bX) = a + bE(X) = L(E(X)) = g(EX).


Example 14 From Jensens inequality we see that E(X 2 ) (EX)2 .

Example 15 (Kullback Leibler Distance) Define the Kullback-Leibler distance between


two densities p and q by


Z
p(x)
dx.
D(p, q) = p(x) log
q(x)
Note that D(p, p) = 0. We will use Jensen to show that D(p, q) 0. Let X p. Then




Z
Z
q(X)
q(X)
q(x)
D(p, q) = E log
log E
= log p(x)
dx = log q(x)dx = log(1) = 0.
p(X)
p(X)
p(x)
So, D(p, q) 0 and hence D(p, q) 0.
Example 16 It follows from Jensens inequality that 3 types of means can be ordered. Assume that a1 , . . . , an are positive numbers and define the arithmetic, geometric and harmonic
means as
1
(a1 + . . . + an )
n
= (a1 . . . an )1/n
1
= 1 1
.
( + . . . + a1n )
n a1

aA =
aG
aH
Then aH aG aA .

Suppose we have an exponential bound on P(Xn > ). In that case we can bound E(Xn ) as
follows.
Theorem 17 Suppose that Xn 0 and that for every  > 0,
2

P(Xn > ) c1 ec2 n

(12)

for some c2 > 0 and c1 > 1/e. Then,


r
E(Xn )

C
.
n

(13)

where C = (1 + log(c1 ))/c2 .


R
Proof. Recall that for any nonnegative random variable Y , E(Y ) = 0 P(Y t)dt. Hence,
for any a > 0,
Z
Z a
Z
Z
2
2
2
2
E(Xn ) =
P(Xn t)dt =
P(Xn t)dt +
P(Xn t)dt a +
P(Xn2 t)dt.
0


Equation (12) implies that P(Xn > t) c1 ec2 nt . Hence,
Z
Z
Z

2
2
E(Xn ) a +
P(Xn t)dt = a +
P(Xn t)dt a + c1
a

ec2 nt dt = a +

c1 ec2 na
.
c2 n

Set a = log(c1 )/(nc2 ) and conclude that


E(Xn2 )

log(c1 )
1
1 + log(c1 )
+
=
.
nc2
nc2
nc2

Finally, we have
s
p

E(Xn )

E(Xn2 )

1 + log(c1 )
.
nc2


Now we consider bounding the maximum of a set of random variables.
Theorem 18 Let X1 , . . . , Xn be random variables. Suppose there exists > 0 such
2 2
that E(etXi ) et /2 for all t > 0. Then


p
E max Xi 2 log n.
(14)
1in

Proof. By Jensens inequality,


 




exp tE max Xi
E exp t max Xi
1in

1in


max exp {tXi }

= E

1in

n
X

2 2 /2

E (exp {tXi }) net

i=1

Thus,

E
The result follows by setting t =


max Xi

1in

log n t 2
+
.
t
2

2 log n/. 

OP and oP

In statisics, probability and machine learning, we make use of oP and OP notation.


Recall first, that an = o(1) means that an 0 as n . an = o(bn ) means that
an /bn = o(1).
an = O(1) means that an is eventually bounded, that is, for all large n, |an | C for some
C > 0. an = O(bn ) means that an /bn = O(1).
8

We write an bn if both an /bn and bn /an are eventually bounded. In computer sicence this
s written as an = (bn ) but we prefer using an bn since, in statistics, often denotes a
parameter space.
Now we move on to the probabilistic versions. Say that Yn = oP (1) if, for every  > 0,
P(|Yn | > ) 0.
Say that Yn = oP (an ) if, Yn /an = oP (1).
Say that Yn = OP (1) if, for every  > 0, there is a C > 0 such that
P(|Yn | > C) .
Say that Yn = OP (an ) if Yn /an = OP (1).

Lets use Hoeffdings inequality to show that sample proportions are OP (1/ n) within the
the true mean. Let Y1 , . . . , Yn be coin flips i.e. Yi {0, 1}. Let p = P(Yi = 1). Let
n

1X
Yi .
pbn =
n i=1

We will show that: pbn p = oP (1) and pbn p = OP (1/ n).


We have that
2
P(|b
pn p| > ) 2e2n 0
and so pbn p = oP (1). Also,



C
pn p| > C) = P |b
pn p| >
P( n|b
n
2

2e2C <
if we pick C large enough. Hence,

n(b
pn p) = OP (1) and so


1
pbn p = OP
.
n

Make sure you can prove the following:


OP (1)oP (1) = oP (1)
OP (1)OP (1) = OP (1)
oP (1) + OP (1) = OP (1)
OP (an )oP (bn ) = oP (an bn )
OP (an )OP (bn ) = OP (an bn )

You might also like