You are on page 1of 11

15-359: Probability and Computing

Inequalities

The worst form of inequality is to try to make unequal things equal. —Aristotle

1. Introduction

We have already seen several times the need to approximate event probabilities. Recall the birthday
problem, for example, or balls and bins calculations where we approximated the probability of the
“tail” of the binomial distribution P(X ≥ k), which is explicitly given by
n µ ¶
X n
P(X ≥ k) = pj (1 − p)n−j
j
j=k

The need to approximate this type of probability comes up in many applications. The difficulty is
that the sum involving binomial coefficients is unwieldy; we’d like to replace it by a single factor
that we can use in further analysis.
We’ll now develop bounds more systematically, leading to methods that hold for general random
variables, not just the binomial. There is a tradition of naming inequalities after their inventors.
To be preserved in eponymity, devise an effective probability bound—there is still room for more!

2. Markov
We’ll begin with one of the simplest inequalities, called the Markov
inequality after Andrei A. Markov. Markov was actually a student of
Chebyshev, whom we’ll hear from in a moment. He is best known for
initiating the study of sequences of dependent random variables, now
known as Markov processes.
In spite of its simplicity, the Markov inequality is a very important
bound because it is used as a “subroutine” to derive more sophisticated
and effective bounds.
Let X be a non-negative, discrete random variable, and let c > 0 be
a positive constant. We want to derive a bound on the tail probability
P(X ≥ c); this is the totalPprobability mass after (and including) the
point X = c. Since X is discrete, we can write E[X] = x x P X (x) where the sum is over the set of

1
values x ≥ 0 taken by the random variable X. Now, we can bound this expectation from below as
X
E[X] = x P X (x)
x
X X
= x P X (x) + x P X (x)
0≤x<c x≥c
X
≥ x P X (x)
x≥c
X
≥ c P X (x)
x≥c
X
= c P X (x)
x≥c
= c P(X ≥ c)

This gives us the Markov inequality

E[X]
P(X ≥ c) ≤
c
Although we haven’t yet covered continuous random variables, you may well be able to see that
this same inequality holds more generally.

2.1. Example

Suppose we flip a fair coin n times—what is the probability of getting more than 43 n heads? By
the Markov inequality, we have that
µ ¶
3n E[X]
P X≥ ≤
4 3n/4
n/2
=
3n/4
2
=
3
There is something obviously “wrong” with this bound. In particular, it is independent of the
number of flips n. Intuitively, the probability should go to zero as n gets large.
But note that the Markov inequality involves only the mean E[X] of the random variable. It does
not depend at all on the “shape” or “spread” of the distribution, which must be related to how
fast the tail probabilities thin out. So, the inequality is typically quite weak. (This should not be
confused with the fact that it is “tight” in certain cases, as seen on the homework.)
Our next inequality does depend on the spread of the distribution.

2
3. Chebyshev
According to Kolmogorov [2], Pafnuty Chebyshev was one of the first
mathematicians to make use of “random quantities” and expected val-
ues, and to study rigorous inequalities for random variables that were
valid under general conditions. In 1867 he published the paper On
mean values that presented the inequality commonly associated with
his name.
Unlike the Markov inequality, the Chebyshev inequality involves the
“shape” of a distribution through its variance. The idea is to apply
the Markov inequality to the deviation of a random variable from its
mean.
For a general random variable X we wish to bound the probability of
the event {|X − E[X]| > a}. Note that this is the same as the event
{(X − E[X])2 > a2 }. Since Y = f (X) = (X − E[X])2 is a non-negative random variable, we can
apply the Markov inequality to obtain
P(|X − E[X]| > a) = P (X − E[X])2 > a2
¡ ¢

E[(X − E[X])2 ]

a2
Var(X)
=
a2
As a special case, it follows that
1
P (|X − E[X]| ≥ aσ(X)) ≤
a2
In particular, for an arbitrary random variable there is no more than a total probability mass of
1/4 two standard deviations away from the mean. An alternative formulation is the following:
Var(X)
P (|X − E[X]| ≥ aE[X]) ≤
a2 E[X]

3.1. Example

Let’s return to our coin flipping example. Applying the Chebyshev inequality we get
µ ¶
3n ³ n n´
P X≥ = P X− ≥
4 2 4
µ ¶
1
≤ P |X − E[X]| ≥ E[X]
2
n/4

1/4 · (n/2)2
4
=
n
This is better—unlike the Markov inequality, it suggests that the probability is becoming concen-
trated around the mean.

3
4. Chernoff
We now come to one of the most powerful bounding techniques, due to
Herman Chernoff. Chernoff is known for many contributions to both
practical and theoretical statistics. The so-called “Chernoff bounds”
originated in a paper on statistical hypothesis testing [1], but the re-
sulting bounds are very widely used in computer science.
Chernoff is also known for a method called “Chernoff faces” for visualiz-
ing high dimensional data. Each variable is associated with a particular
attribute in a cartoon face—for example, one variable might be associ-
ated with eyebrow slant, one with eye spacing, one with mouth shape,
etc. The characteristics of different data sets (or points) are then seen
at a glance.1 On a personal note, Herman Chernoff was very kind to
me when I was an undergraduate, giving his time for weekly private
tutoring in probability and stochastic processes one spring semester.
He was extremely generous, and so it’s a pleasure for me to see his name remembered so often in
computer science.

Figure 1: Chernoff faces

Recall that we introduced a dependence on the “shape” of a distribution by squaring the random
variable and applying the Markov inequality. Here we exponentiate. By the Markov inequality, if
λ > 0, we have that
³ ´
P(X ≥ c) = P eλX ≥ eλc
E[eλX ]

eλc
−λc
= e MX (λ)

where MX (λ) = E[eλX ] is the moment generating function of X.


1
It’s interesting to note that in his 1973 article on this topic Chernoff says that “At this time the cost of drawing
these faces is about 20 to 25 cents per face on the IBM 360-67 at Stanford University using the Calcomp Plotter.
Most of this cost is in the computing, and I believe that it should be possible to reduce it considerably.”

4
It is best to consider this inequality in the log domain. Taking logarithms gives

log P(X ≥ c) ≤ −λ c + log MX (λ)

Now, it’s possible to show that log MX (λ) is a convex function in λ. Since this inequality holds for
any λ, we can think of it as a “variational parameter” and select the value of λ to give the tightest
possible upper bound. See Figure 2.

0
Bound on log P(X ∈ C)

−1

−2

−3

−4

−5

−6

−7
0 0.5 1 1.5 2 2.5 3
Variational parameter λ

Figure
P 2: Chernoff bounds for 1n = 30 independent Bernoulli trials for the event Cδ =
1
{X | Xi > np (1 + δ)} with p = 2 and δ = 2 . The classical Chernoff bound log P (X ∈ Cδ ) <
−npδ 2 /4 is the top horizontal line, the tighter bound log P (X ∈ Cδ ) < np (δ − (1 + δ) log(1 + δ)) is
second horizontal line, and the true log probability is the lowest horizontal line. The curve shows
the variational approximation log P (X ∈ Cδ ) < −λnp(1 + δ) + log MX (λ).

5
4.1. Example: Bounds for the Poisson

Let X ∼ Poisson(µ). Then by the general Chernoff bound, P(X ≥ n) ≤ e−nλ MX (λ). Now the
moment generating function for the Poisson is computed as

X
MX (λ) = eλk pX (k)
k=0

X µk
= eλk e−µ
k!
k=0
λ
= e−µ eµ e
λ −1)
= eµ(e

leading to the bound

log P(X ≥ n) ≤ −nλ + µ(eλ − 1)

Minimizing over λ gives λ = log(n/µ) and the bound


µ ¶
n
log P(X ≥ n) ≤ −n log +n−µ
µ

or, after exponentiating,


³ µe ´n
P(X ≥ n) ≤ e−µ
n

4.2. Chernoff bounds for the binomial

More typically, Chernoff bounds are derived after making a convex upper bound to the convex
function log MX (λ), for which the minimum can be computed analytically in a convenient form.
The following bounds for the binomial illustrate this.
If X ∼ Binomial(n, p) then we can derive the bounds
2 /n
P(X − np ≥ δ) ≤ e−2δ
2 /n
P(X − np ≤ −δ) ≤ e−2δ

for δ > 0. To prove the first of these (the second is similar) we follow the usual protocol: exponen-

6
tiate and apply the Markov inequality. This gives
³ ´
P(X − np ≥ δ) = P eλ(X−np) ≥ eλδ
h Pn i
≤ e−λδ E eλ( i=1 Xi −np)
h Pn i
= e−λδ E eλ i=1 (Xi −p)
" n #
Y
−λδ λ(Xi −p)
= e E e
i=1
n
Y h i
= e−λδ E eλ(Xi −p) (by independence)
i=1
n ³
Y ´
= e−λδ peλ(1−p) + (1 − p)e−λp
i=1

Now, using the following inequality (which we won’t prove)


2 /8
p eλ(1−p) + (1 − p) e−λp ≤ eλ

we get that
n ³
Y ´
P(X − np ≥ δ) ≤ e −λδ
p eλ(1−p) + (1 − p)e−λp
i=1
−λδ+nλ2 /8
≤ e

This gives the convex function −λδ + nλ2 /8 as a second, weaker upper bound to log P(X − np ≥ δ).
Minimizing over the parameter λ then gives λ = 4δ/n and the bound
2 /n
P(X − np ≥ δ) ≤ e−2δ

4.3. Example

Let’s now return again to the simple binomial example. Recall that the Markov inequality gave
P(X ≥ 3n/4) ≤ 2/3, and the Chebyshev inequality gave P(X ≥ 3n/4) ≤ 4/n. Now, we apply the
Chernoff bound just derived to get

P(X ≥ 3n/4) = P(X − n/2 ≥ n/4)


2 /n
≤ e−2(n/4)
= e−n/8

This bound goes to zero exponentially fast in n. This is as we expect intuitively, since the binomial
should become concentrated around its mean as n → ∞.

7
4.4. Alternative forms of Chernoff bounds

There are other forms of the Chernoff bounds for binomial or Bernoulli trials that Pwe’ll state here.
n
If X1 , . . . , Xn are independent Bernoulli trials with Xi ∼ Bernoulli(pi ), let X = i=1 Xi . Then
¶E[X]

µ
P(X > (1 + δ) E[X]) ≤
(1 + δ)(1+δ)
If, moreover, 0 < δ < 1, then
δ2
P(X > (1 + δ) E[X]) ≤ e− 3 E[X]
δ2
P(X < (1 − δ) E[X]) ≤ e− 2 E[X]

The bounds are slightly different but the idea of the proof is the same as we’ve already seen.

4.5. Inverse bounds

As an example of the use of Chernoff bounds, we can carry out an “inverse calculation” to see how
much beyond the mean a random variable must be to have a sufficiently small tail probability.
Specifically, for X ∼ Bernoulli(n, 12 ), we can ask how big does m have to be so that
³ n ´ 1
P X− >m ≤
2 n
Using the Chernoff bound we have that
³ n ´ 2
P X − > m ≤ e−2m /n
2
In order for the righthand side to be equal to 1/n, we need that

m2
−2 = − log n
n
q
n log n
which we solve to get m = 2 .

4.6. Application to Algorithms

We now apply the Chernoff bound to the analysis of randomized quicksort. Recall that earlier
we showed randomized quicksort has expected running time O(n log n). Here we’ll show something
much stronger—that with high probability the running time is O(n log n). In particular, we will
show the following.
Theorem. Randomized quicksort runs in O(n log n) time with probability at least 1 − 1/nb for
some constant b > 1.
To sketch the proof of this, suppose that we run the algorithm and stop when the depth of the
recursion tree is c log n for some constant c. The question we need to ask is: what is the probability
there is a leaf that is not a singleton set {si }?

8
Call a split (induced by some pivot) good if it breaks up the set S into two pieces S1 and S2 with
1
min(|S1 |, |S2 |) ≥ |S|
3
Otherwise the split is bad . What is the probability of getting a good split? Assuming that all
elements are distinct, it’s easy to see that the probability of a good split is 31 .
Now, for each good split, the size of S reduces by a factor of (at least) 32 . How many good splits
are needed in the worst case to get to a singleton set? A little calculation shows that we require

x = log n/ log(3/2)
= a log n

good splits. We can next reason about how large the constant c needs to be so that the probability
that we get fewer than a log n good splits is small.
Consider a single path from the root to a leaf in the recursion tree. The average number of good
splits is (1/3)c log n. By the Chernoff bound, we have that
1 1 δ2
P(number of good splits < (1 − δ) c log n) ≤ e− 3 c log n 2
3
1
We can then choose c sufficiently large so that this righthand side is no larger than n2
.
We arranged this so that the probability of having too few good splits on a single path was less
than 1/n2 . Thus, by the union bound, the probability that there is some path from root to a leaf
that has too few good splits is less than 1/n. It follows that with probability at least 1 − 1/n,
the algorithm finishes with a singleton set {si } at each leaf in the tree, meaning that the set S is
successfully sorted in time O(n log n).
This is a much stronger guarantee on the running time of the algorithm, which explains the effec-
tiveness of randomized quicksort in practice.

4.7. Pointer to recent research

As a final note, in recent research in machine learning we have attempted to extend the scope
of Chernoff bounds to non-independent random variables, using some of the machinery of convex
optimization [4].

9
5. Jensen
Johan Ludwig Jensen of Denmark was self-taught in mathematics, and
never had an academic or full research position. Instead, he worked
beginning in 1880 for a telephone company to support himself suffi-
ciently in order to be able to work on mathematics in his spare time.
(Only much later did telephone workers at such companies as AT&T
get paid directly to do mathematics!)
Jensen studied convex functions, and is given credit for a particularly
important and useful inequality involving convex functions. A convex
function is one such as f (x) = x2 + ax + b that “holds water.” Jensen’s
inequality is best viewed geometrically. If f (x) and f (y) are two points
on the graph of f and we connect them by a chord, this line is traced
out by αf (x) + (1 − α)f (y) for 0 ≤ α ≤ 1. By convexity, this line lies above the graph of the
function. Thus2 ,

α f (x) + (1 − α) f (y) ≥ f (α x + (1 − α) y)

By induction, it follows easily that

α1 f (x1 ) + · · · αn f (xn ) ≥ f (α1 x1 + · · · + αn xn )


Pn
where αi ≥ 0 and i=1 αi = 1.
If X is a finite random variable, this shows that for f convex,

f (E[X]) ≤ E [f (X)]

Similarly, for g a concave function,

g (E[X]) ≥ E [g(X)]

These inequalities are best remembered by simply drawing a picture of a convex function and a
chord between two points on the curve.

5.1. Example

As a simple use of Jensen’s inequality, we note the arithmetic/geometric mean inequality. Using
convexity of − log(x), if a and b are positive then
µ ¶
1 1 1 1
− log(a) − log(b) ≥ − log a+ b
2 2 2 2

which implies
√ 1
ab ≤ (a + b)
2
2
¡ x+y ¢ f (x)+f (y)
It appears that Jensen only, in fact, studied the inequality f 2
≤ 2
.

10
5.2. Example: Entropy

If X is a finite r.v. taking values vi with probability pi = P(X = vi ), the entropy of X, denoted
H(X), is defined as
n
X
H(X) = − pi log pi
i=1

Clearly H(X) ≥ 0 since log(1/pi ) ≥ 0. Now, since − log is convex, we have by Jensen’s inequality
that
X
H(X) = − pi log pi
i
X 1
= pi log
pi
i
à n !
X 1
≤ log pi
pi
i=1
= log n

so that the entropy lies in the interval [0, log n].


The entropy is a lower bound on the amount of compression possible for a sequence of characters
vi that are independent and identically distributed according to the pmf of X.

References

[1] Herman Chernoff (1952). A measure of asymptotic efficiency for tests of a hypothesis based
on the sum of observations. Annals of Mathematical Statistics, 23, 493–507.

[2] The MacTutor history of mathematics archive (2004). http://www-gap.dcs.st-and.ac.uk/


~history.

[3] Rajeev Motwani and Prabhakar Raghavan (1995). Randomized Algorithms. Cambridge Uni-
versity Press.

[4] Pradeep Ravikumar and John Lafferty (2004). “Variational Chernoff bounds for graphical
models,” Uncertainty in Artificial Intelligence.

[5] Sheldon M. Ross (2002). Probability Models for Computer Science. Harcourt/Academic Press.

11

You might also like