Moments and Deviations: 3.1. Markov's Inequality

1/29/2015
3. Moments and Deviations
• We examine techniques for bounding the tail

distribution
– the probability that a random variable
assumes values that are far from its
expectation
• In the analysis of algorithms, these bounds are
the major tool for estimating the failure
probability of algorithms and for establishing
high probability bounds on their run-time
MAT-72306 RandAl, Spring 2015 29-Jan-15 117
3.1. Markov's Inequality
• Markov's inequality is often too weak to yield

useful results, but it is a fundamental tool in
developing more sophisticated bounds
Theorem 3.1 [Markov's Inequality]:

Let be a RV that assumes only nonnegative
values. Then, for all > 0,
]
Pr
1
1/29/2015
Proof: For > 0, let

if
=
otherwise
and note that, since 0,
.
Because is a 0 1 random variable,
= Pr = 1 = Pr .
Taking expectations thus yields
Pr = = .
• Let us use Markov's inequality to bound the

probability of obtaining more than /4 heads in
a sequence of fair coin flips
if the th coin flip is heads

=
otherwise
and = denotes the number of heads
in the coin flips
• Since ] = Pr = 1) = 1/2, it follows that
= ]= 2
• Applying Markov's inequality, we obtain
2 2
Pr 4 = =
4 4 3
2
1/29/2015
3.2. Variance and Moments of a RV
• Markov's inequality gives the best tail bound

possible when all we know is the expectation of
the RV and that the variable is nonnegative
• If more information about the distribution of the
random RV is available, it can be improved
• Additional information about a RV is often
expressed in terms of its moments
Definition 3.1: The th moment of a RV is ]
• Given the first and second moments, one can

compute the variance and standard deviation of
the RV
• Intuitively, the variance and standard deviation
offer a measure of how far the RV is likely to be
from its expectation
Definition 3.2: The variance of a RV is

]
The standard deviation of a RV is

]=
3
1/29/2015
• The two forms of the variance in the definition

are equivalent, as is easily seen by the linearity
of expectations
• Keeping in mind that is a constant, we have
[ ]= ]
• If a RV is constant – always assumes the

same value – then its and are zero
• More generally, if a RV takes on the value
] with probability and the value 0 with
probability , =( 1) ] and
= ]
• The (and ) of a RV are small when the RV
assumes values close to its expectation and are
large when it assumes values far from its
expectation
4
1/29/2015
Definition 3.3: The covariance of and is

Cov ) = [( ])( ])].
Theorem 3.2: For any two RVs and
]= ]+ ]+2 ).
Proof:
[ ] + ]
+2 ] ] ]
] ]
+2 [ ]
+2
Theorem 3.3: If and are two independent

RVs, then
]= ]
• This does not necessarily hold if the RVs are
dependent
• Let and each correspond to fair coin flips,
both taking on the value 0 if the flip is heads
and 1 if the flip is tails
• Then = 1/2
• If the two flips are independent, then is 1
with probability 1/4 and 0 otherwise
• Indeed [ ]= [ ] [ ]
5
1/29/2015
• Suppose instead that the coin flips are

dependent in the following way:
– the coins are tied together, so and either
both come up heads or tails together
• Each coin considered individually is still a fair
coin flip, but now is 1 with probability 1/2
and so ]· ]
Corollary 3.4: If and are independent RVs,

)=0
and
[ + ]= [ ]+ [ ]
Proof:
) = [( ])( ])]
=0
In the second equation we have used the fact that,
since and are independent, so are
and and hence Theorem 3.3 applies. For
the last equation we use the fact that, for any
random variable ,
[( ])] = ] = 0.
Since ) = 0, we have
+ = + .
6
1/29/2015
3.2.1. Example: Variance of a BRV
• The variance of a BRV ) can be

determined by computing ]:
• We conclude that
= = (1 )
• An alternative derivation makes use of

independence
• Recall that a BRV can be represented as the
sum of independent Bernoulli trials, each with
success probability
• Such a Bernoulli trial has variance
= Pr ]
+ (1 )
(1 )
• By Corollary 3.4, the variance of is then
(1 )
7
1/29/2015
3.3. Chebyshev's Inequality
• Using the expectation and the variance of the

RV, one can derive a significantly stronger tail
bound known as Chebyshev's inequality
Theorem 3.6 [Chebyshev's Inequality]:

For any > 0,
Pr(|
Proof: First observe that

Pr = Pr ]
Since ] is a nonnegative random

variable, we can apply Markov's inequality to
prove:
[ ]
Pr ]
8
1/29/2015
• The following useful variants of Chebyshev's

inequality bound the deviation of the RV from
its expectation in terms of a constant factor of
] or ]
Corollary 3.7: For any > 1,

1
Pr and
Pr
]
• Let us use Chebyshev's inequality to bound the

probability of obtaining more than /4 heads
in a sequence of fair coin flips
• Recall that = 1 if the th coin flip is heads and
0 otherwise
• = denotes the number of heads in the
coin flips
• To use Chebyshev's inequality we need to
compute the variance of
• Observe first that, since is a 0–1 RV,
1
=
2
9
1/29/2015
• Thus,
1 1 1
= =
2 4 4
• Now, since = and the are
independent, we can use Thm 3.5 to compute
= =
4
• Applying Chebyshev's inequality then yields
Pr 4 Pr 4
4 4
= =
4 4
• In fact, we can do slightly better

• Chebyshev's inequality yields that 4/ is
actually a bound on the probability that is
either smaller than /4 or larger than /4
• By symmetry the probability that is greater
than /4 is actually 2/
• Chebyshev's inequality gives a significantly
better bound than Markov's inequality for large
10
1/29/2015
3.3.1. Example: Coupon Collector's Problem
• Recall that the time to collect coupons has

expectation , where
= = ln (1)
• Hence Markov's inequality yields
1
Pr
2
• Recall again that = , where the are
geometric random variables with parameter
+ 1)
• The are independent because the time to

collect the th coupon does not depend on how
long it took to collect the previous 1 coupons
• Hence
= ]
so we need to find the variance of a geometric

random variable
Lemma 3.8: The variance of a geometric RV with

parameter is
(1 )
11
1/29/2015
• We simplify the argument by using the upper

bound for a geometric RV,
instead of the exact result of Lemma 3.8
• Then
= ]
+1
1
6
because
1
=
6
• Now, by Chebyshev's inequality,

6
Pr =
6
1
ln
• Chebyshev's inequality again gives a much
better bound than Markov's inequality
• But it is still a fairly weak bound, as we can see
by considering instead a simple union bound
argument
12
1/29/2015
• Consider the probability of not obtaining the th

coupon after ln steps
• This probability is
)
1 ) =
1
• By a union bound, the probability that some

coupon has not been collected after ln
steps is only
• In particular, the probability that all coupons are
not collected after ln steps is at most
• A bound that is significantly better than what can
be achieved even with Chebyshev's inequality
3.4. Application: A Randomized

Algorithm for Computing the Median
• Given a set of elements drawn from a totally
ordered universe, the median is s.t.
– at least 2 elements in are and
– at least 2 + 1 elements in are
• If the elements in are distinct, then is the
( 2 )th element in the sorted order of
• The median can be found deterministically in
log ) steps by sorting, and a relatively
complex deterministic algorithm computes it in
( ) time
13
1/29/2015
• Let us assume that is odd and that the

elements in are distinct
• The goal is to find two elements that are close
together in the sorted order of and that have
the median lie between them
• Specifically, we seek two elements s.t.
1. (the median is between and )
2. for = , |= log
(the # of elements between and is small)
• Sampling gives us a simple and efficient method
for finding two such elements
• Once these two elements are identified, can

be found in linear time with the following steps:
– Count (in linear time) the number of
elements of that are smaller than and then
sort (in sublinear, or ), time) the set
– Set can be sorted in time ) using any
standard log ) sorting algorithm, since
|= log
– The ( 2 + 1)th element in the sorted
order of is , since there are exactly 2
elements in that are smaller than that value
( 2 ) in the set and in )
14
1/29/2015
• To find and , we sample with replacement a

multi-set of elements from
• Each element in is chosen uniformly at
random from the set , independent of previous
choices
• Thus, the same element of might appear more
than once in the multi-set
• Sampling w/o replacement might give marginally
better bounds, but implementing and analyzing it
are significantly harder
• We assume that an element can be sampled
from in constant time
• Since is a random sample of , we expect

to be close to the median element of
• We therefore choose and to be elements of
surrounding the median of
• We require all the steps of our algorithm to work
with high probability (w.h.p), by which we
mean with probability at least for
some constant > 0
• To guarantee that w.h.p. the set includes the
median , we fix and to be respectively the
2 th and the 2+ th
elements in the sorted order of
15
1/29/2015
• With this choice, the set includes all the

elements of that are between the 2 sample
points surrounding the median of
• The analysis will clarify that the choice of the
size of and the choices for and are tailored
to guarantee both that
a) the set is large enough to include with
high probability and
b) the set is sufficiently small so that it can be
sorted in sublinear time with high probability
RANDOMIZED MEDIAN ALGORITHM:

Input: A set of elements over a totally ordered universe
Output: The median element of , denoted by
1. Pick a (multi-)set of elements in , chosen
independently and uniformly at random with replacement
2. Sort the set
3. 2 th smallest element in the sorted set
4. 2+ th smallest element in the sorted set
5. By comparing every element in to and , compute the set
= and the numbers
= and =
6. if 2 or 2 then FAIL
7. if then sort the set , otherwise FAIL
8. output the ( 2 + 1)th element in the sorted order of
16
1/29/2015
Theorem 3.9: The randomized median algorithm

terminates in linear time, and if it does not output FAIL
then it outputs the correct median element of the input
set .
Proof: Correctness follows because the algorithm
could only give an incorrect answer if the median were
not in the set . But then either 2 or >
2 and thus step 6 guarantees that, in these cases,
the algorithm outputs FAIL.
Similarly, as long as is sufficiently small, the total
work is only linear in the size of . Step 7 therefore
guarantees that the algorithm does not take more than
linear time; if the sorting might take too long, the
algorithm outputs FAIL without sorting.
• We identify three "bad" events such that, if none

of them occurs, the algorithm does not fail
: = < 2 ;
= < 2 ;
>4
Lemma 3.10: The randomized median algorithm
fails iff at least one of , , or occurs.
Proof: Failure in step 7 is equivalent to . Failure
in step 6 occurs iff 2 or 2. But for
2, the 2 )th smallest element of
must be larger than : this is equivalent to .
Similarly, > 2 is equivalent to the event .
17
1/29/2015
Lemma 3.11:
Pr
Proof: Define a random variable by
if the th sample is the median
=
otherwise
The are independent, since the sampling is
done with replacement.
Because there are 1)/2 + 1 elements in
that are the median, the probability that a
randomly chosen element of is the median can
be written as
1)/2 + 1 1 1
Pr =1 = = +
2
The event is equivalent to
= < 2
Since is the sum of Bernoulli trials, it is a BRV

with parameters and 1/2 1/2 . Hence,
using the earlier result ] = (1 ) yields
1 1 1 1
]= +
2 2
1 1 1
= <
4 4 4
18
1/29/2015
Applying Chebyshev's inequality then yields

Pr = Pr < 2
Pr ] >
]
1
1
<4 =
4
• We similarly obtain the same bound for the

probability of the event
Lemma 3.12: Pr
Theorem 3.13: The Pr that the randomized

median algorithm fails is bounded by .
• Repeating the algorithm until it succeeds (finds
the median), we obtain an iterative algorithm
that never fails but has a random running time
• The samples taken in successive runs are
independent, so the success of each run is
independent of others, and hence the number of
runs until success is achieved is a geometric
random variable
• This variation of the algorithm still has linear
expected running time
19
1/29/2015
• Rand.algs. that may fail or return an incorrect

answer are called Monte Carlo algorithms
• The running time of a MC algorithm often does
not depend on the random choices made
• E.g., the median algorithm always terminates in
linear time, regardless of its random choices
• A randomized algorithm that always returns the
right answer is called a Las Vegas algorithm
• The median MC algorithm can be turned into a
LV algorithm by repeating it until it succeeds
• Turning it into a LV algorithm means the running
time is variable, although the is still linear
4 Chernoff Bounds
• Chernoff bounds are extremely powerful, giving

exponentially decreasing bounds on the tail
distribution
• These bounds are derived by using Markov's
inequality on the moment generating function of
a random variable
20
1/29/2015
4.1. Moment Generating Functions
Definition 4.1: The moment generating

function (MGF) of a random variable is
= Pr
• We are mainly interested in the existence and

properties of this function in the
neighborhood of zero
• captures all of the moments of
Theorem 4.1: Let be a RV with MGF .

Under the assumption that exchanging the
expectation and differentiation operands is
legitimate, for all > 1 we then have
0 ,
where 0 is the th derivative of
evaluated at = 0.
Proof: Assuming that we can exchange the
expectation and differentiation operands, then
.
Computed at = 0, this expression yields
0 = .
21
1/29/2015
• Expectation and differentiation operands can be

exchanged whenever the MGF exists in a
neighborhood of zero
• This holds for all distributions considered in this
course
• Consider a RV ~Geom( ),
• Then, for ln(1 ),
=
= (1 1
1
• It follows that
)
(1 and
)
=2
(1
• Evaluating these derivatives at = 0 and using
Theorem 4.1 gives
1
= and
= ,
matching our previous calculations

22
1/29/2015
• The MGF of a RV
– equivalently, all of the moments of the RV
uniquely defines its distribution
Theorem 4.2: Let and be two RVs. If
for all ) for some > 0, then and

have the same distribution.
Theorem 4.3: If and are independent RVs,

then .
Proof:
)
Here we have used that and are independent -

and hence and are independent.
• Thus, if we know and and if we

recognize the function as the MGF
of a known distribution, then that must be the
distribution of + when Theorem 4.2 applies
23
1/29/2015
4.2. Deriving and Applying Chernoff Bounds
• The Chernoff bound for a RV is obtained by

applying Markov's inequality to for some
well-chosen value
• From Markov’s inequality, we can derive the
following useful inequality: for any > 0,
Pr = Pr
• In particular,
Pr min
• Similarly, for any < 0,

Pr = Pr
• Hence,
Pr min
• Bounds for specific distributions are obtained by
choosing appropriate values for
• The value of that minimizes gives
the best possible bounds
• However, often one chooses a value of that
gives a convenient form
24
1/29/2015
4.2.1. Chernoff Bounds for the Sum

of Poisson Trials
• We develop a common version of the Chernoff

bound: for the tail distribution of a sum of
independent 0–1 RVs, which are also known as
Poisson trials
• The distributions of the RVs in Poisson trials are
not necessarily identical
• Bernoulli trials are a special case of Poisson
trials where the independent 0–1 RVs have the
same distribution; i.e., all trials take on the value
1 with the same probability
• Recall that the binomial distribution gives the #

of successes in independent Bernoulli trials
• Our Chernoff bound will hold for the binomial
distribution and also for the more general setting
of the sum of Poisson trials
• Let , … , be a sequence of independent
Poisson trials with Pr =1 =
• Let = , and let
= =
25
1/29/2015
• For a given > 0, we are interested in bounds

on Pr 1+ ) and Pr )
• I.e., the probability that deviates from its
expectation by or more
• To develop a Chernoff bound we need to
compute the MGF of
• We start with the MGF of each :
+ (1 )
=1+ 1)
)
because for any , 1 +

• Applying Thm 4.3, we take the product of the

MGFs to obtain
= )
= exp 1)
• We can now develop concrete versions of the

Chernoff bound for a sum of Poisson trials
• Bounds on the deviation above the mean
26
1/29/2015
Theorem 4.4: Let , … , be independent

Poisson trials s.t. Pr = 1 = . Let =
and ]. Then the following Chernoff bounds
hold:
1. for any > 0,
Pr (1 + )< )
;
(1 + )
2. for 0 < 1,
Pr (1 + ;
3. for ,
Pr 2 .
27

Moments and Deviations: 3.1. Markov's Inequality

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Moments and Deviations: 3.1. Markov's Inequality

Uploaded by

Copyright:

Available Formats

1/29/2015

3. Moments and Deviations

• We examine techniques for bounding the tail

MAT-72306 RandAl, Spring 2015 29-Jan-15 117

3.1. Markov's Inequality

• Markov's inequality is often too weak to yield

Theorem 3.1 [Markov's Inequality]:

MAT-72306 RandAl, Spring 2015 29-Jan-15 118

Proof: For > 0, let

• Let us use Markov's inequality to bound the

if the th coin flip is heads

3.2. Variance and Moments of a RV

• Markov's inequality gives the best tail bound

Definition 3.1: The th moment of a RV is ]

MAT-72306 RandAl, Spring 2015 29-Jan-15 121

• Given the first and second moments, one can

Definition 3.2: The variance of a RV is

The standard deviation of a RV is

MAT-72306 RandAl, Spring 2015 29-Jan-15 122

• The two forms of the variance in the definition

MAT-72306 RandAl, Spring 2015 29-Jan-15 123

• If a RV is constant – always assumes the

Definition 3.3: The covariance of and is

MAT-72306 RandAl, Spring 2015 29-Jan-15 125

Theorem 3.3: If and are two independent

• Suppose instead that the coin flips are

Corollary 3.4: If and are independent RVs,

3.2.1. Example: Variance of a BRV

• The variance of a BRV ) can be

• An alternative derivation makes use of

3.3. Chebyshev's Inequality

• Using the expectation and the variance of the

Theorem 3.6 [Chebyshev's Inequality]:

MAT-72306 RandAl, Spring 2015 29-Jan-15 131

Proof: First observe that

Since ] is a nonnegative random

MAT-72306 RandAl, Spring 2015 29-Jan-15 132

• The following useful variants of Chebyshev's

Corollary 3.7: For any > 1,

MAT-72306 RandAl, Spring 2015 29-Jan-15 133

• Let us use Chebyshev's inequality to bound the

MAT-72306 RandAl, Spring 2015 29-Jan-15 135

• In fact, we can do slightly better

MAT-72306 RandAl, Spring 2015 29-Jan-15 136

3.3.1. Example: Coupon Collector's Problem

• Recall that the time to collect coupons has

MAT-72306 RandAl, Spring 2015 29-Jan-15 137

• The are independent because the time to

so we need to find the variance of a geometric

Lemma 3.8: The variance of a geometric RV with

• We simplify the argument by using the upper

• Now, by Chebyshev's inequality,

• Consider the probability of not obtaining the th

• By a union bound, the probability that some

3.4. Application: A Randomized

• Let us assume that is odd and that the

MAT-72306 RandAl, Spring 2015 29-Jan-15 143

• Once these two elements are identified, can

• To find and , we sample with replacement a

• Since is a random sample of , we expect

• With this choice, the set includes all the

MAT-72306 RandAl, Spring 2015 29-Jan-15 147