Professional Documents
Culture Documents
Mathematics
Ma 3/103 KC Border
Introduction to Probability and Statistics Winter 2017
What happens for other kinds of random variables? Fortunately we do not need to know the
details of the distribution to prove a Law of Averages. But we start with some preliminaries.
Markov’s Inequality bounds the probability of large values of a nonnegative random variable in
terms of its expectation. It is a very crude bound, but it is just what we need. Pitman [9]:
p. 174
7–1
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–2
● ●
0.15
● ●
0.10
● ●
0.05
● ●
● ●
● ● ● ● ● ●
0.2 0.4 0.6 0.8 1.0
● ●
● ●
0.04
● ●
● ●
0.03 ● ●
● ●
● ●
0.02
● ●
● ●
● ●
● ●
0.01
● ●
● ●
● ●
● ●
● ●
● ●
●● ●●
●● ●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.2 0.4 0.6 0.8 1.0
F gure 7 1
v 2017 02 02 09 29 KC Border
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–3
Proof : Use the fact that expectation is a positive linear operator. Recall that this implies
that if X ⩾ Y , then E X ⩾ E Y .
Let 1[a,∞) be the indicator of the interval [a, ∞). Note that
{
X if X ⩾ a
X1[a,∞) (X) =
0 if X < a.
Since X ⩾ 0 this implies
X ⩾ X1[a,∞) (X) ⩾ a1[a,∞) (X).
Okay, so
( )
E X ⩾ E X1[a,∞) (X)
( )
⩾ E a1[a,∞) (X)
( )
= a E 1[a,∞) (X)
( )
= aP 1[a,∞) (X) = 1
= aP (X ⩾ a) .
Divide by a > 0 to get P (A) ⩽ E X/a.
7.2.2 Chebychev
Chebychev’s Inequality bounds the deviation from the mean in terms of the number of standard
deviations. Pitman [9]:
p. 191
7.2.2 Proposition (Chebychev’s Inequality, version 1) Let X be a random variable
with finite mean µ and variance σ 2 (and standard deviation σ). For every a > 0,
σ2
P (|X − µ| ⩾ a) ⩽ .
a2
Proof :
( )
P (|X − µ| ⩾ a) = P (X − E X)2 ⩾ a2 square both sides
E(X − E X) 2
⩽ Markov’s Inequality
a2
2
σ
= 2 definition of variance,
a
where the inequality follows from Markov’s Inequality applied to the random variable (X −E X)2
and the constant a2 .
Recall that given a random variable X with finite mean µ and variance σ 2 , the standard- Pitman [9]:
ization of X is the random variable X ∗ defined by p. 190
X −µ
X∗ = .
σ
Recall that E X ∗ = 0 and Var X ∗ = 1.
Then we may rewrite Chebychev’s Inequality as:
KC Border v. 2017.02.02::09.29
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–4
Pitman [9]: 7.2.3 Proposition (Chebychev’s Inequality, version 2) Let X be a random variable with
p. 191 finite mean µ and variance σ 2 (and standard deviation σ), and standardization X ∗ . For every
k > 0,
1
P (|X ∗ | ⩾ k) ⩽ 2 .
k
These are the partial sums of the sequence. Assume µ = E X1 = E Xi for any i since the Xi
are randomly distributed. Since expectation is a linear operator, we have
( )
Sn
E Sn = nµ, so E = µ.
n
Let σ 2 = Var X1 (= Var Xi for any i since the Xi are identically distributed). Since the random
variables Xi are independent, we have
√
Var Sn = nσ 2 and SD Sn = nσ
Recall that for any random variable Y , we have Var(aY ) = a2 Var Y . Thus
( )
Sn 1 nσ 2 σ2
Var = 2 Var Sn = 2 = .
n n n n
Thus the standard deviation (the square root of the variance) of Sn /n is given by
Sn σ
standard deviation =√ .
n n
(Pitman [9, p. 194] calls this the Square Root Law.)
Note that the variance of the sum of n independent and identically √ distributedrandom
variables grows linearly with n, so the standard deviation grows like n; but the variance
√
of the average is proportional to 1/n, so the standard deviation is proportional to 1/ n.
Aside: This is confusing to many people, and can lead to bad decision making. The Nobel prize-
winning economist Paul Samuelson [10] reports circa 1963 that he offered “some lunch colleagues to bet
each $200 to $100 that the side of coin they specified would not appear at the first toss.”
A “distinguished colleague” declined the bet, but said, “I’ll take you on if you promise to let me
make 100 such bets.” Samuelson explains that, “He and many others give something like the following
explanation. ‘One toss is not enough to make it reasonably sure that the law of averages will turn out
in my favor. But in a hundred tosses of a coin, the law of large numbers will make it a darn good bet.
I am so to speak, virtually sure to come out ahead in such a sequence ...”
Samuelson points out that this is not what the Law of Large Numbers guarantees. He says that
his colleague “should have asked for to subdivide the risk and asked for a sequence of 100 bets each of
which was $1 against $2.”
Let’s compare means, standard deviations and the probability of losing money,
Bet Expected Value Std. Dev. Probability of being
a net Loser
$200 v. $100 on one coin $50 $111.80 0.5
flip
100 bets of $200 v. $100 $5000 $1,118.03 0.0003
100 bets of $2 v. $1 $50 $11.10 0.0003
v. 2017.02.02::09.29 KC Border
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–5
Betting $2 to $1 has expectation $0.50 and standard deviation $1.11, so 100 such bets has mean $50
and standard deviation $11.10, which is a lot less risky than the first case.
∑
n
Sn = Xn , n = 1, 2, 3, . . . .
i=1
Proof : Note that if S is the sample space for a single experiment, this statement never requires
consideration of a sample space more complicated than S n . The result is simply Chebychev’s
Inequality (version 1) applied to the random variable Sn /n, which has mean µ and variance
σ 2 /n: ( )
Sn σ 2 /n
P − µ ⩾ ε ⩽ 2 .
n ε
An important fact about this theorem is it tells us how fast in n the probability of the
deviation of the average from the expected value goes to zero: it’s bounded by a constant
times 1/n.
KC Border v. 2017.02.02::09.29
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–6
or equivalently
( )[ ]
∀ε > 0 lim P (|Yn − Y | ⩽ ε) = 1;
n→∞
or equivalently
( )( )( )( )[ ]
∀ε > 0 ∀δ > 0 ∃N ∀n ⩾ N P (|Yn − Y | > ε) < δ ,
Then
Sn
plim = µ.
n→∞ n
Note that this formulation throws away the information on the rate of convergence.
v. 2017.02.02::09.29 KC Border
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–7
This statement is taken from Feller [5, Theorem 1, p. 238] and the proof may be found in
pages 238–240. (His statement is for the case µ = 0, but it trivially implies this apparent
generalization.)
The next result is a converse to the strong law that shows that finiteness of the expectation
cannot be dispensed with. It may be found in Feller [5, Theorem 4, p. 241].
One implication of this is that if the expectation is not finite, the averages will become arbitrarily
large infinitely often with probability one.
With all these P n s and P ∞ running around, when dealing with sequences of independent
and identically distributed random variables, I may simply write Prob, knowing that they
all agree.
1 The set of events is called the product σ-algebra En . and I won’t go into details about that now.
KC Border v. 2017.02.02::09.29
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–8
More generally,
for any sequence Y1 , Y2 , . . . on the common probability space (S, E, P ), we say that Yn
converges almost surely to Y , written Yn → Y a.s. if
P {s ∈ S : Yn (s) → Y (s)} = 1.
v. 2017.02.02::09.29 KC Border
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–9
Let
An (ε) = (|Yn − Y | ⩽ ε)
= {s ∈ S : |Yn (s) − Y (s)| ⩽ ε}.
We can rewrite convergence in probability as
plim Yn = Y
n→∞
( )( )( )( )[ ( ) ] (1)
⇐⇒ ∀k ∀m ∃M ∀n ⩾ M P An (1/k) > 1 − (1/m)
Now observe that the event
(Yn → Y ) = {s ∈ S : Yn (s) → Y (s)}
( )( )( ) [ ]
= {s ∈ S : ∀ε > 0 ∃N ∀n ⩾ N Yn (s) − Y (s) < ε}
( )( )( ) [ ]
= {∈ S : ∀k ∃N ∀n ⩾ N Yn (s) − Y (s) < 1/k}
In other words,
(∩
∞ ∪
∞ ∩
∞ )
Yn → Y a.s. ⇐⇒ P An (1/k) = 1. (2)
k=1 N =1 n=N
The following argument uses the simple facts that for any countably additive probability
∩ ∪
∞
measure P , if P ( Aj ) = 1, then P (Aj ) = 1 for all j; and if P ( An ) = 1, then for each δ > 0,
j j=1
∪
N
there exists N such that P ( Aj ) > 1 − δ.
j=1
So observe that
(∩ ∞ ∪
∞ ∩
∞ )
P An (1/k) = 1
k=1 N =1 n=N
[ ( ∞ ∞ ) ]
( ) ∪ ∩
=⇒ ∀k P An (1/k) = 1
N =1 n=N
[ ( M ∞ ) ]
( )( )( ) ∪ ∩
=⇒ ∀k ∀m ∃M P An (1/k) > 1 − (1/m)
N =1 n=N
∪M ∩∞ ∩∞ ∩∞
But N =1 n=N An (1/k) = n=M An (because letting EN = n=N An (1/k), we have E1 ⊂
E2 ⊂ · · · ⊂ EM ), so
[ ( ∞ ) ]
( )( )( ) ∩
=⇒ ∀k ∀m ∃M P An (1/k) > 1 − (1/m)
n=M
( )( )( )( )[ ( ) ]
=⇒ ∀k ∀m ∃M ∀n ⩾ M P An (1/k) > 1 − (1/m)
so by (1)
=⇒ plim Yn = Y.
So we have proven the following proposition.
7.9.2 Proposition
a.s. P
Yn −−→ Y =⇒ Yn −
→ Y.
KC Border v. 2017.02.02::09.29
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–10
Y1 = 1[0, 12 ) , Y2 = 1[ 12 ,1)
Y3 = 1[0, 41 ) , Y4 = 1[ 14 , 12 ) , Y5 = 1[ 21 , 34 ) , Y6 = 1[ 34 ,1)
Y7 = 1[0, 18 ) , etc.
P a.s.
Then Yn −
→ 0, but Yn −−
̸ → 0.
However we do have the following result.
P a.s.
7.9.3 Proposition If Yn −
→ Y , then for some subsequence, Ynk −−→ Y .
v. 2017.02.02::09.29 KC Border
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–11
Recalling that the indicator function 1(−∞,x] (y) is equal to one if y ⩽ x and zero otherwise,
we may rewrite this as
∑n
1(−∞,x] (Xi )
Fn (x) = i=1
n
7.10.2 Example Let Xi , i = 1, 2, be independent uniform random variables on the finite set
{0, 1, 2}. (One way to get such a random variable is to roll a die, divide the result by 3, and
take the remainder. Then 1 or 4 gives Xi = 1, 2 or 5 gives Xi = 2, and 3 or 6 gives Xi = 0.)
For the repeated experiments there are 9 possible outcomes:
(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2).
Each of these outcomes has an empirical cumulative distribution function F2 associated with
it. There however only 6 distinct functions, since different points in the sample space may give
rise to the same empirical cdf. Table 7.1 shows the empirical cdf associated to each point in the
sample space. You should check that for each x if you look at the value of the empirical cdf at x
and weight them by the probabilities associated with each empirical cdf, the resulting function
is the cdf of the distribution of each Xi :
1
2
0 1 2
□
Now assume that X1 , X2 , . . . , Xn are independent and identically distributed, with common
cumulative distribution function F . Since
E 1(−∞,x] (Xi ) = Prob (Xi ⩽ x) = F (x),
KC Border v. 2017.02.02::09.29
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–12
1
{(0, 0)} 1
2
0 1 2
1
{(0, 1), (1, 0)} 1
2
0 1 2
1
{(0, 2), (2, 0)} 1
2
0 1 2
1
{(1, 1)} 1
2
0 1 2
1
{(1, 2), (2, 1)} 1
2
0 1 2
1
(2,2) 1
2
0 1 2
Table 7.1. Empirical cdfs at different points in the sample space for Example 7.10.2.
v. 2017.02.02::09.29 KC Border
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–13
I shall omit the proof. It uses the Borel–Cantelli Lemma, [2, p. 44]. The practical implication
of the Glivenko–Cantelli Theorem is this:
• Since with probability one, the empirical cdf converges to the true cdf,
• we can estimate the cdf nonparametrically.
• Moreover, since the empirical cdf is also the cdf of a uniform random variable on the sample
values,
• with probability one, as the sample size n becomes large, resampling by drawing sample
points independently at random is almost the same as getting new independent sample points.
• This resampling scheme is called the bootstrap, and is the basis for many nonparametric
statistical procedures.
• The Glivenko–Cantelli Theorem also guarantees that we can find rules for creating his-
tograms of our data that converge to the underlying density from which it is drawn.
KC Border v. 2017.02.02::09.29
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–14
• Herbert Sturges [14] argued that if the sample size n is of the form 2k , and the data are
approximately Normal, we ( )can choose k + 1 bins so that the number in bin j should be close to
the binomial coefficient nj . “For example, 16 items would be divided normally into 5 classes,
with class frequencies, 1, 4, 6, 4, 1.” For sample sizes n that are not powers of two, he argued that
the range of the data should be divided into the number of bins that is the largest “convenient”
integer (e.g., a multiple of 5) that is less than or equal to
1 + log2 n.
• David Freedman and Persi Diaconis [6] argue that one should try to minimize the L2 -norm
of the difference between the normalized histogram and the density. This depends on unknown
parameters, but they argued that the following rule is simple and approximately correct:
Choose the cell width as twice the interquartile range of the data, divided by the
cube root of the sample size.
• David Scott [11] argues that the mean-square error minimizing bin width for large n is given
by
( )1/3
6
∫∞ .
n −∞ f ′ (x)2 dx
• Matt Wand [16] derived more sophisticated methods for binning data based on kernel
estimation theory.
• Kevin Knuth [7] has recently suggested an algorithmic method for binning based on a
Bayesian procedure. It reportedly works better for densities that are not unimodal.
Each of these methods will create histograms that approximate the density better and better
as the sample size increases. Both R and Mathematica have options for using the Sturges,
Freedman–Diaconis, and Scott rules in plotting histograms. Mathematica implements Wand’s
method and it is available as an R package (dpih). Mathematica also implements Knuth’s rule.
Bibliography
[1] C. D. Aliprantis and K. C. Border. 2006. Infinite dimensional analysis: A hitchhiker’s guide,
3d. ed. Berlin: Springer–Verlag.
[2] L. Breiman. 1968. Probability. Reading, Massachusetts: Addison Wesley.
[3] K. L. Chung. 1974. A course in probability theory, 2d. ed. Number 21 in Probability and
Mathematical Statistics. Orlando, Florida: Academic Press.
[5] W. Feller. 1971. An introduction to probability theory and its applications, 2d. ed., vol-
ume 2. New York: Wiley.
[6] D. Freedman and P. Diaconis. 1981. On the histogram as a density estimator: L2 theory.
Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 57(4):453–476.
DOI: 10.1007/BF01025868
v. 2017.02.02::09.29 KC Border
Ma 3/103 Winter 2017
KC Border The Law of Averages 7–15
[8] R. J. Larsen and M. L. Marx. 2012. An introduction to mathematical statistics and its
applications, fifth ed. Boston: Prentice Hall.
[9] J. Pitman. 1993. Probability. Springer Texts in Statistics. New York, Berlin, and Heidel-
berg: Springer.
[10] P. A. Samuelson. 1963. Risk and uncertainty: A fallacy of large numbers. Scientia 98:108–
113. http://https://www.casact.org/pubs/forum/94sforum/94sf049.pdf
[14] H. A. Sturges. 1926. The choice of a class interval. Journal of the American Statistical
Association 21(153):65–66. http://www.jstor.org/stable/2965501
[15] B. L. van der Waerden. 1969. Mathematical statistics. Number 156 in Grundlehren der
mathematischen Wissenschaften in Einzeldarstellungen mit besonderer Berücksichtigung
der Anwendungsgebiete. New York, Berlin, and Heidelberg: Springer–Verlag. Translated
by Virginia Thompson and Ellen Sherman from Mathematische Statistik, published by
Springer-Verlag in 1965, as volume 87 in the series Grundlerhen der mathematischen Wis-
senschaften.
KC Border v. 2017.02.02::09.29