All Stat

All of Statistics
My proposed solutions to the book:

All of Statistics - A Concise Course in Statistical Inference.
Publishing code and pdf to the following GitHub repository.
https://github.com/CoveredInChocolate/AllStatistics
Let me know if you find any mistakes! I am sure there are plenty...
Contents
1 Probability 2
2 Random Variables 25
3 Expectation 49
4 Inequalities 75
5 Convergence of Random Variables 83
6 Models, Statistical Inference and Learning 97
7 Estimating the CDF and Statistical Functionals 101
8 The Bootstrap 109
1
1 Probability
Exercises
1.1
Proving the Continuity of Probabilities.
1.8 Theorem (Continuity of Probabilities). If An → A then
P(An ) → P(A)
as n → ∞.
Proof. Suppose that An is monotone increasing so that A1 ⊂ A2 ⊂ . . .. Let A = limn→∞ An =

S ∞
i=1 Ai . Define:
B1 = A1
B2 = {ω ∈ Ω : ω ∈ A2 , ω 6∈ A1 }
B3 = {ω ∈ Ω : ω ∈ A3 , ω 6∈ A2 , ω 6∈ A1 }
..
.
(1) Showing that Bi are disjoint sets.

We have B1 = A1 . Since B2 = {ω ∈ Ω : ω ∈ A2 , ω 6∈ A1 }, we can rewrite this as B2 = A2 − A1
by definition of set difference.
Since B3 = {ω ∈ Ω : ω ∈ A3 , ω 6∈ A2 , ω 6∈ A1 } and since A1 ⊂ A2 , we can rewrite this as
S Sk−1
B3 = A3 − (A1 A2 ). In general; Bk = Ak − ( i=1 Ai ).
Assuming some random sets Bm , Bp for some m, p ∈ N. Without loss of generality, we assume
m > p. Then:
\ m−1
[ \ p−1
[
Bm B p = Am − ( Ai ) Ap − ( Ai )
i=1 i=1
\ m−1
[ \ \ p−1
[
= Am ( Ai )c Ap ( Ai )c
i=1 i=1
DeMorgan’s law
\
= Am ∩ Ac1 ∩ Ac2 ∩ . . . ∩ Acp ∩ . . . ∩ Acm−1 Ap ∩ Ac1 ∩ Ac2 ∩ . . . ∩ Acp−1
Reshuffling terms and repeated use of C ∩ C = C

= Am ∩ Ac1 ∩ Ac2 ∩ . . . ∩ Acp ∩ Ap ∩ . . . ∩ Acm−1
| {z }
=∅
=∅
Since Bm ∩ Bp = ∅, they are disjoint. (Quite certain this is a correct argument, but could have
made it a bit easier by going directly to the A3 ∩ Ac2 ∩ Ac1 version of the sets.)
2
(2) Showing that
n
[ n
[
An = Ai = Bi
i=1 i=1
for each n. Showing this with an induction argument. Note that A1 = B1 , and A2 ∪ A1 = A2 since
A1 ⊂ A2 . Note also that:
B2 ∪ B1 = (A2 ∩ Ac1 ) ∪ A1
= (A2 ∪ A1 ) ∩ (Ac1 ∪ A1 )
= (A2 ∪ A1 ) ∩ (Ω)
= A2 ∪ A1
= A2
So this is true for n = 2. Assume that Ak = ∪ki=1 Ai = ∪ki=1 Bi for k ∈ N. Showing that this applies
to k + 1.
k+1
[ k
[
Ai = Ak+1 ∪ Ai = Ak+1 ∪ Ak = Ak+1
i=1 i=1
k+1
[ k
[
Bi = Bk+1 ∪ Bi
i=1 i=1
= Bk+1 ∪ Ak
h i
= Ak+1 ∩ Ack ∩ . . . ∩ Ac1 ∪ Ak
h c i
= Ak+1 ∩ Ak ∪ . . . ∪ A1 ∪ Ak
h i
= Ak+1 ∩ Ack ∪ Ak
= (Ak+1 ∪ Ak ) ∩ (Ack ∪ Ak )
= (Ak+1 ∪ Ak ) ∩ Ω
= Ak+1 ∪ Ak
= Ak+1
Result verified by induction argument.
(3) Showing that
∞
[ ∞
[
Bi = Ai . (♦)
i=1 i=1
By definition:
∞
[
A = lim An = Ai .
n→∞
i=1
By step (2):
n
[ n
[ ∞
[
A = lim An = lim Ai = lim Bi = Bi .
n→∞ n→∞ n→∞
i=1 i=1 i=1
Hence, (♦) is satisfied.
3
1.2
Proving some well known results by using the axioms:
P(A) > 0, ∀A ⊂ Ω (A.1)
P(Ω) = 1, (A.2)
∞ ∞
!
[ X
P Ai = P(Ai ) (A.3)
i=1 i=1
where A1 , A2 , . . . are disjoint in (A.3).

• Claim: P(∅) = 0.
Proof. Since Ω ∩ ∅ = ∅ and the empty set is disjoint with itself, we can make set E1 = Ω and
Ek = ∅ for all k > 2. Now assume for contradiction that P(∅) = a > 0 for some a ∈ R. Then by
(A.3):
∞ ∞ ∞ ∞
!
A.3
[ X X X
P Ei = P(Ei ) = P(Ω) + P(∅) = 1 + a=∞
i=1 i=1 i=2 i=2
Now instead of using (A.3) we use that the infinite set of Ei becomes Ω:
∞
!
A.2
[
P Ei = P(Ω) = 1
i=1
We have reached a contradiction. This shows that P(∅) = 0.

• Claim: A ∩ B = ∅ =⇒ P(A ∪ B) = P(A) + P(B).
Proof. Set E1 = A and E2 = B and Ek = ∅ for all k > 3. Then:
∞ ∞ ∞
!
A.3
[ X X
P(A ∪ B) = P Ei = P(Ei ) = P(A) + P(B) + P(∅) = P(A) + P(B) + 0 = P(A) + P(B)
i=1 i=1 k=3
With this result, we can apply (A.3) indirectly to any finite sum of disjoint sets. All other sets
are set to ∅ and then (A.3) is applied. Not giving a formal argument, but if A, B, C are mutually
disjoint we can set Ek = ∅ for all k > 4.
∞ ∞
!
A.3
[ X
P(A ∪ B ∪ C) = P Ei = P(Ei ) = P(A) + P(B) + P(C) + 0 = P(A) + P(B) + P(C)
i=1 i=1
• Claim: A ⊂ B =⇒ P(A) 6 P(B).

Proof. Since A ⊂ B we can split B into two disjoint parts: B = A ∪ B − A. By axiom (A.3):
A.3 A.1
P(B) = P(A ∪ B − A) = P(A) + P(B − A) > P(A) =⇒ P(A) 6 P(B)
• Claim: 0 6 P(A) 6 1.
Proof. Since A ⊂ Ω and by the previous proof: P(A) 6 P(Ω) = 1 Combining this with axiom
(A.1):
A.1
0 6 P(A) 6 P(Ω) = 1 =⇒ 0 6 P(A) 6 1
4
• Claim: P(Ac ) = 1 − P(A).
Proof. Since A ∪ Ac are disjoint, we get by finite version of (A.3):
P(A ∪ Ac ) = P(A) + P(Ac )
Since Ω = A ∪ Ac we also get by (A.2):
P(A ∪ Ac ) = P(Ω) = 1
Putting them together:
P(A) + P(Ac ) = 1 =⇒ P(Ac ) = 1 − P(A)
1.3
Ω is a sample space and A1 , A2 , . . . are events. We define:
∞
[ ∞
\
Bn = Ai , Cn = Ai
i=n i=n
(a) Claim: B1 ⊃ B2 ⊃ . . .
Proof. This follows directly from the definition of Bn . First we see that:
∞ ∞ ∞
!
[ [ [
B1 = Ai = A1 ∪ Ai ⊃ Ai = B2 , =⇒ B1 ⊃ B2
i=1 i=2 i=2
The same argument can be used for some k ∈ N.

∞ ∞ ∞
!
[ [ [
Bk = Ai = Ak ∪ Ai ⊃ Ai = Bk+1 , =⇒ Bk ⊃ Bk+1
i=k i=k+1 i=k+1
From this, it follows that B1 ⊃ B2 ⊃ . . ..

Claim: C1 ⊂ C2 ⊂ . . .
Proof. This also follows directly from the definition. First note that:
∞ ∞ ∞
!
\ \ \
C1 = Ai = A1 ∩ Ai ⊂ Ai = C2 =⇒ C1 ⊂ C2
i=1 i=2 i=2
And, in general, for some k ∈ N.

∞ ∞ ∞
!
\ \ \
Ck = Ai = Ak ∩ Ai ⊂ Ai = Ck+1 =⇒ Ck ⊂ Ck+1
i=k i=k+1 i=k+1
This shows that C1 ⊂ C2 ⊂ . . ..
5
(b) Claim:
∞
\
ω∈ Bn ⇐⇒ ω ∈ A1 , A2 , . . .
n=1
Proof. T∞
⇒) Assume ω ∈ n=1 Bn , which means ω ∈ Bk , ∀k ∈ N. Expanding the intersection:
ω ∈ B1 ∩ B2 ∩ . . . ∩ Bk ∩ Bk+1 ∩ . . .
Now, from the definition of B1 , like we used above:

∞
[ ∞
[
B1 = Ai = A1 ∪ Ai = A1 ∪ B 2
i=1 i=2
So, we can write: B1 ∩ B2 = (A1 ∪ B2 ) ∩ B2 = (A1 ∩ B2 ) ∪ (B2 ∩ B2 ) = (A1 ∩ B2 ) ∪ B2 . And in

general:
∞
[ ∞
[
Bk = Ai = Ak ∪ Ai = Ak ∪ Bk+1 ,
i=k i=k+1
so Bk ∩ Bk+1 = (Ak ∪ Bk+1 ) ∩ Bk+1 = (Ak ∩ Bk+1 ) ∪ Bk+1 . So, each Bi can be decomposed into
(Ai ∩ Bi+1 ) ∪ Bi+1 and the Bi+1 is rewritten as its own intersection. Ultimately, we get:
ω ∈ B1 ∩ B2 ∩ B3 ∩ . . . ∩ Bk ∩ Bk+1 ∩ . . . =⇒

ω ∈ (A1 ∩ B2 ) ∪ B2 ∩ B3 ∩ . . . ∩ Bk ∩ Bk+1 ∩ . . . =⇒

ω ∈ (A1 ∩ B2 ) ∪ (A2 ∩ B3 ) ∪ B3 ∩ . . . ∩ Bk ∩ Bk+1 ∩ . . . =⇒
ω ∈ (A1 ∩ B2 ) ∪ (A2 ∩ B3 ) ∪ . . . ∪ (Ak ∩ Bk+1 ) ∪ (Ak+1 ∩ Bk+2 ) ∪ . . . =⇒
ω ∈ (Ak ∩ Bk+1 )∀k ∈ N by assumption that ω ∈ Bk ∀k ∈ N =⇒
ω ∈ Ak , ∀k ∈ N
which means ω ∈ A1 , A2 , . . . , Ak , Ak+1 , . . . proving the first implication.

⇐) Now assume ω ∈ A1 , A2 , . . . , Ak , . . . which means that ω ∈ Ak for any k ∈ N. Then,
∞ ∞
!
[ [
ω ∈ Ak =⇒ ω ∈ Ak ∪ Ai = Ai = Bk =⇒ ω ∈ Bk ,
i=k+1 i=k
So ω is included in all Bk for k ∈ N which means ω is also in the intersection of all Bi .

∞
\
ω ∈ B1 , B2 , . . . , Bk , . . . =⇒ ω ∈ Bn
n=1
By implication both ways, we have proved the equivalency.
(c) Skipped.
6
1.4
Proving DeMorgan’s laws. First in the case when i = {1, 2, . . . , n}.
Claim: !c
n
[ n
\
Ai = Aci
i=1 i=1
Proof. Starting with x ∈ (∪ni=1 Ai )c , we have the following equivalencies:

n
!c n
[ [
x∈ Ai ⇐⇒ x 6∈ Ai
i=1 i=1
⇐⇒ x 6∈ A1 and x 6∈ A2 and . . . and x 6∈ An
⇐⇒ x ∈ Ac1 and x ∈ Ac2 and . . . and x ∈ Acn
\n
⇐⇒ x ∈ Aci
i=1
Claim: !c
n
\ n
[
Ai = Aci
i=1 i=1
Proof. Starting with x ∈ (∩ni=1 Ai )c ,

we have the following equivalencies:
n
!c n
\ \
x∈ Ai ⇐⇒ x 6∈ Ai
i=1 i=1
⇐⇒ ∃p ∈ {1, . . . , n} x 6∈ Ap
⇐⇒ ∃p ∈ {1, . . . , n} x ∈ Acp
[n
⇐⇒ x ∈ Aci
i=1
Proving these results for a random index set is essentially the same, except finding some way of
numerating the index set with i1 , . . . , in (since it has to be countably infinite) which I won’t bother
doing.
1.5
Tossing a fair coin until we get exactly two heads. Description of the sample space:
Ω = {(ω1 , ω2 , . . . , ωk−1 , ωk ) : ωi ∈ {H, T }, ωk−1 = ωk = H, ωp−1 = H ⇒ ωp = T, p < k}
Now, to calculate the probability that there will be exactly k tosses. Let us look at a few specific
examples:
1 1
k = 2 : {H, H} =⇒ P(2) = 2
= = 0.25
2 4
1 1
k = 3 : {T, H, H} =⇒ P(3) = 3 = = 0.125
2 8
For all tosses after k > 3, the last three tosses have to be {T, H, H}.
7
H 1 1
k=4: , T, H, H = 2 · 4 = = 0.125
T 2 8
TT
1 3
k = 5 : T H , T, H, H = 3 · 5 = = 0.09375
2 32
HT
TTT
TTH
1 5
k = 6 : T HT , T, H, H = 5 · 6 = = 0.078125
2 64
HT T
HT H
TTTT
TTTH
T T HT
T HT T 1 1 1
k=7 , T, H, H = 8 · 7 = 4 = = 0.0625
HT T T 2 2 16
T HT H
HT HT
HT T H
We can see a pattern emerging. The last three tosses are fixed, and for the first k − 3 tosses, we
have to count all the combinations that do not contain 2 simultaneous H. The details are quite
interesting, and it turns out that the number after the k − 3 first tosses follows the Fibonacci
sequence (as we can see above). Defining:
F (1) = 2, F (2) = 3, F (3) = 5, . . .
In general, we will then get:
1


 k = 2, 3
 2k
P(K = k) =
 F (k − 3)

k>3

2k
1.6
Let the sample space be Ω = {0, 1, 2, . . .} (which is {0} ∪ N). We are going to prove that there
is no uniform distribution on this sample space. That is, if P(A) = P(B) whenver |A| = |B| (the
cardinality of the sets) then P cannot satisfy the axioms of probability.
The first two properties are satisfied. If we take some k ∈ N and define A1 = {1, . . . , k} in such
a way that P (A1 ) > 0, and generally define Am = {mk + 1, mk + 2, . . . , mk + k}, then An ∩ Am = ∅.
(This can be checked by setting k = 100 and n = 5 and m = 6 for instance). Also, |An | = |Am |
and P(An ) = P(Am ) > 0. This will cause problems in the third axiom:
∞ ∞
!
[ X
P Ai = P(Ai ) = ∞,
i=1 i=1
since we are summing an infinite number of positive values. The axioms of probability are not
satisfied, so we cannot have a uniform distribution on {0, 1, 2, . . .}.
8
1.7
For some events A1 , A2 , . . ., we are going to show that:
∞ ∞
!
[ X
P An 6 P(An ).
n=1 n=1
Proof. Following the hint, we define:

n−1
[
B n = An − Ai .
i=1
Next, we show that any Bm and Bn are disjoint for n, m ∈ N and m > n.
m−1
! n−1
!
[ [
B m ∩ B n = Am − Ai ∩ An − Ai
i=1 i=1
m−1
!c ! n−1
!c !
[ [
= Am ∩ Ai ∩ An ∩ Ai
i=1 i=1
m−1
! n−1
!
\ \
= Am ∩ Aci ∩ An ∩ Aci
i=1 i=1

= Am ∩ Acm−1 ∩ ... ∩ Acn ∩ . . . ∩ Ac2 ∩ Ac1 ∩ An ∩ Acn−1 ∩ . . . ∩ Ac2 ∩ Ac1
= Am ∩ Acm−1 ∩ . . . ∩ Ac2 ∩ Ac1 ∩ Acn−1 ∩ . . . ∩ Ac2 ∩ Ac1 ∩ Acn ∩ An
| {z }
=∅
=∅
Hence, the sets Bi are mutually disjoint. Next, we show that:
∞ ∞
! !
[ [
An = Bn .
n=1 n=1
We will use an induction argument. For the first element we have: B1 = A1 , and:
∪2i=1 Bi = B1 ∪ B2
= A1 ∪ (A2 − A1 )
= A1 ∪ (A2 ∩ Ac1 )
= (A1 ∪ A2 ) ∩ (A1 ∪ Ac1 )
= (A1 ∪ A2 ) ∩ Ω
= (A1 ∪ A2 )
= ∪2i=1 Ai
So, we assume that for k ∈ N, we have:
k
! k
!
[ [
An = Bn ,
n=1 n=1
and show that this applies to k + 1.
9
k+1 k
!
[ [
Bi = Bk+1 ∪ Bi
i=1 i=1
k
!
[
= Bk+1 ∪ Ai
i=1
k
! k
!
[ [
= Ak+1 − Ai ∪ Ai
i=1 i=1
k
!c ! k
!
[ [
= Ak+1 ∩ Ai ∪ Ai
i=1 i=1
k
! " k
!c k
!#
[ [ [
= Ak+1 ∪ Ai ∩ Ai ∪ Ai
i=1 i=1 i=1
k+1
!
[
= Ai ∩Ω
i=1
k+1
[
= Ai
i=1
So, by induction, we have verfied the equality for k + 1, so it follows that the sets are equal for all
indexes.
By definition of Bn , we have Bn ⊂ An which means P(Bn ) 6 P(An ). Using this, we can prove
the main statement:
∞ ∞ ∞ ∞
! !
[ [ X X
P An = P Bn = P(Bn ) 6 P(An ).
n=1 n=1 n=1 n=1
1.8
Suppose that P(Ai ) = 1 for each Ai . We are going to prove that:
∞
!
\
P Ai = 1.
i=1
Proof. We will prove this by induction. For events A1 and A2 , we know that P(A1 ) = 1 and
P(A2 ) = 1. By the probability axioms:
A1 ⊂ A1 ∪ A2 ⊂ Ω =⇒ P(A1 ) 6 P(A1 ∪ A2 ) 6 P(Ω)
Using that P(A1 ) = 1 and P(Ω) = 1:
1 6 P(A1 ∪ A2 ) 6 1 =⇒ P(A1 ∪ A2 ) = 1
By Lemma 1.6:
P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 )
1 = 2 − P(A1 ∩ A2 )
P(A1 ∩ A2 ) = 1
10
Assuming this holds for k ∈ N, i.e.: !
k
\
P Ai = 1.
i=1
To simplify the notation, we call this intersection Wk , so P(Wk ) = 1. Now we will show it also
applies to k + 1.
A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ak+1 = Wk ∩ Ak+1
Again, from the probability axioms:
Wk ⊂ Wk ∪ Ak+1 ⊂ Ω =⇒ P(Wk ) 6 P(Wk ∪ Ak+1 ) 6 P(Ω)
Using that P(Wk ) = 1 and P(Ω) = 1:
1 6 P(Wk ∪ Ak+1 ) 6 1 =⇒ P(Wk ∪ Ak+1 ) = 1
By Lemma 1.6, and using that P(Ai ) = 1 for each i:
P(Wk ∪ Ak+1 ) = P(Wk ) + P(Ak+1 ) − P(Wk ∩ Ak+1 )

1 = 2 − P(Wk ∩ Ak+1 )
P(Wk ∩ Ak+1 ) = 1
k
!
\
P Ai ∩ Ak+1 = 1
i=1
k+1
!
\
P Ai =1
i=1
Which verifies that equality holds for k + 1. By induction, we can conclude that:
∞
!
\
P Ai = 1.
i=1
1.9
For a fixed B such that P(B) > 0, we will show that P(·|B) satisfies the axioms of probability.
Recalling the definition of conditional probability:
P(A ∩ B)
P(A|B) =
P(B)
• Axiom 1. P(A|B) > 0 for all A.

Proof. The set A ∩ B is an event wrt. P(·), so P(A ∩ B) = a > 0.
P(A ∩ B) a
P(A|B) = = > 0,
P(B) P(B)
since P(B) > 0.
11
• Axiom 2. P(Ω|B) = 1.
Proof.
P(Ω ∩ B) P(B)
P(Ω|B) = = =1
P(B) P(B)
• Axiom 3. If A1 , A2 , . . . are disjoint, then:

∞ ∞
!
[ X
P Ai B = P(Ai |B)
i=1 i=1
Proof. First a supporting argument. If we define Ci := Ai ∩ B for each i, then C1 , C2 , . . . are

mutually disjoint sets and by (A.3):
∞ ∞
!
[ X
P Ci = P(Ci ).
i=1 i=1
This means:
∞ ∞ ∞
! S∞ S∞ P∞
[ P( Ai ∩ B)
i=1 P ( i=1 Ci ) P(Ci ) X P(Ai ∩ B) X
P Ai B = = = i=1 = = P(Ai |B)
i=1
P(B) P(B) P(B) i=1
P(B) i=1
which proves the final axiom for the conditional probability.
1.10 Monty Hall

There are three doors, 1, 2, 3, and there is a prize between one of them. We don’t know which door
contains the prize, so the probability of selecting the correct door is evenly spread out:
1 1 1
P (1) = , P (2) = , P (3) =
3 3 3
To simplify the calculations, we assume we pick door 1, and that the prize is not behind door 3.
Now we will decide if it’s better to switch or not. What is the probability that the host will open
door III conditioned on where the prize is?
1
P (III|1) = (Correct door is selected, door 3 is opened randomly)
2
P (III|2) = 1 (Host never reveals the prize)
P (III|3) = 0 (Host never reveals the prize)
Now, what is the probability that the prize is behind door number 2, given that door III was
opened? This can be calculated by Bayes Law.
P (B|Ak )P (Ak )
P (Ak |B) = Pn
i=1 P (B|Ai )P (Ai )
12
In our case, probability of door 2 containing the prize, given that door III was opened:
P(III|2)P(2)
P (2|III) =
P(III|1)P(1) + P(III|2)P(2) + P(III|3)P(3)
(1)( 13 )
= 1 1
( 2 )( 3 ) + (1)( 13 ) + (0)( 13 )
1
3 1 1
= 1 1 = /
6 + 3 +0 3 2
2
=
3
Probability that door 1 contains the prize, given that door III was opened:
P(III|1)P(1)
P (1|III) =
P(III|1)P(1) + P(III|2)P(2) + P(III|3)P(3)
( 21 )( 13 )
= 1 1
( 2 )( 3 ) + (1)( 13 ) + (0)( 13 )
1
6 1 1
= 1 1 = /
6 + 3 +0 6 2
1
=
3
This shows that we are better off switching doors then remaining in the one we initially selected.
1.11
Claim. If A and B are independent, then Ac and B c are independent.
Proof. By definition of independence P(A ∩ B) = P(A)P(B). By property of probabilities:
P(Ac ) = 1 − P(A) and P(B c ) = 1 − P(B). Applying Lemma 1.6 to Ac ∪ B c :
P(Ac ∪ B c ) = P(Ac ) + P(B c ) − P(Ac ∩ B c ) (2.11.1)
By property of complements, and using DeMorgan’s law:
P(A ∩ B) = 1 − P([A ∩ B]c ) = 1 − P(Ac ∪ B c )
Which we can rewrite in the following way, and then apply the independence of A and B:
P(Ac ∪ B c ) = 1 − P(A ∩ B) = 1 − P(A)P(B)
By property of complements, we can represent this as:
P(Ac ∪ B c ) = 1 − [1 − P(Ac )][1 − P(B c )]

= 1 − 1 − P(B c ) − P(Ac ) + P(Ac )P(B c )

= P(Ac ) + P(B c ) − P(Ac )P(B c ) (2.11.2)
By setting (2.11.1) equal to (2.11.2) we can deduce that P(Ac )P(B c ) = P(Ac ∩ B c ) which proves
independence.
13
Alternative Proof. Slightly more compact argument, but it still relies on Lemma 1.6.
P(Ac )P(B c ) = 1 − P(A) 1 − P(B)

= 1 − P(A) − P(B) + P(A)P(B)

= 1 − P(A) + P(B) − P(A)P(B)
IN D
= 1 − P(A) + P(B) − P(A ∩ B)
L1.6
= 1 − P(A ∪ B)
= P([A ∪ B]c )
= P(Ac ∩ B c )
1.12
Initially, we have three cards that we call G (Green both sides), R (Red both sides) and M (mixed
colors). Next we define r as ’Red side’ and g as ’Green side’. We can define the joint distribution
as follows:
G M R
r 1/6 2/6 1/2
g 2/6 1/6 1/2
1/3 1/3 1/3
From the marginal distributions, we can see that the initial probabilities for selecting each card is:
1 1 1
P(G) = , P(R) = , P(M ) =
3 3 3
And the probabilities for randomly viewing a colored card side:
1 1
P(g) = , P(r) =
2 2
Note that we do not have independene, since P(G ∩ g) = 2/6 while P(G)P(g) = (1/3)(1/2) = 1/6.
We want to find the probability that we have a green card given that we observe a green side.
This can be calculated by:
P(G ∩ g) 2/6
P(G|g) = = = 4/6 = 2/3
P(g) 1/2
1.13
A fair coin is thrown until we have both H and T.
(a) The sample space is:
Ω = {(ω1 , . . . , ωK ) : ωi ∈ {H, T }, 1 6 i 6 K and ωK−1 6= ωK , K > 2}
(b) The probability of getting 3 tosses can only happen in two ways (H, H, T ) and (T, T, H). So:
1 1
P (K = 3) = 2 · 3
=
2 4
(it would be more complicated if it were e.g. K 6 3).
14
1.14
Claim: If P(A) = 0 or P(A) = 1, then A is independent of every other event.
Proof. First consider the case P(A) = 0. Obviously, for any E ∈ Ω:
P(A)P(E) = 0 · P(E) = 0.
By the axioms of probability: A ∩ E ⊂ A so P(A ∩ E) 6 P(A) = 0 and since for any event:
0 6 P(A ∩ E) we can conclude that P(A ∩ E) = 0. So:
P(A ∩ E) = 0 = P(A)P(E) =⇒ P(A ∩ E) = P(A)P(E)
which proves independence.

Now consider P(A) = 1. Then, for any event E ∈ Ω:
P(A)P(E) = 1 · P(E) = P(E).
Since A ⊂ A ∪ E ⊂ Ω, then P(A) 6 P(A ∪ E) 6 P(Ω). We know the probabilities of A and Ω, so

1 6 P(A ∪ E) 6 1 means that P(A ∪ E) = 1. By Lemma 1.6:
P(A ∪ B) = P(A) + P(E) − P(A ∩ E)

1 = 1 + P(E) − P(A ∩ E)
P(A ∩ E) = P(E)
In conclusion:
P(A ∩ E) = P(E) = P(A)P(E) =⇒ P(A ∩ E) = P(A)P(E)
which proves independence.
Claim: If A is independent of itself, then P(A) = 0 or P(A) = 1.
Proof. If A is independent of itself, we can write:
P(A ∩ A) = P(A)P(A)
P(A) = P(A)2
since A ∩ A = A. We can write this as an equation, by defining p := P(A).
p = p2
p2 − p = 0
p(p − 1) = 0
This equation is solved when p = P(A) = 1 or p = P(A) = 0.
15
1.15
The probability that a child has blue eyes is 1/4, P(B) = 1/4. We assume independence in the eye
color of the children.
(a) Given that one child has blue eyes, what is the probability that 2 or more children have blue
eyes? Since it’s given that one child has blue eyes, we just need to find the probability that one, or
both, of the remaining children have blue eyes. Define Bi : as the event that child i has blue eyes,
and ¬Bi the event that child i does NOT have blue eyes. We only need to sum up the following
three probabilities:

1 1 1
P(B1 ∩ B2 ) = P(B1 )P(B2 ) = =
4 4 16

1 3 3
P(B1 ∩ ¬B2 ) = P(B1 )P(¬B2 ) = =
4 4 16

3 1 3
P(¬B1 ∩ B2 ) = P(¬B1 )P(B2 ) = =
4 4 16
7
By summing these up, we find that the probability of two or more children having blue eyes is .
16
7
(b) Which child has blue eyes is irrelevant. The probability will be the same: .
16
1.16
Proving Lemma 1.14. If A and B are independent events, then P(A|B) = P(A). Also, for any pair
of events A and B,
P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A).
Proof. Suppose A and B are independent. By definition of independence, P(A ∩ B) = P(A)P(B).
By definition of conditional probability:
P(A ∩ B) P(A)P(B)
P(A|B) = = = P(A),
P(B) P(B)
which proves the first statement. Again, by definition of conditional probability:
P(A ∩ B)
P(A|B) = =⇒ P(A ∩ B) = P(A|B)P(B)
P(B)
P(B ∩ A)
P(B|A) = =⇒ P(A ∩ B) = P(B|A)P(A),
P(A)
since A ∩ B = B ∩ A which proves the remaining two statements.
16
1.17
Show that
P(A ∩ B ∩ C) = P(A|B ∩ C)P(B|C)P(C).
Proof. Define E := B ∩ C, so A ∩ B ∩ C = A ∩ E. By Lemma 1.14:
P(A ∩ B ∩ C) = P(A ∩ E) = P(A|E)P(E) = P(A|B ∩ C)P(B ∩ C) (2.17.1)
By Lemma 1.14 again:

P(B ∩ C) = P(B|C)P(C),
which we can replace in equation (2.17.1) and get the desired result:
P(A ∩ B ∩ C) = P(A|B ∩ C)P(B|C)P(C)
1.18 Sk
Suppose k events form a partition of Ω, i.e. they are disjoint and i=1 Ai = Ω. Assume that
P(B) > 0. If P(A1 |B) < P(A1 ) then P(Ai |B) > P(Ai ) for some i = 2, 3, . . . , k.
Proof. By summing over all Ai with and without conditioning on B, we get the following two
equalities:
Xk Xk
P(Ai |B) = 1, P(Ai ) = 1
i=1 i=1
We can set these equal to each other:

k
X k
X
P(Ai |B) = P(Ai )
i=1 i=1
Removing the terms for A1 from the sums:

k
X k
X
P(A1 |B) + P(Ai |B) = P(A1 ) + P(Ai )
i=2 i=2
Since P(A1 |B) < P(A1 ), then P(A1 ) − P(A1 |B) = a > 0 for some a ∈ R. By subtracting P(A1 |B)
from both sides, we can write:
k
X k
X
P(Ai |B) = P(A1 ) − P(A1 |B) + P(Ai )
i=2 i=2
k
X k
X k
X
P(Ai |B) = a + P(Ai ) > P(Ai )
i=2 i=2 i=2
So, we get:
k
X k
X
P(Ai |B) > P(Ai )
i=2 i=2
which means there exists some i ∈ {2, 3, . . . , k} such that P(Ai |B) > P(Ai ).
17
1.19
Define M as Mac users, W as Windows users, and L as Linux users. We have:
P(M ) = 0.3, P(W ) = 0.5, P(L) = 0.2
The computers are infected with a virus, which give the following probabilities:
P(V |M ) = 0.65, P(V |W ) = 0.82, P(V |L) = 0.50
A person is selected at random and we learn that her computer has the virus. What is the probability
that she is using Windows? Or, what is P(W |V )? We calculate this with Bayes Law.
P(V |W )P(W )
P(W |V ) =
P(V |M )P(M ) + P(V |W )P(W ) + P(V |L)P(L)
(0.82)(0.5)
=
(0.65)(0.3) + (0.82)(0.5) + (0.5)(0.2)
= 0.5815
1.20
We have a box containing 5 coins, C1 , . . . , C5 . Each coin is selected at random, so P(Ci ) = 1/5 for
all coins. They have the following probabilities of getting H when flipped (pi = P(H|Ci )):
p1 = 0, p2 = 1/4, p3 = 1/2, p4 = 3/4, p5 = 1
(a) We flip a coin and get a head. Calculate P(Ci |H) for all coins. First an intermediary calculation:
5
X
P(H|Ci )P(Ci ) = P(H|C1 )P(C1 ) + P(H|C2 )P(C2 ) + P(H|C3 )P(C3 ) + P(H|C4 )P(C4 ) + P(H|C5 )P(C5 )
i=1

1 1 1 1 1 3 1 1
= (0) + + + + (1)
5 4 5 2 5 4 5 5
1 2 3 4
=0+ + + +
20 20 20 20
1
=
2
Calculating the posterior for each Ci .
P(H|C1 )P(C1 ) 0
P(C1 |H) = P5 = 1 =0
i=1 P(H|Ci )P(Ci ) 2
1
P(H|C2 )P(C2 ) 20
P(C2 |H) = P5 = 1 = 1/10
2
P(H|C3 )P(C3 ) 20
P(C3 |H) = P5 = 1 = 2/10
18
3
P(H|C4 )P(C4 ) 20
P(C4 |H) = P5 = 1 = 3/10
4
P(H|C5 )P(C5 ) 20
P(C5 |H) = P5 = 1 = 4/10
(b) We toss the coin again. Finding the probability P(H2 |H1 ) (Hi : getting H on toss i). We are
still using the same random coin, so we must calculate the probability for all coins together.
19
1.21
Proportion of H vs T for p = 0.3.
# 1.21 - Plotting proportion of H vs T
n = 1000
p = 0.3
coinTosses = sample ( c ( " H " ," T " ) , prob = c (p , 1 - p ) , size = n , replace = TRUE )
p ro po rt i on He ad s = rep (0 , n )
headCount = 0
for ( i in 1: n ) {
if ( coinTosses [ i ] == " H " ) {
headCount = headCount + 1
}
p ropo rt i on He ad s [ i ] = headCount / i
}
# PDF
pdf ( " ~ / ALLSTAT / ch1 _ 2.21 a . pdf " )
plot ( x = 1: n , y = proportionHeads , type = " l " , ylim = c (0 ,1) ,
main = paste0 ( " Proportion heads for p = " , p ) )
abline ( h = p , col = " red " , lty = " dashed " )
dev . off ()
Result
After some initial randomness due to few samples, we can see that the simulated results stabilizes
aroud 0.3, as expected.
Proportion heads for p=0.3

1.0
0.8
proportionHeads
0.6
0.4
0.2
0.0
0 200 400 600 800 1000
1:n
20
Proportion of H vs T for p = 0.03.
# 1.21 - Plotting proportion of H vs T
n = 5000
p = 0.03
coinTosses = sample ( c ( " H " ," T " ) , prob = c (p , 1 - p ) , size = n , replace = TRUE )
p ro po rt i on He ad s = rep (0 , n )
headCount = 0
for ( i in 1: n ) {
if ( coinTosses [ i ] == " H " ) {
headCount = headCount + 1
}
p ropo rt i on He ad s [ i ] = headCount / i
}
# PDF
pdf ( " ~ / ALLSTAT / ch1 _ 2.21 b . pdf " )
plot ( x = 1: n , y = proportionHeads , type = " l " , ylim = c (0 ,1) ,
main = paste0 ( " Proportion heads for p = " , p ) )
abline ( h = p , col = " red " , lty = " dashed " )
dev . off ()
Result
After some initial randomness due to few samples, we can see that the simulated results stabilizes
aroud 0.03, as expected. Did a simulation of 5000 to make the ’convergence’ clearer. Note that the
y-axis only goes up to 0.1 in this plot.
Proportion heads for p=0.03

0.10
0.08
0.06
proportionHeads
0.04
0.02
0.00
0 1000 2000 3000 4000 5000
1:n
21
1.22
Investigating some properties of Binomial random variables.
# 1.22 - Binomial Random Variables
REP = 10 # Number of simulations per n
p = 0.3 # Probability of H
sim10 = rep (0 , REP )
sim100 = rep (0 , REP )
sim1000 = rep (0 , REP )
# Simulating 10
for ( i in 1: REP ) {
# 1 means head
coinTosses = sample ( c (1 ,0) , prob = c (p , 1 - p ) , size = 10 , replace = TRUE )
sim10 [ i ] = sum ( coinTosses )
}
# Simulating 100
# 1 means head
}
# Simulating 1000
# 1 means head
}
df = data . frame (
SIM10 = sim10 ,
SIM100 = sim100 ,
SIM1000 = sim1000
)
# Output
df
apply ( df , 2 , mean )
R
As seen in the results, the mean of the 10 simulations is close to np which would be 3, 30 and 300.
> df
SIM10 SIM100 SIM1000
1 1 32 282
2 2 30 298
3 3 33 319
4 3 28 290
5 3 29 284
6 2 34 286
7 4 37 310
8 2 30 294
9 2 30 297
10 2 31 326
> apply(df, 2, mean)
SIM10 SIM100 SIM1000
2.4 31.4 298.6
22
1.23
Simulating conditional probabilities. First we simulate an independent experiment.
# 1.23 - Simulating a fair die
options ( digits =8)
numb erOfToss es = 10000
A = c (2 , 4 , 6)
B = c (1 , 2 , 3 , 4)
AandB = intersect (A , B ) # c (2 , 4)
dieTosses = sample (1:6 , size = numberOfTosses , replace = TRUE )
# Calculating P ( A ) , P ( B ) , P ( A ) * P ( B ) and P ( A cap B )

PA = sum ( dieTosses % in % A ) / nu mberOfTo sses
PB = sum ( dieTosses % in % B ) / nu mberOfTo sses
PAandB = sum ( dieTosses % in % AandB ) / number OfTosses
# Output
PA
PB
PA * PB
PAandB
R
Results from the calculation. As we can see, the estimates for P(A) ≈ 1/2 and P(B) ≈ 2/3. Also
P(A ∩ B) ≈ P(A)P(B) ≈ 1/3. Differences are probably due to rounding errors.
> # Output
> PA
[1] 0.5005
> PB
[1] 0.6695
> PA*PB
[1] 0.33508475
> PAandB
[1] 0.3344
Now we will construct an experiment with a conditional probability. We will use a fair coin and a
die. When we get heads, this will correspond to 1 and the die is unchanged. If we get tails, this
will be 2 and will double the die count; so if we get tails and we roll a 2, this will give us a 4.
1 2 3 4 5 6 7 8 9 10 11 12
H 1/12 1/12 1/12 1/12 1/12 1/12
T 1/12 1/12 1/12 1/12 1/12 1/12
Define the events: A = {2, 3, 4, 5, 6} and B = {2, 4, 6, 8} which will give A ∩ B = {2, 4, 6}. The
theoretical probabilities are:
P(A) = 8/12 = 2/3 = 0.666
P(B) = 7/12 ≈ 0.5833
P(A)P(B) = 7/18 ≈ 0.3888
P(A ∩ B) = 1/2 = 0.5
23
Simulating conditional probabilities.
# 1.23 - Simulating a conditional probability
options ( digits =8)
numb erOfToss es = 10000
A = c (2 , 3 , 4 , 5 , 6)
B = c (2 , 4 , 6 , 8)
AandB = intersect (A , B ) # c (2 , 4 , 6)
dieTosses = sample (1:6 , size = numberOfTosses , replace = TRUE )

coinTosses = sample (1:2 , size = numberOfTosses , replace = TRUE )
jointToss = dieTosses * coinTosses
# Calculating P ( A ) , P ( B ) , P ( A ) * P ( B ) and P ( A cap B )

PA = sum ( jointToss % in % A ) / nu mberOfTo sses
PB = sum ( jointToss % in % B ) / nu mberOfTo sses
PAandB = sum ( jointToss % in % AandB ) / number OfTosses
# Output
PA
PB
PA * PB
PAandB
> # Output
> PA
[1] 0.6691
> PB
[1] 0.5825
> PA*PB
[1] 0.38975075
> PAandB
[1] 0.4995
Repeating the theoretical values from previous page:
P(A) = 8/12 = 2/3 = 0.666

P(B) = 7/12 ≈ 0.5833
P(A)P(B) = 7/18 ≈ 0.3888
P(A ∩ B) = 1/2 = 0.5
The simulated results are very close to the theoretical calculations. We can see that we do not have
independence, as expected.
24
2 Random Variables
Exercises
2.1
Claim: P(X = x) = F (x+ ) − F (x− ). (Discrete)
Proof. By definition of the CDF:
F (x+ ) = lim F (z) = lim P(X 6 z), F (x− ) = lim F (y) = lim P(X 6 y)
z↓x z↓x y↑x y↑x
(so y < x and y → x, and x < z and x ← z). By the right continuous property, we can deduce that
z > y and we can set z = x and y = x − 1.
P(X 6 x+ ) = P(X 6 x) = P(X = x) + P(X 6 x − 1), P(X 6 x− ) = P(X 6 x − 1)
So:
P(X = x) = P(X = x) + P(X 6 x − 1) − P(X 6 x − 1)

= P(X 6 x) − P(X 6 x − 1)
= P(X 6 x+ ) − P(X 6 x− )
= F (x+ ) − F (x− )
2.2
Let X be such that P(X = 2) = P(X = 3) = 1/10 and P(X = 5) = 8/10. Here is a plot of the
CDF.
1
0.8
0.6
0.4
0.2
0
0 2 4 6 8
By reading the plot, we can see that:
P(2 < X 6 4.8) = F (4.8) − F (2) = 2/10 − 1/10 = 1/10
P(2 6 X 6 4.8) = F (4.8) = 2/10
25
2.3
Lemma 2.15 Let F be the CDF for a random variable X. Then:
1. P(X = x) = F (x) − F (x− )
2. P(x < X 6 y) = F (y) − F (x)
3. P(X > x) = 1 − F (x)
4. If X is continuous, then
F (b) − F (a) = P(a < X < b) = P(a 6 X < b) = P(a < X 6 b) = P(a 6 X 6 b)
Proof. We will prove each statement in turn. (1.) was proved in exercise 2.1. Doing (3) first,
since we need it to prove (2).
(3) By definition of complements of sets A = {X > x} means Ac = {X 6 x}, and it follows that:
P(X > x) = P(A) = 1 − P(Ac ) = 1 − P(X 6 x) = 1 − F (x).
(2) Assume x < y. We will need that {X > x} ∪ {X 6 y} = Ω, and we will also use Lemma 1.6 (in
reverse).
P(x < X 6 y) = P({X > x} ∩ {X 6 y})
= P(X > x) + P(X 6 y) − P({X > x} ∪ {X 6 y})
= 1 − F (x) + F (y) − 1
= F (y) − F (x)
(4) Similar argument for all cases, so will just do one. We just need to turn the inequalities into
strict inequalities. For continuous random variables, pointwise probabilities are 0. Again, we will
need to use {X > a} ∪ {X < b} = Ω.
Define A := {a 6 X} and B := {X < b}. First, we make the following observation:
P(A) = P({a 6 X})
= P({a = X} ∪ {a < X})
= P({a = X}) + P({a < X}) − P({a = X} ∩ {a < X})
= 0 + P(A0 ) − 0
= P(A0 )
where A0 = {a < X}. We get 0 for the pointwise probability, since this is continuous, and we get 0
because the sets are disjoint. We have shown that P(A) = P(A0 ) and can use this to conclude the
proof.
P(a 6 X < b) = P({a 6 X} ∩ {X < b})
= P(A ∩ B)
= P(A) + P(B) + P(A ∪ B)
= P(A) + P(B) + P(Ω)
= P(A0 ) + P(B) + P(A0 ∪ B)
= P(A0 ∩ B)
= P(a < X < b)
26
2.4
X has the probability density (PDF):

1/4 0 < x < 1
fX (x) = 3/8 3 < x < 5
0 otherwise

Plot of the PDF:

0.5
0.4
0.3
fX (x)
0.2
0.1
0
0 1 2 3 4 5 6 7
x
From the relatively simple structure, we can easily determine the area under the graph:

1 3 2 6
A = (1) + (2) = + =1
4 8 8 8
(a) Finding the CDF by integrating the PDF. We will split up the integral in several parts. First
for the case when y ∈ (0, 1):
Z y
1 y
Z
1 h iy y
FX (y) = fX (t)dt = 1dt = t =
−∞ 4 0 4 0 4
When y = 1 we have FX (1) = 1/4. Next, we must consider the case y ∈ (1, 3). Here the PDF is 0,
so it doesn’t increase. It remains constant at 1/4 (since the CDF doesn’t decrease).
1
FX (y) =
4
Next is the case y ∈ (3, 5). Consider the intermediary integral:
Z y
3 3 h iy 3y − 9
I1 = dt = t =
3 8 8 3 8
For values y ∈ (3, 5) we start on 1/4, so the CDF in this region becomes:
3y − 9 1
FX (y) = +
8 4
27
So, the full expression for the CDF becomes:


 y/4 y ∈ (0, 1)

 1/4 y ∈ (1, 3)
FX (y) = 3y − 9 1
 + y ∈ (3, 5)
 8
 4
1 y>5

Note that when y = 5 we get:
3(5) − 9 1 6 2
FX (5) = + = + =1
8 4 8 8
Plot of the CDF:
1
0.8
0.6
y
0.4
0.2
0
0 1 2 3 4 5 6 7
x
(b) Defining Y = 1/X and finding the PDF of Y . Following the hint we are given, we will consider
the following three sets:
1 1 1
A1 = 6y6 , A2 = 6 y 6 1, A3 = y > 1
5 3 3
Where A1 corresponds to (3, 5), A2 to (1, 3) and A3 to (0, 1). We can express the CDF for FY (y)
in terms of FX (x):
1
FY (y) = P(Y 6 y) = P( 6 y)
X
1
= P(X > )
y
1
= 1 − P(X 6 )
y
1
= 1 − FX ( )
y
28
First, we consider A1 : y ∈ [1/5, 1/3], and when we input 1/y to FX (·), it will be in (3, 5). So:
FY (y) = 1 − FX (1/y)
3( y1 ) − 9
!
1
=1− +
8 4
3 − 9y 1
=1− −
8y 4
3 9y − 3
= +
4 8y
15y − 3
=
8y
Next, we consider A2 : y ∈ [1/3, 1]. The input to FX (·) will be in (1, 3):
FY (y) = 1 − FX (1/y)
1
=1−
4
3
=
4
Next, we consider A3 : y > 1. The input to FX (·) will be in (0, 1):
FY (y) = 1 − FX (1/y)
1
y
=1−
4
1
=1−
4y
Also, whenever y < 1/5, then 1/y > 5 which means FX (·) = 1, and so:
FY (y) = 1 − FX (1/y) = 1 − 1 = 0.
This gives a full description of the CDF for FY (y).



 0 y < 1/5

15y − 3




 1/5 6 y 6 1/3
 8y


FY (y) = 3

 1/3 6 y 6 1



 4
1


1− y>1


4y
29
Plot of CDF:
1
0.8
0.6
FY (y)
0.4
0.2
0
0 0.5 1 1.5 2
y
Finally, we can find the PDF of Y . We differentiate each of the parts in the CDF. When y ∈
(1/5, 1/3):

d 15y − 3 3
= 2
dy 8y 8y
When y > 1:

d 1 1
1− =
dy 4y 4y 2
(All other parts are constant, so they become 0). This gives us the PDF and its plot:
0 y < 1/5


 10

3


1/5 6 y 6 1/3


 8y 2

fY (y) = 8


 0 1/3 < y < 1


 1



y>1 6
fY (y)
4y 2
0
0 0.5 1 1.5 2
y
30
2.5
Let X and Y be discrete RV. X and Y are independent if and only if fX,Y (x, y) = fX (x)fY (y) for
all x and y.
Proof.
⇒) Assume that X and Y are independent. That means that for any x, y, we have
P(X = x ∩ Y = y) = P(X = x)P(Y = y)
Starting with the definition of the joint pdf:
fX,Y (x, y) = P(X = x, Y = y)

= P(X = x ∩ Y = y)
= P(X = x)P(Y = y)
= fX (x)fY (y)
Which shows that fX,Y (x, y) = fX (x)fY (y) for all x and y.
⇐) Assume that fX,Y (x, y) = fX (x)fY (y) for all x and y. By definition:
fX,Y (x, y) = P(X = x, Y = y)

= P(X = x ∩ Y = y)
And,
fX (x)fY (y) = P(X = x)P(Y = y)
From our assumption, these are equal, so P(X = x ∩ Y = y) = P(X = x)P(Y = y) which shows
that X and Y are independent.
By implication both ways, the statement is proved.
2.6
Let X have distribution F and density f , and let A be a subset of the real line, e.g. A = (a, b) for
some a, b ∈ R and a < b. We have the indicator function

1 x∈A
IA (x) =
0 x 6∈ A
We will set Y = IA (X) and find the PDF and CDF of Y .

The exercise asks for a probability mass function, but that cannot be correct. Since X has a density
f , it is a continuous RV. If X ∼ U (0, 1) and A = (0, 1), then Y = X and it will be a uniform variable
with a continuous distribution, so not necessarily discrete.
And if it can be a continuous distribution, what happens if we define A = Q ⊂ R? There will be
an infinite number of points in any interval with measure 0. Then we cannot define a PDF at all...
Poorly formulated exercise in my opinion! Need to fill up with extra assumptions?
Skipping for now.
31
2.7
Let X and Y be independent and suppose that X, Y ∼ U (0, 1). For Z = min(X, Y ) we will find
the density fZ (z) for Z. Following the hint, we will first find P(Z > z). Since any observations
of x, y ∈ (0, 1), then we can immediately see that P(Z > 0) = 1 and P(Z > 1) = 0. But what
happens for other values? Best way to find out is with some illustrations. Here are plots of the
cases P(Z > 1/2), P(Z > 1/4) and P(Z > 3/4).
Y Y Y
1 1 1
3/4
1/2
1/4
1/2 1X 1/4 1X 3/4 1X
If we simulate lots of X and Y values, we see that about 1/4th of them will have both X
and Y values larger than 1/2, so P(Z > 1/2) = 1/4. Similarly, we get P(Z > 1/4) = 9/16 and
P(Z > 3/4) = 1/16. Confirming this with a simulation.
# 2.7 - Simulating U (0 ,1)
N = 100000; X = runif ( N ) ; Y = runif ( N )
Z = pmin (X , Y ) # This is : Z = min {X , Y }
# Comparing simulated vs . theoretical results

sum ( Z > 0.5) / N
1/4
sum ( Z > 0.25) / N
9 / 16
sum ( Z > 0.75) / N
1 / 16
> # Comparing simulated vs. theoretical results

> sum(Z > 0.5)/N
[1] 0.24888
> 1/4
[1] 0.25
> sum(Z > 0.25)/N

[1] 0.56156
> 9/16
[1] 0.5625
> sum(Z > 0.75)/N

[1] 0.06205
> 1/16
[1] 0.0625
32
By inspecting the images on the previous page, we can determine the ’shape’ of the probabilities.
For Z > 1/4 we remove the union of X 6 1/4 and Y 6 1/4. We define A = {X 6 z} and
B = {Y 6 z}, and can write the general case as:
P(Z > z) = 1 − P({X 6 z} ∪ {Y 6 z})
= 1 − P(A ∪ B)

= 1 − P(A) + P(B) − P(A ∩ B)

= 1 − P(A) + P(B) − P(A)P(B)
Where we used Lemma 1.6, and the fact that X and Y are independent. By using the probability
law of complements, we can find the expression for P(Z 6 z).
P(Z 6 z) = P(X 6 z) + P(Y 6 z) − P(X 6 z)P(Y 6 z)
The CDF for a uniform distribution on U (a, b) is:
z−a z−0
F (z) = =⇒ FX (z) = FY (z) = =z
b−a 1−0
Which means:
FZ (z) = FX (z) + FY (z) − FX (z)FY (z) = 2z − z 2
We can confirm our illustrations and simulated examples again by noting that:

1 1 1 1 1 3
FZ (1/2) = + − · =1− =
2 2 2 2 4 4

1 1 1 1 8 1 7
FZ (1/4) = + − · = − =
4 4 4 4 16 16 16

3 3 3 3 24 9 15
FZ (3/4) = + − · = − =
4 4 4 4 16 16 16
which gives us the opposite results as expected (since we simulated and illustrated P(Z > z)). The
PDF fZ (z) is the derivative of FZ (z). Including PDF-plot and histogram of the simulated Zs.
d
fZ (z) = FZ (z) = 2 − 2z.
dz
Histogram of Z
2
2.0
1.5
1.5
fZ (z)
Density
1
1.0
0.5
0.5
0.0
0
0 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1.0
z
Z
33
2.8
The RV X has CDF F . Finding the CDF of X + = max{0, X}.
From the definition of CDF:

FX + (u) = P max(0, X) 6 u

= P {ω ∈ Ω : X(ω) 6 u and u > 0}
We must consider two cases. When u < 0:

FX + (u) = P {ω ∈ Ω : X(ω) 6 u and u > 0}
= P(∅)
=0
When u > 0:

FX + (u) = P {ω ∈ Ω : X(ω) 6 u and u > 0}

= P {ω ∈ Ω : X(ω) 6 u}
= P(X 6 u)
= FX (u)
So in summary:
0 u<0
FX + (u) =
FX (u) u > 0
2.9
We have X ∼ Exp(β). The PDF is given by:
1 −x/β
f (x) = e , x>0
β
Finding the CDF by integrating the PDF.
Z y
F (y) = f (x)dx
−∞
1 y −x/β
Z
= e dx
β 0
1 y
= − βe−x/β 0
β
y
= − e−x/β 0

= 1 − e−y/β
To find the inverse F −1 (q) we set q = F (y) and solve for y.
q = 1 − e−y/β =⇒ y = −β log(1 − q) =⇒ F −1 (q) = −β log(1 − q)
34
2.10
If X and Y are independent, then g(X) and h(Y ) are independent for some functions g and h.
Proof. Let X and Y be some arbitrary random variables, and let x and y be values in the range
of g and h such that g(X) = x and h(Y ) = y. Then:
P(g(X) = x, h(Y ) = y) = P(X = g −1 (x), Y = h−1 (y))

By independence of X and Y .
= P(X = g −1 (x))P(Y = h−1 (y))
= P(g(X) = x)P(h(Y ) = y)
which shows that g and h satisfies the condition for independence.
2.11
Tossing a coin which has probability p of getting H. We let X denote the number of heads and Y
the number of tails.
(a) Showing that X and Y are dependent. First it will be helpful to consider a simplified case where
we have N = 1 and N = 2 coin tosses, and assuming p = 1/2.
N = 1 Tosses N = 2 Tosses
Y =0 Y =1 Y =0 Y =1 Y =2
X=0 0 1/2 1/2 X=0 0 0 1/4 1/4
X=1 1/2 0 1/2 X=1 0 1/2 0 1/2
1/2 1/2 1 X=2 1/4 0 0 1/4
1/4 1/2 1/4 1
In the case of N = 2, we see that f (0, 2) = 1/4 while fX (0)fY (2) = 1/16. This will be the
inspiration for how we show it in the general case with N tosses and probability p of heads.
We only need to show one specific case where P(X = x, Y = y) 6= P(X = x)P(Y = y) to show that
these values are dependent. The total number of tosses will be N = X + Y and we compare the
cases where we get N heads. In that case, using the Multinomial distribution:

N
P(X = N, Y = 0) = = pN (1 − p)0 = pN ,
N, 0
but with the two Binomial distributions:

N N N 0
P(X = N )P(Y = 0) = p (1 − p)0 × p (1 − p)N = pN (1 − p)N .
N 0
These are not equal, showing that X and Y are dependent variables.
35
(b) Now we have N ∼ Poisson(λ), where N = X + Y for X heads and Y tails. Show that these
values are now independent.
Current solution, but not sure it’s correct...
Assuming X = x and Y = y. Then:
P(X = x ∩ Y = y) = P(X = x ∩ Y = y ∩ N = x + y)
= P(X = x ∩ Y = y | N = x + y) · P(N = x + y)
λx+y

x+y x
= p (1 − p)y · e−λ
x (x + y)!
(x + y)!
λx λy
px (1 − p)y · e−λ

=
x!y! (x
+ y)!

px λx (1 − p)y λy
= e−λ ·
x! y!
x x
p λ (1 − p)y λy
= e−λp · e−λ(1−p)
x! y!
= P(X = x|λp) · P(Y = y|λ(1 − p))
= P(X = x)P(Y = y)
Which shows we have independence. We used:
e−λ = e−λ(p−p+1) = e−λp+λp−λ = e−λp eλp−λ = e−λp e−λ(1−p) .
Also, we made the assumption that P(X = x ∩ Y = y | N = x + y) is Binomial. But I think this
is only true when we can assume that X and Y are independent... which is what we are trying to
show. Will review this later... hopefully!
2.12 Theorem 2.33

Suppose that the range of X and Y is a (possibly infinite) rectangle. If f (x, y) = g(x)h(y) for some
functions g and h (not necessarily probability density functions) then X and Y are independent.
Proof. From the joint PDF, we can find the marginal distributions:
Z Z
fX (x) = f (x, y)dy, fY (y) = f (x, y)dx
By applying these integrals to both sides of the equality:

Z Z
f (x, y)dy = g(x) h(y)dy =⇒ fX (x) = g(x)
Z Z
f (x, y)dx = h(y) g(x)dx =⇒ fY (y) = h(y)
We get g(x) and h(y) since they are equal to f (x, y) over the entire ’rectangle’ and must therefore
integrate to 1. This leaves us with: f (x, y) = g(x)h(y) = fX (x)fY (y), and by the results in exercise
2.5, this means that X and Y are independent.
36
2.13
(a) Finding the PDF of Y = eX when X ∼ N (0, 1).
FY (y) = P(Y 6 y) = P(eX 6 y)

= P(X 6 log(y)) = FX (log(y))
We simply get the standard normal distribution with log(y) as input. Differentiating to get the
PDF:
d 1 1 1 2
fY (y) = FX (log(y)) = fX (log(y)) · = √ exp − log(y)
dy y y 2π 2
Plotting the function:
0.6
· fX (log(y))
0.4
0.2
1
y
0
2 4 6 8
y
(b) Plotting histogram of simulated results. Comparable to the plot above.
Histogram of y
x = rnorm (10000)
y = exp ( x )
0.6
pdf ( " ~ / AllStatistics / files / ch2 _ 2.13 b . pdf " ,

width = 4.7747 , height = 4)
d = density ( y )
0.4
hist (y , breaks = 200 , xlim = c (0 , 8) ,

Density
prob = TRUE )
lines ( density ( y ) , col = " red " , xlim = c (0 ,8) )
0.2
dev . off ()
0.0
0 2 4 6 8
37
2.14
We let (X, Y )√be uniformly distributed on the unit disk: {(x, y) : x2 + y 2 6 1}. Find the CDF and
PDF of R = X 2 + Y 2 .
First, let’s write some code in order to simulate the data, which are random points contained within
the unit circle. And then use that to simulate R and view the histogram. Plots and code can be
found on the next page. By inspection of the histogram it is clear that fR (r) = 2r which means
FR (r) = r2 .
Following the general
√ recipe for transformation of multiple random variables. We will define the
function r(X, Y ) = X 2 + Y 2 .
Z Z
FR (r) = P(R 6 r) = P(r(X, Y ) 6 r) = P(X 2 + Y 2 6 r2 ) = fX,Y (x, y)dxdy
Ar
p
Finding the set Ar which in this case is still the unit circle, since x2 + y 2 6 1 =⇒ x2 + y 2 6 1.
Since the area of the unit circle is π, and since a uniform distribution means the probability is
equal everywhere, it follows that:
1
fX,Y (x, y) = .
π
As we saw, we must integrate this function over the circle: {(x, y) : x2 + y 2 < r} for some 0 6 r 6 1.
This is easiest when calculating the integral in polar coordinates, so we introduce 0 6 θ 6 2π and
0 6 ρ 6 r and change dxdy = ρdρdθ. So:
1 2π r
Z Z
FR (r) = ρdρdθ
π 0 0
Z 2π h 2 ir
1 ρ
= dθ
π 0 2 0
1 2π r2
Z
= dθ
π 0 2
r2 2π
Z
= (1)dθ
2π 0
r2 2π
= θ
2π 0
r2
= · 2π
2π
= r2
The CDF is FR (r) = r2 for 0 6 r 6 1. By differentiating this, we get the PDF.
d
fR (r) = FR (r) = 2r,
dr
also in the interval 0 6 r 6 1. This confirms the results we found by simulating which are found
on the next page.
38
Plots and Code for 2.14
Result from simulation:
s i m u l a t e U n i t C i r c l e = function ( N ) {
TR = runif ( N )
UR = runif ( N ) ; VR = runif ( N )
t = 2 * pi * TR
u = UR + VR
u [ u > 1] = 2 - u [ u > 1]
1.0
r = u
X = r * cos ( t )
0.5
Y = r * sin ( t )
retVal = list ()
retVal $ X = X
0.0
Y
retVal $ Y = Y
return ( retVal )
−0.5
}
simVal = s i m u l a t e U n i t C i r c l e (1000)
X = simVal $ X
−1.0
Y = simVal $ Y
pdf ( " ~ / AllStatistics / files / ch2 _ 2.14. pdf " ,
−1.0 −0.5 0.0 0.5 1.0
width = 4 , height = 4)
plot (X , Y )
X
dev . off ()
Histogram of R
2.0
simVal = s i m u l a t e U n i t C i r c l e (50000)
R = sqrt ( simVal $ X ^2 + simVal $ Y ^2)
pdf ( " ~ / AllStatistics / files / ch2 _ 2.14 b . pdf " ,
1.5
width = 4 , height = 4)
Density
hist (R , breaks = 40 , prob = TRUE )

1.0
plot (X , Y )
dev . off ()
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
39
2.15
Let X have a continuous, strictly increasing CDF F , and set Y = F (X). Find the density of Y .
By definition of the CDF, F : R → [0, 1] which will be the domain of Y . Since F is strictly
increasing, then F (X1 ) < F (X2 ) if X1 < X2 , but the important property is that F must be
invertible. For any y ∈ [0, 1] we get the following:
FY (y) = P(Y 6 y) = P(F (X) 6 y)
= P(X 6 F −1 (y)) = F (F −1 (y)) = y
Since FY (y) = y, then the density is fY (y) = 1, the derivative. From these observations, we can
conclude that Y ∼ U (0, 1).
Next, if U ∼ U (0, 1) and X = F −1 (U ), then X has F as its CDF.
P(F −1 (U ) 6 x) = P(X 6 x) = F (x) =⇒ X∼F
Writing a program that generates Exp(β) values. Using the result found in exercise 2.9.
simExp = function ( beta , N ) {

U = runif ( N )
return ( - beta * log (1 - U ) )
}
# Simulating
expRND = simExp ( beta = 2 / 3 , N = 50000)
# Making plot
hist ( expRND , breaks =80 ,
xlim = c (0 , 5) , prob = TRUE )
lines ( density ( expRND ) , col = " red " , xlim = c
(0 ,5) , lwd =2)
Plot of the exponential distribution for β = 2/3.
Histogram of expRND fX (x) = 1.5 exp (1.5x)

1.2
1
Density
fX (x)
0.8
0.5
0.4
0.0
0
0 1 2 3 4 5 1 2 3 4 5
x
expRND
40
2.16
If X ∼ Poisson(λ1 ) and X ∼ Poisson(λ2 ), and X and Y are independent, then the distribution of
X given that X + Y = n is Binomial(n, ρ) where ρ = λ1 /(λ1 + λ2 ). (Changed some variable names).
Proof. By definition of conditional probability, by applying hint 1, by independence and by using
the fact that X + Y is a Poisson distribution with λ1 + λ2 .
P(X = x ∩ X + Y = n)
P(X = x|X + Y = n) =
P(X + Y = n)
P(X = x ∩ Y = n − x)
= (Hint)
P(X + Y = n)
P(X = x)P(Y = n − x)
= (Independence)
P(X + Y = n)
Each of these terms are known:
λx1
P(X = x) = e−λ1 ·
x!
(n−x)
λ )
P(Y = n − x) = e−λ2 · 2
(n − x)!
(λ1 + λ2 )n
P(X + Y = n) = e−(λ1 +λ2 ) ·
n!
We can invert the last equation:
1 n!
= eλ1 +λ2 ·
P(X + Y = n) (λ1 + λ2 )n
Putting it all together, we can complete the justification.
P(X = x)P(Y = n − x)
P(X = x|X + Y = n) =
P(X + Y = n)
!
x
(n−x)
−λ1 λ1 −λ2 λ2 ) n!
= e · e · eλ1 +λ2 ·
x! (n − x)! (λ1 + λ2 )n
!
x (n−x)
λ1 λ2 ) n!
=
x! (n − x)! (λ1 + λ2 )x (λ1 + λ2 )(n−x)
x (n−x)
n! λ1 λ2
=
x!(n − x)! λ1 + λ2 λ1 + λ2
x (n−x)
n λ1 λ2
=
x λ1 + λ2 λ1 + λ2
x (n−x)
n λ1 λ1
= 1−
x λ1 + λ2 λ1 + λ2

n x
= ρ (1 − ρ)(n−x) ∼ Binomial(n, ρ)
x
for ρ = λ1 /(λ1 + λ2 ).
41
2.17
The joint probability density is:
c(x + y 2 )

0 6 x 6 1, 0 6 y 6 1
fX,Y (x, y) =
0 otherwise
Calculating P(X < 1/2 | Y = 1/2).

We begin by calculating the marginal distribution for fY (y).
Z 1 Z 1 Z 1 x2 1

1 1
fY (y) = cx + cy 2 dx = c xdx + c y 2 dx = c 0
+ cy 2 x 0 = c + y2
0 0 0 2 2
Following the steps in example 2.38.
fX,Y (x, y) c(x + y 2 ) x + y2

fX|Y (x|y) = = = 1
c 12 + y 2

fY (y) 2
2 +y
Calculating the conditional probability:

Z 1/2
P(X < 1/2|Y = 1/2) = fX|Y (1/2|1/2)dx
0
1/2
x + 41
Z
= 1 1 dx
0 2 + 4
4 1/2
Z
1
= x + dx
3 0 4
4 h x2 x i1/2 4 1 1
= + = · =
3 2 4 0 3 4 3
2.18
Let X ∼ N (3, 16). Solve the following using the Normal table and using a computer package.
(a)
7−3
P(X < 7) = P(3 + 4Z < 7) = P Z < = P(Z < 1) = 0.8413
4
From R:
> pnorm(1) > 1 - pnorm(-0.25)

[1] 0.84134475 [1] 0.59870633
(b)

2−3
P(X > 2) = 1 − P(X 6 2) = 1 − P(3 + 4Z 6 2) = 1 − P Z 6
4
= 1 − P(Z < −1/4) = 1 − (1 − 0.5987) = 0.5987
42
(c) Find x such that P(X > x) = 0.05. By finding 0.95 in the table, we see that it is generated by
the value 1.645, so:
P(Z 6 1.645) = 0.95 =⇒ P(Z > 1.645) = 1 − 0.95 = 0.05
P(4Z > 6.58) = P(3 + 4Z > 9.58) = P(X > 9.58) = 0.05
So, x = 9.58. From R:
1 - pnorm(9.5794, mean = 3, sd = 4)
[1] 0.05000037
(d)
P(0 6 X < 4) = P(0 6 3 + 4Z < 4) = P(−3 6 4Z < 1)

= P(−3/4 6 Z < 1/4) = P(Z < 1/4) − P(Z < −3/4)
= 0.5987 − (1 − 0.7734) = 0.5987 − 0.2266
= 0.3721
From R:
> pnorm(0.25) - pnorm(-0.75)
[1] 0.37207897
#### Alternatively
> CH = rnorm(10000000, mean = 3, sd = 4)
> sum(CH < 4 & CH > 0)/10000000
[1] 0.3719199
(e) Find x such that P(|X| > |x|) = 0.05.
We can split this up in two sections if we assume x > 0. Using that the sets are disjoint:
P(|X| > |x|) = P(X > x ∩ X < −x) = P(X > x) + P(X < −x)
= 1 − P(X 6 x) + P(X < −x) = 0.05
P(X 6 x) − P(X < −x) = 0.95
Note we get minus. Translating to standard normal:

x−3 −x − 3
P Z6 −P Z < = 0.95
4 4
I am not able to see how to solve this by just looking up values because of the dependence. By
using R, we can find that x = 9.611 which makes the lookup values 1.65275 and -3.15275. Verifying
in R.
> X = rnorm(10000000, mean = 3, sd = 4)
> x = 9.611
> sum(X < -x | X > x)/10000000
[1] 0.0500769
> sum(abs(X) > abs(x))/10000000
[1] 0.0500769
43
2.19
Proving equation (2.12). Let X be a random variable with pdf fX (x), and define the random
variable Y = r(X), where r is either strictly monotone increasing or decreasing, so r(·) has the
inverse function s(·) = r−1 (·). Then, we can express the pdf of Y as:
ds(y)
fY (y) = fX (s(y)) .
dy
Proof. The proof will be in two parts. Assume first that r is strictly increasing, which means that
s is strictly increasing and its derivative is positive. From the CDF of Y , we get:
FY (y) = P(Y 6 y)
= P(r(X) 6 y)
= P(X 6 r−1 (y))
= P(X 6 s(y))
= FX (s(y))
Differentiating the CDF to get the PDF.

d d ds(y)
fY (y) = FY (y) = FX (s(y)) = fX (s(y))
dy dy dy
Now assume that r is strictly decreasing, which means s is strictly decreasing and thus has a
negative derivative. From the CDF of Y , we now get:
FY (y) = P(Y 6 y)
= P(r(X) 6 y)
= P(X > r−1 (y))
= 1 − P(X 6 s(y))
= 1 − FX (s(y))

d d ds(y) ds(y)
fY (y) = FY (y) = 1 − FX (s(y)) = −fX (s(y)) = fX (s(y)) −
dy dy dy dy
Since the derivative will be negative, the negative of the derivative will be positive. Generalizing
to both cases, we can write:
ds(y)
fY (y) = fX (s(y))
dy
A short note on why we get 1 − P(X 6 s(y)) when r is strictly decreasing. As an example, say
that X ∼ U (0, 5) and we define Y = r(X) where r(x) = 2 − 2x, which is strictly decreasing. Then
r−1 (y) = s(y) = 1 − y2 , which is also strictly decreasing. Then:
y
P(r(X) 6 y) = P(2 − 2X 6 y) = P(2 − y 6 2X) = P(1 − 6 X) = P(X > s(y))
2
In order to get the expression in terms of FX (x), we rewrite it:
P(X > s(y)) = 1 − P(X 6 s(y)) = 1 − FX (s(y)).
44
2.20
Let X, Y ∼ U (0, 1) be independent. Find the PDF for X − Y and X/Y . The joint density of X, Y
is
f (x, y) = 1, x, y ∈ (0, 1)
and 0 elsewhere.
We define Z = X − Y . We note that Z can assume values between -1 and 1, which are the two
extremes. This gives us Az = (−1, 1). We split this into two cases: when X > Y , then Az = (0, 1)
and when X < Y , which means Az = (−1, 0).
When X > Y and Az = (0, 1).
Z zZ 1 Z z
FZ (z) = 1dxdy = [x]1y dy
0 y 0
z z
y2
Z
= 1 − ydy = y −
0 2 0
z2
=z−
2
When X < Y and Az = (−1, 0).
Z z Z y Z z
FZ (z) = 1dxdy = [x]y−1 dy
−1 −1 −1
Z z z
y2

= y + 1dy = +y
−1 2 −1
2
z 1
= +z+
2 2
Summarizing the CDF, and differentiating to find the PDF. (The 1/2 carries over).
( 2
z
+ z + 12 z ∈ (−1, 0)

z + 1 z ∈ (−1, 0)
FZ (z) = 2 z2 =⇒ fZ (z) =
z− +1 z ∈ (0, 1) 1 − z z ∈ (0, 1)
2 2
Histogram of X−Y
1
1.0
0.8
fZ (z)
0.6
Density
0.5
0.4
0.2
0.0
0
−1.0 −0.5 0.0 0.5 1.0 −1 −0.5 0 0.5 1
d1 z
45
Now we define Z = X/Y .
FZ (z) = P(Z 6 z) = P(X/Y 6 z) = P(X 6 zY )
 Z 1 Z 1
 z
zydy + 1dy z>1
Z 1 Z min(z,y) Z 1 


1
FZ (z) = 1dxdy = min(z, y)dy = Z0 1 z
0 0 0 
zydy z<1



0
Evaluating the integrals to get the CDF, and differentiating to get the PDF:

1
 1− 1

z>1

 z>1
2z 2

FZ (z) = 2z =⇒ f (z) =
 z z<1
Z
 1
2

 z<1
2
Plots and simulation code.
Histogram of X/Y
0.0 0.1 0.2 0.3 0.4 0.5
0.4
fZ (z)
Density
0.2
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3
d3 z
# Simulating X - Y
N = 10000
X = runif ( N ) ; Y = runif ( N )
d1 = X - Y
hist ( d1 , breaks = 50 , prob = TRUE ,
main = " Histogram of X - Y " )
# Simulating X / Y
N = 1000000
X = runif ( N ) ; Y = runif ( N )
d2 = X / Y
d3 = d2 [ abs ( d2 ) < 3] # Removing very rare large vals
d3 = d3 * length ( d2 ) / length ( d3 ) # Normalizing probability
hist ( d3 , breaks =60 , prob = TRUE ,
main = " Histogram of X / Y " )
46
2.21
Let X1 , . . . , Xn ∼ Exp(β) be IID. Define Y = max(X1 , . . . , Xn ). Find the PDF of Y .
Recall the PDF of the exponential distribution:
1
fX (x) = exp (−x/β) , x>0
β
Defining r(X1 , . . . , Xn ) = max(X1 , . . . , Xn ). Finding an expression for the CDF of Y .
FY (y) = P(Y 6 y) = P(max(X1 , . . . , Xn ) 6 y)
According to the hint, Y 6 y iff Xi 6 y for all i ∈ I, where I = {1, . . . , n} is the index set. We can
therefore express the CDF of Y in terms of each individual CDF:
n n
FY (y) = P(X1 6 y) ∩ P(X2 6 y) ∩ . . . ∩ P(Xn 6 y) = (FX (y)) = (1 − exp(−y/β))
Differentiating wrt. y to find the PDF.

d d
fY (y) = FY (y) = (FX (y))n = n(FX (y))n−1 fX (y)
dy dy
n−1 1
= n (1 − exp(−y/β)) exp (−y/β)
β
n n−1
= exp (−y/β) (1 − exp(−y/β))
β
Plots and simulation code for n = 10 and β = 0.2.
Histogram of Y
2
2.0
1.5
1.5
fY (y)
Density
1.0
1
0.5
0.5
0.0
0
0.0 0.5 1.0 1.5 2.0 2.5 0 0.5 1 1.5 2 2.5
y
Y
47
N = 100000
lambda = 5
exp1 = rexp (N , rate = lambda )

Y = pmax ( exp1 , exp2 , exp3 , exp4 , exp5 ,

exp6 , exp7 , exp8 , exp9 , exp10 )
hist (Y , breaks = 50 , prob = TRUE )

lines ( density ( Y ) , col = " red " , xlim = c (0 ,2.5) , lwd =2)
48
3 Expectation
Exercises
3.1
Define X as the wealth after n games. The probability of winning and losing is the same for each
outcome so p = 1/2.

1 1 1 c 5
E[X] = · 2c + · c=c+ = c
2 2 2 4 4
We expect to have 5/4 · c after n games. We can also verify this result with a simulation in R.
> games = sample ( c (2 , 0.5) , size = 1000000 , replace = TRUE )
> mean ( games )
[1] 1.250973
> 5/4
[1] 1.25
3.2
Claim. Var(X) = 0 if and only if P(X = c) = 1 for some constant c.
Proof.
⇒) Set c = µ and assume Var(X) = 0, which means that
Z
2
E[(X − µ) ] = 0 =⇒ (x − µ)2 dF (x) = 0.
This can only be 0 when x = µ = c for the entire domain of X. Hence P(X = c) = 1.
⇐) Assume P(X = c) = 1. When calculating the expectation:
Z
µ = E[X] = cdF (x) = c
When calculating the variance:

Z Z
Var(X) = (x − µ)2 dF (x) = (c − c)2 dF (x) = 0,
since x = c for all x in the domain of X.
49
3.3
Let X1 , . . . , Xn ∼ U (0, 1) and define Y = max(X1 , . . . , Xn ). We will calculate E[Y ]. It is not stated
in the exercise, but we will assume that the Xi are independent. Finding the CDF for Y .
FY (y) = P(Y 6 y)
= P(max(X1 , . . . , Xn ) 6 y)
= P(X1 6 y) ∩ . . . ∩ P(Xn 6 y)
= P(X1 6 y)P(X2 6 y) · · · P(Xn 6 y) (Independence)
n
= (FX (y))
Differentiating to get the PDF for Y .

d d
fY (y) = FY (y) = (FX (y))n = n(FX (y))n−1 fX (y)
dy dy
Since Xi are uniformly distributed, we know that FX (y) = y and fX (y) = 1, so:
fY (y) = ny n−1
Now we can calculate the expectation of Y .

Z 1 Z 1 h y n+1 i1 n
E[Y ] = y · ny n−1 dy = n y n dy = n =
0 0 n+1 0 n+1
Confirming this result with a numeric simulation in R.

> # 3.3
> N = 1000000
> U1 = runif ( N )
> U2 = runif ( N )
> U3 = runif ( N )
> U4 = runif ( N )
> U5 = runif ( N )
> U6 = runif ( N )
> U7 = runif ( N )
> U8 = runif ( N )
> U9 = runif ( N )
> U10 = runif ( N )
> Y = pmax ( U1 , U2 , U3 , U4 , U5 ,
+ U6 , U7 , U8 , U9 , U10 )
> mean ( Y )
[1] 0.9091151
> # Theoretical Result
> 10 / 11
[1] 0.9090909
As we can see, the theoretical result is very close to the simulated result for n = 10.
50
3.4 - Random Walk
A particle starts in the origin and jumps left, a step of -1, with probability p and jumps right, a
step of 1, with probability 1 − p. The expected location will be:
E[X] = (−1)p + (1)(1 − p) = −p + 1 − p = 1 − 2p
To calculate the variance, we start by finding the second moment:
E[X 2 ] = (−1)2 p + (1)2 (1 − p) = p + 1 − p = 1
So the variance is:
Var(X) = E[X 2 ] − E[X]2 = 1 − (1 − 2p)2 = 1 − (1 − 4p + 4p2 ) = 4p − 4p2
3.5
Tossing a fair coin until we get H. Finding the expected number of tosses. The reasoning is as
follows. We get H on the first toss with probability 1/2, first H on the second toss with probability
1/22 = 1/4 and so on. The pattern becomes as follows, for the first 7 cases:
Tosses Outcome Probability

1 {H} 1/2
2 {T H} 1/4
3 {T T H} 1/8
4 {T T T H} 1/16
5 {T T T T H} 1/32
6 {T T T T T H} 1/64
7 {T T T T T T H} 1/128
Define T to be the number of tosses to get H.

1 1 1
E[T ] = (1) + (2) + . . . + (k) + ...
2 4 2k
∞
X k
=
2k
k=1
=2
Not delving in to the mathematics of the infinite sum, but it can be shown that this sum becomes
2 which will be the expected number of tosses to get a H. Here is a numeric approximation in R.
> sumApprox = 0
> for ( k in 1:1000) {
+ sumApprox = sumApprox + k / 2^ k
+ }
> sumApprox
[1] 2
51
3.6 Theorem - The Rule of the Lazy Statistician
Proving the following result for the discrete case. Let Y = r(X), then
X
E[Y ] = E[r(X)] = r(x)fX (x)
x
Proof. Set X , Y to be the set of all values for X and Y . We have the functions r : X → Y and
r−1 : Y → X . We will define s := r−1 for convenience. By definition of the expectation:
X X
E[Y ] = y · fY (y) = y · P(Y = y)
y∈Y y∈Y
X X
= y P(X = x)
y∈Y x∈s(y)
X X
= yP(X = x)
y∈Y x∈s(y)
X
= r(x)P(X = x)
x∈X
X
= r(x)fX (x)
x∈X
A simplified case to make the proof easier a bit easier to understand. We define the variables X
and Y = r(X) = X 2 , and use the following distributions.
x P(X = x) y P(Y = y)
−1 1/4 2 0 1/4
Y = r(X) = X
0 1/4 1 3/4
1 1/2
Then the calculations above become:

X
E[Y ] = y · fY (y) = (0)(1/4) + (1)(3/4)
y∈Y
X X
= y P(X = x) = (0)(1/4) + (1) 1/4 + 1/2
y∈Y x∈s(y)
X X
= yP(X = x) = (0)(1/4) + (1)(1/4) + (1)(1/2)
y∈Y x∈s(y)
X
= r(x)P(X = x) = (0)2 (1/4) + (−1)2 (1/4) + (1)2 (1/2)
x∈X
X
= r(x)fX (x)
x∈X
In both cases, the expectation becomes 3/4.
52
3.7
X ∼ F R is continuous and we suppose that P(X > 0) = 1 and that E[X] exists. Show that
∞
E[X] = 0 P(X > x)dx.
Proof. By definition of the expectation, and since the domain of X must be (0, ∞):
Z ∞
E[X] = x · fX (x)dx
0
Using integration by parts, and setting u = x and v 0 = fX , which gives u0 = 1 and v = FX :

Z Z
uv = uv − u0 v,
0
which gives us:

Z ∞ ∞
Z ∞
E[X] = x · fX (x)dx = xFX (x) − FX (x)dx
0 0
∞
Z0 ∞
= xFX (x) − P(X 6 x)dx
0
∞
Z0 ∞
= xFX (x) −
1 − P(X > x)dx
0 0
∞
Z ∞ ∞
= xFX (x) − x + P(X > x)dx
0 0 0
Z ∞

= − lim x(1 − FX (x) + P(X > x)dx
x→∞ 0
Z ∞
= P(X > x)dx
0
which proves the result. In the second to last step, we applied the hint regarding the limit.
3.8 - Theorem 3.17

Let X1 , . . . , Xn be IID and let µ = E[Xi ], σ 2 = Var(Xi ). Then:
σ2
E[X] = µ, Var(X) = , E[S 2 ] = σ 2 .
n
Proof. Start with the expectation of the sample mean:
" n # n n
1X 1X 1X n·µ
E[X] = E Xi = E[Xi ] = µ= = µ.
n i=1 n i=1 n i=1 n
Next, the variance.

n
#
" n n
1X 1 X 1 X 2 n · σ2 σ2
Var[X] = Var Xi = 2 Var[Xi ] = 2 σ = = .
n i=1 n i=1 n i=1 n2 n
53
Finally, the expectation of the sample variance. This calculation is a lot more involved.
" n
# " n #
2 1 X 2 2
X
2
E[S ] = E (Xi − X) =⇒ (n − 1)E[S ] = E (Xi − X)
n − 1 i=1 i=1
We will use that X = (1/n)ΣXi means that ΣXi = nX. Running through the calculations:
" n #
2
X
2 2
(n − 1)E[S ] = E (Xi − 2Xi X + X )
i=1
" n n n
#
X X X 2
=E Xi2 − 2 Xi X + X )
i=1 i=1 i=1
" n # " n
# " n
#
X X X 2
=E Xi2 −E 2 Xi X + E X )
i=1 i=1 i=1
n
" n
# " n
#
X 2 X 2 X
= E Xi − E 2X Xi + E X 1) (X indp. of i)
i=1 i=1 i=1
h 2i h 2i
= nE Xi2 − 2nE X + nE X

(ΣXi = nX)
h 2i
= nE Xi2 − nE X

By dividing both sides of the equality by n, we have shown:

n−1 h 2i
E[S 2 ] = E Xi2 − E X

(3.8.1)
n
We have the second moment for X. From the definition of the variance:
Var(Xi ) = E[Xi2 ] − µ2 =⇒ E[Xi2 ] = Var(Xi ) + µ2 = σ 2 + µ2
From the previous results:
2 2 σ2
Var(X) = E[X ] − E[X]2 =⇒ E[X ] = Var(X) + E[X]2 = + µ2
n
Replacing each of these into (3.8.1) and isolating E[S 2 ].
n
σ2
2 2 2 2
E[S ] = σ +µ − +µ
n−1 n
σ2

n
= σ2 −
n−1 n

n n−1
= σ2
n−1 n
= σ2
and we have finally shown the final result.
54
3.9
Computer experiment; studying the effects of the mean over 10.000 simulations from the standard
normal distribution and the Cauchy distribution.
As we can see the simulation of the normal distribution quickly stabilizes to a value of around
0 which is the expected value. The Cauchy distribution is a very heavy-tailed distribution and the
expectation does not exist. Therefore, there is no mean it can stablize to and it ends up behaving
erraticaly even after 10.000 simulations. The jumps that can be seen are extreme outliers that are
added to the data, which are relatively common in the Cauchy distribution.
Results:
Mean of N(0,1) Mean of Cauchy

0.2
4
0.1
3
2
Xmean
Ymean
0.0
1
0
−0.1
−0.2
−2
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
1:N 1:N
# Setting the number of simulations

N = 10000
# Simulating N standard normal values

X = rnorm ( N )
Xmean = cumsum ( X ) / 1: N
plot (1: N , Xmean , type = " l " , ylim = c ( -0.2 , 0.2) ,
main = " Mean of N (0 ,1) " )
abline ( h = 0 , col = " red " )
# Simulating N Cauchy values with loc = 0 , scale = 1

Y = rcauchy ( N )
Ymean = cumsum ( Y ) / 1: N
plot (1: N , Ymean , type = " l " ,
main = " Mean of Cauchy " )
55
3.10
Let X ∼ N (0, 1) and define Y = eX . Find E[Y ] and Var(Y ).
One possible way to solve this is defining r(X) = eX and evaluating:
Z Z
x 1 1 2
E[Y ] = E[r(X)] = e fX (x)dx = √ exp x − x dx.
R 2π R 2
Completing the square in the exponential term:
1 1 1 1 2 1
x − x2 = − x2 − 2x = − x2 − 2x + 1 − 1 = − (x − 1) +

2 2 2 2 2
We can rewrite the integral:
Z
1 1
E[Y ] = e1/2 √ exp − (x − 1)2 dx
2π R 2
We will integrate this by substitution:
u=x−1 =⇒ du/dx = 1 =⇒ du = dx
Z
1 1
E[Y ] = e1/2 √ exp − u2 du = e1/2 ≈ 1.6487
2π R 2
| {z }
=1
Confirming with numerical simulation.
> N = 10000000; X = rnorm ( N ) ; Y = exp ( X )
> mean ( Y )
[1] 1.648564
> exp (0.5)
[1] 1.648721
> var ( Y )
[1] 4.676382
> exp (2) - exp (1)
[1] 4.670774
We can calculate the variance by first finding the second moment: E[Y 2 ].
Z Z
2 2 2x 1 1 2
E[Y ] = E[r(X) ] = e fX (x)dx = √ exp 2x − x dx.
R 2π R 2
Just as above, we complete the square and get: −(1/2)(x − 2)2 + 2 and move e2 outside the integral.
Then we use integration by substitution: u = x − 2 which gives du = dx. We end up with a similar
integral: Z
2 2 1 1 2
E[Y ] = e √ exp − u du = e2
2π R 2
| {z }
=1
Now we can calculate the variance of Y .
Var(Y ) = E[Y 2 ] − E[Y ]2 = e2 − (e1/2 )2 = e2 − e1 ,
and as we can see in the simulation above, this corresponds to the simulated variance.
56
3.11
Pn stocks. For each day Yi can be either −1 or 1, each with probability 1/2. We have
Simulating
Xn = i=1 Yi which is the cumulative value of the stock.
(a) Calculating the expectation and variance of Xn . First we calculate it for Yi .
E[Yi ] = (−1)(1/2) + (1)(1/2) = −1/2 + 1/2 = 0
E[Yi2 ] = (−1)2 (1/2) + (1)2 (1/2) = 1/2 + 1/2 = 1
Var(Yi ) = E[Yi2 ] − E[Yi ]2 = 1 − 0 = 1
Now for Xn . " #
n
X n
X
E[Xn ] = E Yi = E[Yi ] = 0
i=1 i=1
" n
# n n
X X X
Var(Xn ) = Var Yi = Var[Yi ] = 1=n
i=1 i=1 i=1
(b) Simulating four stocks.
simStock <- function ( N ) {
daily Movemen ts = sample ( c ( -1 , 1) , size =N , replace = TRUE )
total Movemen ts = cumsum ( dailyMo vements )
return ( totalMo vements )
}
N = 10000
s1 = simStock ( N ) ; s2 = simStock ( N )
s3 = simStock ( N ) ; s4 = simStock ( N )
plot (1: N , s1 , type = " l " ,
ylim = c ( -119 , 212) , # Adjust according to simulation results
main = " Stock simulation " )
lines (1: N , s2 , type = " l " , col = " red " )
lines (1: N , s3 , type = " l " , col = " blue " )
lines (1: N , s4 , type = " l " , col = " green " )
Stock simulation
The simulations have mean 0, so they should
vary around the x-axis. Since the variance is
n, it is dependent on the time. The longer
150
the simulations last, the more they will vary

which is what we see.
0 50
s1
The standard deviation in this example will

be the square root of n, which is about 100,
and that is also what we see (and it is ex-
−100
pected that they will not be exactly on the

0 2000 4000 6000 8000 10000 standard deviation).
1:N
57
3.12
Deriving the general expression for the expectation and variance for a whole bunch of probability
distributions.
• Bernoulli with parameter p. Possible values: X = {0, 1}.
f (x) = px (1 − p)x−1
Expectation. X
E[X] = x · f (x) = (0) + (1)p1 = p
x∈X
Second moment. X
E[X 2 ] = x2 · f (x) = (0) + (1)2 p1 = p
x2 ∈X
Variance.
Var(X) = E[X 2 ] − E[X]2 = p − p2 = p(1 − p)
• Poisson with parameter λ. Possible values: X = N ∪ {0}.

λx
f (x) = e−λ
x!
Expectation. First term disappears. Cancel x against the factorial. Switch to k = x − 1.
∞
X X λx
E[X] = xf (x) = x · e−λ
x=0
x!
x∈X
∞
X λx−1
= 0 + λe−λ
x=1
(x − 1)!
∞
X λk
= λe−λ
k!
k=0
−λ λ
= λe e
=λ
Second moment. We will split the sum into two Poisson sums.
∞
X X λx
E[X 2 ] = x2 · f (x) = x2 · e−λ
x=0
x!
x2 ∈X
∞
X λx−1
= 0 + λe−λ x·
x=1
(x − 1)!
∞
X λx−1
= λe−λ (x − 1 + 1) ·
x=1
(x − 1)!
∞ h
X λx−1 λx−1 i
= λe−λ (x − 1) +
x=1
(x − 1)! (x − 1)!
58
∞ ∞
!
X λx−1 X λx−1
= λe−λ (x − 1) +
x=2
(x − 1)! x=1 (x − 1)!
∞ ∞
!
−λ
X λx−2 X λx−1
= λe λ +
x=2
(x − 2)! x=1 (x − 1)!
 
∞ j ∞ k
X λ X λ 
= λe−λ λ +
j=0
j! k!
k=0
= λe−λ λeλ + eλ

= λ2 + λ
The remaining calculation for the variance is easy.
Var(X) = E[X 2 ] − E[X]2 = λ2 + λ − λ2 = λ
• Uniform on (a, b). Possible values: X = (a, b).

1
f (x) =
b−a
Expectation.
Z Z b
1
E[X] = x · f (x) = xdx
X b−a a
1 h x2 ib
2
b − a2

1
= =
b−a 2 a b−a 2

1 (b + a)(b − a)
=
b−a 2
b+a
=
2
Second moment.
Z Z b
1
E[X 2 ] = x2 · f (x) = x2 dx
X b − a
a 3
1 h x3 i b b − a3

1
= =
b−a 3 a b−a 3
(b − a)(b + ab + a2 )
2

1
=
b−a 3
2 2
b + ab + a
=
3
Variance.
b2 + ab + a2 (b + a)2
Var(X) = E[X 2 ] − E[X]2 = −
3 4
4b2 + 4ab + 4a2 − 3b2 − 6ab − 3a2 b2 − 2ab + a2 (b − a)2
= = =
12 12 12
59
• Exponential with parameter β. Possible values: X = {x ∈ R : x > 0}.
1
f (x) = exp(−x/β)
β
Expectation.
Z Z ∞
1 −x/β
E[X] = x · f (x) = x· e dx
X 0 β
We will integrate this with the substitution u = x/β, so du/dx = 1/β which gives us dx = βdu.
Note that the antiderivative of ue−u is −ue−u − e−u .
Z ∞ Z ∞
1
E[X] = x · e−x/β dx = β ue−u du
0 β 0
h i∞
−u −u
= β − ue − e
0

= β lim (−ue − e−u ) − (−1)
−u
u→∞
=β
Skipping the evaluation of the limit. The first term can be shown to converge to 0 by using
L’Hôpital’s rule. The second term in the limit obviously becomes 0.
Second moment. Will derive the expression for the expectation in the calculations.
Z Z ∞
1
2
E[X ] = 2
x · f (x) = x2 · e−x/β dx
X 0 β
Using integration by parts: u = x2 so u0 = 2x. Setting v 0 = (1/β)e−x/β which yields v = −e−x/β .

From the integration-by-parts formula:
Z Z
uv 0 = uv − u0 v
Z ∞ i∞ Z ∞
1 −x/β h
E[X 2 ] = x2 · e dx = − x2 e−x/β − −2xe−x/β dx
0 β 0 0
Z ∞
=0+ 2xe−x/β dx
0
Z ∞
1
= 2β x · e−x/β dx
0 β
= 2βE[X]
= 2β 2
The limit can be shown to converge to 0 by applying L’Hôpital’s rule twice. We use a clever trick
and substitute in the expectation and get the result. Finally, we calculate the variance.
Var(X) = E[X 2 ] − E[X]2 = 2β 2 − β 2 = β 2
60
• Gamma with parameters α, β. Possible values: X = {x ∈ R : x > 0}.
1
f (x) = xα−1 exp(−x/β)
Γ(α)β α
We will use the definition of the Gamma function, and the ’Gamma difference’ chaining property:
Z ∞
Γ(α + 1) = tα e−t dt, Γ(α + 1) = αΓ(α)
0
In step 3, we make the substitution t = x/β which makes x = βt. dt/dx = 1/β, so dx = βdt.
Z Z ∞
1
E[X] = x · f (x) = x· xα−1 exp(−x/β)dx
X 0 Γ(α)β α
Z ∞
1
= xα exp(−x/β)dx
0 Γ(α)β α
Z ∞
1
= (βt)α exp(−t)βdt
0 Γ(α)β α
Z ∞
1
= α
β α tα exp(−t)βdt
0 Γ(α)β
Z ∞
1
= α
β α+1 tα exp(−t)dt
0 Γ(α)β
Z ∞
β α+1
= tα exp(−t)dt
Γ(α)β α 0
Z ∞
β
= tα exp(−t)dt
Γ(α) 0
βΓ(α + 1)
=
Γ(α)
βαΓ(α)
=
Γ(α)
= αβ
For the Gamma distribution, we will not calculate the second moment. Instead we will go straight
to the variance calculation and calculate it directly. We will use the same substitution and use the
definition of the Gamma function again.
Z ∞
2 2 1
Var(X) = E[X ] − E[X] = x2 · α
xα−1 exp(−x/β)dx − α2 β 2
0 Γ(α)β
Z ∞
1
= xα+1 exp(−x/β)dx − α2 β 2
Γ(α)β α 0
Z ∞
1
= (βt)α+1 exp(−t)βdt − α2 β 2
Γ(α)β α 0
Z ∞
β α+2
= tα+1 exp(−t)dt − α2 β 2
Γ(α)β α 0
61
Using the definition of the Gamma function.
β 2 Γ(α + 2)
= − α2 β 2
Γ(α)
β 2 Γ(α + 2) − α2 β 2 Γ(α)
=
Γ(α)

2
β Γ(α + 2) − α2 Γ(α)
=
Γ(α)

β 2 (α + 1)Γ(α + 1) − α2 Γ(α)
=
Γ(α)

β 2 (α + 1)αΓ(α) − α2 Γ(α)
=
Γ(α)

β Γ(α) (α + 1)α − α2
2
=
Γ(α)
= β (α + 1)α − α2
2

= β 2 α2 α − α2

= αβ 2
And the calculation for the Gamma distribution has been completed.
• Beta with parameter α, β. Possible values: X = (0, 1).

Γ(α + β) α−1
f (x) = x (1 − x)β−1
Γ(α)Γ(β)
In the following, we will use the following definition:
Z 1
Γ(α)Γ(β)
= tα−1 (1 − t)β−1 dt
Γ(α + β) 0
Expectation.
Z Z 1
Γ(α + β) α−1
E[X] = x · f (x) = x· x (1 − x)β−1 dx
X 0 Γ(α)Γ(β)
Γ(α + β) 1 α
Z
= x (1 − x)β−1 dx
Γ(α)Γ(β) 0
Γ(α + β) Γ(α + 1)Γ(β)
= ·
Γ(α)Γ(β) Γ(α + β + 1)
Γ(α + β) αΓ(α)Γ(β)
= ·
Γ(α)Γ(β) (α + β)Γ(α + β)
α Γ(α + β) Γ(α)Γ(β)
= · ·
α + β Γ(α)Γ(β) Γ(α + β)
α
=
α+β
62
Variance. Just like for the Gamma distribution, we calculate the second moment directly in the
variance calculation. Will use a lot of the same tricks as when calculating the expectation.
Z 1 2
Γ(α + β) α−1 α
Var(X) = E[X 2 ] − E[X]2 = x2 · x (1 − x)β−1 dx −
0 Γ(α)Γ(β) α+β
Z 1 2
Γ(α + β) α+1 β−1 α
= x (1 − x) dx −
Γ(α)Γ(β) 0 α+β
Z 1 2
Γ(α + β) α
= xα+1 (1 − x)β−1 dx −
Γ(α)Γ(β) 0 α+β
2
Γ(α + β) Γ(α + 2)Γ(β) α
= · −
Γ(α)Γ(β) Γ(α + β + 2) α+β
2
α(α + 1) Γ(α + β) Γ(α)Γ(β) α
= · · − (3.12.1)
(α + β)(α + β + 1) Γ(α)Γ(β) Γ(α + β) α+β
2
α(α + 1) α
= −
(α + β)(α + β + 1) α+β
α2 + α α2
= −
(α + β)(α + β + 1) (α + β)2
(α2 + α)(α + β) − α2 (α + β + 1)
=
(α + β)2 (α + β + 1)
α + α2 β + α2 + αβ − α3 − α2 β − α2 )
3
=
(α + β)2 (α + β + 1)
3
α
+α2β + α2
+ αβ −α3 −α
β −α2 )
2
= 2
(α + β) (α + β + 1)
αβ
=
(α + β)2 (α + β + 1)
In step (3.12.1) we used the difference/chaining property of the Gamma function.
Γ(α + 2) = (α + 1)Γ(α + 1) = (α + 1)αΓ(α)
and,
Γ(α + β + 2) = (α + β + 1)Γ(α + β + 1) = (α + β + 1)(α + β)Γ(α + β)
This concludes the last calculation. I found this exercise to be a little tedious, to be honest...
3.13
A variable X is generated by assuming a value in U (0, 1) if a coin toss is 0 (H), and X is in U (3, 4)
if the coin toss is 1 (T). The coin is fair, so each of these has a probability of 1/2.
(a) Calculating the mean of X. This will be a conditional expectation, and we define Y to be the
coin toss. The marginal distributions might depend on the outcome, but when we calculate them:
1 1
fX|Y (x|Y = 0) = = 1, fX|Y (x|Y = 1) = = 1,
1−0 4−3
we find that it will be fX|Y (x|y) = 1 either way.
63
Now we can calculate the expected value for specific outcomes.
1 1
x2

1−0
Z Z
1
E[X|Y = 0] = x · fX|Y (x|y) = xdx = = =
X 0 2 0 2 2
4 2 4

16 − 9
Z Z
x 7
E[X|Y = 1] = x · fX|Y (x|y) = xdx = = =
X 3 2 3 2 2
Using that pY (y) is known, we can calculate the expected value of X:
X
E[X] = E[E[X|Y ]] = E[X|Y = y]P(Y = y)
y∈{0,1}

1 1 7 1
= +
2 2 2 2
1 7 8
= + = =2
4 4 4
(b) Calculating the second moment.
1 1
x3

1−0
Z Z
2 2 2 1
E[X |Y = 0] = x · fX|Y (x|y) = x dx = = =
X 0 3 0 3 3
4 4
x3 43 − 33
Z Z
37
E[X 2 |Y = 1] = x2 · fX|Y (x|y) = x2 dx = = =
X 3 3 3 3 3
X
E[X 2 ] = E[E[X 2 |Y ]] = E[X 2 |Y = y]P(Y = y)
y∈{0,1}

1 1 37 1
= +
3 2 3 2
1 37 38
= + =
6 6 6
19
=
3
Calculating the variance:
19 7
Var(X) = E[X 2 ] − E[X]2 = − 4 = ≈ 2.333
3 3
And finally the standard deviation:
r
p 7
SD(X) = Var(X) = ≈ 1.5275
3
To verify these results, we did a numeric simulation.
64
N = 10000; Y = sample ( c (0 , 1) , size =N , replace = TRUE ) ; X = rep (0 , N )
for ( i in 1: N ) {
if ( Y [ i ] == 1) X [ i ] = runif (1)
else X [ i ] = runif (1 , min =3 , max =4)
}
> mean ( X )
[1] 2.01424
> var ( X )
[1] 2.332699
> sd ( X )
[1] 1.527318
3.14
Claim: Let X1 , . . . , Xm and Y1 , . . . , Yn be random variables and a1 , . . . , am and b1 , . . . , bn be
constants. Then:  
Xm Xn m X
X n
Cov  ai Xi , bi Yi  = ai bj Cov(Xi , Yj ).
i=1 j=1 i=1 j=1
Proof. This can be verified with a direct proof. We will do a simplified example with m = n = 2,
which is not a formal proof, but can easily be extended to one. We will repeatedly use:
Cov(A, B) = E[AB] − E[A]E[B].
Starting with the covariance, we apply the identity above. We have, for m = n = 2:
Cov(a1 X1 + a2 X2 , b1 Y1 + b2 Y2 ) = E[(a1 X1 + a2 X2 )(b1 Y1 + b2 Y2 )] − E[a1 X1 + a2 X2 ]E[b1 Y1 + b2 Y2 ]
= S1 − S2
Multiplying and rewriting the first term:
S1 = E[(a1 X1 + a2 X2 )(b1 Y1 + b2 Y2 )] = E[a1 b1 X1 Y1 + a1 b2 X1 Y2 + a2 b1 X2 Y1 + a2 b2 X2 Y2 ]
= a1 b1 E[X1 Y1 ] + a1 b2 E[X1 Y2 ] + a2 b1 E[X2 Y1 ] + a2 b2 E[X2 Y2 ]
The second term:
S2 = E[a1 X1 + a2 X2 ]E[b1 Y1 + b2 Y2 ] = (a1 E[X1 ] + a2 E[X2 ]) (b1 E[Y1 ] + b2 E[Y2 ])
= a1 b1 E[X1 ]E[Y1 ] + a1 b2 E[X1 ]E[Y2 ] + a2 b1 E[X2 ]E[Y1 ] + a2 b2 E[X2 ]E[Y2 ]
Subtracting:
S1 − S2 = a1 b1 E[X1 Y1 ] + a1 b2 E[X1 Y2 ] + a2 b1 E[X2 Y1 ] + a2 b2 E[X2 Y2 ]
− a1 b1 E[X1 ]E[Y1 ] − a1 b2 E[X1 ]E[Y2 ] − a2 b1 E[X2 ]E[Y1 ] − a2 b2 E[X2 ]E[Y2 ]
= a1 b1 (E[X1 Y1 ] − E[X1 ]E[Y1 ]) + a1 b2 (E[X1 Y2 ] − E[X1 ]E[Y2 ])
+ a2 b1 (E[X2 Y1 ] − E[X2 ]E[Y1 ]) + a2 b2 (E[X2 Y2 ] − E[X2 ]E[Y2 ])
= a1 b1 Cov(X1 , Y1 ) + a1 b2 Cov(X1 , Y2 ) + a2 b1 Cov(X2 , Y1 ) + a2 b2 Cov(X2 , Y2 )
2 X
X 2
= ai bj Cov(Xi , Yj )
i=1 i=j
65
3.15
Defining the joint distribution:
 1 (x + y)

0 6 x 6 1, 0 6 y 6 2
fX,Y (x, y) = 3

0 otherwise
We will calculate Var(2X − 3Y + 8). Constants disappear in the variance, and by Theorem 2.30:
Var(2X − 3Y + 8) = 4Var(X) + 9Var(Y ) − 12Cov(X, Y )
Finding the marginal distributions:

Z 2
1 1 h1 1 i2 2x 2
fX (x) = x + ydy = xy + y 2 = +
0 3 3 3 6 0 3 3
Z 1
1 1 h1 1 i1 y 1
fY (y) = x + ydx = x2 + xy = +
0 3 3 6 3 0 3 6
Calculating the expectation of X.
1 1 1
2x2 2x3 2x2
Z Z
2x 2 2x 5
E[X] = x + dx = + dx = + =
0 3 3 0 3 3 9 6 0 9
Second moment:
1 1 1
2x3 2x2 2x4 2x3
Z Z
2x 2 7
E[X 2 ] = x2 + dx = + dx = + =
0 3 3 0 3 3 12 9 0 18
Variance of X:
7 25 13
Var(X) = E[X 2 ] − E[X]2 = − =
18 81 162
Calculating the expectation of Y .
2 2 2
y2 y3 y2
Z Z
y 1 y 11
E[Y ] = y + dy = + dx = + =
0 3 6 0 3 6 9 12 0 9
Second moment:
2 2 2
y3 y2 y4 y3
Z Z
y 1 16
E[Y 2 ] = y2 + dy = + dx = + =
0 3 6 0 3 6 12 18 0 9
Variance of Y :
16 121 23
Var(Y ) = E[Y 2 ] − E[Y ]2 = − =
9 81 81
66
From the definition of the covariance - these calculations are pretty large, so skipping the details:
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

Z 2Z 1
5 11 x y
= x− y− + dxdy
0 0 9 9 3 3
Z 2
1
= −9y 2 + 20y − 11dy
486 0
1
=−
81
Putting it all together:
Var(2X + 3Y − 8) = 4Var(X) + 9Var(Y ) − 12Cov(X, Y )

13 23 1
=4 +9 − 12 −
162 81 81
52 207 12 26 207 12
= + + = + +
162 81 81 81 81 81
245
=
81
3.16
Let r(x) and s(y) be functions of x and y. Then:
E[r(X)s(Y )|X] = r(X)E[s(Y )|X], E[r(X)|X] = r(X)
Proof. By definition of the conditional expectation in the continuous case.

Z
E[r(X)s(Y )|X = x] = r(x)s(y)fY |X (y)dy
Y
Z
= r(x) s(y)fY |X (y)dy
Y
= r(X)E[s(Y )|X]
For the special case, we can set s(Y ) = 1.

Z
E[r(X)|X = x] = r(x)(1)fY |X (y)dy
Y
Z
= r(x) (1)fY |X (y)dy
Y
= r(X)E[(1)|X]
= r(X)
which proves the statement. It is similar in the discrete case.
67
3.17
Proving that
Var(Y ) = E[Var(Y |X)] + Var(E[Y |X]).
Proof. Will disregard the hints.
Var(Y ) = E[(Y − E[Y ])2 ]

= E ((Y − E[Y |X]) + (E[Y |X] − E[Y ]))2

h i
= E (Y − E[Y |X])2 + 2(Y − E[Y |X])(E[Y |X] − E[Y ]) + (E[Y |X] − E[Y ])2
= E[(Y − E[Y |X])2 ] + 2E[(Y − E[Y |X])(E[Y |X] − E[Y ])] + E[(E[Y |X] − E[Y ])2 ]
= E1 + 2E2 + E3
Consider each of these in turn. Using the definition of the conditional variance, with µ(x) = E[Y |X].
hZ i
E1 = E[(Y − E[Y |X])2 ] = E E[(Y − E[Y |X])2 |X] = E (y − µ(x))2 f (y|x)dy = E[Var(Y |X)]

Y
For E3 , we define M = E[Y |X], and µM = E[M ] = E[E[Y |X]] = E[Y ]. On this form, we see that
we can use the definition of variance.
E3 = E[(E[Y |X] − E[Y ])2 ] = E[(M − µM )2 ] = Var(M ) = Var(E[Y |X])
Finally, we consider E2 . Using iterated expectation and conditioning on X, then E[Y |X] and E[Y ]
become constants wrt. expectation. Define C := E[Y |X] − E[Y ] in order to clarify.

E[E2 ] = E E (Y − E[Y |X])(E[Y |X] − E[Y ]) X

= E E (Y − E[Y |X])C X

= E CE (Y − E[Y |X]) X

= E C(E[Y |X] − E[Y |X])

=E C ·0
=0
Collecting all the terms:
Var(Y ) = E1 + 2E2 + E3
= E[Var(Y |X)] + 2(0) + Var(E[Y |X])
= E[Var(Y |X)] + Var(E[Y |X])
and the result has been proved.
68
3.18
If E[X|Y = y] = c for some constant c, then X and Y are uncorrelated.
Proof. Assuming E[X|Y = y] = c for a constant c. Calculating the covariance:
Cov(X, Y ) = E[XY ] − E[X]E[Y ]

= E[E[XY |Y ]] − E[E[X|Y ]]E[Y ]
= E[E[X|Y ]Y ] − E[E[X|Y ]]E[Y ]
= E[cY ] − E[c]E[Y ]
= cE[Y ] − cE[Y ]
=0
Since the covariance is 0, then it follows that the correlation is 0.
3.19
Studying the sample distribution for the Uniform(0,1) distribution. If X ∼ U (0, 1) then E[X] = 1/2
and Var(X) = 1/12. Will not plot fX since it will simply be the straight line at y = 1.
From Theorem 3.17, we found that E[barXn ] = µ = 1/2 and Var(X n ) = σ 2 /n = 1/(12n).
Plotting these as functions of n, mean as blue and variance as red. The mean remains at 1/2, but
the variance will become a lot smaller as n increases.
0.6
0.4
µ, σ 2
0.2
0
0 2 4 6 8 10
x
For each n = 1, 5, 25, 100 we do a 1000 simulations of each n and make a histogram of the results
that can be seen on the next page. The results are centered around the mean, 1/2, but the spread
becomes smaller as we increase n. We are basically seeing the central limit theorem in action.
69
1.2 n=1 n=5
3.0
Density
Density
0.6
1.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
s1$mn s5$mn
n = 25 n = 100
12
6
Density
Density
8
4
4
2
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
s25$mn s100$mn
simSampDist <- function ( numSim ) {

retMean = rep (0 , 1000) ; retVar = rep (0 , 1000)
for ( i in 1:1000) {
sim = runif ( numSim )
retMean [ i ] = mean ( sim )
retVar [ i ] = var ( sim )
}
retVal = list ()
retVal $ mn = retMean
retVal $ vr = retVal
return ( retVal )
}
s1 = simSampDist (1)
# Creating plots
par ( mfrow = c (2 ,2) )
hist ( s1 $ mn , main = " n = 1 " , xlim = c (0 ,1) , prob = TRUE )
70
3.20
If a is a vector and X is a random vector with mean µ and variance Σ, then E[aT X] = aT µ and
Var(aT X) = aT Σa. If A is a matrix then E[AX] = Aµ and Var(AX) = AΣAT .
Proof.        T  
E[a1 X1 ] a1 E[X1 ] a1 µ1 a1 µ1
 E[a2 X2 ]   a2 E[X2 ]   a2 µ2   a2   µ2 
E[aT X] =  =  =  ..  =  ..   ..  = aT µ
         
.. ..
 .   .   .  .  . 
E[an Xn ] an E[Xn ] an µn an µn
The variance of a vector is the covariance matrix.
 
Var(a1 X1 ) Cov(a1 X1 , a2 X2 ) ··· Cov(a1 X1 , an Xn )
 Cov(a2 X2 , a1 X1 ) Var(a2 X2 ) ··· Cov(a2 X2 , an Xn )
Var(aT X) = 
 
.. .. .. 
 . . . 
Cov(an Xn , a1 X1 ) Cov(am Xm , a2 X2 ) · · · Var(an Xn )
 
a1 Var(X1 )a1 a1 Cov(X1 , X2 )a2 ··· a1 Cov(X1 , Xn )an
 a2 Cov(X2 , X1 )a1 a2 Var(X2 )a2 ··· a2 Cov(X2 , Xn )an 
=
 
.. .. .. 
 . . . 
an Cov(Xn , X1 )a1 an Cov(Xn , X2 )a2 ··· an Var(Xn )an
 T   
a1 Var(X1 ) Cov(X1 , X2 ) ··· Cov(X1 , Xn ) a1
 a2   Cov(X2 , X1 ) Var(X2 ) ··· Cov(X2 , Xn )  a2 
= .    ..  = aT Σa
    
.. .. ..
 .. 

 . . .  . 
an Cov(Xn , X1 ) Cov(Xn , X2 ) · · · Var(Xn ) an
The same general idea applies to matrices as well, execpt the notation becomes more extensive.
Will skip this for brevity.
3.21
Let X and Y be random variables. If E[Y |X] = X, then Cov(X, Y ) = Var(X).
Proof. Applying the covariance identity and using iterated expectation:
Cov(X, Y ) = E[XY ] − E[X]E[Y ]

= E[E[XY |X]] − E[X]E[E[Y |X]]
= E[XE[Y |X]] − E[X]E[X]
= E[X 2 ] − E[X]2
= Var(X)
and the result has been proved.
71
3.22
Let X ∼ U (0, 1) and 0 < a < b < 1. Define:

1 0<x<b 1 a<x<1
Y = , Z=
0 otherwise 0 otherwise
(a) Are Y and Z independent? One way of thinking about independence is whether knowing
something about Y tells us anything about Z, or vice-versa. In this case it doesn’t seem to be that
way. If we only know that Z is one, there is no way to determine whether x < b or x > b, so we
don’t know what Y will be. By this reasoning, Y and Z are independent. To calculate it, let’s set
a = 1/4 and b = 3/4. Then:
Y =0 Y =1
Z=0 1/16 3/16 1/4
Z=1 3/16 9/16 3/4
1/4 3/4 1
To determine e.g. P(Y = 0 ∩ Z = 0) we first determine P(Y = 0|Z = 0) which is 1/4, and then
we determine P(Z = 0) which is 1/4 which makes P(Y = 0 ∩ Z = 0) = 1/16. Now we calculate
that P(Y = 0)P(Z = 0) = (1/4)(1/4) = 1/16 = P(Y = 0 ∩ Z = 0) etc. which verifies that we have
independence.
(b) Finding E[Y |Z]. By definition of the iterated expectation and using independence (P(Y =
y|Z = z) = P(Y = y)).
X
E[Y |Z = 0] = y · P(Y = y|Z = 0) = 0 + (1)P(Y = 1|Z = 0) = P(Y = 1) = b
y
X
E[Y |Z = 1] = y · P(Y = y|Z = 1) = 0 + (1)P(Y = 1|Z = 1) = P(Y = 1) = b
y
The last step follows because we know x ∼ U (0, 1), so the probability is equal to the length of the
interval.
Because of independence, Z does not affect the outcome. And in fact, there is a result that says
E[Y |Z] = E[Y ] when Y is independent of Z, which we have almost proved above.
3.23
Finding the moment generating function for the Poisson, Normal and Gamma distributions. Recall
the definition of the moment generating function:
 X tx

 e p(x) discrete with pmf p(x)

 x
M (t) = E[etX ] = Z ∞

etx f (x)dx continuous with pdf f (x)



−∞
As usual with the Poisson distribution, the following identity is useful:

∞
X an
= ea
n!
k=0
72
• Poisson. Assuming some arbitrary λ.
M (t) = E[etX ]
∞
X etn λn
= e−λ
n=0
n!
∞
X (λet )n
= e−λ
n=0
n!
t
= e−λ eλe
t
−λ
= eλe
t
−1)
= eλ(e
= exp λ(et − 1)

• Normal. Assuming some arbitrary µ and σ 2 . The usual way of deriving the MGF for a normal
distribution is to first do it for the standard normal distribution, since this is a lot easier. Denote
this by MZ (t). We will ’compelte the square’.
MZ (t) = E[etZ ]
Z ∞
1 1 2
=√ etx e− 2 x dx
2π −∞
Z ∞
1 1 2
=√ exp (tx) exp − x dx
2π −∞ 2
Z ∞
1 1 2
=√ exp − x + tx dx
2π −∞ 2
Z ∞
t2 t2

1 1
=√ exp − x2 + tx − + dx
2π −∞ 2 2 2
Z ∞
2 1 1
= et /2 √ exp − (x2 − 2tx + t2 ) dx
2π −∞ 2
Z ∞
2 1 1
= et /2 √ exp − (x − t)2 dx
2π −∞ 2
2
= et /2
The integration is done by substituting u = x − t which leads to the usual standard pdf, which
integrates to 1. Hence, we have the MGF for the standard normal distribution. Using the MGF
for the standard normal distribution, we can find it for a general normal distribution.
MX (t) = E[etX ] = E[et(µ+σZ) ]
= E[etµ+tσZ) ] = etµ E[etσZ) ]
2
= etµ MZ (tσ) = etµ e(tσ) /2
2 2
σ t
= exp + µt
2
So tedious!!
73
• Gamma. Assuming some arbitrary α and β.
M (t) = E[etX ]
Z ∞
1
= etx xα−1 exp(−x/β)dx
Γ(α)β α 0
Z ∞
1
= xα−1 exp(tx − x/β)dx
Γ(α)β α 0
Z ∞
1 1
= xα−1 exp(−( − t)x)dx
Γ(α)β α 0 β
1 Γ(α)
= · α
Γ(α)β α 1
−t
β
1
=h iα
β β1 − t
1
= α
[1 − βt]
= (1 − βt)−α
where we used the definition of the Gamma function like in an earlier exercise.
3.24 Pn
Let X1 , . . . , Xn ∼ Exp(β). Find the MGF of Xi . Prove that i=1 Xi ∼ Gamma(n, β).
Finding the MGF for the exponential distribution with parameter β.
M (t) = E[etX ]
1 ∞
Z
= exp(tx) exp(−x/β)dx
β 0
Z ∞
1
= exp(−(1/β − t)x)dx
β 0
!
1 1
=
β β1 − t

1 β
=
β 1 − βt
1
=
1 − βt
Pn
All Xi have the same MGF. If we now assume they are independent, and define Y = i=1 Xi , then
according to Lemma 3.31:
n n
Y Y 1 1
MY (t) = MX (t) = = n
= (1 − βt)−n
i=1 i=1
1 − βt (1 − βt)
which is the MGF for a Gamma distribution with parameters n and β, as we can see from the
previous exercise. Hence, Y ∼ Gamma(n, β).
74
4 Inequalities
Practical Example
Example 1 - Markov’s Inequality.
You hear that the mean age of NYU students is 20 years, but you know quite a few students that
are older than 30. You decide to apply Markov’s inequality to bound the fraction of students above
30 by modeling age as a nonnegative random variable A.
E[A] 2
P(A > 30) 6 = ,
30 3
so at most two thirds of the students are over 30. This isn’t a very precise bound, but then we also
only use the expectation.
Example 2 - Chebyshev’s Inequality.
The previous bound is a little too weak. After investigating we discover that the standard deviation
of student age is actually just 3 years. Applying Chebyshev’s inequality to this information and
obtain
Var(A) 9
P(|A − E[A]| > 10) 6 = .
100 100
So at least 91% of the students are under 30 years old (and above 10).
Exercises
4.1
Let X ∼ Exp(β). As we found in exercise 3.12, µ = E[X] = β and σ 2 = Var(X) = β 2 . By applying
Chebyshev’s inequality:
σ2 β2
P(|X − β| > t) = P(|X − µ| > t) 6 2
= 2
t t
We are going to compare Chebyshev’s inequality with the following inequality. Since σ > 0 and
k > 1, we can apply Markov’s inequality to get:
E[(X − µ)2 ] σ2 1
P(|X − µ| > kσ) = P((X − µ)2 > k 2 σ 2 ) 6 2 2
= 2 2 = 2
k σ k σ k
We recognize this as the corollarly to the Chebyshev inequality. Comparing, and using that σ = β,
and using t instead of k:
β2
P(|X − β| > t) 6
t2
1
P(|X − β| > tβ) 6 2
t
If β > 1, the first bound becomes ’weaker’ which makes sense since βt > t and we are considering
a smaller interval. If β < 1 the reverse is true. Also interesting to note that by scaling up the
bounded interval by β gives a β 2 increase in the bound, provided that β > 1. Shows there is a
nonlinear relationship for the bound when using the Exponential distribution.
75
4.2
Let X ∼ Poisson(λ). As seen, E[X] = λ and Var(X) = λ. Applying Chebyshev’s inequality to
show that P(X > 2λ) = 1/λ.
λ 1
P(X > 2λ) = P(X − λ > λ) 6 P(|X − λ| > λ) 6 = .
λ2 λ
Which shows that P(X > 2λ) 6 1/λ.
4.3
Let X1 , . . . , Xn ∼ Bernoulli(p) and define
n
1X
X= Xi .
n i=1
Bound P(|X − p| > ) with Chebyshev’s and Hoeffding’s inequality. Show that when n is large, the
Hoeffding’s bound is smaller than Chebyshev’s bound. (Assuming > 0).
For the Bernoulli distribution, E[Xi ] = p and Var(Xi ) = p(1 − p). Since we are using a statistic,
E[X] = p and Var(X) = p(1 − p)/n as per Theorem 3.17. Applying Chebyshev’s inequality:
p(1 − p)/n p(1 − p) 1/4 1

P(|X − p| > ) 6 = 6 2 =
2 n2 n 4n2
(Using that the max value of p − p2 is 1/4 for p ∈ (0, 1)). Applying Hoeffding’s inequality via
Theorem 4.5.
2 2
P(|X − p| > ) 6 2e−2n = 2n2
e
To examine what happens as n grows, we divide the Hoeffding bound by the Chebyshev bound and
call it Ln :
8n2
.
2 1
Ln = 2 =
e2n 4n2 e2n2
If we take the limit when n → ∞, both the numerator and denominator diverges. So we apply
L’Hôpital’s rule and differentiate the numerator and denominator wrt. n:
8n2 82
lim Ln = lim 2 = lim 2 = 0
n→∞ n→∞ e2n n→∞ 22 e2n
Since we lose the dependence on n in the numerator, the fraction goes to 0 as n grows. So as n
becomes large, Ln becomes 0, which means that the Hoeffding bound becomes smaller than the
Chebyshev bound.
76
4.4
Let X1 , . . . , Xn ∼ Bernoulli(p).
(a) For some α > 0, define:
s
1 2 1 2
= log =⇒ 2 = log
2n α 2n α
Pn
and set pb = n−1 i=1 Xi . Now we define Cn = (b p − , pb + ) and will show that the probability
that Cn contains pb is α − 1, using Hoeffding’s inequality.
Cn containing p means that:
p ∈ Cn =⇒ pb − 6 p 6 pb + =⇒ − 6 p − pb 6 =⇒ |b
p − p| 6
where we used that |b

p − p| = |p − pb| in the final step. So:
P(p ∈ Cn ) = P(|b
p − p| 6 ) = 1 − P(|b
p − p| > )
(the probability that p is in the interval Cn is equal to 1 minus the probability that is outside the
interval, basically). We can use the bound found in Theorem 4.5 again.
p − p| > ) 6 2 exp −2n2

P(|b

1 2
= 2 exp −2n · log
2n α

2
= 2 exp − log
α
n α o
= 2 exp log
α 2
=2
2
=α
Rewriting this inequality in terms of 1 − α.
p − p| > ) 6 α
P(|b
p − p| > ) 6 1 + α
1 + P(|b
1 − α 6 1 − P(|b
p − p| > )
Which shows the desired result:
1 − α 6 1 − P(|b
p − p| > ) = P(p ∈ Cn ).
(b) Running some simulations. Doing 1000 simulations for n = 10, 50, 100, 2000 and checking how
often the p lies in the range pb ± .
See code and plots on the next page.
77
# 4.4( b ) : p = 0.4 , alpha = 0.05
simBern <- function ( n ) { # Simulate Bernoulli
NSIM = 1000 # Do 1000 simulations per n
simul ationLi st = matrix ( rep (0 , n * NSIM ) , nrow = NSIM )
for ( i in 1: NSIM ) {
simul ationLi st [i ,1: n ] = sample ( c (1 , 0) , size =n ,
replace = TRUE , prob = c (0.4 , 0.6) )
}
return ( simulat ionList )
}
countCoverage <- function ( sim , alpha ) {
N = ncol ( sim ) ; NSIM = nrow ( sim )
covRate = rep (0 , NSIM )
mn = mean ( sim [i , 1: N ])
if ( abs (0.4 - mn ) < 0.05) {
covRate [ i ] = 1
}
}
hitRate = sum ( covRate ) / NSIM ; print ( hitRate )
}
# Bernoulli simulations
bsim10 = simBern (10) ; bsim50 = simBern (50)
bsim100 = simBern (100) ; bsim2000 = simBern (2000)
> countCoverage ( bsim10 , alpha =0.05)
[1] 0.268
[1] 0.552
[1] 0.684
[1] 1
R
Plotting the coverage rate vs. n. As n increases, the interval will contain p more often.
n vs. Coverage
1.0
0.8
Coverage rate
0.6
0.4
10 50 100 2000
78
p − p|. When does it become
(c) Investigating the relationship between n and the interval length: 2|b
smaller than 0.05? According to the simulation results, that happens roughly when n passes 200,
based on our 1000-per-n simulation.
# 4.4( c ) : Plotting length of interval
Inte rvalLeng th <- function ( sim ) {
N = ncol ( sim ) ; NSIM = nrow ( sim )
intLength = rep (0 , NSIM )
mn = mean ( sim [i , 1: N ])
intLength [ i ] = 2 * abs (0.4 - mn )
}
print ( mean ( intLength ) )
}
> Inte rvalLeng th ( bsim10 )
[1] 0.2408
[1] 0.10572
[1] 0.077
[1] 0.0547
[1] 0.04543333
[1] 0.017746
n vs. Interval Length

0.20
Interval Length
0.15
0.10
0.05
10 50 100 200 300 2000
79
4.5 - Theorem 4.7: Mill’s Inequality
Let Z ∼ N (0, 1) and t > 0. Then,
r 2
2 e−t /2
P(|Z| > t) 6 · .
π t
Proof. From the hint, we can use the symmetry property of the normal distribution, and when
t > 0:
1 t
P(|Z| > t) = 2P(Z > t) =⇒ P(|Z| > t) = P(Z > t) =⇒ P(|Z| > t) = tP(Z > t)
2 2
As laid out in the proof of the Markov inequality, we get:
Z ∞ Z ∞
t
P(|Z| > t) = tP(Z > t) = t f (x)dx 6 xf (x)dx = I
2 t t
Calculating the integral I:

Z ∞ Z ∞
1 2 1 −x2 /2 ∞ 1 −t2 /2
1 2
I= xf (x)dx = √ x·e−x /2 dx = √ −e t
= √ 0−(−e ) = √ e−t /2
t 2π t 2π 2π 2π
Collecting our findings:
r 2
t 1 2 2 e−t /2
P(|Z| > t) 6 √ e−t /2 =⇒ P(|Z| > t) 6
2 2π π t
which concludes the proof.
An additional note on the fraction manipulation.
√ √ √ r
2 2 2 2 2
√ =√ √ =√ =
2π 2 π π π
4.6
Let Z ∼ N (0, 1). We simulate the probability that P(|Z| > t) and plot the results. See code and
plot on next page.
We will also include the Markov bounds. Since the Markov inequality only applies to nonnegative
random variables, we use it on |Z|. Finding E[|Z|] by using that the standard normal distribution
is symmetric, so we multiply the expected value for E[Z > 0] by 2.
Z ∞
1 2
E[|Z|] = 2 √ x · e−x /2 dx
2π 0
r h i∞
2 2
= − e−x /2
π 0
r
2
= 0 − (−1)
π
r
2
=
π
(This can easily be verified numerically).
80
# Plotting P ( Z > t ) as a function of t
Z = rnorm (1000000)
t = 1:2000 / 400 # t goes from 0 to 5
P _ Zgt = rep (0 , length ( t ) )
for ( i in 1: length ( t ) ) {
P _ Zgt [ i ] = sum ( abs ( Z ) > t [ i ])
}
plot (t , P _ Zgt / 1 e6 , type = " l " ,
main = " P (| Z | > t ) for t in (0 ,5) " )
Results from plot, when including bounds for the Markov and Hill inequalities:
P(|Z|>t) for t in (0,5)

1.0
P(|Z|>t)
Mill's bound
0.8
Markov's bound
0.6
P(|Z| > t)
0.4
0.2
0.0
0 1 2 3 4 5
The Markov inequality gives a very lose bound, but as we can see the Mill inequality gives a
bound that really hugs the P(|Z| > z) curve.
81
4.7 Pn
Let X1 , . . . , Xn ∼ N (0, 1). Bound P(|X| > t) using Mill’s inequality, where X = n−1 i=1 Xi and
compare to Chebyshev’s bound.
We begin by applying the Chebyshev’s inequality. The sampling distribution of X gives E[X] = 0
and Var(X) = σ 2 /n = 1/n.
σ2 1
P(|X| > t) = P(|X − µ| > t) 6 = 2
t2 nt
Now, the variable X no longer has a standard normal distribution, as we noted above. We have
X ∼ N (0, 1/n). But we can do the following, since we only require t > 0:
P(|X| > t) = P(|0 + (1/n)Z| > t)

= P(1/n|Z| > t)
= P(|Z| > nt)
So, by defining k = nt > 0, we get:

r 2 r 2
2 e−k /2 2 e−(nt) /2
P(|X| > t) = P(|Z| > k) 6 =
π k π nt
No particular observations, except that we can note Hill’s inequality relies on the square of n, while
Chebyshev’s bound only uses n. By taking the limit and applying L’Hôpital’s rule, we can easily
see that Hill’s bound will be smaller as n grows.
82
5 Convergence of Random Variables
Definition 5.1 Types of convergence
Let X1 , X2 , . . . be a sequence of random variables and let X be another random variable. Let Fn
denote the CDF of Xn and let F denote the CDF of X.
P
1. Xn converges to X in probability, Xn −
→ X, if for all > 0,
P(|Xn − X| > ) → 0
as n → ∞.
2. Xn converges to X in distribution, Xn X, if
lim Fn (t) = F (t)

n→∞
at all t for which F is continuous.

Definition 5.2
qm
Xn converges to X in quadratic mean (L2 ), written Xn −−→ X, if
E[(Xn − X)2 ] → 0
as n → ∞.
Relationship in convergence types.
qm P
−−→ =⇒ −
→ =⇒
Exercises
5.1
Let X1 , . . . , Xn be IID with finite mean and variance µ and σ 2 . Let X be the sample mean and Sn2
be the sample variance.
n n
1X 1 X
X= , Sn2 = (Xi − X)2
n i=1 n − 1 i=1
(a) Showing that E[Sn2 ] = σ 2 was done in exercise 3.8.

P
(b) Showing that Sn2 −
→ σ 2 . Directly, this means that for any > 0,
P(|Sn2 − σ 2 | > ) = 0 as n → ∞.
But we can do it in a different way. Following the hint, we want to rewrite Sn2 so we can apply the
law of large numbers.
Doing the calculations on the next page.
83
n
1 X
Sn2 = (Xi − X)2
n − 1 i=1
n
!
n 1 Xh 2 2
i
= Xi − 2Xi X + X
n−1 n i=1
Distributing the sum.
n n
! !
n 1X 2 1X 2
= X − 2X Xi +X
n−1 n i=1 i n i=1
n
!
n 1X 2 2 2
= X − 2X + X
n−1 n i=1 i
n
!
n 1X 2 2
= X −X
n−1 n i=1 i
2
WLLN on X with g(x) = x2 .
n
!
n 1X 2
= X − µ2
n−1 n i=1 i
n
!
n 1X
= (Xi − µ)2
n−1 n i=1
The last rewrite is a standard identity. Now, we define Yi = Xi − µ. Then Y → E[(Xi − µ)] by the
2
WLLN. By Theorem 5.5(f), we can apply g(x) = x2 and get Y → E[(Xi − µ)2 ] = Var(Xi ) = σ 2 .
Finally, we use the hint with cn = n/(n − 1) which obviously tends to 1, so we can apply Theorem
P
5.5(d) and hence we have proved that Sn2 − → σ 2 . (The hint says using 5.5(e), but that is convergence
with distribution, which I don’t think is correct). (Update: this is a misprint which is noted in the
errata2.pdf for the book).
5.2
qm
Let X1 , X2 . . . be a sequence of random variables. Show that Xn −−→ b if and only if
lim E[Xn ] = b, and lim Var[Xn ] = 0.

n→∞ n→∞
Proof.
qm
⇒) We assume Xn −−→ b, which means
E[(Xn − b)2 ] → 0, n → ∞.
We can rewrite the expression:
E[(Xn − b)2 ] = E[Xn2 − 2bXn + b2 ]

= E[Xn2 ] − 2bE[Xn ] + b2
= E[Xn2 ] − E[Xn ]2 + E[Xn ]2 − 2bE[Xn ] + b2
= Var(Xn ) + (E[Xn ] − b)2
84
By our assumption E[(Xn − b)2 ] → 0 as n → ∞, and so Var(Xn ) + (E[Xn ] − b)2 → 0 as n → ∞.
Since Var(Xn ) > 0 and (E[Xn ] − b)2 > 0, then we can conclude that
Var(Xn ) → 0, (E[Xn ] − b)2 → 0 =⇒ E[Xn ] → b
as n → ∞.
⇐) We assume
lim E[Xn ] = b, and lim Var[Xn ] = 0.
n→∞ n→∞
Then limn→∞ Var(Xn ) + (E[Xn ] − b)2 = 0. By reversing the calculations from the first part:
qm
lim Var(Xn ) + (E[Xn ] − b)2 = 0 =⇒ lim E[(Xn − b)2 ] = 0 =⇒ Xn −−→ b.
n→∞ n→∞
By showing implication both ways, the result is proved.
5.3
qm
Let X1 , . . . , Xn be IID and let µ = E[Xi ] with finite variance. Show that X −−→ µ.
Proof. Define:
n
1X
Xn = Xi .
n i=1
P
By the WLLN, X n −
→ µ, which we can also express as
lim E[X n ] = µ.
n→∞
The variance is finite, so Var(Xi ) = σ 2 < ∞. This means that for the sample variance:
σ2
lim Var(X) = lim = 0.
n→∞ n→∞ n
With this, we can simply apply the result from exercise 5.2 which shows that E[(X − µ)2 ] → 0 as
qm
n → ∞ which proves X −−→ µ.
5.4
Let X1 , X2 , . . . be a sequence of random variables such that

1 1 1
P Xn = = 1 − 2 , and P(Xn = n) = 2 .
n n n
Just noting the probabilities for each outcome as n grows.
n=1 n=2 n=3 n=4 ...

p(x) = P(Xn = n1 ) p(1) = 0 p( 12 ) = 34 p( 13 ) = 89 p( 14 ) = 15
16 ...
p(x) = P(Xn = n) p(1) = 1 p(2) = 41 p(3) = 19 p(4) = 16 1
...
As n becomes large, the probability that we get n and not 1/n becomes very small. But as n
becomes large, it will also yield extreme outliers in the sequence with a non-zero probability.
85
Starting by calculating the mean, second moment, and variance.

1 1 1
E[Xn ] = 1 − 2 + (n)
n n n2
1 1 1
= − 3+
n n n
2 1
= − 3
n n

1 1 1
E[Xn2 ] = 2
2
1 − 2 + (n )
n n n2
1 1
= 2 − 4 +1
n n
Var(Xn ) = E[Xn2 ] − E[Xn ]2

2
1 1 2 1
= 2 − 4 +1− − 3
n n n n

1 1 4 4 1
= 2 − 4 +1− − +
n n n2 n4 n6
3 3 1
=1− 2 + 4 − 6
n n n
Results verified by simulation. (See code in 5.4.R). E.g. for n = 200 and 10M simulations:
# Simulated vs . Theoretical
> mean ( Xn )
[1] 0.009879878
> 2 / n - 1 / n ^3
[1] 0.009999875
> var ( Xn )
[1] 0.9759275
> 1 - 3 / n ^2 + 3 / n ^4 - 1 / n ^6
[1] 0.999925
R
From these expressions we can see that limn→∞ E[Xn ] = 0, but limn→∞ Var(Xn ) = 1. Since the
variance does not become 0, we know by the result in exercise 5.2 that this does NOT converge in
quadratic mean.
Checking if Xn converges in probability. If we fix some > 0, we can apply the Chebyshev
inequality. Set µ = E[Xn ] and σ 2 = Var(Xn ):
3
σ2 1− n2 + n34 − 1
n6
P(|Xn − µ| > ) 6 2
=
2
By taking the limit n → ∞ on both sides, we get:
3
1− n2 + n34 − 1
n6 1
lim P(|Xn − µ| > ) 6 lim = .
n→∞ n→∞ 2 2
Since this does not tend to 0, we do NOT have convergence in probability.
86
5.5
Let X1 , . . . , Xn ∼ Bernoulli(p). Prove that
n n
1X 2 P 1 X 2 qm
X −→ p, and X −−→ p.
n i=1 i n i=1 i
Proof. Recalling how to calculate the expectation and second moment for a Bernoulli(p) variable:
E[X] = (1)p + (0)(p − 1) = p, E[X 2 ] = (1)2 p + (0)2 (p − 1) = p,
With these, we can calculate the variance.
Var(X) = E[X 2 ] − E[X]2 = p − p2 = p(1 − p).
We define Yi := Xi2 and exploit the simplicity of the Bernoulli distribution. In this case:
E[Y ] = E[X 2 ] = (1)2 p + (0)2 (p − 1) = p, E[Y 2 ] = E[X 4 ] = (1)4 p + (0)4 (p − 1) = p,
With these, we can calculate the variance.
Var(Y ) = E[Y 2 ] − E[Y ]2 = p − p2 = p(1 − p).
So we have µ = E[Y ] = p and σ 2 = p(1 − p). We can now define the sample mean and variance:
σ2 p(1 − p).
E[Y ] = p, Var(Y ) = = .
n n
Since the variance tends to 0 as n → ∞ Pnwe can qm
use the results in exercise 5.3 to conclude that
qm
Y −−→ p and by how we defined Y , n1 i=1 Xi2 −−→ p. Since we have convergence in quadratic
mean, it follows that we also have convergence in probability.
5.6
The height of men has mean 68 inches and standard deviation 2.6 inches. We have n = 100. Finding
the approximate probability that the average height in the sample will be at least 68 inches.
We can approximate this probability with the
Pn CLT (central limit theorem). We want to find
the probability that the sample height X = n1 i=1 Xi is at least as big as the population mean:
P(X > µ).
P(X > µ) = 1 − P(X 6 µ)
By the central limit theorem, using that n = 100, µ = 68 and σ = 2.6:

10(X − 68)
P(X 6 68) = P(X − 68 6 0) = P 6 0 ≈ P(Z 6 0) = 0.5
2.6
(Since the standard normal distribution is symmetric and centered at 0). So we get:
P(X > 68) = 1 − P(X 6 68) = 1 − 0.5 = 0.5.
87
5.7
1
Let λn = n for all n and let Xn ∼ Poisson(λn ).
P
(a) Showing that Xn −
→ 0. By properties of the Poisson distribution, we have
1 1
µ = E[Xn ] = , σ 2 = Var(Xn ) = .
n n
By fixing some > 0 and applying Chebyshev’s inequality, we get:
1 σ2 1
P(|Xn − | > ) = P(|Xn − µ| > ) 6 2 = 2
n n
By taking the limit on both sides:
1 1
lim P(|Xn − | > ) =6 lim = 0,
n→∞ n n→∞ n2
P
and since µ = 1/n → 0 we have shown that Xn −
→ 0.
P
(b) We define Yn = nXn and will show that Yn −
→ 0. Finding the mean and variance:

1
E[Yn ] = E[nXn ] = nE[Xn ] = n =1
n

2 2 1
Var(Yn ) = Var(nXn ) = n Var(Xn ) = n =n
n
From the Chebyshev inequality we can only really conclude that Yn does NOT converge to 1, but
we can’t use it for determining the asymptotic behavior of Yn at 0. Instead we will use Theorem
P P
5.5(f):Xn −
→ 0 then g(Xn ) − → g(X) . Here, Yn is a function of Xn , defined as: Yn = nXn where
P P
g(x) = nx, which is a continuous function. In this case we get that Xn −
→ 0 implies nXn − → n·0
P
so Yn −→ 0.
5.8
A program has n = 100
Ppages of code. Let Xi ∼ Poisson(1) be iid and denote the number of errors
n
on page i. Let Y = i=1 Xi denote the total number of errors. Use the CLT to approximate
P(Y < 90).
The mean and variance are µ = E[Xi ] = 1 and σ 2 = Var(Xi ) = 1. An important observation: the
CLT applies to sample means, but in this case we are simply summing up 100 independent Poisson
variables, and so Y ∼ Poisson(100), i.e. E[Y ] = 100 and Var(Y ) = 100.
Let us define W = n1 Y , which means:
n
1 1X
W = Y = Xi ,
n n i=1
then by the CLT, W ∼ N (1, 1/100), so E[W ] = 1 and Var(W ) = 1/100.
88
By going the other way, we can find a normal approximation for Y by using that Y = nW where
n = 100.:
E[Y ] = E[nW ] = nE[W ] = n(1) = 100,

2 2 1
Var(Y ) = Var(nW ) = n Var(W ) = (100) = 100
100
p
The standard deviation is: Var(Y ) = 10. Approximating the probability.

Y − 100
P(Y < 90) = P(Y − 100 < −10) = P < −1 ≈ P(Z 6 −1) ≈ 0.1586
10
Note: this exercise demonstrates that the CLT is just an approximation, and under certain condi-
tions it might not be a very accurate approximation. In 5.8.R a simulation repeating the conditions
were done one million times, and numerically, the probability that P(X < 90) turns out to be about
0.1467.
> # Approximating answer numerically
> length ( Y )
[1] 1000000
> sum ( Y < 90) / length ( Y )
[1] 0.146773
5.9
Suppose that P(X = 1) = P(X = −1) = 1/2 and define:
1

X with probability 1− n
Xn = 1
en with probability n
We will determine what kinds of convergences it satisfies.

For starters, we can get some information about X. The expectation:

1 1
E[X] = (−1) + (1) =0
2 2
Then, the second moment and the variance.

1 1
E[X 2 ] = (−1)2 + (1)2 =1
2 2
Var(X) = E[X 2 ] − E[X]2 = 1 − 0 = 1

Next we can find the expectation of Xn . Since we are using X which is itself a random variable,
we can’t calculate it directly, instead we will use iterated expectation. Calculations are on the next
page.
1 1
E[Xn ] = X 1 − + en
n n
89
E[Xn ] = E[E[Xn |X]] = E[Xn |X = 1]P(X = 1) + E[Xn |X = −1]P(X = −1)
h 1
i h
1 1

1
i
1 1
= (1) 1 − + (en ) + (−1) 1 − + (en )
n n 2 n n 2
en
=
n
Calculating the second moment.
E[Xn2 ] = E[E[Xn2 |X]] = E[Xn2 |X = 1]P(X = 1) + E[Xn2 |X = −1]P(X = −1)

i h i
h 1 1 1 1 1 1
= (1)2 1 − + (en )2 + (−1)2 1 − + (en )2
n n 2 n n 2
2n
1 e
=1− +
n n
Variance.
Var(Xn ) = E[Xn2 ] − E[Xn ]2

1 e2n e2n
=1− + − 2
n n n
1 1 1
=1− + − e2n
n n n2
Verified by simulation for n = 5. (See 5.9.R).

> # Theoretical vs . Simulated
> mean ( Xn )
[1] 29.72278
> exp ( n ) / n
[1] 29.68263
> var ( Xn )
[1] 3528.798
> (1 - 1 / n ) + (1 / n - 1 / n ^2) * exp (2 * n )
[1] 3525.035
Both the expectation and variance will tend to infinity as n → ∞, so we can conclude that Xn
does NOT converge in probability, and it does NOT converge in quadratic mean. Even just the
oscillating X wouldn’t converge. The variance would be constant, so there would be no quadratic
mean convergence by the results in exercise 5.2, and we would not be able to apply Chebyshev’s
inequality to prove convergene in probability.
Evaluating the density convergence later.
90
5.10
Let Z ∼ N (0, 1) and let t > 0 Show that for any k > 0
E[|Z|k ]
P(|Z| > t) 6 .
tk
Proof. Following the method used in the proof of Markov’s inequality. Will also use the hint from
4.5: P(|Z| > t) = 2P(Z > t) and what was found in 4.6 that the expected value of |Z| is two times
the integral from 0 to ∞.
For any t > 0 and k > 0:
Z ∞ Z ∞ Z t
E[|Z|k ] = 2 z k f (z)dz = 2 z k f (z)dz + 2 z k f (z)dz
0 t 0
Z ∞ Z ∞
k k
>2 z f (z)dz > 2t f (z)dz = 2tk P(Z > t) = tk P(|Z| > t)
t t
Which implies,
E[|Z|k ]
P(|Z| > t) 6
tk
Comparing to Mill’s inequality and Markov’s inequality by extending the plot used in exercise 6.4.
Plotting the bounds for k = 2, 4, 6, 8. This method is not quite as precise as Mill’s inequality, but it
gets very close after t > 3. The bound is not as good for smaller t, where even the Markov bound
is better. (The dark green line is k = 8, and the Markov bound corresponds to k = 1)
P(|Z|>t) for t in (0,5)

1.0
P(|Z|>t)
Mill's bound
0.8
Markov's bound
New bounds
0.6
P(|Z| > t)
0.4
0.2
0.0
0 1 2 3 4 5
Code for making this plot is in 5.10.R.
91
5.11
Let Xn ∼ N (0, 1/n) and let X have the following CDF:

0 x<0
F (x) =
1 x>0
The variable X has a point mass distribution at 0, and we want to see if Xn converges. Note that
µ = E[Xn ] = 0 and σ 2 = Var(Xn ) = 1/n and limn→∞ (1/n) = 0, so by 5.2, this actually converges
in quadratic mean to 0, which implies convergence in distribution.
Also showing it directly by using Chebyshev’s inequality. For any > 0:
σ2 1
P(|Xn − 0| > ) 6 = 2 →0
2 n
P
when n → ∞. In conclusion, Xn − → X.
Next is investigating convergence in distribution which must be true, since convergence in prob-
ability implies convergence in distribution. We can also see the trend by plotting the CDF of Xn
for n = 1, 2, 5, 10, . . . , 5000 where n = 5000 is in black. Code for generating plot in 5.11.R.
CDF of N(0, 1/n)

1.0
0.8
0.6
CDF
0.4
0.2
0.0
−1.0 −0.5 0.0 0.5 1.0
Xn
To show convergence in distribution, we find an expression for the CDF of Xn . By using the
definition of the CDF and translating to the standard normal distribution:
√ √ √

Xn − 0 z−0
Fn (z) = P(X 6 z) = P √ 6 √ = P( nXn 6 nz) = P(Z 6 nz),
1/ n 1/ n
where Z ∼ N (0, 1). As n → ∞ then for any z < 0, Fn (z) = 0 and for z > 0, Fn (z) = 1 so it
assumes the exact same form as F (x), so Fn → F and Xn X as n → ∞.
92
5.12
Let X1 , X2 , . . . and X be random variables that are positive integers. Show that Xn X if and
only if, for every integer k,
lim P(Xn = k) = P(X = k).
n→∞
Proof. This is a discrete random variable.

⇒). Assuming that Xn X. By definition of convergence in distribution, for any k,
lim Fn (k) = F (k) and lim Fn (k − 1) = F (k − 1)

n→∞ n→∞
Since we have a discrete distribution,

X X
F (k) − F (k − 1) = P(X = p) − P(X = p)
p6k p6k−1
X X
= P(X = k) + P(X = p) − P(X = p)
p6k−1 p6k−1
= P(X = k)
The same relation is true for Fn , but here we also apply our assumption of convergence.
lim P(Xn = k) = lim Fn (k) − lim Fn (k − 1)

n→∞ n→∞ n→∞
= F (k) − F (k − 1)
By combining these together, we have shown, for any k,
lim P(Xn = k) = P(X = k).

n→∞
⇐) Now we assume that the limit of P(Xn = k) is equal to P(X = k) for all k, which means they
have equal probability mass function. Then, for any k:
k
X
F (k) = P(X 6 k) = P(X = z)
z=1
= P(X = 1) + P(X = 2) + . . . + P(X = k)
= lim P(Xn = 1) + lim P(Xn = 2) + . . . + lim P(Xn = k)
n→∞ n→∞ n→∞
k
X
= lim P(Xn = z)
n→∞
z=1
k
X
= lim P(Xn = z)
n→∞
z=1
= lim P(Xn 6 k) = lim Fn (k)
n→∞ n→∞
(We can change the order of the sum and limit since it is a finite sum). In conclusion, this shows
that limn→∞ Fn (k) = F (k) and so Xn X.
93
5.13
Let Z1 , Z2 , . . . be IID random variables with density f . Suppose P(Zi > 0) = 1 (the Zi are
nonnegative) and that λ = limx↓0 f (x) > 0. Define:
Xn = n min{Z1 , . . . , Zn }.
Show that Xn Z where Z ∼ Exp(λ).

First, let’s look at the CDF for a min function. Assume that we have some V1 , . . . , Vn that are IID
and Vi ∼ FV , then define
Wn = min(V1 , . . . , Vn ).
What is the CDF of FW of W ? By definition, we have:
FW (y) = P(W 6 y) = P(min(V1 , . . . , Vn ) 6 y)
This means that for at least one i, we have Vi 6 y. Another way of expressing this is to look at the
complement, which is equivalent. P(A) = 1 − P(Ac ), which is the case when ALL Vi > y. Using
this, and using independence:
FW (y) = P(W 6 y) = 1 − P(V1 > y, V2 > y, . . . , Vn > y)

= 1 − P(V1 > y)P(V2 > y) · · · P(Vn > y)
= 1 − [1 − FV (y)][1 − FV (y)] · · · [1 − FV (y)]
= 1 − [1 − FV (y)]n
We will use this result in the following.

Returning to the exercise. Define Zm := min(Z1 , . . . , Zn ). Then:
y h y in
Fn (y) = P(Xn 6 y) = P(nZm 6 y) = P Zm 6 =1− 1−F
n n
From using the result above.
For the CDF, which is an integral, we can make an approximation with a step function (which
can be thought of as approximating the curve with a square),
y Z y/n y y
F = f (z)dz ≈ f (cj ) − 0 = f (cj )
n 0 n n
y
where cj is e.g. the midpoint in the interval, such as cj = 2n . When letting n → ∞ then
f (cj ) → f (0) = λ. So:
x in n
h f (cj )x
lim Fn (x) = lim 1 − 1 − F = lim 1 − 1 − = 1 − exp(−f (0)x) = 1 − exp(−λx)
n→∞ n→∞ n n→∞ n
We recognise this as the exponential CDF with parameter λ, so if we set Z ∼ Exp(λ) with CDF
FZ , then we have shown that Fn → FZ as n → ∞, so Xn Z.
94
5.14
2
Let X1 , . . . , Xn ∼ U (0, 1) and let Yn = X n . We will find the limiting distribution of Yn .
By the CLT, X ∼ N (µ, σ 2 ), and from the uniform distrbution this is:
1+0 1
µ = E[X] = =
2 2
(1 − 0)2 1
σ 2 = Var(X) = =
12 12
2
And so, X ∼ N (1/2, 1/12). To find the distribution of X , we apply the delta method with
g(x) = x2 , which gives us N (g(µ), (g 0 (µ))2 σ 2 ). Calculating:
g(µ) = (1/2)2 = 1/4
(g 0 (µ))2 σ 2 ) = (2(1/2))2 (1/12) = 1/12

2
Which means that X ∼ N (1/4, 1/12). Verified with a simulation in 5.14.R.
5.15
Let
X11 X12 X1n
, ,...,
X21 X22 X2n
be IID random vectors with mean µ = (µ1 , µ2 ) and variance Σ with
n n
1X 1X
X1 = X1i , X2 = X2i .
n i=1 n i=1
We will find the limiting distribution of Yn = X 1 /X 2 .

Following example 5.16 closely. Yn = g(X 1 , X 2 ) where
s1
g(s1 , s2 ) = .
s2
By the central limit theorem:
√

X 1 − µ1
n N (0, Σ).
X 2 − µ2
Finding the gradient of g(s1 , s2 ).
 ∂g   1 
 ∂s1   s2 
∇g(s1 , s2 ) = 
 ∂g  =  s1 
  
− 2
∂s2 s2
95
Calculating the deltra transformed covariance matrix.
 1 T  1 
 µ2  µ 

σ11 σ12   2 
∇g(µ1 , µ2 )T Σ∇g(µ1 , µ2 ) =  
 µ 
1 σ 21 σ 22
 µ1 
− 2 − 2
µ2 µ2
 1 T  σ11 σ12 µ1 
−
 µ2   µ2 µ22 
=  
 µ   σ21

1 σ22 µ1 
− 2 −
µ2 µ2 µ22
σ11 σ12 µ1 σ21 µ1 σ22 µ21
= 2 − − +
µ2 µ32 µ32 µ42
µ σ11 − µ1 µ2 σ12 − µ1 µ2 σ21 + µ21 σ22
2
= 2
µ4
With this we have found the limiting distribution.
√ µ2 σ − µ µ σ − µ µ σ + µ2 σ
11 1 2 12 1 2 21 1 22
N 0, 2

n X 1 /X 2 − µ1 /µ2
µ4
5.16
Finding an example where Xn X and Yn Y , but Xn + Yn 6 X +Y.
Set X ∼ U (0, 1) and Y ∼ U (−1, 0). Then we define Xn = U (0, 1) and Yn = −Xn and we denote
the CDFs as X ∼ F , Y ∼ G, Xn ∼ Fn and Yn ∼ Gn .
Then:
lim Fn (x) = lim x = x = F (x)
n→∞ n→∞
lim Gn (x) = lim −x = −x = G(x)

n→∞ n→∞
so Xn X and Yn Y . However,
lim Fn (x) + Gn (x) = lim x − x = lim 0 = 0

n→∞ n→∞ n→∞
so Xn + Yn 0, but X + Y 6= 0. (This can be seen in exercise 2.20 with X − Y , which corresponds

to this one if we instead use X + Z with Z = −Y . Also see 5.16.R).
96
6 Models, Statistical Inference and Learning
Definitions
The bias of an estimator is defined as
bias(θbn ) = Eθ [θbn ] − θ.
The standard error, denoted by se and is defined as:

q
se(θbn ) = Var(θbn ).
The MSE, mean squared error, is defined as:
MSE = Eθ [(θbn − θ)2 ].
The MSE can according to Theorem 6.9 be written as:
MSE = bias2 (θbn ) + Varθ (θbn ).
Exercises
6.1
b = n−1 Pn Xi . We will calculate the bias, se and MSE of
Let X1 , . . . , Xn ∼ Poisson(λ) and let λ i=1
the estimator.
For the Poisson distribution, we know that E[Xi ] = λ and Var(Xi ) = λ.
Calculating the bias.
" n
# n
1X 1X
bias(λ)
b = Eθ Xi − λ = Eθ [Xi ] − λ = λ − λ = 0
n i=1 n i=1
which shows that the mean is an unbiased estimator. Calculating the variance.
n
! n
1 X 1 X λ
Var(λ)
b = Var Xi = 2 Var(Xi ) =
n i=1 n i=1 n
The standard error is the square root of the variance.

r
λ
se(λ) =
b
n
Finally, the MSE.
λ λ
MSE = 02 + = .
n n
97
6.2
Let X1 , . . . , Xn ∼ U (0, θ) and let θb = max(X1 , . . . , Xn ). Find bias, se and MSE.
For the U (0, θ) uniform distribution, E[Xi ] = θ/2 and Var(Xi ) = θ2 /12. However, we are using
the maximum value, which we don’t know the expectation of. In order to calculate it, we will need
the PDF, which requires the CDF. For each Xi the CDF is F (x) = x/θ. We assume independence
and will find the CDF of θb which we will denote as Fθ .
Fθ (x) = P(θb 6 x) = P(X1 6 x, X2 6 x, . . . , Xn 6 x)

= P(X1 6 x)P(X2 6 x) · · · P (Xn 6 x)
x n
=
θ
xn
= n
θ
We find the PDF by differentiating the CDF.
nθn−1
fθ (x) = F 0 (x) =
θ
Now we can calculate the expectation. (A simulated verification is in 6.2.R).
θ Z θ
nxn−1
Z
n
E[θ]
b = x·
dx = n xn dx
0 θ θ 0
θ
n xn+1

n 1 n+1
= n = θ
θ n+1 0 n + 1 θn
n
= θ
n+1
We will also need the variance, so we calculate the second moment.
θ Z θ
nxn−1
Z
n
E[θb2 ] = x2 ·
dx = n xn+1 dx
0 θ θ 0
θ
n xn+2

n 1 n+2
= n = θ
θ n+2 0 n + 2 θn
n 2
= θ
n+2
Now we can calculate the variance (skipping the algebra). (A simulated verification is in 6.2.R).
b = E[θb2 ] − E[θ]
Var(θ) b2
2
n 2 n
= θ − θ2
n+2 n+1
n
= θ2
(n + 2)(n + 1)2
98
Finding the bias.
b −θ = n 1
bias(θ)
b = E[θ] θ−θ =− θ
n+1 n+1
This estimator is not unbiased, because we will often end up with a slightly smaller value. As we
have shown, we could use the unbiased estimator n+1n max(X1 , . . . , Xn ) instead.
Finding the standard error, which is just the square root of the variance.
√
n
q
b = Var(θ)
se(θ) b =√ θ
n + 2(n + 1)
And finally, the MSE.
MSE = bias2 (θ)

b + Var(θ)
b
1 n
= θ2 + θ2
(n + 1)2 (n + 2)(n + 1)2
n+2 n
= θ2 + θ2
(n + 2)(1 + n)2 (n + 2)(n + 1)2
2n + 2
= θ2
(n + 2)(n + 1)2
6.3
Let X1 , . . . , Xn ∼ U (0, θ) and let θb = 2X n . Find the bias, se and MSE.
Here the estimator is two times the mean, or written more explicitly.
n
2X
θb = Xi
n i=1
Now we can use the properties of the uniform distribution:
θ θ2
E[Xi ] = and Var(Xi ) =
2 12
Finding the bias.
"
n
# n
2 X 2X θ
bias(θ)
b =E Xi − θ = E [Xi ] − θ = 2 · − θ = θ − θ = 0
n i=1 n i=1 2
In this case, we have an unbiased estimator. Calculating the variance:

n
! n
2X 4 X 4 θ2 θ2
Var(θ) = Var
b Xi = 2 Var (Xi ) = · =
n i=1 n i=1 n 12 3n
(Simulations for expectation and variance confirmed in 6.3.R).
99
The standard error, se, is just the square root of the variance.
b = √θ
q
se(θ)
b = Var(θ)
3n
Calculating the MSE.
MSE = bias2 (θ)

b + Var(θ)
b
θ2
= 02 +
3n
θ2
=
3n
100
7 Estimating the CDF and Statistical Functionals
Definition 7.1 The empirical distribution function Fbn is the CDF that puts mass 1/n at each data
point Xi . Pn
I(Xi 6 x)
Fn (x) = i=1
b
n
where I(·) is the indicator function.
Exercises
7.1 - Theorem 7.3
E[Fbn (x)] = F (x)

Proof.
Pn n
i=1 I(Xi 6 x) 1X
E[Fbn (x)] = E = E[I(Xi 6 x)]
n n i=1
n
1 X
= 1 · P(Xi 6 x) + 0 · P(Xi > x)
n i=1
n
1X 1
= P(Xi 6 x) = · nP(Xi 6 x)
n i=1 n
= P(Xi 6 x)
= F (x)
The second identity.
F (x)(1 − F (x))
Var(Fbn (x)) =
n
Proof. Calculating the variance of I(Xi 6 x), starting with the second moment.
E[I(Xi 6 x)2 ] = (1)2 · P(Xi 6 x) + (0)2 · P(Xi > x) = P(Xi 6 x)
The variance.
Var(I(Xi 6 x)) = E[I(Xi 6 x)2 ]−E[I(Xi 6 x)]2 = P(Xi 6 x)−P(Xi 6 x)2 = P(Xi 6 x)(1−P(Xi 6 x))
Pn n
i=1 I(Xi 6 x) 1 X
Var(Fbn (x)) = Var = 2 Var[I(Xi 6 x)]
n n i=1
n
1 X
= 2 P(Xi 6 x)(1 − P(Xi 6 x))
n i=1
P(Xi 6 x)(1 − P(Xi 6 x))
=
n
F (x)(1 − F (x))
=
n
101
MSE.
F (x)(1 − F (x))
MSE = →0
n
Proof. First we need the bias, but as the first result showed, this is an unbiased estimator.
bias(Fbn ) = E[Fbn ] − Fn = Fn − Fn = 0
The MSE is the sum of the squared bias and the variance, but since the bias is 0, it becomes equal
to the variance.
F (x)(1 − F (x))
MSE =
n
As, n → ∞, this will tend to 0.
F (x)(1 − F (x))
lim MSE = lim =0
n→∞ n→∞ n
The empirical distribution converges in probability to F (x).
P
Fbn (x) −
→ F (x).
Proof. Recalling that µ = E[Fbn ] = Fn and σ 2 = (F (x)(1 − F (x)))/n. Using Chebyshev’s

inequality. For any > 0,
σ2 F (x)(1 − F (x))
lim P(|Fbn (x) − Fn (x)| > ) 6 lim 2
= lim = 0.
n→∞ n→∞ n→∞ n2
P
This shows that Fbn (x) −
→ F (x).
7.2
For X1 , . . . , Xn ∼ Bernoulli(p) and Y1 , . . . , Ym ∼ Bernoulli(q), we will find the plug-in estimator
and estimated standard error for p, an approximate 90 percent confidence interval for p, the plug-in
estimator and standard error for p − q and approximate 90 percent confidence interval for p − q.
(Oh, is that all?)
The plug-in estimator is really just a term for using the Fbn instead of F for calculating functionals,
such as the mean, median etc. For estimating p in a Bernoulli distribution, it is simply the sample
mean.
n
1X
pb = Xi
n i=1
Finding the standard error by e.g. using pb in the variance formula.
p
pb(1 − pb)
se(b p) = √
n
By consluting a standard normal table we can see that 1.65 is the value that gives 0.95 of the data,
which will correspond to a 90% two-sided test.
pb ± z.10/2 se(b
p) = 1.65 · se(b
p)
102
Finding an estimate for p − q with a plug-in estimator is simply subtracting the sample means.
n m
1X 1 X
pb − qb = Xi − Yj
n i=1 m j=1
The standard error for qb is: p

qb(1 − qb)
q) =
se(b √
m
And the combined standard error is:
p
se = p))2 + (se(b
(se(b q ))2 ,
and in the same way, we can calculate a 90% confidence interval by using the same value from the
standard normal table.
pb − qb ± 1.65 · se
7.3
Program code found in 7.3.R. There is no technical appendix, so the confidence intervals are meant
to be made with the Dvoretzky-Kiefer-Wolfowitz inequality.
For a 95% confidence interval, we define
s
1 2
n = log
2n α
where α = 0.05 and specify the bounds as (Fbn (x) − n , Fbn (x) + n ). We simulate 100 X ∼ N (0, 1)
variables a total of 1000 times and calculate the confidence intervals for each bound, then we count
how many of the 1000 simulations make bounds that do not contain the true CDF. Illustration of
what the simulations look like. (Black is the true CDF, the colored are simulations).
True CDF vs. simulations

1.0
0.8
0.6
F(x)
0.4
0.2
0.0
−2 −1 0 1 2
103
On running the simulation, approximately 95% of the simulation bounds contained the true CDF.
When redoing for the Cauchy distribution, the results are approximately 95% as well.
[1] " Proportion of normal distributions containing the CDF "
> sum ( checkNorm ) / length ( checkNorm )
[1] 0.952
[1] " Proportion of Cauchy distributions containing the CDF "
> sum ( checkCauchy ) / length ( checkCauchy )
[1] 0.949
7.4
Let X1 , . . . , Xn ∼ F and let Fbn (x) be the empirical distribution function. For a fixed x, we will use
the CLT to find the limiting distribution of Fbn (x).
We make the definiton: Pn
i=1 I(Xi 6 x)
Yn =
n
to highlight the fact that Fbn (x) = Y n is a sample mean. By the CLT:
√
n(Y n − µ)
N (0, 1)
σ
where µ = E[Y n ] and σ 2 /n = Var(Y n ), or the equivalent statement:
Yn N (µ, σ 2 /n) = N (E[Y n ], Var(Y n ))
Since x is fixed, we can apply Theorem 7.3:
E[Y n ] = E[Fbn (x)] = F (x)
F (x)[1 − F (x)]
Var(Y n ) = Var(Fbn (x)) =
n
In summary, when replacing Fbn (x) for Y n , we have shown:

F (x)[1 − F (x)]
Fbn (x) ≈ N F (x),
n
We can actually apply Chebyshev’s inequality in this case. For any > 0,
F (x)[1 − F (x)]
lim P(|Fbn (x) − F (x)| > ) 6 lim = 0,
n→∞ n→∞ n2
P
which shows that Fbn (x) −
→ F (x) which again implies Fbn (x) F (x).
104
7.5
Let x and y be two distinct points. Find Cov Fbn (x), Fbn (y) .
First of all, introducing the short-hand notation:
Ix := I(Xi 6 x), Iy := I(Xi 6 y)
Assuming independence of the Xi . Then, Cov(Ix , Iy ) = 0 whenever i 6= j. By definition:
Pn Pn !
i=1 Ix j=1 Iy
Cov Fn (x), Fn (y) = Cov
b b ,
n n
 
n n
1 X X
= 2 Cov  Ix , Iy 
n i=1 j=1
n
1 X
= Cov(Ix , Iy )
n2 i=1
The last part follows from the identity proved in exercise 3.14, and from independence, which leads
to all terms with different indices to become 0, hence leaving just a single sum. Now we will do an
intermediary calculation, and we will assume that x < y:
Cov(Ix , Iy ) = E[Ix Iy ] − E[Ix ]E[Iy ]
= E[Ix ] − E[Ix ]E[Iy ] (Since x < y)
= E[Ix ](1 − E[Iy ])
= P(Xi 6 x)(1 − P(Xi 6 y))

= F (x) 1 − F (y)
Returning to the main calculation:
n
1 X
Cov Fbn (x), Fbn (y) = 2 Cov(Ix , Iy )
n i=1
n
1 X
= 2
F (x) 1 − F (y)
n i=1
1
= 2
· n · F (x) 1 − F (y)
n
F (x) 1 − F (y)
= .
n
This is provided that x < y. But since the covariance is symmetric, then the smallest probability
can always be put in as x. Since F is unknown, the plug-in estimator is:

Fbn (x) 1 − Fbn (y)
n
Note that this is very similar to the expression for the variance.
F (x)(1 − F (x))
Var(Fbn ) =
n
105
7.6
Let X1 , . . . , Xn ∼ F and let Fb be the empirical distribution function. For a < b define θ = T (F ) =
F (b) − F (a) and set θb = T (Fbn ) = Fbn (b) − Fbn (a). Find the estimted standard error of θb and an
expression for the 1 − α confidence interval for θ.
Following example 7.15, the standard error of a difference is the square root of the following
variances:
q
se(θ) b = Var(Fbn (b) − Fbn (a))
21
= Var(Fbn (b)) + Var(Fbn (b)) − 2Cov(Fbn (a), Fbn (b)) (Prop. of var.)
1 1
= √ (F (b)[1 − F (b)] + F (a)[1 − F (a)] − 2F (a)[1 − F (b)]) 2 (Thm 7.3, Ex. 7.5)
n
r
(F (b) − F (a))[1 − (F (b) − F (a))]
= (See notes.)
n
s
(Fbn (b) − Fbn (a))[1 − (Fbn (b) − Fbn (a))]
= (Plug-in est.)
n
s
b − θ)
θ(1 b
= (Def. of θ)
b
n
Finding the confidence interval is easy. Just find the corresponding zα/2 level and calculate:
θb ± zα/2 se(θ).
b
For the sake of clarity, here are the notes for the algebra steps used in the step above. Instead of
F (b) and F (a) we just write b and a.
b(1 − b) + a(1 − a) − 2a(1 − b) = b − b2 + a − a2 − 2a + 2ab

= b − b2 − a − a2 + 2ab
= b − a − b2 + 2ab − a2
= b − a − (b2 − 2ab + a2 )
= b − a − (b − a)2

= (b − a) 1 − (b − a)
106
7.7
Code found in 7.7.R. Plot of empirical distribution and bounds:
Estimate of CDF
0.0 0.2 0.4 0.6 0.8 1.0

F(x)
4.0 4.5 5.0 5.5 6.0 6.5
x (mag)
(The CDF jumps because there are only 1000 points and jumps in the data itself).
Finding the confidence interval for F (4.9) − F (4.3). (Note: these results will differ depending on
whether the empirical distribution is defined with < or 6. These are the bounds for 6, as the
definition says).
’ 95% Confidence Interval for F (4.9) - F (4.3) : ’
( 0.495 , 0.557 )
7.8
Results for the ’Old Faithful’ dataset. Code found in 7.8.R.
Output from code:
> # Mean
> sprintf ( " Sample mean : %2.3 f " , Xbar )
[1] " Sample mean : 70.897 "
> # 90% confidence interval

> sprintf ( " 90 %% Confidence interval : (%.3 f , %.3 f ) " , CIlower , CIupper )
[1] " 90% Confidence interval : (48.576 , 93.218) "
> # Median
> sprintf ( " Median at : %2. f , with probability %2.2 f " , xvalMedian , Fhat ( xvalMedian , of
$ waiting ) )
[1] " Median at : 76 , with probability 0.53 "
107
7.9
Comparing two trials of antibiotic. 100 people are given standard and 90 recover, 100 are given
new and 85 recover.
We will assume that these are independent. We will also assume that the standard trial is
X1 , . . . , X100 ∼ Bernoulli(p1 ) and the new trial is Y1 , . . . , Y100 ∼ Bernoulli(p2 ). This is the same
conditions as in exercise 7.2.
If we define Xi = 1 and Yi = 1 for recoveries and Xi = 0 and Yi = 0 for no change, we can
estimate the probabilities with the sample mean:
100
1 X 90
pb1 = Xi = = 0.90
100 i=1 100
100
1 X 85
pb2 = Yi = = 0.85
100 i=1 100
If we define θb = pb1 − pb2 , it becomes:
θb = pb1 − pb2 = 0.90 − 0.85 = 0.05
Finding the standard errors for each pbi .

p p
pb1 (1 − pb1 ) 0.9(0.1)
se(b p1 ) = √ = √ = 0.03 =⇒ se(b p1 )2 = 0.0009
n 100
p p
pb2 (1 − pb2 ) 0.85(0.15)
se(bp2 ) = √ = √ = 0.0357 =⇒ se(b p1 )2 = 0.001275
n 100
Since they are independent, the covariance is 0. The standard error for θ: b
p p √
se(θ)b = Var(b p1 − pb2 ) = se(b p1 )2 + se(b
p2 )2 = 0.002175 = 0.046637
For constructing the confidence intervals, we find the corresponding z.90 and z0.975 values and
calculate. Since probabilities can’t be negative, we cap them at 0. 80% confidence interval.
θb ± z.90 se(θ)
b = 0.05 ± 1.28(0.046637) = (−0.0097, 0.1097) = (0, 0.1097)
95% confidence interval.
θb ± z.975 se(θ)
b = 0.05 ± 1.96(0.046637) = (−0.0414, 0.1414) = (0, 0.1414)
7.10
Code found in 7.10.R.
> theta # Estimate of theta
[1] 277.3962
> seTheta # Standard error of theta
[1] 388.522
> sprintf ( " CI : (%3.2 f , %3.2 f ) " , CIlwr , CIupr )
[1] " CI : (0.00 , 601.90) "
108
8 The Bootstrap
Exercises
109

All Stat

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All Stat

Uploaded by

Copyright:

Available Formats

All of Statistics

My proposed solutions to the book:

5 Convergence of Random Variables 83

6 Models, Statistical Inference and Learning 97

7 Estimating the CDF and Statistical Functionals 101

8 The Bootstrap 109

1.8 Theorem (Continuity of Probabilities). If An → A then

Proof. Suppose that An is monotone increasing so that A1 ⊂ A2 ⊂ . . .. Let A = limn→∞ An =

(1) Showing that Bi are disjoint sets.

where A1 , A2 , . . . are disjoint in (A.3).

We have reached a contradiction. This shows that P(∅) = 0.

• Claim: A ⊂ B =⇒ P(A) 6 P(B).

P(A ∪ Ac ) = P(A) + P(Ac )

Since Ω = A ∪ Ac we also get by (A.2):

Putting them together:

P(A) + P(Ac ) = 1 =⇒ P(Ac ) = 1 − P(A)

The same argument can be used for some k ∈ N.

From this, it follows that B1 ⊃ B2 ⊃ . . ..

And, in general, for some k ∈ N.

This shows that C1 ⊂ C2 ⊂ . . ..

Now, from the definition of B1 , like we used above:

So, we can write: B1 ∩ B2 = (A1 ∪ B2 ) ∩ B2 = (A1 ∩ B2 ) ∪ (B2 ∩ B2 ) = (A1 ∩ B2 ) ∪ B2 . And in

which means ω ∈ A1 , A2 , . . . , Ak , Ak+1 , . . . proving the first implication.

So ω is included in all Bk for k ∈ N which means ω is also in the intersection of all Bi .

By implication both ways, we have proved the equivalency.

Proof. Starting with x ∈ (∪ni=1 Ai )c , we have the following equivalencies:

Proof. Starting with x ∈ (∩ni=1 Ai )c ,

Proof. Following the hint, we define:

and show that this applies to k + 1.

Wk ⊂ Wk ∪ Ak+1 ⊂ Ω =⇒ P(Wk ) 6 P(Wk ∪ Ak+1 ) 6 P(Ω)

Using that P(Wk ) = 1 and P(Ω) = 1:

1 6 P(Wk ∪ Ak+1 ) 6 1 =⇒ P(Wk ∪ Ak+1 ) = 1

By Lemma 1.6, and using that P(Ai ) = 1 for each i:

P(Wk ∪ Ak+1 ) = P(Wk ) + P(Ak+1 ) − P(Wk ∩ Ak+1 )

• Axiom 1. P(A|B) > 0 for all A.

since P(B) > 0.

• Axiom 3. If A1 , A2 , . . . are disjoint, then:

Proof. First a supporting argument. If we define Ci := Ai ∩ B for each i, then C1 , C2 , . . . are

which proves the final axiom for the conditional probability.

1.10 Monty Hall

P(Ac ∪ B c ) = P(Ac ) + P(B c ) − P(Ac ∩ B c ) (2.11.1)

By property of complements, and using DeMorgan’s law:

P(A ∩ B) = 1 − P([A ∩ B]c ) = 1 − P(Ac ∪ B c )

P(Ac ∪ B c ) = 1 − P(A ∩ B) = 1 − P(A)P(B)

By property of complements, we can represent this as:

P(Ac ∪ B c ) = 1 − [1 − P(Ac )][1 − P(B c )]

= P(Ac ) + P(B c ) − P(Ac )P(B c ) (2.11.2)

P(Ac )P(B c ) = 1 − P(A) 1 − P(B)

= 1 − P(A) − P(B) + P(A)P(B)

Ω = {(ω1 , . . . , ωK ) : ωi ∈ {H, T }, 1 6 i 6 K and ωK−1 6= ωK , K > 2}

Proof. First consider the case P(A) = 0. Obviously, for any E ∈ Ω:

P(A ∩ E) = 0 = P(A)P(E) =⇒ P(A ∩ E) = P(A)P(E)

which proves independence.

P(A)P(E) = 1 · P(E) = P(E).

Since A ⊂ A ∪ E ⊂ Ω, then P(A) 6 P(A ∪ E) 6 P(Ω). We know the probabilities of A and Ω, so

P(A ∪ B) = P(A) + P(E) − P(A ∩ E)

Claim: If A is independent of itself, then P(A) = 0 or P(A) = 1.

Proof. If A is independent of itself, we can write:

since A ∩ A = A. We can write this as an equation, by defining p := P(A).

This equation is solved when p = P(A) = 1 or p = P(A) = 0.

which proves the first statement. Again, by definition of conditional probability:

P(A ∩ B ∩ C) = P(A ∩ E) = P(A|E)P(E) = P(A|B ∩ C)P(B ∩ C) (2.17.1)

By Lemma 1.14 again:

P(A ∩ B ∩ C) = P(A|B ∩ C)P(B|C)P(C)