Professional Documents
Culture Documents
Let me know if you find any mistakes! I am sure there are plenty...
Contents
1 Probability 2
2 Random Variables 25
3 Expectation 49
4 Inequalities 75
1
1 Probability
Exercises
1.1
Proving the Continuity of Probabilities.
P(An ) → P(A)
as n → ∞.
B1 = A1
B2 = {ω ∈ Ω : ω ∈ A2 , ω 6∈ A1 }
B3 = {ω ∈ Ω : ω ∈ A3 , ω 6∈ A2 , ω 6∈ A1 }
..
.
\ m−1
[ \ p−1
[
Bm B p = Am − ( Ai ) Ap − ( Ai )
i=1 i=1
\ m−1
[ \ \ p−1
[
= Am ( Ai )c Ap ( Ai )c
i=1 i=1
DeMorgan’s law
\
= Am ∩ Ac1 ∩ Ac2 ∩ . . . ∩ Acp ∩ . . . ∩ Acm−1 Ap ∩ Ac1 ∩ Ac2 ∩ . . . ∩ Acp−1
Reshuffling terms and repeated use of C ∩ C = C
= Am ∩ Ac1 ∩ Ac2 ∩ . . . ∩ Acp ∩ Ap ∩ . . . ∩ Acm−1
| {z }
=∅
=∅
Since Bm ∩ Bp = ∅, they are disjoint. (Quite certain this is a correct argument, but could have
made it a bit easier by going directly to the A3 ∩ Ac2 ∩ Ac1 version of the sets.)
2
(2) Showing that
n
[ n
[
An = Ai = Bi
i=1 i=1
for each n. Showing this with an induction argument. Note that A1 = B1 , and A2 ∪ A1 = A2 since
A1 ⊂ A2 . Note also that:
B2 ∪ B1 = (A2 ∩ Ac1 ) ∪ A1
= (A2 ∪ A1 ) ∩ (Ac1 ∪ A1 )
= (A2 ∪ A1 ) ∩ (Ω)
= A2 ∪ A1
= A2
So this is true for n = 2. Assume that Ak = ∪ki=1 Ai = ∪ki=1 Bi for k ∈ N. Showing that this applies
to k + 1.
k+1
[ k
[
Ai = Ak+1 ∪ Ai = Ak+1 ∪ Ak = Ak+1
i=1 i=1
k+1
[ k
[
Bi = Bk+1 ∪ Bi
i=1 i=1
= Bk+1 ∪ Ak
h i
= Ak+1 ∩ Ack ∩ . . . ∩ Ac1 ∪ Ak
h c i
= Ak+1 ∩ Ak ∪ . . . ∪ A1 ∪ Ak
h i
= Ak+1 ∩ Ack ∪ Ak
= (Ak+1 ∪ Ak ) ∩ (Ack ∪ Ak )
= (Ak+1 ∪ Ak ) ∩ Ω
= Ak+1 ∪ Ak
= Ak+1
Result verified by induction argument.
(3) Showing that
∞
[ ∞
[
Bi = Ai . (♦)
i=1 i=1
By definition:
∞
[
A = lim An = Ai .
n→∞
i=1
By step (2):
n
[ n
[ ∞
[
A = lim An = lim Ai = lim Bi = Bi .
n→∞ n→∞ n→∞
i=1 i=1 i=1
Hence, (♦) is satisfied.
3
1.2
Proving some well known results by using the axioms:
P(A) > 0, ∀A ⊂ Ω (A.1)
P(Ω) = 1, (A.2)
∞ ∞
!
[ X
P Ai = P(Ai ) (A.3)
i=1 i=1
Now instead of using (A.3) we use that the infinite set of Ei becomes Ω:
∞
!
A.2
[
P Ei = P(Ω) = 1
i=1
With this result, we can apply (A.3) indirectly to any finite sum of disjoint sets. All other sets
are set to ∅ and then (A.3) is applied. Not giving a formal argument, but if A, B, C are mutually
disjoint we can set Ek = ∅ for all k > 4.
∞ ∞
!
A.3
[ X
P(A ∪ B ∪ C) = P Ei = P(Ei ) = P(A) + P(B) + P(C) + 0 = P(A) + P(B) + P(C)
i=1 i=1
• Claim: 0 6 P(A) 6 1.
Proof. Since A ⊂ Ω and by the previous proof: P(A) 6 P(Ω) = 1 Combining this with axiom
(A.1):
A.1
0 6 P(A) 6 P(Ω) = 1 =⇒ 0 6 P(A) 6 1
4
• Claim: P(Ac ) = 1 − P(A).
Proof. Since A ∪ Ac are disjoint, we get by finite version of (A.3):
P(A ∪ Ac ) = P(Ω) = 1
1.3
Ω is a sample space and A1 , A2 , . . . are events. We define:
∞
[ ∞
\
Bn = Ai , Cn = Ai
i=n i=n
(a) Claim: B1 ⊃ B2 ⊃ . . .
Proof. This follows directly from the definition of Bn . First we see that:
∞ ∞ ∞
!
[ [ [
B1 = Ai = A1 ∪ Ai ⊃ Ai = B2 , =⇒ B1 ⊃ B2
i=1 i=2 i=2
5
(b) Claim:
∞
\
ω∈ Bn ⇐⇒ ω ∈ A1 , A2 , . . .
n=1
Proof. T∞
⇒) Assume ω ∈ n=1 Bn , which means ω ∈ Bk , ∀k ∈ N. Expanding the intersection:
ω ∈ B1 ∩ B2 ∩ . . . ∩ Bk ∩ Bk+1 ∩ . . .
so Bk ∩ Bk+1 = (Ak ∪ Bk+1 ) ∩ Bk+1 = (Ak ∩ Bk+1 ) ∪ Bk+1 . So, each Bi can be decomposed into
(Ai ∩ Bi+1 ) ∪ Bi+1 and the Bi+1 is rewritten as its own intersection. Ultimately, we get:
ω ∈ B1 ∩ B2 ∩ B3 ∩ . . . ∩ Bk ∩ Bk+1 ∩ . . . =⇒
ω ∈ (A1 ∩ B2 ) ∪ B2 ∩ B3 ∩ . . . ∩ Bk ∩ Bk+1 ∩ . . . =⇒
ω ∈ (A1 ∩ B2 ) ∪ (A2 ∩ B3 ) ∪ B3 ∩ . . . ∩ Bk ∩ Bk+1 ∩ . . . =⇒
ω ∈ (A1 ∩ B2 ) ∪ (A2 ∩ B3 ) ∪ . . . ∪ (Ak ∩ Bk+1 ) ∪ (Ak+1 ∩ Bk+2 ) ∪ . . . =⇒
ω ∈ (Ak ∩ Bk+1 )∀k ∈ N by assumption that ω ∈ Bk ∀k ∈ N =⇒
ω ∈ Ak , ∀k ∈ N
(c) Skipped.
6
1.4
Proving DeMorgan’s laws. First in the case when i = {1, 2, . . . , n}.
Claim: !c
n
[ n
\
Ai = Aci
i=1 i=1
Claim: !c
n
\ n
[
Ai = Aci
i=1 i=1
Proving these results for a random index set is essentially the same, except finding some way of
numerating the index set with i1 , . . . , in (since it has to be countably infinite) which I won’t bother
doing.
1.5
Tossing a fair coin until we get exactly two heads. Description of the sample space:
Ω = {(ω1 , ω2 , . . . , ωk−1 , ωk ) : ωi ∈ {H, T }, ωk−1 = ωk = H, ωp−1 = H ⇒ ωp = T, p < k}
Now, to calculate the probability that there will be exactly k tosses. Let us look at a few specific
examples:
1 1
k = 2 : {H, H} =⇒ P(2) = 2
= = 0.25
2 4
1 1
k = 3 : {T, H, H} =⇒ P(3) = 3 = = 0.125
2 8
For all tosses after k > 3, the last three tosses have to be {T, H, H}.
7
H 1 1
k=4: , T, H, H = 2 · 4 = = 0.125
T 2 8
TT
1 3
k = 5 : T H , T, H, H = 3 · 5 = = 0.09375
2 32
HT
TTT
TTH
1 5
k = 6 : T HT , T, H, H = 5 · 6 = = 0.078125
2 64
HT T
HT H
TTTT
TTTH
T T HT
T HT T 1 1 1
k=7 , T, H, H = 8 · 7 = 4 = = 0.0625
HT T T 2 2 16
T HT H
HT HT
HT T H
We can see a pattern emerging. The last three tosses are fixed, and for the first k − 3 tosses, we
have to count all the combinations that do not contain 2 simultaneous H. The details are quite
interesting, and it turns out that the number after the k − 3 first tosses follows the Fibonacci
sequence (as we can see above). Defining:
F (1) = 2, F (2) = 3, F (3) = 5, . . .
In general, we will then get:
1
k = 2, 3
2k
P(K = k) =
F (k − 3)
k>3
2k
1.6
Let the sample space be Ω = {0, 1, 2, . . .} (which is {0} ∪ N). We are going to prove that there
is no uniform distribution on this sample space. That is, if P(A) = P(B) whenver |A| = |B| (the
cardinality of the sets) then P cannot satisfy the axioms of probability.
The first two properties are satisfied. If we take some k ∈ N and define A1 = {1, . . . , k} in such
a way that P (A1 ) > 0, and generally define Am = {mk + 1, mk + 2, . . . , mk + k}, then An ∩ Am = ∅.
(This can be checked by setting k = 100 and n = 5 and m = 6 for instance). Also, |An | = |Am |
and P(An ) = P(Am ) > 0. This will cause problems in the third axiom:
∞ ∞
!
[ X
P Ai = P(Ai ) = ∞,
i=1 i=1
since we are summing an infinite number of positive values. The axioms of probability are not
satisfied, so we cannot have a uniform distribution on {0, 1, 2, . . .}.
8
1.7
For some events A1 , A2 , . . ., we are going to show that:
∞ ∞
!
[ X
P An 6 P(An ).
n=1 n=1
Next, we show that any Bm and Bn are disjoint for n, m ∈ N and m > n.
m−1
! n−1
!
[ [
B m ∩ B n = Am − Ai ∩ An − Ai
i=1 i=1
m−1
!c ! n−1
!c !
[ [
= Am ∩ Ai ∩ An ∩ Ai
i=1 i=1
m−1
! n−1
!
\ \
= Am ∩ Aci ∩ An ∩ Aci
i=1 i=1
= Am ∩ Acm−1 ∩ ... ∩ Acn ∩ . . . ∩ Ac2 ∩ Ac1 ∩ An ∩ Acn−1 ∩ . . . ∩ Ac2 ∩ Ac1
= Am ∩ Acm−1 ∩ . . . ∩ Ac2 ∩ Ac1 ∩ Acn−1 ∩ . . . ∩ Ac2 ∩ Ac1 ∩ Acn ∩ An
| {z }
=∅
=∅
Hence, the sets Bi are mutually disjoint. Next, we show that:
∞ ∞
! !
[ [
An = Bn .
n=1 n=1
We will use an induction argument. For the first element we have: B1 = A1 , and:
∪2i=1 Bi = B1 ∪ B2
= A1 ∪ (A2 − A1 )
= A1 ∪ (A2 ∩ Ac1 )
= (A1 ∪ A2 ) ∩ (A1 ∪ Ac1 )
= (A1 ∪ A2 ) ∩ Ω
= (A1 ∪ A2 )
= ∪2i=1 Ai
So, we assume that for k ∈ N, we have:
k
! k
!
[ [
An = Bn ,
n=1 n=1
9
k+1 k
!
[ [
Bi = Bk+1 ∪ Bi
i=1 i=1
k
!
[
= Bk+1 ∪ Ai
i=1
k
! k
!
[ [
= Ak+1 − Ai ∪ Ai
i=1 i=1
k
!c ! k
!
[ [
= Ak+1 ∩ Ai ∪ Ai
i=1 i=1
k
! " k
!c k
!#
[ [ [
= Ak+1 ∪ Ai ∩ Ai ∪ Ai
i=1 i=1 i=1
k+1
!
[
= Ai ∩Ω
i=1
k+1
[
= Ai
i=1
So, by induction, we have verfied the equality for k + 1, so it follows that the sets are equal for all
indexes.
By definition of Bn , we have Bn ⊂ An which means P(Bn ) 6 P(An ). Using this, we can prove
the main statement:
∞ ∞ ∞ ∞
! !
[ [ X X
P An = P Bn = P(Bn ) 6 P(An ).
n=1 n=1 n=1 n=1
1.8
Suppose that P(Ai ) = 1 for each Ai . We are going to prove that:
∞
!
\
P Ai = 1.
i=1
Proof. We will prove this by induction. For events A1 and A2 , we know that P(A1 ) = 1 and
P(A2 ) = 1. By the probability axioms:
A1 ⊂ A1 ∪ A2 ⊂ Ω =⇒ P(A1 ) 6 P(A1 ∪ A2 ) 6 P(Ω)
Using that P(A1 ) = 1 and P(Ω) = 1:
1 6 P(A1 ∪ A2 ) 6 1 =⇒ P(A1 ∪ A2 ) = 1
By Lemma 1.6:
P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 )
1 = 2 − P(A1 ∩ A2 )
P(A1 ∩ A2 ) = 1
10
Assuming this holds for k ∈ N, i.e.: !
k
\
P Ai = 1.
i=1
To simplify the notation, we call this intersection Wk , so P(Wk ) = 1. Now we will show it also
applies to k + 1.
A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ak+1 = Wk ∩ Ak+1
Again, from the probability axioms:
Which verifies that equality holds for k + 1. By induction, we can conclude that:
∞
!
\
P Ai = 1.
i=1
1.9
For a fixed B such that P(B) > 0, we will show that P(·|B) satisfies the axioms of probability.
Recalling the definition of conditional probability:
P(A ∩ B)
P(A|B) =
P(B)
P(A ∩ B) a
P(A|B) = = > 0,
P(B) P(B)
11
• Axiom 2. P(Ω|B) = 1.
Proof.
P(Ω ∩ B) P(B)
P(Ω|B) = = =1
P(B) P(B)
This means:
∞ ∞ ∞
! S∞ S∞ P∞
[ P( Ai ∩ B)
i=1 P ( i=1 Ci ) P(Ci ) X P(Ai ∩ B) X
P Ai B = = = i=1 = = P(Ai |B)
i=1
P(B) P(B) P(B) i=1
P(B) i=1
Now, what is the probability that the prize is behind door number 2, given that door III was
opened? This can be calculated by Bayes Law.
P (B|Ak )P (Ak )
P (Ak |B) = Pn
i=1 P (B|Ai )P (Ai )
12
In our case, probability of door 2 containing the prize, given that door III was opened:
P(III|2)P(2)
P (2|III) =
P(III|1)P(1) + P(III|2)P(2) + P(III|3)P(3)
(1)( 13 )
= 1 1
( 2 )( 3 ) + (1)( 13 ) + (0)( 13 )
1
3 1 1
= 1 1 = /
6 + 3 +0 3 2
2
=
3
Probability that door 1 contains the prize, given that door III was opened:
P(III|1)P(1)
P (1|III) =
P(III|1)P(1) + P(III|2)P(2) + P(III|3)P(3)
( 21 )( 13 )
= 1 1
( 2 )( 3 ) + (1)( 13 ) + (0)( 13 )
1
6 1 1
= 1 1 = /
6 + 3 +0 6 2
1
=
3
This shows that we are better off switching doors then remaining in the one we initially selected.
1.11
Claim. If A and B are independent, then Ac and B c are independent.
Proof. By definition of independence P(A ∩ B) = P(A)P(B). By property of probabilities:
P(Ac ) = 1 − P(A) and P(B c ) = 1 − P(B). Applying Lemma 1.6 to Ac ∪ B c :
Which we can rewrite in the following way, and then apply the independence of A and B:
By setting (2.11.1) equal to (2.11.2) we can deduce that P(Ac )P(B c ) = P(Ac ∩ B c ) which proves
independence.
13
Alternative Proof. Slightly more compact argument, but it still relies on Lemma 1.6.
1.12
Initially, we have three cards that we call G (Green both sides), R (Red both sides) and M (mixed
colors). Next we define r as ’Red side’ and g as ’Green side’. We can define the joint distribution
as follows:
G M R
r 1/6 2/6 1/2
g 2/6 1/6 1/2
1/3 1/3 1/3
From the marginal distributions, we can see that the initial probabilities for selecting each card is:
1 1 1
P(G) = , P(R) = , P(M ) =
3 3 3
And the probabilities for randomly viewing a colored card side:
1 1
P(g) = , P(r) =
2 2
Note that we do not have independene, since P(G ∩ g) = 2/6 while P(G)P(g) = (1/3)(1/2) = 1/6.
We want to find the probability that we have a green card given that we observe a green side.
This can be calculated by:
P(G ∩ g) 2/6
P(G|g) = = = 4/6 = 2/3
P(g) 1/2
1.13
A fair coin is thrown until we have both H and T.
(a) The sample space is:
(b) The probability of getting 3 tosses can only happen in two ways (H, H, T ) and (T, T, H). So:
1 1
P (K = 3) = 2 · 3
=
2 4
(it would be more complicated if it were e.g. K 6 3).
14
1.14
Claim: If P(A) = 0 or P(A) = 1, then A is independent of every other event.
P(A)P(E) = 0 · P(E) = 0.
By the axioms of probability: A ∩ E ⊂ A so P(A ∩ E) 6 P(A) = 0 and since for any event:
0 6 P(A ∩ E) we can conclude that P(A ∩ E) = 0. So:
In conclusion:
P(A ∩ E) = P(E) = P(A)P(E) =⇒ P(A ∩ E) = P(A)P(E)
which proves independence.
P(A ∩ A) = P(A)P(A)
P(A) = P(A)2
p = p2
p2 − p = 0
p(p − 1) = 0
15
1.15
The probability that a child has blue eyes is 1/4, P(B) = 1/4. We assume independence in the eye
color of the children.
(a) Given that one child has blue eyes, what is the probability that 2 or more children have blue
eyes? Since it’s given that one child has blue eyes, we just need to find the probability that one, or
both, of the remaining children have blue eyes. Define Bi : as the event that child i has blue eyes,
and ¬Bi the event that child i does NOT have blue eyes. We only need to sum up the following
three probabilities:
1 1 1
P(B1 ∩ B2 ) = P(B1 )P(B2 ) = =
4 4 16
1 3 3
P(B1 ∩ ¬B2 ) = P(B1 )P(¬B2 ) = =
4 4 16
3 1 3
P(¬B1 ∩ B2 ) = P(¬B1 )P(B2 ) = =
4 4 16
7
By summing these up, we find that the probability of two or more children having blue eyes is .
16
7
(b) Which child has blue eyes is irrelevant. The probability will be the same: .
16
1.16
Proving Lemma 1.14. If A and B are independent events, then P(A|B) = P(A). Also, for any pair
of events A and B,
P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A).
Proof. Suppose A and B are independent. By definition of independence, P(A ∩ B) = P(A)P(B).
By definition of conditional probability:
P(A ∩ B) P(A)P(B)
P(A|B) = = = P(A),
P(B) P(B)
P(A ∩ B)
P(A|B) = =⇒ P(A ∩ B) = P(A|B)P(B)
P(B)
P(B ∩ A)
P(B|A) = =⇒ P(A ∩ B) = P(B|A)P(A),
P(A)
since A ∩ B = B ∩ A which proves the remaining two statements.
16
1.17
Show that
P(A ∩ B ∩ C) = P(A|B ∩ C)P(B|C)P(C).
Proof. Define E := B ∩ C, so A ∩ B ∩ C = A ∩ E. By Lemma 1.14:
1.18 Sk
Suppose k events form a partition of Ω, i.e. they are disjoint and i=1 Ai = Ω. Assume that
P(B) > 0. If P(A1 |B) < P(A1 ) then P(Ai |B) > P(Ai ) for some i = 2, 3, . . . , k.
Proof. By summing over all Ai with and without conditioning on B, we get the following two
equalities:
Xk Xk
P(Ai |B) = 1, P(Ai ) = 1
i=1 i=1
Since P(A1 |B) < P(A1 ), then P(A1 ) − P(A1 |B) = a > 0 for some a ∈ R. By subtracting P(A1 |B)
from both sides, we can write:
k
X k
X
P(Ai |B) = P(A1 ) − P(A1 |B) + P(Ai )
i=2 i=2
k
X k
X k
X
P(Ai |B) = a + P(Ai ) > P(Ai )
i=2 i=2 i=2
So, we get:
k
X k
X
P(Ai |B) > P(Ai )
i=2 i=2
which means there exists some i ∈ {2, 3, . . . , k} such that P(Ai |B) > P(Ai ).
17
1.19
Define M as Mac users, W as Windows users, and L as Linux users. We have:
The computers are infected with a virus, which give the following probabilities:
A person is selected at random and we learn that her computer has the virus. What is the probability
that she is using Windows? Or, what is P(W |V )? We calculate this with Bayes Law.
P(V |W )P(W )
P(W |V ) =
P(V |M )P(M ) + P(V |W )P(W ) + P(V |L)P(L)
(0.82)(0.5)
=
(0.65)(0.3) + (0.82)(0.5) + (0.5)(0.2)
= 0.5815
1.20
We have a box containing 5 coins, C1 , . . . , C5 . Each coin is selected at random, so P(Ci ) = 1/5 for
all coins. They have the following probabilities of getting H when flipped (pi = P(H|Ci )):
(a) We flip a coin and get a head. Calculate P(Ci |H) for all coins. First an intermediary calculation:
5
X
P(H|Ci )P(Ci ) = P(H|C1 )P(C1 ) + P(H|C2 )P(C2 ) + P(H|C3 )P(C3 ) + P(H|C4 )P(C4 ) + P(H|C5 )P(C5 )
i=1
1 1 1 1 1 3 1 1
= (0) + + + + (1)
5 4 5 2 5 4 5 5
1 2 3 4
=0+ + + +
20 20 20 20
1
=
2
Calculating the posterior for each Ci .
P(H|C1 )P(C1 ) 0
P(C1 |H) = P5 = 1 =0
i=1 P(H|Ci )P(Ci ) 2
1
P(H|C2 )P(C2 ) 20
P(C2 |H) = P5 = 1 = 1/10
i=1 P(H|Ci )P(Ci ) 2
2
P(H|C3 )P(C3 ) 20
P(C3 |H) = P5 = 1 = 2/10
i=1 P(H|Ci )P(Ci ) 2
18
3
P(H|C4 )P(C4 ) 20
P(C4 |H) = P5 = 1 = 3/10
i=1 P(H|Ci )P(Ci ) 2
4
P(H|C5 )P(C5 ) 20
P(C5 |H) = P5 = 1 = 4/10
i=1 P(H|Ci )P(Ci ) 2
(b) We toss the coin again. Finding the probability P(H2 |H1 ) (Hi : getting H on toss i). We are
still using the same random coin, so we must calculate the probability for all coins together.
19
1.21
Proportion of H vs T for p = 0.3.
# 1.21 - Plotting proportion of H vs T
n = 1000
p = 0.3
coinTosses = sample ( c ( " H " ," T " ) , prob = c (p , 1 - p ) , size = n , replace = TRUE )
p ro po rt i on He ad s = rep (0 , n )
headCount = 0
for ( i in 1: n ) {
if ( coinTosses [ i ] == " H " ) {
headCount = headCount + 1
}
p ropo rt i on He ad s [ i ] = headCount / i
}
# PDF
pdf ( " ~ / ALLSTAT / ch1 _ 2.21 a . pdf " )
plot ( x = 1: n , y = proportionHeads , type = " l " , ylim = c (0 ,1) ,
main = paste0 ( " Proportion heads for p = " , p ) )
abline ( h = p , col = " red " , lty = " dashed " )
dev . off ()
Result
After some initial randomness due to few samples, we can see that the simulated results stabilizes
aroud 0.3, as expected.
0.6
0.4
0.2
0.0
1:n
20
Proportion of H vs T for p = 0.03.
# 1.21 - Plotting proportion of H vs T
n = 5000
p = 0.03
coinTosses = sample ( c ( " H " ," T " ) , prob = c (p , 1 - p ) , size = n , replace = TRUE )
p ro po rt i on He ad s = rep (0 , n )
headCount = 0
for ( i in 1: n ) {
if ( coinTosses [ i ] == " H " ) {
headCount = headCount + 1
}
p ropo rt i on He ad s [ i ] = headCount / i
}
# PDF
pdf ( " ~ / ALLSTAT / ch1 _ 2.21 b . pdf " )
plot ( x = 1: n , y = proportionHeads , type = " l " , ylim = c (0 ,1) ,
main = paste0 ( " Proportion heads for p = " , p ) )
abline ( h = p , col = " red " , lty = " dashed " )
dev . off ()
Result
After some initial randomness due to few samples, we can see that the simulated results stabilizes
aroud 0.03, as expected. Did a simulation of 5000 to make the ’convergence’ clearer. Note that the
y-axis only goes up to 0.1 in this plot.
0.04
0.02
0.00
1:n
21
1.22
Investigating some properties of Binomial random variables.
# 1.22 - Binomial Random Variables
REP = 10 # Number of simulations per n
p = 0.3 # Probability of H
sim10 = rep (0 , REP )
sim100 = rep (0 , REP )
sim1000 = rep (0 , REP )
# Simulating 10
for ( i in 1: REP ) {
# 1 means head
coinTosses = sample ( c (1 ,0) , prob = c (p , 1 - p ) , size = 10 , replace = TRUE )
sim10 [ i ] = sum ( coinTosses )
}
# Simulating 100
for ( i in 1: REP ) {
# 1 means head
coinTosses = sample ( c (1 ,0) , prob = c (p , 1 - p ) , size = 100 , replace = TRUE )
sim100 [ i ] = sum ( coinTosses )
}
# Simulating 1000
for ( i in 1: REP ) {
# 1 means head
coinTosses = sample ( c (1 ,0) , prob = c (p , 1 - p ) , size = 1000 , replace = TRUE )
sim1000 [ i ] = sum ( coinTosses )
}
df = data . frame (
SIM10 = sim10 ,
SIM100 = sim100 ,
SIM1000 = sim1000
)
# Output
df
apply ( df , 2 , mean )
R
As seen in the results, the mean of the 10 simulations is close to np which would be 3, 30 and 300.
> df
SIM10 SIM100 SIM1000
1 1 32 282
2 2 30 298
3 3 33 319
4 3 28 290
5 3 29 284
6 2 34 286
7 4 37 310
8 2 30 294
9 2 30 297
10 2 31 326
> apply(df, 2, mean)
SIM10 SIM100 SIM1000
2.4 31.4 298.6
22
1.23
Simulating conditional probabilities. First we simulate an independent experiment.
# 1.23 - Simulating a fair die
options ( digits =8)
numb erOfToss es = 10000
A = c (2 , 4 , 6)
B = c (1 , 2 , 3 , 4)
AandB = intersect (A , B ) # c (2 , 4)
# Output
PA
PB
PA * PB
PAandB
R
Results from the calculation. As we can see, the estimates for P(A) ≈ 1/2 and P(B) ≈ 2/3. Also
P(A ∩ B) ≈ P(A)P(B) ≈ 1/3. Differences are probably due to rounding errors.
> # Output
> PA
[1] 0.5005
> PB
[1] 0.6695
> PA*PB
[1] 0.33508475
> PAandB
[1] 0.3344
Now we will construct an experiment with a conditional probability. We will use a fair coin and a
die. When we get heads, this will correspond to 1 and the die is unchanged. If we get tails, this
will be 2 and will double the die count; so if we get tails and we roll a 2, this will give us a 4.
1 2 3 4 5 6 7 8 9 10 11 12
H 1/12 1/12 1/12 1/12 1/12 1/12
T 1/12 1/12 1/12 1/12 1/12 1/12
Define the events: A = {2, 3, 4, 5, 6} and B = {2, 4, 6, 8} which will give A ∩ B = {2, 4, 6}. The
theoretical probabilities are:
P(A) = 8/12 = 2/3 = 0.666
P(B) = 7/12 ≈ 0.5833
P(A)P(B) = 7/18 ≈ 0.3888
P(A ∩ B) = 1/2 = 0.5
23
Simulating conditional probabilities.
# 1.23 - Simulating a conditional probability
options ( digits =8)
numb erOfToss es = 10000
A = c (2 , 3 , 4 , 5 , 6)
B = c (2 , 4 , 6 , 8)
AandB = intersect (A , B ) # c (2 , 4 , 6)
# Output
PA
PB
PA * PB
PAandB
> # Output
> PA
[1] 0.6691
> PB
[1] 0.5825
> PA*PB
[1] 0.38975075
> PAandB
[1] 0.4995
The simulated results are very close to the theoretical calculations. We can see that we do not have
independence, as expected.
24
2 Random Variables
Exercises
2.1
Claim: P(X = x) = F (x+ ) − F (x− ). (Discrete)
Proof. By definition of the CDF:
F (x+ ) = lim F (z) = lim P(X 6 z), F (x− ) = lim F (y) = lim P(X 6 y)
z↓x z↓x y↑x y↑x
(so y < x and y → x, and x < z and x ← z). By the right continuous property, we can deduce that
z > y and we can set z = x and y = x − 1.
So:
2.2
Let X be such that P(X = 2) = P(X = 3) = 1/10 and P(X = 5) = 8/10. Here is a plot of the
CDF.
1
0.8
0.6
0.4
0.2
0
0 2 4 6 8
25
2.3
Lemma 2.15 Let F be the CDF for a random variable X. Then:
1. P(X = x) = F (x) − F (x− )
2. P(x < X 6 y) = F (y) − F (x)
3. P(X > x) = 1 − F (x)
4. If X is continuous, then
F (b) − F (a) = P(a < X < b) = P(a 6 X < b) = P(a < X 6 b) = P(a 6 X 6 b)
Proof. We will prove each statement in turn. (1.) was proved in exercise 2.1. Doing (3) first,
since we need it to prove (2).
(3) By definition of complements of sets A = {X > x} means Ac = {X 6 x}, and it follows that:
P(X > x) = P(A) = 1 − P(Ac ) = 1 − P(X 6 x) = 1 − F (x).
(2) Assume x < y. We will need that {X > x} ∪ {X 6 y} = Ω, and we will also use Lemma 1.6 (in
reverse).
P(x < X 6 y) = P({X > x} ∩ {X 6 y})
= P(X > x) + P(X 6 y) − P({X > x} ∪ {X 6 y})
= 1 − F (x) + F (y) − 1
= F (y) − F (x)
(4) Similar argument for all cases, so will just do one. We just need to turn the inequalities into
strict inequalities. For continuous random variables, pointwise probabilities are 0. Again, we will
need to use {X > a} ∪ {X < b} = Ω.
Define A := {a 6 X} and B := {X < b}. First, we make the following observation:
P(A) = P({a 6 X})
= P({a = X} ∪ {a < X})
= P({a = X}) + P({a < X}) − P({a = X} ∩ {a < X})
= 0 + P(A0 ) − 0
= P(A0 )
where A0 = {a < X}. We get 0 for the pointwise probability, since this is continuous, and we get 0
because the sets are disjoint. We have shown that P(A) = P(A0 ) and can use this to conclude the
proof.
P(a 6 X < b) = P({a 6 X} ∩ {X < b})
= P(A ∩ B)
= P(A) + P(B) + P(A ∪ B)
= P(A) + P(B) + P(Ω)
= P(A0 ) + P(B) + P(A0 ∪ B)
= P(A0 ∩ B)
= P(a < X < b)
26
2.4
X has the probability density (PDF):
1/4 0 < x < 1
fX (x) = 3/8 3 < x < 5
0 otherwise
0.4
0.3
fX (x)
0.2
0.1
0
0 1 2 3 4 5 6 7
x
From the relatively simple structure, we can easily determine the area under the graph:
1 3 2 6
A = (1) + (2) = + =1
4 8 8 8
(a) Finding the CDF by integrating the PDF. We will split up the integral in several parts. First
for the case when y ∈ (0, 1):
Z y
1 y
Z
1 h iy y
FX (y) = fX (t)dt = 1dt = t =
−∞ 4 0 4 0 4
When y = 1 we have FX (1) = 1/4. Next, we must consider the case y ∈ (1, 3). Here the PDF is 0,
so it doesn’t increase. It remains constant at 1/4 (since the CDF doesn’t decrease).
1
FX (y) =
4
Next is the case y ∈ (3, 5). Consider the intermediary integral:
Z y
3 3 h iy 3y − 9
I1 = dt = t =
3 8 8 3 8
For values y ∈ (3, 5) we start on 1/4, so the CDF in this region becomes:
3y − 9 1
FX (y) = +
8 4
27
So, the full expression for the CDF becomes:
y/4 y ∈ (0, 1)
1/4 y ∈ (1, 3)
FX (y) = 3y − 9 1
+ y ∈ (3, 5)
8
4
1 y>5
3(5) − 9 1 6 2
FX (5) = + = + =1
8 4 8 8
Plot of the CDF:
1
0.8
0.6
y
0.4
0.2
0
0 1 2 3 4 5 6 7
x
(b) Defining Y = 1/X and finding the PDF of Y . Following the hint we are given, we will consider
the following three sets:
1 1 1
A1 = 6y6 , A2 = 6 y 6 1, A3 = y > 1
5 3 3
Where A1 corresponds to (3, 5), A2 to (1, 3) and A3 to (0, 1). We can express the CDF for FY (y)
in terms of FX (x):
1
FY (y) = P(Y 6 y) = P( 6 y)
X
1
= P(X > )
y
1
= 1 − P(X 6 )
y
1
= 1 − FX ( )
y
28
First, we consider A1 : y ∈ [1/5, 1/3], and when we input 1/y to FX (·), it will be in (3, 5). So:
FY (y) = 1 − FX (1/y)
3( y1 ) − 9
!
1
=1− +
8 4
3 − 9y 1
=1− −
8y 4
3 9y − 3
= +
4 8y
15y − 3
=
8y
Next, we consider A2 : y ∈ [1/3, 1]. The input to FX (·) will be in (1, 3):
FY (y) = 1 − FX (1/y)
1
=1−
4
3
=
4
Next, we consider A3 : y > 1. The input to FX (·) will be in (0, 1):
FY (y) = 1 − FX (1/y)
1
y
=1−
4
1
=1−
4y
Also, whenever y < 1/5, then 1/y > 5 which means FX (·) = 1, and so:
FY (y) = 1 − FX (1/y) = 1 − 1 = 0.
29
Plot of CDF:
1
0.8
0.6
FY (y)
0.4
0.2
0
0 0.5 1 1.5 2
y
Finally, we can find the PDF of Y . We differentiate each of the parts in the CDF. When y ∈
(1/5, 1/3):
d 15y − 3 3
= 2
dy 8y 8y
When y > 1:
d 1 1
1− =
dy 4y 4y 2
(All other parts are constant, so they become 0). This gives us the PDF and its plot:
0 y < 1/5
10
3
1/5 6 y 6 1/3
8y 2
fY (y) = 8
0 1/3 < y < 1
1
y>1 6
fY (y)
4y 2
0
0 0.5 1 1.5 2
y
30
2.5
Let X and Y be discrete RV. X and Y are independent if and only if fX,Y (x, y) = fX (x)fY (y) for
all x and y.
Proof.
⇒) Assume that X and Y are independent. That means that for any x, y, we have
Which shows that fX,Y (x, y) = fX (x)fY (y) for all x and y.
⇐) Assume that fX,Y (x, y) = fX (x)fY (y) for all x and y. By definition:
And,
From our assumption, these are equal, so P(X = x ∩ Y = y) = P(X = x)P(Y = y) which shows
that X and Y are independent.
By implication both ways, the statement is proved.
2.6
Let X have distribution F and density f , and let A be a subset of the real line, e.g. A = (a, b) for
some a, b ∈ R and a < b. We have the indicator function
1 x∈A
IA (x) =
0 x 6∈ A
31
2.7
Let X and Y be independent and suppose that X, Y ∼ U (0, 1). For Z = min(X, Y ) we will find
the density fZ (z) for Z. Following the hint, we will first find P(Z > z). Since any observations
of x, y ∈ (0, 1), then we can immediately see that P(Z > 0) = 1 and P(Z > 1) = 0. But what
happens for other values? Best way to find out is with some illustrations. Here are plots of the
cases P(Z > 1/2), P(Z > 1/4) and P(Z > 3/4).
Y Y Y
1 1 1
3/4
1/2
1/4
If we simulate lots of X and Y values, we see that about 1/4th of them will have both X
and Y values larger than 1/2, so P(Z > 1/2) = 1/4. Similarly, we get P(Z > 1/4) = 9/16 and
P(Z > 3/4) = 1/16. Confirming this with a simulation.
# 2.7 - Simulating U (0 ,1)
N = 100000; X = runif ( N ) ; Y = runif ( N )
32
By inspecting the images on the previous page, we can determine the ’shape’ of the probabilities.
For Z > 1/4 we remove the union of X 6 1/4 and Y 6 1/4. We define A = {X 6 z} and
B = {Y 6 z}, and can write the general case as:
P(Z > z) = 1 − P({X 6 z} ∪ {Y 6 z})
= 1 − P(A ∪ B)
= 1 − P(A) + P(B) − P(A ∩ B)
= 1 − P(A) + P(B) − P(A)P(B)
Where we used Lemma 1.6, and the fact that X and Y are independent. By using the probability
law of complements, we can find the expression for P(Z 6 z).
P(Z 6 z) = P(X 6 z) + P(Y 6 z) − P(X 6 z)P(Y 6 z)
The CDF for a uniform distribution on U (a, b) is:
z−a z−0
F (z) = =⇒ FX (z) = FY (z) = =z
b−a 1−0
Which means:
FZ (z) = FX (z) + FY (z) − FX (z)FY (z) = 2z − z 2
We can confirm our illustrations and simulated examples again by noting that:
1 1 1 1 1 3
FZ (1/2) = + − · =1− =
2 2 2 2 4 4
1 1 1 1 8 1 7
FZ (1/4) = + − · = − =
4 4 4 4 16 16 16
3 3 3 3 24 9 15
FZ (3/4) = + − · = − =
4 4 4 4 16 16 16
which gives us the opposite results as expected (since we simulated and illustrated P(Z > z)). The
PDF fZ (z) is the derivative of FZ (z). Including PDF-plot and histogram of the simulated Zs.
d
fZ (z) = FZ (z) = 2 − 2z.
dz
Histogram of Z
2
2.0
1.5
1.5
fZ (z)
Density
1
1.0
0.5
0.5
0.0
0
0 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1.0
z
Z
33
2.8
The RV X has CDF F . Finding the CDF of X + = max{0, X}.
From the definition of CDF:
FX + (u) = P max(0, X) 6 u
= P {ω ∈ Ω : X(ω) 6 u and u > 0}
When u > 0:
FX + (u) = P {ω ∈ Ω : X(ω) 6 u and u > 0}
= P {ω ∈ Ω : X(ω) 6 u}
= P(X 6 u)
= FX (u)
So in summary:
0 u<0
FX + (u) =
FX (u) u > 0
2.9
We have X ∼ Exp(β). The PDF is given by:
1 −x/β
f (x) = e , x>0
β
Finding the CDF by integrating the PDF.
Z y
F (y) = f (x)dx
−∞
1 y −x/β
Z
= e dx
β 0
1 y
= − βe−x/β 0
β
y
= − e−x/β 0
= 1 − e−y/β
34
2.10
If X and Y are independent, then g(X) and h(Y ) are independent for some functions g and h.
Proof. Let X and Y be some arbitrary random variables, and let x and y be values in the range
of g and h such that g(X) = x and h(Y ) = y. Then:
2.11
Tossing a coin which has probability p of getting H. We let X denote the number of heads and Y
the number of tails.
(a) Showing that X and Y are dependent. First it will be helpful to consider a simplified case where
we have N = 1 and N = 2 coin tosses, and assuming p = 1/2.
N = 1 Tosses N = 2 Tosses
Y =0 Y =1 Y =0 Y =1 Y =2
X=0 0 1/2 1/2 X=0 0 0 1/4 1/4
X=1 1/2 0 1/2 X=1 0 1/2 0 1/2
1/2 1/2 1 X=2 1/4 0 0 1/4
1/4 1/2 1/4 1
In the case of N = 2, we see that f (0, 2) = 1/4 while fX (0)fY (2) = 1/16. This will be the
inspiration for how we show it in the general case with N tosses and probability p of heads.
We only need to show one specific case where P(X = x, Y = y) 6= P(X = x)P(Y = y) to show that
these values are dependent. The total number of tosses will be N = X + Y and we compare the
cases where we get N heads. In that case, using the Multinomial distribution:
N
P(X = N, Y = 0) = = pN (1 − p)0 = pN ,
N, 0
These are not equal, showing that X and Y are dependent variables.
35
(b) Now we have N ∼ Poisson(λ), where N = X + Y for X heads and Y tails. Show that these
values are now independent.
Current solution, but not sure it’s correct...
Assuming X = x and Y = y. Then:
P(X = x ∩ Y = y) = P(X = x ∩ Y = y ∩ N = x + y)
= P(X = x ∩ Y = y | N = x + y) · P(N = x + y)
λx+y
x+y x
= p (1 − p)y · e−λ
x (x + y)!
(x + y)!
λx λy
px (1 − p)y · e−λ
=
x!y! (x
+ y)!
px λx (1 − p)y λy
= e−λ ·
x! y!
x x
p λ (1 − p)y λy
= e−λp · e−λ(1−p)
x! y!
= P(X = x|λp) · P(Y = y|λ(1 − p))
= P(X = x)P(Y = y)
Also, we made the assumption that P(X = x ∩ Y = y | N = x + y) is Binomial. But I think this
is only true when we can assume that X and Y are independent... which is what we are trying to
show. Will review this later... hopefully!
We get g(x) and h(y) since they are equal to f (x, y) over the entire ’rectangle’ and must therefore
integrate to 1. This leaves us with: f (x, y) = g(x)h(y) = fX (x)fY (y), and by the results in exercise
2.5, this means that X and Y are independent.
36
2.13
(a) Finding the PDF of Y = eX when X ∼ N (0, 1).
We simply get the standard normal distribution with log(y) as input. Differentiating to get the
PDF:
d 1 1 1 2
fY (y) = FX (log(y)) = fX (log(y)) · = √ exp − log(y)
dy y y 2π 2
Plotting the function:
0.6
· fX (log(y))
0.4
0.2
1
y
0
2 4 6 8
y
Histogram of y
x = rnorm (10000)
y = exp ( x )
0.6
prob = TRUE )
lines ( density ( y ) , col = " red " , xlim = c (0 ,8) )
0.2
dev . off ()
0.0
0 2 4 6 8
37
2.14
We let (X, Y )√be uniformly distributed on the unit disk: {(x, y) : x2 + y 2 6 1}. Find the CDF and
PDF of R = X 2 + Y 2 .
First, let’s write some code in order to simulate the data, which are random points contained within
the unit circle. And then use that to simulate R and view the histogram. Plots and code can be
found on the next page. By inspection of the histogram it is clear that fR (r) = 2r which means
FR (r) = r2 .
Following the general
√ recipe for transformation of multiple random variables. We will define the
function r(X, Y ) = X 2 + Y 2 .
Z Z
FR (r) = P(R 6 r) = P(r(X, Y ) 6 r) = P(X 2 + Y 2 6 r2 ) = fX,Y (x, y)dxdy
Ar
p
Finding the set Ar which in this case is still the unit circle, since x2 + y 2 6 1 =⇒ x2 + y 2 6 1.
Since the area of the unit circle is π, and since a uniform distribution means the probability is
equal everywhere, it follows that:
1
fX,Y (x, y) = .
π
As we saw, we must integrate this function over the circle: {(x, y) : x2 + y 2 < r} for some 0 6 r 6 1.
This is easiest when calculating the integral in polar coordinates, so we introduce 0 6 θ 6 2π and
0 6 ρ 6 r and change dxdy = ρdρdθ. So:
1 2π r
Z Z
FR (r) = ρdρdθ
π 0 0
Z 2π h 2 ir
1 ρ
= dθ
π 0 2 0
1 2π r2
Z
= dθ
π 0 2
r2 2π
Z
= (1)dθ
2π 0
r2 2π
= θ
2π 0
r2
= · 2π
2π
= r2
d
fR (r) = FR (r) = 2r,
dr
also in the interval 0 6 r 6 1. This confirms the results we found by simulating which are found
on the next page.
38
Plots and Code for 2.14
Result from simulation:
s i m u l a t e U n i t C i r c l e = function ( N ) {
TR = runif ( N )
UR = runif ( N ) ; VR = runif ( N )
t = 2 * pi * TR
u = UR + VR
u [ u > 1] = 2 - u [ u > 1]
1.0
r = u
X = r * cos ( t )
0.5
Y = r * sin ( t )
retVal = list ()
retVal $ X = X
0.0
Y
retVal $ Y = Y
return ( retVal )
−0.5
}
simVal = s i m u l a t e U n i t C i r c l e (1000)
X = simVal $ X
−1.0
Y = simVal $ Y
pdf ( " ~ / AllStatistics / files / ch2 _ 2.14. pdf " ,
−1.0 −0.5 0.0 0.5 1.0
width = 4 , height = 4)
plot (X , Y )
X
dev . off ()
Histogram of R
2.0
simVal = s i m u l a t e U n i t C i r c l e (50000)
R = sqrt ( simVal $ X ^2 + simVal $ Y ^2)
pdf ( " ~ / AllStatistics / files / ch2 _ 2.14 b . pdf " ,
1.5
width = 4 , height = 4)
Density
plot (X , Y )
dev . off ()
0.5
0.0
39
2.15
Let X have a continuous, strictly increasing CDF F , and set Y = F (X). Find the density of Y .
By definition of the CDF, F : R → [0, 1] which will be the domain of Y . Since F is strictly
increasing, then F (X1 ) < F (X2 ) if X1 < X2 , but the important property is that F must be
invertible. For any y ∈ [0, 1] we get the following:
FY (y) = P(Y 6 y) = P(F (X) 6 y)
= P(X 6 F −1 (y)) = F (F −1 (y)) = y
Since FY (y) = y, then the density is fY (y) = 1, the derivative. From these observations, we can
conclude that Y ∼ U (0, 1).
Next, if U ∼ U (0, 1) and X = F −1 (U ), then X has F as its CDF.
P(F −1 (U ) 6 x) = P(X 6 x) = F (x) =⇒ X∼F
Writing a program that generates Exp(β) values. Using the result found in exercise 2.9.
1
Density
fX (x)
0.8
0.5
0.4
0.0
0
0 1 2 3 4 5 1 2 3 4 5
x
expRND
40
2.16
If X ∼ Poisson(λ1 ) and X ∼ Poisson(λ2 ), and X and Y are independent, then the distribution of
X given that X + Y = n is Binomial(n, ρ) where ρ = λ1 /(λ1 + λ2 ). (Changed some variable names).
Proof. By definition of conditional probability, by applying hint 1, by independence and by using
the fact that X + Y is a Poisson distribution with λ1 + λ2 .
P(X = x ∩ X + Y = n)
P(X = x|X + Y = n) =
P(X + Y = n)
P(X = x ∩ Y = n − x)
= (Hint)
P(X + Y = n)
P(X = x)P(Y = n − x)
= (Independence)
P(X + Y = n)
Each of these terms are known:
λx1
P(X = x) = e−λ1 ·
x!
(n−x)
λ )
P(Y = n − x) = e−λ2 · 2
(n − x)!
(λ1 + λ2 )n
P(X + Y = n) = e−(λ1 +λ2 ) ·
n!
We can invert the last equation:
1 n!
= eλ1 +λ2 ·
P(X + Y = n) (λ1 + λ2 )n
Putting it all together, we can complete the justification.
P(X = x)P(Y = n − x)
P(X = x|X + Y = n) =
P(X + Y = n)
!
x
(n−x)
−λ1 λ1 −λ2 λ2 ) n!
= e · e · eλ1 +λ2 ·
x! (n − x)! (λ1 + λ2 )n
!
x (n−x)
λ1 λ2 ) n!
=
x! (n − x)! (λ1 + λ2 )x (λ1 + λ2 )(n−x)
x (n−x)
n! λ1 λ2
=
x!(n − x)! λ1 + λ2 λ1 + λ2
x (n−x)
n λ1 λ2
=
x λ1 + λ2 λ1 + λ2
x (n−x)
n λ1 λ1
= 1−
x λ1 + λ2 λ1 + λ2
n x
= ρ (1 − ρ)(n−x) ∼ Binomial(n, ρ)
x
for ρ = λ1 /(λ1 + λ2 ).
41
2.17
The joint probability density is:
c(x + y 2 )
0 6 x 6 1, 0 6 y 6 1
fX,Y (x, y) =
0 otherwise
2.18
Let X ∼ N (3, 16). Solve the following using the Normal table and using a computer package.
(a)
7−3
P(X < 7) = P(3 + 4Z < 7) = P Z < = P(Z < 1) = 0.8413
4
From R:
(b)
2−3
P(X > 2) = 1 − P(X 6 2) = 1 − P(3 + 4Z 6 2) = 1 − P Z 6
4
= 1 − P(Z < −1/4) = 1 − (1 − 0.5987) = 0.5987
42
(c) Find x such that P(X > x) = 0.05. By finding 0.95 in the table, we see that it is generated by
the value 1.645, so:
P(4Z > 6.58) = P(3 + 4Z > 9.58) = P(X > 9.58) = 0.05
So, x = 9.58. From R:
1 - pnorm(9.5794, mean = 3, sd = 4)
[1] 0.05000037
(d)
From R:
> pnorm(0.25) - pnorm(-0.75)
[1] 0.37207897
#### Alternatively
> CH = rnorm(10000000, mean = 3, sd = 4)
> sum(CH < 4 & CH > 0)/10000000
[1] 0.3719199
(e) Find x such that P(|X| > |x|) = 0.05.
We can split this up in two sections if we assume x > 0. Using that the sets are disjoint:
P(|X| > |x|) = P(X > x ∩ X < −x) = P(X > x) + P(X < −x)
= 1 − P(X 6 x) + P(X < −x) = 0.05
P(X 6 x) − P(X < −x) = 0.95
43
2.19
Proving equation (2.12). Let X be a random variable with pdf fX (x), and define the random
variable Y = r(X), where r is either strictly monotone increasing or decreasing, so r(·) has the
inverse function s(·) = r−1 (·). Then, we can express the pdf of Y as:
ds(y)
fY (y) = fX (s(y)) .
dy
Proof. The proof will be in two parts. Assume first that r is strictly increasing, which means that
s is strictly increasing and its derivative is positive. From the CDF of Y , we get:
FY (y) = P(Y 6 y)
= P(r(X) 6 y)
= P(X 6 r−1 (y))
= P(X 6 s(y))
= FX (s(y))
FY (y) = P(Y 6 y)
= P(r(X) 6 y)
= P(X > r−1 (y))
= 1 − P(X 6 s(y))
= 1 − FX (s(y))
d d ds(y) ds(y)
fY (y) = FY (y) = 1 − FX (s(y)) = −fX (s(y)) = fX (s(y)) −
dy dy dy dy
Since the derivative will be negative, the negative of the derivative will be positive. Generalizing
to both cases, we can write:
ds(y)
fY (y) = fX (s(y))
dy
A short note on why we get 1 − P(X 6 s(y)) when r is strictly decreasing. As an example, say
that X ∼ U (0, 5) and we define Y = r(X) where r(x) = 2 − 2x, which is strictly decreasing. Then
r−1 (y) = s(y) = 1 − y2 , which is also strictly decreasing. Then:
y
P(r(X) 6 y) = P(2 − 2X 6 y) = P(2 − y 6 2X) = P(1 − 6 X) = P(X > s(y))
2
In order to get the expression in terms of FX (x), we rewrite it:
44
2.20
Let X, Y ∼ U (0, 1) be independent. Find the PDF for X − Y and X/Y . The joint density of X, Y
is
f (x, y) = 1, x, y ∈ (0, 1)
and 0 elsewhere.
We define Z = X − Y . We note that Z can assume values between -1 and 1, which are the two
extremes. This gives us Az = (−1, 1). We split this into two cases: when X > Y , then Az = (0, 1)
and when X < Y , which means Az = (−1, 0).
When X > Y and Az = (0, 1).
Z zZ 1 Z z
FZ (z) = 1dxdy = [x]1y dy
0 y 0
z z
y2
Z
= 1 − ydy = y −
0 2 0
z2
=z−
2
When X < Y and Az = (−1, 0).
Z z Z y Z z
FZ (z) = 1dxdy = [x]y−1 dy
−1 −1 −1
Z z z
y2
= y + 1dy = +y
−1 2 −1
2
z 1
= +z+
2 2
Summarizing the CDF, and differentiating to find the PDF. (The 1/2 carries over).
( 2
z
+ z + 12 z ∈ (−1, 0)
z + 1 z ∈ (−1, 0)
FZ (z) = 2 z2 =⇒ fZ (z) =
z− +1 z ∈ (0, 1) 1 − z z ∈ (0, 1)
2 2
Histogram of X−Y
1
1.0
0.8
fZ (z)
0.6
Density
0.5
0.4
0.2
0.0
0
−1.0 −0.5 0.0 0.5 1.0 −1 −0.5 0 0.5 1
d1 z
45
Now we define Z = X/Y .
FZ (z) = P(Z 6 z) = P(X/Y 6 z) = P(X 6 zY )
Z 1 Z 1
z
zydy + 1dy z>1
Z 1 Z min(z,y) Z 1
1
FZ (z) = 1dxdy = min(z, y)dy = Z0 1 z
0 0 0
zydy z<1
0
Evaluating the integrals to get the CDF, and differentiating to get the PDF:
1
1− 1
z>1
z>1
2z 2
FZ (z) = 2z =⇒ f (z) =
z z<1
Z
1
2
z<1
2
Plots and simulation code.
Histogram of X/Y
0.0 0.1 0.2 0.3 0.4 0.5
0.4
fZ (z)
Density
0.2
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3
d3 z
# Simulating X - Y
N = 10000
X = runif ( N ) ; Y = runif ( N )
d1 = X - Y
hist ( d1 , breaks = 50 , prob = TRUE ,
main = " Histogram of X - Y " )
# Simulating X / Y
N = 1000000
X = runif ( N ) ; Y = runif ( N )
d2 = X / Y
d3 = d2 [ abs ( d2 ) < 3] # Removing very rare large vals
d3 = d3 * length ( d2 ) / length ( d3 ) # Normalizing probability
hist ( d3 , breaks =60 , prob = TRUE ,
main = " Histogram of X / Y " )
46
2.21
Let X1 , . . . , Xn ∼ Exp(β) be IID. Define Y = max(X1 , . . . , Xn ). Find the PDF of Y .
Recall the PDF of the exponential distribution:
1
fX (x) = exp (−x/β) , x>0
β
Defining r(X1 , . . . , Xn ) = max(X1 , . . . , Xn ). Finding an expression for the CDF of Y .
According to the hint, Y 6 y iff Xi 6 y for all i ∈ I, where I = {1, . . . , n} is the index set. We can
therefore express the CDF of Y in terms of each individual CDF:
n n
FY (y) = P(X1 6 y) ∩ P(X2 6 y) ∩ . . . ∩ P(Xn 6 y) = (FX (y)) = (1 − exp(−y/β))
Histogram of Y
2
2.0
1.5
1.5
fY (y)
Density
1.0
1
0.5
0.5
0.0
0
0.0 0.5 1.0 1.5 2.0 2.5 0 0.5 1 1.5 2 2.5
y
Y
47
N = 100000
lambda = 5
48
3 Expectation
Exercises
3.1
Define X as the wealth after n games. The probability of winning and losing is the same for each
outcome so p = 1/2.
1 1 1 c 5
E[X] = · 2c + · c=c+ = c
2 2 2 4 4
We expect to have 5/4 · c after n games. We can also verify this result with a simulation in R.
> games = sample ( c (2 , 0.5) , size = 1000000 , replace = TRUE )
> mean ( games )
[1] 1.250973
> 5/4
[1] 1.25
3.2
Claim. Var(X) = 0 if and only if P(X = c) = 1 for some constant c.
Proof.
⇒) Set c = µ and assume Var(X) = 0, which means that
Z
2
E[(X − µ) ] = 0 =⇒ (x − µ)2 dF (x) = 0.
This can only be 0 when x = µ = c for the entire domain of X. Hence P(X = c) = 1.
⇐) Assume P(X = c) = 1. When calculating the expectation:
Z
µ = E[X] = cdF (x) = c
49
3.3
Let X1 , . . . , Xn ∼ U (0, 1) and define Y = max(X1 , . . . , Xn ). We will calculate E[Y ]. It is not stated
in the exercise, but we will assume that the Xi are independent. Finding the CDF for Y .
FY (y) = P(Y 6 y)
= P(max(X1 , . . . , Xn ) 6 y)
= P(X1 6 y) ∩ . . . ∩ P(Xn 6 y)
= P(X1 6 y)P(X2 6 y) · · · P(Xn 6 y) (Independence)
n
= (FX (y))
Since Xi are uniformly distributed, we know that FX (y) = y and fX (y) = 1, so:
fY (y) = ny n−1
As we can see, the theoretical result is very close to the simulated result for n = 10.
50
3.4 - Random Walk
A particle starts in the origin and jumps left, a step of -1, with probability p and jumps right, a
step of 1, with probability 1 − p. The expected location will be:
3.5
Tossing a fair coin until we get H. Finding the expected number of tosses. The reasoning is as
follows. We get H on the first toss with probability 1/2, first H on the second toss with probability
1/22 = 1/4 and so on. The pattern becomes as follows, for the first 7 cases:
Not delving in to the mathematics of the infinite sum, but it can be shown that this sum becomes
2 which will be the expected number of tosses to get a H. Here is a numeric approximation in R.
> sumApprox = 0
> for ( k in 1:1000) {
+ sumApprox = sumApprox + k / 2^ k
+ }
> sumApprox
[1] 2
51
3.6 Theorem - The Rule of the Lazy Statistician
Proving the following result for the discrete case. Let Y = r(X), then
X
E[Y ] = E[r(X)] = r(x)fX (x)
x
Proof. Set X , Y to be the set of all values for X and Y . We have the functions r : X → Y and
r−1 : Y → X . We will define s := r−1 for convenience. By definition of the expectation:
X X
E[Y ] = y · fY (y) = y · P(Y = y)
y∈Y y∈Y
X X
= y P(X = x)
y∈Y x∈s(y)
X X
= yP(X = x)
y∈Y x∈s(y)
X
= r(x)P(X = x)
x∈X
X
= r(x)fX (x)
x∈X
A simplified case to make the proof easier a bit easier to understand. We define the variables X
and Y = r(X) = X 2 , and use the following distributions.
x P(X = x) y P(Y = y)
−1 1/4 2 0 1/4
Y = r(X) = X
0 1/4 1 3/4
1 1/2
52
3.7
X ∼ F R is continuous and we suppose that P(X > 0) = 1 and that E[X] exists. Show that
∞
E[X] = 0 P(X > x)dx.
Proof. By definition of the expectation, and since the domain of X must be (0, ∞):
Z ∞
E[X] = x · fX (x)dx
0
which proves the result. In the second to last step, we applied the hint regarding the limit.
σ2
E[X] = µ, Var(X) = , E[S 2 ] = σ 2 .
n
Proof. Start with the expectation of the sample mean:
" n # n n
1X 1X 1X n·µ
E[X] = E Xi = E[Xi ] = µ= = µ.
n i=1 n i=1 n i=1 n
53
Finally, the expectation of the sample variance. This calculation is a lot more involved.
" n
# " n #
2 1 X 2 2
X
2
E[S ] = E (Xi − X) =⇒ (n − 1)E[S ] = E (Xi − X)
n − 1 i=1 i=1
We will use that X = (1/n)ΣXi means that ΣXi = nX. Running through the calculations:
" n #
2
X
2 2
(n − 1)E[S ] = E (Xi − 2Xi X + X )
i=1
" n n n
#
X X X 2
=E Xi2 − 2 Xi X + X )
i=1 i=1 i=1
" n # " n
# " n
#
X X X 2
=E Xi2 −E 2 Xi X + E X )
i=1 i=1 i=1
n
" n
# " n
#
X 2 X 2 X
= E Xi − E 2X Xi + E X 1) (X indp. of i)
i=1 i=1 i=1
h 2i h 2i
= nE Xi2 − 2nE X + nE X
(ΣXi = nX)
h 2i
= nE Xi2 − nE X
2 2 σ2
Var(X) = E[X ] − E[X]2 =⇒ E[X ] = Var(X) + E[X]2 = + µ2
n
Replacing each of these into (3.8.1) and isolating E[S 2 ].
n
σ2
2 2 2 2
E[S ] = σ +µ − +µ
n−1 n
σ2
n
= σ2 −
n−1 n
n n−1
= σ2
n−1 n
= σ2
54
3.9
Computer experiment; studying the effects of the mean over 10.000 simulations from the standard
normal distribution and the Cauchy distribution.
As we can see the simulation of the normal distribution quickly stabilizes to a value of around
0 which is the expected value. The Cauchy distribution is a very heavy-tailed distribution and the
expectation does not exist. Therefore, there is no mean it can stablize to and it ends up behaving
erraticaly even after 10.000 simulations. The jumps that can be seen are extreme outliers that are
added to the data, which are relatively common in the Cauchy distribution.
Results:
4
0.1
3
2
Xmean
Ymean
0.0
1
0
−0.1
−0.2
−2
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
1:N 1:N
55
3.10
Let X ∼ N (0, 1) and define Y = eX . Find E[Y ] and Var(Y ).
One possible way to solve this is defining r(X) = eX and evaluating:
Z Z
x 1 1 2
E[Y ] = E[r(X)] = e fX (x)dx = √ exp x − x dx.
R 2π R 2
Completing the square in the exponential term:
1 1 1 1 2 1
x − x2 = − x2 − 2x = − x2 − 2x + 1 − 1 = − (x − 1) +
2 2 2 2 2
We can rewrite the integral:
Z
1 1
E[Y ] = e1/2 √ exp − (x − 1)2 dx
2π R 2
We will integrate this by substitution:
u=x−1 =⇒ du/dx = 1 =⇒ du = dx
Z
1 1
E[Y ] = e1/2 √ exp − u2 du = e1/2 ≈ 1.6487
2π R 2
| {z }
=1
Confirming with numerical simulation.
> N = 10000000; X = rnorm ( N ) ; Y = exp ( X )
> mean ( Y )
[1] 1.648564
> exp (0.5)
[1] 1.648721
> var ( Y )
[1] 4.676382
> exp (2) - exp (1)
[1] 4.670774
We can calculate the variance by first finding the second moment: E[Y 2 ].
Z Z
2 2 2x 1 1 2
E[Y ] = E[r(X) ] = e fX (x)dx = √ exp 2x − x dx.
R 2π R 2
Just as above, we complete the square and get: −(1/2)(x − 2)2 + 2 and move e2 outside the integral.
Then we use integration by substitution: u = x − 2 which gives du = dx. We end up with a similar
integral: Z
2 2 1 1 2
E[Y ] = e √ exp − u du = e2
2π R 2
| {z }
=1
Now we can calculate the variance of Y .
and as we can see in the simulation above, this corresponds to the simulated variance.
56
3.11
Pn stocks. For each day Yi can be either −1 or 1, each with probability 1/2. We have
Simulating
Xn = i=1 Yi which is the cumulative value of the stock.
(a) Calculating the expectation and variance of Xn . First we calculate it for Yi .
E[Yi ] = (−1)(1/2) + (1)(1/2) = −1/2 + 1/2 = 0
E[Yi2 ] = (−1)2 (1/2) + (1)2 (1/2) = 1/2 + 1/2 = 1
Var(Yi ) = E[Yi2 ] − E[Yi ]2 = 1 − 0 = 1
Now for Xn . " #
n
X n
X
E[Xn ] = E Yi = E[Yi ] = 0
i=1 i=1
" n
# n n
X X X
Var(Xn ) = Var Yi = Var[Yi ] = 1=n
i=1 i=1 i=1
(b) Simulating four stocks.
simStock <- function ( N ) {
daily Movemen ts = sample ( c ( -1 , 1) , size =N , replace = TRUE )
total Movemen ts = cumsum ( dailyMo vements )
return ( totalMo vements )
}
N = 10000
s1 = simStock ( N ) ; s2 = simStock ( N )
s3 = simStock ( N ) ; s4 = simStock ( N )
plot (1: N , s1 , type = " l " ,
ylim = c ( -119 , 212) , # Adjust according to simulation results
main = " Stock simulation " )
lines (1: N , s2 , type = " l " , col = " red " )
lines (1: N , s3 , type = " l " , col = " blue " )
lines (1: N , s4 , type = " l " , col = " green " )
Stock simulation
The simulations have mean 0, so they should
vary around the x-axis. Since the variance is
n, it is dependent on the time. The longer
150
57
3.12
Deriving the general expression for the expectation and variance for a whole bunch of probability
distributions.
• Bernoulli with parameter p. Possible values: X = {0, 1}.
f (x) = px (1 − p)x−1
Expectation. X
E[X] = x · f (x) = (0) + (1)p1 = p
x∈X
Second moment. X
E[X 2 ] = x2 · f (x) = (0) + (1)2 p1 = p
x2 ∈X
Variance.
Var(X) = E[X 2 ] − E[X]2 = p − p2 = p(1 − p)
Second moment. We will split the sum into two Poisson sums.
∞
X X λx
E[X 2 ] = x2 · f (x) = x2 · e−λ
x=0
x!
x2 ∈X
∞
X λx−1
= 0 + λe−λ x·
x=1
(x − 1)!
∞
X λx−1
= λe−λ (x − 1 + 1) ·
x=1
(x − 1)!
∞ h
X λx−1 λx−1 i
= λe−λ (x − 1) +
x=1
(x − 1)! (x − 1)!
58
∞ ∞
!
X λx−1 X λx−1
= λe−λ (x − 1) +
x=2
(x − 1)! x=1 (x − 1)!
∞ ∞
!
−λ
X λx−2 X λx−1
= λe λ +
x=2
(x − 2)! x=1 (x − 1)!
∞ j ∞ k
X λ X λ
= λe−λ λ +
j=0
j! k!
k=0
= λe−λ λeλ + eλ
= λ2 + λ
The remaining calculation for the variance is easy.
Var(X) = E[X 2 ] − E[X]2 = λ2 + λ − λ2 = λ
59
• Exponential with parameter β. Possible values: X = {x ∈ R : x > 0}.
1
f (x) = exp(−x/β)
β
Expectation.
Z Z ∞
1 −x/β
E[X] = x · f (x) = x· e dx
X 0 β
We will integrate this with the substitution u = x/β, so du/dx = 1/β which gives us dx = βdu.
Note that the antiderivative of ue−u is −ue−u − e−u .
Z ∞ Z ∞
1
E[X] = x · e−x/β dx = β ue−u du
0 β 0
h i∞
−u −u
= β − ue − e
0
= β lim (−ue − e−u ) − (−1)
−u
u→∞
=β
Skipping the evaluation of the limit. The first term can be shown to converge to 0 by using
L’Hôpital’s rule. The second term in the limit obviously becomes 0.
Second moment. Will derive the expression for the expectation in the calculations.
Z Z ∞
1
2
E[X ] = 2
x · f (x) = x2 · e−x/β dx
X 0 β
Z ∞ i∞ Z ∞
1 −x/β h
E[X 2 ] = x2 · e dx = − x2 e−x/β − −2xe−x/β dx
0 β 0 0
Z ∞
=0+ 2xe−x/β dx
0
Z ∞
1
= 2β x · e−x/β dx
0 β
= 2βE[X]
= 2β 2
The limit can be shown to converge to 0 by applying L’Hôpital’s rule twice. We use a clever trick
and substitute in the expectation and get the result. Finally, we calculate the variance.
60
• Gamma with parameters α, β. Possible values: X = {x ∈ R : x > 0}.
1
f (x) = xα−1 exp(−x/β)
Γ(α)β α
We will use the definition of the Gamma function, and the ’Gamma difference’ chaining property:
Z ∞
Γ(α + 1) = tα e−t dt, Γ(α + 1) = αΓ(α)
0
In step 3, we make the substitution t = x/β which makes x = βt. dt/dx = 1/β, so dx = βdt.
Z Z ∞
1
E[X] = x · f (x) = x· xα−1 exp(−x/β)dx
X 0 Γ(α)β α
Z ∞
1
= xα exp(−x/β)dx
0 Γ(α)β α
Z ∞
1
= (βt)α exp(−t)βdt
0 Γ(α)β α
Z ∞
1
= α
β α tα exp(−t)βdt
0 Γ(α)β
Z ∞
1
= α
β α+1 tα exp(−t)dt
0 Γ(α)β
Z ∞
β α+1
= tα exp(−t)dt
Γ(α)β α 0
Z ∞
β
= tα exp(−t)dt
Γ(α) 0
βΓ(α + 1)
=
Γ(α)
βαΓ(α)
=
Γ(α)
= αβ
For the Gamma distribution, we will not calculate the second moment. Instead we will go straight
to the variance calculation and calculate it directly. We will use the same substitution and use the
definition of the Gamma function again.
Z ∞
2 2 1
Var(X) = E[X ] − E[X] = x2 · α
xα−1 exp(−x/β)dx − α2 β 2
0 Γ(α)β
Z ∞
1
= xα+1 exp(−x/β)dx − α2 β 2
Γ(α)β α 0
Z ∞
1
= (βt)α+1 exp(−t)βdt − α2 β 2
Γ(α)β α 0
Z ∞
β α+2
= tα+1 exp(−t)dt − α2 β 2
Γ(α)β α 0
61
Using the definition of the Gamma function.
β 2 Γ(α + 2)
= − α2 β 2
Γ(α)
β 2 Γ(α + 2) − α2 β 2 Γ(α)
=
Γ(α)
2
β Γ(α + 2) − α2 Γ(α)
=
Γ(α)
β 2 (α + 1)Γ(α + 1) − α2 Γ(α)
=
Γ(α)
β 2 (α + 1)αΓ(α) − α2 Γ(α)
=
Γ(α)
β Γ(α) (α + 1)α − α2
2
=
Γ(α)
= β (α + 1)α − α2
2
= β 2 α2 α − α2
= αβ 2
And the calculation for the Gamma distribution has been completed.
Expectation.
Z Z 1
Γ(α + β) α−1
E[X] = x · f (x) = x· x (1 − x)β−1 dx
X 0 Γ(α)Γ(β)
Γ(α + β) 1 α
Z
= x (1 − x)β−1 dx
Γ(α)Γ(β) 0
Γ(α + β) Γ(α + 1)Γ(β)
= ·
Γ(α)Γ(β) Γ(α + β + 1)
Γ(α + β) αΓ(α)Γ(β)
= ·
Γ(α)Γ(β) (α + β)Γ(α + β)
α Γ(α + β) Γ(α)Γ(β)
= · ·
α + β Γ(α)Γ(β) Γ(α + β)
α
=
α+β
62
Variance. Just like for the Gamma distribution, we calculate the second moment directly in the
variance calculation. Will use a lot of the same tricks as when calculating the expectation.
Z 1 2
Γ(α + β) α−1 α
Var(X) = E[X 2 ] − E[X]2 = x2 · x (1 − x)β−1 dx −
0 Γ(α)Γ(β) α+β
Z 1 2
Γ(α + β) α+1 β−1 α
= x (1 − x) dx −
Γ(α)Γ(β) 0 α+β
Z 1 2
Γ(α + β) α
= xα+1 (1 − x)β−1 dx −
Γ(α)Γ(β) 0 α+β
2
Γ(α + β) Γ(α + 2)Γ(β) α
= · −
Γ(α)Γ(β) Γ(α + β + 2) α+β
2
α(α + 1) Γ(α + β) Γ(α)Γ(β) α
= · · − (3.12.1)
(α + β)(α + β + 1) Γ(α)Γ(β) Γ(α + β) α+β
2
α(α + 1) α
= −
(α + β)(α + β + 1) α+β
α2 + α α2
= −
(α + β)(α + β + 1) (α + β)2
(α2 + α)(α + β) − α2 (α + β + 1)
=
(α + β)2 (α + β + 1)
α + α2 β + α2 + αβ − α3 − α2 β − α2 )
3
=
(α + β)2 (α + β + 1)
3
α
+α2β + α2
+ αβ −α3 −α
β −α2 )
2
= 2
(α + β) (α + β + 1)
αβ
=
(α + β)2 (α + β + 1)
In step (3.12.1) we used the difference/chaining property of the Gamma function.
Γ(α + 2) = (α + 1)Γ(α + 1) = (α + 1)αΓ(α)
and,
Γ(α + β + 2) = (α + β + 1)Γ(α + β + 1) = (α + β + 1)(α + β)Γ(α + β)
This concludes the last calculation. I found this exercise to be a little tedious, to be honest...
3.13
A variable X is generated by assuming a value in U (0, 1) if a coin toss is 0 (H), and X is in U (3, 4)
if the coin toss is 1 (T). The coin is fair, so each of these has a probability of 1/2.
(a) Calculating the mean of X. This will be a conditional expectation, and we define Y to be the
coin toss. The marginal distributions might depend on the outcome, but when we calculate them:
1 1
fX|Y (x|Y = 0) = = 1, fX|Y (x|Y = 1) = = 1,
1−0 4−3
we find that it will be fX|Y (x|y) = 1 either way.
63
Now we can calculate the expected value for specific outcomes.
1 1
x2
1−0
Z Z
1
E[X|Y = 0] = x · fX|Y (x|y) = xdx = = =
X 0 2 0 2 2
4 2 4
16 − 9
Z Z
x 7
E[X|Y = 1] = x · fX|Y (x|y) = xdx = = =
X 3 2 3 2 2
Using that pY (y) is known, we can calculate the expected value of X:
X
E[X] = E[E[X|Y ]] = E[X|Y = y]P(Y = y)
y∈{0,1}
1 1 7 1
= +
2 2 2 2
1 7 8
= + = =2
4 4 4
(b) Calculating the second moment.
1 1
x3
1−0
Z Z
2 2 2 1
E[X |Y = 0] = x · fX|Y (x|y) = x dx = = =
X 0 3 0 3 3
4 4
x3 43 − 33
Z Z
37
E[X 2 |Y = 1] = x2 · fX|Y (x|y) = x2 dx = = =
X 3 3 3 3 3
X
E[X 2 ] = E[E[X 2 |Y ]] = E[X 2 |Y = y]P(Y = y)
y∈{0,1}
1 1 37 1
= +
3 2 3 2
1 37 38
= + =
6 6 6
19
=
3
Calculating the variance:
19 7
Var(X) = E[X 2 ] − E[X]2 = − 4 = ≈ 2.333
3 3
And finally the standard deviation:
r
p 7
SD(X) = Var(X) = ≈ 1.5275
3
To verify these results, we did a numeric simulation.
64
N = 10000; Y = sample ( c (0 , 1) , size =N , replace = TRUE ) ; X = rep (0 , N )
for ( i in 1: N ) {
if ( Y [ i ] == 1) X [ i ] = runif (1)
else X [ i ] = runif (1 , min =3 , max =4)
}
> mean ( X )
[1] 2.01424
> var ( X )
[1] 2.332699
> sd ( X )
[1] 1.527318
3.14
Claim: Let X1 , . . . , Xm and Y1 , . . . , Yn be random variables and a1 , . . . , am and b1 , . . . , bn be
constants. Then:
Xm Xn m X
X n
Cov ai Xi , bi Yi = ai bj Cov(Xi , Yj ).
i=1 j=1 i=1 j=1
Proof. This can be verified with a direct proof. We will do a simplified example with m = n = 2,
which is not a formal proof, but can easily be extended to one. We will repeatedly use:
Cov(A, B) = E[AB] − E[A]E[B].
Starting with the covariance, we apply the identity above. We have, for m = n = 2:
Cov(a1 X1 + a2 X2 , b1 Y1 + b2 Y2 ) = E[(a1 X1 + a2 X2 )(b1 Y1 + b2 Y2 )] − E[a1 X1 + a2 X2 ]E[b1 Y1 + b2 Y2 ]
= S1 − S2
Multiplying and rewriting the first term:
S1 = E[(a1 X1 + a2 X2 )(b1 Y1 + b2 Y2 )] = E[a1 b1 X1 Y1 + a1 b2 X1 Y2 + a2 b1 X2 Y1 + a2 b2 X2 Y2 ]
= a1 b1 E[X1 Y1 ] + a1 b2 E[X1 Y2 ] + a2 b1 E[X2 Y1 ] + a2 b2 E[X2 Y2 ]
The second term:
S2 = E[a1 X1 + a2 X2 ]E[b1 Y1 + b2 Y2 ] = (a1 E[X1 ] + a2 E[X2 ]) (b1 E[Y1 ] + b2 E[Y2 ])
= a1 b1 E[X1 ]E[Y1 ] + a1 b2 E[X1 ]E[Y2 ] + a2 b1 E[X2 ]E[Y1 ] + a2 b2 E[X2 ]E[Y2 ]
Subtracting:
S1 − S2 = a1 b1 E[X1 Y1 ] + a1 b2 E[X1 Y2 ] + a2 b1 E[X2 Y1 ] + a2 b2 E[X2 Y2 ]
− a1 b1 E[X1 ]E[Y1 ] − a1 b2 E[X1 ]E[Y2 ] − a2 b1 E[X2 ]E[Y1 ] − a2 b2 E[X2 ]E[Y2 ]
= a1 b1 (E[X1 Y1 ] − E[X1 ]E[Y1 ]) + a1 b2 (E[X1 Y2 ] − E[X1 ]E[Y2 ])
+ a2 b1 (E[X2 Y1 ] − E[X2 ]E[Y1 ]) + a2 b2 (E[X2 Y2 ] − E[X2 ]E[Y2 ])
= a1 b1 Cov(X1 , Y1 ) + a1 b2 Cov(X1 , Y2 ) + a2 b1 Cov(X2 , Y1 ) + a2 b2 Cov(X2 , Y2 )
2 X
X 2
= ai bj Cov(Xi , Yj )
i=1 i=j
65
3.15
Defining the joint distribution:
1 (x + y)
0 6 x 6 1, 0 6 y 6 2
fX,Y (x, y) = 3
0 otherwise
We will calculate Var(2X − 3Y + 8). Constants disappear in the variance, and by Theorem 2.30:
Second moment:
1 1 1
2x3 2x2 2x4 2x3
Z Z
2x 2 7
E[X 2 ] = x2 + dx = + dx = + =
0 3 3 0 3 3 12 9 0 18
Variance of X:
7 25 13
Var(X) = E[X 2 ] − E[X]2 = − =
18 81 162
Calculating the expectation of Y .
2 2 2
y2 y3 y2
Z Z
y 1 y 11
E[Y ] = y + dy = + dx = + =
0 3 6 0 3 6 9 12 0 9
Second moment:
2 2 2
y3 y2 y4 y3
Z Z
y 1 16
E[Y 2 ] = y2 + dy = + dx = + =
0 3 6 0 3 6 12 18 0 9
Variance of Y :
16 121 23
Var(Y ) = E[Y 2 ] − E[Y ]2 = − =
9 81 81
66
From the definition of the covariance - these calculations are pretty large, so skipping the details:
3.16
Let r(x) and s(y) be functions of x and y. Then:
67
3.17
Proving that
Var(Y ) = E[Var(Y |X)] + Var(E[Y |X]).
Proof. Will disregard the hints.
Consider each of these in turn. Using the definition of the conditional variance, with µ(x) = E[Y |X].
hZ i
E1 = E[(Y − E[Y |X])2 ] = E E[(Y − E[Y |X])2 |X] = E (y − µ(x))2 f (y|x)dy = E[Var(Y |X)]
Y
For E3 , we define M = E[Y |X], and µM = E[M ] = E[E[Y |X]] = E[Y ]. On this form, we see that
we can use the definition of variance.
Finally, we consider E2 . Using iterated expectation and conditioning on X, then E[Y |X] and E[Y ]
become constants wrt. expectation. Define C := E[Y |X] − E[Y ] in order to clarify.
E[E2 ] = E E (Y − E[Y |X])(E[Y |X] − E[Y ]) X
= E E (Y − E[Y |X])C X
= E CE (Y − E[Y |X]) X
= E C(E[Y |X] − E[Y |X])
=E C ·0
=0
Var(Y ) = E1 + 2E2 + E3
= E[Var(Y |X)] + 2(0) + Var(E[Y |X])
= E[Var(Y |X)] + Var(E[Y |X])
68
3.18
If E[X|Y = y] = c for some constant c, then X and Y are uncorrelated.
Proof. Assuming E[X|Y = y] = c for a constant c. Calculating the covariance:
3.19
Studying the sample distribution for the Uniform(0,1) distribution. If X ∼ U (0, 1) then E[X] = 1/2
and Var(X) = 1/12. Will not plot fX since it will simply be the straight line at y = 1.
From Theorem 3.17, we found that E[barXn ] = µ = 1/2 and Var(X n ) = σ 2 /n = 1/(12n).
Plotting these as functions of n, mean as blue and variance as red. The mean remains at 1/2, but
the variance will become a lot smaller as n increases.
0.6
0.4
µ, σ 2
0.2
0
0 2 4 6 8 10
x
For each n = 1, 5, 25, 100 we do a 1000 simulations of each n and make a histogram of the results
that can be seen on the next page. The results are centered around the mean, 1/2, but the spread
becomes smaller as we increase n. We are basically seeing the central limit theorem in action.
69
1.2 n=1 n=5
3.0
Density
Density
0.6
1.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
s1$mn s5$mn
n = 25 n = 100
12
6
Density
Density
8
4
4
2
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
s25$mn s100$mn
70
3.20
If a is a vector and X is a random vector with mean µ and variance Σ, then E[aT X] = aT µ and
Var(aT X) = aT Σa. If A is a matrix then E[AX] = Aµ and Var(AX) = AΣAT .
Proof. T
E[a1 X1 ] a1 E[X1 ] a1 µ1 a1 µ1
E[a2 X2 ] a2 E[X2 ] a2 µ2 a2 µ2
E[aT X] = = = .. = .. .. = aT µ
.. ..
. . . . .
E[an Xn ] an E[Xn ] an µn an µn
The variance of a vector is the covariance matrix.
Var(a1 X1 ) Cov(a1 X1 , a2 X2 ) ··· Cov(a1 X1 , an Xn )
Cov(a2 X2 , a1 X1 ) Var(a2 X2 ) ··· Cov(a2 X2 , an Xn )
Var(aT X) =
.. .. ..
. . .
Cov(an Xn , a1 X1 ) Cov(am Xm , a2 X2 ) · · · Var(an Xn )
a1 Var(X1 )a1 a1 Cov(X1 , X2 )a2 ··· a1 Cov(X1 , Xn )an
a2 Cov(X2 , X1 )a1 a2 Var(X2 )a2 ··· a2 Cov(X2 , Xn )an
=
.. .. ..
. . .
an Cov(Xn , X1 )a1 an Cov(Xn , X2 )a2 ··· an Var(Xn )an
T
a1 Var(X1 ) Cov(X1 , X2 ) ··· Cov(X1 , Xn ) a1
a2 Cov(X2 , X1 ) Var(X2 ) ··· Cov(X2 , Xn ) a2
= . .. = aT Σa
.. .. ..
..
. . . .
an Cov(Xn , X1 ) Cov(Xn , X2 ) · · · Var(Xn ) an
The same general idea applies to matrices as well, execpt the notation becomes more extensive.
Will skip this for brevity.
3.21
Let X and Y be random variables. If E[Y |X] = X, then Cov(X, Y ) = Var(X).
Proof. Applying the covariance identity and using iterated expectation:
71
3.22
Let X ∼ U (0, 1) and 0 < a < b < 1. Define:
1 0<x<b 1 a<x<1
Y = , Z=
0 otherwise 0 otherwise
(a) Are Y and Z independent? One way of thinking about independence is whether knowing
something about Y tells us anything about Z, or vice-versa. In this case it doesn’t seem to be that
way. If we only know that Z is one, there is no way to determine whether x < b or x > b, so we
don’t know what Y will be. By this reasoning, Y and Z are independent. To calculate it, let’s set
a = 1/4 and b = 3/4. Then:
Y =0 Y =1
Z=0 1/16 3/16 1/4
Z=1 3/16 9/16 3/4
1/4 3/4 1
To determine e.g. P(Y = 0 ∩ Z = 0) we first determine P(Y = 0|Z = 0) which is 1/4, and then
we determine P(Z = 0) which is 1/4 which makes P(Y = 0 ∩ Z = 0) = 1/16. Now we calculate
that P(Y = 0)P(Z = 0) = (1/4)(1/4) = 1/16 = P(Y = 0 ∩ Z = 0) etc. which verifies that we have
independence.
(b) Finding E[Y |Z]. By definition of the iterated expectation and using independence (P(Y =
y|Z = z) = P(Y = y)).
X
E[Y |Z = 0] = y · P(Y = y|Z = 0) = 0 + (1)P(Y = 1|Z = 0) = P(Y = 1) = b
y
X
E[Y |Z = 1] = y · P(Y = y|Z = 1) = 0 + (1)P(Y = 1|Z = 1) = P(Y = 1) = b
y
The last step follows because we know x ∼ U (0, 1), so the probability is equal to the length of the
interval.
Because of independence, Z does not affect the outcome. And in fact, there is a result that says
E[Y |Z] = E[Y ] when Y is independent of Z, which we have almost proved above.
3.23
Finding the moment generating function for the Poisson, Normal and Gamma distributions. Recall
the definition of the moment generating function:
X tx
e p(x) discrete with pmf p(x)
x
M (t) = E[etX ] = Z ∞
etx f (x)dx continuous with pdf f (x)
−∞
72
• Poisson. Assuming some arbitrary λ.
M (t) = E[etX ]
∞
X etn λn
= e−λ
n=0
n!
∞
X (λet )n
= e−λ
n=0
n!
t
= e−λ eλe
t
−λ
= eλe
t
−1)
= eλ(e
= exp λ(et − 1)
• Normal. Assuming some arbitrary µ and σ 2 . The usual way of deriving the MGF for a normal
distribution is to first do it for the standard normal distribution, since this is a lot easier. Denote
this by MZ (t). We will ’compelte the square’.
MZ (t) = E[etZ ]
Z ∞
1 1 2
=√ etx e− 2 x dx
2π −∞
Z ∞
1 1 2
=√ exp (tx) exp − x dx
2π −∞ 2
Z ∞
1 1 2
=√ exp − x + tx dx
2π −∞ 2
Z ∞
t2 t2
1 1
=√ exp − x2 + tx − + dx
2π −∞ 2 2 2
Z ∞
2 1 1
= et /2 √ exp − (x2 − 2tx + t2 ) dx
2π −∞ 2
Z ∞
2 1 1
= et /2 √ exp − (x − t)2 dx
2π −∞ 2
2
= et /2
The integration is done by substituting u = x − t which leads to the usual standard pdf, which
integrates to 1. Hence, we have the MGF for the standard normal distribution. Using the MGF
for the standard normal distribution, we can find it for a general normal distribution.
MX (t) = E[etX ] = E[et(µ+σZ) ]
= E[etµ+tσZ) ] = etµ E[etσZ) ]
2
= etµ MZ (tσ) = etµ e(tσ) /2
2 2
σ t
= exp + µt
2
So tedious!!
73
• Gamma. Assuming some arbitrary α and β.
M (t) = E[etX ]
Z ∞
1
= etx xα−1 exp(−x/β)dx
Γ(α)β α 0
Z ∞
1
= xα−1 exp(tx − x/β)dx
Γ(α)β α 0
Z ∞
1 1
= xα−1 exp(−( − t)x)dx
Γ(α)β α 0 β
1 Γ(α)
= · α
Γ(α)β α 1
−t
β
1
=h iα
β β1 − t
1
= α
[1 − βt]
= (1 − βt)−α
where we used the definition of the Gamma function like in an earlier exercise.
3.24 Pn
Let X1 , . . . , Xn ∼ Exp(β). Find the MGF of Xi . Prove that i=1 Xi ∼ Gamma(n, β).
Finding the MGF for the exponential distribution with parameter β.
M (t) = E[etX ]
1 ∞
Z
= exp(tx) exp(−x/β)dx
β 0
Z ∞
1
= exp(−(1/β − t)x)dx
β 0
!
1 1
=
β β1 − t
1 β
=
β 1 − βt
1
=
1 − βt
Pn
All Xi have the same MGF. If we now assume they are independent, and define Y = i=1 Xi , then
according to Lemma 3.31:
n n
Y Y 1 1
MY (t) = MX (t) = = n
= (1 − βt)−n
i=1 i=1
1 − βt (1 − βt)
which is the MGF for a Gamma distribution with parameters n and β, as we can see from the
previous exercise. Hence, Y ∼ Gamma(n, β).
74
4 Inequalities
Practical Example
Example 1 - Markov’s Inequality.
You hear that the mean age of NYU students is 20 years, but you know quite a few students that
are older than 30. You decide to apply Markov’s inequality to bound the fraction of students above
30 by modeling age as a nonnegative random variable A.
E[A] 2
P(A > 30) 6 = ,
30 3
so at most two thirds of the students are over 30. This isn’t a very precise bound, but then we also
only use the expectation.
Example 2 - Chebyshev’s Inequality.
The previous bound is a little too weak. After investigating we discover that the standard deviation
of student age is actually just 3 years. Applying Chebyshev’s inequality to this information and
obtain
Var(A) 9
P(|A − E[A]| > 10) 6 = .
100 100
So at least 91% of the students are under 30 years old (and above 10).
Exercises
4.1
Let X ∼ Exp(β). As we found in exercise 3.12, µ = E[X] = β and σ 2 = Var(X) = β 2 . By applying
Chebyshev’s inequality:
σ2 β2
P(|X − β| > t) = P(|X − µ| > t) 6 2
= 2
t t
We are going to compare Chebyshev’s inequality with the following inequality. Since σ > 0 and
k > 1, we can apply Markov’s inequality to get:
E[(X − µ)2 ] σ2 1
P(|X − µ| > kσ) = P((X − µ)2 > k 2 σ 2 ) 6 2 2
= 2 2 = 2
k σ k σ k
We recognize this as the corollarly to the Chebyshev inequality. Comparing, and using that σ = β,
and using t instead of k:
β2
P(|X − β| > t) 6
t2
1
P(|X − β| > tβ) 6 2
t
If β > 1, the first bound becomes ’weaker’ which makes sense since βt > t and we are considering
a smaller interval. If β < 1 the reverse is true. Also interesting to note that by scaling up the
bounded interval by β gives a β 2 increase in the bound, provided that β > 1. Shows there is a
nonlinear relationship for the bound when using the Exponential distribution.
75
4.2
Let X ∼ Poisson(λ). As seen, E[X] = λ and Var(X) = λ. Applying Chebyshev’s inequality to
show that P(X > 2λ) = 1/λ.
λ 1
P(X > 2λ) = P(X − λ > λ) 6 P(|X − λ| > λ) 6 = .
λ2 λ
Which shows that P(X > 2λ) 6 1/λ.
4.3
Let X1 , . . . , Xn ∼ Bernoulli(p) and define
n
1X
X= Xi .
n i=1
Bound P(|X − p| > ) with Chebyshev’s and Hoeffding’s inequality. Show that when n is large, the
Hoeffding’s bound is smaller than Chebyshev’s bound. (Assuming > 0).
For the Bernoulli distribution, E[Xi ] = p and Var(Xi ) = p(1 − p). Since we are using a statistic,
E[X] = p and Var(X) = p(1 − p)/n as per Theorem 3.17. Applying Chebyshev’s inequality:
8n2 82
lim Ln = lim 2 = lim 2 = 0
n→∞ n→∞ e2n n→∞ 22 e2n
Since we lose the dependence on n in the numerator, the fraction goes to 0 as n grows. So as n
becomes large, Ln becomes 0, which means that the Hoeffding bound becomes smaller than the
Chebyshev bound.
76
4.4
Let X1 , . . . , Xn ∼ Bernoulli(p).
(a) For some α > 0, define:
s
1 2 1 2
= log =⇒ 2 = log
2n α 2n α
Pn
and set pb = n−1 i=1 Xi . Now we define Cn = (b p − , pb + ) and will show that the probability
that Cn contains pb is α − 1, using Hoeffding’s inequality.
Cn containing p means that:
p ∈ Cn =⇒ pb − 6 p 6 pb + =⇒ − 6 p − pb 6 =⇒ |b
p − p| 6
P(p ∈ Cn ) = P(|b
p − p| 6 ) = 1 − P(|b
p − p| > )
(the probability that p is in the interval Cn is equal to 1 minus the probability that is outside the
interval, basically). We can use the bound found in Theorem 4.5 again.
p − p| > ) 6 α
P(|b
p − p| > ) 6 1 + α
1 + P(|b
1 − α 6 1 − P(|b
p − p| > )
1 − α 6 1 − P(|b
p − p| > ) = P(p ∈ Cn ).
(b) Running some simulations. Doing 1000 simulations for n = 10, 50, 100, 2000 and checking how
often the p lies in the range pb ± .
See code and plots on the next page.
77
# 4.4( b ) : p = 0.4 , alpha = 0.05
simBern <- function ( n ) { # Simulate Bernoulli
NSIM = 1000 # Do 1000 simulations per n
simul ationLi st = matrix ( rep (0 , n * NSIM ) , nrow = NSIM )
for ( i in 1: NSIM ) {
simul ationLi st [i ,1: n ] = sample ( c (1 , 0) , size =n ,
replace = TRUE , prob = c (0.4 , 0.6) )
}
return ( simulat ionList )
}
countCoverage <- function ( sim , alpha ) {
N = ncol ( sim ) ; NSIM = nrow ( sim )
covRate = rep (0 , NSIM )
for ( i in 1: NSIM ) {
mn = mean ( sim [i , 1: N ])
if ( abs (0.4 - mn ) < 0.05) {
covRate [ i ] = 1
}
}
hitRate = sum ( covRate ) / NSIM ; print ( hitRate )
}
# Bernoulli simulations
bsim10 = simBern (10) ; bsim50 = simBern (50)
bsim100 = simBern (100) ; bsim2000 = simBern (2000)
> countCoverage ( bsim10 , alpha =0.05)
[1] 0.268
> countCoverage ( bsim50 , alpha =0.05)
[1] 0.552
> countCoverage ( bsim100 , alpha =0.05)
[1] 0.684
> countCoverage ( bsim2000 , alpha =0.05)
[1] 1
R
Plotting the coverage rate vs. n. As n increases, the interval will contain p more often.
n vs. Coverage
1.0
0.8
Coverage rate
0.6
0.4
10 50 100 2000
78
p − p|. When does it become
(c) Investigating the relationship between n and the interval length: 2|b
smaller than 0.05? According to the simulation results, that happens roughly when n passes 200,
based on our 1000-per-n simulation.
# 4.4( c ) : Plotting length of interval
Inte rvalLeng th <- function ( sim ) {
N = ncol ( sim ) ; NSIM = nrow ( sim )
intLength = rep (0 , NSIM )
for ( i in 1: NSIM ) {
mn = mean ( sim [i , 1: N ])
intLength [ i ] = 2 * abs (0.4 - mn )
}
print ( mean ( intLength ) )
}
> Inte rvalLeng th ( bsim10 )
[1] 0.2408
> Inte rvalLeng th ( bsim50 )
[1] 0.10572
> Inte rvalLeng th ( bsim100 )
[1] 0.077
> Inte rvalLeng th ( bsim200 )
[1] 0.0547
> Inte rvalLeng th ( bsim300 )
[1] 0.04543333
> Inte rvalLeng th ( bsim2000 )
[1] 0.017746
0.15
0.10
0.05
79
4.5 - Theorem 4.7: Mill’s Inequality
Let Z ∼ N (0, 1) and t > 0. Then,
r 2
2 e−t /2
P(|Z| > t) 6 · .
π t
Proof. From the hint, we can use the symmetry property of the normal distribution, and when
t > 0:
1 t
P(|Z| > t) = 2P(Z > t) =⇒ P(|Z| > t) = P(Z > t) =⇒ P(|Z| > t) = tP(Z > t)
2 2
As laid out in the proof of the Markov inequality, we get:
Z ∞ Z ∞
t
P(|Z| > t) = tP(Z > t) = t f (x)dx 6 xf (x)dx = I
2 t t
4.6
Let Z ∼ N (0, 1). We simulate the probability that P(|Z| > t) and plot the results. See code and
plot on next page.
We will also include the Markov bounds. Since the Markov inequality only applies to nonnegative
random variables, we use it on |Z|. Finding E[|Z|] by using that the standard normal distribution
is symmetric, so we multiply the expected value for E[Z > 0] by 2.
Z ∞
1 2
E[|Z|] = 2 √ x · e−x /2 dx
2π 0
r h i∞
2 2
= − e−x /2
π 0
r
2
= 0 − (−1)
π
r
2
=
π
(This can easily be verified numerically).
80
# Plotting P ( Z > t ) as a function of t
Z = rnorm (1000000)
t = 1:2000 / 400 # t goes from 0 to 5
P _ Zgt = rep (0 , length ( t ) )
for ( i in 1: length ( t ) ) {
P _ Zgt [ i ] = sum ( abs ( Z ) > t [ i ])
}
plot (t , P _ Zgt / 1 e6 , type = " l " ,
main = " P (| Z | > t ) for t in (0 ,5) " )
Results from plot, when including bounds for the Markov and Hill inequalities:
P(|Z|>t)
Mill's bound
0.8
Markov's bound
0.6
P(|Z| > t)
0.4
0.2
0.0
0 1 2 3 4 5
The Markov inequality gives a very lose bound, but as we can see the Mill inequality gives a
bound that really hugs the P(|Z| > z) curve.
81
4.7 Pn
Let X1 , . . . , Xn ∼ N (0, 1). Bound P(|X| > t) using Mill’s inequality, where X = n−1 i=1 Xi and
compare to Chebyshev’s bound.
We begin by applying the Chebyshev’s inequality. The sampling distribution of X gives E[X] = 0
and Var(X) = σ 2 /n = 1/n.
σ2 1
P(|X| > t) = P(|X − µ| > t) 6 = 2
t2 nt
Now, the variable X no longer has a standard normal distribution, as we noted above. We have
X ∼ N (0, 1/n). But we can do the following, since we only require t > 0:
82
5 Convergence of Random Variables
Definition 5.1 Types of convergence
Let X1 , X2 , . . . be a sequence of random variables and let X be another random variable. Let Fn
denote the CDF of Xn and let F denote the CDF of X.
P
1. Xn converges to X in probability, Xn −
→ X, if for all > 0,
P(|Xn − X| > ) → 0
as n → ∞.
2. Xn converges to X in distribution, Xn X, if
E[(Xn − X)2 ] → 0
as n → ∞.
Relationship in convergence types.
qm P
−−→ =⇒ −
→ =⇒
Exercises
5.1
Let X1 , . . . , Xn be IID with finite mean and variance µ and σ 2 . Let X be the sample mean and Sn2
be the sample variance.
n n
1X 1 X
X= , Sn2 = (Xi − X)2
n i=1 n − 1 i=1
P(|Sn2 − σ 2 | > ) = 0 as n → ∞.
But we can do it in a different way. Following the hint, we want to rewrite Sn2 so we can apply the
law of large numbers.
Doing the calculations on the next page.
83
n
1 X
Sn2 = (Xi − X)2
n − 1 i=1
n
!
n 1 Xh 2 2
i
= Xi − 2Xi X + X
n−1 n i=1
Distributing the sum.
n n
! !
n 1X 2 1X 2
= X − 2X Xi +X
n−1 n i=1 i n i=1
n
!
n 1X 2 2 2
= X − 2X + X
n−1 n i=1 i
n
!
n 1X 2 2
= X −X
n−1 n i=1 i
2
WLLN on X with g(x) = x2 .
n
!
n 1X 2
= X − µ2
n−1 n i=1 i
n
!
n 1X
= (Xi − µ)2
n−1 n i=1
The last rewrite is a standard identity. Now, we define Yi = Xi − µ. Then Y → E[(Xi − µ)] by the
2
WLLN. By Theorem 5.5(f), we can apply g(x) = x2 and get Y → E[(Xi − µ)2 ] = Var(Xi ) = σ 2 .
Finally, we use the hint with cn = n/(n − 1) which obviously tends to 1, so we can apply Theorem
P
5.5(d) and hence we have proved that Sn2 − → σ 2 . (The hint says using 5.5(e), but that is convergence
with distribution, which I don’t think is correct). (Update: this is a misprint which is noted in the
errata2.pdf for the book).
5.2
qm
Let X1 , X2 . . . be a sequence of random variables. Show that Xn −−→ b if and only if
Proof.
qm
⇒) We assume Xn −−→ b, which means
E[(Xn − b)2 ] → 0, n → ∞.
84
By our assumption E[(Xn − b)2 ] → 0 as n → ∞, and so Var(Xn ) + (E[Xn ] − b)2 → 0 as n → ∞.
Since Var(Xn ) > 0 and (E[Xn ] − b)2 > 0, then we can conclude that
as n → ∞.
⇐) We assume
lim E[Xn ] = b, and lim Var[Xn ] = 0.
n→∞ n→∞
Then limn→∞ Var(Xn ) + (E[Xn ] − b)2 = 0. By reversing the calculations from the first part:
qm
lim Var(Xn ) + (E[Xn ] − b)2 = 0 =⇒ lim E[(Xn − b)2 ] = 0 =⇒ Xn −−→ b.
n→∞ n→∞
5.3
qm
Let X1 , . . . , Xn be IID and let µ = E[Xi ] with finite variance. Show that X −−→ µ.
Proof. Define:
n
1X
Xn = Xi .
n i=1
P
By the WLLN, X n −
→ µ, which we can also express as
lim E[X n ] = µ.
n→∞
The variance is finite, so Var(Xi ) = σ 2 < ∞. This means that for the sample variance:
σ2
lim Var(X) = lim = 0.
n→∞ n→∞ n
With this, we can simply apply the result from exercise 5.2 which shows that E[(X − µ)2 ] → 0 as
qm
n → ∞ which proves X −−→ µ.
5.4
Let X1 , X2 , . . . be a sequence of random variables such that
1 1 1
P Xn = = 1 − 2 , and P(Xn = n) = 2 .
n n n
As n becomes large, the probability that we get n and not 1/n becomes very small. But as n
becomes large, it will also yield extreme outliers in the sequence with a non-zero probability.
85
Starting by calculating the mean, second moment, and variance.
1 1 1
E[Xn ] = 1 − 2 + (n)
n n n2
1 1 1
= − 3+
n n n
2 1
= − 3
n n
1 1 1
E[Xn2 ] = 2
2
1 − 2 + (n )
n n n2
1 1
= 2 − 4 +1
n n
R
From these expressions we can see that limn→∞ E[Xn ] = 0, but limn→∞ Var(Xn ) = 1. Since the
variance does not become 0, we know by the result in exercise 5.2 that this does NOT converge in
quadratic mean.
Checking if Xn converges in probability. If we fix some > 0, we can apply the Chebyshev
inequality. Set µ = E[Xn ] and σ 2 = Var(Xn ):
3
σ2 1− n2 + n34 − 1
n6
P(|Xn − µ| > ) 6 2
=
2
By taking the limit n → ∞ on both sides, we get:
3
1− n2 + n34 − 1
n6 1
lim P(|Xn − µ| > ) 6 lim = .
n→∞ n→∞ 2 2
Since this does not tend to 0, we do NOT have convergence in probability.
86
5.5
Let X1 , . . . , Xn ∼ Bernoulli(p). Prove that
n n
1X 2 P 1 X 2 qm
X −→ p, and X −−→ p.
n i=1 i n i=1 i
Proof. Recalling how to calculate the expectation and second moment for a Bernoulli(p) variable:
We define Yi := Xi2 and exploit the simplicity of the Bernoulli distribution. In this case:
So we have µ = E[Y ] = p and σ 2 = p(1 − p). We can now define the sample mean and variance:
σ2 p(1 − p).
E[Y ] = p, Var(Y ) = = .
n n
Since the variance tends to 0 as n → ∞ Pnwe can qm
use the results in exercise 5.3 to conclude that
qm
Y −−→ p and by how we defined Y , n1 i=1 Xi2 −−→ p. Since we have convergence in quadratic
mean, it follows that we also have convergence in probability.
5.6
The height of men has mean 68 inches and standard deviation 2.6 inches. We have n = 100. Finding
the approximate probability that the average height in the sample will be at least 68 inches.
We can approximate this probability with the
Pn CLT (central limit theorem). We want to find
the probability that the sample height X = n1 i=1 Xi is at least as big as the population mean:
P(X > µ).
P(X > µ) = 1 − P(X 6 µ)
By the central limit theorem, using that n = 100, µ = 68 and σ = 2.6:
10(X − 68)
P(X 6 68) = P(X − 68 6 0) = P 6 0 ≈ P(Z 6 0) = 0.5
2.6
(Since the standard normal distribution is symmetric and centered at 0). So we get:
87
5.7
1
Let λn = n for all n and let Xn ∼ Poisson(λn ).
P
(a) Showing that Xn −
→ 0. By properties of the Poisson distribution, we have
1 1
µ = E[Xn ] = , σ 2 = Var(Xn ) = .
n n
By fixing some > 0 and applying Chebyshev’s inequality, we get:
1 σ2 1
P(|Xn − | > ) = P(|Xn − µ| > ) 6 2 = 2
n n
By taking the limit on both sides:
1 1
lim P(|Xn − | > ) =6 lim = 0,
n→∞ n n→∞ n2
P
and since µ = 1/n → 0 we have shown that Xn −
→ 0.
P
(b) We define Yn = nXn and will show that Yn −
→ 0. Finding the mean and variance:
1
E[Yn ] = E[nXn ] = nE[Xn ] = n =1
n
2 2 1
Var(Yn ) = Var(nXn ) = n Var(Xn ) = n =n
n
From the Chebyshev inequality we can only really conclude that Yn does NOT converge to 1, but
we can’t use it for determining the asymptotic behavior of Yn at 0. Instead we will use Theorem
P P
5.5(f):Xn −
→ 0 then g(Xn ) − → g(X) . Here, Yn is a function of Xn , defined as: Yn = nXn where
P P
g(x) = nx, which is a continuous function. In this case we get that Xn −
→ 0 implies nXn − → n·0
P
so Yn −→ 0.
5.8
A program has n = 100
Ppages of code. Let Xi ∼ Poisson(1) be iid and denote the number of errors
n
on page i. Let Y = i=1 Xi denote the total number of errors. Use the CLT to approximate
P(Y < 90).
The mean and variance are µ = E[Xi ] = 1 and σ 2 = Var(Xi ) = 1. An important observation: the
CLT applies to sample means, but in this case we are simply summing up 100 independent Poisson
variables, and so Y ∼ Poisson(100), i.e. E[Y ] = 100 and Var(Y ) = 100.
Let us define W = n1 Y , which means:
n
1 1X
W = Y = Xi ,
n n i=1
88
By going the other way, we can find a normal approximation for Y by using that Y = nW where
n = 100.:
E[Y ] = E[nW ] = nE[W ] = n(1) = 100,
2 2 1
Var(Y ) = Var(nW ) = n Var(W ) = (100) = 100
100
p
The standard deviation is: Var(Y ) = 10. Approximating the probability.
Y − 100
P(Y < 90) = P(Y − 100 < −10) = P < −1 ≈ P(Z 6 −1) ≈ 0.1586
10
Note: this exercise demonstrates that the CLT is just an approximation, and under certain condi-
tions it might not be a very accurate approximation. In 5.8.R a simulation repeating the conditions
were done one million times, and numerically, the probability that P(X < 90) turns out to be about
0.1467.
> # Approximating answer numerically
> length ( Y )
[1] 1000000
> sum ( Y < 90) / length ( Y )
[1] 0.146773
5.9
Suppose that P(X = 1) = P(X = −1) = 1/2 and define:
1
X with probability 1− n
Xn = 1
en with probability n
89
E[Xn ] = E[E[Xn |X]] = E[Xn |X = 1]P(X = 1) + E[Xn |X = −1]P(X = −1)
h 1
i h
1 1
1
i
1 1
= (1) 1 − + (en ) + (−1) 1 − + (en )
n n 2 n n 2
en
=
n
Calculating the second moment.
Both the expectation and variance will tend to infinity as n → ∞, so we can conclude that Xn
does NOT converge in probability, and it does NOT converge in quadratic mean. Even just the
oscillating X wouldn’t converge. The variance would be constant, so there would be no quadratic
mean convergence by the results in exercise 5.2, and we would not be able to apply Chebyshev’s
inequality to prove convergene in probability.
Evaluating the density convergence later.
90
5.10
Let Z ∼ N (0, 1) and let t > 0 Show that for any k > 0
E[|Z|k ]
P(|Z| > t) 6 .
tk
Proof. Following the method used in the proof of Markov’s inequality. Will also use the hint from
4.5: P(|Z| > t) = 2P(Z > t) and what was found in 4.6 that the expected value of |Z| is two times
the integral from 0 to ∞.
For any t > 0 and k > 0:
Z ∞ Z ∞ Z t
E[|Z|k ] = 2 z k f (z)dz = 2 z k f (z)dz + 2 z k f (z)dz
0 t 0
Z ∞ Z ∞
k k
>2 z f (z)dz > 2t f (z)dz = 2tk P(Z > t) = tk P(|Z| > t)
t t
Which implies,
E[|Z|k ]
P(|Z| > t) 6
tk
Comparing to Mill’s inequality and Markov’s inequality by extending the plot used in exercise 6.4.
Plotting the bounds for k = 2, 4, 6, 8. This method is not quite as precise as Mill’s inequality, but it
gets very close after t > 3. The bound is not as good for smaller t, where even the Markov bound
is better. (The dark green line is k = 8, and the Markov bound corresponds to k = 1)
P(|Z|>t)
Mill's bound
0.8
Markov's bound
New bounds
0.6
P(|Z| > t)
0.4
0.2
0.0
0 1 2 3 4 5
91
5.11
Let Xn ∼ N (0, 1/n) and let X have the following CDF:
0 x<0
F (x) =
1 x>0
The variable X has a point mass distribution at 0, and we want to see if Xn converges. Note that
µ = E[Xn ] = 0 and σ 2 = Var(Xn ) = 1/n and limn→∞ (1/n) = 0, so by 5.2, this actually converges
in quadratic mean to 0, which implies convergence in distribution.
Also showing it directly by using Chebyshev’s inequality. For any > 0:
σ2 1
P(|Xn − 0| > ) 6 = 2 →0
2 n
P
when n → ∞. In conclusion, Xn − → X.
Next is investigating convergence in distribution which must be true, since convergence in prob-
ability implies convergence in distribution. We can also see the trend by plotting the CDF of Xn
for n = 1, 2, 5, 10, . . . , 5000 where n = 5000 is in black. Code for generating plot in 5.11.R.
0.4
0.2
0.0
Xn
To show convergence in distribution, we find an expression for the CDF of Xn . By using the
definition of the CDF and translating to the standard normal distribution:
√ √ √
Xn − 0 z−0
Fn (z) = P(X 6 z) = P √ 6 √ = P( nXn 6 nz) = P(Z 6 nz),
1/ n 1/ n
where Z ∼ N (0, 1). As n → ∞ then for any z < 0, Fn (z) = 0 and for z > 0, Fn (z) = 1 so it
assumes the exact same form as F (x), so Fn → F and Xn X as n → ∞.
92
5.12
Let X1 , X2 , . . . and X be random variables that are positive integers. Show that Xn X if and
only if, for every integer k,
lim P(Xn = k) = P(X = k).
n→∞
= P(X = k)
The same relation is true for Fn , but here we also apply our assumption of convergence.
⇐) Now we assume that the limit of P(Xn = k) is equal to P(X = k) for all k, which means they
have equal probability mass function. Then, for any k:
k
X
F (k) = P(X 6 k) = P(X = z)
z=1
= P(X = 1) + P(X = 2) + . . . + P(X = k)
= lim P(Xn = 1) + lim P(Xn = 2) + . . . + lim P(Xn = k)
n→∞ n→∞ n→∞
k
X
= lim P(Xn = z)
n→∞
z=1
k
X
= lim P(Xn = z)
n→∞
z=1
= lim P(Xn 6 k) = lim Fn (k)
n→∞ n→∞
(We can change the order of the sum and limit since it is a finite sum). In conclusion, this shows
that limn→∞ Fn (k) = F (k) and so Xn X.
93
5.13
Let Z1 , Z2 , . . . be IID random variables with density f . Suppose P(Zi > 0) = 1 (the Zi are
nonnegative) and that λ = limx↓0 f (x) > 0. Define:
Xn = n min{Z1 , . . . , Zn }.
This means that for at least one i, we have Vi 6 y. Another way of expressing this is to look at the
complement, which is equivalent. P(A) = 1 − P(Ac ), which is the case when ALL Vi > y. Using
this, and using independence:
We recognise this as the exponential CDF with parameter λ, so if we set Z ∼ Exp(λ) with CDF
FZ , then we have shown that Fn → FZ as n → ∞, so Xn Z.
94
5.14
2
Let X1 , . . . , Xn ∼ U (0, 1) and let Yn = X n . We will find the limiting distribution of Yn .
By the CLT, X ∼ N (µ, σ 2 ), and from the uniform distrbution this is:
1+0 1
µ = E[X] = =
2 2
(1 − 0)2 1
σ 2 = Var(X) = =
12 12
2
And so, X ∼ N (1/2, 1/12). To find the distribution of X , we apply the delta method with
g(x) = x2 , which gives us N (g(µ), (g 0 (µ))2 σ 2 ). Calculating:
5.15
Let
X11 X12 X1n
, ,...,
X21 X22 X2n
be IID random vectors with mean µ = (µ1 , µ2 ) and variance Σ with
n n
1X 1X
X1 = X1i , X2 = X2i .
n i=1 n i=1
95
Calculating the deltra transformed covariance matrix.
1 T 1
µ2 µ
σ11 σ12 2
∇g(µ1 , µ2 )T Σ∇g(µ1 , µ2 ) =
µ
1 σ 21 σ 22
µ1
− 2 − 2
µ2 µ2
1 T σ11 σ12 µ1
−
µ2 µ2 µ22
=
µ σ21
1 σ22 µ1
− 2 −
µ2 µ2 µ22
σ11 σ12 µ1 σ21 µ1 σ22 µ21
= 2 − − +
µ2 µ32 µ32 µ42
µ σ11 − µ1 µ2 σ12 − µ1 µ2 σ21 + µ21 σ22
2
= 2
µ4
With this we have found the limiting distribution.
√ µ2 σ − µ µ σ − µ µ σ + µ2 σ
11 1 2 12 1 2 21 1 22
N 0, 2
n X 1 /X 2 − µ1 /µ2
µ4
5.16
Finding an example where Xn X and Yn Y , but Xn + Yn 6 X +Y.
Set X ∼ U (0, 1) and Y ∼ U (−1, 0). Then we define Xn = U (0, 1) and Yn = −Xn and we denote
the CDFs as X ∼ F , Y ∼ G, Xn ∼ Fn and Yn ∼ Gn .
Then:
lim Fn (x) = lim x = x = F (x)
n→∞ n→∞
so Xn X and Yn Y . However,
96
6 Models, Statistical Inference and Learning
Definitions
The bias of an estimator is defined as
bias(θbn ) = Eθ [θbn ] − θ.
Exercises
6.1
b = n−1 Pn Xi . We will calculate the bias, se and MSE of
Let X1 , . . . , Xn ∼ Poisson(λ) and let λ i=1
the estimator.
For the Poisson distribution, we know that E[Xi ] = λ and Var(Xi ) = λ.
Calculating the bias.
" n
# n
1X 1X
bias(λ)
b = Eθ Xi − λ = Eθ [Xi ] − λ = λ − λ = 0
n i=1 n i=1
which shows that the mean is an unbiased estimator. Calculating the variance.
n
! n
1 X 1 X λ
Var(λ)
b = Var Xi = 2 Var(Xi ) =
n i=1 n i=1 n
97
6.2
Let X1 , . . . , Xn ∼ U (0, θ) and let θb = max(X1 , . . . , Xn ). Find bias, se and MSE.
For the U (0, θ) uniform distribution, E[Xi ] = θ/2 and Var(Xi ) = θ2 /12. However, we are using
the maximum value, which we don’t know the expectation of. In order to calculate it, we will need
the PDF, which requires the CDF. For each Xi the CDF is F (x) = x/θ. We assume independence
and will find the CDF of θb which we will denote as Fθ .
nθn−1
fθ (x) = F 0 (x) =
θ
Now we can calculate the expectation. (A simulated verification is in 6.2.R).
θ Z θ
nxn−1
Z
n
E[θ]
b = x·
dx = n xn dx
0 θ θ 0
θ
n xn+1
n 1 n+1
= n = θ
θ n+1 0 n + 1 θn
n
= θ
n+1
We will also need the variance, so we calculate the second moment.
θ Z θ
nxn−1
Z
n
E[θb2 ] = x2 ·
dx = n xn+1 dx
0 θ θ 0
θ
n xn+2
n 1 n+2
= n = θ
θ n+2 0 n + 2 θn
n 2
= θ
n+2
Now we can calculate the variance (skipping the algebra). (A simulated verification is in 6.2.R).
b = E[θb2 ] − E[θ]
Var(θ) b2
2
n 2 n
= θ − θ2
n+2 n+1
n
= θ2
(n + 2)(n + 1)2
98
Finding the bias.
b −θ = n 1
bias(θ)
b = E[θ] θ−θ =− θ
n+1 n+1
This estimator is not unbiased, because we will often end up with a slightly smaller value. As we
have shown, we could use the unbiased estimator n+1n max(X1 , . . . , Xn ) instead.
Finding the standard error, which is just the square root of the variance.
√
n
q
b = Var(θ)
se(θ) b =√ θ
n + 2(n + 1)
6.3
Let X1 , . . . , Xn ∼ U (0, θ) and let θb = 2X n . Find the bias, se and MSE.
Here the estimator is two times the mean, or written more explicitly.
n
2X
θb = Xi
n i=1
θ θ2
E[Xi ] = and Var(Xi ) =
2 12
Finding the bias.
"
n
# n
2 X 2X θ
bias(θ)
b =E Xi − θ = E [Xi ] − θ = 2 · − θ = θ − θ = 0
n i=1 n i=1 2
99
The standard error, se, is just the square root of the variance.
b = √θ
q
se(θ)
b = Var(θ)
3n
Calculating the MSE.
100
7 Estimating the CDF and Statistical Functionals
Definition 7.1 The empirical distribution function Fbn is the CDF that puts mass 1/n at each data
point Xi . Pn
I(Xi 6 x)
Fn (x) = i=1
b
n
where I(·) is the indicator function.
Exercises
7.1 - Theorem 7.3
Pn n
i=1 I(Xi 6 x) 1 X
Var(Fbn (x)) = Var = 2 Var[I(Xi 6 x)]
n n i=1
n
1 X
= 2 P(Xi 6 x)(1 − P(Xi 6 x))
n i=1
P(Xi 6 x)(1 − P(Xi 6 x))
=
n
F (x)(1 − F (x))
=
n
101
MSE.
F (x)(1 − F (x))
MSE = →0
n
Proof. First we need the bias, but as the first result showed, this is an unbiased estimator.
bias(Fbn ) = E[Fbn ] − Fn = Fn − Fn = 0
The MSE is the sum of the squared bias and the variance, but since the bias is 0, it becomes equal
to the variance.
F (x)(1 − F (x))
MSE =
n
As, n → ∞, this will tend to 0.
F (x)(1 − F (x))
lim MSE = lim =0
n→∞ n→∞ n
The empirical distribution converges in probability to F (x).
P
Fbn (x) −
→ F (x).
σ2 F (x)(1 − F (x))
lim P(|Fbn (x) − Fn (x)| > ) 6 lim 2
= lim = 0.
n→∞ n→∞ n→∞ n2
P
This shows that Fbn (x) −
→ F (x).
7.2
For X1 , . . . , Xn ∼ Bernoulli(p) and Y1 , . . . , Ym ∼ Bernoulli(q), we will find the plug-in estimator
and estimated standard error for p, an approximate 90 percent confidence interval for p, the plug-in
estimator and standard error for p − q and approximate 90 percent confidence interval for p − q.
(Oh, is that all?)
The plug-in estimator is really just a term for using the Fbn instead of F for calculating functionals,
such as the mean, median etc. For estimating p in a Bernoulli distribution, it is simply the sample
mean.
n
1X
pb = Xi
n i=1
Finding the standard error by e.g. using pb in the variance formula.
p
pb(1 − pb)
se(b p) = √
n
By consluting a standard normal table we can see that 1.65 is the value that gives 0.95 of the data,
which will correspond to a 90% two-sided test.
pb ± z.10/2 se(b
p) = 1.65 · se(b
p)
102
Finding an estimate for p − q with a plug-in estimator is simply subtracting the sample means.
n m
1X 1 X
pb − qb = Xi − Yj
n i=1 m j=1
7.3
Program code found in 7.3.R. There is no technical appendix, so the confidence intervals are meant
to be made with the Dvoretzky-Kiefer-Wolfowitz inequality.
For a 95% confidence interval, we define
s
1 2
n = log
2n α
where α = 0.05 and specify the bounds as (Fbn (x) − n , Fbn (x) + n ). We simulate 100 X ∼ N (0, 1)
variables a total of 1000 times and calculate the confidence intervals for each bound, then we count
how many of the 1000 simulations make bounds that do not contain the true CDF. Illustration of
what the simulations look like. (Black is the true CDF, the colored are simulations).
0.4
0.2
0.0
−2 −1 0 1 2
103
On running the simulation, approximately 95% of the simulation bounds contained the true CDF.
When redoing for the Cauchy distribution, the results are approximately 95% as well.
[1] " Proportion of normal distributions containing the CDF "
> sum ( checkNorm ) / length ( checkNorm )
[1] 0.952
[1] " Proportion of Cauchy distributions containing the CDF "
> sum ( checkCauchy ) / length ( checkCauchy )
[1] 0.949
7.4
Let X1 , . . . , Xn ∼ F and let Fbn (x) be the empirical distribution function. For a fixed x, we will use
the CLT to find the limiting distribution of Fbn (x).
We make the definiton: Pn
i=1 I(Xi 6 x)
Yn =
n
to highlight the fact that Fbn (x) = Y n is a sample mean. By the CLT:
√
n(Y n − µ)
N (0, 1)
σ
where µ = E[Y n ] and σ 2 /n = Var(Y n ), or the equivalent statement:
F (x)[1 − F (x)]
Var(Y n ) = Var(Fbn (x)) =
n
In summary, when replacing Fbn (x) for Y n , we have shown:
F (x)[1 − F (x)]
Fbn (x) ≈ N F (x),
n
We can actually apply Chebyshev’s inequality in this case. For any > 0,
F (x)[1 − F (x)]
lim P(|Fbn (x) − F (x)| > ) 6 lim = 0,
n→∞ n→∞ n2
P
which shows that Fbn (x) −
→ F (x) which again implies Fbn (x) F (x).
104
7.5
Let x and y be two distinct points. Find Cov Fbn (x), Fbn (y) .
First of all, introducing the short-hand notation:
Ix := I(Xi 6 x), Iy := I(Xi 6 y)
Assuming independence of the Xi . Then, Cov(Ix , Iy ) = 0 whenever i 6= j. By definition:
Pn Pn !
i=1 Ix j=1 Iy
Cov Fn (x), Fn (y) = Cov
b b ,
n n
n n
1 X X
= 2 Cov Ix , Iy
n i=1 j=1
n
1 X
= Cov(Ix , Iy )
n2 i=1
The last part follows from the identity proved in exercise 3.14, and from independence, which leads
to all terms with different indices to become 0, hence leaving just a single sum. Now we will do an
intermediary calculation, and we will assume that x < y:
Cov(Ix , Iy ) = E[Ix Iy ] − E[Ix ]E[Iy ]
= E[Ix ] − E[Ix ]E[Iy ] (Since x < y)
= E[Ix ](1 − E[Iy ])
= P(Xi 6 x)(1 − P(Xi 6 y))
= F (x) 1 − F (y)
Returning to the main calculation:
n
1 X
Cov Fbn (x), Fbn (y) = 2 Cov(Ix , Iy )
n i=1
n
1 X
= 2
F (x) 1 − F (y)
n i=1
1
= 2
· n · F (x) 1 − F (y)
n
F (x) 1 − F (y)
= .
n
This is provided that x < y. But since the covariance is symmetric, then the smallest probability
can always be put in as x. Since F is unknown, the plug-in estimator is:
Fbn (x) 1 − Fbn (y)
n
Note that this is very similar to the expression for the variance.
F (x)(1 − F (x))
Var(Fbn ) =
n
105
7.6
Let X1 , . . . , Xn ∼ F and let Fb be the empirical distribution function. For a < b define θ = T (F ) =
F (b) − F (a) and set θb = T (Fbn ) = Fbn (b) − Fbn (a). Find the estimted standard error of θb and an
expression for the 1 − α confidence interval for θ.
Following example 7.15, the standard error of a difference is the square root of the following
variances:
q
se(θ) b = Var(Fbn (b) − Fbn (a))
21
= Var(Fbn (b)) + Var(Fbn (b)) − 2Cov(Fbn (a), Fbn (b)) (Prop. of var.)
1 1
= √ (F (b)[1 − F (b)] + F (a)[1 − F (a)] − 2F (a)[1 − F (b)]) 2 (Thm 7.3, Ex. 7.5)
n
r
(F (b) − F (a))[1 − (F (b) − F (a))]
= (See notes.)
n
s
(Fbn (b) − Fbn (a))[1 − (Fbn (b) − Fbn (a))]
= (Plug-in est.)
n
s
b − θ)
θ(1 b
= (Def. of θ)
b
n
Finding the confidence interval is easy. Just find the corresponding zα/2 level and calculate:
θb ± zα/2 se(θ).
b
For the sake of clarity, here are the notes for the algebra steps used in the step above. Instead of
F (b) and F (a) we just write b and a.
106
7.7
Code found in 7.7.R. Plot of empirical distribution and bounds:
Estimate of CDF
x (mag)
(The CDF jumps because there are only 1000 points and jumps in the data itself).
Finding the confidence interval for F (4.9) − F (4.3). (Note: these results will differ depending on
whether the empirical distribution is defined with < or 6. These are the bounds for 6, as the
definition says).
’ 95% Confidence Interval for F (4.9) - F (4.3) : ’
( 0.495 , 0.557 )
7.8
Results for the ’Old Faithful’ dataset. Code found in 7.8.R.
Output from code:
> # Mean
> sprintf ( " Sample mean : %2.3 f " , Xbar )
[1] " Sample mean : 70.897 "
> # Median
> sprintf ( " Median at : %2. f , with probability %2.2 f " , xvalMedian , Fhat ( xvalMedian , of
$ waiting ) )
[1] " Median at : 76 , with probability 0.53 "
107
7.9
Comparing two trials of antibiotic. 100 people are given standard and 90 recover, 100 are given
new and 85 recover.
We will assume that these are independent. We will also assume that the standard trial is
X1 , . . . , X100 ∼ Bernoulli(p1 ) and the new trial is Y1 , . . . , Y100 ∼ Bernoulli(p2 ). This is the same
conditions as in exercise 7.2.
If we define Xi = 1 and Yi = 1 for recoveries and Xi = 0 and Yi = 0 for no change, we can
estimate the probabilities with the sample mean:
100
1 X 90
pb1 = Xi = = 0.90
100 i=1 100
100
1 X 85
pb2 = Yi = = 0.85
100 i=1 100
For constructing the confidence intervals, we find the corresponding z.90 and z0.975 values and
calculate. Since probabilities can’t be negative, we cap them at 0. 80% confidence interval.
θb ± z.90 se(θ)
b = 0.05 ± 1.28(0.046637) = (−0.0097, 0.1097) = (0, 0.1097)
θb ± z.975 se(θ)
b = 0.05 ± 1.96(0.046637) = (−0.0414, 0.1414) = (0, 0.1414)
7.10
Code found in 7.10.R.
> theta # Estimate of theta
[1] 277.3962
> seTheta # Standard error of theta
[1] 388.522
> sprintf ( " CI : (%3.2 f , %3.2 f ) " , CIlwr , CIupr )
[1] " CI : (0.00 , 601.90) "
108
8 The Bootstrap
Exercises
109