Lecture Notes

© All Rights Reserved

6 views

Lecture Notes

© All Rights Reserved

- Nassim Taleb Journal Article - Report on the Effectiveness and Possible Side Effects of the OFR (2009)
- Chap 003
- Microeconomic Theory Basic Principles and Extensions 12th Edition Nicholson Solutions Manual
- pertemuan 4
- Risk & Return
- Discrete Distributions
- DBASE & LOOKUP
- ESC92 - Chapter 7 Mathematical Expectation
- Presentation on Random Variable
- RV Prob Dakamistributions
- stat130module1cslides
- Print Version - Random Variables and Probability Distributions
- Probst at Book
- Business Statistics 4
- Basic Probability Theory and Randomised Allocation Strategies
- Stigler stat Ch. 2`
- Federal Public Service Examination
- Chapter 5 in Class Problems
- Probability Questions and Answers
- syllabus

You are on page 1of 65

HEINRICH MATZINGER

Georgia Tech

E-mail: matzi@math.gatech.edu

October 7, 2014

Contents

1 Definition and basic properties

1.1 Events . . . . . . . . . . . . .

1.2 Frequencies . . . . . . . . . .

1.3 Definition of probability . . .

1.4 Direct consequences . . . . . .

1.5 Some inequalities . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2

2

4

5

6

10

11

2.1 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Expectation

14

18

4.1 Matzingers rule of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Calculation with the Variance

21

5.1 Getting the big picture with the help of Matzingers rule of thumb . . . . . 23

6 Covariance and correlation

24

6.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Chebyshevs and Markovs inequalities

26

8 Combinatorics

29

9.1 Bernoulli variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.2 Binomial random variable . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.3 Geometric random variable . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

31

32

33

34

36

12 Distribution functions

38

40

42

15 Statistical testing

44

15.1 Looking up probabilities for the standard normal in a table . . . . . . . . . 46

15.2 Two sample testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

16 Statistical estimation

16.1 An example . . . . . . . . . . . . . . . . . . .

16.2 Estimation of variance and standart deviation

16.3 Maximum Likelihood estimation . . . . . . .

16.4 Estimation of parameter for geometric random

. . . . . .

. . . . . .

. . . . . .

variables

.

.

.

.

17 Linear Regression

17.1 The case where the exact linear model is known . . . . . .

17.2 When and are not know . . . . . . . . . . . . . . . .

17.3 Where the formula for the estimates of and come from

17.4 Expectation and variance of . . . . . . . . . . . . . . .

17.5 How precise are our estimates . . . . . . . . . . . . . . . .

17.6 Multiple factors and or polynomial regression . . . . . . .

17.7 Other applications . . . . . . . . . . . . . . . . . . . . . .

1

1.1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

53

53

55

56

57

.

.

.

.

.

.

.

58

58

60

62

64

65

65

65

Events

Imagine that we throw a die which has 4 sides. The outcome, of this experiment will be

one of the four numbers: 1,2,3 or 4. The set of all possible outcomes in this case is:

= {1, 2, 3, 4} .

is called the outcome space or sample space. Before doing the experiment we dont know

what the outcome will be. Each possible outcome has a certain probability to occur. This

2

We can use our die to make bets. Somebody might bet that the number will be even. We

throw the die: if the number we see is 2 or 4 we say that the event even has occurred

or has been observed. We can identify the event even with the set: {2, 4}. This might

seem a little bit abstract, but by identifying the event with a set, events become easier

to handle: Sets are well known mathematical objects, whilst the events as we know them

from every day language are not.

In a similar way one might bet that the outcome is a number greater-equal to 3. This

event is realized when we observe a 3 or a 4. The event greater or equal 3 can thus be

viewed as the set {3, 4}.

Another example, is the event odd. This is the set: {1, 3}.

With this way of looking at things, events are simply subsets of . Take another example:

a coin with a side 0 and a side 1. The outcome space or sample space in that case is:

= {0, 1} .

The events are the subsets of , in this case there are 4 of them:

, {0}, {1}, {0, 1}.

Example 1.1 It might at first seem very surprising that events can be viewed as sets.

Consider for example the following sets:

the set of bicycles which belong to a Ga tech student, the set of all sky-scrapers in Atlanta,

the set of all one dollar bills which are currently in the US.

Let us give a couple of events:

the event that after X-mas the unemployment rate is lower than now, the event that our

favorite pet dies from hart attack, the event that I go down with flue next week.

At first, it seems that events are something very dierent from sets. Let us see in a real

world example how mathematicians view events as sets:

Assume that we are interested in where the American economy is going to stand in exactly

one year from now. More specifically, we look at unemployment and inflation and wonder

if they will be above or below their current level. To describe the situation which we

encounter in a year from now, we introduce a two digit variable Z = XY . Let X be equal

to one if unemployment is higher in a year than its current level. If it is lower, let X be

equal to 0. Similarly, let Y be equal to one if inflation is higher in a year from now. If it

is lower, let Y be equal to zero. The possible outcomes for Z are:

00, 01, 10, 11.

This is the situation of a random experiment, where the outcome is one of the four possible

numbers: 00,01,10,11. We dont know what the outcome will be. But each possibility can

occur with a certain probability. Let A be the event that unemployment is higher in a year.

This corresponds to the outcomes 10 and 11. We thus identify the event A with the set:

{10, 11} .

3

Let B be the event that inflation is higher in a year form now. This corresponds to the

outcomes 01 and 11. We thus view the event B as the set:

{01, 11} .

Recall that the intersection AB of two sets A and B, is the set consisting of all elements

contained in both A and B. In our example, the intersection of A and B is equal to

A B = {11}. Let C designate the event that unemployment goes up and that inflation

goes up at the same time. This corresponds to the outcome 11. Thus, C is identified with

the set: {11}. In other words, C = A B. The general rule which we must remember is:

For any events A and B, if C designates the event that A and B both occur at

the same time, then C = A B.

Let D be the event that unemployment or inflation will be up in a year from now. (By or

we mean that at least one of them is up.) This corresponds to the outcomes: 01,10,11.

Thus D gets identified with the set:

D = {01, 10, 11} .

Recall that the union of two sets A and B is defined to be the set consisting of all elements

which are in A or in B. We see in our example that D = A B. This is true in general.

We must thus remember the following rule:

For any events A and B, if D designates the event that A or B occur, then

D = A B.

1.2

Frequencies

Assume that we have a six sided die. In this case the outcome space is

= {1, 2, 3, 4, 5, 6}.

The event even in this case is the set:

{2, 4, 6}

whilst the event odd is equal to

{1, 3, 5} .

Instead of throwing the die only once, we throw it several times. As a result, instead of

just a number, we get a sequence of numbers. When throwing the six-sided die I obtained

the sequence:

1, 4, 3, 5, 2, 6, 3, 4, 5, 3, . . .

When repeating the same experiment which consists in throwing the die a couple of times,

we are likely to obtain another sequence. The sequence we observe is a random sequence.

In this example we observe one 3 within the first 5 trials and three 3s occurring within

the first 10 trials. We write:

n{3}

for the number of times we observe a 3 among the first n trials. In our example thus: for

n = 5 we have n{3} = 1 whilst for n = 10 we find n{3} = 3.

Let A be an event. We denote by nA the number of times A occurred up to time n.

Take for example A to be the event even. In the above sequence within the first 5 trials

we obtained 2 even numbers. Thus for n = 5 we have that nA = 2. Within the first 10

trials we found 4 even numbers. Thus, for n = 10 we have nA = 4. The proportion of

even numbers nA /n for the first 5 trials is equal to 2/5 = 40%. For the first 10 trials, this

proportion is 4/10 = 40%.

1.3

Definition of probability

The basic definition of probability which we use is based on frequencies. For our definition

of probability we need an assumption about the world surrounding us:

Let A designate an event. When we repeat the same random experiment independently

many times we observe that on the long run the proportion of times A occurs tends to

stabilize. Whenever we repeat this experiment, the proportion nA /n on the long run

tends to be the same number. A more mathematical way of formulating this, is to stay

that nA /n converges to a number only depending on A, as n tends to infinity. This is our

basic assumption.

Assumption As we keep repeating the same random experiment under the same conditions and such that each trial is independent of the previous ones, we find that:

the proportion nA /n tends to a number which only depends on A, as n .

We are now ready to give our definition of probability:

Definition 1.1 Let A be an event. Assume that we repeat the same random experiment

under exactly the same conditions independently many times. Let nA designate the number

of times the event A occurred within the n first repeats of the experiment. We define the

probability of the event A to be the real number:

nA

.

n n

P (A) =: lim

Thus, P (A) designates the probability of the event A. Take for example a four-sided

perfectly symmetric die. Because, of symmetry each side must have same probability. On

the long run we will see a forth of the times a 1, a forth of the times a 2, a forth of the

times a 3 and a forth of the times a 4. Thus, for the symmetric die the probability of

each side is 0.25.

1.4

Direct consequences

From our definition of probability there are several useful facts, which follow immediately:

1. For any event A, we have that:

P (A) 0.

2. For any event A, we have that:

P (A) 1.

3. Let designate the state space. Then:

P () = 1.

Let us prove these elementary facts:

1. By definition na /n 0. However, the limit of a sequence which is 0 is also 0.

Since P (A) is by definition equal to the limit of the sequence nA /n we find that

P (A) 0.

2. By definition nA n. It follows that na /n 1. The limit of a sequence which

is always less or equal to one must also be less or equal to one. Thus, P (A) =

limn na /n 1.

3. By definition n = n. Thus:

P () = lim n /n = lim n/n = lim 1 = 1.

n

The next two theorems are essential for solving many problems:

Theorem 1.1 Let A and B be disjoint events. Then:

P (A B) = P (A) + P (B).

Proof. Let C be the event C = A B. C is the event that A or B has occurred. Because

A and B are disjoint, we have that A and B can not occur at the same time. Thus, when

we count up to time n how many times C has occurred, we find that this is exactly equal

to the number of times A has occurred plus the number or times B has occurred. In other

words,

nC = nA + nB .

(1.1)

From this it follows that:

!n

nC

nA + nB

nB "

A

= lim

= lim

+

.

n n

n

n

n

n

n

P (C) = lim

We know that the sum of limits is equal to the limit of the sum. Applying this to the

right side of the last equality above, yields:

!n

nB "

nA

nB

A

= lim

+

+ lim

= P (A) + P (B).

lim

n n

n n

n

n

n

6

P (C) = P (A B) = P (A) + P (B).

Let us give an example which might help us to understand why equation 1.1 holds.

Imagine we are using a 6-sided die. Let A be the event that we observe a 2 or a 3. Thus

A = {2, 3}. Let B be the event that we observe a 1 or a 5. Thus, B = {1, 5}. The two

events A and B are disjoint: it is not possible to observe at the same time A and B since

A B = . Assume that we throw the die 10 times and obtain the sequence of numbers:

1, 3, 4, 6, 3, 4, 2, 5, 1, 2.

We have seen the event A four times: at the second, fifth, seventh and tenth trial. The

event B is observed at the first trial, at the eight and ninth trials. C = AB = {2, 3, 1, 5}

is observed at the trials number: 2,5,7,10 and 1,8,9. We thus find in this case that nA = 4,

nB = 3 and nC = 7 which confirms equation 1.1.

Example 1.2 Assume that we are throwing a fair coin with sides 0 and 1. Let Xi designate the number which we obtain when we flip the coin for the i-th time. Let A be the

event that we observe right at the beginning the number 111. In other words:

A = {X1 = 1, X2 = 1, X3 = 1}.

Let B designate the event that we observe the number 101 when we read our random

sequence starting from the second trial. Thus:

B = {X2 = 1, X3 = 0, X4 = 1}.

Assume that we want to calculate the probability to observe that at least one of the two

events A or B holds. In other words we want to calculate the probability of the event

C = A B.

Note that A and B can not occur both at the same time. The reason is that for A to hold

it is necessary that X3 = 1 and for B to hold it is necessary that X3 = 0. X3 however

can not be equal at the same time to 0 ant to 1. Thus, A and B are disjoint events, so

we are allowed to use theorem 1.1.We find, applying theorem 1.1 that:

P (A B) = P (A) + P (B).

With a fair coin, each 3-digit number has same probability. There are 8, 3-digit numbers

so each one has probability 1/8. It follows that P (A) = 1/8 and P (B) = 1/8. Thus

P (A B) =

1 1

1

+ = = 25%

8 8

4

7

The next theorem is useful for any pair of events A and B and not just disjoint events:

Theorem 1.2 Let A and B be two events. Then:

P (A B) = P (A) + P (B) P (A B).

Proof. Let C = A B. Let D = B A, that is D consists of all the elements that are in

B, but not in A. We have by definition that C = D A and that D and A are disjoint.

Thus we can apply theorem 1.1 and find:

P (C) = P (A) + P (D)

(1.2)

apply theorem 1.1 and find that:

P (B) = P (A B) + P (D)

(1.3)

P (C) P (B) = P (A) P (A B).

By adding P (B) on both sides of the last equation, we find:

P (C) = P (A B) = P (A) + P (B) P (A B).

This finishes this proof.

Problem 1.1 Let a and b designate two genes. Let the probability that a randomly picked

person in the US, has gene a be 20%. Let the probability for gene b be 30%. And eventually,

let the probability that he has both genes at the same time be 10%. What is the probability

to have at least one of the two genes?

Let us explain how we solve the above problem: Let A, resp. B designate the event that

the randomly picked person has gene a, resp. b. We know that:

P (A) = 20%

P (B) = 30%

P (A B) = 10%

The event to have at least one gene is the event A B. By theorem 1.2 we have that:

P (A B) = P (A) + P (B) P (A B). Thus in our case: P (A B) = 20% + 30% 10% =

40%. This finishes to solve the above problem.

In many situations we will be considering the union of more of 3 events. The next theorem

gives the formula for three events:

P (A B C) = P (A) + P (B) + P (C) P (A B) P (A C) P (B C) + P (A B C).

Proof. We already now the formula for the probability of the union of two events. So,

we are going to use this formula. Let D denote the union: D := B C. Then we find

ABC =AD

and hence

P (A B C) = P (A D).

(1.4)

By theorem 1.2, the right side of the last equation above is equal to:

P (A D) = P (A) + P (D) P (A D) = P (A) + P (B C) P (A (B C))

(1.5)

P (B C) = P (B) + P (C) P (B C).

We have:

A (B C) = (A B) (A C)

and hence

(1.6)

But the right side of the last equation above is the probability of the union of two events

and hence theorem 1.2 applies:

P ((AB)(AC)) = P (AB)+P (AC)P ((AB)(AC)) = P (AB)+P (AC)P ((ABC)).

(1.7)

Combining now equation 1.4, 1.5, 1.6 with 1.7, we find

P (A B C) = P (A) + P (B) + P (C) P (A B) P (A C) P (B C) + P (A B C).

Often it is easier to calculate the probability of a complement than the probability of the

event itself. In such a situation, the following theorem is useful:

Theorem 1.4 Let A be an event and let Ac denote its complement. Then:

P (A) = 1 P (Ac )

Proof. Note that the events A and Ac are disjoint. Furthermore by definition AAc = .

Recall that for the sample space , we have that P () = 1. We can thus apply theorem

1.1 and find that:

1 = P () = P (A Ac ) = P (A) + P (Ac ).

This implies that:

P (A) = 1 P (Ac )

9

1.5

Some inequalities

P (A B) P (A) + P (B)

Proof. We know by theorem 1.2 that

P (A B) = P (A) + P (B) P (A B)

Since P (A B) 0 we have that

P (A) + P (B) P (A B) P (A) + P (B).

It follows that

P (A B) P (A) + P (B).

For several events a similar theorem holds:

Theorem 1.6 Let A1 , . . . , An be a collection of n events. Then

P (A1 A2 . . . An ) A1 + . . . + An

Proof. By induction.

Another often used inequality is:

Theorem 1.7 Let A B. Then:

P (A) P (B).

Proof. If A B, then for every n we have that:

nA nB

hence also

Thus:

nA

nB

.

n

n

nA

nB

lim

.

n n

n n

lim

Hence

P (A) P (B).

10

Imagine the following situation: in a population there are two illnesses a and b. We

assume that 20% suer from b, 15% suer from a whilst 10% suer from both. Let A be

the event that a person suers from a and let B be the event that a person suers from b.

If a patient comes to a doctor and says that he suers from illness b, how likely is he to

have illness a also? (We assume that the patient has been tested for b but not yet tested

for a.) We note that half the population group suering from b, also suer from a. Hence,

when the doctor meets such a patients suering from b, there is a chance of 1 out of 2,

that the person suers also from a. This is called the conditional probability of B given

A and denoted by P (B|A). The formula we used is 10%/20% = P (A B)/P (A).

Definition 2.1 Let A, B be two events. Then we define the probability of A conditional

on the event B, and write P (A|B) for the number:

P (A|B) :=

P (A B)

.

P (B)

Definition 2.2 Let A, B be two events. We say that A and B are independent of each

other i

P (A B) = P (A) P (B).

Note that A and B are independent of each other if and only if P (A|B) = P (A). In other

word, A and B are independent of each other if and only if the realization of one of the

events does not aect the conditional probability of the other.

Assume that we perform two random experiments independently of each other, in the

sense that the two experiments do not interact. That is the experiments have no influence

on each other. Le A denote an event related to the first experiment, and let B denote

an event related to the second experiment. We saw in class that in this situation the

equation P (A B) = P (A) P (B) must hold. And thus, A and B are independent in the

sense of the above definition. To show this we used an argument where we simulated the

two random experience by picking marbles from two bags.

There are also many cases, where events related to a same experiment are independent,

in the sense of the above definition. For example for a fair die, the events A = {1, 2} and

B = {2, 4, 6} are independent.

There can also be more than two independent events at a time:

Definition 2.3 Let A1 , A2 , . . . , An be a finite collection of events. We say that A1 , A2 , . . . , An

are all independent of each other i

P (iI Ai ) = iI P (Ai )

for every subset I {1, 2, . . . , n}.

The next example is very important for the test on Wednesday.

11

Example 2.1 Assume we flip the same coin independently three times. Let the coin be

biased, so that side 1 has probability 60% and side 0 has probability 40%. What is the

probability to observe the number 101? (By this we mean: what is the probability to first

get a 1, then a 0 and eventually, at the third trial, a 1 again?)

To solve this problem let A1 , resp. A3 be the event that at the first, resp. third trial we

get a one. Let A2 be the event that at the second trial we get a zero. Observing a 101

is thus equal to the event A := A1 A2 A3 . Because, the trials are performed in an

independent manner it follows that the events A1 , A2 , A3 are independent of each other.

Thus we have that:

P (A1 A2 A3 ) = P (A1 ) P (A2 ) P (A3 ).

We have that:

P (A1 ) = 60%, P (A2 ) = 40%, P (A3) = 60%.

It follows that:

P (A1 A2 A3 ) = 60% 40% 60% = 0.144.

2.1

P (A) = P (A B) + P (A B c ).

(2.1)

Furthermore, if B and B c have both probabilities that are not equal to zero, then

P (A) = P (A|B) P (B) + P (A|B c )P (B c ).

(2.2)

that D and E are disjoint. Furthermore, A = E D so that by theorem 1.1, we find:

P (A) = P (E D) = P (E) + P (D).

Replacing E and D by A B and A B c , yields equation 2.1.

We can use 2.1 to find

P (A) =

P (A B) P (B) P (A B c ) P (B c )

+

.

P (B)

P (B c )

P (A) = P (A|B)P (B) + P (A|B c )P (B c ),

which finishes to prove equation 2.2

Let us give an example which should show that intuitively this law is very clear. Assume

that in a town 90% of women are blond but only 20% of men. Assume we chose a person

12

at random from this town. Each person is equally likely to be drawn. Let W be the event

that the person be a women and B be the event that the person be blond. The law of

total probability can be written as

P (B) = P (B|W )P (W ) + P (B|W c)P (W c ).

(2.3)

In our case, the conditional probability of blond conditional on women is P (B|W ) = 0.9.

On the other hand W c is the event to draw a male and P (B|W c) is the conditional

probability to have a blond given that the person is a man. In our case, P (B|W c ) = 0.2.

So, when we put the numerical values into equation 2.3, we find

P (B) = 0.9P (W ) + 0.2P (W c ).

(2.4)

Here P (W ) is the probability that the chosen person is a women. This is then the

percentage of women in this population. Similarly, P (W c ) is the proportion of men. In

other words, equation 2.4 can be read as follows: the total proportion of blonds in the

population is the weighted average between the proportion of blonds among the female

and the male population.

2.2

Bayes rule

Bayes rule is useful when one would like to calculate a conditional probability of A given

B, but one is given the opposite, that is the probability of B given A. Let us next state

Bayes rule:

Lemma 2.2 Let A and B be events both having non zero probabilities. Then

P (A|B) =

P (B|A) P (A)

.

P (B)

(2.5)

are now going to plug the last expression into the right side of equation 2.5. We find:

P (B|A) P (A)

P (A B)P (A)

P (A B)

=

=

= P (A|B),

P (B)

P (A) P (B)

P (B)

which establishes equation 2.5.

Let us give an example. Assume that 30% of men are interested in car races, but only 10%

of women are. If I know that a person is interested in car races, what is the probability

that it is a man? Again I imagine that we pick a person at random in the population.

Let M be the event that the person is a man and C the event that she/he is interested

in car races. We know P (C|M) = 0.3 and P (C|M c ) = 0.1. Now by Bayes rule we have

that the conditional probability that the person is a man given that he/she is interested

in car races is:

P (M)

.

(2.6)

P (M|C) = P (C|M)

P (C)

13

We have that P (C) = P (C|M)P (M) + P (C|M c )P (M c ) which we can plug into 2.6 to

find

P (M)

.

P (M|C) = P (C|M)

P (C|M)P (M) + P (C|M c )P (M c )

In the present numerical example, we find

P (M|C) = 0.3

P (M)

,

0.3P (M) + 0.1P (M c )

where P (M) represents the proportion of men in the population, whilst P (M c ) represents

the proportion of women.

Expectation

Imagine a firm which every year makes a profit. It is not known in advance what the profit

of the firm is going to be. This means that the profit is random: we can assign to each

possible outcome a certain probability. Assume that from year to year the probabilities

for the profit of our firm do not change. Assume also that from one year to the next the

profits are independent. What is the long term average yearly profit equal to?

For this let us look at a specific model. Assume the firm could make 1, 2, 3 or 4 million

profit with the following probabilities

P (X = x)

x

1

2

3

4

(The model here is not very realistic since there are only a few possible outcomes. We

chose it merely to be able to illustrate our point). Let Xi denote the profit in year i.

Hence, we have that X, X1 , X2 ,... are i.i.d. random variables.

To calculate the long term average yearly profit consider the following. In 10% of the

year on the long run we get 1 million. If we take a period of n years, where n is large,

we thus find that in about 0.1n years we make 1 million. In 40% of the years we make

2 millions on the long run. Hence, in a period of n years, this means that in about 0.4n

years we make 2 millions. This corresponds to an amount of money equal to about 0.4n

times 2 millions. Similarly, for n large the money made during the years where we earned

3 million is about 3 0.3n, whilst for the years where we made 4 millions we get 4 0.2n.

The total during this n year period is thus about

10.1+20.4+30.3+40.4 = 1P (X = 1)+2P (X = 2)+3P (X = 3)+4P (X = 4) == 3.3

Hence, on the long run the yearly average profit is 3.3 millions. This long term average is

called expected value or expectation and is denoted by E[X]. Let us formalize this concept:

In general if X denotes the outcome of a random experiment, then we call X a random

variable.

14

Definition 3.1 Let us consider a random experiment with a finite number of possible

outcomes, where the state space is

= {x1 , x2 , . . . , xs } .

(In the profit example above, we would have = {1, 2, 3, 4}.) Let X denote the outcome

of this random experiment. For x , let px denote the probability that the outcome of

our random experiment is is x. That is:

px := P (X = x).

(In the last example above, we have for example p1 = 0.1 and p2 = 0.4...) We define the

expected value E[X]:

#

E[X] :=

xpx .

x

In other words, to calculate the expected value of a random variable, we simply multiple

the probabilities with the corresponding values and then take the sum over all possible

outcomes. Let us see yet another example for expectation.

Example 3.1 Let X denote the value which we obtain when we throw a fair coin with

side 0 and side 1. Then we find that:

E[X] = 0.5 1 + 0.5 0 = 0.5

When we keep repeating the same random experiment independently and under the same

conditions on the long run, we will see that the average value which we observe converges to

the expectation. This is what we saw in the firm/profit example above. Let us formalize

this. This fact is actually a theorem which is called the Law of Large Numbers. This

theorem goes as follows:

Theorem 3.1 Assume we repeat the same random experiment under the same conditions

independently many times. Let Xi denote the (random variable) which is the outcome of

the i-th experiment. Then:

lim

(X1 + X2 + . . . + Xn )

= E[X1 ]

n

(3.1)

This simply means that on the long run, the average is going to be equal to to the expectation.

Proof. Let denote the state space of the random variables Xi :

= {x1 , x2 , . . . , xs } .

by regrouping the same terms together, we find:

X1 + X2 + . . . + Xn = x1 nx1 + x2 nx2 + . . . + xs nxs .

15

(Remember that nxi denotes the number of times we observe the value xi in the finite

sequence: X1 , X2 , . . . , Xn .) Thus:

! n

(X1 + X2 + . . . + Xn )

nxs "

x1

lim

= lim x1

+ . . . + xs

.

n

n

n

n

n

By definition

nxi

.

n n

Since the limit of a sum is the sum of the limits we find,

P (X1 = xi ) = lim

nx1

nxs "

nx

nx

lim x1

+ . . . + xs

= x1 lim 1 + . . . + xs lim s =

n

n

n

n

n

n

n

=x1 P (X = x1 ) + . . . + xs P (X = xs ) = E[X1 ].

!

So, we can now generalize our firm profit example. Imagine for this that the profit a firm

makes every month is random. Imagine also that the earnings from month to month are

independent of each other and also have the same probabilities. In this case we can

view the sequence of earnings month for month, as a sequence of repeats of the same

random experiment. Because of theorem 3.1, on the long run the average monthly income

will be equal to the expectation.

Let us next give a few useful lemmas in connection with expectation. The first lemma deals

with the situation where we take an i.i.d. sequence of random outcomes X1 , X2 , X3 , . . .

and multiply each one of them with a constant a. Let Yi denote the number Xi multiplied

by a: hence Yi := aXi . Then the long term average of the Xi s multiplied by a equals to

the long term average of the Yi s. Let us state this fact in a formal way:

Lemma 3.1 Let X denote the outcome of a random experiment. (Thus X is a so-called

random variable.) Let a be a real (non-random) number. Then:

E[aX] = aE[X].

Proof. Let us repeat the same experiment independently many times. Let Xi denote

the outcome of the i-th trial. Let Yi be equal to Yi := aXi . Then by the law of large

numbers, we have that

Y1 + . . . + Yn

= E[Y1 ] = E[aX1 ].

n

n

lim

However:

%

X 1 + . . . + Xn

=

n

X 1 + . . . + Xn

= aE[X1 ].

=a lim

n

n

Y1 + . . . + Yn

aX1 + . . . + aXn

lim

= lim

= lim a

n

n

n

n

n

16

The next lemma is

extremely important when dealing with the expectation of sums of random variables. It

states that the sum of the expectation is equal to the expectation of the sum. We can

think of a simple real life example which shows why this should be true. Imagine that

Matzinger is the owner of two firm (wishful thinking since Matzinger is a poor professor).

Let Xi denote the profit made by his first firm in year i. Let Yi denote the profit made

by his second firm in year i. We assume that from year to year the probabilities do not

change for both firms and the profits are independent (from year to year). In other words

X, X1 , X2 , X3 , . . . are i.i.d. variables and so are Y, Y1 , Y2 , . . .. Let Zi denote the total profit

Matzinger makes in year i, so that Zi = Xi + Yi . Now obviously the long term average

yearly profit of Matzinger is the long term average yearly profit form the first firm plus

the long term average profit from the second firm. In mathematical writing this gives:

E[X + Y ] = E[X] + E[Y ].

As a matter of fact, E[X + Y ] denotes the long term average profit of Matzinger. On

the other hand, E[X] denotes the average profit of the first firm, whilst E[Y ] denotes the

average profit of the second firm.

Let us next formalize all of this:

Lemma 3.2 Let X, Y denote the outcomes of two random experiments.

Then:

E[X + Y ] = E[X] + E[Y ].

Proof. Let us repeat the two random experiments independently many times. Let Xi

denote the outcome of the i-th trial of the first random experiment. Let Yi be equal to the

outcome of the i-th trial of the second random experiment. For all i N, let Zi := Xi +Yi.

Then by the law of large numbers, we have that:

Z1 + . . . + Zn

= E[Z1 ] = E[X1 + Y1 ].

n

n

lim

However:

Z1 + . . . + Zn

X1 + Y1 + X2 + Y2 + . . . + Xn + Yn

= lim

=

n

n

n

n

$

%

(X1 + . . . + Xn ) + (Y1 + . . . + Yn )

= lim

=

n

n

Y1 + . . . + Yn

X1 + . . . + Xn

+ lim

+ = E[X1 ] + E[Y1 ].

= lim

n

n

n

n

lim

This proves that E[X1 +Y1] = E[X1 ]+E[Y1 ] and finishes this proof. It is very important

to note that we do not need for the above theorem to have X and Y being independent

of each other.

17

Lemma 3.3 Let X, Y denote the outcomes of two independent random experiments.

Then:

E[X Y ] = E[X] E[Y ].

Proof. We assume that X takes values in a countable set x , whilst Y takes on values

from the countable set Y . We have that

#

E[XY ] =

xyP (X = x, Y = y).

(3.2)

xX ,yY

Plugging the last equality into 3.2, we find

#

#

#

yP (Y = y) = E[X]E[Y ]

xyP (X = x)P (Y = y) =

xP (X = x)

E[XY ] =

xX ,yY

xX

yY

In some problems we are only interested in the expectation of a random variable. For

example, consider insurance policies for mobile telephones sold by a big phone company.

Say Xi is the amount which will be paid during the coming year to the i-th customer

due to his/her phone breaking down. It seems reasonable to assume that the Xi s are

independent of each other. (We assume no phone viruses). We also assume that they all

follow the same random model. So, by the Law of Large Numbers we have that for n

large, the average is approximately equal to the expectation:

X 1 + X 2 + . . . + Xn

E[Xi ].

n

Hence, when n is really large, there is no risk involved for the phone company: they

know how much they will have to pay total: on a per customer basis, they will have to

spend an amount very close to E[X1 ]. In other words, they only need one real number

from the probability model for the claims: that is the expectation E[Xi ]. Now, in many

other applications knowing only the expected value will not be enough: we will also need

a measure of the dispersion. This means that we will also want to know how much on

average the variables fluctuate from their long term average E[X1 ].

Let us give an example. Matzinger as a child used to walk with his mother every day on the shores

of Lake Geneva. Now, there is a place where there is a scale to measure the height of the water. So,

hydrologists measure the water level and then analyze this data. Assume that Xi denotes the water level

on a specific day day in year i. (We assume that we always measure on the same day of the year, like for

example on the first of January). For the current discussion we assume that the model does not change

18

over time (no global warming). We furthermore assume that from one year to the next the values are

independent. Say the random model would be given as follows:

x

P (X = x)

1

6

1

6

1

6

1

6

1

6

1

6

How much does the water level fluctuate on average from year to year? Note that the long term average,

that is the expectation is equal to

E[Xi ] = 4

1

1

1

1

1

1

+ 5 + 6 + 7 + 8 + 9 = 6.5

6

6

6

6

6

6

Now, when the water level is 6 or 7, then we are 0.5 away from the long term average of 6.5. In such a

year i, we will say that the fluctuation fi is 0.5. In other words, we measure for each year i, how far we

are from E[Xi ]. This observed fluctuation in year i is then equal to

fi := |Xi 6.5| = |Xi E[Xi ]|.

In our model, fi = 0.5 happens with a probability of 1/3, that is on the long run, in one third of the

years. When the water level is either at 8 or 5, then we are 1.5 away from the long term average of 6.5.

This has also a probability of 1/3. Finally, with water levels of 4 or 9, we are 2.5 away from the long

term average and again this will happen in a third of the year on the long run. So, the long term average

fluctuation if this models holds, will always tend to be about

E[fi ] = E[|Xi E[Xi ]|] = 2.5

1

1

1

+ 1.5 + 0.5 = 1.5.

3

3

3

after many years. To understand why simply consider the fluctuations f1 , f2 , f3 , . . .. By the Law of Large

Numbers applied to them we get that for n large, the average fluctuation is approximately equal to its

expectation:

f1 + f2 + . . . + fn

E[fi ] = E[|Xi E[Xi ]|].

(4.1)

n

So, now matter, what after many years, we will always now what the average fluctuation is approximately equal to: the expression on the right side of 4.1

Long term average fluctuation = E[|Xi E[Xi ]|]

(4.2)

is a measure of the dispersion (around the expectation) in our model. It should be obvious

why this dispersion is important: if it is small people of Geneva will be safe. If it is big,

they will often have to deal with flooding. So, in some sense, we can view the value given

in 4.2 as a measure of risk: if the dispersion is 0, then there is no risk and the random

number is not random but always equal to the fixed value E[X1 ]!

In modern statistics, one considers however most often a number which represents the

same idea, but can be slightly dierent from 4.2. The number we will use most often, is

not the average fluctuation, but instead the square root of the average fluctuation square.

This number is called the standard deviation of a random variable. We usually denote it

by , so

&

X := E[(X E[X])2 ].

19

The long term average fluctuation square of a random variable X is also called variance,

and will be denoted by V AR[X] so that

V AR[X] := E[(X E[X])2 ].

With this definition the standard deviation is simply the square root of the variance:

&

X = V AR[X].

In most cases, X and our other measure of dispersion given by E[|X E[x]|] are almost

equal.

Let us go back to our example. the variance is the average fluctuation square. We get thus:

V AR[Xi ] = E[fi2 ] = 2.52

1

1

1

+ 1.5 + 0.52 = 2.91

3

3

3

Xi =

&

So, we see the average fluctuation size was E[|Xi E[Xi ]|] = 1.5 whilst the standard deviation is (only)

about 13% bigger.

Now, standard deviation is most often used for determining the order of magnitude of a

random imprecision. So, we don t care about knowing absolutely exactly that number:

instead we just want the order of magnitude. In other words, in most applications,

E[|Xi E[Xi ]| and Xi are suciently close to each other, that for applications it does

not matter which one of the two we take! But, it will turn out that the standard deviation

allows for certain calculations which the other measure of dispersion in 4.2 does not allow

for. So, we will work more often with the standard deviation than the other.

4.1

most variables most of the time take values not further than two standard deviations from

their expected values. We could thus write in a lose way:

X E[X] 2X .

To understand where this rule comes from simply think of the following: for example

average American household income is around 70.000. How many households make more

than twice that much, that is above 140000? Certainly not a very large portion of the

population. Now, in our case the argument is not about average, but about the average

fluctuation. Still it is an average. So, what is true for averages should also be true for

an average of fluctuations....

We will see below Chebyche rule which is the worst possible scenario. The probability

20

for any random variable to be further than 2 standard deviations from it expected value

could be as much as 25% but never more:

P (|Z E[Z]| 2Z ) 0.25.

the above inequality holds for any random variable, so it represents in some sense the

worse case. Inequality ?? will be proven in our section on chebyche.

For normal variables, the probability to be further than two standard deviations is much

smaller: it is about 0.05. Now, we will see in the section on central limit theorem, that

any sum of many independent random contributions is approximately normal as soon as

they follow about the same model. Now, 0.0 is much smaller than 0.25. In real life, in

many cases, one will be in between these two possibilities. This rule of thumb is extremely

useful when analyzing data and trying to get the big picture!

to:

'

(

V AR[X] := E (X E[X])2 .

&

X := V AR[X].

The standard deviation is a measure for the typical order of magnitude of how far away

the value we get after doing the experiment once, is from E[X].

Lemma 5.1 Let a be a non-random number and X the outcome of a random experiment.

Then:

V AR[aX] = a2 V AR[X].

Proof. We have:

V AR[aX] =E[(aX E[aX])2 ] = E[(aX aE[X])2 ] =

=E[a2 (X E[X])2 ] = a2 E[(X E[X])2 ] = a2 V AR[X],

which finishes to prove that: V AR[aX] = a2 V AR[X].

Lemma 5.2 Let X be the outcome of a random experiment, (in other words a random

variable). Then:

V AR[X] = E[X 2 ] (E[X])2 .

Proof. We have that

E[(X E[X])2 ] = E[X 2 2XE[X] + E[X]2 ] = E[X 2 ] 2E[XE[X]] + E[E[X]2 ]. (5.1)

21

Now E[X] is a constant and constants can be taken out of the expectation. This implies

that

E[XE[X]] = E[X]E[X] = E[X]2 .

(5.2)

On the other hand, the expectation of a constant is the constant itself. Thus, since E[X]2

is a constant, we find:

E[E[X]2 ] = E[X]2 .

(5.3)

Using equation 5.2 and 5.3 with 5.1 we find

E[(X E[X])2 ] = E[X 2 ] 2E[X]2 + E[X]2 = E[X 2 ] E[X]2 .

this finishes to prove that V AR[X] = E[X 2 ] E[X]2 .

Lemma 5.3 Let X and Y be the outcomes of two random experiments, which are independent of each other. Then:

V AR[X + Y ] = V AR[X] + V AR[Y ].

Proof. We have:

V AR[X + Y ] =E[((X + Y ) E[X + Y ])2 ] = E[(X + Y E[X] E[Y ])2 ] =

=E[(X E[X])2 + 2(X E[X])(Y E[Y ]) + (Y E[Y ])2 ] =

=E[(X E[X])2 ] + 2E[(X E[X])(Y E[Y ])] + E[(Y E[Y ])2 ] =

Since X and Y are independent, we have that (X E[X]) is also independent from

(Y E[Y ]). Thus, we can use lemma 3.3, which says that the expectation of a product

equals the product of the expectations in case the variables are independent. We find:

E[(X E[X])(Y E[Y ])] = E[X E[X]] E[Y E[Y ]].

Furthermore:

E[X E[X]] = E[X] E[E[X]] = E[X] E[X] = 0

Thus

E[(X E[X])(Y E[Y ])] = 0.

Applying this to the above formula for V AR[X + Y ], we get:

V AR[X + Y ] =E[(X E[X])2 ] + 2E[(X E[X])(Y E[Y ])] + E[(Y E[Y ])2 ] =

= E[(X E[X])2 ] + E[(Y E[Y ])2 ] = V AR[X] + V AR[Y ].

This finishes our proof.

22

5.1

thumb

We mentioned that most of the time, any random variable takes values no further than two

times its standard deviation from its expectation. We can apply this and our calculation

for variance to understand how insurances work, hedging investments, and even statistical

estimation work. Let X1 , X2 , . . . be a sequence of random variables which all follow the

same model and are independent of each other. Let Z be the sum of n such variables:

Z = X1 + X2 + . . . + Xn

We find that

E[Z] = E[X1 + X2 + . . . + Xn ] = E[X1 ] + E[X2 ] + . . . + E[Xn ] = nE[X1 ]

Similarly we can use the fact that the variance of a sum of independent variables is the

sum of the variance to find:

V AR[Z] = V AR[X1 +X2 +. . .+Xn ] = V AR[X1 ]+V AR[X2 ]+. . .+V AR[Xn ] = nV AR[X1 ].

Using the last equation above with the fact that standard deviation is the square root of

variance we find:

&

&

In other words: the sum of n independents has its expectation grow like n times constant,

but the standard deviation grows only like square root of n times constant!!!! This is

everything you need to know for understanding how insurances and other risk reducers

work....

Let us see dierent examples, what these random numbers Xi could represent:

Say you are an insurance company specializing in proving life insurance. Let Xi be

the claim in the current year of the ith client. You have n clients, so the total claim

which you as a company will have to pay is Z = X1 + X2 + . . . + Xn .

You buy houses which you flip and then try to sell at a profit. You have bought

houses all over the US. Assuming the economy and real estate market stays very

stable, we can assume that the selling prices will be independent of each other.

So, let Xi represent the profit (or loss) for the i-th house which you are currently

renovating. This profit or loss is random, since you dont know exactly what it

will be until you sell. Assume that you have currently n houses which you are

renovating. Then, Z = X1 + . . . + Xn is your total profit or loss, with the n-houses

you are currently holding. This is a random variable since its outcome is not know

in advance.

23

So, again it is all based on the following two equations which hold when the Xi s are

independent and follow the same model:

X1 +...+Xn = X1 n

E[X1 + . . . + Xn ] = nE[X1 ]

So for example with n = 1000000, we get

Z = 1000Xi

whilst

E[Z] = 10000000E[Xi]

so Z becomes negligence compared to E[Z]. So, if we think that most of the times a

variable is within two standard deviations of its expectation, we find

Z 1000000E[X1] 1000X1

so, compared to the order of magnitude of Z the fluctuation becomes almost negligible!

Two random variables are dependent when their joint distribution is not simply the product of their marginal distribution. But the degree of dependence can vary from strong

dependence to loose dependence. One measure of the degree of dependence of random

variables is Covariance. For random variables X and Y we define the covariance as follows:

COV [X, Y ] = E[(X E[X])(Y E[Y ])]

Lemma:

For random variables X and Y there is also another equivalent formula for the covariance:

COV [X, Y ] = E[XY ] E[X]E[Y ]

Proof:

E[(X E[X])(Y E[Y ])] = E[XY Y E[X] XE[Y ] + E[X]E[Y ]]

= E[XY ] E[X]E[Y ]] E[X]E[Y ]] + E[X]E[Y ]]

= E[XY ] E[X]E[Y ]]

Lemma:

For independent random variables X and Y,

COV [X, Y ] = 0

24

Proof:

COV [X, Y ] = E[XY ] E[X]E[Y ]

For independent X and Y, E[XY ] = E[X]E[Y ]. Hence COV [X, Y ] = 0

Lemma:

COV [X, X] = V AR[X]

Proof:

COV [X, X] = E[X 2 ] E[X]2 = V AR[X]

Lemma:

Assume that a is a constant and let X and Y be two random variables. Then

COV [X + a, Y ] = COV [X, Y ]

Proof:

COV [X + a, Y ] = E[(X + a E[X + a])(Y E[Y ])]

= E[Y X + Y a Y E[X + a] XE[Y ] aE[Y ] + E[Y ]E[X + a]]

= E[XY ] + aE[Y ] E[Y ]E[X + a] E[X]E[Y ] aE[Y ] + E[Y ]E[X + a]

= E[XY ] E[X]E[Y ]

= COV [X, Y ]

Lemma:

Let a be a constant and let X and Y be random variables. Then

COV [aX, Y ] = aCOV [X, Y ]

Proof:

COV [aX, Y ] = E[(aX E[aX])(Y E[Y ])]

= E[aXY Y E[aX] aXE[Y ] + E[aX]E[Y ]]

= aE[XY ] aE[X]E[Y ] aE[X]E[Y ] + aE[X]E[Y ]

= aE[XY ] aE[X]E[Y ]

= a(E[XY ] E[X]E[Y ])

= aCOV [X, Y ]

25

Lemma:

For any random variables X, Y and Z we have:

COV [Z + X, Y ] = COV [Z, Y ] + COV [X, Y ]

Proof:

COV [Z + X, Y ] = E[(X + Z E[X + Z])(Y E[Y ])]

= E[Y X] + E[Y Z] E[X]E[Y ] E[Z]E[Y ]

= E[Y X] E[X]E[Y ] + E[Y Z] E[Z]E[Y ]

= COV [X, Y ] + COV [Z, Y ]

6.1

Correlation

COR[X, Y ] = &

COV [X, Y ]

V AR[X]V AR[Y ]

One can prove that correlation is always between 1 and 1. When the variables are

independent the correlation is zero. The correlation is one when Y can be written as

Y = a + bX where a, b are constants such that b > 0. If the correlation is 1 then the

variable Y can be written as Y = a + bX where b is a negative constant and a is any

constant.

An important property of the correlation is that when we multiply the variable by a

constant then the correlation does not change: COR[aX, Y ] = COR[X, Y ]. This implies

that a change of units does not aect correlation.

For this assume the dividend payed (per share) next year to be a random variable. Let

the expected amount of money payed be equal to E[X] = 2 dollars. Then the probability

that the dividend pays more than 100 dollars can not be more then 2/100 = E[X]/100

since otherwise the expectation would have to be bigger than 2. In other words, for a

random variable X which can not take negative values, the probability that the random

variable is bigger than a is at most E[X]/a. This is the content of the next lemma:

26

Lemma 7.1 Assume that a > 0 is a constant and let X be a random variable taking on

only non-negative values, i.e. P (X 0) = 1. Then,

P (X a)

E[X]

.

a

Proof. To simplify the notation, we assume that the variable takes on only integer values.

The result remains valid otherwise. We have that

E[X] = 0 P (X = 0) + 1 P (X = 1) + 2 P (X = 2) + 3 P (X = 3) + . . .

(7.1)

Note that the sum on the right side of the above inequality contains only non-negative

terms. If we leave out some of these terms, the value can only decrease or stay equal. We

are going to just keep the values x P (X = x) for x greater equal to a. This way equation

7.1, becomes

E[X] xa P (X = xa ) + (xa + 1) P (X = xa + 1) + (xa + 2) P (X = xa + 2) + . . . (7.2)

where xa denotes the smallest natural number which is larger or equal to a. Note that

xa + i a for any i natural number. With this we obtain that the right side of 7.2 is

larger-equal than

a(P (X = xa ) + P (X = xa + 1) + P (X = xa + 2) + . . .) = aP (X a).

and hence

E[X] aP (X a).

The last inequality above implies:

P (X a)

E[X]

.

a

The inequality given in the last lemma is called Markov inequality. In is very useful: in

many real world situations it is dicult to estimate all the probabilities (the probability

distribution) for a random variable. However, it might be easier to estimate the expectation, since that is just one number. If we know the expectation of a random variable, we

can at least get upper-bounds on the probability to be far away from the expectation.

Let us next present the Chebyche inequality:

Lemma 7.2 If X is a random variable with expectation E[X] and variance VAR[X] and

a 0 is a non-random number, then

P (|X E[X]| a)

27

V AR[X]

a2

Proof.

Note that |X E[X]| a implies (X E[X])2 a2 and vice versa. Hence,

P (|X E[X]| a) = P ((X E[X])2 a2 )

(7.3)

P (|X E[X]| a) = P (Y a2 ).

(7.4)

P (Y a2 )

E[Y ]

E[(X E[X])2 ]

V AR[X]

=

=

2

2

a

a

a2

Using the last chain of inequalities above with equalities 7.3 and 7.4 we find

P (|X E[X]| a)

V AR[X]

a2

Let us consider one more example. Assume the total expected claim at the end of next

year for an insurance company is 1 000000$. What is the risk that the insurance company

has to pay more than 5 000000 as total claim at the end of next year? The answer goes

as follows:

let Z be the total claim at the end of next year. By Markov inequality, we find

P (Z 5 000 000)

E[Z]

5 000 000

1

= 20%.

5

Hence, we know that the probability to have to pay more than five millions is at most

20%. To derive this the only information needed was the expectation of Z.

When the standard deviation is also available, one can usually get better bounds using

the Chebycheef inequality. Assume in the example above that the expected total claim is

as before, but let the standard deviation of the total claim be one million. Then we have

V AR[Z] = (1 000 000)2.

Note that for Z to be above 5 000000 we need Z E[Z] to be above 4 000000. Hence,

P (Z 5 000 000) = P (Z E[Z] 4 000 000) P (|Z E[Z]| 4 000000).

Using Chebyche, we get

P (|Z 1 000 000| 4 000 000)

V AR[Z]

1

=

= 0.0625.

2

(4 000 000)

16

It follows that the probability that the total claim is above five millions is less than 6.25

percent. This is a lot less than the bound we had found using Markovs inequality.

28

Combinatorics

= {x1 , x2 , . . . , xs } .

denote the state space of a random experiment. Let each possible outcome have same

probability. Let E be an event. Then,

P (E) =

number of outcomes in E

|E|

=

total number of outcomes

s

P () = 1

Now

P () = P (X {x1 , . . . , xs }) = P ({X = x1 } . . . {X = xs }) =

P (X = xt )

t=1,...,s

#

P (X = xt ) = sP (X = x1 ).

t=1,...,s

Thus,

1

P (X = x1 ) = .

s

Now if:

E = {y1 , . . . , yj }

We find that:

P (E) = P (X E) = P ({X = y1 } . . . {Xj = yj }) =

j

#

P (X = yi ) =

i=1

j

s

Next we present one of the main principles used in combinatorics:

Lemma 8.1 Let m1 , m2 , . . . , mr denote a given finite sequence of natural numbers. Assume that we have to make a sequence of r choices. At the s-th choice, assume that we

have ms possibilities to choose from. Then the total number of possibilities is:

m1 m2 . . . mr

Why this lemma holds can best be understood when thinking of a tree, where at each

knot which is s away from the root we have ms new branches.

29

Example 8.1 Assume we first throw a coin with a side 0 and a side 1. Then we throw a

four sided die. Eventually we throw the coin again. For example we could get the number

031. How many dier numbers are there which we could get? The answer is: First we

have to possibilities. For the second choice we have four, and eventually we have again

two. Thus, m1 = 2, m2 = 4, m3 = 2. This implies that the total number of possibilities is:

m1 m2 m3 = 2 4 2 = 16.

Recall that the product of all natural numbers which are less or equal to k, is denoted by

k!. k! is called k-factorial.

Lemma 8.2 There are

k!

possibilities to put k dierent objects in a linear order. Thus there are k! permutations of

k elements.

To realize why the last lemma above holds we use lemma 8.1. To place k dierent objects

in a row we first choose the first object which we will place down. For this we have k

possibilities. For the second object, there remain k 1 objects to choose from. For the

third, there are k 3 possibilities to choose from. And so on and so forth. This then

gives that the total number of possibilities is equal to k (k 1) . . . 2 1.

Lemma 8.3 There are:

n!

(n k)!

possibilities to pick k out of n dierent objects, when the order in which we pick them

matters.

For the first object, we have n possibilities. For the second object we pick, we have n 1

remaining objects to choose from. For the last object which we pick, (that is the k-th

which we pick), we have n k + 1 remaining objects to choose from. Thus the total

number of possibilities is equal to:

n (n 1) . . . (n k + 1)

which is equal to:

n!

.

(n k)!

n!

The number (nk)!

is also equal to the number of words of length k written with a n-letter

alphabet, when we require that the words never contain twice the same letter.

n!

k!(n k)!

30

The reason why the last lemma holds is the following: there are k! ways of putting a

given subset of size k into dierent orders. Thus, there are k! times more ways to pick k

elements, than there are subsets of size k.

Lemma 8.5 There are:

2n

subsets of any size in a set of size n.

The reason why the last lemma above holds is the following: we can identify the subsets

two binary vectors with n entries. For example, let n = 5. Let the set we consider be

{1, 2, 3, 4, 5}. Take the binary vector:

(1, 1, 1, 0, 0).

This vector would correspond to the subset containing the first three elements of the set,

thus to the subset:

{1, 2, 3}.

So, for every non zero entry in the vector we pick the corresponding element in the set. It

is clear that this correspondence between subsets of a set of size n and binary vectors of

dimension n is one to one. Thus, there is the same number of subsets as there is binary

vectors of length n. The total number of binary vectors of dimension n however is 2n .

9.1

Bernoulli variable

Let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be

the probability of side 0. Let X designate the random number we obtain when we flip

this coin. Thus, with probability p the random variable X takes on the value 1 and with

probability 1 p it takes on the value 0. The random variable X is called a Bernoulli

variable with parameter p. It is named after the famous swiss mathematician Bernoulli.

For a Bernoulli variable X with parameter p we have:

E[X] = p.

V AR[X] = p(1 p).

Let us show this:

E[X] = 1 p + 0 (1 p) = p.

For the variance we find:

V AR[X] = E[X 2 ] (E[X])2 = 12 p + 02 (1 p) (E[X])2 = p p2 = p(1 p).

31

9.2

Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be

the probability of side 0. We toss this coin independently n times and count the numbers

of 1s observed. The number Z of 1s observed after n coin-tosses is equal to

Z := X1 + X2 + . . . + Xn

where Xi designates the result of the i-th toss. (Hence the Xi s are independent Bernoulli

variables with parameter p.) The random variable Z is called a binomial variable with

parameter p and n. For the binomial random variable with parameter p we find:

E[Z] = np

V AR[Z] = np(1 p)

For k n, we have: P (Z = k) =

Let us show the above statements:

)n*

k

pk (1 p)nk .

Also:

V AR[Z] = V AR[X1 + . . . + Xn ] = V AR[X1 ] + . . . + V AR[Xn ] = nV AR[X1 ] = np(1 p).

Let us calculate next the probability: P (Z = k). We start with an example. Take n = 3

and k = 2. We want to calculate the probability to observe exactly to ones among the

first three coin tosses. To observe exactly two ones out of three successive trials there are

exactly three possibilities:

Let A be the event: X1 = 1, X2 = 1, X3 = 0

Let B be the event: X1 = 1, X2 = 0, X3 = 1

Let C be the event: X1 = 0, X2 = 1, X3 = 1.

Each of these possibilities has probability p2 (1 p). As a matter of fact, since the trials

are independent we have for example:

P (X1 = 1, X2 = 1, X3 = 0) = P (X1 = 1)P (X2 = 1)P (X3 = 0) = p2 (1 p).

The three dierent possibilities are disjoint of each other. Thus,

P (Z = 2) = P (A B C) = P (A) + P (B) + P (C) = 3p2 (1 p).

Here 3 is the number of realization where we have exactly two ones within the first three

coin tosses. This is equal to the dierent number of ways, there is to choose two dierent

objects out of three items. In other words the number three stand in our formula for 3

32

choose 2.

We can now generalize to n trials and a number k n. There are n choose k possible

outcomes for which among the first n coin tosses there appear exactly k ones. Each of

these outcomes has probability:

pk (1 p)(nk) .

This gives then:

$ %

n k

P (Z = k) =

p (1 p)(nk) .

k

9.3

Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be

the probability of side 0. We toss this coin independently n many times. Let Xi designate

the result of the i-th coin-toss. Let T designate the number of trials it takes until we first

observe a 1. For example, if we have:

X1 = 0, X2 = 0, X3 = 1

we would have that T = 3. If we observe on the other hand:

X1 = 0, X2 = 1

we have that T = 2. T is a random variable. As we are going to show, we have:

For k > 0, we have P (T = k) = p(1 p)k1 .

E[T ] = 1/p

V AR[T ] = (1 p)/p2

A random variable T for which P (T = k) = p(1 p)k1, k N, is called geometric

random variable with parameter p. Let us next prove the above statements: For T to be

equal to k we need to observe k 1 time a zero followed by a one. Thus:

P (T = k) = P (X1 = 0, X2 = 0, . . . , Xk1 = 0, Xk = 1) =

P (X1 = 0) P (X2 = 0) . . . P (Xk1 = 0) P (Xk = 1) = (1 p)k1p.

Let us calculate the expectation of T . We find:

E[T ] =

#

k=1

kp(1 p)k1

x / f (x) =

33

#

k=1

kxk1 .

We have that

f (x) =

#

d(xk )

k=1

dx

)+

k=1

xk

dx

#

k=1

Thus,

d (x/(1 x))

1

x

1

=

+

=

(9.1)

2

dx

1 x (1 x)

(1 x)2

k(1 p)k1 = f (1 p) =

E[T ] = p

#

k=1

k(1 p)

k1

=p

1

(p)2

1

1

= .

2

(p)

p

2

E[T ] =

#

k=1

k 2 p(1 p)k1 .

x / g(x) =

We find:

k 2 (x)k1

k=1

) +

*

k1

#

d x

d(xk )

k=1 kx

g(x) =

k

=

dx

dx

k=1

g(x) =

1+x

d (x/(1 x)2 )

=

.

dx

(1 x)3

E[T 2 ] = pg(1 p) =

2p

.

p2

Now,

2p

V AR[T ] = E[T ] (E[T ] ) =

p2

2

10

$ %2

1

1p

=

.

p

p2

So far we have only been studying discrete random variables. Let us see how continuous

random variables are defined.

34

Definition 10.1 Let X be a number generated by a random experiment. (Such a random number is also called random variable). X is a continuous random variable if there

exists a non-negative piecewise continuous function

f : x / f (x) R R+

such that for any interval I = [i1 , i2 ] R we have that:

.

P (X I) = f (x)dx.

I

The function f (.) is called the density function of X or simply the density of X.

/

Note that the notation I f (x)dx stands for:

.

f (x)dx =

i2

f (x)dx.

(10.1)

i1

Recall also that integrals like the one appearing in equation 10.1 are defined to be equal

to the air under the curve f (.) and above the interval I.

Remark 10.1 Let f (.) be a piecewise continuous function from R into R. Then, there

exists a continuous random variable X such that f (.) is the density of X, if and only if

all of the following conditions are satisfied:

1. f is everywhere non-negative.

/

2. R f (x)dx = 1.

The uniform variable in the interval I = [i1 , i2 ], where i1 < i2 . The density of f (.)

is equal to 1/|i2 i2 | everywhere in the interval I. Anywhere outside the interval I,

f (.) is equal to zero.

The standard normal variable has density:

1

2

f (x) := ex /2 .

2

A standard normal random variable is often denoted by N (0, 1).

Let R, > 0 be given numbers. The density of the normal variable with

expectation and standard deviation is defined to be equal to:

f (x) :=

1

2

2

e(x) /2 .

2

35

11

f (x) =

1

2

2

e(x) /2 .

2

Hence there are two parameters which determine a normal distribution: and . We

write N (, ) for an normal variable with parameters and .

If we analyze the density function f (x), we see that for any value a, we have that f (+a) =

f ( a). Hence the function f (.) is symmetric at the point . This implies that the

expected value has to be :

E[N (, )] = .

One could also show this by verifying that

.

1

2

2

e(x) /2 dx = 0.

E[N (, ) = x

2

By integration by parts, one can show that

.

1

2

2

x2

e(x) /2 dx = 2

2

and hence the variance is 2 :

V AR[N (, )] = 2 .

Note that the function f (x) decreases when we go away from : the shape is a bell shape

with maximum at and width . (Go onto the Internet to see the graph of a normal

density plotted.)

Let us give next a few very useful facts about normal variables:

Let a and b be two constants such that a = 0. Let X be a normal variable with

parameters X and X . Let Y be the random variable defined by ane transformation from X in the following way: Y := aX + b. Then Y is also normal. The

parameters of Y are

E[Y ] = Y = aX + b

and

Y = aX .

This we obtain simply from the fact that these parameters are expectation and

standard deviation of their respective variables.

Let X and Y be normal variables independent of each others. Let Z := X +Y . Then

Z is also normal. The same result is true for sums of more than two independent

normal variables.

36

If X is normal, then

X E[X]

Z := &

V AR[X]

is a standard normal.

For the last point above note that for any random variable X (not necessarily normal) we

have that if Z = (X E[X])/X , then Z has expectation zero and standard deviation 1.

This is a simple straight forward calculation:

0

1

X E[X]

1

E[Z] = E

=

(E[X] E[E[X]])

(11.1)

X

X

but since E[E[X]] = E[X], equality 11.1 implies that E[Z] = 0. Also

1

0

X E[X]

= V AR[X]/ 2 = 1.

V AR[Z] = V AR

X

Now if X is normal then we saw that Z = (X E[X])/X is also normal, since Z is

just obtained from X by multiplying and adding constants. But Z has expectation 0 and

standard deviation 1 and hence it is standard normal.

One can use normal variables to model financial processes and many others. Let us

consider an example. Assume that a portfolio consists of three stocks. Let Xi denote the

value of stock number i in one year from now. We assume that the three stocks in the

portfolio are

& all independent of each others and normally distributed so that i = E[Xi ]

and i = V AR[Xi ] for i = 1, 2, 3. Let

1 = 100, 2 = 110, 3 = 120

and let

1 = 10, 2 = 20, 3 = 20.

The value of the portfolio after one year is Z = X1 + X2 + X3 and E[Z] = E[X1 ]+ E[X2 ]+

E[X3 ] = 330. : Question: What is the probability that the value of the portfolio after a

year is above 360?

Answer: We have that

V AR[Z] = V AR[X1 ] + V AR[X2 ] + V AR[X3 ] = 100 + 400 + 400 = 900

and hence

Z =

We are now going to calculate

&

V AR[Z] = 30.

P (Z 330).

For this we want to transform the probability into a probability involving a standard

normal since for standard normal we have tables available. We find

$

%

Z E[Z]

360 E[Z]

P (Z 360) = P

.

(11.2)

Z

Z

37

Note that

(360 E[Z])/z = 1

and also (Z E[Z])/Z is standard normal. Using this in equation 11.2, we find that the

probability that the portfolio after a year is above 360 is equal to

P (Z 360) = P (N (0, 1) 1) = 1 (1),

where (1) = P (N (0, 1) 1) = 0.8413 can be found in a table for the standard normal.

12

Distribution functions

X is defined in the following way:

FX (s) := P (X s)

for all s R.

Let us next mention a few properties of the distribution function:

FX is an increasing function. This means that for any two numbers s < t in R, we

have that FX (s) FX (t).

lims FX (s) = 1

lims FX (s) = 0

We leave the proof of the three facts above to the reader.

Imagine next that X is a continuous random variable with density function fX . Then,

we have for all s R, that:

. s

FX (s) = P (X s) =

fX (t)dt.

Taking the derivative on all sides of the above system of equations we find that:

dFX (s)

= fX (s).

ds

In other words, for a continuous random variables X, the derivative of the distribution

function is equal to the density of X. Hence, in this case, the distribution function is

dierentiable and thus also continuous. Another implication is: the distribution function

uniquely determines the density function of f . This implies, that the distribution function

determines uniquely all the probabilities of events which can be defined in terms of X.

Assume next that the random variable X has a finite state space:

X = {s1 , s2 , . . . , sr }

38

such that s1 < s2 < . . . < sr . Then, the distribution function FX is a step function. Left

of s1 , we have that FX is equal to zero. Right of sr it is equal to one. Between si and

si+1 , that is on the interval [si , si+1 [, the distribution function is constantly equal to:

#

P (X = sj ).

ji

To sum up: for continuous random variables the distribution functions are dierentiable

functions, whilst for discrete random variables the distribution functions are step functions. Let us next show how we can use the distribution function to simulate random

variables. The situation is the following: our computer can generate a uniform random

variable U in the interval [0, 1]. (This is a random variable with density equal to 1 in [0, 1]

and 0 everywhere else.) We want to generate a random variable with a given probability

density function fX , using U. We do this in the following manner: we plug the random

number U into the map invFX . (Here invFX designates the inverse map of FX (.).) The

next lemma says that this method really produces a random variable with the desired

density function.

Lemma 12.1 Let fX denote the density function of a continuous random variable and

let FX designate its distribution function. Let Y designate the random variable obtained

by plugging the uniform random variable U into the inverse distribution function:

Y := invFX (U).

Then, the density of Y is equal to fX .

Proof. Since, F (.) is an increasing function. Thus for any number s we have:

Y s.

is equivalent to

Hence:

Now, FX (Y ) = U, thus

FX (Y ) FX (s).

P (Y s) = P (FX (Y ) FX (s)).

P (Y s) = P (U FX (s)).

We know that FX (s) [0, 1]. Using the fact that U has density function equal to one in

the interval [0, 1], we find:

. FX (s)

P (U FX (s)) =

1 dt = FX (s).

0

Thus

P (Y s) = FX (s).

39

This shows that the distribution function FY of Y is equal to FX (s). Applying the

derivative according to s to both FY (s) and FX (s), yields:

fY (s) = fX (s).

Hence, X and Y have same density function. This finishes the proof.

13

variables

Definition 13.1 Let X be a continuous random variable with density function fX (.).

Then, we define the expectation E[X] of X to be:

.

E[X] :=

sfX (s)ds.

Next we are going to prove that the law of large numbers also holds for continuous random

variables.

Theorem 13.1 Let X1 , X2 , . . . be a sequence of i.i.d. continuous random variables all

with same density function fX (.). Then,

X1 + X2 + . . . + Xn

= E[X1 ].

n

n

lim

Proof. Let > 0 be a fix number. Let us approximate the continuous variables Xi by

a discrete variable Xi . For this we let Xi be the largest integer multiple of which is

still smaller equal to Xi . In this way, we always get that

|Xi Xi | < .

This implies that:

2

2

2 X1 + X2 + . . . + Xn X1 + X2 + . . . + Xn 2

2

2<

2

2

n

n

However the variables Xi are discrete. So for them the law of large number has already

been proven and we find:

'

(

X1 + X2 + . . . + Xn

= E X1

n

n

lim

We have that

E[Xi ] =

#

zZ

z P (Xi = z)

40

(13.1)

However, by definition:

P (Xi = z) = P (Xi [z, (z + 1)[).

The expression on the right side of the last inequality is equal to

.

(z+1)

fX (s)ds.

z

Thus

E[Xi ]

zZ

(z+1)

fX (s)ds.

z

As tends to zero, the expression on the left side of the last equality above tends to:

.

sfX (s)ds

This implies that by taking fix and suciently small, we have that, for large enough n

, the fraction

X1 + X2 + . . . + Xn

n

is as close as we want from

.

sfX (s)ds.

X1 + X2 + . . . + Xn

n

actually converges to

.

sfX (s)ds.

The linearity of expectation holds in the same way as for discrete random variables.

This is the content of the next lemma.

Lemma 13.1 Let X and Y be two continuous random variables and let a be a number.

Then

E[X + Y ] = E[X] + E[Y ]

and

E[aX] = aE[X]

Proof. The proof goes like in the discrete case: The only thing used for the proof in the

discrete case is the law of large numbers. Since the central limit theorem also holds for

the continuous case, the exactly same proof holds for the continuous case.

41

14

The Central Limit Theorem (CLT) is one of the most important theorems in probability.

Roughly speaking it says that if we build the sum of many independent random variables,

no matter what these little contributions are, we will always get approximately a normal

distribution. This is very important in every day life, because often times you have

situations where a lot of little independent things add up. So, you end up observing

something which is approximately a normal random variable. For example, when you

make a measurement you are most of the time in this situation. That is, when you dont

make one big measurement error. In that case, you have a lot of little imprecisions which

add up to give you your measurement error. Most of the time, these imprecisions can be

seen as close to being independent of each other. This then implies: unless you make one

big error, you will always end up having your measurement-error being close to a normal

variable.

Let X1 , X2 , X3 , . . . be a sequence of independent, identically distributed random variables.

(This means that they are the outcome of the same random experiment repeated several

times independently.) Let

& denote the expectation := E[X1 ] and let denote the

standard deviation := V AR[X1 ]. Let Z denote the sum

Z := X1 + X2 + X3 + . . . + Xn .

Then, by the calculation rules we learned for expectation and variance it follows that:

E[Z] = n

and the standard deviation Z of Z is equal to:

Z = n.

When you subtract from a random variable its mean and divide by the standard deviation

then you always get a new variable with zero expectation and variance equal to one. Thus

the standardized sum:

Z n

n

has expectation zero and standard deviation 1. The central limit theorem says that on

top of this, for large n, the expression

Z n

n

is close to being a standard normal variable. Let us now formulate the central limit

theorem:

Theorem 14.1 Let

X1 , X2 , X3 , . . .

42

for large n, the normalized sum Y :

Y :=

X1 + . . . + Xn n

The version of the Central Limit Theorem is not yet very precise. As a matter of fact, what

means close to being a standard normal random variable? We certainly understand

what means that two points are close to each other. But we have not yet discussed the

concept of closeness for random variables. Let us do this by using the example of a sixsided die. Let us assume that we have a six-sided die which is not perfectly symmetric.

For i {1, 2, . . . , 6}, let pi denote the probability of side i:

P (X = i) = pi

where X denotes the number which we get when we through this die once. A perfectly

symmetric die would have the probabilities pi all equal to 1/6. Say, our die is not exactly

symmetric but close to a perfectly symmetric die. What does this mean? This means

that for all i {1, 2, . . . , 6} we have that pi is close to 1/6.

For the die example with have a finite number of outcomes. For a continuous random

variable on the other hand we are interested in the probabilities of intervals. By this I

means that we are interested for a given interval I, in the probabilities that the random

experiment gives result in I. If X denotes our continuous random variable, this means

that we are interested in the probabilities of type:

P (X I).

We are now ready to explain what we mean by: two continuous random variables X

and Y have there probability laws close to each other. By X and Y are close (have

probability laws which are closed to each other) we mean: for each interval I we have

that the real number P (Y I) is close to the real number P (X I). For the interval,

i = [i1 , i2 ] with i1 < i2 , we have that

P (X I) = P (X i2 ) P (X < i1 ).

It follows that if we know all the probabilities for semi-infinite intervals we can determine

the probabilities of type P (X I). Thus, for two continuous random variables X and

Y to be close to each other (with respect to their probability law), it is enough to ask

that for all x R we have that the real number P (X x) is close to the real number

P (Y y).

Now that we have clarified the concept of closeness in distribution for continuous random

variables, we are ready to formulate the CLT in a more precise way. Hence saying that

Z :=

X1 + . . . + Xn n

n

43

is close to a standard normal random variable N (0, 1) means that for every z R we

have that:

P (Z z)

is close to

P (N (0, 1) z).

In other words, as n goes to infinity, P (Z z) converges to P (N (0, 1) z). Let us give

a more precise version of the CLT then what we have done so far:

Theorem 14.2 Let

X1 , X2 , X3 , . . .

be

& a sequence of independent, identically distributed random variables. Let E[X1 ] = and

V AR[X1 ] = . Then, for any z Z, we have that:

$

%

X1 + . . . + Xn n

lim P

z = P (N (0, 1) z).

n

n

15

Statistical testing

Assume that you read in the newspaper that 50% of the population in Atlanta smokes.

You dont believe that number, so you start a survey. You ask 100 randomly chosen

people, and find that 70 out of the hundred smoke. Now, you want to know if the result

of your survey constitutes strong evidence against the 50% claimed by the newspaper.

If the true percentage of the population of Atlanta which smokes would be 50%, you

would expect to find in your survey a number closer to 50 people. However, it could be

that although the true percentage is 50%, you still observe a figure as high as 70. Just by

chance. So, the procedure is the following: determine the probability of getting 70 people

or more in your survey who smoke, given that the percentage would really be 50%. If that

probability is very small you decide to reject the idea that 50% of the population smoke

in Atlanta. In general one takes a fix level > 0 and rejects the idea one wants to test

if the probability is smaller than . Most of the times statisticians work with being

equal to 0.05 or 0.1. So, if the probability of getting 70 people or more in our survey who

smoke is smaller than = 0.05(the probability given that 50% of the population smokes),

then statisticains will say: we reject the hypothesis that 50% of the population in Atlanta

smokes. We do this on the confidence level = 0.05, based on the evidence of our survey.

How do we calculate the probability to observe 70 or more people in our survey who

smoke if the percentage would really be 50% of the Atlanta population? For this it is

important how we choose, the people for our survey. The correct way to choose them is

the following: take a complete list of the inhabitants of Atlanta. Numerate them. Choose

100 of them with replacement and with equal probability. This means that a person could

appear twice.

Let Xi be equal to one if the i-th person chosen is a smoker. Then, if we chose the people

44

following the procedure above we find that the Xi s are i.i.d. and that P (Xi = 1) =

p where p designates the true percentage of people in Atlanta who smoke. Then also

E[Xi ] = p. The total number of people in our survey who smoke Z, can now be expressed

as

Z := X1 + X2 + . . . + X100 .

Let P50% (.) designate the probability given that the true percentage which smoke is really

50%. Testing if 50% in Atlanta smoke can now be discribed as follows:

Calculate the probability:

P50% (X1 + . . . + X100 70).

If the above probability is smaller than = 0.05 we reject the hypothesis that 50%

of the population smokes in Atlanta (we reject it on the = 0.05 level). Otherwise,

we keep the hypothesis. When we keep the hypothesis, this means that the result

of our survey does not constitute strong evidence against the hypothesis: the result

of the survey does not contradict the hypothesis.

Note that we could also have done the test on the = 0.1 level. In that case we would

reject the hypothesis if that probability is smaller that 0.1.

Next we are explaining how we can calculate approximately the probability P50% (Z 70),

using the CLT. Simply note that, by basic algebra, the inequality

Z 70

is equivalent to

which is itself equivalent to:

Z n 70 n

Z n

70 n

n

n

Equivalent inequalities must also have same probability. Hence:

$

%

70 n

Z n

n

n

(15.1)

Z n

n

is close to being a standard normal random variable N (0, 1). Thus, the probability on

the right side of inequality 15.1, is approximately equal to

%

$

70 n

.

(15.2)

P N (0, 1)

n

If the probability in expression 15.2 is smaller than 0.05 then we reject the hypothesis

that 50% of the Atlanta populations smokes. (on the = 0.05 level). We can look upthe

probability that the standard normal N (0, 1) is smaller than the number (70n)/( n)

in a table. We have tables, for the standard normal variable N (0, 1).

45

15.1

Let z R. Let (z) denote the probability that a standard normal variable is smaller

equal than z. Thus:

. z

1 x2 /2

e

dx.

(z) := P (N (0, 1) z) =

2

For example, let z > 0 be a number. Say wen want to find the probability

P (N (0, 1) z).

(15.3)

The table for the standard normal gives the values of (z) for z > 0 thus we have to

try to express probability 15.3 in terms of (z). For this note that:

P (N (0, 1) z) = 1 P (N (0, 1) < z).

Furthermore, P (N (0, 1) < z) is equal to P (N (0, 1) z) = (z). Thus we find that:

P (N (0, 1) z) = 1 (z).

Let us next explain how, if z < 0, we can find the probability:

P (N (0, 1) z).

Note that N (0, 1) is symmetric around the origin. Thus,

P (N (0, 1) z) = P (N (0, 1) |z|).

This brings us back to the previously studied case. We find

P (N (0, 1) z) = 1 (|z|).

Eventually let z > 0 again. What is the probability:

P (z N (0, 1) z)

equal to? For this problem note that

P (z N (0, 1) z) = 1 P (N (0, 1) z) P (N (0, 1) z).

Thus, we find that:

P (z N (0, 1) z) = 1 (1 (z)) (1 (z)) = 2(z) 1.

46

15.2

Let us give an example to introduce this subject. Assume that we are testing a new fuel

for a certain type of rocket. We would like to know if the new fuel gives a dierent initial

velocity to the rocket. The initial velocity with the old fuel is denote by X whilst Y is

the initial velocity with the new fuel. We fire the rocket five times with the old fuel and

measure each time the initial velocity. We find:

X1 = 100, X2 = 102, X3 = 97, X4 = 100, X5 = 101

(15.4)

(here Xi denotes the initial velocity measured whilst firing the rocket for the i-th time

with the old fuel). Then we fire the rocket five times with the new fuel. Every time we

measure the initial velocity. We find

Y1 = 101, Y2 = 103, Y3 = 99, Y4 = 102, Y5 = 100

(15.5)

:= X1 + X2 + X3 + X4 + X5 = 100

X

5

and

Y1 + Y2 + Y3 + Y4 + Y5

= 101

Y :=

5

When we measure the initial velocities we find dierent values even when we use the same

fuel. The reason is that our measurement instruments are not very precise, so we get the

true value plus a measurement error. The model is as follows:

Xi = X + X

i

and

Yi = Y + Yi .

X

Y

Y

Furthermore X

1 , 2 , . . . are i.i.d. random errors and so are 1 , 2 , . . .. We assume that

the measurement instrument is well calibrated so that

Y

E[X

i ] = E[i ] = 0

for all i = 1, 2, . . .. Here X and Y are unknown constants (in our example X is the

initial speed when we use the old fuel whilst Y is the initial speed when we use the new

fuel). We find that

X

E[Xi ] = E[X + X

i ] = E[X ] + E[i ] = X + 0 = X ,

and similarly

E[Yi ] = Y .

47

So our testing problem can be described as follows: we want to figure out based on our

data 15.4 and 15.5, if the second fuel gives a dierent initial speed than the old fuel. We

observed

= 1 > 0.

Y X

This means that in the second sample, obtained with the new fuel, the initial speed is

higher by one unit on average to the initial speed in the first sample obtained with the

old fuel. But is this evidence enough to conclude that the new fuel provides higher initial

speed, or could this dierence just be due to the measurement errors? As a matter of

fact, since we make measurement errors, it could be that even, if the second fuel does

not provide higher initial speed, (i.e. X = y ) that due to the random errors and bad

luck the second average is higher than the first. In our present setting we can never

be absolutely sure, but we try to see if there is statistically significant evidence for

arguing that X and Y are not equal.

The exact method to do this depends on whether we know the standard deviation of the

errors or not and if they are identical for the two samples. We will need the expectation

and standard deviation of the means. This is what we calculate in the next paragraph.

Expectation and standard deviation of the means Let the standard deviation of

the errors be denoted by

3

3

X := V AR[X

]

,

:=

V AR[Yi ].

Y

i

We find that the standard deviation of Z is given by

Let Z := Y X.

3

3

= V AR[Y ] + V AR[X

Z = Y X = V AR[Y X]

(15.6)

where the last equality above was obtained using the facts that the X

independent of each other, and variance of a sum of independent variable is equal to the

sum of the variances. Now

0

1

X

+

.

.

.

+

X

V AR[X1 + . . . + Xn ]

1

n

= V AR

V AR[X]

=

=

n

n2

2

V AR[X1 ] + V AR[X2 ] + . . . + V AR[Xn ]

nV AR[X1 ]

V AR[X1 ]

X

=

=

=

=

n2

n2

n

n

and similarly

V AR[Y ] =

Y2

n

Z = Y X =

48

2

2

X

+ Y

n

n

(15.7)

If X = Y (which should be the case when we use the same measurement instrument),

then equation 15.7 can be rewritten as

4

&

2 2

+

= 2/n,

Y X =

(15.8)

n

n

where = X = Y . If the two samples would have dierent sizes, we would find by a

similar calculation

4

X

Y

Z =

+

(15.9)

n1

n2

where n1 is the size of the first sample and n2 is the size of the second sample. Furthermore

we have for the expectation

= E[Y ] E[X]

=

E[Y X]

1

0

1

0

X1 + . . . + Xn

Y1 + Y2 + . . . + Yn

E

=

=E

n

n

E[Y1 + . . . + Yn ] E[X1 + . . . + Xn ]

=

=

n

n

E[Y1 ] + E[Y2 ] + . . . + E[Yn ] E[X1 ] + E[X2 ] + . . . + E[Xn ]

=

=

n

n

= E[Y1 ] E[X1 ] = X Y

To summarize we found that

= Y X .

E[Y X]

(15.10)

A simplified method Let us first explain a rough method, to explain in a simple way

the idea. This method up to a small detail is the same as what is really used in practice.

At this stage we are ready to explain how we could proceed to know if we have strong

evidence for the case Y X = 0. We are going to use the rule of thumb which says

that in most cases for most variables the values we typically observe are within a distance

of at most 2 times the standard deviation from the expected value. We apply this rule

If we would have that there is no dierence between the new and old

to Z = Y X.

fuel, then Y X would be equal to zero and hence E[Z] = 0 (see equation 15.10). We

can then check if the value we observe for Z is within 2 times the standard deviation Z .

Thus in our case we check if the value 1 is within 2 times the standard deviation Z . This

is the same as checking if

Y X

(15.11)

Z

is not more than 2 in absolute value. If it is, we would think that Y is probably not

equal to X . In that case, we say that we reject the hypothesis that X = Y . The

expression 15.11 is called test statistic. What we did here is check if the value taken by

the test statistics is within the interval [cr , cr ], where we took cr = 2. The number cr is

49

called critical value for our test. If we do not know X and Y , we estimate them and

replace them by their estimates in the formulas 15.7,15.8, 15.9. (To see how to estimate

a standard deviation go to subsection 16.2). We then use that value for the test statistic

instead of 15.11.

The method described here diers from the one really used only in as much as the critical

value is concerned. However even with the way the test is usually done in practice, the

critical value will not be very far from 2. Let us next explain in detail the dierent

methods used in practice. They depend on whether the standard deviation is known or

not. Also, to perform a statistical test in a precise way, need to specify the level of

confidence for the test. The higher the lever of confidence the bigger the critical value

will be. Let us see the details in the next paragraphs:

The case with identical, known standard deviation Assume that the standard

deviations X and Y are known to us and identical. This is typically the case, when the

measurement instruments used for both samples are identical. In this case, we denote by

the value = X = Y . If we work often with the same measurement instruments, we will

known the typical size of the measurement error, hence we will know from experience

. Assume here that the measurement errors are normal. Then, the test statistic

Y

X

X 1 + . . . + Xn Y 1 Y 2 . . . Y n

=

Z

nZ

(15.12)

is also normal. As a matter of fact, as can be seen in 15.12, the test statistic can be

written as a sum of independent normal variables divided by a constant. We know that

sums of independent normal variables are again normal. Furthermore dividing a normal

by a constant gets you a normal again. If X = Y , then the expectation of the test

statistic is zero:

0

1 E[Y X]

Y X

Y X

E

=

=

= 0.

Z

Z

Z

Similarly the variance of the test statistic is one. This can be seen from:

0

1 V AR[Y X]

Y X

V AR

=

= 1.

Z

Z2

Hence, if X = Y , then the test statistic is a normal variable with expectation 0 and

variance 1. In other words, the test statistic is a standard normal variable. So, in this

case, the critical value cr at a confidence level p is the number cr > 0 satisfying

P (cr N (0, 1) cr ) = p.

By symmetry around the origin, this implies (see subsection 15.1) that

(cr ) = (1 + p)/2,

(15.13)

value satisfies equation 15.13 can be found in a table for standard normal variables. For

50

Let us get back to our example with the rocket. Assume that the average measurement

error when we make one measurement is 3. In other words, let = 3. Assume we want to

test on the 95%-level if there is a statistical significant dierence between the means in our

samples. That is, we want to test the hypothesis Y = X against the hypothesis

Y = X on the 95%-level. For this we simply need to check if the test statistic lies

has

Z ). The constant Z = Y X

been calculated in 15.8, where it was found:

&

Z := 2/n.

Hence, with our value = 3 and n = 5, the test statistic takes on the value

Y X

1

Y X

1

= &

=

= 0.55

Z

1.8

3 0.4

2/n

The value for the test statistic lies within the interval [cr , cr ], when cr = 1.96. Hence

in this situation we can not reject the hypothesis that X = Y on the 95%confidence level. In other words, we do not have enough statistical evidence to reject

the idea that X = Y . This means that our data does not seem to imply that the new

fuel is better or worse then the old. Note that this does not necessarily mean that X

and Y must be identical. It could be that the dierence is so small, that it gets masked

by our measurement errors.

The way the test was done is called a two-sided test. If we would be interested just in

knowing if the new fuel is better then we would do a one-sided test. (It could be that

a company might change to a new fuel but only if it is proven to be better. In that case,

the only interesting thing is to know if the new fuel is better and not if it is dierent). In

the case of a one-sided test the confidence interval would be [, cr ] where the critical

value cr is determined by

P (N (0, 1) cr ) = (cr ) = p

on the confidence level p. Here as before (.) designates the distribution function of a

standard normal variable.

In this example, we assumed the measurement errors to be normal. If this is not the case,

but we have many measurements, the above method still applies due to the Central Limit

Theorem.

Case when the standard deviations are known but not equal or dierent sample

sizes. If in the two samples the standard deviations are dierent (because of dierent

measurement instruments maybe), then all the above remains the same except that we

use a dierent formula for Z . The formula then used for Z is formula 15.7. The same

goes when the samples have dierent sizes from each other. The formula used in that case

is 15.9.

51

The case with unknown, but equal standard deviation Assume that = X =

Y , but is unknown to us. Then instead of Z , we use an estimate for Z . For this note

that

4

2

X

2

+ Y.

(15.14)

Z =

n

n

2

We will estimate X

and Y2 and plug the values into the formula 17.7 instead of the real

values. The estimates we use (see subsection 16.2), are

s2X :=

2

for X

and

2 + (X2 X)

2 + . . . + (Xn X)

2

(X1 X)

n1

n1

2

for Y . This then gives as estimate for Z the following expression:

3

(s2X + s2Y )/n

s2Y :=

Our test statistic is obtained by replacing Z by its estimate in the previously used test

statistic. Hence the test statistic for the case of unknown standard deviation is:

Y X

&

.

(15.15)

(s2X + s2Y )/n

The distribution of the test statistic is no longer normal. It is slightly modified. One can

prove when X = Y and the measurement errors are normal, that then the test statistic

has a student t-distribution with 2n 2 degrees of freedom. So our testing procedure is

almost as before, only that we have to find the critical value cr in a dierent table. This

time we have to find it in a table for the t-distribution with 2n 2 degrees of freedom.

So, if we test on the confidence level p, then cr is defined to be the number such that

P (cr T2n2 cr ) = p.

We reject the hypothesis X = Y on the level p, if the test statistic 15.15 takes

a value outside [cr , cr ] Let us get back to our rocket example. Say we want to test

X = Y on the level p = 95%. We find

s2X =

0 + 22 + 3 2 + 0 + 1

14

=

= 3.5

4

4

and

10

0 + 22 + 2 2 + 1 + 1

=

= 2.5

4

4

With n = 5 the test statistic takes on the value

Y X

1

1

&

&

=

=

0.9.

1.2

(s2X + s2Y )/n

6/5

s2Y =

52

reading the table we find the critical value for a two sided test on the 95% level to be

cr = 2.2???. We see that the value taken by the test statistic is way within the interval

[cr , cr ] and hence we can not reject the hypothesis that the new fuel has no eect. More

precisely, our data does not contain significant evidence on the 95%-level that there is a

dierence between the two fuels, i.e. that uX = uY .

16

16.1

Statistical estimation

An example

Imagine that we want to measure the distance d between two points y and z. Every

time we repeat the measurement we make a measurement error. In order to improve the

precision we make several measurements and then take the average value measured. Let

Xi designate measurement number i and i the error number i. We have that:

Xi = d + i .

We assume that the measurement errors are i.i.d. such that

E[i ] = 0

and

V ar[i ] = 2 .

The standard deviation of the measurement instrument is supposed to be know to us.

Imagine that we make 4 measurements and find in meters the four values:

100, 102, 99, 101

We see that the distance d must be around 101 meters. However, the exact value of the

distance d remains unknown to us, since each of the four measurements above contains an

error. So, we can only estimate what the true distance is equal to. Typically we take the

average of the measurements as estimate for d. We write d for our estimate of d. In the

case we decide to use the average of our measurements as estimate for d, we have that:

X1 + X2 + X3 + X4

d =

.

4

The advantage of taking four measurements of the same distance instead of only one,

is that that probability to have a large error is reduced. The errors in the dierent

measurements tend to even each other out when we compute the average. As a matter of

fact, assume we make n measurements and then take the average. In this case:

X1 + . . . + Xn

.

d :=

n

53

We find:

= (1/n) (E[X1 ] + . . . + E[Xn ]) =

E[d]

(1/n) (nE[X1 ]) = E[X1 ] = E[d + i ] = E[d] + E[i ] = d + 0 = d.

An estimator which has its expectation equal to the true value we want to estimate is

called unbiased estimator.

Let us calculate:

0

1

X

+

.

.

.

+

X

1

1

n

= V AR

V AR[d]

= 2 (V AR[X1 ] + . . . + V AR[Xn ]) =

n

n

1

(nV AR[X1 ]) = V AR[X1 ]/n

n2

Thus, the standart deviation of d is equal to

&

V AR[X1 ]/n = / n.

The standard deviation of the average d is thus n times smaller than the standard

deviation of the error when we make one measurement. This justifies taking several

measurements

and taking the average, since it reduces the size of a typical error by a

factor n.

When we make a measurement and give an estimate of what the distance is, it is important

that when know the order of magnitude of the error. Imagine for example that the order

of magnitude of the error is 100 meters. The situation would then be: our estimate of

the distance is 101 meters, and the precision of this estimate is plus/ minus 100 meters.

In this case our estimate our estimate of the distance is almost useless because of the huge

imprecision. This is why, we try to always give the precision of the estimate. Since

the errors are random, theoretically even very large errors are always possible. Very large

errors however have small probability. Hence one tries to be able to be able to give a

upper bound on the size of the error which holds with a given probability. Typically one

uses the probabilities 95% or 99%. The type of statement one whishes to make is for

example: our estimate for the distance is 101 meters. Furthermore, with 95% probability

the true distance is within 2 meters of our estimate. In this case the interval [99, 103] is

called the 95% confidence interval for d. With 95% probability, d should lie within this

interval. More precisely, we look for a real number a > 0 such that:

P (d a d d + a) = 95%

or equivalently:

P (a d d a) = 95%

%

$

%

$

d + 1 + . . . + d + n nd

X1 + . . . + Xn

d a = P a

a =

95% = P a

n

n

%

$

1 + . . . + n

a

= P a

n

54

Now, either way we assume that the errors I are normal or that n is big enough so that

the sum 1 + . . . + n is approximately

normal due to the central limit theorem. Dividing

then gives:

$

$

%

%

a n

a n

1 + . . . + n

a n

a n

P

N (0, 1)

95% = P

We thus find the number b > 0 from the table for standard normal random variable such

that:

95% = P (b N (0, 1) b).

Hence:

95% = (b) (1 (b)) = 2(b) 1

where (.) designates the distribution function of the standard normal variable. Then,

we find a > 0 solving:

a n

b=

.

[d a, d + a].

This means that although we dont know the exact value of d, we can say that with 95%

probability d lies in the interval [d a, d + a].

16.2

Assume that we are in the same situation as in the previous subsection. The only dierence

is that instead of trying to determine the distance we want to find out how precise our

measurement

instrument is. In other words, we try to determine the standard deviation

&

= V AR[i ]. For this we make several measurements of the distance between to points

y and z. We choose the point so that we know the distance d between them. Again, if Xi

designates the i-th measurement we have Xi = d + i . Define the random variable Zi in

the following way:

Zi := (Xi d)2 = 2i .

Thus:

E[Zi ] = V AR[i ].

We have argued that if we have a number of independent copies of the same random

variables, a good way to estimate the expectation is to take the average. Thus to estimate

the expectation E[Zi ], we take the average:

i ] := Z1 + . . . + Zn .

E[Z

n

55

Z1 + . . . + Zn

(X1 d)2 + . . . + (Xn d)2

=

.

n

n

The estimate for is then

& simply the square root of the estimate for the variance. Thus,

our estimator for = V AR[i ] is:

4

(X1 d)2 + . . . + (Xn d)2

=

.

n

If the distance d, should not be known, we simply take and estimate for d instead of d.

In that case our estimate for is

5

2 + . . . + (Xn d)

2

(X1 d)

=

n1

where

X 1 + . . . + Xn

.

d :=

n

(Note that instead of dividing by n in the case that d in unknown, we divide usually

by n 1. This is a little detail which I am not going to explain. For large d, it is not

important since then n/(n 1) is close to 1.)

16.3

Imagine the following situation: we have two 6-sided dice. Let X designate the number

we obtain when we throw the first die. Let Y designate the number we obtain when we

throw the second one. Assume that the first die is regular whilst the second is skewed.

We have:

(P (X = 1), P (X = 2), . . . , P (X = 6)) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).

(Note that 1/6 = 0.16) Assume furthermore that:

(P (Y = 1), . . . , P (Y = 6)) = (0.01, 0.3, 0.2, 0.1, 0.1, 0.29).

Imagine that we are playing the following game: I choose from a bag one of the two dice.

Then I throw it and get a number between 1 and 6. I dont tell you which die I used, but

I tell you the number obtained. You have to guess which die I used based on the number

which I tell you. (This guessing is what statisticians call estimating.) For example I tell

you that obtained the number 1. With the first die, the probability to obtain a 1 is 0.16,

whilst with the second die it is 0.01. The probability to obtain a 1 is thus much smaller

with the second die. Having obtained a one makes us thus think that it is likelier that

the die used is the first die. Our guess will thus be the first die. Of course you could be

wrong, but based on what you know the first die appears to be likelier.

56

If on the other hand, after throwing the die we obtain a 2 we guess that it was the second

die which got used. The reason is that with the second die a 2 has a probability of 0.3

which is larger than the probability to see a 2 with the first die. Again, our guess might be

wrong, but when we observe a 2, the second die seem likelier. The method of guessing

described here is called Maximum likelihood estimation. It consist in guessing (estimating)

the possibility which makes the observed result most likely. In other words, we choose

the possibility, for which the probability of the observed outcome is highest.

Let us look at is in a slightly more abstract way. Let I designate the first die and II

the second. For x = 1, 2, . . . , 6, let P (x, I) designate the probability that the number we

obtain by throwing the first die equals to x. Thus:

P (x, I) := P (X = x).

Let P (x, II) designate the probability that the number we obtain by throwing the second

die equals to x. Thus:

P (x, II) := P (Y = x).

For example, P (1, I) is the probability that the first die gives a 1 and P (1, II) is the

probability that the second die equals 1 whilst P (2, II) designates the probability that

the second die gives a 2.

Let be a (non-random) variable with can take one out of two values: I or II. Statisticians

call the parameter. In this example guessing which die we are using, is the same as

trying to figure out if equals I or II. We consider the probability function P (., .) with

two entries:

(x, ) / P (x, ).

Formally what we did can be describe as follows: given that we observe an outcome x, we

take the which maximizes P (x, ) as our guess for which die was used. . Our maximum

likelihood estimate of is the theta maximizing P (x, ) where x is the observed outcome.

This is a general method, and can be used in many dierent settings. Let us give another

example of maximum likelihood estimation, based on the same principle.

16.4

Assume that p > 0 is unknown. We want to estimate p (in other words we want to try to

guess what p is approximately equal to). Say we observe:

(T1 , T2 , T3 , T4 , T5 ) = (6, 7, 5, 8, 8)

Based on this evidence, what should our estimate p for p be? (Hence what should our

guess for the unknown p be?) We can use the Maximum Likelihood method. For this the

estimate p is the p [0, 1] for which the probability to observe

(6, 7, 5, 8, 8)

57

P ( (T1 , T2 , T3 , T4 , T5 ) = (6, 7, 5, 8, 8) )

(16.1)

is equal to

P (T1 = 6) P (T2 = 7) . . . P (T5 = 8)

For a geometric random variable T with parameter p we have that:

P (T = k) = p(1 p)k1 .

Thus the probability 16.1 is equal to:

p(1p)5 p(1p)6 . . .p(1p)7 = exp(ln(p)+5 ln(1p)+. . .+ln(p)+7 ln(1p)). (16.2)

We want to find p maximizing the last expression. This is the same as maximizing the

expression:

ln(p) + 5 ln(1 p) + . . . + ln(p) + 7 ln(1 p),

since exp(.) is an increasing function. To find the maximum, we take the derivative

according to p and set it equal to 0. This gives:

0=

1

5

1

7

d (ln(p) + 5 ln(1 p) + . . . + ln(p) + 7 ln(1 p))

=

+ ...+

.

dp

p 1p

p 1p

n(1 p) = (5 + . . . + 7)p

where n designates the number or observations. (In the special example considered here

n = 5.) We find:

n = (6 + . . . + 8)p = p(T1 + T2 + . . . + Tn )

and hence:

1

6+7+5+8+8

T1 + T2 + . . . + Tn

=

=

p

5

n

(16.3)

Our estimate p of p is the p which maximizes expression 16.2. This is the p which satisfies

equation 16.3. Thus our estimate:

%1 $

%1

$

T1 + T2 + . . . + Tn

6+7+5+8+8

=

.

p :=

5

n

17

17.1

Linear Regression

The case where the exact linear model is known

Imagine a situation where you have a chain of shops: The shops can have dierent sizes

and the profit seems to be to some extend a function of the size. The chain owns n shops.

58

Let xi denote the size of the i-th shop and Yi its profit. Now you assume that there are

two constants and so that the following relationship holds:

Yi = + xi + i ,

where we assume that 1 , 2 , . . . are i.i.d. random variables with expectation zero:

E[i ] = 0.

Let denote the standard deviation of the variables i . Often it will also be assumed that

the variables i are normal. Now, we have that the expected profit is equal to

E[Yi ] = E[ + xi + i ] = E[] + E[xi ] + E[i ] = + xi .

In other words, the expected profit is a linear function of the size: E[Y ] = + x, where

x is the size and Y is the profit of a shop. So, if you draw a curve representing expected

profit as function of size, you would get a straight line.

Say for your chain of shops, you would have the relationship Yi = 3 + 4xi + i . So, in this case = 3

and = 4. Say, you own a shop of size 5. Then for that shop, the expected profit given that the size

is 5, would be E[Y |x = 5] = 3 + 4 5 = 23. Here we denote by E[Y |x = 5] the expectation given that

the size is 5. Now why would the profit of that shop be random? Well very simple. It could be that

you own many shops gggbut this one shop with size 5 is going to open next month. So, nobody knows

in advance the exact profit. One can forecast it, give maybe a confidence interval, but nobody knows in

advance the exact value! Hence, the profit is behaving like a random variable. If you are told to predict

(estimate) what the profit will be, you will give the expected value + 5 = 23. Of course, this requires

that you know the constants and . Now, if you know the standard deviation of then you can also

give a confidence interval. First using Matzis rule of thumb, you could simply say that most probably,

the profit of the shop, will be withing two standard deviation of the expected profit. In our case, thus we

could say that typically the profit is 23 + 2 and hence most likely to be between 23 2 and 23 + 2.

If for example = 3, then most likely the profit for our shop will be between 17 and 29. Now, this

is using a rule of thumb which says that random variables typically take values not further than twice

the standard deviation from their expectation most of the time. But this is not very precise. So, we

could actually give a confidence interval. That is we could give an interval so, that with for example 95%

probability the profit will be in that interval. If we assume that the errors are normal, then we have that

/ is standard normal. Hence, (Y x)/ = / is standard normal. Hence,

%

$

Y x

c = P (c N (0, 1) c)

P c

The above allows us to give an interval (think of confidence interval), so that the profit of the new shop

will be in that interval with a given probability. for example of 95%-confidence, the interval is going to

be

[ + 5 c0.95 , + 5 c0.95 ] = [23 C0.95 , 23 + c0.95 ],

where c0.95 denotes the constant so that a standard normal is between + that constant with 0.95probability. We have seen how to calculate, such a constant.

Imagine next a situation where and are known, and is not known. Then we want to estimate

based on our data. Note that Yi = + xi + i and hence:

i = Yi xi .

(17.1)

When the data (Yi , xi ) is known and , are known as well, then we can figure out the value of the

i s using formula 17.1. Note that designates the standard deviation for the errors i . But in previous

59

chapters we have learned how to estimate a standard deviation. So, this is what we are going to do using

the i s to estimate the standard deviation :

4

21 + . . . + n

:=

.

(17.2)

n

Let us give an example. Say we have five shops and as before = 3 and = 4. The data for the shops

is given in the table below:

xi 1 2 3 4 6

Yi 8 10 17 17 27

Now, for the example 1 = Y1 3 4x1 = 8 3 4 = 1 So for each i = 1, 2, . . . , 5 we can calculate the

corresponding i . We get the values:

xi

Yi Xi

1

2

1 11

3 4 6

2 2 1

4

11 + (1)2 + 22 + (2)2 + 02

:=

= 2 1.41.

5

(17.3)

We can now using Matzingers rule of thumb, which says that random variable most of the time, takes

values not further than two standard deviation from their expectation. So, that tells us that for our shop,

the profit should be within 23 + 2

23 + 2.82 So, typically the profit would be in the interval

[20.1716, 25.8284].

The above interval is just to have a rough idea of which area most likely the profit will be in. For a more

precise approach with an explicit confidence level , we would take the interval:

[23 c

, 23 c

]

(17.4)

where c is the constant so that a standard normal is with probability between c and +c :

= P (c N (0, 1) c ).

Now if we do not know the standard deviation, we can replace the true standard deviation by its estimate.

The coecient c from the normal table has to be replaced by a coecient from the student table tn/2 .

So, the confidence interval if we have to estimate the standard deviation becomes:

].

, + 10 + tn/2

[ + 10 tn/2

17.2

(17.5)

If and are not known, then we estimate them using least square. The estimates are

given by the two following equations:

y =

+ x

and

60

+n

(xi x)yi

:= +i=1

n

)2

i=1 (xi x

Put in the value of from the second equation into the first to calculate

.

Now, in principle, all the things we did in the last subsection where and were known

will be done here. The dierence is mainly that instead of and we use the estimate

and instead. But then we act as if the estimates where the true values. (For the

confidence interval there will be a small adjustment). In other words, given some real data

(xi , Yi) for i = 1, 2, . . . , n, you could estimate and . Then forget that your estimates

and are only estimates. Act as if they where the true and and do everything we

did in the section above...in this way, you can figure out how to estimate the standard

deviation, get a confidence interval and so on and so forth. Let us summarize:

1. To estimate the expected profit of a new shop of size x0 , we used in the previous

section + x0 . Now, however and are not known. So, we simply take the

estimates for and and act as if they would be the true values. Our estimate for

the expected profit of a shop of size x0 , when and is not known is:

0.

E[Y|x0 ] :=

+ x

2. To estimate the standard deviation, we had used the i s which are equal to i =

Yi xi . Now, and are not known here, so we replace them by their

respective estimates. So our estimated random errors are

i

i := Yi

x

For estimating the standard deviation , we now simply replace i by the estimate

i in formula 17.2. Hence, the estimated is defined to be:

5

21 + 22 + . . . + 2n

:=

.

(17.6)

n2

3. Let us see how we give a rough confidence interval using Matzis rule of thumb.

(That rule of thumb is: mostly variables take values not further than two times

the standard deviation from their expectation). So, in the formula + x0 + 2

we simply replace , , by their respective estimates: so the rough confidence

interval for the profit of a shop with size x0 would be

0 2

0 + 2

[

+ x

,

+ x

]

where

is our estimate given in 17.6.

61

4. For an exact confidence interval we take the same as in 17.5 but replacing again ,

and by their respective estimates. (Here for estimating we take 17.6). Also,

there is an additional factor equal to

5

1

(x0 x)2

1+ + +

n

)2

i (xi x

This factor is needed, because we have additional uncertainty since we do not know

+ x0 , but only have an estimate for it. Also, for large n, this factor becomes

close to 1. So, all this being said, our confidence-interval on the -co0nfidence level

is

5

5

7

6

2

2

1

1

(x

)

(x

)

0

0

n

n

+t

t

.

+ 10

,

+ 10

1+ + +

1+ + +

/2

/2

2

n

(x

)

n

)2

i

i

i (xi x

from

So a typical situation is that we have data:

x1 x2 . . . xn

y1 y2 . . . yn

We can assume that these points where generated by a model like the one described

at the beginning of this section:

yi = + xi + i

for all i = 1, 2, . . . , n and where , do not depend on i. Again, 1 , 2 , . . . are i.i.d.

with expectation 0 and standard deviation . Typically, , and are not known

to us. So how can we figure them out? Now when we have many data points, we

want to try to find a straight line which is close to all the points. Consider any

straight line y = a + bx. We could try to find such a line so that the sum of the

distances to all the points (xi , yi) is small. This would correspond to searching for

a straight line which minimizes:

n

#

i=1

|yi a bxi |.

Note that in the above sum, the yi s and the xi s are given numbers, so we only

need to find a and b minimizing the above expression. Now, absolute values are a

mess to calculate with. So, instead, we will take the sum of the distances square:

2

d (a, b) :=

n

#

i=1

62

(yi a bxi )2

and find a and b minimizing d2 (a, b). This will yield very nice explicit formulas. To

find those formulas we simple take the derivative according to a and according to b

and set equal to 0. This yields:

+

n

#

d ni=1 (yi a bxi )2

= 2

(yi a bxi )

da

i=1

Setting the expression on the right side of the last equation above equal to 0 we

find:

y = a + b

x,

where

y1 + . . . + yn

n

y :=

and

x1 + . . . + xn

.

n

Then, we take the derivative according to b and set it equal to 0:

+

n

#

d ni=1 (yi a bxi )2

= 2

xi (yi a bxi )

db

i=1

x :=

So, setting the expression ont he right side of the last equation above equal to 0

yields:

n

#

i=1

xi yi a

xi

n

#

i=1

xi b

n

#

xi xi = 0

(17.7)

i=1

not changed. Also, the distances square are not aected by a shift in x-coordinates.

So,the same formula as 17.7 must hold for the values xi . Hence, formula 17.7 is

equivalent to

n

n

n

#

#

#

xi yi a

xi b

(xi )2 = 0.

i=1

xi

i=1

= 0. Hence, we have

n

#

i=1

i=1

xi yi

n

#

(xi )2 = 0

i=1

+n

+n

xi yi

(xi x)yi

i=1

b = +n

= +i=1

n

2

)2

i=1 (xi )

i=1 (xi x

We have now found a system of two equations for a and b, which determines which

straight line y = a + bx gets closes to the data-points (x1 , y1), (x2 , y2), . . . , (xn , yn ).

63

By closest, we mean the sum of the vertical distances square between the points

and the line should be minimal. So, the system of two equations is:

y = a + b

x

+n

(xi x)yi

.

b = +i=1

n

)2

i=1 (xi x

(17.8)

(17.9)

Solving the above system of two equations in a and b yields, the straight line y =

a + bx which is closest (in our sense of sum of distances square) to our points.

We will use, these value for a and b which minimize the sum of distances square as

our estimates for and . An explanation why this is a good idea can be found

below in the subsection entitled: how precise is our estimate. So, we have that

the estimates

and are the only solution to 17.8 and 17.9. Hence, they

are given by the following two equations:

y =

+ x

and

17.4

+n

(xi x)yi

:= +i=1

.

n

)2

i=1 (xi x

We can calculate the expectation of our estimate .

is given by:

+n

(xi x)yi

.

= +i=1

n

)2

i=1 (xi x

We are going to take the variance on both sides of the last equation above, and

use the fact that the xi s are constants and not random. Recall that constants who

multiple a random variable, can be taken out of the variance after squaring. This

leads to

1 +n

0 +n

(xi x)2 V AR[yi ]

(x

)y

V AR[yi ]

2

i

i

i=1

i=1

= V AR +n

+

+

+

=

=

=

.

V AR[]

n

n

)2

( ni=1 (xi x)2 )2

)2

)2

i=1 (xi x

i=1 (xi x

i=1 (xi x

So, we get finally:

and

2

,

)2

i=1 (xi x

= +n

V AR[]

.

)2

i=1 (xi x

= &+n

(17.10)

Next we want to calculate the expectation of the estimate .

terms i have zero expectation: E[i ] = 0 and hence

E[Y i] = E[ + xi + i ] = E[] + E[xi ] + E[i ] = + xi .

64

0+n

1 +n

(x

)y

)E[yi ]

i

i

i=1

i=1 (xi x

= E +n

E[]

= +

=

n

2

)

)2

i=1 (xi x

i=1 (xi x

+n

+n

+n

(xi x)

(xi x)xi

(xi x)( + xi )

i=1

i=1

+n

=

= +n

+ +i=1

= .

n

2

2

)

)

)2

i=1 (xi x

i=1 (xi x

i=1 (xi x

In other words, the expectation value of the estimator is itself. This has some

very important application: We have that is a random number itself since it

depends on the i s which we have assumed to be random. Now for any random

variable Z we have that we measure the approximate average distance from its

expectation(=dispersion) by the standard deviation of the variable. So, how far

= on average when we keep repeating the experiment, is given by .

is from E[]

But, the distance between and is the estimation error of our estimate. So, in

other words, the average size of the estimation error (when we estimate ) is given

by for which we have a close expression given in equation 17.10 above.

6

17.5

17.6

17.7

Multiple factors and or polynomial regression

Other applications

65

- Nassim Taleb Journal Article - Report on the Effectiveness and Possible Side Effects of the OFR (2009)Uploaded bycasefortrils
- Chap 003Uploaded byShekhar Saurabh Biswal
- Microeconomic Theory Basic Principles and Extensions 12th Edition Nicholson Solutions ManualUploaded bya653662159
- pertemuan 4Uploaded bybudiabuy
- Risk & ReturnUploaded byNocturnal Bee
- Discrete DistributionsUploaded byneednid_27
- DBASE & LOOKUPUploaded byCheryl Ecalnir
- ESC92 - Chapter 7 Mathematical ExpectationUploaded byjoana_perez_11
- Presentation on Random VariableUploaded byUMANG SHAH
- RV Prob DakamistributionsUploaded byMuhammad Ｉｋｍａｌ
- stat130module1cslidesUploaded byPao Peralta
- Print Version - Random Variables and Probability DistributionsUploaded bySasoo Anany
- Probst at BookUploaded byAnjar May Purnama
- Business Statistics 4Uploaded byak5775
- Basic Probability Theory and Randomised Allocation StrategiesUploaded bysahin04
- Stigler stat Ch. 2`Uploaded byplxnospam
- Federal Public Service ExaminationUploaded byDavut Abdullah
- Chapter 5 in Class ProblemsUploaded byBharat Mendiratta
- Probability Questions and AnswersUploaded byAK
- syllabusUploaded byVishalSelvan
- Statistics & ProbabilityUploaded byAlfredo Barrientos Padilla
- g ExpectationUploaded byHaidar Alvinanda Sulistyo 22TKJ
- Lecture 23Uploaded byEd Z
- week06Uploaded byHawJingZhi
- MAT097 Chapter 7 Random Variables (With Solution)Uploaded byARe-may Pudean
- ME GATE-2018 Paper 3-FEB-2018 Morning Session Ans Qns (1)Uploaded byNirmal Jayanth
- 01_2 Random Variables.pptxUploaded bydhanesh15
- Sums and Averages of Large Samples Using Standard Transformations - The Central Limit Theorem and the Law of Large NumbersUploaded byHugo Hernández
- 337589467-Chapter-22-Estimating-Risk-and-Return-on-Assets.pdfUploaded bymelody gerong
- Mineral processing notesUploaded byLinganna

- Diagrid Final Paper2Uploaded byEdwin Gonzalez
- 73Uploaded bySachin Pednekar
- Solar Chimney DesignUploaded byassallama
- PbS 4Uploaded byapi-3753794
- 2008. Barton-The Main Causes of the Pinheiros Cavern Collapse.cobrAMSEG, BuziosUploaded byGabrenogue
- A.R. Jha-Next-Generation Batteries and Fuel Cells for Commercial, Military, and Space Applications-CRC Press (2012).pdfUploaded byVikas Jainkeri
- Review - Rotor BalancingUploaded byNoelrmu
- motoresUploaded byRonald Cuenta Mamani
- Lecture 3 Columns_nov 11_leedsUploaded byRaihan Momand
- Malarial Hemozoin: From target to toolUploaded byMauri Paradeda
- TEK 01-03C.pdfUploaded byF
- 0620_w13_qp_12.pdfUploaded byHaider Ali
- Ndt Supply Iqi e747Uploaded bySander Duque
- White Paper_ Oil Analysis OverviewUploaded bylahiru1983
- 2118701Uploaded byRicky Gunawan
- Converter MPPTUploaded byArvind Nayaka
- [Architecture eBook] Alvar Aalto,Alvar Aalto and the Bio-ArchitectureUploaded bypooh86pooh
- ANSI C37.06.1-1997, Trial-Use Guide for High-Voltage Circuit Breakers Rated on a Symmetrical CurrentUploaded byGerardo M. James Bravo
- (an Examination of Steam-Injection Processes)Uploaded bySaeid Rajabi
- Electrochemical Biosensor for formaldehydeUploaded byrajdewaan
- Food Engineering PrinciplesUploaded byTrịnh Quốc Khánh
- Equivalence of four-point and three-point rainflow cycle counting algorithmsUploaded bymahaprabhu78
- Wps FormatUploaded byGohilakrishnan Thiagarajan
- prr153[1]Uploaded byMatthew Wee
- SGS Soil AnalysisUploaded byJaharudin Juhan
- QUALIFYING EXAM SYLLABUS.pdfUploaded byMarcus Randall
- Geophysical Data Analyst (Seismic Data Analyst)Uploaded byhugo
- Electromagnetic FieldsUploaded byMary Dunham
- ArticuloUploaded byluz
- Rich H.H.J Reference Card.V6.01.2006Uploaded bycsodaelme