You are on page 1of 47

ICICI Centre for Mathematical Sciences

Mathematical Finance 2004-5


A Probability Primer
Dr Amber Habib

We will sum up the most essential aspects of Probability, and the transition
to Statistical Inference. We avoid technical distractions. The gaps we leave
can be filled in by consulting one of the references provided at the end.

The figures have been produced using Mathematica.

Contents
1 Sample Space, Random Variables 2

2 Probability 3

3 Probability Distributions 4

4 Binomial Distribution 7

5 Normal Distribution 9

6 Expectation 11

7 Variance 13

8 Lognormal Distribution 17

9 Bivariate Distributions 18

10 Conditional Probability and Distributions 21

11 Independence 24

12 Chi Square Distribution 25

13 Random Samples 26

14 Sample Mean and Variance 28

15 Large Sample Approximations 29

1
1 SAMPLE SPACE, RANDOM VARIABLES 2

16 Point Estimation 32

17 Method of Moments 33

18 Maximum Likelihood Estimators 34

19 Sampling Distributions 36

20 Confidence Intervals 41

1 Sample Space, Random Variables

The sample space is the collection of objects whose statistical properties


are to be studied. Each such object is called an outcome, and a collection
of outcomes is called an event.

• Mathematically, the sample space is a set, an outcome is a member of this


set, and an event is a subset of it.

For example, our sample space may be the collection of stocks listed on
the National Stock Exchange. We may ask statistical questions about this
space, such as: What is the average change in the value of a listed stock
(an outcome) over the past year? Alternately, our sample space might be
the history of a single stock: All its daily closing prices, say, over the last
5 years. Our question might be: What is the average weekly change of the
value over the last year? Or, how large is the variation in the prices over its
5 year history? We could ask more subtle questions: How likely is it that the
stock will have a sudden fall in value over the next month?

• Typical notation is to use S for sample space, lower case letters such as x
for individual outcomes, and capital letters such as E for events.

In a particular context, we would be interested in some specific numerical


property of the outcomes: such as the closing price on October 25, 2004,
of all the stocks listed on the NSE. This property allots a number to each
outcome, so it can be viewed as a function whose domain is the sample space
S and range is in the real numbers R.

• A function X : S → R is called a random variable.


2 PROBABILITY 3

Our interest is in taking a particular random variable X and studying how


its values are distributed. What is the average? How much variation is there
in its values? Are very large values unlikely enough to be ignored?

2 Probability

We present two viewpoints on the meaning of probability. Both are relevant


to Finance and, interestingly, they lead to the same mathematical definition
of probability!

Viewpoint 1: The probability of an event should predict the relative fre-


quency of its occurrence. That is, suppose we say the probability of a random
stock having increased in value over the last month is 0.6. Then, if we look
at 100 different stocks, about 60 of them should have increased in value. The
prediction should be more accurate if we look at larger numbers of stocks.

Viewpoint 2: The probability of an event reflects our (subjective) opinion


about how likely it is to occur, in comparison to other events. Thus, if we
allot probability 0.4 to event A and 0.2 to event B, we are expressing the
opinion that A is twice as likely as B.

Viewpoint 1 is appropriate when we are analyzing the historical data to


predict the future. Viewpoint 2 is useful in analyzing how an individual may
act when faced with certain information. Both viewpoints are captured by
the following mathematical formulation:

• Let A be the collection of all subsets of the sample space S: we call it the
event algebra.

• A probability function is a function P : A → [0, 1] such that:


1. P(S) = 1.

[  ∞
X
2. P Ei = P(Ei ), if the Ei are pairwise disjoint events (i.e., i 6= j
i=1 i=1
implies Ei ∩ Ej = ∅).

Let P : A → [0, 1] be a probability function. Then it automatically has the


following properties:
3 PROBABILITY DISTRIBUTIONS 4

1. P(∅) = 0.
n
[  n
X
2. P Ei = P(Ei ), if the Ei are pairwise disjoint events.
i=1 i=1

3. P(Ac ) = 1 − P(A).
[  X
4. P Ei ≤ P(Ei ), for any collection of events Ei .
i i

3 Probability Distributions

We return to our main question: How likely are different values (or ranges
of values) of a random variable X : S → R?

If we just plug in the definition of a random variable, we realize that our


question can be phrased as follows: What is the probability of the event
whose outcomes correspond to a given value (or range of values) of X?

Thus, suppose I ask, what is the probability that X takes on a value greater
than 100? This is to be interpreted as: what is the probability of the event
whose outcomes t all satisfy X(t) > 100? That is,
P(X > 100) = P({t : X(t) > 100}).
It is convenient to consider two types of random variables, ones whose values
vary discretely (discrete random variables) and those whose values vary
continuously (continuous random variables).1

Examples:
1. Let us allot +1 to the stocks whose value rose on a given day, and −1 to
those whose value fell. Then we have created a random variable whose
possible values are ±1. This is a discrete random variable.
2. Let us allot the whole number +n to the stocks whose value rose by
between n and n + 1 on a given day, and −n to those whose value
fell by between n and n + 1. Then we have created a random variable
whose possible values are all the integers. This is also a discrete random
variable.
1
These are not the only types. But these types contain all the ones we need.
3 PROBABILITY DISTRIBUTIONS 5

3. If, in the previous example, we let X be the actual change in value, it


is still discrete (since all changes are in multiples of Rs 0.01). However,
now the values are so close that it is simpler to ignore the discreteness
and model X as a continuous random variable.

Discrete Random Variables

Let S be the sample space, A the event algebra, and P : A → [0, 1] a


probablity function. Let X : S → R be a discrete random variable, with
range x1 , x2 , . . . (the range can be finite or infinite). The probability of any
particular value x is

P(X = x) = P({t ∈ S : X(t) = x}).

• We call these values the probability distribution or probability den-


sity of X and denote them by fX : R → [0, 1],

fX (x) = P(X = x).

One can find the probability of a range of values of X by just summing up


the probabilities of all the individual values in that range. For instance,
X
P(a < X < b) = fX (xi ).
xi ∈(a,b)

In particular, summing over the entire range gives


X
fX (xi ) = 1,
i

since the total probability must be 1.

Discrete Uniform Distribution: Consider an X whose range is {0, 1, 2, . . . , n}


and each value is equally likely. Then

0 x∈ / {0, 1, 2, . . . , n}
fX (x) = .
1/(n + 1) else
3 PROBABILITY DISTRIBUTIONS 6

Continuous Random Variables

Suppose the values of a random variable X vary continuously over some


range, such as [0, 1]. Then, it is not particularly useful to ask for the likeli-
hood of X taking on any individual value such as 1/2 - since there are in-
finitely many choices, each individual choice is essentially impossible and has
probability zero. From a real life viewpoint also, since exact measurements
of a continuously varying quantity are impossible, it is only reasonable to ask
for the probability of an observation lying in a range, such as (0.49, 0.51),
rather than its having the exact value 0.5.

The notion of a probability distribution of a continuous random variable is


developed with this in mind. Recall that in the discrete context, probability
of a range was obtained by summing over that range. So, in the continuous
case, we seek to obtain probability of a range by integrating over it.

• Given a continuous random variable X, we define its probability density


to be a function fX : R → [0, ∞) such that for any a, b with a ≤ b,
Z b
P(a ≤ X ≤ b) = fX (x) dx.
a
Z ∞
In particular, fX (x) dx = 1.
−∞

Important. The number fX (x) does not represent the probability that
X = x. Individual values of fX have no significance, only the integrals of fX
do! (Contrast this with the discrete case.)

Continuous Uniform Distribution:2 Suppose we want a random variable


X that represents a quantity which varies over [0, 1] without any bias: this is
taken to mean that P(X ∈ [a, b]) should not depend on the location of [a, b]
but only on its length.

0 1
- -  - 1
I1 I2 I3 P(X ∈ I1 ) = P(X ∈ I2 ) = P(X ∈ I3 )
2
2
Usually, this is just called the unifom distribution.
4 BINOMIAL DISTRIBUTION 7

This is achieved by taking the following probability density:



1 0≤x≤1
fX (x) =
0 else

For then, with 0 ≤ a ≤ b ≤ 1,


Z b
P(a ≤ X ≤ b) = 1 dx = b − a.
a

1 fX

0 1

Cumulative Probability Distribution

• The cumulative probability distribution of a random variable X, de-


noted FX , is defined by:

FX (x) = P(X ≤ x).

It is easy to see that


 X


 fX (xi ) X is discrete with range x1 , x2 , . . .
 xi ≤x

FX (x) = Z x


fX (t) dt X is continuous



−∞

Now we look at two distributions, one discrete and one continuous, which
are especially important.

4 Binomial Distribution

Consider a random variable X which can take on only two values, say 0 and 1
(the choice of values is not important). Suppose the probability of the value
4 BINOMIAL DISTRIBUTION 8

0 is q and of 1 is p. Then we have:

1. 0 ≤ p, q ≤ 1,
2. p+q =1

Question: Suppose we observe X n times. What are the likely distributions


of 0’s and 1’s? Specifically, we ask: What is the probability of observing 1 k
times?

We calculate as follows. Let us consider all possible combinations of n 0’s


and 1’s:

Number
  of combinations with k 1’s = Ways of picking k slots out of n =
n
.
k

Probability of each individual combination with k 1’s = pk (1 − p)n−k .


 
n k
Therefore, P(1 occurs k times) = p (1 − p)n−k .
k

• A random variable Y has a binomial distribution with parameters n


and p if it has range 0, 1, . . . , n and its probability distribution is:
 
n k
fY (k) = p (1 − p)n−k , k = 0, 1, . . . , n.
k

We call Y a binomial random variable and write Y ∼ B(n, p).

As illustrated above, binomial distributions arise naturally wherever we are


faced with a sequence of choices. In Finance, the binomial distribution is
part of the Binomial Tree Model for pricing options.

Figure 1 illustrates the binomial distributions with p = 0.2 and p = 0.5,


using different types of squares for each. Both have n = 10.

Exercise. In Figure 1, identify which points correspond to p = 0.2 and


which to p = 0.5.
5 NORMAL DISTRIBUTION 9

0.3

0.25

0.2

0.15

0.1

0.05

2 4 6 8 10

Figure 1: Two binomial distributions.

5 Normal Distribution

This kind of probability distribution is at once the most common in nature,


among the easiest to work with mathematically, and theoretically at the heart
of Probability and Statistical Inference. Among its remarkable properties is
that any phenomenon occurring on a large scale tends to be governed by
it. When in doubt about the nature of a distribution, assume it is (nearly)
normal, and you will usually get good results!

We first define the standard normal distribution. This is a probability


density of the form
1 2
fX (x) = √ e−x /2 .

It has the following ‘bell-shaped’ graph:

0.4
0.3
0.2
0.1

-3 -2 -1 1 2 3
5 NORMAL DISTRIBUTION 10

1
Exercise. Can you explain the factor √ ? (Hint: Think about the

requirement that total probability should be 1.)

Note that the graph is symmetric about the y-axis. The axis of symmetry can
be moved to another position m, by replacing x by x − m. In the following
diagram, the dashed line represents the standard normal distribution:

0.4
0.3
0.2
0.1

1 2
fX (x) = √ e−(x−m) /2 .

Also, starting from the standard normal distribution, we can create one with
a similar shape but bunched more tightly around the y-axis. We achieve this
by replacing x with x/s:

-3 -2 -1 1 2 3

1 1 2
fX (x) = √ e− 2 (x/s) .
2πs

By combining both kinds of changes, we reach the definition of a general


normal distribution:

• A random variable X has a normal distribution with parameters µ, σ,


6 EXPECTATION 11

if its density function has the form:


1 1 x−µ 2
fX (x) = √ e− 2 ( σ ) .
2πσ
We call X a normal random variable and write X ∼ N(µ, σ).

The axis of symmetry of this distribution is determined by µ and its clustering


about the axis of symmetry is controlled by σ.

Exercise. Will increasing σ make the graph more tightly bunched around
the axis of symmetry? What will it do to the peak height of the graph?

In the empirical sciences, errors in observation tend to be normally dis-


tributed: they are clustered around zero, small errors are common, and very
large errors are very rare. Regarding this, observe from the graph that by
±3 the density of the standard normal distribution has essentially become
zero: in fact the probability of a standard normal variable taking on a value
outside [−3, 3] is just 0.0027. In theoretical work, the normal distribution is
the main tool in determining whether the gap between theoretical predictions
and observed reality can be attributed solely to errors in observation.

6 Expectation

If we have some data consisting of numbers xi , each occurring fi times, then


the average of this data is defined to be:
P
Sum of all the data fi xi X fi
x̄ = = Pi = P xi
Number of data points i fi i j fj

Now, if we have a discrete random variable X, then fX (xi ) predicts the


relative frequency with which xi will occur in a large number of observations
of X, i.e., we view fX (xi ) as a prediction of Pfifj . And then,
j

X
fX (xi )xi
i

becomes a predictor for the average x̄ of the observations of X.


6 EXPECTATION 12

• The expectation of a discrete random variable is defined to be


X
E[X] = fX (xi )xi .
i

On replacing the sum by an integral we arrive at the notion of expectation


of a continuous random variable:

• The expectation of a continuous random variable is defined to be


Z ∞
E[X] = x fX (x) dx.
−∞

Expectation is also called mean and denoted by µX or just µ.

Exercise. Make the following calculations:


1. X has the discrete uniform distribution with range 0, . . . , n. Then
E[X] = n2 .
2. X has the uniform distribution on [0, 1]. Then E[X] = 1/2.
3. X ∼ B(n, p) implies E[X] = np.
4. X ∼ N(µ, σ) implies E[X] = µ.

Some elementary properties of expectation:


1. E[c] = c, for any constant c.3
2. E[cX] = c E[X], for any constant c.

Suppose X : S → R is a random variable and g : R → R is any function.


Then their composition g ◦ X : S → R, defined by (g ◦ X)(w) = g(X(w)), is
a new random variable which we will call g(X).

Example. Let g(x) = xr . Then g ◦ X is denoted X r . 2

Suppose X is discrete with range {xi }. Then, the range of g(X) is {g(xi)}.
Therefore we can calculate the expectation of g(X) as follows:4
X  X
E[g(X)] = g(xi )P g(X) = g(xi) = g(xi )fX (xi ).
i i
3
A constant c can be viewed as a random variable whose range consists of the single
value c.
4
Our calculation is valid if g is one-one. With slightly more effort we can make it valid
for any g.
7 VARIANCE 13

If X is continuous, one has the analogous formula:


Z ∞
E[g(X)] = g(x)fX (x) dx.
−∞

Example. Let g(x) = x2 . Then


 P 2

 i xi fX (xi ) X is discrete.

E[X 2 ] = Z ∞
x2 fX (x) dx X is continuous.



−∞
2

With these facts in hand, the following result is easy to prove.

• Let X be any random variable, and g, h two real functions. Then


E[g(X) + h(X)] = E[g(X)] + E[h(X)].

7 Variance

Given some data {xi }ni=1 , its average x̄ is seen as a central value about which
the data is clustered. The significance of the average is greater if the clus-
tering is tight, less otherwise. To measure the tightness of the clustering, we
use the variance of the data:
1X
s2 = (xi − x̄)2 .
n i

Variance is just the average of the squared distance from each data point to
the average (of the data).

Therefore, in the analogous situation where we have a random variable X, if


we wish to know how close to its expectation its values are likely to be, we
again define a quantity called the variance of X:
var[X] = E[(X − E[X])2 ].
2
Alternate notation for variance is σX or just σ 2 . The quantity σX or σ, the
square root of the variance, is called the standard deviation of X. Its
advantage is that it is in the same units as X.
7 VARIANCE 14

Exercise. Will a larger value of variance indicate tighter clustering around


the mean?

Sometimes, it is convenient to use the following alternative formula for vari-


ance:

• var[X] = E[X 2 ] − E[X]2 .

This is obtained as follows.

var[X] = E[(X − E[X])2 ]


= E[X 2 − 2E[X]X + E[X]2 ]
= E[X 2 ] − 2E[ (E[X]X) ] + E[(E[X]2 )]
= E[X 2 ] − 2E[X]2 + E[X]2
= E[X 2 ] − E[X]2 .

Elementary properties of variance:

1. var[X + a] = var[X], if a is any constant.

2. var[aX] = a2 var[X], if a is any constant.

Exercise. Will it be correct to say that σaX = a σX for any constant a?

Exercise. Let X be a random variable with expectation µ and standard


X −µ
deviation σ. Then Z = has expectation 0 and standard deviation 1.
σ

Exercise. Suppose X has the discrete uniform distibution with range {0, . . . , n}.
1
We have seen that E[X] = n/2. Show that its variance is n(n + 2).
6
Exercise. Suppose X has the continuous uniform distibution with range
[0, 1]. We have seen that E[X] = 1/2. Show that its variance is 1/12.
7 VARIANCE 15

Example. Suppose X ∼ B(n, p). We know E[X] = np. Therefore,


n  
X n k
E[X(X − 1)] = k(k − 1) p (1 − p)n−k
k=0
k
n
X n!
= pk (1 − p)n−k
k=2
(k − 2)!(n − k)!
n 
n−2 k
X 
= n(n − 1) p (1 − p)n−k
k=2
k − 2
n−2 
n−2 i
X 
2
= n(n − 1)p p (1 − p)(n−2)−i
i=0
i
= n(n − 1)p2

And so,

var[X] = E[X 2 ] − E[X]2


= E[X(X − 1)] + E[X] − E[X]2
= n(n − 1)p2 + np(1 − np)
= np(1 − p)

Example. Suppose X ∼ N(µ, σ). We know E[X] = µ. Therefore,


Z ∞
1 1 x−µ 2
var[X] = √ (x − µ)2 e− 2 ( σ ) dx
σ 2π −∞
Z ∞
σ2 2
= √ z 2 e−z /2 dz
2π −∞
Now we integrate by parts:
Z ∞ Z ∞
2 −z 2 /2 2 /2
z e dz = z(ze−z ) dz
−∞ −∞
∞ Z ∞
2 2
= −ze−z /2 + e−z /2 dz

−∞ −∞

= 0 + 2π

Therefore
var[X] = σ 2 .
7 VARIANCE 16

0.1

0.08

0.06

0.04

0.02

10 20 30 40

Figure 2: Normal approximation to a binomial distribution.

2
X −µ
Thus, if X ∼ N(µ, σ), then Z = has expectation 0 and standard
σ
deviation 1. In fact, Z is again a normal distribution and hence is the stan-
dard normal distribution (it has parameters 0 and 1). Through this link, all
questions about normal distributions can be converted to questions about
the standard normal distribution. For instance, let X, Z be as above. Then:

X −µ a−µ
 
a−µ

P(X ≤ a) = P ≤ =P Z ≤ .
σ σ σ

Now we can also illustrate our earlier statement about how the normal dis-
tribution serves as a substitute for other distributions. Figure 2 compares
a binomial distribution (n = 100 and p = 0.2) with the normal distribution
with the same mean and variance (µ = np = 20 and σ 2 = np(1 − p) = 4).

Generally, the normal distribution is a good approximation to the binomial


distribution if n is large. One criterion that is often used is that, for a
reasonable approximation, we should have both np and n(1 − p) greater than
5.
8 LOGNORMAL DISTRIBUTION 17

1.2

1
A
0.8 B
C
0.6

0.4

0.2

1 2 3 4

Figure 3: Lognormal density functions: (A) µ = 0, σ = 1, (B) µ = 1, σ = 1, (C)


µ = 0, σ = 0.4.

8 Lognormal Distribution

• If Z ∼ N(µ, σ) then X = eZ is called a lognormal random variable with


parameters µ and σ.5

Exercise. Let Z ∼ N(µ, σ). Then


1 2 t2
E[etZ ] = eµt+ 2 σ .

The t = 1, 2 cases of the Exercise immediately give the following:

• Let X be a lognormal variable with parameters µ and σ. Then


1 2 2 2
E[X] = eµ+ 2 σ , var[X] = e2µ+2σ − e2µ+σ .

The lognormal distribution is used in Finance to model the variation of stock


prices with time. Without explaining how, we show in Figure 4 an example
of the kind of behaviour predicted by this model.
5
The name comes from “The log of X is normal.”
9 BIVARIATE DISTRIBUTIONS 18

Figure 4: A simulation of the lognormal model for variation of stock prices with
time.

9 Bivariate Distributions

So far we have dealt with individual random variables, i.e. with models for
particular features of a population. The next step is to study the relationships
that exist between different features of a population. For instance, an investor
might like to know the nature and strength of the connection between her
portfolio and a stock market index such as the NIFTY. If the NIFTY goes
up, is her portfolio likely to do the same? How much of a rise is it reasonable
to expect?

This leads us to the study of pairs of random variables and the probabilities
associated with their joint values. Are high values of one associated with
high or low values of the other? Is there a significant connection at all?

• Let S be a sample space and X, Y : S → R two random variables. Then


we say that X, Y are jointly distributed.

• Let X, Y be jointly distributed discrete random variables. The joint dis-


9 BIVARIATE DISTRIBUTIONS 19

tribution fX,Y of X and Y is defined by

fX,Y (x, y) = P(X = x, Y = y).

Since fX,Y is a function of two variables, we call it a bivariate distribution.


It can be used to find any probability associated with X and Y . Let X have
range {xi } and Y have range {yj }. Then:
X X
1. P(a ≤ X ≤ b, c ≤ Y ≤ d) = fX,Y (xi , yj ).
xi ∈[a,b] yj ∈[c,d]
X X
2. fX (x) = fX,Y (x, yj ) and fY (y) = fX,Y (xi , y).
j i
X
3. fX,Y (xi , yj ) = 1.
i,j

We will be interested in various combinations of X and Y . Therefore, con-


sider a function g : R2 → R. We use it to define a new random variable
g(X, Y ) : S → R by

g(X, Y )(w) = g(X(w), Y (w)).

The expectation of this new random variable can be obtained, as usual, by


multiplying its values with their probabilities:
X
E[g(X, Y )] = g(xi , yj ) fX,Y (xi , yj ).
i,j

We create analogous definitions when X, Y are jointly distributed continuous


random variables.

• Let X, Y be jointly distributed continuous random variables. Their joint


probability density fX,Y is a function of two variables whose integrals give
the probability of X, Y lying in any range:
Z bZ d
P(a ≤ X ≤ b, c ≤ Y ≤ d) = fX,Y (x, y) dx dy.
a c

Then the following are easy to prove:


Z ∞ Z ∞
1. fX,Y (x, y) dy = fX (x) and fX,Y (x, y) dx = fY (y).
−∞ −∞
9 BIVARIATE DISTRIBUTIONS 20

Z ∞ Z ∞
2. fX,Y (x, y) dx dy = 1.
−∞ −∞

• Let X, Y be jointly distributed continuous random variables and g : R2 →


R. Then the expectation of g(X, Y ) is given by
Z ∞Z ∞
E[g(X, Y )] = g(x, y)fX,Y (x, y) dx dy.
−∞ −∞

Some important special cases:


1. Suppose X, Y are jointly distributed discrete random variables. Then
X
E[X + Y ] = (xi + yj ) fX,Y (xi , yj )
i,j
X X X X
= xi ( fX,Y (xi , yj )) + yj ( fX,Y (xi , yj ))
i j j i
X X
= xi fX (xi ) + yj fY (yj ) = E[X] + E[Y ]
i j

There is a similar proof when X, Y are continuous. Thus expectation


always distributes over sums.
2. E[(X − µX )(Y − µY )] is called the covariance of X and Y and is
denoted by cov[X, Y ] or σXY . If large values of X tend to go with large
values of Y then covariance is positive. If they go with small values of
Y , covariance is negative. A zero covariance indicates that X and Y
are unrelated. (See Figure 5.)
3. We have the following identity:6
var[X + Y ] = var[X] + var[Y ] + 2cov[X, Y ].

Exercise. Show that cov[X, Y ] = E[XY ] − E[X]E[Y ].

• The correlation coefficient of X, Y is defined to be


cov[X, Y ]
ρ = ρX,Y = .
σX σY
6
Compare this with the identity connecting the dot product and length of vectors:
||u + v||2 = ||u||2 + ||v||2 + 2u · v. This motivates us to think of covariance as a kind of dot
product between different random variables, with variance as squared length. Geometric
analogies then lead to useful statistical insights and even proofs, such as the statement
below that |ρ| ≤ 1.
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 21

3 3 3

2 2 2

1 1 1

-3 -2 -1 1 2 3 -3 -2 -1 1 2 3 -3 -2 -1 1 2 3

-1 -1 -1

-2 -2 -2

-3 -3 -3

Figure 5: Observed values of two jointly distributed standard normal variables


with covariances 0.95, −0.6 and 0 respectively.

The advantage of the correlation coefficient is that it is not affected by the


units used in the measurement. If we replace X by aX ′ = X + b and Y by
Y ′ = cY + d, we will have
ρX ′ ,Y ′ = ρX,Y .

An interesting fact is that |cov[X, Y ]| ≤ σX σY .7 This immediately implies:

• −1 ≤ ρX,Y ≤ 1.

Exercise. Suppose a die is tossed once. Let X take on the value 1 if the
result is ≤ 4 and the value 0 otherwise. Similarly, let Y take on the value 1
if the result is even and the value 0 otherwise.
X\Y 0 1
1. Show that the values of fX,Y are given by: 0 1/6 1/6
1 1/3 1/3

2. Show that µX = 2/3 and µY = 1/2.


3. Show that cov[X, Y ] = 0.

10 Conditional Probability and Distributions

Consider a sample space S with event algebra A and a probability fuction


p : A → [0, 1]. Let A, B ⊂ A be events. If we know B has occurred,
7
This is the analogue of the geometric fact that |u · v| ≤ ||u|| ||v||.
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 22

what is the probability that A has also occurred? We reason that since we
know B has occurred, in effect B has become our sample space. Therefore
all probabilities of events inside B should be scaled by 1/P(B), to keep the
total probability at 1. As for the occurrence of A, the points outside B
are irrelevant, so our answer should be P(A ∩ B) times the correcting factor
1/P(B).

• The conditional probability of A, given B, is defined to be

P(A ∩ B)
P(A|B) = .
P(B)

We apply this idea to random variables:

• Let X, Y be jointly distributed random variables (both discrete or both


continuous). Then the conditional probability distribution of Y , given
X = x, is defined to be

fX,Y (x, y)
fY |X=x (y) = .
fX (x)

In the discrete case, we have fY |X=x (x) = P(Y = y|X = x).

Note that the conditional probability distribution is a valid probability dis-


tribution in its own right. For example, in the discrete case, we have

1. 0 ≤ fY |X=x (y) ≤ 1,
X X fX,Y (x, yi ) fX (x)
2. fY |X=x (yi ) = = = 1.
i i
fX (x) fX (x)

Since fY |X=x is a probability distribution, we can use it to define expectations.

• The conditional expectation of Y , given X = x, is defined to be


 X


 yi fY |X=x (yi ) X, Y are discrete

 i
E[Y |X = x] = Z ∞




 yfY |X=x (y) dy X, Y are continuous
−∞
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 23

-3 -2 -1 1 2 3

-1

-2

-3

Figure 6: Regression curve for two normal variables with ρ = 0.7.

• Note that E[Y |X = x] is a function of x. It is also denoted by µY |X=x or


µY |x , and is called the curve of regression of Y on X.

The function E[Y |X = x] creates a new random variable E[Y |X]. Below,
we calculate the expectation of this new random variable in the continuous
case:
Z ∞
E[E[Y |X]] = E[Y |X = x]fX (x) dx
−∞
Z ∞Z ∞
= yfY |X=x (y)fX (x) dy dx
−∞ −∞
Z ∞Z ∞
= yfX,Y (x, y) dy dx
−∞ −∞
= E[Y ]
Similar calculations can be carried out in the discrete case, so that we have
the general result:
E[Y ] = E[E[Y |X]].
This result is useful when we deal with experiments carried out in stages,
and we have information on how the results of one stage depend on those of
the previous ones.
11 INDEPENDENCE 24

11 Independence

Let X, Y be jointly distributed random variables. We consider Y to be


independent of X, if knowledge of the value taken by X tells us nothing
about the value taken by Y . Mathematically, this means:
fY |X=x (y) = fY (y).
This is easily rearranged to:
fX,Y (x, y) = fX (x)fY (y).
Note that the last expression is symmetric in X, Y .

• Jointly distributed X, Y are independent if we have the identity:


fX,Y (x, y) = fX (x)fY (y).

Exercise. If X, Y are independent random variables and g : R2 → R any


function of the form g(x, y) = m(x)n(y), then
E[g(X, Y )] = E[m(X)]E[n(Y )].

Exercise. If X, Y are independent, then cov[X, Y ] = 0.

A common error is to think zero covariance implies independence – in fact it


only indicates the possibility of independence.

Exercise. If X, Y are independent, then var[X + Y ] = var[X] + var[Y ].

We return again to the normal distribution. The following facts about it are
the key to its wide usability:
1. If X ∼ N(µ, σ) and a 6= 0, then aX ∼ N(aµ, |a|σ).
If Xi ∼ N(µi , σi ) with i = 1, . . . , n are pairwise independent, then
2. P
i Xi ∼ N(µ, σ), where
X X
µ= µi , σ2 = σi2 .
i i

Of course, the mean and variance would add up like this for any collection
of independent random variables. The important feature here is the preser-
vation of normality.
12 CHI SQUARE DISTRIBUTION 25

n=1 0.1
0.8
n=10
0.7 n=2 0.08
0.6
0.5 0.06 n=20
n=3
0.4
0.3 0.04
0.2 0.02
0.1
1 2 3 4 5 10 20 30 40

Figure 7: Chi square distributions with various degrees of freedom n.

12 Chi Square Distribution

We have remarked that measurements of continuously varying quantities tend


to follow normal distributions. Therefore, a consistent method of measure-
ment will then amount to looking at a sequence of normal variables. The
errors in the measurements can be treated as a sequence of independent nor-
mal variables with mean zero. By scaling units, we can take them to be
standard normal.

Suppose we consider independent standard normal variables X1 , . . . , Xn . If


we view them as representing a sequence of errors,
P it is natural to ask if, in
total, the errors are large or small. The sum i Xi won’t do as a measure of
this, because individual Xi could take on large values yet cancel out to give
a small value of the sum. This problem can be avoided by summing either
|Xi | or Xi2 . The latter choice is more tractable.

• Let X1 , . . . , Xn be independent standard normal variables. The variable


n
X
X= Xi2
i=1

is called a chi square random variable with n degrees of freedom. We


2
write X ∼ χ (n), and say it has the chi square distribution.

Exercise. Let X be a standard normal variable. Show that:


1. E[X 2 ] = 1.
2. E[X 4 ] = 3.
13 RANDOM SAMPLES 26

0.06

0.05
Chi square
0.04 Normal

0.03

0.02

0.01

10 20 30 40 50 60

Figure 8: Density functions of chi square and normal distributions

3. var[X 2 ] = 2.

2
Exercise. Let X ∼ χ (n). Show that E[X] = n and var[X] = 2n.

Figure 7 shows chi square distributions with different degrees of freedom.


Note that as the degrees of freedom increase, the distributions
√ look more and
2
more normal. Figure 8 compares the χ (30) and N(30, 60) distributions:
they are very close to each other, and so a chi square distribution χ2 (n) with

n ≥ 30 is usually replaced by a normal approximation N(n, 2n).
2 2
• If X ∼ χ (m) and Y ∼ χ (n) are independent, then X + Y is a sum of
m + n squared independent standard normal variables. Therefore X + Y ∼
χ2 (m + n).

2
• The converse is also true. If X, Y are independent, X ∼ χ (m) and X +Y ∼
χ2 (m + n), then Y ∼ χ2 (n).

13 Random Samples

The subject of statistical inference is concerned with the task of using ob-
served data to draw conclusions about the distribution of properties in a
13 RANDOM SAMPLES 27

population. The main obstruction is that it may not be possible (or even
desirable) to observe all the members of the population. We are forced to
draw conclusions by observing only a fraction of the members, and these
conclusions are necessarily probabilistic rather than certain.

In such situations, probability enters at two levels. Let us illustrate this


by a familiar example - the use of opinion polls. An opinion poll consists
of a rather small number of interviews, and from the opinions observed in
these, the pollster extrapolates the likely distribution of opinions in the whole
population. Thus, one may come across a poll which, after mailing forms to
10,000 people and getting responses from about half of them, concludes that
42% (of a population of 200 million) support candidate A, while 40% support
B, and the remainder are undecided. These conclusions are not certain and
come with error limits, say of 3%. Typically, this is done by reviewing all the
possible distributions of opinions, and finding out which one is most likely
to lead to the observations. This is one level at which probability is used.
The other level is in evaluating the confidence with which we can declare our
conclusions - the error limit of 3% is not certain either, but has a probability
attached, say 90%. If we wish to have an error bar we are more certain about
(say 95%), we would have to raise the bar (say, to 5%).

Sampling is the act of choosing certain members of a population, and then


measuring their properties. Since the results depend on the choices and
are not known beforehand, each measurement is naturally represented by a
random variable. We will also make the usually reasonable assumptions that
the measurements do not disturb the population, and that each choice is
independent of the previous ones.

• A random sample is a finite sequence of jointly distributed random vari-


ables X1 , . . . , Xn such that

1. Each Xi has the same probability density: fXi = fXj for all i, j.
Y
2. The Xi are independent: fX1 ,...,Xn (x1 , . . . , xn ) = fXi (xi ).
i

The common density function fX for all the Xi is called the density function
of the population. We also say that we are sampling from a population of
type X.
14 SAMPLE MEAN AND VARIANCE 28

14 Sample Mean and Variance

Broadly, the task of Statistics is to estimate the type of a population, as


well as the associated parameters. Thus we might first try to establish that
a population can be reasonably described by a binomial variable, and then
estimate n and p. The parameters of a population can be estimated in various
ways, but are most commonly approached through the mean and variance.
For instance, if we have estimates µ̂ and σ̂ 2 for the mean and variance of
a binomial population, we could then use µ = np and σ 2 = np(1 − p) to
estimate n and p.

• Let X1 , . . . , Xn be a random sample. Its sample mean is a random


variable X̄ defined by
n
1X
X̄ = Xi .
n i=1
Observed values of the sample mean are used as estimates of the population
mean µ. Therefore, we need to be reassured that, on average at least, we
will see the right value:
n n
1X 1X
E[X̄] = E[Xi ] = µ = µ.
n i=1 n i=1

Moreover, we would like the variance of X̄ to be small so that its values are
more tightly clustered around µ. We have
n
1 X σ2
var[X̄] = var[X i ] = ,
n2 i=1 n

where σ 2 is the population variance. Thus the variance of the sample mean
goes to zero as the sample size increases: the sample mean becomes a more
reliable estimator of the population mean.

σ2
E[X̄] = µ, var[X̄] =
n

• Let X1 , . . . , Xn be a random sample. Its sample variance is a random


variable S 2 defined by
n
2 1 X
S = (Xi − X̄)2 .
n − 1 i=1
15 LARGE SAMPLE APPROXIMATIONS 29

X X X
(Xi − X̄)2 = (Xi2 + (X̄)2 − 2XiX̄) = Xi2 − n(X̄)2 .
i i i

hX i X
E (Xi − X̄)2 = E[Xi2 ] − nE[(X̄)2 ]
i i
X
= (σ 2 + µ2 ) − n(var[X̄] + E[X̄]2 )
i
= n(σ 2 + µ2 ) − σ 2 − nµ2 = (n − 1)σ 2

E[S 2 ] = σ 2

A rather longer calculation, which we do not include, shows that:

2 2σ 4
var[S ] =
n−1

Again, we see that sample variance has the right average (σ 2 ) and clusters
more tightly around it if we use larger samples.

15 Large Sample Approximations

Chebyshev’s Inequality. Let X be any random variable, with mean µ and


standard deviation σ. Then for any r ≥ 0,
  1
P |X − µ| ≥ rσ ≤ 2 .
r
15 LARGE SAMPLE APPROXIMATIONS 30

Proof. We give the proof for a discrete random variable X with range {xi }.
X
σ2 = (xi − µ)2 fX (xi )
i
X
≥ (xi − µ)2 fX (xi )
i:|xi −µ|≥rσ
X
≥ r2σ2 fX (xi )
i:|xi −µ|≥rσ
 
= r 2 σ 2 P |X − µ| ≥ rσ
Rearranging gives the desired inequality. 2

Weak Law of Large Numbers. Let X1 , . . . , Xn be a random sample from


a population with mean µ and standard deviation σ. Then for any c ≥ 0:
  σ2
P |X̄ − µ| ≥ c ≤ 2 .
nc
In particular, the probability diminishes to zero as n → ∞.

Proof. Apply Chebyshev’s


√ inequality to X̄, noting that it has mean µ and
standard deviation σ/ n:
 rσ  1
P |X̄ − µ| ≥ √ ≤ 2.
n r

Substituting c = rσ/ n gives the Weak Law. 2

The Weak Law allows us to make estimates of the mean of arbitrary accuracy
and certainty, by just increasing the sample size.

A remarkable feature of the sample mean is that as the sample size increases,
its distribution looks more and more like a normal distribution, regardless of
the population distribution! We give an illustration of this in Figure 9.

Central Limit Theorem. Let X1 , . . . , Xn be a random sample from a


population with mean µ and variance σ 2 . Then the sample mean X̄ is more
normally distributed as n increases:
X̄ − µ
lim √ ∼ N(0, 1).
n→∞ σ/ n

When we work with large samples,8 the Central Limit Theorem allows us to
8
A commonly used definition of “large” is n ≥ 30.
15 LARGE SAMPLE APPROXIMATIONS 31

1.6 1.8 2 2.2 2.4

Figure 9: The histogram represents the means of 5000 samples of size 100, from
a binomial population with parameters n =√10 and p = 0.2. It is matched against
a normal distribution with µ = 2 and σ = 0.016.

use the normal approximation to make sharper error estimates (relative to


the Weak Law). We will not prove this result, but we shall use it frequently.

Example. Suppose we are trying to estimate the mean of a population


whose variance is known to be σ 2 = 1.6 (as in Figure 9). Using samples
of size 100, we want error bars that are 90% certain. Then the Weak Law
suggests we find a c such that

1.62
= 0.1, or c = 0.506.
100c2
Thus we are 90% sure that any particular observed mean is within 0.506 of
the actual population mean.

If we consider the particular data that led to the histogram of Figure 9,


we find that in fact 90% of the observed sample means are within 0.21 of
the actual mean, which is 2. Additionally, 99.9% of the observed means are
within 0.5 of 2. So the Weak Law estimates are correct, but inefficient.

On the other hand, taking 100 as a large sample size, we see from the Central
Limit Theorem that Z = 7.905(X̄ − µ) should be approximately a standard
normal variable. Therefore we are 90% sure that its observed values are
within 1.65 of 0. So we are 90% sure that any observed sample mean is
16 POINT ESTIMATION 32

1.65
within = 0.209 of the actual population mean. Similarly, we are 99.9%
7.905
sure that any observed mean is within 0.5 of the actual. Thus, the Central
Limit Theorem provides estimates which are close to reality(in this case,
identical). 2

16 Point Estimation

We now start our enquiry into the general process of estimating statisti-
cal parameters – so far we have become familiar with estimating mean and
variance via sample mean and sample variance.

• A statistic of a random sample X1 , . . . , Xn is any function

Θ = f (X1 , . . . , Xn ).

Thus a statistic is itself a random variable. X̄ and S 2 are examples of statis-


tics. The probability distribution of a statistic is called its sampling dis-
tribution.

• A statistic Θ is called an estimator of a population parameter θ if values


of the statistic are used as estimates of θ.

X̄ is an estimator of the population mean, and S 2 is an estimator of the


population variance.

• If a statistic Θ is an estimator of a population parameter θ, then any


observed value of Θ is denoted θ̂ and called a point estimate of θ.

According to this convention, observed values of X̄ should be denoted by µ̂,


and of S 2 by σ̂ 2 . We will also denote them by x̄ and s2 , respectively.

• A statistic Θ is an unbiased estimator of a population parameter θ if

E[Θ] = θ.

X̄ is an unbiased estimator of the population mean, and S 2 is an unbiased


estimator of the population variance.
17 METHOD OF MOMENTS 33


Exercise. Suppose we use S = S 2 as an estimator of the standard devia-
tion of the population. Is it an unbiased estimator?

In general, if Θ is an estimator of θ, the gap E[Θ] − θ is called its bias.

• Θ is an asymptotically unbiased estimator of θ if the bias goes to 0 as


the sample size goes to infinity.

Exercise. Show the following statistics are asymptotically unbiased estima-


tors:
1X
1. (Xi − X̄)2 , of σ 2 .
n i

2. S = S 2 , of σ.

• Θ is a consistent estimator of θ if for any c ≥ 0,


 
P |Θ − θ| ≥ c → 0, as n → ∞,

where n is the sample size.

The Weak Law of Large Numbers states that X̄ is a consistent estimator


of µ. The proof of the Weak Law can be trivially generalized to yield the
following result:

Theorem. Let Θ be an unbiased estimator for θ, such that var[Θ] → 0 as


the sample size n → ∞. Then Θ is a consistent estimator of θ. 2

It follows that S 2 is a consistent estimator of σ 2 .

17 Method of Moments

Having created the abstract notion of an estimator of a population param-


eter, we now face the problem of creating such estimators. The Method of
Moments is based on the idea that parameters of a population can be re-
covered from its moments, and for the moments there are certain obvious
estimators.
18 MAXIMUM LIKELIHOOD ESTIMATORS 34

• Given a random variable X, its r th moment is defined to be

µr = E[X r ].

2
Note that µX = µ1 and σX = µ2 − µ21 .

• Consider a population with probability distribution fX . Then the statistic


n
1X r
Mr = X
n i=1 i

is the Method of Moments estimator for the population moment µr .

2
We have M1 = X̄. Further, the identity σX = µ2 − µ21 leads to the use of
M2 − M12 as an estimator of σ 2 .

1X
Exercise. M2 − M12 = (Xi − X̄)2 .
n i

The Method of Moments is easy to implement. Its drawback is that it comes


with no guarantees about the performance of the estimators that it creates.

18 Maximum Likelihood Estimators

Maximum Likelihood Estimation (MLE) will be our main tool for generating
estimators. The reason lies in the following guarantee:

Fact Maximum Likelihood Estimators are asymptotically minimum vari-


ance and unbiased. Thus, for large samples, they are essentially the best
estimators.

Before describing the method, we ask the reader to consider the following
graph and question:
18 MAXIMUM LIKELIHOOD ESTIMATORS 35

0.4
0.3
0.2
0.1

-2 -1 1 2 3 4

Question. The graphs above represent two conjectured distributions for a


population. If we pick a random member of the population and it turns out
to be 2, which of these conjectures appears more plausible? What if the
random member turns out to be −1? 1?

• Consider a population whose probability density f ( · ; t) depends on a pa-


rameter t. We draw a random sample X1 , . . . , Xn from this population. The
corresponding likelihood function is defined to be
Y
L(x1 , . . . , xn ; t) = f (xi ; t),
i

where xi is a value of Xi . Thus L is the joint probability density of the


random sample.

• Suppose we observe the values x1 , . . . , xn of a random sample X1 , . . . , Xn .


The corresponding maximum likelihood estimate of t is the value t̂ that
maximizes L(x1 , . . . , xn ; t).

Example. Suppose we have a binomial population with n = 1 and we wish


to estimate p using a sample of size m. The corresponding likelihood function
is:
Y1
L(x1 , . . . , xm ; p) = pxi (1 − p)1−xi
i
xi
Y Y
= ( p)( (1 − p))
i: xi =1 i: xi =0
P P
= p i xi
(1 − p) m− i xi

= p (1 − p)m(1−x̄) .
mx̄

To maximize L, we use Calculus:


dL
=0 =⇒ p = x̄.
dp
19 SAMPLING DISTRIBUTIONS 36

So the sample mean X̄ is the ML Estimator of p. 2

Example. Suppose we have a normal population whose σ is known, and we


want the maximum likelihood estimator of µ. The likelihood function is
Y 1 1 xi −µ 2
L(x1 , . . . , xn ; σ) = √ e− 2 ( σ )
i
σ 2π
1 1 P xi −µ 2
= √ e− 2 i ( σ )
(σ 2π)n

Matters simplify if we take the logarithms:


√ 1 X  xi − µ 2
ln L(x1 , . . . , xn ; σ) = −n ln(σ 2π) −
2 i σ

We differentiate with respect to µ:


d(ln L) X  xi − µ 
0= =− =⇒ µ = x̄
dµ i
σ

So the sample mean is the ML Estimator of µ. 2

Exercise. Suppose we have a normal population whose µ is known. Show


n−1 2
that the ML Estimator of σ 2 is S .
n
The MLE technique can also be applied when many parameters t1 , . . . , tk
have to be estimated. Then the likelihood function has the form

L(x1 , . . . , xn ; t1 , . . . , tk )

and we again seek the values t̂1 , . . . , t̂k which maximize it.

Exercise. Suppose we have a normal population and neither parameter is


known. Find their ML Estimators.

19 Sampling Distributions

In earlier sections, we have explored the characteristics of the sample mean


and variance. We could give their expectation and variance, and for large
19 SAMPLING DISTRIBUTIONS 37

samples the Central Limit Theorem gives a close approximation to the actual
distribution of X̄. Other than this, we do not have an explicit description of
the sampling distributions of X̄ and S 2 . However, if the population is known
to be normal, we can do better.

Theorem. Let X1 , . . . , Xn be a random sample from a normal population.


Then

1. X̄ and S 2 are independent.



2. X̄ ∼ N(µ, σ/ n).
S2 2
3. (n − 1) ∼ χ (n − 1).
σ2

Proof. We skip the proof of the first item – it is hard and no elementary
Statistics book contains it! The next item is trivial, since linear combinations
of independent normal variables are again normal. For the last claim, we start
with the identity
X X
(n − 1)S 2 = Xi2 − n(X̄)2 = (Xi − µ)2 − n(X̄ − µ)2 .
i i

Hence,
X  Xi − µ 2 2
S2

X̄ − µ
= (n − 1) 2 + √ .
i
σ σ σ/ n

2
Let us recall the following fact: If Z, Y are independent, Z ∼ χ (k) and
2 2
Y + Z ∼ χ (k + m), then Y ∼ χ (m). We can apply this to the last equation
S2 2
with k = 1 and k + m = n, to get (n − 1) 2 ∼ χ (n − 1). 2
σ

Exercise. Let S 2 be the sample variance of a sample of size n from an


N(µ, σ) population. Show that

2 2σ 4
var[S ] = .
n−1
19 SAMPLING DISTRIBUTIONS 38

1.4

1.2

-0.3 -0.2 -0.1 0.1 0.2 0.3

0.8

0.6

Figure 10: This diagram illustrates the independence of X̄ and S 2 for a normal
population. It shows (x̄, s2 ) pairs for a thousand samples (each of size 100) from
a standard normal population.

0.5

0.4
A
B
0.3
C

0.2

0.1

-3 -2 -1 1 2 3

Figure 11: The graphs compare t distributions with the standard normal distri-
bution. We have A: t(1), B: t(10), C: N (0, 1).
19 SAMPLING DISTRIBUTIONS 39

t Distribution

2
• Consider independent random variables Y, Z, where Y ∼ χ (n) and Z ∼
N(0, 1). Then the random variable

Z
T = √
Y/ n

is said to have a t distribution with n degrees of freedom. We write T ∼


t(n). This definition is motivated by the following example:

Example. Suppose we have a random sample X1 , . . . , Xn from a normal


population, and we seek to estimate µ. Then we have:

X̄ − µ
Z= √ ∼ N(0, 1).
σ/ n

We can’t use Z directly because σ is also unknown. One option is to replace


σ by its estimator S, but then the distribution is no longer normal. However,
we recall that
S2 2
Y = (n − 1) 2 ∼ χ (n − 1).
σ
Further, since X̄, S 2 are independent, so are Y and Z. Therefore
Z
√ ∼ t(n − 1)
Y/ n − 1

To round off the example, note that

Z X̄ − µ
√ = √ .
Y/ n − 1 S/ n
2

Figure 11 shows that the t distribution is only needed for small sample sizes.
Beyond n = 30, the t distributions are essentially indistinguishable from the
standard normal one, and so we can directly use the observed value s (of S)
for σ and work with the standard normal distribution.
19 SAMPLING DISTRIBUTIONS 40

1.4 3
1.2 2.5
A A
1 B 2
0.8 C B
1.5
0.6
0.4 1
0.2 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2

Figure 12: The first graph shows various F distributions (A: F(3, 10), B: F(10, 10),
C: F(50, 50)). The second graph compares the F(200, 200) distribution (A) and the
normal distribution with the same mean and standard deviation (B).

F Distribution

2
• Suppose X1 , X2 are independent random variables, and X1 ∼ χ (n1 ), X2 ∼
χ2 (n2 ). Then the random variable

X1 /n1
F =
X2 /n2
is said to have an F distribution with n1 , n2 degrees of freedom. We write
F ∼ F(n1 , n2 ).

Example. Suppose we wish to compare the variances of two normal popu-


lations. We could do this by estimating their ratio. Therefore, for i = 1, 2,
let Si2 be the sample variance of a random sample of size ni from the ith
population. Then
S2 2
(ni − 1) i2 ∼ χ (ni − 1).
σi
Therefore, on taking the ratio and simplifying,
S12 /σ12
∼ F(n1 − 1, n2 − 1)
S22 /σ22
Given observations of Si , we can use this relation to estimate σ1 /σ2 . 2

From Figure 12 we see that F distributions also become more normal with
2
larger sample sizes. However, they do so much more slowly than χ or t
distributions.
20 CONFIDENCE INTERVALS 41

You may have noted that we have not provided the density functions for the
χ2 , t and F distributions. Their density functions are not very complicated,
but they are rarely used directly. What we really need are their direct in-
tegrals (to get probabilities) and these do not have closed form formulae!
Instead, we have to look up the values from tables (provided at the end of
every Statistics book) or use computational software such as Mathematica.

Remark From the definition of the F distribution, it is clear that F ∼


F(m1 , m2 ) implies 1/F ∼ F(m2 , m1 ). So tables for the F distribution are
printed only for the case m1 ≤ m2 .

20 Confidence Intervals

An interval estimate for a population parameter t is an interval [t̂1 , t̂2 ]


where t is predicted to lie. The numbers t̂1 , t̂2 are obtained from a random
sample: so they are values of statistics T1 and T2 . Let
P(T̂1 < t < T̂2 ) = 1 − α.
Then 1 − α is called the degree of confidence and t̂1 , t̂2 are the lower
and upper confidence limits. The interval (t̂1 , t̂2 ) is called a (1 − α)100%
confidence interval.

To obtain a confidence interval for a parameter t we need to create a statistic


T in which t is the only unknown. If T is of one of the standard types
(normal, chi square, etc.) we can use tables or statistical software to find
t1 , t2 such that
P(t1 < T < t2 ) = 1 − α.
By rearranging t1 < T < t2 in the form T1 < t < T2 we obtain a (1 − α)100%
confidence interval for t.

Confidence Interval For Mean

Example. Suppose we need to estimate the mean µ of a normal population


whose standard deviation σ is known. If we have a random sample of size n,
then √
X̄ ∼ N(µ, σ/ n).
20 CONFIDENCE INTERVALS 42

Figure 13: The point z such that P(Z > z) = α/2, also satisfies P(|Z| < z) = 1−α.

If we desire a degree of confidence 1 − α, we need a c such that



P |X̄ − µ| < c = 1 − α.

Then (x̄−c, x̄+c) will be a confidence interval for µ with degree of confidence
1 − α.

X̄ − µ
Define Z = √ . This is a standard normal variable. Also,
σ/ n
 √ 
 c n
P |X̄ − µ| < c = P |Z| < .
σ

c n
The problem can be finished off by locating z = such that (See Figure
σ
13):
α
P(Z > z) = .
2

For example, suppose we have n = 150, σ = 6.2 and α = 0.01. Then we have
z = 2.575, since Z ∞
1 2
√ e−x /2 dx = 0.005.
2π 2.575

c n
From z = we obtain
σ
2.575 × 6.2
c= √ = 1.30.
150
20 CONFIDENCE INTERVALS 43

Figure 14: A one-sided confidence interval.

Hence (x̄ − 1.30, x̄ + 1.30) is a 99% confidence interval for µ. 2

Let us note some features of this example:

1. We found a confidence interval which was symmetric about the point


estimate x̄. We can also find asymmetric ones. In particular, we have
σ
the one-sided confidence interval (−∞, x̄+z √ ), where z is chosen
n
to satisfy P(Z > z) = α. (See Figure 14.)

2. We assumed a normal population. This can be bypassed with the help


of the Central Limit Theorem and a large sample size n, for then X̄
will be normally distributed.

3. We assumed σ was somehow known. If not, we can work with the


X̄ − µ
t distribution T = √ . Further, if n ≥ 30, we can just use the
S/ n
observed value s of S for σ.

The next example illustrates the last item.

Example. Suppose we have a small random sample of size n from a normal


population, N(µ, σ). Then

X̄ − µ
T = √ ∼ t(n − 1).
S/ n
20 CONFIDENCE INTERVALS 44

To obtain a confidence interval for µ, with degree of confidence 1 − α, we


seek a value t such that

P( |T | ≤ t ) = 1 − α.

For then, the interval



x̄ − µ
  
s s
µ : −t ≤ √ ≤ t = x̄ − √ t, x̄ + √ t
s/ n n n

is the required confidence interval for µ.

For instance, suppose the random sample consists of the following numbers:
2.3, 1.9, 2.1, 2.8, 2.3, 3.6, 1.4, 1.8, 2.1, 3.2, 2.0 and 1.9. Then we have

x̄ = 2.28 and s2 = 0.407.

Also, T ∼ t(11). If we want a confidence level of 95%, then we can take


t = 2.201 since P(T ≥ 2.201) = 0.025 = α/2 (and T is symmetric about
y-axis). So the 95% confidence interval is (1.88, 2.68). 2

Confidence Interval For Variance

Consider the task of estimating the variance σ 2 of a normal population. If


we have a random sample of size n, then

S2 2
(n − 1) 2
∼ χ (n − 1).
σ
We take x1 , x2 such that

S2 S2
   
α α
P (n − 1) 2 ≥ x1 = 1 − and P (n − 1) 2 ≥ x2 = ,
σ 2 σ 2

S2
 
so that P x1 < (n − 1) 2 < x2 = 1 − α. Therefore,
σ

n−1 2 n−1 2

2
P S <σ < S = 1 − α.
x2 x1

n−1 2 n−1 2

Thus, s , s is a (1 − α)100% confidence interval for σ 2 .
x2 x1
20 CONFIDENCE INTERVALS 45

Figure 15: Choosing a confidence interval for variance, using a chi square distri-
bution.

Confidence Interval For Difference of Means

We now consider the problem of comparing two different populations: We


do this by estimating either the difference or the ratio of their population
parameters such as mean and variance.

Example. Consider independent normal populations with parameters (µi , σi ),


i = 1, 2, respectively. We will estimate µ1 − µ2 , assuming the variances are
known. Now,  
σi
X̄i ∼ N µi , √ .
ni
Assuming the random samples to be independent, we have
s
σ12 σ22
X̄1 − X̄2 ∼ N(µ1 − µ2 , σ), where σ = + .
n1 n2

Therefore,
(X̄1 − X̄2 ) − (µ1 − µ2 )
Z= ∼ N(0, 1).
σ
We are now in a familiar situation. We choose z so that

P(Z > z) = α/2.

Then,
P(|Z| < z) = 1 − α.
Hence, 
P |(X̄1 − X̄2 ) − (µ1 − µ2 )| < σz = 1 − α.
20 CONFIDENCE INTERVALS 46

And so, (x̄1 − x̄2 − σz, x̄1 − x̄2 + σz) is a (1 − α)100% confidence interval for
the difference of means. 2

If the variances are not known, we can take large samples (ni ≥ 30) and
use s2i for σi2 . If even that is not possible, we need to at least assume that
σ1 = σ2 = σ. Then
(X̄1 − X̄2 ) − (µ1 − µ2 )
Z= q ∼ N(0, 1).
1 1
σ n1 + n2

For σ, we have the pooled estimator


(n1 − 1)S12 + (n2 − 1)S22
Sp2 = ,
n1 + n2 − 1
which is just the overall sample variance of the pooled samples. We have
Si2 2
(ni − 1) 2
∼ χ (ni − 1) i = 1, 2.
σ
Hence
2
X Si2 2
Y = (ni − 1) 2
∼ χ (n1 + n2 − 2).
i=1
σ
Therefore
Z
T =p ∼ t(n1 + n2 − 2).
Y /(n1 + n2 − 2)
Now T can also be expressed as:
(X̄1 − X̄2 ) − (µ1 − µ2 )
T = q
Sp n11 + n12

So we see that the t distribution can be used to find confidence intervals for
µ1 − µ2 . For example, suppose we find t such that

P(T ≥ t) = α/2.

Then
P(|T | ≤ t) = 1 − α.
This can be rearranged into
 r 
1 1
P |(X̄1 − X̄2 ) − (µ1 − µ2 )| ≤ t Sp + = 1 − α.
n1 n2
20 CONFIDENCE INTERVALS 47

Therefore,
 r r 
1 1 1 1
x̄1 − x̄2 − t sp + , x̄1 − x̄2 + t sp +
n1 n2 n1 n2
is a (1 − α)100% confidence interval for µ1 − µ2 .

Confidence Interval For Ratio Of Variances

For comparing the variances of two normal populations, it is convenient to


look at their ratio through the F distribution.

Example. Suppose we have independent random samples from N(µi , σi )


populations, with i = 1, 2. Let ni be the sample size and Si2 the sample
variance for the sample from the ith population. Then
Si2 2
(ni − 1) 2 ∼ χ (ni − 1),
σi
and hence
S12 /σ12
F = ∼ F(n1 − 1, n2 − 1).
S22 /σ22
Therefore, we look for f1 , f2 such that
P(F > f1 ) = 1 − α/2 and P(F > f2 ) = α/2,
so that
P(f1 < F < f2 ) = 1 − α.
This yields the (1 − α)100% confidence interval
 2
s1 1 s21 1

,
s22 f2 s22 f1
for σ12 /σ22 . 2

References
1. A.M. Mood, F.A. Graybill and D.C. Boes, Introduction to the Theory
of Statistics, Third Edition, Tata McGraw-Hill, New Delhi.
2. J.E. Freund, Mathematical Statistics, Fifth Edition, Prentice-Hall India.

You might also like