ICICI Centre for Mathematical Sciences
Mathematical Finance 20045
A Probability Primer
Dr Amber Habib
We will sum up the most essential aspects of Probability, and the transition
to Statistical Inference. We avoid technical distractions. The gaps we leave
can be ﬁlled in by consulting one of the references provided at the end.
The ﬁgures have been produced using Mathematica.
Contents
1 Sample Space, Random Variables 2
2 Probability 3
3 Probability Distributions 4
4 Binomial Distribution 7
5 Normal Distribution 9
6 Expectation 11
7 Variance 13
8 Lognormal Distribution 17
9 Bivariate Distributions 18
10 Conditional Probability and Distributions 21
11 Independence 24
12 Chi Square Distribution 25
13 Random Samples 26
14 Sample Mean and Variance 28
15 Large Sample Approximations 29
1
1 SAMPLE SPACE, RANDOM VARIABLES 2
16 Point Estimation 32
17 Method of Moments 33
18 Maximum Likelihood Estimators 34
19 Sampling Distributions 36
20 Conﬁdence Intervals 41
1 Sample Space, Random Variables
The sample space is the collection of objects whose statistical properties
are to be studied. Each such object is called an outcome, and a collection
of outcomes is called an event.
• Mathematically, the sample space is a set, an outcome is a member of this
set, and an event is a subset of it.
For example, our sample space may be the collection of stocks listed on
the National Stock Exchange. We may ask statistical questions about this
space, such as: What is the average change in the value of a listed stock
(an outcome) over the past year? Alternately, our sample space might be
the history of a single stock: All its daily closing prices, say, over the last
5 years. Our question might be: What is the average weekly change of the
value over the last year? Or, how large is the variation in the prices over its
5 year history? We could ask more subtle questions: How likely is it that the
stock will have a sudden fall in value over the next month?
• Typical notation is to use S for sample space, lower case letters such as x
for individual outcomes, and capital letters such as E for events.
In a particular context, we would be interested in some speciﬁc numerical
property of the outcomes: such as the closing price on October 25, 2004,
of all the stocks listed on the NSE. This property allots a number to each
outcome, so it can be viewed as a function whose domain is the sample space
S and range is in the real numbers R.
• A function X : S →R is called a random variable.
2 PROBABILITY 3
Our interest is in taking a particular random variable X and studying how
its values are distributed. What is the average? How much variation is there
in its values? Are very large values unlikely enough to be ignored?
2 Probability
We present two viewpoints on the meaning of probability. Both are relevant
to Finance and, interestingly, they lead to the same mathematical deﬁnition
of probability!
Viewpoint 1: The probability of an event should predict the relative fre
quency of its occurrence. That is, suppose we say the probability of a random
stock having increased in value over the last month is 0.6. Then, if we look
at 100 diﬀerent stocks, about 60 of them should have increased in value. The
prediction should be more accurate if we look at larger numbers of stocks.
Viewpoint 2: The probability of an event reﬂects our (subjective) opinion
about how likely it is to occur, in comparison to other events. Thus, if we
allot probability 0.4 to event A and 0.2 to event B, we are expressing the
opinion that A is twice as likely as B.
Viewpoint 1 is appropriate when we are analyzing the historical data to
predict the future. Viewpoint 2 is useful in analyzing how an individual may
act when faced with certain information. Both viewpoints are captured by
the following mathematical formulation:
• Let A be the collection of all subsets of the sample space S: we call it the
event algebra.
• A probability function is a function P : A →[0, 1] such that:
1. P(S) = 1.
2. P
∞
¸
i=1
E
i
=
∞
¸
i=1
P(E
i
), if the E
i
are pairwise disjoint events (i.e., i = j
implies E
i
∩ E
j
= ∅).
Let P : A → [0, 1] be a probability function. Then it automatically has the
following properties:
3 PROBABILITY DISTRIBUTIONS 4
1. P(∅) = 0.
2. P
n
¸
i=1
E
i
=
n
¸
i=1
P(E
i
), if the E
i
are pairwise disjoint events.
3. P(A
c
) = 1 −P(A).
4. P
¸
i
E
i
≤
¸
i
P(E
i
), for any collection of events E
i
.
3 Probability Distributions
We return to our main question: How likely are diﬀerent values (or ranges
of values) of a random variable X : S →R?
If we just plug in the deﬁnition of a random variable, we realize that our
question can be phrased as follows: What is the probability of the event
whose outcomes correspond to a given value (or range of values) of X?
Thus, suppose I ask, what is the probability that X takes on a value greater
than 100? This is to be interpreted as: what is the probability of the event
whose outcomes t all satisfy X(t) > 100? That is,
P(X > 100) = P({t : X(t) > 100}).
It is convenient to consider two types of random variables, ones whose values
vary discretely (discrete random variables) and those whose values vary
continuously (continuous random variables).
1
Examples:
1. Let us allot +1 to the stocks whose value rose on a given day, and −1 to
those whose value fell. Then we have created a random variable whose
possible values are ±1. This is a discrete random variable.
2. Let us allot the whole number +n to the stocks whose value rose by
between n and n + 1 on a given day, and −n to those whose value
fell by between n and n + 1. Then we have created a random variable
whose possible values are all the integers. This is also a discrete random
variable.
1
These are not the only types. But these types contain all the ones we need.
3 PROBABILITY DISTRIBUTIONS 5
3. If, in the previous example, we let X be the actual change in value, it
is still discrete (since all changes are in multiples of Rs 0.01). However,
now the values are so close that it is simpler to ignore the discreteness
and model X as a continuous random variable.
Discrete Random Variables
Let S be the sample space, A the event algebra, and P : A → [0, 1] a
probablity function. Let X : S → R be a discrete random variable, with
range x
1
, x
2
, . . . (the range can be ﬁnite or inﬁnite). The probability of any
particular value x is
P(X = x) = P({t ∈ S : X(t) = x}).
• We call these values the probability distribution or probability den
sity of X and denote them by f
X
: R →[0, 1],
f
X
(x) = P(X = x).
One can ﬁnd the probability of a range of values of X by just summing up
the probabilities of all the individual values in that range. For instance,
P(a < X < b) =
¸
x
i
∈(a,b)
f
X
(x
i
).
In particular, summing over the entire range gives
¸
i
f
X
(x
i
) = 1,
since the total probability must be 1.
Discrete Uniform Distribution: Consider an X whose range is {0, 1, 2, . . . , n}
and each value is equally likely. Then
f
X
(x) =
0 x / ∈ {0, 1, 2, . . . , n}
1/(n + 1) else
.
3 PROBABILITY DISTRIBUTIONS 6
Continuous Random Variables
Suppose the values of a random variable X vary continuously over some
range, such as [0, 1]. Then, it is not particularly useful to ask for the likeli
hood of X taking on any individual value such as 1/2  since there are in
ﬁnitely many choices, each individual choice is essentially impossible and has
probability zero. From a real life viewpoint also, since exact measurements
of a continuously varying quantity are impossible, it is only reasonable to ask
for the probability of an observation lying in a range, such as (0.49, 0.51),
rather than its having the exact value 0.5.
The notion of a probability distribution of a continuous random variable is
developed with this in mind. Recall that in the discrete context, probability
of a range was obtained by summing over that range. So, in the continuous
case, we seek to obtain probability of a range by integrating over it.
• Given a continuous random variable X, we deﬁne its probability density
to be a function f
X
: R →[0, ∞) such that for any a, b with a ≤ b,
P(a ≤ X ≤ b) =
b
a
f
X
(x) dx.
In particular,
∞
−∞
f
X
(x) dx = 1.
Important. The number f
X
(x) does not represent the probability that
X = x. Individual values of f
X
have no signiﬁcance, only the integrals of f
X
do! (Contrast this with the discrete case.)
Continuous Uniform Distribution:
2
Suppose we want a random variable
X that represents a quantity which varies over [0, 1] without any bias: this is
taken to mean that P(X ∈ [a, b]) should not depend on the location of [a, b]
but only on its length.
0 1

I
1

I
2

I
3
P(X ∈ I
1
) = P(X ∈ I
2
) =
1
2
P(X ∈ I
3
)
2
Usually, this is just called the unifom distribution.
4 BINOMIAL DISTRIBUTION 7
This is achieved by taking the following probability density:
f
X
(x) =
1 0 ≤ x ≤ 1
0 else
For then, with 0 ≤ a ≤ b ≤ 1,
P(a ≤ X ≤ b) =
b
a
1 dx = b −a.
0 1
1
f
X
Cumulative Probability Distribution
• The cumulative probability distribution of a random variable X, de
noted F
X
, is deﬁned by:
F
X
(x) = P(X ≤ x).
It is easy to see that
F
X
(x) =
¸
x
i
≤x
f
X
(x
i
) X is discrete with range x
1
, x
2
, . . .
x
−∞
f
X
(t) dt X is continuous
Now we look at two distributions, one discrete and one continuous, which
are especially important.
4 Binomial Distribution
Consider a random variable X which can take on only two values, say 0 and 1
(the choice of values is not important). Suppose the probability of the value
4 BINOMIAL DISTRIBUTION 8
0 is q and of 1 is p. Then we have:
1. 0 ≤ p, q ≤ 1,
2. p + q = 1
Question: Suppose we observe X n times. What are the likely distributions
of 0’s and 1’s? Speciﬁcally, we ask: What is the probability of observing 1 k
times?
We calculate as follows. Let us consider all possible combinations of n 0’s
and 1’s:
Number of combinations with k 1’s = Ways of picking k slots out of n =
n
k
.
Probability of each individual combination with k 1’s = p
k
(1 −p)
n−k
.
Therefore, P(1 occurs k times) =
n
k
p
k
(1 −p)
n−k
.
• A random variable Y has a binomial distribution with parameters n
and p if it has range 0, 1, . . . , n and its probability distribution is:
f
Y
(k) =
n
k
p
k
(1 −p)
n−k
, k = 0, 1, . . . , n.
We call Y a binomial random variable and write Y ∼ B(n, p).
As illustrated above, binomial distributions arise naturally wherever we are
faced with a sequence of choices. In Finance, the binomial distribution is
part of the Binomial Tree Model for pricing options.
Figure 1 illustrates the binomial distributions with p = 0.2 and p = 0.5,
using diﬀerent types of squares for each. Both have n = 10.
Exercise. In Figure 1, identify which points correspond to p = 0.2 and
which to p = 0.5.
5 NORMAL DISTRIBUTION 9
2 4 6 8 10
0.05
0.1
0.15
0.2
0.25
0.3
Figure 1: Two binomial distributions.
5 Normal Distribution
This kind of probability distribution is at once the most common in nature,
among the easiest to work with mathematically, and theoretically at the heart
of Probability and Statistical Inference. Among its remarkable properties is
that any phenomenon occurring on a large scale tends to be governed by
it. When in doubt about the nature of a distribution, assume it is (nearly)
normal, and you will usually get good results!
We ﬁrst deﬁne the standard normal distribution. This is a probability
density of the form
f
X
(x) =
1
√
2π
e
−x
2
/2
.
It has the following ‘bellshaped’ graph:
3 2 1 1 2 3
0.1
0.2
0.3
0.4
5 NORMAL DISTRIBUTION 10
Exercise. Can you explain the factor
1
√
2π
? (Hint: Think about the
requirement that total probability should be 1.)
Note that the graph is symmetric about the yaxis. The axis of symmetry can
be moved to another position m, by replacing x by x − m. In the following
diagram, the dashed line represents the standard normal distribution:
m
0.1
0.2
0.3
0.4
f
X
(x) =
1
√
2π
e
−(x−m)
2
/2
.
Also, starting from the standard normal distribution, we can create one with
a similar shape but bunched more tightly around the yaxis. We achieve this
by replacing x with x/s:
3 2 1 1 2 3
f
X
(x) =
1
√
2πs
e
−
1
2
(x/s)
2
.
By combining both kinds of changes, we reach the deﬁnition of a general
normal distribution:
• A random variable X has a normal distribution with parameters µ, σ,
6 EXPECTATION 11
if its density function has the form:
f
X
(x) =
1
√
2πσ
e
−
1
2
(
x−µ
σ
)
2
.
We call X a normal random variable and write X ∼ N(µ, σ).
The axis of symmetry of this distribution is determined by µ and its clustering
about the axis of symmetry is controlled by σ.
Exercise. Will increasing σ make the graph more tightly bunched around
the axis of symmetry? What will it do to the peak height of the graph?
In the empirical sciences, errors in observation tend to be normally dis
tributed: they are clustered around zero, small errors are common, and very
large errors are very rare. Regarding this, observe from the graph that by
±3 the density of the standard normal distribution has essentially become
zero: in fact the probability of a standard normal variable taking on a value
outside [−3, 3] is just 0.0027. In theoretical work, the normal distribution is
the main tool in determining whether the gap between theoretical predictions
and observed reality can be attributed solely to errors in observation.
6 Expectation
If we have some data consisting of numbers x
i
, each occurring f
i
times, then
the average of this data is deﬁned to be:
¯ x =
Sum of all the data
Number of data points
=
¸
i
f
i
x
i
¸
i
f
i
=
¸
i
f
i
¸
j
f
j
x
i
Now, if we have a discrete random variable X, then f
X
(x
i
) predicts the
relative frequency with which x
i
will occur in a large number of observations
of X, i.e., we view f
X
(x
i
) as a prediction of
f
i
¸
j
f
j
. And then,
¸
i
f
X
(x
i
)x
i
becomes a predictor for the average ¯ x of the observations of X.
6 EXPECTATION 12
• The expectation of a discrete random variable is deﬁned to be
E[X] =
¸
i
f
X
(x
i
)x
i
.
On replacing the sum by an integral we arrive at the notion of expectation
of a continuous random variable:
• The expectation of a continuous random variable is deﬁned to be
E[X] =
∞
−∞
xf
X
(x) dx.
Expectation is also called mean and denoted by µ
X
or just µ.
Exercise. Make the following calculations:
1. X has the discrete uniform distribution with range 0, . . . , n. Then
E[X] =
n
2
.
2. X has the uniform distribution on [0, 1]. Then E[X] = 1/2.
3. X ∼ B(n, p) implies E[X] = np.
4. X ∼ N(µ, σ) implies E[X] = µ.
Some elementary properties of expectation:
1. E[c] = c, for any constant c.
3
2. E[cX] = c E[X], for any constant c.
Suppose X : S → R is a random variable and g : R → R is any function.
Then their composition g ◦ X : S →R, deﬁned by (g ◦ X)(w) = g(X(w)), is
a new random variable which we will call g(X).
Example. Let g(x) = x
r
. Then g ◦ X is denoted X
r
. 2
Suppose X is discrete with range {x
i
}. Then, the range of g(X) is {g(x
i
)}.
Therefore we can calculate the expectation of g(X) as follows:
4
E[g(X)] =
¸
i
g(x
i
)P
g(X) = g(x
i
)
=
¸
i
g(x
i
)f
X
(x
i
).
3
A constant c can be viewed as a random variable whose range consists of the single
value c.
4
Our calculation is valid if g is oneone. With slightly more eﬀort we can make it valid
for any g.
7 VARIANCE 13
If X is continuous, one has the analogous formula:
E[g(X)] =
∞
−∞
g(x)f
X
(x) dx.
Example. Let g(x) = x
2
. Then
E[X
2
] =
¸
i
x
2
i
f
X
(x
i
) X is discrete.
∞
−∞
x
2
f
X
(x) dx X is continuous.
2
With these facts in hand, the following result is easy to prove.
• Let X be any random variable, and g, h two real functions. Then
E[g(X) + h(X)] = E[g(X)] + E[h(X)].
7 Variance
Given some data {x
i
}
n
i=1
, its average ¯ x is seen as a central value about which
the data is clustered. The signiﬁcance of the average is greater if the clus
tering is tight, less otherwise. To measure the tightness of the clustering, we
use the variance of the data:
s
2
=
1
n
¸
i
(x
i
− ¯ x)
2
.
Variance is just the average of the squared distance from each data point to
the average (of the data).
Therefore, in the analogous situation where we have a random variable X, if
we wish to know how close to its expectation its values are likely to be, we
again deﬁne a quantity called the variance of X:
var[X] = E[(X −E[X])
2
].
Alternate notation for variance is σ
2
X
or just σ
2
. The quantity σ
X
or σ, the
square root of the variance, is called the standard deviation of X. Its
advantage is that it is in the same units as X.
7 VARIANCE 14
Exercise. Will a larger value of variance indicate tighter clustering around
the mean?
Sometimes, it is convenient to use the following alternative formula for vari
ance:
• var[X] = E[X
2
] −E[X]
2
.
This is obtained as follows.
var[X] = E[(X −E[X])
2
]
= E[X
2
−2E[X]X + E[X]
2
]
= E[X
2
] −2E[ (E[X]X) ] + E[(E[X]
2
)]
= E[X
2
] −2E[X]
2
+ E[X]
2
= E[X
2
] −E[X]
2
.
Elementary properties of variance:
1. var[X + a] = var[X], if a is any constant.
2. var[aX] = a
2
var[X], if a is any constant.
Exercise. Will it be correct to say that σ
aX
= a σ
X
for any constant a?
Exercise. Let X be a random variable with expectation µ and standard
deviation σ. Then Z =
X −µ
σ
has expectation 0 and standard deviation 1.
Exercise. Suppose X has the discrete uniform distibution with range {0, . . . , n}.
We have seen that E[X] = n/2. Show that its variance is
1
6
n(n + 2).
Exercise. Suppose X has the continuous uniform distibution with range
[0, 1]. We have seen that E[X] = 1/2. Show that its variance is 1/12.
7 VARIANCE 15
Example. Suppose X ∼ B(n, p). We know E[X] = np. Therefore,
E[X(X −1)] =
n
¸
k=0
k(k −1)
n
k
p
k
(1 −p)
n−k
=
n
¸
k=2
n!
(k −2)!(n −k)!
p
k
(1 −p)
n−k
=
n
¸
k=2
n(n −1)
n −2
k −2
p
k
(1 −p)
n−k
= n(n −1)p
2
n−2
¸
i=0
n −2
i
p
i
(1 −p)
(n−2)−i
= n(n −1)p
2
And so,
var[X] = E[X
2
] −E[X]
2
= E[X(X −1)] + E[X] −E[X]
2
= n(n −1)p
2
+ np(1 −np)
= np(1 −p)
2
Example. Suppose X ∼ N(µ, σ). We know E[X] = µ. Therefore,
var[X] =
1
σ
√
2π
∞
−∞
(x −µ)
2
e
−
1
2
(
x−µ
σ
)
2
dx
=
σ
2
√
2π
∞
−∞
z
2
e
−z
2
/2
dz
Now we integrate by parts:
∞
−∞
z
2
e
−z
2
/2
dz =
∞
−∞
z(ze
−z
2
/2
) dz
= −ze
−z
2
/2
∞
−∞
+
∞
−∞
e
−z
2
/2
dz
= 0 +
√
2π
Therefore
var[X] = σ
2
.
7 VARIANCE 16
10 20 30 40
0.02
0.04
0.06
0.08
0.1
Figure 2: Normal approximation to a binomial distribution.
2
Thus, if X ∼ N(µ, σ), then Z =
X −µ
σ
has expectation 0 and standard
deviation 1. In fact, Z is again a normal distribution and hence is the stan
dard normal distribution (it has parameters 0 and 1). Through this link, all
questions about normal distributions can be converted to questions about
the standard normal distribution. For instance, let X, Z be as above. Then:
P(X ≤ a) = P
X −µ
σ
≤
a −µ
σ
= P
Z ≤
a −µ
σ
.
Now we can also illustrate our earlier statement about how the normal dis
tribution serves as a substitute for other distributions. Figure 2 compares
a binomial distribution (n = 100 and p = 0.2) with the normal distribution
with the same mean and variance (µ = np = 20 and σ
2
= np(1 −p) = 4).
Generally, the normal distribution is a good approximation to the binomial
distribution if n is large. One criterion that is often used is that, for a
reasonable approximation, we should have both np and n(1−p) greater than
5.
8 LOGNORMAL DISTRIBUTION 17
1 2 3 4
0.2
0.4
0.6
0.8
1
1.2
C
B
A
Figure 3: Lognormal density functions: (A) µ = 0, σ = 1, (B) µ = 1, σ = 1, (C)
µ = 0, σ = 0.4.
8 Lognormal Distribution
• If Z ∼ N(µ, σ) then X = e
Z
is called a lognormal random variable with
parameters µ and σ.
5
Exercise. Let Z ∼ N(µ, σ). Then
E[e
tZ
] = e
µt+
1
2
σ
2
t
2
.
The t = 1, 2 cases of the Exercise immediately give the following:
• Let X be a lognormal variable with parameters µ and σ. Then
E[X] = e
µ+
1
2
σ
2
, var[X] = e
2µ+2σ
2
−e
2µ+σ
2
.
The lognormal distribution is used in Finance to model the variation of stock
prices with time. Without explaining how, we show in Figure 4 an example
of the kind of behaviour predicted by this model.
5
The name comes from “The log of X is normal.”
9 BIVARIATE DISTRIBUTIONS 18
Figure 4: A simulation of the lognormal model for variation of stock prices with
time.
9 Bivariate Distributions
So far we have dealt with individual random variables, i.e. with models for
particular features of a population. The next step is to study the relationships
that exist between diﬀerent features of a population. For instance, an investor
might like to know the nature and strength of the connection between her
portfolio and a stock market index such as the NIFTY. If the NIFTY goes
up, is her portfolio likely to do the same? How much of a rise is it reasonable
to expect?
This leads us to the study of pairs of random variables and the probabilities
associated with their joint values. Are high values of one associated with
high or low values of the other? Is there a signiﬁcant connection at all?
• Let S be a sample space and X, Y : S → R two random variables. Then
we say that X, Y are jointly distributed.
• Let X, Y be jointly distributed discrete random variables. The joint dis
9 BIVARIATE DISTRIBUTIONS 19
tribution f
X,Y
of X and Y is deﬁned by
f
X,Y
(x, y) = P(X = x, Y = y).
Since f
X,Y
is a function of two variables, we call it a bivariate distribution.
It can be used to ﬁnd any probability associated with X and Y . Let X have
range {x
i
} and Y have range {y
j
}. Then:
1. P(a ≤ X ≤ b, c ≤ Y ≤ d) =
¸
x
i
∈[a,b]
¸
y
j
∈[c,d]
f
X,Y
(x
i
, y
j
).
2. f
X
(x) =
¸
j
f
X,Y
(x, y
j
) and f
Y
(y) =
¸
i
f
X,Y
(x
i
, y).
3.
¸
i,j
f
X,Y
(x
i
, y
j
) = 1.
We will be interested in various combinations of X and Y . Therefore, con
sider a function g : R
2
→ R. We use it to deﬁne a new random variable
g(X, Y ) : S →R by
g(X, Y )(w) = g(X(w), Y (w)).
The expectation of this new random variable can be obtained, as usual, by
multiplying its values with their probabilities:
E[g(X, Y )] =
¸
i,j
g(x
i
, y
j
) f
X,Y
(x
i
, y
j
).
We create analogous deﬁnitions when X, Y are jointly distributed continuous
random variables.
• Let X, Y be jointly distributed continuous random variables. Their joint
probability density f
X,Y
is a function of two variables whose integrals give
the probability of X, Y lying in any range:
P(a ≤ X ≤ b, c ≤ Y ≤ d) =
b
a
d
c
f
X,Y
(x, y) dxdy.
Then the following are easy to prove:
1.
∞
−∞
f
X,Y
(x, y) dy = f
X
(x) and
∞
−∞
f
X,Y
(x, y) dx = f
Y
(y).
9 BIVARIATE DISTRIBUTIONS 20
2.
∞
−∞
∞
−∞
f
X,Y
(x, y) dxdy = 1.
• Let X, Y be jointly distributed continuous random variables and g : R
2
→
R. Then the expectation of g(X, Y ) is given by
E[g(X, Y )] =
∞
−∞
∞
−∞
g(x, y)f
X,Y
(x, y) dxdy.
Some important special cases:
1. Suppose X, Y are jointly distributed discrete random variables. Then
E[X + Y ] =
¸
i,j
(x
i
+ y
j
) f
X,Y
(x
i
, y
j
)
=
¸
i
x
i
(
¸
j
f
X,Y
(x
i
, y
j
)) +
¸
j
y
j
(
¸
i
f
X,Y
(x
i
, y
j
))
=
¸
i
x
i
f
X
(x
i
) +
¸
j
y
j
f
Y
(y
j
) = E[X] + E[Y ]
There is a similar proof when X, Y are continuous. Thus expectation
always distributes over sums.
2. E[(X − µ
X
)(Y − µ
Y
)] is called the covariance of X and Y and is
denoted by cov[X, Y ] or σ
XY
. If large values of X tend to go with large
values of Y then covariance is positive. If they go with small values of
Y , covariance is negative. A zero covariance indicates that X and Y
are unrelated. (See Figure 5.)
3. We have the following identity:
6
var[X + Y ] = var[X] + var[Y ] + 2cov[X, Y ].
Exercise. Show that cov[X, Y ] = E[XY ] −E[X]E[Y ].
• The correlation coeﬃcient of X, Y is deﬁned to be
ρ = ρ
X,Y
=
cov[X, Y ]
σ
X
σ
Y
.
6
Compare this with the identity connecting the dot product and length of vectors:
u +v
2
= u
2
+v
2
+2u· v. This motivates us to think of covariance as a kind of dot
product between diﬀerent random variables, with variance as squared length. Geometric
analogies then lead to useful statistical insights and even proofs, such as the statement
below that ρ ≤ 1.
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 21
3 2 1 1 2 3
3
2
1
1
2
3
3 2 1 1 2 3
3
2
1
1
2
3
3 2 1 1 2 3
3
2
1
1
2
3
Figure 5: Observed values of two jointly distributed standard normal variables
with covariances 0.95, −0.6 and 0 respectively.
The advantage of the correlation coeﬃcient is that it is not aﬀected by the
units used in the measurement. If we replace X by aX
′
= X + b and Y by
Y
′
= cY + d, we will have
ρ
X
′
,Y
′ = ρ
X,Y
.
An interesting fact is that cov[X, Y ] ≤ σ
X
σ
Y
.
7
This immediately implies:
• −1 ≤ ρ
X,Y
≤ 1.
Exercise. Suppose a die is tossed once. Let X take on the value 1 if the
result is ≤ 4 and the value 0 otherwise. Similarly, let Y take on the value 1
if the result is even and the value 0 otherwise.
1. Show that the values of f
X,Y
are given by:
X\Y 0 1
0 1/6 1/6
1 1/3 1/3
2. Show that µ
X
= 2/3 and µ
Y
= 1/2.
3. Show that cov[X, Y ] = 0.
10 Conditional Probability and Distributions
Consider a sample space S with event algebra A and a probability fuction
p : A → [0, 1]. Let A, B ⊂ A be events. If we know B has occurred,
7
This is the analogue of the geometric fact that u · v ≤ u v.
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 22
what is the probability that A has also occurred? We reason that since we
know B has occurred, in eﬀect B has become our sample space. Therefore
all probabilities of events inside B should be scaled by 1/P(B), to keep the
total probability at 1. As for the occurrence of A, the points outside B
are irrelevant, so our answer should be P(A∩ B) times the correcting factor
1/P(B).
• The conditional probability of A, given B, is deﬁned to be
P(AB) =
P(A∩ B)
P(B)
.
We apply this idea to random variables:
• Let X, Y be jointly distributed random variables (both discrete or both
continuous). Then the conditional probability distribution of Y , given
X = x, is deﬁned to be
f
Y X=x
(y) =
f
X,Y
(x, y)
f
X
(x)
.
In the discrete case, we have f
Y X=x
(x) = P(Y = yX = x).
Note that the conditional probability distribution is a valid probability dis
tribution in its own right. For example, in the discrete case, we have
1. 0 ≤ f
Y X=x
(y) ≤ 1,
2.
¸
i
f
Y X=x
(y
i
) =
¸
i
f
X,Y
(x, y
i
)
f
X
(x)
=
f
X
(x)
f
X
(x)
= 1.
Since f
Y X=x
is a probability distribution, we can use it to deﬁne expectations.
• The conditional expectation of Y , given X = x, is deﬁned to be
E[Y X = x] =
¸
i
y
i
f
Y X=x
(y
i
) X, Y are discrete
∞
−∞
yf
Y X=x
(y) dy X, Y are continuous
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 23
3 2 1 1 2 3
3
2
1
1
2
3
Figure 6: Regression curve for two normal variables with ρ = 0.7.
• Note that E[Y X = x] is a function of x. It is also denoted by µ
Y X=x
or
µ
Y x
, and is called the curve of regression of Y on X.
The function E[Y X = x] creates a new random variable E[Y X]. Below,
we calculate the expectation of this new random variable in the continuous
case:
E[E[Y X]] =
∞
−∞
E[Y X = x]f
X
(x) dx
=
∞
−∞
∞
−∞
yf
Y X=x
(y)f
X
(x) dy dx
=
∞
−∞
∞
−∞
yf
X,Y
(x, y) dy dx
= E[Y ]
Similar calculations can be carried out in the discrete case, so that we have
the general result:
E[Y ] = E[E[Y X]].
This result is useful when we deal with experiments carried out in stages,
and we have information on how the results of one stage depend on those of
the previous ones.
11 INDEPENDENCE 24
11 Independence
Let X, Y be jointly distributed random variables. We consider Y to be
independent of X, if knowledge of the value taken by X tells us nothing
about the value taken by Y . Mathematically, this means:
f
Y X=x
(y) = f
Y
(y).
This is easily rearranged to:
f
X,Y
(x, y) = f
X
(x)f
Y
(y).
Note that the last expression is symmetric in X, Y .
• Jointly distributed X, Y are independent if we have the identity:
f
X,Y
(x, y) = f
X
(x)f
Y
(y).
Exercise. If X, Y are independent random variables and g : R
2
→ R any
function of the form g(x, y) = m(x)n(y), then
E[g(X, Y )] = E[m(X)]E[n(Y )].
Exercise. If X, Y are independent, then cov[X, Y ] = 0.
A common error is to think zero covariance implies independence – in fact it
only indicates the possibility of independence.
Exercise. If X, Y are independent, then var[X + Y ] = var[X] + var[Y ].
We return again to the normal distribution. The following facts about it are
the key to its wide usability:
1. If X ∼ N(µ, σ) and a = 0, then aX ∼ N(aµ, aσ).
2. If X
i
∼ N(µ
i
, σ
i
) with i = 1, . . . , n are pairwise independent, then
¸
i
X
i
∼ N(µ, σ), where
µ =
¸
i
µ
i
, σ
2
=
¸
i
σ
2
i
.
Of course, the mean and variance would add up like this for any collection
of independent random variables. The important feature here is the preser
vation of normality.
12 CHI SQUARE DISTRIBUTION 25
1 2 3 4 5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
n=3
n=2
n=1
10 20 30 40
0.02
0.04
0.06
0.08
0.1
n=20
n=10
Figure 7: Chi square distributions with various degrees of freedom n.
12 Chi Square Distribution
We have remarked that measurements of continuously varying quantities tend
to follow normal distributions. Therefore, a consistent method of measure
ment will then amount to looking at a sequence of normal variables. The
errors in the measurements can be treated as a sequence of independent nor
mal variables with mean zero. By scaling units, we can take them to be
standard normal.
Suppose we consider independent standard normal variables X
1
, . . . , X
n
. If
we view them as representing a sequence of errors, it is natural to ask if, in
total, the errors are large or small. The sum
¸
i
X
i
won’t do as a measure of
this, because individual X
i
could take on large values yet cancel out to give
a small value of the sum. This problem can be avoided by summing either
X
i
 or X
2
i
. The latter choice is more tractable.
• Let X
1
, . . . , X
n
be independent standard normal variables. The variable
X =
n
¸
i=1
X
2
i
is called a chi square random variable with n degrees of freedom. We
write X ∼
χ
2
(n), and say it has the chi square distribution.
Exercise. Let X be a standard normal variable. Show that:
1. E[X
2
] = 1.
2. E[X
4
] = 3.
13 RANDOM SAMPLES 26
10 20 30 40 50 60
0.01
0.02
0.03
0.04
0.05
0.06
Normal
Chi square
Figure 8: Density functions of chi square and normal distributions
3. var[X
2
] = 2.
Exercise. Let X ∼
χ
2
(n). Show that E[X] = n and var[X] = 2n.
Figure 7 shows chi square distributions with diﬀerent degrees of freedom.
Note that as the degrees of freedom increase, the distributions look more and
more normal. Figure 8 compares the
χ
2
(30) and N(30,
√
60) distributions:
they are very close to each other, and so a chi square distribution
χ
2
(n) with
n ≥ 30 is usually replaced by a normal approximation N(n,
√
2n).
• If X ∼
χ
2
(m) and Y ∼
χ
2
(n) are independent, then X + Y is a sum of
m + n squared independent standard normal variables. Therefore X + Y ∼
χ
2
(m+ n).
• The converse is also true. If X, Y are independent, X ∼
χ
2
(m) and X+Y ∼
χ
2
(m+ n), then Y ∼
χ
2
(n).
13 Random Samples
The subject of statistical inference is concerned with the task of using ob
served data to draw conclusions about the distribution of properties in a
13 RANDOM SAMPLES 27
population. The main obstruction is that it may not be possible (or even
desirable) to observe all the members of the population. We are forced to
draw conclusions by observing only a fraction of the members, and these
conclusions are necessarily probabilistic rather than certain.
In such situations, probability enters at two levels. Let us illustrate this
by a familiar example  the use of opinion polls. An opinion poll consists
of a rather small number of interviews, and from the opinions observed in
these, the pollster extrapolates the likely distribution of opinions in the whole
population. Thus, one may come across a poll which, after mailing forms to
10,000 people and getting responses from about half of them, concludes that
42% (of a population of 200 million) support candidate A, while 40% support
B, and the remainder are undecided. These conclusions are not certain and
come with error limits, say of 3%. Typically, this is done by reviewing all the
possible distributions of opinions, and ﬁnding out which one is most likely
to lead to the observations. This is one level at which probability is used.
The other level is in evaluating the conﬁdence with which we can declare our
conclusions  the error limit of 3% is not certain either, but has a probability
attached, say 90%. If we wish to have an error bar we are more certain about
(say 95%), we would have to raise the bar (say, to 5%).
Sampling is the act of choosing certain members of a population, and then
measuring their properties. Since the results depend on the choices and
are not known beforehand, each measurement is naturally represented by a
random variable. We will also make the usually reasonable assumptions that
the measurements do not disturb the population, and that each choice is
independent of the previous ones.
• A random sample is a ﬁnite sequence of jointly distributed random vari
ables X
1
, . . . , X
n
such that
1. Each X
i
has the same probability density: f
X
i
= f
X
j
for all i, j.
2. The X
i
are independent: f
X
1
,...,Xn
(x
1
, . . . , x
n
) =
¸
i
f
X
i
(x
i
).
The common density function f
X
for all the X
i
is called the density function
of the population. We also say that we are sampling from a population of
type X.
14 SAMPLE MEAN AND VARIANCE 28
14 Sample Mean and Variance
Broadly, the task of Statistics is to estimate the type of a population, as
well as the associated parameters. Thus we might ﬁrst try to establish that
a population can be reasonably described by a binomial variable, and then
estimate n and p. The parameters of a population can be estimated in various
ways, but are most commonly approached through the mean and variance.
For instance, if we have estimates ˆ µ and ˆ σ
2
for the mean and variance of
a binomial population, we could then use µ = np and σ
2
= np(1 − p) to
estimate n and p.
• Let X
1
, . . . , X
n
be a random sample. Its sample mean is a random
variable
¯
X deﬁned by
¯
X =
1
n
n
¸
i=1
X
i
.
Observed values of the sample mean are used as estimates of the population
mean µ. Therefore, we need to be reassured that, on average at least, we
will see the right value:
E[
¯
X] =
1
n
n
¸
i=1
E[X
i
] =
1
n
n
¸
i=1
µ = µ.
Moreover, we would like the variance of
¯
X to be small so that its values are
more tightly clustered around µ. We have
var[
¯
X] =
1
n
2
n
¸
i=1
var[X
i
] =
σ
2
n
,
where σ
2
is the population variance. Thus the variance of the sample mean
goes to zero as the sample size increases: the sample mean becomes a more
reliable estimator of the population mean.
E[
¯
X] = µ, var[
¯
X] =
σ
2
n
• Let X
1
, . . . , X
n
be a random sample. Its sample variance is a random
variable S
2
deﬁned by
S
2
=
1
n −1
n
¸
i=1
(X
i
−
¯
X)
2
.
15 LARGE SAMPLE APPROXIMATIONS 29
¸
i
(X
i
−
¯
X)
2
=
¸
i
(X
2
i
+ (
¯
X)
2
−2X
i
¯
X) =
¸
i
X
2
i
−n(
¯
X)
2
.
E
¸
i
(X
i
−
¯
X)
2
=
¸
i
E[X
2
i
] −nE[(
¯
X)
2
]
=
¸
i
(σ
2
+ µ
2
) −n(var[
¯
X] + E[
¯
X]
2
)
= n(σ
2
+ µ
2
) −σ
2
−nµ
2
= (n −1)σ
2
E[S
2
] = σ
2
A rather longer calculation, which we do not include, shows that:
var[S
2
] =
2σ
4
n −1
Again, we see that sample variance has the right average (σ
2
) and clusters
more tightly around it if we use larger samples.
15 Large Sample Approximations
Chebyshev’s Inequality. Let X be any random variable, with mean µ and
standard deviation σ. Then for any r ≥ 0,
P
X −µ ≥ rσ
≤
1
r
2
.
15 LARGE SAMPLE APPROXIMATIONS 30
Proof. We give the proof for a discrete random variable X with range {x
i
}.
σ
2
=
¸
i
(x
i
−µ)
2
f
X
(x
i
)
≥
¸
i:x
i
−µ≥rσ
(x
i
−µ)
2
f
X
(x
i
)
≥ r
2
σ
2
¸
i:x
i
−µ≥rσ
f
X
(x
i
)
= r
2
σ
2
P
X −µ ≥ rσ
Rearranging gives the desired inequality. 2
Weak Law of Large Numbers. Let X
1
, . . . , X
n
be a random sample from
a population with mean µ and standard deviation σ. Then for any c ≥ 0:
P

¯
X −µ ≥ c
≤
σ
2
nc
2
.
In particular, the probability diminishes to zero as n →∞.
Proof. Apply Chebyshev’s inequality to
¯
X, noting that it has mean µ and
standard deviation σ/
√
n:
P

¯
X −µ ≥
rσ
√
n
≤
1
r
2
.
Substituting c = rσ/
√
n gives the Weak Law. 2
The Weak Law allows us to make estimates of the mean of arbitrary accuracy
and certainty, by just increasing the sample size.
A remarkable feature of the sample mean is that as the sample size increases,
its distribution looks more and more like a normal distribution, regardless of
the population distribution! We give an illustration of this in Figure 9.
Central Limit Theorem. Let X
1
, . . . , X
n
be a random sample from a
population with mean µ and variance σ
2
. Then the sample mean
¯
X is more
normally distributed as n increases:
lim
n→∞
¯
X −µ
σ/
√
n
∼ N(0, 1).
When we work with large samples,
8
the Central Limit Theorem allows us to
8
A commonly used deﬁnition of “large” is n ≥ 30.
15 LARGE SAMPLE APPROXIMATIONS 31
1.6 1.8 2 2.2 2.4
Figure 9: The histogram represents the means of 5000 samples of size 100, from
a binomial population with parameters n = 10 and p = 0.2. It is matched against
a normal distribution with µ = 2 and σ =
√
0.016.
use the normal approximation to make sharper error estimates (relative to
the Weak Law). We will not prove this result, but we shall use it frequently.
Example. Suppose we are trying to estimate the mean of a population
whose variance is known to be σ
2
= 1.6 (as in Figure 9). Using samples
of size 100, we want error bars that are 90% certain. Then the Weak Law
suggests we ﬁnd a c such that
1.6
2
100c
2
= 0.1, or c = 0.506.
Thus we are 90% sure that any particular observed mean is within 0.506 of
the actual population mean.
If we consider the particular data that led to the histogram of Figure 9,
we ﬁnd that in fact 90% of the observed sample means are within 0.21 of
the actual mean, which is 2. Additionally, 99.9% of the observed means are
within 0.5 of 2. So the Weak Law estimates are correct, but ineﬃcient.
On the other hand, taking 100 as a large sample size, we see from the Central
Limit Theorem that Z = 7.905(
¯
X −µ) should be approximately a standard
normal variable. Therefore we are 90% sure that its observed values are
within 1.65 of 0. So we are 90% sure that any observed sample mean is
16 POINT ESTIMATION 32
within
1.65
7.905
= 0.209 of the actual population mean. Similarly, we are 99.9%
sure that any observed mean is within 0.5 of the actual. Thus, the Central
Limit Theorem provides estimates which are close to reality(in this case,
identical). 2
16 Point Estimation
We now start our enquiry into the general process of estimating statisti
cal parameters – so far we have become familiar with estimating mean and
variance via sample mean and sample variance.
• A statistic of a random sample X
1
, . . . , X
n
is any function
Θ = f(X
1
, . . . , X
n
).
Thus a statistic is itself a random variable.
¯
X and S
2
are examples of statis
tics. The probability distribution of a statistic is called its sampling dis
tribution.
• A statistic Θ is called an estimator of a population parameter θ if values
of the statistic are used as estimates of θ.
¯
X is an estimator of the population mean, and S
2
is an estimator of the
population variance.
• If a statistic Θ is an estimator of a population parameter θ, then any
observed value of Θ is denoted
ˆ
θ and called a point estimate of θ.
According to this convention, observed values of
¯
X should be denoted by ˆ µ,
and of S
2
by ˆ σ
2
. We will also denote them by ¯ x and s
2
, respectively.
• A statistic Θ is an unbiased estimator of a population parameter θ if
E[Θ] = θ.
¯
X is an unbiased estimator of the population mean, and S
2
is an unbiased
estimator of the population variance.
17 METHOD OF MOMENTS 33
Exercise. Suppose we use S =
√
S
2
as an estimator of the standard devia
tion of the population. Is it an unbiased estimator?
In general, if Θ is an estimator of θ, the gap E[Θ] −θ is called its bias.
• Θ is an asymptotically unbiased estimator of θ if the bias goes to 0 as
the sample size goes to inﬁnity.
Exercise. Show the following statistics are asymptotically unbiased estima
tors:
1.
1
n
¸
i
(X
i
−
¯
X)
2
, of σ
2
.
2. S =
√
S
2
, of σ.
• Θ is a consistent estimator of θ if for any c ≥ 0,
P
Θ−θ ≥ c
→0, as n →∞,
where n is the sample size.
The Weak Law of Large Numbers states that
¯
X is a consistent estimator
of µ. The proof of the Weak Law can be trivially generalized to yield the
following result:
Theorem. Let Θ be an unbiased estimator for θ, such that var[Θ] → 0 as
the sample size n →∞. Then Θ is a consistent estimator of θ. 2
It follows that S
2
is a consistent estimator of σ
2
.
17 Method of Moments
Having created the abstract notion of an estimator of a population param
eter, we now face the problem of creating such estimators. The Method of
Moments is based on the idea that parameters of a population can be re
covered from its moments, and for the moments there are certain obvious
estimators.
18 MAXIMUM LIKELIHOOD ESTIMATORS 34
• Given a random variable X, its r
th
moment is deﬁned to be
µ
r
= E[X
r
].
Note that µ
X
= µ
1
and σ
2
X
= µ
2
−µ
2
1
.
• Consider a population with probability distribution f
X
. Then the statistic
M
r
=
1
n
n
¸
i=1
X
r
i
is the Method of Moments estimator for the population moment µ
r
.
We have M
1
=
¯
X. Further, the identity σ
2
X
= µ
2
− µ
2
1
leads to the use of
M
2
−M
2
1
as an estimator of σ
2
.
Exercise. M
2
−M
2
1
=
1
n
¸
i
(X
i
−
¯
X)
2
.
The Method of Moments is easy to implement. Its drawback is that it comes
with no guarantees about the performance of the estimators that it creates.
18 Maximum Likelihood Estimators
Maximum Likelihood Estimation (MLE) will be our main tool for generating
estimators. The reason lies in the following guarantee:
Fact Maximum Likelihood Estimators are asymptotically minimum vari
ance and unbiased. Thus, for large samples, they are essentially the best
estimators.
Before describing the method, we ask the reader to consider the following
graph and question:
18 MAXIMUM LIKELIHOOD ESTIMATORS 35
2 1 1 2 3 4
0.1
0.2
0.3
0.4
Question. The graphs above represent two conjectured distributions for a
population. If we pick a random member of the population and it turns out
to be 2, which of these conjectures appears more plausible? What if the
random member turns out to be −1? 1?
• Consider a population whose probability density f( · ; t) depends on a pa
rameter t. We draw a random sample X
1
, . . . , X
n
from this population. The
corresponding likelihood function is deﬁned to be
L(x
1
, . . . , x
n
; t) =
¸
i
f(x
i
; t),
where x
i
is a value of X
i
. Thus L is the joint probability density of the
random sample.
• Suppose we observe the values x
1
, . . . , x
n
of a random sample X
1
, . . . , X
n
.
The corresponding maximum likelihood estimate of t is the value
ˆ
t that
maximizes L(x
1
, . . . , x
n
; t).
Example. Suppose we have a binomial population with n = 1 and we wish
to estimate p using a sample of size m. The corresponding likelihood function
is:
L(x
1
, . . . , x
m
; p) =
¸
i
1
x
i
p
x
i
(1 −p)
1−x
i
= (
¸
i: x
i
=1
p)(
¸
i: x
i
=0
(1 −p))
= p
¸
i
x
i
(1 −p)
m−
¸
i
x
i
= p
m¯ x
(1 −p)
m(1−¯ x)
.
To maximize L, we use Calculus:
dL
dp
= 0 =⇒ p = ¯ x.
19 SAMPLING DISTRIBUTIONS 36
So the sample mean
¯
X is the ML Estimator of p. 2
Example. Suppose we have a normal population whose σ is known, and we
want the maximum likelihood estimator of µ. The likelihood function is
L(x
1
, . . . , x
n
; σ) =
¸
i
1
σ
√
2π
e
−
1
2
(
x
i
−µ
σ
)
2
=
1
(σ
√
2π)
n
e
−
1
2
¸
i
(
x
i
−µ
σ
)
2
Matters simplify if we take the logarithms:
ln L(x
1
, . . . , x
n
; σ) = −nln(σ
√
2π) −
1
2
¸
i
x
i
−µ
σ
2
We diﬀerentiate with respect to µ:
0 =
d(ln L)
dµ
= −
¸
i
x
i
−µ
σ
=⇒ µ = ¯ x
So the sample mean is the ML Estimator of µ. 2
Exercise. Suppose we have a normal population whose µ is known. Show
that the ML Estimator of σ
2
is
n −1
n
S
2
.
The MLE technique can also be applied when many parameters t
1
, . . . , t
k
have to be estimated. Then the likelihood function has the form
L(x
1
, . . . , x
n
; t
1
, . . . , t
k
)
and we again seek the values
ˆ
t
1
, . . . ,
ˆ
t
k
which maximize it.
Exercise. Suppose we have a normal population and neither parameter is
known. Find their ML Estimators.
19 Sampling Distributions
In earlier sections, we have explored the characteristics of the sample mean
and variance. We could give their expectation and variance, and for large
19 SAMPLING DISTRIBUTIONS 37
samples the Central Limit Theorem gives a close approximation to the actual
distribution of
¯
X. Other than this, we do not have an explicit description of
the sampling distributions of
¯
X and S
2
. However, if the population is known
to be normal, we can do better.
Theorem. Let X
1
, . . . , X
n
be a random sample from a normal population.
Then
1.
¯
X and S
2
are independent.
2.
¯
X ∼ N(µ, σ/
√
n).
3. (n −1)
S
2
σ
2
∼
χ
2
(n −1).
Proof. We skip the proof of the ﬁrst item – it is hard and no elementary
Statistics book contains it! The next item is trivial, since linear combinations
of independent normal variables are again normal. For the last claim, we start
with the identity
(n −1)S
2
=
¸
i
X
2
i
−n(
¯
X)
2
=
¸
i
(X
i
−µ)
2
−n(
¯
X −µ)
2
.
Hence,
¸
i
X
i
−µ
σ
2
= (n −1)
S
2
σ
2
+
¯
X −µ
σ/
√
n
2
.
Let us recall the following fact: If Z, Y are independent, Z ∼
χ
2
(k) and
Y +Z ∼
χ
2
(k +m), then Y ∼
χ
2
(m). We can apply this to the last equation
with k = 1 and k + m = n, to get (n −1)
S
2
σ
2
∼
χ
2
(n −1). 2
Exercise. Let S
2
be the sample variance of a sample of size n from an
N(µ, σ) population. Show that
var[S
2
] =
2σ
4
n −1
.
19 SAMPLING DISTRIBUTIONS 38
0.3 0.2 0.1 0.1 0.2 0.3
0.6
0.8
1.2
1.4
Figure 10: This diagram illustrates the independence of
¯
X and S
2
for a normal
population. It shows (¯ x, s
2
) pairs for a thousand samples (each of size 100) from
a standard normal population.
3 2 1 1 2 3
0.1
0.2
0.3
0.4
0.5
C
B
A
Figure 11: The graphs compare t distributions with the standard normal distri
bution. We have A: t(1), B: t(10), C: N(0, 1).
19 SAMPLING DISTRIBUTIONS 39
t Distribution
• Consider independent random variables Y, Z, where Y ∼
χ
2
(n) and Z ∼
N(0, 1). Then the random variable
T =
Z
Y/
√
n
is said to have a t distribution with n degrees of freedom. We write T ∼
t(n). This deﬁnition is motivated by the following example:
Example. Suppose we have a random sample X
1
, . . . , X
n
from a normal
population, and we seek to estimate µ. Then we have:
Z =
¯
X −µ
σ/
√
n
∼ N(0, 1).
We can’t use Z directly because σ is also unknown. One option is to replace
σ by its estimator S, but then the distribution is no longer normal. However,
we recall that
Y = (n −1)
S
2
σ
2
∼
χ
2
(n −1).
Further, since
¯
X, S
2
are independent, so are Y and Z. Therefore
Z
Y/
√
n −1
∼ t(n −1)
To round oﬀ the example, note that
Z
Y/
√
n −1
=
¯
X −µ
S/
√
n
.
2
Figure 11 shows that the t distribution is only needed for small sample sizes.
Beyond n = 30, the t distributions are essentially indistinguishable from the
standard normal one, and so we can directly use the observed value s (of S)
for σ and work with the standard normal distribution.
19 SAMPLING DISTRIBUTIONS 40
0.5 1 1.5 2 2.5 3
0.2
0.4
0.6
0.8
1
1.2
1.4
C
B
A
0.5 1 1.5 2
0.5
1
1.5
2
2.5
3
B
A
Figure 12: The ﬁrst graph shows various F distributions (A: F(3, 10), B: F(10, 10),
C: F(50, 50)). The second graph compares the F(200, 200) distribution (A) and the
normal distribution with the same mean and standard deviation (B).
F Distribution
• Suppose X
1
, X
2
are independent random variables, and X
1
∼
χ
2
(n
1
), X
2
∼
χ
2
(n
2
). Then the random variable
F =
X
1
/n
1
X
2
/n
2
is said to have an F distribution with n
1
, n
2
degrees of freedom. We write
F ∼ F(n
1
, n
2
).
Example. Suppose we wish to compare the variances of two normal popu
lations. We could do this by estimating their ratio. Therefore, for i = 1, 2,
let S
2
i
be the sample variance of a random sample of size n
i
from the i
th
population. Then
(n
i
−1)
S
2
i
σ
2
i
∼
χ
2
(n
i
−1).
Therefore, on taking the ratio and simplifying,
S
2
1
/σ
2
1
S
2
2
/σ
2
2
∼ F(n
1
−1, n
2
−1)
Given observations of S
i
, we can use this relation to estimate σ
1
/σ
2
. 2
From Figure 12 we see that F distributions also become more normal with
larger sample sizes. However, they do so much more slowly than
χ
2
or t
distributions.
20 CONFIDENCE INTERVALS 41
You may have noted that we have not provided the density functions for the
χ
2
, t and F distributions. Their density functions are not very complicated,
but they are rarely used directly. What we really need are their direct in
tegrals (to get probabilities) and these do not have closed form formulae!
Instead, we have to look up the values from tables (provided at the end of
every Statistics book) or use computational software such as Mathematica.
Remark From the deﬁnition of the F distribution, it is clear that F ∼
F(m
1
, m
2
) implies 1/F ∼ F(m
2
, m
1
). So tables for the F distribution are
printed only for the case m
1
≤ m
2
.
20 Conﬁdence Intervals
An interval estimate for a population parameter t is an interval [
ˆ
t
1
,
ˆ
t
2
]
where t is predicted to lie. The numbers
ˆ
t
1
,
ˆ
t
2
are obtained from a random
sample: so they are values of statistics T
1
and T
2
. Let
P(
ˆ
T
1
< t <
ˆ
T
2
) = 1 −α.
Then 1 − α is called the degree of conﬁdence and
ˆ
t
1
,
ˆ
t
2
are the lower
and upper conﬁdence limits. The interval (
ˆ
t
1
,
ˆ
t
2
) is called a (1 − α)100%
conﬁdence interval.
To obtain a conﬁdence interval for a parameter t we need to create a statistic
T in which t is the only unknown. If T is of one of the standard types
(normal, chi square, etc.) we can use tables or statistical software to ﬁnd
t
1
, t
2
such that
P(t
1
< T < t
2
) = 1 −α.
By rearranging t
1
< T < t
2
in the form T
1
< t < T
2
we obtain a (1 −α)100%
conﬁdence interval for t.
Conﬁdence Interval For Mean
Example. Suppose we need to estimate the mean µ of a normal population
whose standard deviation σ is known. If we have a random sample of size n,
then
¯
X ∼ N(µ, σ/
√
n).
20 CONFIDENCE INTERVALS 42
Figure 13: The point z such that P(Z > z) = α/2, also satisﬁes P(Z < z) = 1−α.
If we desire a degree of conﬁdence 1 −α, we need a c such that
P

¯
X −µ < c
= 1 −α.
Then (¯ x−c, ¯ x+c) will be a conﬁdence interval for µ with degree of conﬁdence
1 −α.
Deﬁne Z =
¯
X −µ
σ/
√
n
. This is a standard normal variable. Also,
P

¯
X −µ < c
= P
Z <
c
√
n
σ
.
The problem can be ﬁnished oﬀ by locating z =
c
√
n
σ
such that (See Figure
13):
P(Z > z) =
α
2
.
For example, suppose we have n = 150, σ = 6.2 and α = 0.01. Then we have
z = 2.575, since
1
√
2π
∞
2.575
e
−x
2
/2
dx = 0.005.
From z =
c
√
n
σ
we obtain
c =
2.575 ×6.2
√
150
= 1.30.
20 CONFIDENCE INTERVALS 43
Figure 14: A onesided conﬁdence interval.
Hence (¯ x −1.30, ¯ x + 1.30) is a 99% conﬁdence interval for µ. 2
Let us note some features of this example:
1. We found a conﬁdence interval which was symmetric about the point
estimate ¯ x. We can also ﬁnd asymmetric ones. In particular, we have
the onesided conﬁdence interval (−∞, ¯ x+z
σ
√
n
), where z is chosen
to satisfy P(Z > z) = α. (See Figure 14.)
2. We assumed a normal population. This can be bypassed with the help
of the Central Limit Theorem and a large sample size n, for then
¯
X
will be normally distributed.
3. We assumed σ was somehow known. If not, we can work with the
t distribution T =
¯
X −µ
S/
√
n
. Further, if n ≥ 30, we can just use the
observed value s of S for σ.
The next example illustrates the last item.
Example. Suppose we have a small random sample of size n from a normal
population, N(µ, σ). Then
T =
¯
X −µ
S/
√
n
∼ t(n −1).
20 CONFIDENCE INTERVALS 44
To obtain a conﬁdence interval for µ, with degree of conﬁdence 1 − α, we
seek a value t such that
P( T ≤ t ) = 1 −α.
For then, the interval
µ : −t ≤
¯ x −µ
s/
√
n
≤ t
=
¯ x −
s
√
n
t, ¯ x +
s
√
n
t
is the required conﬁdence interval for µ.
For instance, suppose the random sample consists of the following numbers:
2.3, 1.9, 2.1, 2.8, 2.3, 3.6, 1.4, 1.8, 2.1, 3.2, 2.0 and 1.9. Then we have
¯ x = 2.28 and s
2
= 0.407.
Also, T ∼ t(11). If we want a conﬁdence level of 95%, then we can take
t = 2.201 since P(T ≥ 2.201) = 0.025 = α/2 (and T is symmetric about
yaxis). So the 95% conﬁdence interval is (1.88, 2.68). 2
Conﬁdence Interval For Variance
Consider the task of estimating the variance σ
2
of a normal population. If
we have a random sample of size n, then
(n −1)
S
2
σ
2
∼
χ
2
(n −1).
We take x
1
, x
2
such that
P
(n −1)
S
2
σ
2
≥ x
1
= 1 −
α
2
and P
(n −1)
S
2
σ
2
≥ x
2
=
α
2
,
so that P
x
1
< (n −1)
S
2
σ
2
< x
2
= 1 −α. Therefore,
P
n −1
x
2
S
2
< σ
2
<
n −1
x
1
S
2
= 1 −α.
Thus,
n −1
x
2
s
2
,
n −1
x
1
s
2
is a (1 −α)100% conﬁdence interval for σ
2
.
20 CONFIDENCE INTERVALS 45
Figure 15: Choosing a conﬁdence interval for variance, using a chi square distri
bution.
Conﬁdence Interval For Diﬀerence of Means
We now consider the problem of comparing two diﬀerent populations: We
do this by estimating either the diﬀerence or the ratio of their population
parameters such as mean and variance.
Example. Consider independent normal populations with parameters (µ
i
, σ
i
),
i = 1, 2, respectively. We will estimate µ
1
− µ
2
, assuming the variances are
known. Now,
¯
X
i
∼ N
µ
i
,
σ
i
√
n
i
.
Assuming the random samples to be independent, we have
¯
X
1
−
¯
X
2
∼ N(µ
1
−µ
2
, σ), where σ =
σ
2
1
n
1
+
σ
2
2
n
2
.
Therefore,
Z =
(
¯
X
1
−
¯
X
2
) −(µ
1
−µ
2
)
σ
∼ N(0, 1).
We are now in a familiar situation. We choose z so that
P(Z > z) = α/2.
Then,
P(Z < z) = 1 −α.
Hence,
P
(
¯
X
1
−
¯
X
2
) −(µ
1
−µ
2
) < σz
= 1 −α.
20 CONFIDENCE INTERVALS 46
And so, (¯ x
1
−¯ x
2
−σz, ¯ x
1
−¯ x
2
+σz) is a (1 −α)100% conﬁdence interval for
the diﬀerence of means. 2
If the variances are not known, we can take large samples (n
i
≥ 30) and
use s
2
i
for σ
2
i
. If even that is not possible, we need to at least assume that
σ
1
= σ
2
= σ. Then
Z =
(
¯
X
1
−
¯
X
2
) −(µ
1
−µ
2
)
σ
1
n
1
+
1
n
2
∼ N(0, 1).
For σ, we have the pooled estimator
S
2
p
=
(n
1
−1)S
2
1
+ (n
2
−1)S
2
2
n
1
+ n
2
−1
,
which is just the overall sample variance of the pooled samples. We have
(n
i
−1)
S
2
i
σ
2
∼
χ
2
(n
i
−1) i = 1, 2.
Hence
Y =
2
¸
i=1
(n
i
−1)
S
2
i
σ
2
∼
χ
2
(n
1
+ n
2
−2).
Therefore
T =
Z
Y/(n
1
+ n
2
−2)
∼ t(n
1
+ n
2
−2).
Now T can also be expressed as:
T =
(
¯
X
1
−
¯
X
2
) −(µ
1
−µ
2
)
S
p
1
n
1
+
1
n
2
So we see that the t distribution can be used to ﬁnd conﬁdence intervals for
µ
1
−µ
2
. For example, suppose we ﬁnd t such that
P(T ≥ t) = α/2.
Then
P(T ≤ t) = 1 −α.
This can be rearranged into
P
(
¯
X
1
−
¯
X
2
) −(µ
1
−µ
2
) ≤ t S
p
1
n
1
+
1
n
2
= 1 −α.
20 CONFIDENCE INTERVALS 47
Therefore,
¯ x
1
− ¯ x
2
−t s
p
1
n
1
+
1
n
2
, ¯ x
1
− ¯ x
2
+ t s
p
1
n
1
+
1
n
2
is a (1 −α)100% conﬁdence interval for µ
1
−µ
2
.
Conﬁdence Interval For Ratio Of Variances
For comparing the variances of two normal populations, it is convenient to
look at their ratio through the F distribution.
Example. Suppose we have independent random samples from N(µ
i
, σ
i
)
populations, with i = 1, 2. Let n
i
be the sample size and S
2
i
the sample
variance for the sample from the i
th
population. Then
(n
i
−1)
S
2
i
σ
2
i
∼
χ
2
(n
i
−1),
and hence
F =
S
2
1
/σ
2
1
S
2
2
/σ
2
2
∼ F(n
1
−1, n
2
−1).
Therefore, we look for f
1
, f
2
such that
P(F > f
1
) = 1 −α/2 and P(F > f
2
) = α/2,
so that
P(f
1
< F < f
2
) = 1 −α.
This yields the (1 −α)100% conﬁdence interval
s
2
1
s
2
2
1
f
2
,
s
2
1
s
2
2
1
f
1
for σ
2
1
/σ
2
2
. 2
References
1. A.M. Mood, F.A. Graybill and D.C. Boes, Introduction to the Theory
of Statistics, Third Edition, Tata McGrawHill, New Delhi.
2. J.E. Freund, Mathematical Statistics, Fifth Edition, PrenticeHall India.