Professional Documents
Culture Documents
We will sum up the most essential aspects of Probability, and the transition
to Statistical Inference. We avoid technical distractions. The gaps we leave
can be filled in by consulting one of the references provided at the end.
Contents
1 Sample Space, Random Variables 2
2 Probability 3
3 Probability Distributions 4
4 Binomial Distribution 7
5 Normal Distribution 9
6 Expectation 11
7 Variance 13
8 Lognormal Distribution 17
9 Bivariate Distributions 18
11 Independence 24
13 Random Samples 26
1
1 SAMPLE SPACE, RANDOM VARIABLES 2
16 Point Estimation 32
17 Method of Moments 33
19 Sampling Distributions 36
20 Confidence Intervals 41
For example, our sample space may be the collection of stocks listed on
the National Stock Exchange. We may ask statistical questions about this
space, such as: What is the average change in the value of a listed stock
(an outcome) over the past year? Alternately, our sample space might be
the history of a single stock: All its daily closing prices, say, over the last
5 years. Our question might be: What is the average weekly change of the
value over the last year? Or, how large is the variation in the prices over its
5 year history? We could ask more subtle questions: How likely is it that the
stock will have a sudden fall in value over the next month?
• Typical notation is to use S for sample space, lower case letters such as x
for individual outcomes, and capital letters such as E for events.
2 Probability
• Let A be the collection of all subsets of the sample space S: we call it the
event algebra.
1. P(∅) = 0.
n
[ n
X
2. P Ei = P(Ei ), if the Ei are pairwise disjoint events.
i=1 i=1
3. P(Ac ) = 1 − P(A).
[ X
4. P Ei ≤ P(Ei ), for any collection of events Ei .
i i
3 Probability Distributions
We return to our main question: How likely are different values (or ranges
of values) of a random variable X : S → R?
Thus, suppose I ask, what is the probability that X takes on a value greater
than 100? This is to be interpreted as: what is the probability of the event
whose outcomes t all satisfy X(t) > 100? That is,
P(X > 100) = P({t : X(t) > 100}).
It is convenient to consider two types of random variables, ones whose values
vary discretely (discrete random variables) and those whose values vary
continuously (continuous random variables).1
Examples:
1. Let us allot +1 to the stocks whose value rose on a given day, and −1 to
those whose value fell. Then we have created a random variable whose
possible values are ±1. This is a discrete random variable.
2. Let us allot the whole number +n to the stocks whose value rose by
between n and n + 1 on a given day, and −n to those whose value
fell by between n and n + 1. Then we have created a random variable
whose possible values are all the integers. This is also a discrete random
variable.
1
These are not the only types. But these types contain all the ones we need.
3 PROBABILITY DISTRIBUTIONS 5
Important. The number fX (x) does not represent the probability that
X = x. Individual values of fX have no significance, only the integrals of fX
do! (Contrast this with the discrete case.)
0 1
- - - 1
I1 I2 I3 P(X ∈ I1 ) = P(X ∈ I2 ) = P(X ∈ I3 )
2
2
Usually, this is just called the unifom distribution.
4 BINOMIAL DISTRIBUTION 7
1 fX
0 1
Now we look at two distributions, one discrete and one continuous, which
are especially important.
4 Binomial Distribution
Consider a random variable X which can take on only two values, say 0 and 1
(the choice of values is not important). Suppose the probability of the value
4 BINOMIAL DISTRIBUTION 8
1. 0 ≤ p, q ≤ 1,
2. p+q =1
Number
of combinations with k 1’s = Ways of picking k slots out of n =
n
.
k
0.3
0.25
0.2
0.15
0.1
0.05
2 4 6 8 10
5 Normal Distribution
0.4
0.3
0.2
0.1
-3 -2 -1 1 2 3
5 NORMAL DISTRIBUTION 10
1
Exercise. Can you explain the factor √ ? (Hint: Think about the
2π
requirement that total probability should be 1.)
Note that the graph is symmetric about the y-axis. The axis of symmetry can
be moved to another position m, by replacing x by x − m. In the following
diagram, the dashed line represents the standard normal distribution:
0.4
0.3
0.2
0.1
1 2
fX (x) = √ e−(x−m) /2 .
2π
Also, starting from the standard normal distribution, we can create one with
a similar shape but bunched more tightly around the y-axis. We achieve this
by replacing x with x/s:
-3 -2 -1 1 2 3
1 1 2
fX (x) = √ e− 2 (x/s) .
2πs
Exercise. Will increasing σ make the graph more tightly bunched around
the axis of symmetry? What will it do to the peak height of the graph?
6 Expectation
X
fX (xi )xi
i
Suppose X is discrete with range {xi }. Then, the range of g(X) is {g(xi)}.
Therefore we can calculate the expectation of g(X) as follows:4
X X
E[g(X)] = g(xi )P g(X) = g(xi) = g(xi )fX (xi ).
i i
3
A constant c can be viewed as a random variable whose range consists of the single
value c.
4
Our calculation is valid if g is one-one. With slightly more effort we can make it valid
for any g.
7 VARIANCE 13
7 Variance
Given some data {xi }ni=1 , its average x̄ is seen as a central value about which
the data is clustered. The significance of the average is greater if the clus-
tering is tight, less otherwise. To measure the tightness of the clustering, we
use the variance of the data:
1X
s2 = (xi − x̄)2 .
n i
Variance is just the average of the squared distance from each data point to
the average (of the data).
Exercise. Suppose X has the discrete uniform distibution with range {0, . . . , n}.
1
We have seen that E[X] = n/2. Show that its variance is n(n + 2).
6
Exercise. Suppose X has the continuous uniform distibution with range
[0, 1]. We have seen that E[X] = 1/2. Show that its variance is 1/12.
7 VARIANCE 15
And so,
Therefore
var[X] = σ 2 .
7 VARIANCE 16
0.1
0.08
0.06
0.04
0.02
10 20 30 40
2
X −µ
Thus, if X ∼ N(µ, σ), then Z = has expectation 0 and standard
σ
deviation 1. In fact, Z is again a normal distribution and hence is the stan-
dard normal distribution (it has parameters 0 and 1). Through this link, all
questions about normal distributions can be converted to questions about
the standard normal distribution. For instance, let X, Z be as above. Then:
X −µ a−µ
a−µ
P(X ≤ a) = P ≤ =P Z ≤ .
σ σ σ
Now we can also illustrate our earlier statement about how the normal dis-
tribution serves as a substitute for other distributions. Figure 2 compares
a binomial distribution (n = 100 and p = 0.2) with the normal distribution
with the same mean and variance (µ = np = 20 and σ 2 = np(1 − p) = 4).
1.2
1
A
0.8 B
C
0.6
0.4
0.2
1 2 3 4
8 Lognormal Distribution
Figure 4: A simulation of the lognormal model for variation of stock prices with
time.
9 Bivariate Distributions
So far we have dealt with individual random variables, i.e. with models for
particular features of a population. The next step is to study the relationships
that exist between different features of a population. For instance, an investor
might like to know the nature and strength of the connection between her
portfolio and a stock market index such as the NIFTY. If the NIFTY goes
up, is her portfolio likely to do the same? How much of a rise is it reasonable
to expect?
This leads us to the study of pairs of random variables and the probabilities
associated with their joint values. Are high values of one associated with
high or low values of the other? Is there a significant connection at all?
Z ∞ Z ∞
2. fX,Y (x, y) dx dy = 1.
−∞ −∞
3 3 3
2 2 2
1 1 1
-3 -2 -1 1 2 3 -3 -2 -1 1 2 3 -3 -2 -1 1 2 3
-1 -1 -1
-2 -2 -2
-3 -3 -3
• −1 ≤ ρX,Y ≤ 1.
Exercise. Suppose a die is tossed once. Let X take on the value 1 if the
result is ≤ 4 and the value 0 otherwise. Similarly, let Y take on the value 1
if the result is even and the value 0 otherwise.
X\Y 0 1
1. Show that the values of fX,Y are given by: 0 1/6 1/6
1 1/3 1/3
what is the probability that A has also occurred? We reason that since we
know B has occurred, in effect B has become our sample space. Therefore
all probabilities of events inside B should be scaled by 1/P(B), to keep the
total probability at 1. As for the occurrence of A, the points outside B
are irrelevant, so our answer should be P(A ∩ B) times the correcting factor
1/P(B).
P(A ∩ B)
P(A|B) = .
P(B)
fX,Y (x, y)
fY |X=x (y) = .
fX (x)
1. 0 ≤ fY |X=x (y) ≤ 1,
X X fX,Y (x, yi ) fX (x)
2. fY |X=x (yi ) = = = 1.
i i
fX (x) fX (x)
-3 -2 -1 1 2 3
-1
-2
-3
The function E[Y |X = x] creates a new random variable E[Y |X]. Below,
we calculate the expectation of this new random variable in the continuous
case:
Z ∞
E[E[Y |X]] = E[Y |X = x]fX (x) dx
−∞
Z ∞Z ∞
= yfY |X=x (y)fX (x) dy dx
−∞ −∞
Z ∞Z ∞
= yfX,Y (x, y) dy dx
−∞ −∞
= E[Y ]
Similar calculations can be carried out in the discrete case, so that we have
the general result:
E[Y ] = E[E[Y |X]].
This result is useful when we deal with experiments carried out in stages,
and we have information on how the results of one stage depend on those of
the previous ones.
11 INDEPENDENCE 24
11 Independence
We return again to the normal distribution. The following facts about it are
the key to its wide usability:
1. If X ∼ N(µ, σ) and a 6= 0, then aX ∼ N(aµ, |a|σ).
If Xi ∼ N(µi , σi ) with i = 1, . . . , n are pairwise independent, then
2. P
i Xi ∼ N(µ, σ), where
X X
µ= µi , σ2 = σi2 .
i i
Of course, the mean and variance would add up like this for any collection
of independent random variables. The important feature here is the preser-
vation of normality.
12 CHI SQUARE DISTRIBUTION 25
n=1 0.1
0.8
n=10
0.7 n=2 0.08
0.6
0.5 0.06 n=20
n=3
0.4
0.3 0.04
0.2 0.02
0.1
1 2 3 4 5 10 20 30 40
0.06
0.05
Chi square
0.04 Normal
0.03
0.02
0.01
10 20 30 40 50 60
3. var[X 2 ] = 2.
2
Exercise. Let X ∼ χ (n). Show that E[X] = n and var[X] = 2n.
2
• The converse is also true. If X, Y are independent, X ∼ χ (m) and X +Y ∼
χ2 (m + n), then Y ∼ χ2 (n).
13 Random Samples
The subject of statistical inference is concerned with the task of using ob-
served data to draw conclusions about the distribution of properties in a
13 RANDOM SAMPLES 27
population. The main obstruction is that it may not be possible (or even
desirable) to observe all the members of the population. We are forced to
draw conclusions by observing only a fraction of the members, and these
conclusions are necessarily probabilistic rather than certain.
1. Each Xi has the same probability density: fXi = fXj for all i, j.
Y
2. The Xi are independent: fX1 ,...,Xn (x1 , . . . , xn ) = fXi (xi ).
i
The common density function fX for all the Xi is called the density function
of the population. We also say that we are sampling from a population of
type X.
14 SAMPLE MEAN AND VARIANCE 28
Moreover, we would like the variance of X̄ to be small so that its values are
more tightly clustered around µ. We have
n
1 X σ2
var[X̄] = var[X i ] = ,
n2 i=1 n
where σ 2 is the population variance. Thus the variance of the sample mean
goes to zero as the sample size increases: the sample mean becomes a more
reliable estimator of the population mean.
σ2
E[X̄] = µ, var[X̄] =
n
X X X
(Xi − X̄)2 = (Xi2 + (X̄)2 − 2XiX̄) = Xi2 − n(X̄)2 .
i i i
hX i X
E (Xi − X̄)2 = E[Xi2 ] − nE[(X̄)2 ]
i i
X
= (σ 2 + µ2 ) − n(var[X̄] + E[X̄]2 )
i
= n(σ 2 + µ2 ) − σ 2 − nµ2 = (n − 1)σ 2
E[S 2 ] = σ 2
2 2σ 4
var[S ] =
n−1
Again, we see that sample variance has the right average (σ 2 ) and clusters
more tightly around it if we use larger samples.
Proof. We give the proof for a discrete random variable X with range {xi }.
X
σ2 = (xi − µ)2 fX (xi )
i
X
≥ (xi − µ)2 fX (xi )
i:|xi −µ|≥rσ
X
≥ r2σ2 fX (xi )
i:|xi −µ|≥rσ
= r 2 σ 2 P |X − µ| ≥ rσ
Rearranging gives the desired inequality. 2
The Weak Law allows us to make estimates of the mean of arbitrary accuracy
and certainty, by just increasing the sample size.
A remarkable feature of the sample mean is that as the sample size increases,
its distribution looks more and more like a normal distribution, regardless of
the population distribution! We give an illustration of this in Figure 9.
When we work with large samples,8 the Central Limit Theorem allows us to
8
A commonly used definition of “large” is n ≥ 30.
15 LARGE SAMPLE APPROXIMATIONS 31
Figure 9: The histogram represents the means of 5000 samples of size 100, from
a binomial population with parameters n =√10 and p = 0.2. It is matched against
a normal distribution with µ = 2 and σ = 0.016.
1.62
= 0.1, or c = 0.506.
100c2
Thus we are 90% sure that any particular observed mean is within 0.506 of
the actual population mean.
On the other hand, taking 100 as a large sample size, we see from the Central
Limit Theorem that Z = 7.905(X̄ − µ) should be approximately a standard
normal variable. Therefore we are 90% sure that its observed values are
within 1.65 of 0. So we are 90% sure that any observed sample mean is
16 POINT ESTIMATION 32
1.65
within = 0.209 of the actual population mean. Similarly, we are 99.9%
7.905
sure that any observed mean is within 0.5 of the actual. Thus, the Central
Limit Theorem provides estimates which are close to reality(in this case,
identical). 2
16 Point Estimation
We now start our enquiry into the general process of estimating statisti-
cal parameters – so far we have become familiar with estimating mean and
variance via sample mean and sample variance.
Θ = f (X1 , . . . , Xn ).
E[Θ] = θ.
√
Exercise. Suppose we use S = S 2 as an estimator of the standard devia-
tion of the population. Is it an unbiased estimator?
17 Method of Moments
µr = E[X r ].
2
Note that µX = µ1 and σX = µ2 − µ21 .
2
We have M1 = X̄. Further, the identity σX = µ2 − µ21 leads to the use of
M2 − M12 as an estimator of σ 2 .
1X
Exercise. M2 − M12 = (Xi − X̄)2 .
n i
Maximum Likelihood Estimation (MLE) will be our main tool for generating
estimators. The reason lies in the following guarantee:
Before describing the method, we ask the reader to consider the following
graph and question:
18 MAXIMUM LIKELIHOOD ESTIMATORS 35
0.4
0.3
0.2
0.1
-2 -1 1 2 3 4
= p (1 − p)m(1−x̄) .
mx̄
L(x1 , . . . , xn ; t1 , . . . , tk )
and we again seek the values t̂1 , . . . , t̂k which maximize it.
19 Sampling Distributions
samples the Central Limit Theorem gives a close approximation to the actual
distribution of X̄. Other than this, we do not have an explicit description of
the sampling distributions of X̄ and S 2 . However, if the population is known
to be normal, we can do better.
Proof. We skip the proof of the first item – it is hard and no elementary
Statistics book contains it! The next item is trivial, since linear combinations
of independent normal variables are again normal. For the last claim, we start
with the identity
X X
(n − 1)S 2 = Xi2 − n(X̄)2 = (Xi − µ)2 − n(X̄ − µ)2 .
i i
Hence,
X Xi − µ 2 2
S2
X̄ − µ
= (n − 1) 2 + √ .
i
σ σ σ/ n
2
Let us recall the following fact: If Z, Y are independent, Z ∼ χ (k) and
2 2
Y + Z ∼ χ (k + m), then Y ∼ χ (m). We can apply this to the last equation
S2 2
with k = 1 and k + m = n, to get (n − 1) 2 ∼ χ (n − 1). 2
σ
2 2σ 4
var[S ] = .
n−1
19 SAMPLING DISTRIBUTIONS 38
1.4
1.2
0.8
0.6
Figure 10: This diagram illustrates the independence of X̄ and S 2 for a normal
population. It shows (x̄, s2 ) pairs for a thousand samples (each of size 100) from
a standard normal population.
0.5
0.4
A
B
0.3
C
0.2
0.1
-3 -2 -1 1 2 3
Figure 11: The graphs compare t distributions with the standard normal distri-
bution. We have A: t(1), B: t(10), C: N (0, 1).
19 SAMPLING DISTRIBUTIONS 39
t Distribution
2
• Consider independent random variables Y, Z, where Y ∼ χ (n) and Z ∼
N(0, 1). Then the random variable
Z
T = √
Y/ n
X̄ − µ
Z= √ ∼ N(0, 1).
σ/ n
Z X̄ − µ
√ = √ .
Y/ n − 1 S/ n
2
Figure 11 shows that the t distribution is only needed for small sample sizes.
Beyond n = 30, the t distributions are essentially indistinguishable from the
standard normal one, and so we can directly use the observed value s (of S)
for σ and work with the standard normal distribution.
19 SAMPLING DISTRIBUTIONS 40
1.4 3
1.2 2.5
A A
1 B 2
0.8 C B
1.5
0.6
0.4 1
0.2 0.5
Figure 12: The first graph shows various F distributions (A: F(3, 10), B: F(10, 10),
C: F(50, 50)). The second graph compares the F(200, 200) distribution (A) and the
normal distribution with the same mean and standard deviation (B).
F Distribution
2
• Suppose X1 , X2 are independent random variables, and X1 ∼ χ (n1 ), X2 ∼
χ2 (n2 ). Then the random variable
X1 /n1
F =
X2 /n2
is said to have an F distribution with n1 , n2 degrees of freedom. We write
F ∼ F(n1 , n2 ).
From Figure 12 we see that F distributions also become more normal with
2
larger sample sizes. However, they do so much more slowly than χ or t
distributions.
20 CONFIDENCE INTERVALS 41
You may have noted that we have not provided the density functions for the
χ2 , t and F distributions. Their density functions are not very complicated,
but they are rarely used directly. What we really need are their direct in-
tegrals (to get probabilities) and these do not have closed form formulae!
Instead, we have to look up the values from tables (provided at the end of
every Statistics book) or use computational software such as Mathematica.
20 Confidence Intervals
Figure 13: The point z such that P(Z > z) = α/2, also satisfies P(|Z| < z) = 1−α.
Then (x̄−c, x̄+c) will be a confidence interval for µ with degree of confidence
1 − α.
X̄ − µ
Define Z = √ . This is a standard normal variable. Also,
σ/ n
√
c n
P |X̄ − µ| < c = P |Z| < .
σ
√
c n
The problem can be finished off by locating z = such that (See Figure
σ
13):
α
P(Z > z) = .
2
For example, suppose we have n = 150, σ = 6.2 and α = 0.01. Then we have
z = 2.575, since Z ∞
1 2
√ e−x /2 dx = 0.005.
2π 2.575
√
c n
From z = we obtain
σ
2.575 × 6.2
c= √ = 1.30.
150
20 CONFIDENCE INTERVALS 43
X̄ − µ
T = √ ∼ t(n − 1).
S/ n
20 CONFIDENCE INTERVALS 44
P( |T | ≤ t ) = 1 − α.
For instance, suppose the random sample consists of the following numbers:
2.3, 1.9, 2.1, 2.8, 2.3, 3.6, 1.4, 1.8, 2.1, 3.2, 2.0 and 1.9. Then we have
S2 2
(n − 1) 2
∼ χ (n − 1).
σ
We take x1 , x2 such that
S2 S2
α α
P (n − 1) 2 ≥ x1 = 1 − and P (n − 1) 2 ≥ x2 = ,
σ 2 σ 2
S2
so that P x1 < (n − 1) 2 < x2 = 1 − α. Therefore,
σ
n−1 2 n−1 2
2
P S <σ < S = 1 − α.
x2 x1
n−1 2 n−1 2
Thus, s , s is a (1 − α)100% confidence interval for σ 2 .
x2 x1
20 CONFIDENCE INTERVALS 45
Figure 15: Choosing a confidence interval for variance, using a chi square distri-
bution.
Therefore,
(X̄1 − X̄2 ) − (µ1 − µ2 )
Z= ∼ N(0, 1).
σ
We are now in a familiar situation. We choose z so that
Then,
P(|Z| < z) = 1 − α.
Hence,
P |(X̄1 − X̄2 ) − (µ1 − µ2 )| < σz = 1 − α.
20 CONFIDENCE INTERVALS 46
And so, (x̄1 − x̄2 − σz, x̄1 − x̄2 + σz) is a (1 − α)100% confidence interval for
the difference of means. 2
If the variances are not known, we can take large samples (ni ≥ 30) and
use s2i for σi2 . If even that is not possible, we need to at least assume that
σ1 = σ2 = σ. Then
(X̄1 − X̄2 ) − (µ1 − µ2 )
Z= q ∼ N(0, 1).
1 1
σ n1 + n2
So we see that the t distribution can be used to find confidence intervals for
µ1 − µ2 . For example, suppose we find t such that
P(T ≥ t) = α/2.
Then
P(|T | ≤ t) = 1 − α.
This can be rearranged into
r
1 1
P |(X̄1 − X̄2 ) − (µ1 − µ2 )| ≤ t Sp + = 1 − α.
n1 n2
20 CONFIDENCE INTERVALS 47
Therefore,
r r
1 1 1 1
x̄1 − x̄2 − t sp + , x̄1 − x̄2 + t sp +
n1 n2 n1 n2
is a (1 − α)100% confidence interval for µ1 − µ2 .
References
1. A.M. Mood, F.A. Graybill and D.C. Boes, Introduction to the Theory
of Statistics, Third Edition, Tata McGraw-Hill, New Delhi.
2. J.E. Freund, Mathematical Statistics, Fifth Edition, Prentice-Hall India.