A Probability and

Statistics Companion
A Probability and
Statistics Companion
John J. Kinney
Copyright ©2009 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax
(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be suitable
for your situation. You should consult with a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of profit or any other commercial damages, including but not limited to
special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at (317)
572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic formats. For more information about Wiley products, visit our web site at
www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Kinney, John J.
A probability and statistics companion / John J. Kinney.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-470-47195-1 (pbk.)
1. Probabilities. I. Title.
QA273.K494 2009
519.2–dc22
2009009329
Typeset in 10/12pt Times by Thomson Digital, Noida, India.
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
For Cherry, Kaylyn, and James
Contents
Preface xv
1. Probability and Sample Spaces 1
Why Study Probability? 1
Probability 2
Sample Spaces 2
Some Properties of Probabilities 8
Finding Probabilities of Events 11
Conclusions 16
Explorations 16
2. Permutations and Combinations: Choosing the Best Candidate;
Acceptance Sampling 18
Permutations 19
Counting Principle 19
Permutations with Some Objects Alike 20
Permuting Only Some of the Objects 21
Combinations 22
General Addition Theorem and Applications 25
Conclusions 35
Explorations 35
3. Conditional Probability 37
Introduction 37
Some Notation 40
Bayes’ Theorem 45
Conclusions 46
Explorations 46
vii
viii Contents
4. Geometric Probability 48
Conclusion 56
Explorations 57
5. Random Variables and Discrete Probability Distributions—Uniform,
Binomial, Hypergeometric, and Geometric Distributions 58
Introduction 58
Discrete Uniform Distribution 59
Mean and Variance of a Discrete Random Variable 60
Intervals, σ, and German Tanks 61
Sums 62
Binomial Probability Distribution 64
Mean and Variance of the Binomial Distribution 68
Sums 69
Hypergeometric Distribution 70
Other Properties of the Hypergeometric Distribution 72
Geometric Probability Distribution 72
Conclusions 73
Explorations 74
6. Seven-Game Series in Sports 75
Introduction 75
Seven-Game Series 75
Winning the First Game 78
How Long Should the Series Last? 79
Conclusions 81
Explorations 81
7. Waiting Time Problems 83
Waiting for the First Success 83
The Mythical Island 84
Waiting for the Second Success 85
Waiting for the rth Success 87
Contents ix
Mean of the Negative Binomial 87
Collecting Cereal Box Prizes 88
Heads Before Tails 88
Waiting for Patterns 90
Expected Waiting Time for HH 91
Expected Waiting Time for TH 93
An Unfair Game with a Fair Coin 94
Three Tosses 95
Who Pays for Lunch? 96
Expected Number of Lunches 98
Negative Hypergeometric Distribution 99
Mean and Variance of the Negative Hypergeometric 101
Negative Binomial Approximation 103
The Meaning of the Mean 104
First Occurrences 104
Waiting Time for c Special Items to Occur 104
Estimating k 105
Conclusions 106
Explorations 106
8. Continuous Probability Distributions: Sums, the Normal
Distribution, and the Central Limit Theorem; Bivariate
Random Variables 108
Uniform Random Variable 109
Sums 111
A Fact About Means 111
Normal Probability Distribution 113
Facts About Normal Curves 114
Bivariate Random Variables 115
Variance 119
x Contents
Central Limit Theorem: Sums 121
Central Limit Theorem: Means 123
Central Limit Theorem 124
Expected Values and Bivariate Random Variables 124
Means and Variances of Means 124
A Note on the Uniform Distribution 126
Conclusions 128
Explorations 129
9. Statistical Inference I 130
Estimation 131
Confidence Intervals 131
Hypothesis Testing 133
β and the Power of a Test 137
p-Value for a Test 139
Conclusions 140
Explorations 140
10. Statistical Inference II: Continuous Probability
Distributions II—Comparing Two Samples 141
The Chi-Squared Distribution 141
Statistical Inference on the Variance 144
Student t Distribution 146
Testing the Ratio of Variances: The F Distribution 148
Tests on Means from Two Samples 150
Conclusions 154
Explorations 154
Contents xi
11. Statistical Process Control 155
Control Charts 155
Estimating σ Using the Sample Standard Deviations 157
Estimating σ Using the Sample Ranges 159
Control Charts for Attributes 161
np Control Chart 161
p Chart 163
Some Characteristics of Control Charts 164
Some Additional Tests for Control Charts 165
Conclusions 168
Explorations 168
12. Nonparametric Methods 170
Introduction 170
The Rank Sum Test 170
Order Statistics 173
Median 174
Maximum 176
Runs 180
Some Theory of Runs 182
Conclusions 186
Explorations 187
13. Least Squares, Medians, and the Indy 500 188
Introduction 188
Least Squares 191
Principle of Least Squares 191
Influential Observations 193
The Indy 500 195
A Test for Linearity: The Analysis of Variance 197
A Caution 201
Nonlinear Models 201
The Median–Median Line 202
When Are the Lines Identical? 205
Determining the Median–Median Line 207
xii Contents
Analysis for Years 1911–1969 209
Conclusions 210
Explorations 210
14. Sampling 211
Simple Random Sampling 212
Stratification 214
Proportional Allocation 215
Optimal Allocation 217
Some Practical Considerations 219
Strata 221
Conclusions 221
Explorations 221
15. Design of Experiments 223
Yates Algorithm 230
Randomization and Some Notation 231
Confounding 233
Multiple Observations 234
Design Models and Multiple Regression Models 235
Testing the Effects for Significance 235
Conclusions 238
Explorations 238
16. Recursions and Probability 240
Introduction 240
Conclusions 250
Explorations 250
Contents xiii
17. Generating Functions and the Central Limit Theorem 251
Means and Variances 253
A Normal Approximation 254
Conclusions 255
Explorations 255
Bibliography 257
Where to Learn More 257
Index 259
Preface
Courses in probability and statistics are becoming very popular, both at the college
and at the high school level, primarily because they are crucial in the analysis of data
derived from samples and designed experiments and in statistical process control
in manufacturing. Curiously, while these topics have put statistics at the forefront
of scientific investigation, they are given very little emphasis in textbooks for these
courses.
This book has been written to provide instructors with material on these important
topics so that they may be included in introductory courses. In addition, it provides
instructors with examples that go beyond those commonly used. I have developed
these examples from my own long experience with students and with teachers in
teacher enrichment programs. It is hoped that these examples will be of interest in
themselves, thus increasing student motivation in the subjects and providing topics
students can investigate in individual projects.
Although some of these examples can be regarded as advanced, they are presented
here in ways to make them accessible at the introductory level. Examples include a
probleminvolving a run of defeats in baseball, a method of selecting the best candidate
from a group of applicants for a position, and an interesting set of problems involving
the waiting time for an event to occur.
Connections with geometry are frequent. The fact that the medians of a triangle
meet at a point becomes an extremely useful fact in the analysis of bivariate data;
problems in conditional probability, often a challenge for students, are solved using
only the area of a rectangle. Graphs allow us to see many solutions visually, and the
computer makes graphic illustrations and heretofore exceedingly difficult computa-
tions quick and easy.
Students searching for topics to investigate will find many examples in this book.
I think then of the book as providing both supplemental applications and novel
explanations of some significant topics, and trust it will prove a useful resource for
both teachers and students.
It is a pleasure to acknowledge the many contributions of Susanne Steitz-Filler,
my editor at John Wiley &Sons. I ammost deeply grateful to my wife, Cherry; again,
she has been indispensable.
John Kinney
Colorado Springs
April 2009
xv
Chapter 1
Probability and Sample
Spaces
CHAPTER OBJECTIVES:
r
to introduce the theory of probability
r
to introduce sample spaces
r
to show connections with geometric series, including a way to add them
without a formula
r
to show a use of the Fibonacci sequence
r
to use the binomial theorem
r
to introduce the basic theorems of probability.
WHY STUDY PROBABILITY?
There are two reasons to study probability. One reason is that this branch of math-
ematics contains many interesting problems, some of which have very surprising
solutions. Part of its fascination is that some problems that appear to be easy are, in
fact, very difficult, whereas some problems that appear to be difficult are, in fact, easy
to solve. We will show examples of each of these types of problems in this book.
Some problems have very beautiful solutions.
The second, and compelling, reason to study probability is that it is the mathe-
matical basis for the statistical analysis of experimental data and the analysis of sample
survey data. Statistics, although relatively new in the history of mathematics, has
become a central part of science. Statistics can tell experimenters what observations
to take so that conclusions to be drawn from the data are as broad as possible. In
sample surveys, statistics tells us howmany observations to take (usually, and counter-
intuitively, relatively small samples) and what kinds of conclusions can be taken from
the sample data.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
1
2 Chapter 1 Probability and Sample Spaces
Eachof these areas of statistics is discussedinthis book, but first we must establish
the probabilistic basis for statistics.
Some of the examples at the beginning may appear to have little or no practical
application, but these are needed to establish ideas since understanding problems
involving actual data can be very challenging without doing some simple problems
first.
PROBABILITY
A brief introduction to probability is given here with an emphasis on some unusual
problems to consider for the classroom. We follow this chapter with chapters on
permutations and combinations, conditional probability, geometric probability, and
then with a chapter on random variables and probability distributions.
We begin with a framework for thinking about problems that involve randomness
or chance.
SAMPLE SPACES
An experimenter has four doses of a drug under testing and four doses of an inert
placebo. If the drugs are randomly allocated to eight patients, what is the probability
that the experimental drug is given to the first four patients?
This problem appears to be very difficult. One of the reasons for this is that
we lack a framework in which to think about the problem. Most students lack a
structure for thinking about probability problems in general and so one must be
created. We will see that the problem above is in reality not as difficult as one might
presume.
Probability refers to the relative frequency with which events occur where there
is some element or randomness or chance. We begin by enumerating, or showing,
the set of all the possible outcomes when an experiment involving randomness is
performed. This set is called a sample space.
We will not solve the problem involving the experimental drug here but instead
will show other examples involving a sample space.
EXAMPLE 1.1 A Production Line
Items coming off a production line can be classified as either good (G) or defective (D). We
observe the next item produced.
Here the set of all possible outcomes is
S = {G, D}
since one of these sample points must occur.
Now suppose we inspect the next five items that are produced. There are now 32 sample
points that are shown in Table 1.1.
Sample Spaces 3
Table 1.1
Point Good Runs Point Good Runs
GGGGG 5 1 GGDDD 2 2
GGGGD 4 2 GDGDD 2 4
GGGDG 4 3 DGGDD 2 3
GGDGG 4 3 DGDGD 2 5
GDGGG 4 3 DDGGD 2 3
DGGGG 4 2 DDGDG 2 4
DGGGD 3 3 DDDGG 2 2
DGGDG 3 4 GDDGD 2 4
DGDGG 3 4 GDDDG 2 3
DDGGG 3 2 GDDGD 2 4
GDDGG 3 3 GDDDD 1 2
GDGDG 3 5 DGDDD 1 3
GDGGD 3 4 DDGDD 1 3
GGDDG 3 3 DDDGD 1 3
GGDGD 3 4 DDDDG 1 2
GGGDD 3 2 DDDDD 0 1
We have shown in the second column the number of good items that occur with each
sample point. If we collect these points together we find the distribution of the number of good
items in Table 1.2.
It is interesting to see that these frequencies are exactly those that occur in the binomial
expansion of
2
5
= (1 +1)
5
= 1 +5 +10 +10 +5 +1 = 32
This is not coincidental; we will explain this subsequently.
The sample space also shows the number of runs that occur. A run is a sequence of like
adjacent results of length 1 or more, so the sample point GGDGG contains three runs while
the sample point GDGDD has four runs.
It is also interesting to see, in Table 1.3, the frequencies with which various numbers of
runs occur.
Table 1.2
Good Frequency
0 1
1 5
2 10
3 10
4 5
5 1
4 Chapter 1 Probability and Sample Spaces
Table 1.3
Runs Frequency
1 2
2 8
3 12
4 8
5 2
We see a pattern but not one as simple as the binomial expansion we saw previously. So
we see that like adjacent results are almost certain to occur somewhere in the sequence that is
the sample point. The mean number of runs is 3. If a group is asked to write down a sequence
of, say, G’s and D’s, they are likely to write down too many runs; like symbols are very likely
to occur together. In a baseball season of 162 games, it is virtually certain that runs of several
wins or losses will occur. These might be noted as remarkable in the press; they are not. We
will explore the topic of runs more thoroughly in Chapter 12.
One usually has a number of choices for the sample space. In this example, we could
choose the sample space that has 32 points or the sample space {0, 1, 2, 3, 4, 5} indicating the
number of good items or the set {1, 2, 3, 4, 5} showing the number of runs. So we have three
possible useful sample spaces.
Is there a “correct” sample space? The answer is “no”. The sample space chosen for an
experiment depends upon the probabilities one wishes to calculate. Very often one sample
space will be much easier to deal with than another for a problem, so alternative sample spaces
provide different ways for viewing the same problem. As we will see, the probabilities assigned
to these sample points are quite different.
We should also note that good and defective items usually do not come off production lines
at random. Items of the same sort are likely to occur together. The frequency of defective items
is usually extremely small, so the sample points are by no means equally likely. We will return
to this when we consider acceptance sampling in Chapter 2 and statistical process control in
Chapter 11.
᭿
EXAMPLE 1.2 Random Arrangements
The numbers 1, 2, 3, and 4 are arranged in a line at random.
The sample space here consists of all the possible orders, as shown below.
S =

















1234

2134

3124

4123
1243

2143 3142 4132

1324

2314

3214

4231

1342

2341 3241

4213

1423

2413 3412 4312
1432

2431

3421 4321

















Sample Spaces 5
S here contains 24 elements, the number of possible linear orders, or arrangements of
4 distinct items. These arrangements are called permutations. We will consider permutations
more generally in Chapter 2.
A well-known probability problem arises from the above permutations. Suppose the
“natural” order of the four integers is 1234. If the four integers are arranged randomly, how
many of the integers occupy their own place? For example, in the order 3214, the integers 2
and 4 are in their own place. By examining the sample space above, it is easy to count the
permutations in which at least one of the integers is in its own place. These are marked with
an asterisk in S. We find 15 such permutations, so 15/24 = 0.625 of the permutations has at
least one integer in its own place.
Now what happens as we increase the number of integers? This leads to the well-known
“hat check” problemthat involves npeople who visit a restaurant and each check a hat, receiving
a numbered receipt. Upon leaving, however, the hats are distributed at random. So the hats are
distributed according to a random permutation of the integers 1, 2, . . . , n. What proportion of
the diners gets his own hat?
If there are four diners, we see that 62.5% of the diners receive their own hats. Increasing
the number of diners complicates the problem greatly if one is thinking of listing all the orders
and counting the appropriate orders as we have done here. It is possible, however, to find the
answer without proceeding in this way. We will show this in Chapter 2.
It is perhaps surprising, and counterintuitive, to learn that the proportion for 100 people
differs little from that for 4 people! In fact, the proportion approaches 1 −1/e = 0.632121
as n increases. (To six decimal places, this is the exact result for 10 diners.) This is our
first, but by no means our last, encounter with e = 2.71828 . . . , the base of the system of
natural logarithms. The occurrence of e in probability, however, has little to do with natural
logarithms.
᭿
The next example also involves e.
EXAMPLE 1.3 Running Sums
A box contains slips of paper numbered 1, 2, and 3, respectively. Slips are drawn one at a time,
replaced, and a cumulative running sum is kept until the sum equals or exceeds 4.
This is an example of a waiting time problem; we wait until an event occurs. The event
can occur in two, three, or four drawings. (It must occur no later than the fourth drawing.)
The sample space is shown in Table 1.4, where n is the number of drawings and the sample
points show the order in which the integers were selected.
Table 1.4
n Orders
(1,3),(3,1),(2,2)
2
(2,3),(3,2),(3,3)
(1,1,2),(1,1,3),(1,2,1),(1,2,2)
3
(1,2,3),(2,1,1),(2,1,2),(2,1,3)
4 (1,1,1,1),(1,1,1,2),(1,1,1,3)
6 Chapter 1 Probability and Sample Spaces
Table 1.5
n Expected value
1 2.00
2 2.25
3 2.37
4 2.44
5 2.49
We will show later that the expected number of drawings is 2.37.
What happens as the number of slips of paper increases? The approach used here becomes
increasingly difficult. Table 1.5 shows exact results for small values of n, where we draw until
the sum equals or exeeds n +1.
While the value of n increases, the expected length of the game increases, but at a de-
creasing rate. It is too difficult to show here, but the expected length of the game approaches
e = 2.71828 . . . as n increases.
This does, however, make a very interesting classroom exercise either by generating ran-
dom numbers within the specified range or by a computer simulation. The result will probably
surprise students of calculus and be an interesting introduction to e for other students.
᭿
EXAMPLE 1.4 An Infinite Sample Space
Examples 1.1, 1.2, and 1.3 are examples of finite sample spaces, since they contain a finite
number of elements. We now consider an infinite sample space.
We observe a production line until a defective (D) item appears. The sample space now is
infinite since the event may never occur. The sample space is shown below (where G denotes
a good item).
S =





















D
GD
GGD
GGGD
.
.
.





















We note that S in this case is a countable set, that is, a set that can be put in one-to-one
correspondence with the set of positive integers. Countable sample spaces often behave as if
they were finite. Uncountable infinite sample spaces are also encountered in probability, but
we will not consider these here.
᭿
Sample Spaces 7
EXAMPLE 1.5 Tossing a Coin
We toss a coin five times and record the tosses in order. Since there are two possibilities on
each toss, there are 2
5
= 32 sample points. A sample space is shown below.
S =



















TTTTT TTTTH TTTHT TTHTT THTTT HTTTT
HHTTT HTHTT HTTHT TTTHH THTHT TTHHT
HTTHH HTTTH THTTH TTHTH HHHHT THTHH
THHTH THHHT TTHHH HTHTH THHTT HHHTT
HTHHT HHTHT HHHHT HHHTH HHTHH HTHHH
THHHH HHHHH



















It is also possible in this example simply to count the number of heads, say, that occur. In
that case, the sample space is
S
1
= {0, 1, 2, 3, 4, 5}
Both S and S
1
are sets that contain all the possibilities when the experiment is performed
and so are sample spaces. So we see that the sample space is not uniquely defined. Perhaps one
can think of other sets that describe the sample space in this case.
᭿
EXAMPLE 1.6 AP Statistics
A class in advanced placement statistics consists of three juniors (J) and four seniors (S). It is
desired to select a committee of size two. An appropriate sample space is
S = {JJ, JS, SJ, SS}
where we have shown the class of the students selected in order. One might also simply count
the number of juniors on the committee and use the sample space
S
1
= {0, 1, 2}
Alternatively, one might consider the individual students selected so that the sample space,
shown below, becomes
S
2
= {J
1
J
2
, J
1
J
3
, J
2
J
3
, S
1
S
2
, S
1
S
3
, S
1
S
4
, S
2
S
3
, S
2
S
4
, S
3
S
4
,
J
1
S
1
, J
1
S
2
, J
1
S
3
, J
1
S
4
, J
2
S
1
, J
2
S
2
, J
2
S
3
, J
2
S
4
, J
3
S
1
,
J
3
S
2
, J
3
S
3
, J
3
S
4
}
S
2
is as detailed a sample space one can think of, if order of selection is disregarded, so
one might think that these 21 sample points are equally likely to occur provided no priority
is given to any of the particular individuals. So we would expect that each of the points in S
2
would occur about 1/21 of the time. We will return to assigning probabilities to the sample
points in S and S
2
later in this chapter.
᭿
8 Chapter 1 Probability and Sample Spaces
EXAMPLE 1.7 Let’s Make a Deal
On the television program Let’s Make a Deal, a contestant is shown three doors, only one of
which hides a valuable prize. The contestant chooses one of the doors and the host then opens
one of the remaining doors to show that it is empty. The host then asks the contestant if she
wishes to change her choice of doors from the one she selected to the remaining door.
Let W denote a door with the prize and E
1
and E
2
the empty doors. Supposing that the
contestant switches choices of doors (which, as we will see in a later chapter, she should do),
and we write the contestant’s initial choice and then the door she finally ends up with, the
sample space is
S = {(W, E
1
), (W, E
2
), (E
1
, W), (E
2
, W)}
᭿
EXAMPLE 1.8 A Birthday Problem
A class in calculus has 10 students. We are interested in whether or not at least two of the
students share the same birthday. Here the sample space, showing all possible birthdays, might
consist of components with 10 items each. We can only show part of the sample space since it
contains 365
10
= 4.196 9 ×10
25
points! Here
S = {(March 10, June 15, April 24, . . .), (May 5, August 2, September 9, . . . .)}
It may seem counterintuitive, but we can calculate the probability that at least two of the
students share the same birthday without enumerating all the points in S. We will return to this
problem later.
᭿
Now we continue to develop the theory of probability.
SOME PROPERTIES OF PROBABILITIES
Any subset of a sample space is called an event. In Example 1.1, the occurrence of
a good item is an event. In Example 1.2, the sample point where the number 3 is to
the left of the number 2 is an event. In Example 1.3, the sample point where the first
defective item occurs in an even number of items is an event. In Example 1.4, the
sample point where exactly four heads occur is an event.
We wish to calculate the relative likelihood, or probability, of these events. If we
try an experiment n times and an event occurs t times, then the relative likelihood
of the event is t/n. We see that relative likelihoods, or probabilities, are numbers
between 0 and 1. If A is an event in a sample space, we write P(A) to denote the
probability of the event A.
Probabilities are governed by these three axioms:
1. P(S) = 1.
2. 0 ≤ P(A) ≤ 1.
3. If events A and B are disjoint, so that A ∩ B = ∅, then
P(A ∪ B) = P(A) +P(B).
Some Properties of Probabilities 9
Axioms 1 and 2 are fairly obvious; the probability assigned to the entire sample
space must be 1 since by definition of the sample space some point in the sample
space must occur and the probability of an event must be between 0 and 1. Now if an
event A occurs with probability P(A) and an event B occurs with probability P(B)
and if the events cannot occur together, then the relative frequency with which one
or the other occurs is P(A) +P(B). For example, if a prospective student decides to
attend University A with probability 2/5 and to attend University B with probability
1/5, she will attend one or the other (but not both) with probability 2/5 +1/5 = 3/5.
This explains Axiom 3.
It is also very useful to consider an event, say A, as being composed of distinct
points, say a
i
,with probabilities p(a
i
). By Axiom 3 we can add these individual
probabilities to compute P(A) so
P(A) =
n

i=1
p(a
i
)
It is perhaps easiest to consider a finite sample space, but our conclusions also
apply to a countably infinite sample space. Example 1.4 involved a countable infinite
sample space; we will encounter several more examples of these sample spaces in
Chapter 7.
Disjoint events are also called mutually exclusive events.
Let A denote the points in the sample space where event A does not occur. Note
that A and A are mutually exclusive so
P(S) = P(A ∪ A) = P(A) +P(A) = 1
and so we have
Fact 1. P(A) = 1 −P(A).
Axiom 3 concerns events that are mutually exclusive. What if they are not mutually
exclusive?
Refer to Figure 1.1.
A B AപB
Figure 1.1
10 Chapter 1 Probability and Sample Spaces
If we find P(A) +P(B) by adding the probabilities of the distinct points in those
events, then we have counted P(A ∩ B) twice so
Fact 2. PA ∪ B) = P(A) +P(B) −P(A ∩ B),
where Fact 2 applies whether events A and B are disjoint or not.
Fact 2 is known as the addition theorem for two events. It can be generalized to
three or more events in Fact 3:
Fact 3. (General addition theorem).
P(A
1
∪ A
2
∪ · · · ∪ A
n
) =
n

i=n
P(A
i
) −
n

i / = j=1
P(A
i
∩ A
j
)
+
n

i / = j / = k=1
P(A
i
∩ A
j
∩ A
k
) −· · ·
±
n

i / = j / = ··· / = n=1
P(A
i
∩ A
j
∩ · · · ∩ A
n
)
We simply state this theorem here. We prefer to prove it using techniques devel-
oped in Chapter 2, so we delay the proof until then.
Now we turn to events that can occur together.
EXAMPLE 1.9 Drawing Marbles
Suppose we have a jar of marbles containing five red and seven green marbles. We draw
them out one at a time (without replacing the drawn marble) and want the probability that
the first drawn marble is red and the second green. Clearly, the probability the first is red is
5/12. As for the second marble, the contents of the jar have now changed, and the probability
the second marble is green given that the first marble is red is now 7/11. We conclude that the
probability the first marble is red and the second green is 5/12 · 7/11 = 35/132. The fact that
the composition of the jar changes with the selection of the first marble alters the probability
of the color of the second marble.
The probability the second marble is green given that the first is red, 7/11, differs from
the probability the marble is green, 7/12.
We call the probability the second marble is green, given the first is red, the
conditional probability of a green, given a red. We say in general that
P(A ∩ B) = P(A) · P(B|A)
where we read P(B|A) as the conditional probability of event Bgiven that event Ahas occurred.
This is called the multiplication rule.
Finding Probabilities of Events 11
Had the first marble been replaced before making the second drawing, the probability
of drawing a green marble on the second drawing would have been the same as drawing a
green marble on the first drawing, 7/12. In this case, P(B|A) = P(B), and the events are called
independent.
᭿
We will study conditional probability in Chapter 3.
Independent events and disjoint events are commonly confused. Independent
events refer to events that can occur together; disjoint events cannot occur together.
We refer now to events that do not have probability 0 (such events are encountered in
nondenumerably infinite sample spaces).
If events are independent, then they cannot be disjoint since they must be able to
occur together; if events are disjoint, then they cannot be independent because they
cannot occur together.
FINDING PROBABILITIES OF EVENTS
The facts about probabilities, as shown in the previous section, are fairly easy. The
difficulty arises when we try to apply them.
The first step in any probability problemis to define an appropriate sample space.
More than one sample space is possible; it is usually the case that if order is considered,
then the desired probabilities can be found, because that is the most detailed sample
space one can write, but it is not always necessary to consider order.
Let us consider the examples for which we previously found sample
spaces.
EXAMPLE 1.10 A Binomial Problem
In Example 1.1, we examined an itememerging froma production line and observed the result.
It might be sensible to assign the probabilities to the events as P(G) = 1/2 and P(D) = 1/2
if we suppose that the production line is not a very good one. This is an example of a binomial
event (where one of the two possible outcomes occurs at each trial) but it is not necessary to
assign equal probabilities to the two outcomes.
It is a common error to presume, because a sample space has n points, that each point
has probability 1/n. For another example, when a student takes a course she will either
pass it or fail it, but it would not be usual to assign equal probabilities to the two events.
But if we toss a fair coin, then we might have P(Head) = 1/2 and P(Tail) = 1/2. We
might also consider a loaded coin where P(H) = p and P(T) = q = 1 −p where, of course,
0 ≤ p ≤ 1.
It is far more sensible to suppose in our production line example that P(G) = 0.99 and
P(D) = 0.01 and even these assumptions assume a fairly poor production line. In that case,
and assuming the events are independent, we then find that
P(GGDDG) = P(G) · P(G) · P(D) · P(D) · P(G) = (0.99)
3
· (0.01)
2
= 0.000097
12 Chapter 1 Probability and Sample Spaces
Also, since the sample points are disjoint, we can compute the probability we see exactly two
defective items as
P(GGDDG) +P(DGDGG)+P(DGGDG)+P(DGGGD)+P(GDGGD)+P(GGDGD)
+P(GGGDD) +P(GDGDG) +P(GDDGG) +P(DGDGG) = 0.00097
Note that the probability above must be 10 · P(GGDDG) = 10 · (0.99)
3
· (0.01)
2
since
each of the 10 orders shown above has the same probability. Note also that 10 · (0.99)
3
· (0.01)
2
is a term in the binomial expansion (0.99 +0.01)
10
.
᭿
We will consider more problems involving the binomial theorem in Chapter 5.
EXAMPLE 1.11 More on Arrangements
In Example 1.2, we considered all the possible permutations of four objects. Thinking that
these permutations occur at random, we assign probability 1/24 to each of the sample points.
The events “3 occurs to the left of 2” then consists of the points
{3124, 3142, 4132, 1324, 3214, 1342, 3241, 3412, 4312, 1432, 3421, 4321}
Since there are 12 of these and since they are mutually disjoint and since each has proba-
bility 1/24, we find
P(3 occurs to the left of 2) = 12/24 = 1/2
We might have seen this without so much work if we considered the fact that in a random
permutation, 3 is as likely to be to the left of 2 as to its right. As you were previously warned,
easy looking problems are often difficult while difficult looking problems are often easy. It is
all in the way one considers the problem.
᭿
EXAMPLE 1.12 Using a Geometric Series
Example 1.4 is an example of a waiting time problem; that is, we do not have a determined
number of trials, but we wait for an event to occur. If we consider the manufacturing process
to be fairly poor and the items emerging from the production line are independent, then one
possible assignment of probabilities is shown in Table 1.6.
Table 1.6
Event Probability
D 0.01
GD 0.99 · 0.01 = 0.0099
GGD (0.99)
2
· 0.01 = 0.009801
GGGD (0.99)
3
· 0.01 = 0.009703
· ·
· ·
· ·
Finding Probabilities of Events 13
We should check that the probabilities add up to 1.We find that (using S for sum now)
S = 0.01 +0.99 · 0.01 +(0.99)
2
· 0.01 +(0.99)
3
· 0.01 +· · ·
and so
0.99S = 0.99 · 0.01 +(0.99)
2
· 0.01 +(0.99)
3
· 0.01 +· · ·
and subtracting one series from another we find
S −0.99S = 0.01
or
0.01S = 0.01
and so S = 1.
This is also a good opportunity to use the geometric series to find the sum, but we will
have to use for the above trick in later chapters for series that are not geometric.
What happens if we assign arbitrary probabilities to defective items and good items? This
would certainly be the case with an effective production process. If we let P(D) = p and
P(G) = 1 −p = q, then the probabilities appear as shown in Table 1.7.
Table 1.7
Event Probability
D p
GD qp
GGD q
2
p
GGGD q
3
p
. .
. .
. .
Again, have we assigned a probability of 1 to the entire sample space? Letting S stand for
sum again, we have
S = p +qp +q
2
p +q
3
p +· · ·
and so
qS = qp +q
2
p +q
3
p +· · ·
and subtracting, we find
S −qS = p
so
(1 −q)S = p
or
pS = p
meaning that S = 1.
This means that our assignment of probabilities is correct for any value of p.
14 Chapter 1 Probability and Sample Spaces
Now let us find the probability the first defective item occurs at an even-numbered toss.
Let the event be denoted by E.
P(E) = qp +q
3
p +q
5
p +· · ·
and so
q
2
· P(E) = q
3
p +q
5
p +· · ·
and subtracting we find
P(E) −q
2
· P(E) = qp
from which it follows that
(1 −q
2
) · P(E) = qp
and so
P(E) =
qp
1 −q
2
=
qp
(1 −q)(1 +q)
=
q
1 +q
If the process produces items with the above probabilities, this becomes
0.01/(1 +0.01) = 0.0099. One might presume that the probability the first defective item
occurs at an even-numbered observation is the same as the probability the first defective item
occurs at an odd-numbered observation. This cannot be correct, however, since the probability
the first defective item occurs at the first observation (an odd-numbered observation) is p. It
is easy to show that the probability the first defective item occurs at an odd-numbered obser-
vation is 1/(1 +q), and for a process with equal probabilities, such as tossing a fair coin, this
is 2/3.
᭿
EXAMPLE 1.13 Relating Two Sample Spaces
Example 1.5 considers a binomial event where we toss a coin five times. In the first sample
space, S, we wrote out all the possible orders in which the tosses could occur. This is of course
impossible if we tossed the coin, say, 10, 000 times! In the second sample space, S
1
,we simply
looked at the number of heads that occurred. The difference is that the sample points are not
equally likely.
In the first sample space, where we enumerated the result of each toss, using the fact
that the tosses are independent, and assuming that the coin is loaded, where P(H) = p and
P(T) = 1 −p = q, we find, to use two examples, that
P(TTTTT) = q
5
and P(HTTHH) = p
3
q
2
Now we can relate the two sample spaces. In S
1
, P(0) = P(0 heads) = P(TTTTT) = q
5
.
NowP(1 head) is more complex since the single head can occur in one of the five possible
places. Since these sample points are mutually disjoint, P(1 head) = 5 · p · q
4
.
There are 10 points in S where two heads appear. Each of these points has probability
p
2
· q
3
so P(2 heads) = 10 · p
2
· q
3
.
We find, similarly, that P(3 heads) = 10 · p
3
· q
2
, P(4 heads) = 5 · p
4
· q, and, finally,
P(5 heads) = p
5
. So the sample points in S
1
are far from being equally likely. If we add all
these probabilities, we find
q
5
+5 · p · q
4
+10 · p
2
· q
3
+10 · p
3
· q
2
+5 · p
4
· q +p
5
which we recognize as the binomial expansion (q +p)
5
that is 1 since q = 1 −p.
Finding Probabilities of Events 15
In a binomial situation (where one of the two possible outcomes occurs at each trial) with
n observations, we see that the probabilities are the individual terms in the binomial expansion
(q +p)
n
.
᭿
EXAMPLE 1.14 Committees and Probability
In Example 1.6, we chose a committee of two students from a class with three juniors and four
seniors. The sample space we used is
S = {JJ, JS, SJ, SS}
How should probabilities be assigned to the sample points? First we realize that each
sample point refers to a combination of events so that JJ means choosing a junior first and
then choosing another junior. So JJ really refers to J ∩ J whose probability is
P(J ∩ J) = P(J) · P(J|J)
by the multiplication rule. Now P(J) = 3/7 since there are three juniors and we regard the
selection of the students as equally likely. Now, with one of the juniors selected, we have only
two juniors to choose from, so P(J|J) = 2/6 and so
P(J ∩ J) =
3
7
·
2
6
=
1
7
In a similar way, we find
P(J ∩ S) =
3
7
·
4
6
=
2
7
P(S ∩ J) =
4
7
·
3
6
=
2
7
P(S ∩ S) =
4
7
·
3
6
=
2
7
These probabilities add up to 1 as they should.
᭿
EXAMPLE 1.15 Let’s Make a Deal
Example 1.7 is the Let’s Make a Deal problem. It has been widely written about since it is easy
to misunderstand the problem. The contestant chooses one of the doors that we have labeled
W, E
1
, and E
2
.We suppose again that the contestant switches doors after the host exhibits one
of the nonchosen doors to be empty.
If the contestant chooses W, then the host has two choices of empty doors to exhibit.
Suppose he chooses these with equal probabilities. Then W ∩ E
1
means that the contestant
initially chooses W, the host exhibits E
2
, the contestant switches doors and ends up with
E
1
.The probability of this is then
P(W ∩ E
1
) = P(W) · P(E
1
|W) =
1
3
·
1
2
=
1
6
In an entirely similar way,
P(W ∩ E
2
) = P(W) · P(E
2
|W) =
1
3
·
1
2
=
1
6
16 Chapter 1 Probability and Sample Spaces
Using the switching strategy, the only way the contestant loses is by selecting the winning
door first (and then switching to an empty door), so the probability the contestant loses is the
sum of these probabilities, 1/6 +1/6 = 1/3, which is just the probability of choosing W in
the first place. It follows that the probability of winning under this strategy is 2/3!
Another way to see this is to calculate the probabilities of the two ways to winning, namely,
P(E
1
∩ W) and P(E
2
∩ W). In either of these, an empty door is chosen first. This means that
the host has only one choice for exhibiting an empty door. So each of these probabilities is
simply the probability of choosing the specified empty door first, which is 1/3. The sum of
these probabilities is 2/3, as we found before.
After the contestant selects a door, the probability the winning door is one not chosen is
2/3. The fact that one of these is shown to be empty does not change this probability.
᭿
EXAMPLE 1.16 A Birthday Problem
Tothinkabout the birthdayproblem, inExample 1.8, we will use the fact that P(A) = 1 −P(A).
So if A denotes the event that the birthdays are all distinct, then A denotes the event that at
least two of the birthdays are the same.
To find P(A), note that the first person can have any birthday in the 365 possible birthdays,
the next can choose any day of the 364 remaining, the next has 363 choices, and so on.
If there are 10 students in the class, then
P(at least two birthdays are the same) = 1 −
365
365
·
364
365
·
363
365
· · · · ·
356
365
= 0.116948
If there are n students, we find
P(at least two birthdays are the same) = 1 −
365
365
·
364
365
·
363
365
· · · · ·
366 −(n −1)
365
This probability increases as n increases. It is slightly more than 1/2 if n = 23, while if
n = 40, it is over 0.89. These calculations can be made with your graphing calculator.
This result may be surprising, but note that any two people in the group can share a
birthday; this is not the same as finding someone whose birthday matches, say, your birthday.
᭿
CONCLUSIONS
This chapter has introduced the idea of the probability of an event and has given us
a framework, called a sample space, in which to consider probability problems. The
axioms on which probability is based have been shown and some theorems resulting
from them have been shown.
EXPLORATIONS
1. Consider all the arrangements of the integers 1, 2, 3, and 4. Count the number
of derangements, that is, the number of arrangements in which no integer
occupies its own place. Speculate on the relative frequency of the number of
derangements as the number of integers increases.
Explorations 17
2. Simulate the Let’s Make a Deal problem by taking repeated selections from
three cards, one of which is designated to be the prize. Compare two strategies:
(1) never changing the selection and (2) always changing the selection.
3. A hat contains tags numbered 1, 2, 3, 4, 5, 6. Two tags are selected. Show the
sample space and then compare the probability that the number on the second
tag exceeds the number on the first tag when (a) the first tag is not replaced
before the second tag is drawn and (b) the first tag is replaced before the second
tag is drawn.
4. Find the probability of (a) exactly three heads and (b) at most three heads
when a fair coin is tossed five times.
5. If p is the probability of obtaining a 5 at least once in n tosses of a fair die,
what is the least value of n so that p 1/2?
6. Simulate drawing integers from the set 1, 2, 3 until the sum exceeds 4. Com-
pare your mean value to the expected value given in the text.
7. Toss a fair coin 100 times and find the frequencies of the number of runs.
Repeat the experiment as often as you can.
8. Use a computer to simulate tossing a coin 1000 times and find the frequencies
of the number of runs produced.
Chapter 2
Permutations and
Combinations: Choosing the
Best Candidate; Acceptance
Sampling
CHAPTER OBJECTIVES:
r
to discuss permutations and combinations
r
to use the binomial theorem
r
to show how to select the best candidate for a position
r
to encounter an interesting occurrence of e
r
to show how sampling can improve the quality of a manufactured product
r
to use the principle of maximum likelihood
r
to apply permutations and combinations to other practical problems.
An executive in a company has an opening for an executive assistant. Twenty can-
didates have applied for the position. The executive is constrained by company rules
that say that candidates must be told whether they are selected or not at the time of an
interview. How should the executive proceed so as to maximize the chance that the
best candidate is selected?
Manufacturers of products commonly submit their product to inspection before
the product is shipped to a consumer. This inspection usually measures whether or
not the product meets the manufacturer’s as well as the consumer’s specifications.
If the product inspection is destructive, however (such as determining the length of
time a light bulb will burn), then all the manufactured product cannot be inspected.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
18
Permutations 19
Even if the inspection is not destructive or harmful to the product, inspection of all the
product manufactured is expensive and time consuming. If the testing is destructive,
it is possible to inspect only a random sample of the product produced. Can random
sampling improve the quality of the product sold?
We will consider each of these problems, as well as several others, in this
chapter. First we must learn to count points in sets, so we discuss permutations and
combinations as well as some problems solved using them.
PERMUTATIONS
A permutation is a linear arrangement of objects, or an arrangement of objects in a
row, in which the order of the objects is important.
For example, if we have four objects, which we will denote by a, b, c,and d, there
are 24 distinct linear arrangements as shown in Table 2.1
Table 2.1
abcd abdc acbd acdb adbc adcb
bacd badc bcda bcad bdac bdca
cabd cadb cbda cbad cdab cdba
dabc dacb dbca dbac dcab dcba
In Chapter 1, we showed all the permutations of the set {1, 2, 3, 4} and, of course,
found 24 of these. To count the permutations, we need a fundamental principle first.
Counting Principle
Fundamental Counting Principle. If an event A can occur in n ways and an event
B can occur in m ways, then A and B can occur in n · m ways.
The proof of this can be seen in Figure 2.1, where we have taken n = 2 and
m = 3. There are 2 · 3 = 6 paths from the starting point. It is easy to see that the
branches of the diagram can be generalized.
The counting principle can be extended to three or more events by simply multi-
plying the number of ways subsequent events can occur. The number of permutations
of the four objects shown above can be counted as follows, assuming for the moment
that all of the objects to be permuted are distinct.
Since there are four positions to be filled to determine a unique permutation,
we have four choices for the letter or object in the leftmost position. Proceeding to
the right, there are three choices of letter or object to be placed in the next position.
This gives 4 · 3 possible choices in total. Now we are left with two choices for the
object in the next position and finally with only one choice for the rightmost position.
So there are 4 · 3 · 2 · 1 = 24 possible permutations of these four objects. We have
made repeated use of the counting principle.
We denote 4 · 3 · 2 · 1 as 4! (read as “4 factorial”).
20 Chapter 2 Permutations and Combinations
A
B
1
2
3
1
2
3
Figure 2.1
It follows that there are n! possible permutations of n distinct objects.
The number of permutations of n distinct objects grows rapidly as n increases as
shown in Table 2.2.
Table 2.2
n n!
1 1
2 2
3 6
4 24
5 120
6 720
7 5,040
8 40,320
9 362,880
10 3,628,800
Permutations with Some Objects Alike
Sometimes not all of the objects to be permuted are distinct. For example, suppose
we have 3, A’s, 4 B’s, and 5 C’s to be permuted, or 12 objects all together. There are
not 12! permutations, since the A’s are not distinguishable from each other, nor are
the B’s, nor are the C’s.
Suppose we let G be the number of distinct permutations and that we have a list
of these permutations. Now number the A’s from 1 to 3.These can be permuted in 3!
ways; so, if we permute the A’s in each itemin our list, the list nowhas 3!Gitems. Now
label the B’s from1 to 4 and permute the B’s in each itemin the list in all 4! ways. The
list now has 4!3!G items. Finally, number the 5 C’s and permute these for each item
in the list. The list now contains 5!4!3!G items. But now each of the items is distinct,
Permutations 21
so the list has 12! items. We see that 5!4!3!G = 12!, so G =
12!
5!4!3!
or G = 27, 720
and this is considerably less than 12! = 479, 001, 600.
Permuting Only Some of the Objects
Nowsuppose that we have n distinct objects and we wish to permute r of them, where
r ≤ n. We now have r boxes to fill. This can be done in
n · (n −1) · (n −2) · · · [n −(r −1)] = n · (n −1) · (n −2) · · · (n −r +1)
ways. If r < n, this expression is not a factorial, but can be expressed in terms of
factorials by multiplying and dividing by (n −r)! We see that
n · (n −1) · (n −2) · · · (n −r +1)
=
n · (n −1) · (n −2) · · · (n −r +1) · (n −r)!
(n −r)!
=
n!
(n −r)!
We will have little use for this formula. We derived it so that we can count the
number of samples that can be chosen from a population, which we do subsequently.
For the formula to work for any value of r, we define 0! = 1.
We remark now that the 20 applicants to the executive faced with choosing a
newassistant could appear in 20! = 24, 329, 020, 081, 766, 400, 000 different orders.
Selecting the best of the group by making a random choice means that the best
applicant has a 1/20 = 0.05 chance of being selected, a fairly low probability. So the
executive must create a better procedure. The executive can, as we will see, choose
the best candidate with a probability approaching 1/3, but that is something we will
discuss much later.
There are 52! distinct arrangements of a deck of cards. This number is of the order
8 · 10
67
. It is surprising to find, if we could produce 10, 000 distinct permutations of
these per second, that it would take about 2 · 10
56
years to enumerate all of these.
We usually associate impossible events with infinite sets, but this is an example of a
finite set for which this event is impossible.
For example, suppose we have four objects (a, b, c, and d again) and that we
wish to permute only two of these. We have four choices for the leftmost position and
three choices for the second position, giving 4 · 3 = 12 permutations.
Applying the formula we have n = 4 and r = 2, so
4
P
2
=
4!
(4 −2)!
=
4!
2!
=
4 · 3 · 2!
2!
= 4 · 3 = 12
giving the correct result.
Permutations are often the basis for a sample space in a probability problem.
Here are two examples.
22 Chapter 2 Permutations and Combinations
EXAMPLE 2.1 Lining Up at a Counter
Jim, Sue, Bill, and Kate stand in line at a ticket counter. Assume that all the possible permuta-
tions, or orders, are equally likely. There are 4! = 24 of these permutations. If we want to find
the probability that Sue is in the second place, we must count the number of ways in which she
could be in the second place. To count these, first put her there—there is only one way to do
this. This leaves three choices for the first place, two choices for the third place, and, finally,
only one choice for the fourth place. There are then 3 · 1 · 2 · 1 = 6 ways for Sue to occupy
the second place. So
P(Sue is in the second place) =
3 · 1 · 2 · 1
4 · 3 · 2 · 1
=
6
24
=
1
4
This is certainly no surprise. We would expect that any of the four people has a probability
of 1/4 to be in any of the four positions.
᭿
EXAMPLE 2.2 Arranging Marbles
Five red and seven blue marbles are arranged in a row. We want to find the probability that both
the end marbles are red.
Number the marbles from 1 to 12, letting the red marbles be numbered from 1 to 5 for
convenience. The sample space consists of all the possible permutations of 12 distinct objects,
so the sample space contains 12! points, each of which, we will assume, is equally likely. Now
we must count the number of points in which the end points are both red. We have five choices
for the marble at the left end and four choices for the marble at the right end. The remaining
marbles, occupying places between the ends, can be arranged in 10! ways, so
P (end marbles are both red) =
5 · 4 · 10!
12!
=
5 · 4 · 10!
12 · 11 · 10!
=
5 · 4
12 · 11
=
5
33
᭿
COMBINATIONS
If we have n distinct objects and we choose only r of them, we denote the num-
ber of possible samples, where the order in which the sample items are selected
is of no importance, by

n
r

, which we read as “n choose r”. We want to find a
formula for this quantity and first we consider a special case. Return to the prob-
lem of counting the number of samples of size 3 that can be chosen from the set
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. We denote this number by

10
3

. Let us suppose that we
have a list of all these

10
3

samples. Each sample contains three distinct numbers and
each sample could be permuted in 3! different ways. Were we to do this, the result
might look like Table 2.3, in which only some of the possible samples are listed;
then each sample is permuted in all possible ways, so each sample gives 3! permuta-
tions.
There are two ways in which to view the contents of the table, which, if shown
in its entirety, would contain all the permutations of 10 objects taken 3 at a time.
Combinations 23
Table 2.3
Sample Permutations
{1,4,7} 1,4,7 1,7,4 4,1,7 4,7,1 7,1,4 7,4,1
{2,4,9} 2,4,9 2,9,4 4,2,9 4,9,2 9,2,4 9,4,2
{6,7,10} 6,7,10 6,10,7 7,6,10 7,10,6 10,6,7 10,7,6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
First, using our formula for the number of permutations of 10 objects taken 3 at a
time, the table must contain
10!
(10−3)!
permutations. However, since each of the

10
3

combinations can be permuted in 3! ways, the total number of permutations must also
be 3! ·

10
3

. It then follows that
3! ·

10
3

=
10!
(10 −3)!
From this, we see that

10
3

=
10!
7! · 3!
=
10 · 9 · 8 · 7!
7! · 3 · 2 · 1
= 120
This process is easily generalized. If we have

n
r

distinct samples, each of these
can be permuted in r! ways, yielding all the permutations of n objects taken r at a
time, so
r! ·

n
r

=
n!
(n −r)!
or

n
r

=
n!
r! · (n −r)!

52
5

then represents the total number of possible poker hands. This is 2, 598, 960.
This number is small enough so that one could, given enough time, enumerate each
of these.
This calculation by hand would appear this way:

52
5

=
52!
5!(52 −5)!
=
52!
5!47!
=
52 · 51 · 50 · 49 · 48 · 47!
5 · 4 · 3 · 2 · 1 · 47!
=
52 · 51 · 50 · 49 · 48
5 · 4 · 3 · 2 · 1
= 2, 598, 960
24 Chapter 2 Permutations and Combinations
Notice that the factors of 47! cancel fromboth the numerator and the denominator
of the fraction above. This cancellation always occurs and a calculation rule is that

n
r

has r factors in the numerator and in the denominator, so

n
r

=
n(n −1)(n −2) · · · [n −(r −1)]
r!
This makes the calculation by hand fairly simple. We will solve some interesting
problems after stating and proving some facts about these binomial coefficients.
It is also true that

n
r

=

n
n −r

An easy way to see this is to notice that if r objects are selected from n objects, then
n −r objects are left unselected. Every change in an item selected produces a change
in the set of items that are not selected, and so the equality follows.
It is also true that

n
r

=

n−1
r−1

+

n−1
r

. To prove this, suppose you are a
member of a group of n people and that a committee of size r is to be selected. Either
you are on the committee or you are not on the committee. If you are selected for the
committee, there are

n−1
r−1

further choices to be made. There are

n−1
r

committees
that do not include you. So

n
r

=

n−1
r−1

+

n−1
r

.
Many other facts are known about the numbers

n
r

, which are also called
binomial coefficients because they occur in the binomial theorem. The binomial
theorem states that
(a +b)
n
=
n

i=1

n
i

a
n−i
b
i
This is fairly easy to see. Consider (a +b)
n
= (a +b) · (a +b) · · · · · (a +b),
where there are n factors on the right-hand side. To find the product (a +b)
n
, we must
choose either a or b from each of the factors on the right-hand side. There are

n
i

ways to select i b’s (and hence n −i a’s). The product (a +b)
n
consists of the sum
of all such terms.
Many other facts are known about the binomial coefficients. We cannot explore
all these here, but we will showan application, among others, to acceptance sampling.
EXAMPLE 2.3 Arranging Some Like Objects
Let us return to the problem first encountered when we counted the permutations of objects,
some of which are alike. Specifically, we wanted to count the number of distinct permutations
of 3 A’s, 4 B’s, and 5 C’s, where the individual letters are not distinguishable from one another.
We found the answer was G =
12!
5!4!3!
= 27, 720.
Here’s another way to arrive at the answer.
From the 12 positions in the permutation, choose 3 for the A’s. This can be done in

12
3

ways. Then fromthe remaining nine positions, choose four for the B’s. This can be done in

9
4

Combinations 25
ways. Finally, there are five positions left for the 5 C’s. So the total number of permutations
must be

12
3

·

9
4

·

5
5

, which can be simplified to
12! · 9! · 5!
3! · 9!4! · 5!5! · 0!
=
12!
5!4!3!
, as before.
᭿
Note that we have used combinations to count permutations!
General Addition Theorem and Applications
In Chapter 1, we discussed some properties of probabilities including the addition
theorem for two events: P(A ∪ B) = P(A) +P(B) −P(A ∩ B). What if we have
three or more events? This addition theorem can be generalized and we call this,
following Chapter 1,
Fact 3. (General addition theorem).
P(A
1
∪ A
2
∪ · · · ∪ A
n
) =
n

i=n
P(A
i
) −
n

i / = j=1
P(A
i
∩ A
j
) +
n

i / = j / = k=1
P(A
i
∩ A
j
∩ A
k
) −· · · ±
n

i / = j / = .... / = n=1
P(A
i
∩ A
j
∩ · · · ∩ A
n
)
We could not prove this in Chapter 1 since our proof involves combinations.
To prove the general addition theorem, we use a different technique from the one
we used to prove the theorem for two events. Suppose a sample point is contained in
exactly k of the events A
i
. For convenience, number the events so that the sample
point is in the first k events. Now we show that the probability of the sample point is
contained exactly once in the right-hand side of the theorem.
The point is contained on the right-hand side

k
1

k
2

+

k
3

−· · · ±

k
k

times. But consider the binomial expansion of
0 = [1 +(−1)]
k
= 1
k

k
1

+

k
2

k
3

+· · · ∓

k
k

which shows that

k
1

k
2

+

k
3

−· · · ±

k
k

= 1
So the sample point is counted exactly once, proving the theorem.
26 Chapter 2 Permutations and Combinations
The principle we used here is that of inclusion and exclusion and is of great
importance in discrete mathematics. It could also have been used in the case k = 2.
EXAMPLE 2.4 Checking Hats
Now we return to Example 1.2, where n diners have checked their hats and we seek the
probability that at least one diner is given his own hat at the end of the evening. Let the events
A
i
denote the event “diner i gets his own hat,” so we seek P(A
1
∪ A
2
∪ · · · ∪ A
n
) using the
general addition theorem.
Suppose diner i gets his own hat. There are (n −1)! ways for the remaining hats to be
distributed, given the correct hat to diner i,so P(A
i
) = (n −1)!/n! . There are

n
1

ways for a
single diner to be selected.
In a similar way, if diners i and j get their own hats, the remaining hats can be distributed in
(n −2)! ways, so P(A
i
∩ A
j
) = (n −2)!/n!. There are

n
2

ways for two diners to be chosen.
Clearly, this argument can be continued. We then find that
P(A
1
∪ A
2
∪ · · · ∪ A
n
) =

n
1

(n −1)!
n!

n
2

(n −2)!
n!
+

n
3

(n −3)!
n!
−· · · ±

n
n

(n −n)!
n!
which simplifies easily to
P(A
1
∪ A
2
∪ · · · ∪ A
n
) =
1
1!

1
2!
+
1
3!
−· · · ±
1
n!
Table 2.4 shows some numerical results from this formula.
Table 2.4
n p
1 1.00000
2 0.50000
3 0.66667
4 0.62500
5 0.63333
6 0.63194
7 0.63214
8 0.63212
9 0.63212
It is perhaps surprising that, while the probabilities fluctuate a bit, they appear to approach
a limit. To six decimal places, the probability that at least one diner gets his own hat is 0.632121
for n ≥ 9. It can also be shown that this limit is 1 −1/e for n ≥ 9.
᭿
Combinations 27
EXAMPLE 2.5 Aces and Kings
Now we can solve the problem involving a real drug and a placebo given at the beginning of
Chapter 1. To make an equivalent problem, suppose we seek the probability that when the cards
are turned up one at a time in a shuffled deck of 52 cards all the aces will turn up before any
of the kings. This is the same as finding the probability all the users of the real drug will occur
before any of the users of the placebo.
The first insight into the card problem is that the remaining 44 cards have abso-
lutely nothing to do with the problem. We need to only concentrate on the eight aces and
kings.
Assume that the aces are indistinguishable from one another and that the kings are indis-
tinguishable from one another. There are then

8
4

= 70 possible orders for these cards; only
one of them has all the aces preceding all the kings, so the probability is 1/70.
᭿
EXAMPLE 2.6 Poker
We have seen that there are

52
5

= 2,598,960 different hands that can be dealt in playing poker.
We will calculate the probabilities of several different hands. We will see that the special hands
have very low probabilities of occurring.
Caution is advised in calculating the probabilities: choose the values of the cards first and
then the actual cards. Order is not important. Here are some of the possible hands and their
probabilities.
(a) Royal flush. This is a sequence of 10 through ace in a single suit. Since there are four
of these, the probability of a royal flush is 4/

52
5

= 0.000001539 .
(b) Four of a kind. This hand contains all four cards of a single value plus another card
that must be of another value. Since there are 13 values to choose from for the four
cards of a single value (and only one way to select them) and then 12 possible values
for the fifth card, and then 4 choices for a card of that value, the probability of this
hand is 13 · 12 · 4/

52
5

= 0.0002401.
(c) Straight. This is a sequence of five cards regardless of suit. There are nine
possible sequences, 2 through 6, 3 through 7, ..., 10 through Ace, and since
the suits are not important, there are nine possible sequences and four choices
for each of the five cards in the sequence, the probability of a straight is
9 · 4
5
/

52
5

= 0.003 546 .
(d) Two pairs. There are

13
2

choices for the values of the pairs and then

4
2

choices
for the two cards in the first pair and

4
2

choices for the two cards in the second pair.
Finally, there are 11 choices for the value of the fifth card and 4 choices for that card.
So the probability of two pairs is

13
2

·

4
2

·

4
2

· 11 · 4/

52
5

= 0.04754.
(e) Other special hands are three of a kind (three cards of one value and two other
cards of different values), full house (one pair and one triple), and one pair. The
probabilities of these hands are 0.02113, 0.0014406, and 0.422569, respectively.
The most common hand is the one with five different values. This has probability

13
5

· 4
5
/

52
5

= 0.507 083. The probability of a hand with at least one pair is then
1 −0.507 083 = 0.492 917 . ᭿
Now we show another example, the one concerning auto racing.
28 Chapter 2 Permutations and Combinations
EXAMPLE 2.7 Race Cars
One hundred race cars, numbered from 1 to 100, are running around a race course. We observe
a sample of five noting the numbers on the cars and then calculate the median (the number in
the middle when the sample is arranged in order). If the median is m, then we must choose two
that are less than m and then two that are greater than m. This can be done in

m−1
2

·

100−m
2

ways. So the probability that the median is m is

m−1
2

·

100 −m
2

100
5

A graph of this function of m is shown in Figure 2.2.
The most likely value of m is 50 or 51, each having probability 0.0191346.
20 40 60 80
Median
0.005
0.01
0.015
P
r
o
b
a
b
i
l
i
t
y
Figure 2.2
The race car problem is hardly a practical one. A more practical problem is this; we have
taken a random sample of size 5 and we find that the median of the sample is 8. How many
cars are racing around the track; that is, what is n?
This problem actually arose during World War II. The Germans numbered all kinds of
war materiel and their parts. When we captured some tanks, say, we could then estimate the
total number of tanks they had from the serial numbers on the captured tanks.
Here we will consider maximum likelihood estimation: we will estimate n as the value
that makes the sample median we observed most likely.
If there are n tanks, then the probability the median of a sample of 5 tanks is m is

m−1
2

·

n −m
2

n
5

Combinations 29
Table 2.5
n Probability
10 0.0833333
11 0.136364
12 0.159091
13 0.16317
14 0.157343
15 0.146853
16 0.134615
17 0.122172
18 0.110294
19 0.0993292
20 0.0893963
21 0.0804954
22 0.0725678
23 0.0655294
24 0.0592885
25 0.0537549
Nowlet us compute a table of values of n and these probabilities, letting m = 8 for various
values of n. This is shown in Table 2.5.
A graph is helpful (Figure 2.3).
We see that the maximumprobability occurs when n = 13, so we have found the maximum
likelihood estimator for n. The mathematical solution of this problem would be a stretch for
most students in this course. The computer is of great value here in carrying out a fairly simple
idea.
It should be added here that this is not the optimal solution for the German tank problem.
It should be clear that the maximum value in the sample carries more information about n than
does the median. We will return to this problem in Chapter 12.
᭿
Now we return to the executive who is selecting the best assistant.
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
n
0.06
0.08
0.1
0.12
0.14
0.16
P
r
o
b
a
b
i
l
i
t
y
Figure 2.3
30 Chapter 2 Permutations and Combinations
EXAMPLE 2.8 Choosing an Assistant
An executive in a company has an opening for an executive assistant. Twenty candidates have
applied for the position. The executive is constrained by company rules that say that candidates
must be told whether they are selected or not at the time of an interview. How should the
executive proceed so as to maximize the chance that the best candidate is selected?
We have already seen that a random choice of candidate selects the best one with proba-
bility 1/20 = 0.05, so it is not a very sensible strategy.
It is probably clear, assuming the candidates appear in random order, that we should deny
the job to a certain number of candidates while noting which one was best in this first group;
then we should choose the next candidate who is better than the best candidate in the first group
(or the last candidate if a better candidate does not appear).
This strategyhas surprisingconsequences. Toillustrate what follows fromit, let us consider
a small example of four candidates whom we might as well number 1, 2, 3, 4 with 4 being the
best candidate and 1 the worst candidate. The candidates can appear in 4! = 24 different orders
that are shown in Table 2.6.
Table 2.6
Order Pass 1 Pass2 Order Pass 1 Pass 2
1234 2 3 3124 4 4
1243 2 4 3142 4 4
1324 3 4 3241 4 4
1342 3 4 3214 4 4
1423 4 3 3412 4 2
1432 4 2 3421 4 1
2134 3 3 4123 3 3
2143 4 4 4132 2 2
2314 3 4 4231 1 1
2341 3 4 4213 3 3
2413 4 3 4312 2 2
2431 4 1 4321 1 1
The column headed “Pass 1” indicates the final choice when the first candidate is passed
by and the next candidate ranking higher than the first candidate is selected.
Similarly, the column headed “Pass 2” indicates the final choice when the first two can-
didates are passed by and the next candidate ranking higher than the first two candidates is
selected.
It is only sensible to pass by one or two candidates since we will choose the fourth
candidate if we let three candidates pass by and the probability of choosing the best candidate
is then 1/4.
So we let one or two candidates pass, noting the best of these. Then we choose the next
candidate better than the best in the group we passed. Suppose we interview one candidate,
reject him or her, noting his or her ranking. So if the candidates appeared in the order 3214,
then we would pass by the candidate ranked 3; the next best candidate is 4. These rankings
appear in Table 2.6 under the column labeled “Pass 1”. If we examine the rankings and their
frequencies in that column, we get Table 2.7.
Combinations 31
Table 2.7 Passing the First Candidate
Ranking Frequency
1 2
2 4
3 7
4 11
Interestingly, the most frequent choice is 4! and the average of the rankings is 3.125, so
we do fairly well.
If we interview the first two candidates, noting the best, and then choose the candidate
with the better ranking (or the last candidate), we find the rankings in Table 2.6 labeled “Pass
2.” A summary of our choice is shown in Table 2.8.
Table 2.8 Passing the First Two Candidates
Ranking Frequency
1 4
2 4
3 6
4 10
We do somewhat less well, but still better than a random choice. The average ranking here
is 2.917.
A little forethought would reduce the number of permutations we have to list. Consider
the plan to pass the first candidate by. If 4 appears in the first position, we will not choose
the best candidate; if 4 appears in the second position, we will choose the best candidate; if 3
appears in the first position, we will choose the best candidate; so we did not need to list 17 of
the 24 permutations. Similar comments will apply to the plan to pass the first two candidates
by.
It is possible to list the permutations of five or six candidates and calculate the average
choice; beyond that this procedure is not very sensible. The results for five candidates are shown
in Table 2.9, where the first candidate is passed by.
Table 2.9 Five Candidates Passing the First One
Ranking Frequency
1 6
2 12
3 20
4 32
5 50
The plan still does very well. The average rank selected is 3.90 and we see with the plans
presented here that we get the highest ranked candidate at least 42% of the time.
32 Chapter 2 Permutations and Combinations
Generalizing the plan, however, is not so easy. It can be shown that the optimal plan
passes the first [n/e] candidates (where the brackets indicate the greatest integer function)
and the probability of selecting the best candidate out of the n candidates approaches 1/e as
n increases. So we allow the first candidate to pass by until n = 6 and then let the first two
candidates pass by until n = 9, and so on.
᭿
This problem makes an interesting classroom exercise that is easily simulated
with a computer that can produce random permutations of n integers.
EXAMPLE 2.9 Acceptance Sampling
Now let us discuss an acceptance sampling plan.
Suppose a lot of 100 items manufactured in an industrial plant actually contains items that
do not meet either the manufacturer’s or the buyer’s specifications. Let us denote these items
by calling them D items while the remainder of the manufacturer’s output, those items that do
meet the manufacturer’s and the buyer’s specifications, we will call G items.
Now the manufacturer wishes to inspect a random sample of the items produced by the
production line. It may be that the inspection process destroys the product or that the inspection
process is very costly, so the manufacturer uses sampling and so inspects only a portion of the
manufactured items.
As an example, suppose the lot of 100 items actually contains 10 D items and 90 G items
and that we select a randomsample of 5 items fromthe entire lot produced by the manufacturer.
There are

100
5

= 75, 287, 520 possible samples. Suppose we want the probability that
the sample contains exactly three of the D items. Since we assume that each of the samples is
equally likely, this probability is
P(D = 3) =

10
3

·

90
2

100
5

= 0.00638353
making it fairly unlikely that this sample will find three of the items that do not meet
specifications.
It may be of interest to find the probabilities for all the possible values of D. This is often
called the probability distribution of the randomvariable D. That is, we want to find the values
of the function
f(d) =

10
d

·

90
5 −d

100
5

for d = 0, 1, 2, 3, 4, 5.
A graph of this function is shown in Figure 2.4.
What should the manufacturer do if items not meeting specifications are discovered in the
sample? Normally, one of two courses is followed: either the D items found in the sample are
Combinations 33
1 2 3 4 5 6
d
0
0.1
0.2
0.3
0.4
0.5
P
r
o
b
a
b
i
l
i
t
y
Figure 2.4
replaced by G items or the entire lot is inspected and any D items found in the entire lot are
replaced by G items. The last course is usually followed if the sample does not exhibit too
many D items, and, of course, can only be followed if the sampling is not destructive.
If the sample does not contain too many D items, the lot is accepted and sent to the
buyer, perhaps after some D items in the sample are replaced by G items. Otherwise, the lot is
rejected. Hence,the process is called acceptance sampling.
We will explore the second possibility noted above here, namely, that if any D items at
all are found in the sample, then the entire lot is inspected and any D items in it are replaced
with G items. So, the entire delivered lot consists of G items when the sample detects any D
items at all. This clearly will improve the quality of the lot of items sold, but it is not clear how
much of an improvement will result. The process has some surprising consequences and we
will now explore this procedure.
To be specific, let us suppose that the lot is accepted only if the sample contains no D
items whatsoever. Let us also assume that we do not know how many D items are in the lot,
so we will suppose that there are d of these in the lot.
The lot is then accepted with probability
P(D = 0) =

100 −d
5

100
5

This is a decreasing function of d; the larger d, is, the more likely the sample will contain
some D items and hence the lot will not be accepted. A graph of this function is shown in
Figure 2.5.
10 20 30 40
d
0.2
0.4
0.6
0.8
1
P
r
o
b
a
b
i
l
i
t
y
Figure 2.5
34 Chapter 2 Permutations and Combinations
Finally, we consider the average percentage of D items delivered to the customer with
this acceptance sampling plan. This is often called the average outgoing quality (AOQ) in the
quality control literature.
The average of a quantity is found by multiplying the values of that quantity by the
probability of that quantity and adding the results. So if a random variable is D, say, whose
specific values are d, then the average value of D is

all values of d
d · P(D = d)
Here we wish to find the average value of the percentage of Ditems delivered to the buyer,
or the average of the quantity d/100. This is the average outgoing quality.
AOQ =

all values of d
d
100
· P(D = d)
But we have a very special circumstance here. The delivered lot has percentage D items
of d/100 only if the sample contains no D items whatsoever; otherwise, the lot has 0% D items
due to the replacement plan. So the average outgoing quality is
AOQ =
d
100
· P(D = 0) +
0
100
· P(D / = 0)
so
AOQ =
d
100
· P(D = 0)
or
AOQ =
d
100
·

100 −d
5

100
5

A graph of this function is shown in Figure 2.6.
We notice that the graph attains a maximumvalue; this may not have been anticipated! This
means that regardless of the quality of the lot, there is a maximumfor the average percentage of
D items that can be delivered to the customer! This maximum can be found using a computer
and the above graph. Table 2.10 shows the values of the AOQ near the maximum value.
We see that the maximumAOQoccurs when d = 16, so the maximumaverage percentage
of D items that can be delivered to the customer is 0.066!
10 20 30 40 50 60
d
0.01
0.02
0.03
0.04
0.05
0.06
A
O
Q
Figure 2.6
Explorations 35
Table 2.10
d AOQ
14 0.06476
15 0.06535
16 0.06561
17 0.06556
18 0.06523
Sampling here has had a dramatic impact on the average percentage of D items delivered
to the customer.
᭿
This is just one example of how probability and statistics can assist in delivering
high-quality product to consumers. There are many other techniques used that are
called in general statistical process control methods, or SPC; these have found wide
use in industry today. Statistical process control is the subject of Chapter 11.
CONCLUSIONS
We have explored permutations and combinations in this chapter and have applied
them to several problems, most notably a plan for choosing the best candidate from
a group of applicants for a position and acceptance sampling where we found that
sampling does improve the quality of the product sold and actually puts a limit on the
percentage of unacceptable product sold.
We will continue our discussion or production methods and the role probability
and statistics can play in producing more acceptable product in the chapter on quality
control and statistical process control.
We continue in the following two chapters with a discussion of conditional
probability and geometric probability. Each of these topics fits well into a course in
geometry.
EXPLORATIONS
1. Use the principle of inclusion and exclusion to prove the general addition
theorem for two events.
2. Find the probability a poker hand has
(a) exactly two aces;
(b) exactly one pair.
3. What is the probability a bridge hand (13 cards from a deck of 52 cards) does
not contain a heart?
4. Simulate Example 2.8: Choose random permutations of the integers from
1 to 10 with 10 being the most qualified assistant. Compare the probabilities
36 Chapter 2 Permutations and Combinations
of making the best choice by letting 1, 2, or 3 applicants go by and then
choosing the best applicant better than the best of those passed by.
5. Simulate Example 2.4: Choose random permutations of the integers from 1 to
5 and count the number of instances of an integer being in its own place. Then
count the number of derangements, that is, permutations where no integer
occupies its own place.
6. In Example 2.7, how would you use the maximum of the sample in order to
estimate the maximum of the population?
7. A famous correspondence between the Chevalier de Mere and Blaise Pascal,
each of whom made important contributions to the theory of probability, con-
tained the question. “Which is more likely—at least one 6 in 4 rolls of a fair
die or at least one sum of 12 in 24 rolls of a pair of dice?”
Show that the two events are almost equally likely. Which is more likely?
Chapter 3
Conditional Probability
CHAPTER OBJECTIVES:
r
to consider some problems involving conditional probability
r
to show diagrams of conditional probability problems
r
to showhowthese probability problems can be solved using only the area of a rectangle
r
to show connections with geometry
r
to show how a test for cancer and other medical tests can be misinterpreted
r
to analyze the famous Let’s Make a Deal problem.
INTRODUCTION
Aphysiciantells a patient that a test for cancer has givena positive response (indicating
the possible presence of the disease). In this particular test, the physician knows that
the test is 95% accurate both for patients who have cancer and for patients who do
not have cancer. The test appears at first glance to be quite accurate. How is it then,
based on this test alone, that this patient almost certainly does not have cancer? We
will explain this.
I have a class of 65 students. I regard one of my students as an expert since
she never makes an error. I regret to report that the remaining students are terribly
ignorant of my subject, and so guess at each answer. I gave a test with six true–false
questions; a paper I have selected at random has each of these questions answered
correctly. What is the probability that I have selected the expert’s paper? The answer
may be surprising.
Each of these problems can be solved using what is usually known in introductory
courses as Bayes’ theorem. We will not needthis theoremat all tosolve these problems.
No formulas are necessary, except that for the area of a rectangle, so these problems
can actually be explained to many students of mathematics.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
37
38 Chapter 3 Conditional Probability
These problems often cause confusion since it is tempting to interchange two
probabilities. In the cancer example, the probability that the test is positive if the
patient has cancer (0.95 in our example) is quite different from the probability that
the patient has cancer if the test is positive (only 0.087!). We will explain how this
apparent difference arises and how to calculate the conditional probability, which is
probably the one of interest.
Finally, and regrettably, in the second example, despite the large size of my class,
the probability I am looking at the expert’s paper is only 1/2. Now let us see how
to tackle these problems in a simple way. We begin with two simple examples and
then solve the problems above. We will also discuss the famous Let’s Make a Deal
problem.
EXAMPLE 3.1 For the Birds
A pet store owner purchases bird seed from two suppliers, buying 3/4 of his bird seed
from one supplier and the remainder from another. The germination rate of the seed from
the first supplier is 75% whereas the germination rate from the second supplier is 60%. If
seed is randomly selected and germinates, what is the probability that it came from the first
supplier?
We will show how to make this problem visual and we will see that it is fairly simple to
analyze. We begin with a square of side 1, as shown in Figure 3.1. The total area of the square
is 1, so portions of the total area represent probabilities.
Along the horizontal axis we have shown the point at 3/4, indicating the proportion of
seed purchased from the first supplier. The remainder of the axis is of length 1/4, indicating
the proportion of the seed purchased from the second supplier.
Along the vertical axis, we have shown a horizontal line at 75%, the proportion of the
first supplier’s seed that germinates. The area of the rectangle formed is 3/4 · 0.75, which
is the probability that the seed came from the first supplier and germinates. This is shaded
in Figure 3.1. We have also shown a horizontal line at the 60% mark along the vertical axis,
indicatingthat this is the percentage of the secondsupplier’s seedthat germinates. The area of the
unshaded rectangle with height 0.60 and whose base is along the horizontal axis is 1/4 · 0.60,
0 1
1
0.6
0.75
3/4
Figure 3.1
Introduction 39
representing the probability that the seed came from the second supplier and germinates. Now
the total area of the two rectangles is 3/4 · 0.75 +1/4 · 0.60 = 0.7125; this is the probability
that the seed germinates regardless of the supplier.
We now want to find the probability that germinating seed came from the first supplier.
This is the portion of the two rectangles that is shaded or
3/4 · 0.75
3/4 · 0.75 +1/4 · 0.60
= 0.789474
differing from 3/4, the percentage from the first supplier.
᭿
This problem and the others considered in this chapter are known as conditional
probability problems. They are usually solved using Bayes’ theorem, but, as we have
shown above, all that is involved is areas of rectangles.
We emphasize here that the condition is the crucial distinction. It matters greatly
whether we are given the condition that the seed came from the first supplier (in
which case the germination rate is 75%) or that the seed germinated (in which case
the probability it came from the first supplier is 78.474%). Confusing these two
conditions can lead to erroneous conclusions, as some of our examples will show.
EXAMPLE 3.2 Driver’s Ed
In a high school, 60% of the students take driver’s education. Of these, 4% have an accident
in a year. Of the remaining students, 8% have an accident in a year. A student has an accident;
what is the probability he or she took driver’s ed?
Note that the probability a driver’s ed student has an accident (4%) may well differ from
the probability he took driver’s ed if he had an accident. Let us see if this is so. The easiest way
to look at this is again to draw a picture. In Figure 3.2 we show the unit square as we did in the
first example.
Along the horizontal axis, we have shown 60% of the axis for students who take driver’s
ed while the remainder of the horizontal axis, 40%, represents students who do not take driver’s
education. Now of the 60% who take driver’s ed, 4% have an accident, so we show a line on
the vertical axis at 4%.
0
1
1
8%
4%
60%
Figure 3.2
40 Chapter 3 Conditional Probability
The area of this rectangle, 0.6 · 0.04 = 0.024, represents students who both take driver’s
ed and have an accident. (The scale has been exaggerated here so that the rectangles we are
interested in can be seen more easily.) This rectangle has been shaded.
Above the 40% who do not take driver’s ed, we show a line at 8%.
This rectangle on the right has area 0.4 · 0.08 = 0.032, representing students who do not
take driver’s ed and who have an accident. This is unshaded in the figure.
The area of the two rectangles then represents the proportion of students who have an
accident. This is 0.024 +0.032 = 0.056.
Now we want the probability that a student who has an accident has taken driver’s
ed. This is the area of the two rectangles that arises from the left-hand rectangle, or
0.024/0.056 = 3/8 = 37.5 %. It follows that 5/8 = 62.5% of the students who have accidents
did not take driver’s ed.
It is clear then that the probability a student has an accident if he took driver’s ed
(4%) differs markedly from the probability a student who took driver’s ed had an accident
(37.5%).
᭿
It is not uncommon for these probabilities to be mixed up. Note that the first
relates to students who have had accidents whereas the second refers to the group of
students who took driver’s ed.
SOME NOTATION
Now we pause in our examples to consider some notation.
From this point on, we will use the notation, as we have done previously, P(A|B)
to denote “the probability that an event A occurs given that an event B has occurred.”
In the bird seed example, we let S1 and S2 denote the suppliers and let G denote
the event that the seed germinates.
We then have P(G|S1) = 0.75 and P(G|S2) = 0.60, and we found that
P(S1|G) = 0.78474. Note that the condition has a great effect on the probability.
EXAMPLE 3.3 The Expert in My Class
Recall that our problem is this: I have a class of 65 students. I regard one of my students as an
expert since she never makes an error. I regret to report that the remaining students are terribly
ignorant of my subject, and so guess at each answer. I gave a test with six true–false questions;
a paper I have selected at random has each of these questions answered correctly.
What is the probability that I have selected the expert’s paper? One might think this is
certain, but it is not.
This problem can be solved in much the same way that we solved the previous two
problems.
Begin again with a square of side 1. On the horizontal axis, we indicate the probability we
have chosen the expert (1/65) and the probability we have chosen an ordinary student (64/65).
The chance of choosing the expert is small, so we have exaggerated the scale in Figure 3.3 for
reasons of clarity. The relative areas then should not be judged visually.
Some Notation 41
0
1
1
1/64
1/65
Figure 3.3
Nowthe expert answers the six true–false questions on my examination with probability 1.
This has been indicated on the vertical axis. The area of the corresponding rectangle is then
1
65
· 1 =
1
65
This represents the probability that I chose the expert and that all the questions are answered
correctly.
Now the remainder of my students are almost totally ignorant of my subject and so they
guess the answers to each of my questions. The probability that one of these students answers
all the six questions correctly is then (1/2)
6
= 1/64. This has been indicated on the right-hand
vertical scale in Figure 3.3. The rectangle corresponding to the probability that a nonexpert
answers all six of the questions correctly is then
P(A|E) =
64
65
·
1
64
=
1
65
Let us introduce some notation here. Let E denote the expert and E denote a nonexpert
and let A denote the event that the student responds correctly to each question. We found that
P(A|E) = 1 while we want P(E|A), so again the condition is crucial.
So the total area, shown as two rectangles in Figure 3.3, corresponding to the probability
that all six of the questions are answered correctly, is
P(A) = P(A and E) +P(A and E)
P(A) =
1
65
· 1 +
64
65
·
1
64
=
2
65
42 Chapter 3 Conditional Probability
The portion of this area corresponding to the expert is
P(E|A) =
1
65
· 1
1
65
· 1 +
64
65
·
1
64
=
1
2
The shaded areas in Figure 3.3 are in reality equal, but they do not appear to be
equal.
᭿
Recall that the statement P(A) = P(A and E) +P(A and E) is often known as
the Law of Total Probability.
We add the probabilities here of mutually exclusive events, namely, Aand E and
A and E (represented by the nonoverlapping rectangles). The law can be extended to
more than two mutually exclusive events.
EXAMPLE 3.4 The Cancer Test
We nowreturntothe medical example at the beginningof this chapter. We interpret the statement
that the cancer test is accurate for 95% of patients with cancer and 95% accurate for patients
without cancer as that P(T
+
|C) = 0.95 and P(T
+
|C) = 0.05, where C indicates the presence
of cancer and T
+
means that the test indicates a positive result or the presence of cancer.
P(T
+
|C) is known as the false positive rate for the test since it produces a positive result for
noncancer patients. Supposing that a small percentage of the population has cancer, we assume
in this case that P(C) = 0.005. This assumption will prove crucial in our conclusions.
Apicture will clarifythe situation, although, again, the small probabilities involvedforce us
to exaggerate the picture somewhat. Figure 3.4 shows along the horizontal axis the probability
that a person has cancer. Along the vertical axis are the probabilities that a person shows a
positive test for each of the two groups of patients.
0
1
1
0.95
0.05
0.005
Figure 3.4
Some Notation 43
It is clear that the probability that a person shows a positive test is
P(T
+
) = 0.95 · 0.005 +0.05 · 0.995
= 0.0545
The portion of this area corresponding to the people who actually have cancer is then
P(C|T
+
) =
0.95 · 0.005
0.95 · 0.005 +0.05 · 0.995
=
0.00475
0.0545
= 0.087
This is surprisingly low. We emphasize, however, that the test should not be relied upon
alone; one should have other indications of the disease as well.
We also note here that the probability that a person testing positive actually has cancer
highly depends upon the true proportion of people in the population who are actually cancer
patients. Let us suppose that this true proportion is r, so that r represents the incidence rate
of the disease. Replacing the proportion 0.005 by r and the proportion 0.995 by 1 −r in
the above calculation, we find that the proportion of people who test positive and actually have
the disease is
P(C|T
+
) =
r · 0.95
r · 0.95 +(1 −r) · 0.05
This can be simplified to
P(C|T
+
) =
0.95 · r
0.05 +0.90 · r
=
19r
1 +18r
A graph of this function is shown in Figure 3.5.
We see that the test is quite reliable when the incidence rate for the disease is large. Most
diseases, however, have small incidence rates, so the false positive rate for these tests is a very
important number.
Now suppose that the test has probability p indicating the disease among patients who
actually have the disease and that the test indicates the presence of the disease with probability
1 −pamong patients who do not have the disease p = 0.95 in our example. It is also interesting
0 0.2 0.4 0.6 0.8 1
r
0
0.2
0.4
0.6
0.8
1
P
[
C
|
T

+
]
Figure 3.5
44 Chapter 3 Conditional Probability
0
0.25
0.5
0.75
1
r
0
0.25
0.5
0.75
1
p
0
0.25
0.5
0.75
1
P
(
C
|
T
+
)
0
0.25
0.5
0.75
1
Figure 3.6
to examine the probability P(C|T
+
) as a function of both the incidence rate of the disease, r,
and p. Now
P(C|T
+
) =
r · p
r · p +(1 −r) · (1 −p)
᭿
The surface showing this probability as a function of both r and p is shown in
Figure 3.6.
Clearly, the accuracy of the test increases as r and p increase. Ours has a low
probability since we are dealing with the lower left corner of the surface.
EXAMPLE 3.5 Let’s Make a Deal
In this TV game show, a contestant is presented with three doors, one of which contains a
valuable prize while the other two are empty. The contestant is allowed to choose one door.
Regardless of the choice made, at least one, and perhaps two, of the remaining doors is empty.
The show host, say Monty Hall, opens one door and shows that it is empty. He now offers
the contestant the opportunity to change the choice of doors; should the contestant switch, or
doesn’t it matter?
It matters. The contestant who switches has probability 2/3 of winning the prize. If the
contestant does not switch, the probability is 1/3 that the prize is won.
In thinking about the problem, note that when the empty door is revealed, the game does
not suddenly become choosing one of the two doors that contains the prize. The problem here
is that sometimes Monty Hall has one choice of door to show empty and sometimes he has two
choices of doors that are empty. This must be accounted for in analyzing the problem.
Bayes’ Theorem 45
An effective classroomstrategy at this point is to try the experiment several times, perhaps
using large cards that must be shuffled thoroughly before each trial; some students can use the
“never switch” strategy whereas others can use the “always switch” strategy and the results
compared. This experiment alone is enough to convince many people that the switching strategy
is the superior one; we analyze the problem using geometry.
To be specific, let us call the door that the contestant chooses as door 1 and the door that
the host opens as door 2. The symmetry of the problem tells us that this is a proper analysis of
the general situation.
Now we need some notation. Let P
i
, i = 1, 2, 3, denote the event “prize is behind door i”
and let D be the event “door 2 is opened.” We assume that P
1
= P
2
= P
3
= 1/3.
Then, P(D|P
1
) = 1/2, since in that case the host then has a choice of two doors to open;
P(D|P
2
) = 0, since the host will not open the door showing the prize; and P(D|P
3
) = 1, since
in this case door 2 is the only one that can be opened to show no prize behind it.
0
1
1 A
1
A
2
A
3
1/3 2/3
1/2
Figure 3.7
Our unit square is shown in Figure 3.7. It is clear that the shaded area in Figure 3.7
represents the probability that door 2 is opened. The probability that the contestant wins if he
switches is then the proportion of this area corresponding to door 3. This is
P(P
3
|D) =
1
3
· 1
1
3
·
1
2
+
1
3
· 0 +
1
3
· 1
=
2
3
᭿
Another, perhaps more intuitive, way to view the problem is this: when the first
choice is made, the contestant has probability1/3of winningthe prize. The probability
that the prize is behind one of the other doors is 2/3. Revealing one of the doors to
be empty does not alter these probabilities; hence, the contestant should switch.
BAYES’ THEOREM
Bayes’ theorem is stated here although, as we have seen, problems involving it can
be done geometrically.
46 Chapter 3 Conditional Probability
Bayes’ theorem: If S = A
1
∪ A
2
∪· · · A
n
, where A
i
and A
j
have no sample
points in common if i / = j, then if B is an event,
P(A
i
|B) =
P(A
i
∩ B)
P(B)
P(A
i
|B) =
P(A
i
) · P(B|A
i
)
P(A
1
) · P(B|A
1
) +P(A
2
) · P(B|A
2
) +· · · +P(A
n
) · P(B|A
n
)
or
P(A
i
|B) =
P(A
i
) · P(B|A
i
)
n

j=1
P(A
j
) · P(B|A
j
)
In Example 3.5 (Let’s Make a Deal), A
i
is the event “Prize is behind door i” for
i = 1, 2, 3. (We used P
i
in Example 3.5.) B was the event “door 2 is opened.”
CONCLUSIONS
The problems considered above should be interesting and practical for our students. I
think our students should have mastery of these problems and others like them since
data of this kind are frequently encountered in various fields.
Mathematically, the analyses given above are equivalent to those found by using
a probability theorem known as Bayes’ theorem. The geometric model given here
shows that this result need not be known since it follows so simply from the area
of a rectangle. The analysis given here should make these problems accessible to
elementary students of mathematics.
EXPLORATIONS
1. Three methods, say A, B, and C, are sometimes used to teach an industrial
worker a skill. The methods fail to instruct with rates of 20%, 10%, and 30%,
respectively. Cost considerations restrict method A to be used twice as often
as B that is used twice as often as C. If a worker fails to learn the skill, what is
the probability that she was taught by method A?
2. Binary symbols (0 or 1) sent over a communication line are sometimes inter-
changed. The probability that a 0 is changed to 1 is 0.1 while the probability
that a 1 is changed to 0 is 0.2. The probability that a 0 is sent is 0.4 and the
probability that a 1 is sent is 0.6. If a 1 is received, what is the probability that
a 0 was sent?
3. Sample surveys are often subject to error because the respondent might not
truthfully answer a sensitive question such as “Do you use illegal drugs?” A
procedure known as randomized response is sometimes used. Here is how that
works. A respondent is asked to flip a coin and not reveal the result. If the coin
comes up heads, the respondent answers the sensitive question, otherwise he
Explorations 47
responds to an innocuous question such as “Is your Social Security number
even?” So if the respondent answers “Yes,” it is not known to which question
he is responding. Show, however, that with a large number of respondents, the
frequency of illegal drug use can be determined.
4. Combine cards from several decks and create another deck of cards with, say,
12 aces and 40 other cards. Have students select a card and not reveal whether it
is an ace or not. If an ace is chosen, have the students answer the question about
illegal drugs in the previous exploration and otherwise answer an innocuous
question. Then approximate the use of illegal drugs.
5. A certain brand of lie detector is accurate with probability 0.92; that is, if a
person is telling the truth, the detector indicates he is telling the truth with
probability 0.92, while if he is lying, the detector indicates he is lying with
probability 0.92. Assume that 98%of the subjects of the test are truthful. What
is the probability that a person is lying if the detector indicates he is lying?
Chapter 4
Geometric Probability
CHAPTER OBJECTIVES:
r
to use connections between geometry and probability
r
to see how the solution of a quadratic equation solves a geometric problem
r
to use linear inequalities in geometric problems
r
to show some unusual problems for geometry.
EXAMPLE 4.1 Meeting at the Library
Joan and Jim agree to meet at the library after school between 3 and 4 p.m. Each agrees to wait
no longer than 15 min for the other. What is the probability that they will meet?
This at first glance does not appear to be a geometric problem, but it is.
We show the situation in Figure 4.1. For convenience, we take the interval between 3 and
4 p.m. to be the interval between 0 and 1, so both Joan and Jim’s waiting time becomes 1/4 of
an hour. We suppose that each person arrives at some random time, so these arrival times are
points somewhere in a square of side 1. Let X denote Joan’s arrival time and Y denote Jim’s
arrival time.
If Joan arrives first, then Jim’s arrival time Y must be greater than Joan’s arrival time X.
So Y > X or Y −X > 0 and if they are to meet, then Y −X < 1/4 or Y < X+1/4. This is
the region below the line Y = X+1/4 and has intercepts at (0, 1/4) and (3/4, 1) and is the
top line in Figure 4.1.
If Jim arrives first, then X > Y and X−Y > 0 and if they are to meet, then X−Y < 1/4
or Y > X−1/4. This is the region above the line Y = X−1/4 and has intercepts at (1/4, 0)
and (1, 3/4) and is the lower line in Figure 4.1.
They both meet then when Y < X+1/4 and when Y > X−1/4. This is the region
between the lines and is shaded in the figure.
Since the area of the square is 1, the shaded area must represent the probability that Joan
and Jim meet. The easiest way to compute this is to subtract the areas of the two triangles
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
48
Chapter 4 Geometric Probability 49
(1/4,0)
(0,1/4)
(1,3/4)
(3/4,1)
0 1
1
Figure 4.1
from 1, the area of the square. The triangles are equal in size. This gives us the probability that
they meet as
P(Joan and Jim meet) = 1 −2 ·
1
2
·

3
4

2
= 1 −
9
16
=
7
16
= 0.437 5
So they meet less than half the time.
᭿
It is, perhaps, surprising that the solution to our problem is geometric. This is
one of the many examples of the solution of probability problems using geometry.
We now show some more problems of this sort.
EXAMPLE 4.2 How Long Should They Wait?
Now suppose that it is really important that they meet, so we want the probability that they
meet to be, say, 3/4. How long should each wait for the other?
Let us say that each waits t minutes for the other. The situation is shown in Figure 4.2.
We know then that the shaded area is 3/4 or that the nonshaded area is 1/4.
This means that
1 −2 ·
1
2
· (1 −t)
2
=
3
4
or that
1 −(1 −t)
2
=
3
4
50 Chapter 4 Geometric Probability
(t,0)
(0,t)
(1,1−t )
(1−t,1)
0 1
1
Figure 4.2
so
(1 −t)
2
=
1
4
or 1 −t =
1
2
so
t =
1
2
So each must wait up to 30 min for the other for them to meet with probability 3/4.
᭿
EXAMPLE 4.3 A General Graph
In the previous example, we specified the probability that Joan and Jim meet. Now suppose
we want to know how the waiting time, say t hours for each person, affects the probability that
they meet.
The probability that they meet is
p = 1 −2 ·
1
2
· (1 −t)
2
or
t
2
−2t +p = 0
᭿
Figure 4.3 is a graph of this quadratic function where t is restricted to be between
0 and 1.
Chapter 4 Geometric Probability 51
0.2 0.4 0.6 0.8 1
t
0.2
0.4
0.6
0.8
1
P
r
o
b
a
b
i
l
i
t
y
Figure 4.3
EXAMPLE 4.4 Breaking a Wire
I have a piece of wire of length L. I break it into three pieces. What is the probability that I can
form a triangle from the three pieces of wire?
Suppose the pieces of wire after the wire is broken have lengths x, y, and L −x −y.
To form a triangle, the sum of the lengths of any two sides must exceed the length of the
remaining side. So,
x +y > L −x −y or x +y > L/2
and
x +(L −x −y) > y so y < L/2
and
y +(L −x −y) > x so x < L/2
It is easy to see the simultaneous solution of these inequalities geometrically, as shown in
Figure 4.4.
If x < L/2, then x must be to the left of the vertical line at L/2. If y < L/2, then y must
be below the horizontal line at L/2. Finally, x +y > L/2; this is the region above the line
x +y = L/2.
0 L
L
L/2
L/2
Figure 4.4
52 Chapter 4 Geometric Probability
The resulting region is the triangle shaded in the figure. Its area is clearly 1/8 of the square;
so the probability that I can form a triangle from the pieces of wire is 1/8.
᭿
EXAMPLE 4.5 Shooting Fouls
A, B, and C are basketball players. A makes 40% of her foul shots, B makes 60% of her foul
shots, and C makes 80% of her foul shots. A takes 50% of the team’s foul shots, B takes 30%
of the team’s foul shots, and C takes 20% of the team’s foul shots. What is the probability that
a foul shot is made?
The situation can be seen in Figure 4.5, as we have done in all our previous examples.
The region labeled “A” is the region where player A is shooting and she makes the shot;
the situation is similar for players B and C. So the probability that a shot is made is then the
sum of the areas for the three players or
(0.5)(0.4) +(0.3)(0.6) +(0.2)(0.8) = 0.54
1 0
1
0.5 0.8
0.4
0.6
0.8
A
B
C
Figure 4.5
᭿
EXAMPLE 4.6 Doing the Dishes
We conclude this chapter with a problem that at first glance does not involve geometry. But it
does!
Mydaughter, Kaylyn, andI share doingthe dinner dishes. Tomake the situationinteresting,
I have in my pocket two red marbles and one green marble. We agree to select two marbles
at random. If the colors match, I do the dishes; otherwise Kaylyn does the dishes. Is the game
fair?
Of course the game is not fair. It is clear from the triangle in Figure 4.6 that Kaylyn does
the dishes 2/3 of the time, corresponding to two sides of the triangle, while I do the dishes 1/3
of the time, corresponding to the base of the triangle.
Chapter 4 Geometric Probability 53
R R
G
Figure 4.6
R R
G G
Figure 4.7
How can the game be made fair? It may be thought that I have too many red marbles in
my pocket and that adding a green marble will rectify things.
However, examine Figure 4.7 where we have two green and two red marbles. There are
six possible samples of two marbles; two of these contain marbles of the same color while four
contain marbles of different colors. Adding a green marble does not change the probabilities
at all!
᭿
Why is this so? Part of the explanation lies in the fact that while the number of
red marbles and green marbles is certainly important, it is the number of sides and
diagonals of the figure involved that is crucial. It is the geometry of the situation that
explains the fairness, or unfairness, of the game.
It is interesting to find, in the above example, that if we have three red marbles
and one green marble, then the game is fair. The unfairness of having two red and one
green marbles in my pocket did not arise from the presumption that I had too many
red marbles in my pocket. I had too few!
Increasing the number of marbles in my pocket is an interesting challenge.
Figure 4.8 shows three red and three green marbles, but it is not a fair situation.
R
R
G G
R
G
Figure 4.8
54 Chapter 4 Geometric Probability
Table 4.1
R G R +G
3 1 4
6 3 9
10 6 16
15 10 25
The lines in Figure 4.8 show that there are 15 possible samples to be selected, 6
of which have both marbles of the same color while 9 of the samples contain marbles
of different colors. With a total of six marbles, there is no way in which the game can
be made fair.
Table 4.1 shows some of the first combinations of red and green marbles that
produce a fair game.
There are no combinations of 5 through 8, 10 through 15, or 17 through 25
marbles for which the game is fair. We will prove this now.
One might notice that the numbers of red and green marbles in Table 4.1
are triangular numbers, that is, they are sums of consecutive positive integers
1 +2 = 3, 1 +2 +3 = 6, 1 +2 +3 +4 = 10, and so on. The term triangular
comes from the fact that these numbers can be shown as equilateral triangles:
r r r
r r r r
r r r
1 3 6
We also note that 1 +2 +3 +· · · +k = k(k +1)/2. To see that this is so, note
that the formula works for k = 2 and also note that if the formula is correct for the
sum of k positive integers, then
1 +2 +3 +· · · +k +(k +1) =
k(k +1)
2
+(k +1) =
(k +2)(k +1)
2
which is the formula for the sum of k +1 positive integers. This proves the
formula.
Let us see why R and G must be triangular numbers. For the game to be fair,

R
2

+

G
2

=
1
2

R +G
2

or
2R(R −1) +2G(G−1) = (R +G)(R +G−1)
Chapter 4 Geometric Probability 55
and this can easily be simplified to
R +G = (R −G)
2
This is one equation in two unknowns. But we also know that both Rand Gmust
be positive integers. So let R −G = k. Then, since R +G = (R −G)
2
, it follows
that R +G = k
2
.
Solving these simultaneously gives 2R = k +k
2
or R = k(k +1)/2 and so
G = k(k −1)/2 showing that R and G are consecutive triangular numbers.
EXAMPLE 4.7 Randomized Response
We show here a geometric solution to the randomized response exploration in Chapter 2.
Those who do sample surveys are often interested in asking sensitive questions such
as “Are you an illegal drug user?” or “Have you ever committed a serious crime?” Asking
these questions directly would involve self-incrimination and would likely not produce honest
answers, so it would be unlikely that we could determine the percentage of drug users or
dangerous criminals with any degree of accuracy.
Here is a procedure for obtaining responses to sensitive survey questions that has proved
to be quite accurate. Suppose we have two questions:
1. Is your Social Security number even?
2. Are you an illegal drug user?
The interviewer then asks the person being interviewed to flip a fair coin (and not show
the result to the interviewer). If the coin comes up heads, he is to answer the first question; if
the coin comes up tails, he is to answer the second question.
So if the person answers “Yes,” we have no way of knowing whether his Social Security
number is even or if he is a drug user. But it is possible, if we draw a picture of the situation,
to estimate the proportion of drug users. Figure 4.9 should be very useful.
The square in the bottom left-hand corner has area 1/4, representing those who when
interviewed showed a head on the coin (probability 1/2) and who then answered the first ques-
tion “Yes” (also with probability 1/2). The rectangle on the bottom right-hand side represents
0 1 1/2
1/2
p
1
Figure 4.9
56 Chapter 4 Geometric Probability
those who when interviewed showed a tail on the coin (probability 1/2) and who answered
the drug question “Yes.” So the total area represents the proportion of those interviewed who
responded “Yes.”
Suppose that proportion of people answering “Yes” is, say, 0.30. Then , comparing areas,
we have
1/4 +(1/2) · p = 0.30
so
p = 2(0.30 −1/4)
or
p = 0.10
So our estimate from this survey is that 10% of the population uses drugs.
᭿
This estimate is of course due to some sampling variability, namely, in the pro-
portion showing heads on the coin. This should not differ much from1/2 if the sample
is large, but could vary considerably from 1/2 in a small sample.
It is useful to see how our estimate of p, the true proportion who should answer
“Yes” to the sensitive question, varies with the proportion answering “Yes” in the
survey, say p
s
. We have that
1/4 +(1/2) · p = p
s
so that
p = 2(p
s
−1/4)
A graph of this straight line is shown in Figure 4.10 where we assume that
1/4 ≤ p
s
≤ 3/4 since if p
s
= 1/4, p = 0, and if p
s
= 3/4, then p = 1.
0.3 0.4 0.5 0.6 0.7
P
s
0.2
0.4
0.6
0.8
1
p
Figure 4.10
CONCLUSION
We have shown several examples in this chapter where problems in probability can
be used in geometry. These are unusual examples for a geometry class and they may
Explorations 57
well motivate the student to draw graphs and draw conclusions from them. They are
also examples of practical situations producing reasons to find areas and equations of
straight lines, motivation that is often lacking in our classrooms.
EXPLORATIONS
Here are some interesting situations for classroom projects or group work.
1. Change the waiting times so that they are different for Joan and for Jim at the
library.
2. In the foul shooting example, suppose we do not know the frequency with
which C makes a foul shot, but we know the overall percentage of foul shots
made by the team. How good a foul shooter is C?
3. Suppose in randomized response that we have the subject interviewed draw
a ball from a sack of balls numbered from 1 to 100. If the number drawn
(unknown to the interviewer) is less than or equal to 60, he is to answer the
sensitive question; if the number is between61and80, he is tosimplysay“Yes,”
otherwise, “No.” What is the estimate of the “Yes” answers to the sensitive
question?
4. Two numbers are chosen at random between 0 and 20.
(a) What is the probability that their sum is less than 25?
(b) What is the probability that the sum of their squares is less than 25?
5. In Exploration 4, draw pairs of random numbers with your computer or calcu-
lator and estimate the probability that the product of the numbers is less than
50.
Chapter 5
Random Variables and
Discrete Probability
Distributions—Uniform,
Binomial, Hypergeometric,
and Geometric Distributions
CHAPTER OBJECTIVES:
r
to introduce random variables and probability distribution functions
r
to discuss uniform, binomial, hypergeometric, and geometric probability
distribution functions
r
to discover some surprising results when random variables are added
r
to encounter the “bell-shaped” curve for the first (but not the last!) time
r
to use the binomial theorem with both positive and negative exponents.
INTRODUCTION
Suppose a hat contains slips of paper with the numbers 1 through 5 on them. A slip
is drawn at random and the number on the slip observed. Since the result cannot be
known in advance, the number is called a random variable. In general, a random
variable is a variable that takes on values on the points of a sample space.
Random variables are generally denoted by capital letters, such as X, Y, Z, and
so on. If we see the number 3 in the slip of paper experiment, we say that X = 3.
It is important to distinguish between the variable itself, say X, and one of its values,
usually denoted by small letters, say x.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
58
Discrete Uniform Distribution 59
Discrete random variables are those random variables that take on a finite, or
perhaps a countably infinite, number of values so the associated sample space has
a finite, or countably infinite, number of points. We discuss several discrete random
variables in this chapter. Later, we will discuss several random variables that can
take on an uncountably infinite number of values; these random variables are called
continuous random variables.
If the random variable is discrete, we call the function giving the probability the
randomvariable takes on any of its values, say P(X = x), the probability distribution
function of the random variable X. We will often abbreviate this as the PDF for the
random variable X.
Random variables occur with different properties and characteristics; we will
discuss some of these in this chapter. We begin with the probability distri-
bution function suggested by the example of drawing a slip of paper from a
hat.
DISCRETE UNIFORM DISTRIBUTION
The experiment of drawing one of five slips of paper from a hat at random suggests
that the probability of observing any of the numbers 1 through 5 is 1/5, that is,
P(X = x) = 1/5 for x = 1, 2, 3, 4, 5
is the PDF for the random variable X. This is called a discrete uniform probability
distribution function.
Not any function can serve as a probability distribution function. All discrete
probability distribution functions have these properties:
If f(x) = P(X = x) is the PDF for a random variable X, then
1) f(x) 0
2)
¸
allx
f(x) = 1
These properties arise from the fact that probabilities must be nonnegative and
since some event must occur in the sample space, the sum of the probabilities over
the entire sample space must be 1.
In general, the discrete uniform probability distribution function is defined as
f(x) = P(x = x) = 1/n for x = 1, 2, ..., n
It is easy to verify that f(x) satisfies the properties of a discrete probability
distribution function.
60 Chapter 5 Random Variables and Discrete Probability Distributions
Mean and Variance of a Discrete Random Variable
We pause now to introduce two numbers by which random variables can be charac-
terized, the mean and the variance.
The mean or the expected value of a discrete random variable X is denoted by
μ
x
or E(X) and is defined as
μ
x
= E(X) =
¸
all x
x · f(x)
This then is a weighted average of the values of Xand the probabilities associated
with its values. In our example, we find
μ
x
= E(X) = 1 ·
1
5
+2 ·
1
5
+3 ·
1
5
+4 ·
1
5
+5 ·
1
5
= 3
In general, we have discrete uniform distribution
μ
x
= E(X) =
¸
all x
x · f(x) =
n
¸
x=1
x ·
1
n
=
1
n
n
¸
x=1
x =
1
n
·
n(n +1)
2
=
n +1
2
In our case where n = 5, we have μ
x
= E(X) = (5 +1)/2 = 3, as before.
Since the expected value is a sum, we have
E(X±Y) = E(X) ±E(Y)
if X and Y are random variables defined on the same sample space.
If the random variable X is a constant, say X = c, then
E(X) =
¸
all x
x · f(x) = c
¸
all x
f(x) = c · 1 = c
We now turn to another descriptive measure of a probability distribution, the
variance. This is a measure of how variable the probability distribution is. To mea-
sure this, we might subtract the mean value from each of the values of X and find
the expected value of the result. The thinking here is that values that depart markedly
from the mean value show that the probability distribution is quite variable. Un-
fortunately, E(X−μ) = E(X) −E(μ) = μ −μ = 0 for any random variable. The
problem here is that positive deviations from the mean exactly cancel out negative
deviations, producing 0 for any random variable. So we square each of the deviations
to prevent this and find the expected value of those deviations. We call this quantity
the variance and denote it by
σ
2
= Var(X) = E(X−μ)
2
Discrete Uniform Distribution 61
This can be expanded as
σ
2
= Var(X) = E(X−μ)
2
= E(X
2
−2μX+μ
2
) = E(X
2
) −2μE(X) +μ
2
or
σ
2
= E(X
2
) −μ
2
We will often use this form of σ
2
for computation. For the general discrete
uniform distribution, we have
σ
2
=
n
¸
x=1
x
2
·
1
n

n +1
2

2
=
1
n
·
n(n +1)(2n +1)
6

(n +1)
2
4
which is easily simplified to σ
2
= (n
2
−1)/12. If n = 5, this becomes σ
2
= 2.
The positive square root of the variance, σ, is called the standard deviation.
Intervals, σ, and German Tanks
Probability distributions that contain extreme values (with respect to the mean) in gen-
eral have larger standard deviations than distributions whose values are mostly close to
the mean. This will be dealt later in this book when we consider confidence intervals.
For now, let us look at some uniform distributions and the percentage of those
distributions contained in an interval centered at the mean. Since μ = (n +1)/2 and
σ =

(n
2
−1)/12 for the uniform distribution on x = 1, 2, . . . , n, consider the
interval μ ±σ that in this case is the interval
n +1
2
±

n
2
−1
12
We have used the mean plus or minus one standard deviation here.
If n = 10, for example, this is the interval (2.628, 8.372) that contains the values
3, 4, 5, 6, 7, 8 or 60% of the distribution.
Suppose now that we add and subtract k standard deviations from the mean
for the general uniform discrete distribution, then the length of this interval is
2k

(n
2
−1)/12 and the percentage of the distribution contained in that interval is
2k

n
2
−1
12
n
= 2k

n
2
−1
12n
2
The factor (n
2
−1)/n
2
, however, rapidly approaches 1 as n increases. So the percent-
age of the distribution contained in the interval is approximately 2k/

12 = k/

3
for reasonably large values of n. If k = 1, this is about 57.7% and if k = 1.5, this
is about 86.6%; k of course must be less than

3 or else the entire distribution is
covered by the interval.
62 Chapter 5 Random Variables and Discrete Probability Distributions
So the more standard deviations we add to the mean, the more of the distribution
we cover, and this is true for probability distributions in general. This really seems to
be a bit senseless however since we know the value of n and so it is simple to figure
out what percentage of the distribution is contained in any particular interval.
We will return to confidence intervals later.
For now, what if we do not knowthe value of n? This is an entirely different matter.
This problem actually arose during World War II. The Germans numbered much of
the military material it put into the battlefield. They numbered parts of motorized
vehicles and parts of aircraft and tanks. So, when tanks, for example, were captured,
the Germans unwittingly gave out information about howmany tanks were in the field!
If we assume the tanks are numbered 1, 2,. . . , n (here is our uniformdistribution!)
and we have captured tanks numbered 7,13, and 42, what is n?
This is not a probability problem but a statistical one since we want to estimate
the value of nfroma sample. We will have much more to say about statistical problems
in later chapters and this one in particular. The reader may wish to think about this
problem and make his or her own estimate. Notice that the estimate must exceed 42,
but by how much? We will return to this problem later.
We now turn to another extremely useful situation, that of sums.
Sums
Suppose in the discrete uniform distribution with n = 5 (our slips of paper example)
we draw two slips of paper. Suppose further that the first slip is replaced before the
second is drawn so that the sampling for the first and second slips is done under
exactly the same conditions. What happens if we add the two numbers found. Does
the sum also have a uniform distribution?
We might think the answer to this is “yes” until we look at the sample space for
the sum shown below.
Sample Sum Sample Sum
1,1 2 3,4 7
1,2 3 3,5 8
1,3 4 4,1 5
1,4 5 4,2 6
1,5 6 4,3 7
2,1 3 4,4 8
2,2 4 4,5 9
2,3 5 5,1 6
2,4 6 5,2 7
2,5 7 5,3 8
3,1 4 5,4 9
3,2 5 5,5 10
3,3 6
Discrete Uniform Distribution 63
Now we realize that sums of 4, 5, or 6 are fairly likely. Here is the probability
distribution of the sum:
X 2 3 4 5 6 7 8 9 10
f(x)
1
25
2
25
3
25
4
25
5
25
4
25
3
25
2
25
1
25
and the graph is as shown in Figure 5.1.
4 6 8 10
Sum
2
3
4
5
F
r
e
q
u
e
n
c
y
Figure 5.1
Even more surprising things occur when we increase the number of drawings,
say to 5. Although the sample space now contains 5
5
= 3125 points, enumerating
these is a daunting task to say the least. Other techniques can be used however to
produce the graph in Figure 5.2.
This gives a “bell-shaped” curve. As we will see, this is far from uncommon
in probability theory; in fact, it is to be expected when sums are considered—and
the sums can arise from virtually any distribution or combination of these dis-
tributions! We will discuss this further in the chapter on continuous probability
distributions.
10 15 20 25
Sum
100
200
300
F
r
e
q
u
e
n
c
y
Figure 5.2
64 Chapter 5 Random Variables and Discrete Probability Distributions
BINOMIAL PROBABILITY DISTRIBUTION
The next example of a discrete probability distribution is called the binomial
distribution.
One of the most commonly occurring random variables is the one that takes one
of two values each time the experiment is performed. Examples of this include tossing
a coin, the result of which is a head or a tail; a newborn child is a female or male; a
vaccination against the flu is successful or nonsuccessful. Examples of this situation
are very common. We call these experiments binomial since, at each trial, the result
is one of two outcomes, which, for convenience, are called success (S) or failure (F).
We will make two further assumptions: that the trials are independent and that
the probabilities of success or failure remain constant from trial to trial. In fact,
we let P(S) = p and P(F) = q = 1 −p for each trial. It is common to let the
random variable X denote the number of successes in n independent trials of the
experiment.
Let us consider a particular example. Suppose we inspect an item as it is coming
off a production line. The item is good (G) or defective (D). If we inspect five items,
the sample space then consists of all the possible sequences of five items, each G or
D. The sample space then contains 2
5
= 32 sample points. We also suppose as above
that P(G) = p and P(D) = q = 1 −p, and if we let X denote the number of good
items, then we see that the possible values of X are 0, 1, 2, 3, 4, or 5. Now we must
calculate the probabilities of each of these events.
If X = 0, then, unfortunately, none of the items are good so, using the indepen-
dence of the events and the associated sample point,
P(X = 0) = P(DDDDD) = P(D) · P(D) · P(D) · P(D) · P(D) = q
5
How can X = 1? Then we must have exactly one good item and four defective
items. But that event can occur in five different ways since the good item can occur
at any one of the five trials. So
P(X = 1) = P(GDDDD or DGDDD or DDGDD or DDDGD or DDDDG
= P(GDDDD) + P(DGDDD) + P(DDGDD)
+ P(DDDGD) + P(DDDDG)
= pq
4
+pq
4
+pq
4
+pq
4
+pq
4
= 5pq
4
Now P(X = 2) is somewhat more complicated since two good items and three
defective items can occur in a number of ways. Any particular order will have prob-
ability q
3
p
2
since the trials of the experiment are independent.
We also note that the number of orders in which there are exactly two good items
must be

5
2

or the number of ways in which we can select two positions for the
Binomial Probability Distribution 65
two good items from five positions in total. We conclude that
P(X = 2) =

5
2

p
2
q
3
= 10p
2
q
3
In an entirely similar way, we find that
P(X = 3) =

5
3

p
3
q
2
= 10p
3
q
2
and
P(X = 4) =

5
4

p
4
q = 5p
4
q and P(X = 5) = p
5
If we add all these probabilities together, we find
P(X = 0) +P(X = 1) +P(X = 2) +P(X = 3) +P(X = 4) +P(X = 5)
= q
5
+5pq
4
+10p
2
q
3
+10p
3
q
2
+5p
4
q +p
5
which we recognize as
= (q +p)
5
= 1 since q +p = 1
Note that the coefficients in the binomial expansion add up to 32, so all the points
in the sample space have been used.
The occurrence of the binomial theorem here is one reason the probability dis-
tribution of X is called the binomial probability distribution.
The above situation can be generalized. Suppose now that we have
n independent trials, that X denotes the number of successes, and that
P(S) = p and P(F) = q = 1 −p. We see that
P(X = x) =

n
x

p
x
q
n−x
for x = 0, 1, 2, . . . , n
This is the probability distribution function for the binomial random variable in
general.
We note that P(X=x) 0 and
¸
n
x=0
P(X=x) =
¸
n
x=0

n
x

p
x
q
n−x
=(q+p)
n
=1,
so the properties of a discrete probability distribution function are satisfied.
Graphs of binomial distributions are interesting. We show some here where we
have chosen p = 0.3 for various values of n (Figures 5.3, 5.4, and 5.5).
The graphs indicate that as n increases, the probability distributions become
more “bell shaped” and strongly resemble what we will call, in Chapter 8, a con-
tinuous normal curve. This is in fact the case, although this fact will not be pur-
sued here. One reason for not pursuing this is that exact calculations involving the
66 Chapter 5 Random Variables and Discrete Probability Distributions
0 1 2 3 4 5
X
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
P
r
o
b
a
b
i
l
i
t
y
Figure 5.3 Binomial distribution, n = 5,
p = 0.3.
2 4 6 8 10 12 14
X
0.05
0.1
0.15
0.2
P
r
o
b
a
b
i
l
i
t
y
Figure 5.4 Binomial distribution, n = 15,
p = 0.3.
5 10 15 20 25 30
X
0.025
0.05
0.075
0.1
0.125
0.15
P
r
o
b
a
b
i
l
i
t
y
Figure 5.5 Binomial distribution n = 30,
p = 0.3.
binomial distribution are possible using a statistical calculator or a computer alge-
bra system, and so we do not need to approximate these probabilities with a normal
curve, which we study in the chapter on continuous distributions. Here are some
examples.
EXAMPLE 5.1 A Production Line
A production line has been producing good parts with probability 0.85. A sample of 20 parts is
taken, and it is found that 4 of these are defective. Assuming a binomial model, is this a cause
for concern?
Binomial Probability Distribution 67
Let X denote the number of good parts in a sample of 20. Assuming that the probability
a good part is 0.85, we find the probability the sample has at most 16 good parts is
P(X ≤ 16) =
16
¸
x=0

20
x

(0.85)
x
(0.15)
20−x
= 0.352275
So this event is not unusual and one would probably conclude that the production line
was behaving normally and that although the percentage of defective parts has increased, the
sample is not a cause for concern.
᭿
EXAMPLE 5.2 A Political Survey
A sample survey of 100 voters is taken where actually 45% of the voters favor a certain
candidate. What is the probability that the sample will contain between 40% and 60% of voters
who favor the candidate?
We presume that a binomial model is appropriate. Note that the sample proportion of
voters, say p
s
, can be expressed in terms of the number of voters, say X, who favor the
candidate. In fact, p
s
= X/100, so
P(0.40 ≤ p
s
≤ 0.60) = P

0.40 ≤
X
100
≤ 0.60

= P(40 ≤ X ≤ 60)
=
60
¸
x=40

100
x

(0.45)
x
(0.55)
100−x
= 0.864808
So if candidates in fact did not know the true proportion of voters willing to vote for
them, the survey would be of little comfort since it indicates that they could either win or lose.
Increasing the sample size will, of course, increase the probability that the sample proportion
is within the range of 40–60%.
᭿
EXAMPLE 5.3 Seed Germinations
A biologist studying the germination rate of a certain type of seed finds that 90% of the seeds
germinate. She has a box of 45 seeds.
The probability that all the seeds germinate, letting X denote the number of seeds germi-
nating, is
P(X = 45) =

45
45

(0.90)
45
(0.10)
0
= (0.90)
45
= 0.0087280
So this is not a very probable event although the germination rate is fairly high.
68 Chapter 5 Random Variables and Discrete Probability Distributions
The probability that at least 40 of the seeds germinate is
P(X ≥ 40) =
45
¸
x=40

45
x

(0.90)
x
(0.10)
45−x
= 0.70772
Now suppose that the seller of the seeds wishes to advertise a “guarantee” that at least k
of the seeds germinates. What should k be if the seller wishes to disappoint at most 5% of the
buyers of the seeds?
Here, we wish to determine k so that P(X ≥ k) = 0.05, or P(X ≤ k) = 0.95. This can be
done only approximately.
Using a statistical calculator, we find the following in Table 5.1:
Table 5.1
k P(X ≤ k)
40 0.472862
41 0.671067
42 0.840957
43 0.947632
44 0.991272
45 1
It appears that one should fix the guarantee at 43 seeds. Since the variable is discrete, the
cumulative probabilities increase in discrete jumps, so it is not possible to find a 5% error rate
exactly. The table in fact indicates that this probability could be approximately 16%, 5%, or
1%, but no value in between these values.
᭿
Mean and Variance of the Binomial Distribution
It will be shown in the next section that the following formulas apply to the binomial
distribution with n trials and p the probability of success at any trial.
μ = E(X) =
n
¸
x=0
x ·

n
x

p
x
(1 −p)
n−x
= np
σ
2
= Var(X) = E(X−μ)
2
=
n
¸
x=0
(x −μ)
2
·

n
x

p
x
(1 −p)
n−x
= npq
In our above example, we calculate the mean value, E(X) =μ=45 · 0.90 =40.5
and variance, Var(X) = npq = 45 · 0.90 · 0.10 = 4. 05, meaning that the stan-
dard deviation is σ =

4.05 = 2.01246. We have seen that the standard de-
viation is a measure of the variation in the distribution. To illustrate this,
Binomial Probability Distribution 69
we calculate
P(μ −σ ≤ X ≤ μ +σ) = P(40.5 −2.01246 ≤ X ≤ 40.5 +2.01246)
= P(38.48754 ≤ X ≤ 42.51246)
=
43
¸
38

45
x

(0.90)
x
(0.10)
45−x
= 0.871934.
Notice that we must round off some of the results since Xcan only assume integer
values. We also find that
P(μ −2σ ≤ X ≤ μ +2σ) = P(40.5 −2 · 2.01246 ≤ X ≤ 40.5 +2 · 2.01246)
= P(36.47508 ≤ X ≤ 44.52492)
=
45
¸
36

45
x

(0.90)
x
(0.10)
45−x
= 0.987970.
We will return to these intervals in the chapter on continuous distributions.
Sums
In our study of the discrete uniform distribution, when we took sums of independent
observations, graphs of those sums became “bell shaped”, andwe indicatedthat graphs
of sums in general became shaped that way. Could it be then that binomial probabil-
ities could be considered to be sums? The answer to this is “yes”. The reason is as
follows:
Consider a binomial process with n trials and probability of success at any partic-
ular trial, p. We define a random variable now for each one of the n trials as follows:
X
i
=

1 if the ith trial is a success
0 if the ith trial is a failure
and X
i
then is 1 only if the ith trial is a success; it follows that the sum of the X
i
’s
counts the total number of successes in n trials. That is,
X
1
+X
2
+X
3
+· · · +X
n
= X
so the binomial random variable X is in fact a sum.
This explains the “bell shaped” curve we see when we graph the binomial dis-
tribution. The identity X
1
+X
2
+X
3
+· · · +X
n
= X also provides an easy way to
calculate the mean and variance of X. We find that E(X
i
) = 1 · p +0 · q = p and
70 Chapter 5 Random Variables and Discrete Probability Distributions
since the expected value of a sum is the sum of the expected values,
E(X) = E(X
1
+X
2
+X
3
+· · · +X
n
) = E(X
1
) +E(X
2
) +E(X
3
) +· · · +E(X
n
)
= p +p +p +· · · +p = np
as we saw earlier.
We will show later that the variance of a sum of independent random variables
is the sum of the variances and Var(X
i
) = E(X
i
2
) −[E(X
i
)]
2
, and E(X
i
2
) = p,
Var(X
i
) = p −p
2
= p(1 −p) = pq
it follows that
Var(X
i
) = Var(X
1
) +Var(X
2
) +Var(X
3
) +· · · +Var(X
n
)
= pq +pq +pq +· · · +pq = npq
HYPERGEOMETRIC DISTRIBUTION
We now consider another very useful and frequently occurring discrete probability
distribution.
The binomial distribution assumes that the probability of an event remains con-
stant from trial to trial. This is not always an accurate assumption. We actually have
encountered this situation when we studied acceptance sampling in the previous chap-
ter; now we make the situation formal.
As an example, suppose that a small manufacturer has produced 11 machine parts
in a day. Unknown to him, the lot contains three parts that are not acceptable (D),
while the remaining parts are acceptable and can be sold (G). A sample of three parts
is taken, and the sampling being done without replacement; that is, a selected part is
not replaced so that it cannot be sampled again.
This means that after the first part is selected, no matter whether it is good or
defective, the probability that the second part is good depends on the quality of the
first part. So the binomial model does not apply.
We can, for example, find the probability that the sample of three contains two
good and one unacceptable part. This event could occur in three ways.
P(2G, 1D) = P(GGD) +P(GDG) +P(DGG)
=
8
11
·
7
10
·
3
9
+
8
11
·
3
10
·
7
9
+
3
11
·
8
10
·
7
9
= 0.509
Notice that there are three ways for the event to occur, and each has the same
probability. Since the order of the parts is irrelevant, we simply need to choose two of
Hypergeometric Distribution 71
the good items and one of the defective items. This can be done in

8
2

·

2
1

= 56
ways. So
P(2G, 1D) =

8
2

·

3
1

11
3
=
84
165
= 0.509
as before. This is called a hypergeometric probability distribution function.
We can generalize this as follows. Suppose a manufactured lot contains Ddefec-
tive items and N −D good items. Let X denote the number of unacceptable items in
a sample of n items. This is
P(X = x) =

D
x

·

N −D
n −x

N
n
, x = 0, 1, . . . , Min(n, D)
Since the sum covers all the possibilities
Min(n,D)
¸
x=0

D
x

·

N −D
n −x

=

N
n

,
the probabilities sum to 1 as they should.
It can be shown that the mean value is μ
x
= n · (D/N), surprisingly like n · p
in the binomial. The nonreplacement does not affect the mean value. It does effect
the variance, however. The demonstration will not be given here, but
Var(X) = n ·
D
N
·

1 −
D
N

·
N −n
N −1
is like the binomial npq except for the factor (N −n)/(N −1), which is often called
a finite population correction factor.
To continue our example, we find the following values for the probability distri-
bution function:
x 0 1 2 3
P(X = x)
56
165
28
55
8
55
1
165
Now, altering our sampling procedure fromour previous discussion of acceptance
sampling, suppose our sampling is destructive and we can only replace the defective
items that occur in the sample. Then, for example, if we find one defective item in the
sample, we sell 2/11 defective product. So the average defective product sold under
this sampling plan is
56
165
·
3
11
+
28
55
·
2
11
+
8
55
·
1
11
+
1
165
·
0
11
= 19.8%
72 Chapter 5 Random Variables and Discrete Probability Distributions
an improvement over the 3/11 = 27.3%had we proceeded without the sampling plan.
The improvement is substantial, but not as dramatic as what we saw in our first
encounter with acceptance sampling.
Other Properties of the Hypergeometric Distribution
Although the nonreplacement in sampling creates quite a different mathematical situ-
ation than that we encountered with the binomial distribution, it can be shown that the
hypergeometric distribution approaches the binomial distribution as the population
size increases.
It is also true that the graphs of the hypergeometric distribution show the same
“bell-shaped” characteristic that we have encountered several times now, and it will
be encountered again.
We end this chapter with another probability distribution that we have actually
seen before, the geometric distribution. There are hundreds of other discrete probabil-
ity distributions. Those considered here are a sample of these, although the sampling
has been purposeful; we have discussed some of the most common distributions.
GEOMETRIC PROBABILITY DISTRIBUTION
In the binomial distribution, we have a fixed number of trials, and the randomvariable
is the number of successes. In many situations, however, we wait for the first success,
and the number of trials to achieve that success is the random variable.
In Examples 1.4 and 1.12, we discussed a sample space in which we sampled
items emerging froma production line that can be characterized as good (G) or defec-
tive (D). We discussed a waiting time problem, namely, waiting for a defective item
to occur. We presumed that the selections are independent and showed the following
sample space:
S =

























D
GD
GGD
GGGD
.
.
.

























Later, we showed that no matter the size of the probability an item was good or
defective, the probability assigned to the entire sample space is 1.
Notice that in the binomial random variable, we have a fixed number of trials,
say n, and a variable number of successes. In waiting time problems, we have a given
number of successes (here 1); the number of trials to achieve those successes is the
random variable.
Conclusions 73
Let us begin with the following waiting time problem. In taking a driver’s test,
suppose that the probability the test is passed is 0.8, the trials are independent, and the
probability of passing the test remains constant. Let the randomvariable Xdenote the
number of trials necessary up to and including when the test is passed. Then, applying
the assumptions we have made and letting q = 1 −p, we find the following sample
space (where T and F indicate respectively, that the test is passed and test has been
failed), values of X, and probabilities in Table 5.2.
Table 5.2
Sample Point X Probability
T 1 p = 0.8
FT 2 qp = (0.2)(0.8)
FFT 3 q
2
p = (0.2)
2
(0.8)
FFFT 4 q
3
p = (0.2)
3
(0.8)
.
.
.
.
.
.
.
.
.
We see that if the first success occurs at trial number x, then it must be preceded
by exactly x −1 failures, so we conclude that
P(X = x) = q
x−1
· p for x = 1, 2, 3, . . .
is the probability distribution function for the random variable X. This is called a
geometric probability distribution function.
The probabilities are obviously positive and their sum is
S = p +qp +q
2
p +q
3
p +· · · = 1
which we showed in Chapter 2 by multiplying the series by q and subtracting one
series from another. The occurrence of a geometric series here explains the use of the
word “geometric” in describing the probability distribution.
We also state here that E[X] = 1/p, a fact that will be proven in Chapter 7.
For example, if we toss a fair coin with p = 1/2, then the expected waiting time
for the first head to occur is 1/(1/2) = 2 tosses. The expected waiting time for our
new driver to pass the driving test is 1/(8/10) = 1.25 attempts.
We will consider the generalization of this problem in Chapter 7.
CONCLUSIONS
This has been a brief introduction to four of the most commonly occurring discrete
random variables. There are many others that occur in practical problems, but those
discussed here are the most important.
74 Chapter 5 Random Variables and Discrete Probability Distributions
We will soon return to random variables, but only to random variables whose
probability distributions are continuous. For now, we pause and consider an appli-
cation of our work so far by considering seven-game series in sports. Then we will
return to discrete probability distribution functions.
EXPLORATIONS
1. For a geometric random variable with parameters p and q, let r denote the
probability that the first success occurs in no more than n trials.
(a) Show that r = 1 −q
n
.
(b) Now let r vary and show a graph of r and n.
2. Asample of size 2 has been selected fromthe uniformdistribution 1, 2, · · · , n,
but it is not known whether n = 5 or n = 6. It is agreed that if the sum of
the sample is 6 or greater, then it will be decided that n = 6. This decision
rule is subject to two kinds of errors: we could decide that n = 6 while in
reality n = 5, or we could decide that n = 5 while in reality n = 6. Find the
probabilities of each of these errors.
3. Alot of 100 manufactured items contains an unknown number, k, of defective
items. Items are selected from the lot and inspected, and the inspected items
are not replaced before the next item is drawn. The second defective item is
the fifth item drawn. What is k? (Try various values of k and select the value
that makes the event most likely.)
4. Ahypergeometric distribution has N = 100, with D = 10 special items. Sam-
ples of size 4 are selected. Find the probability distribution of the number of
special items in the sample and then compare these probabilities with those
from a binomial distribution with p = 0.10.
5. Flip a fair coin, or use a computer to select random numbers 0 or 1, and verify
that the expected waiting time for a head to appear is 2.
Chapter 6
Seven-Game Series in Sports
CHAPTER OBJECTIVES:
r
to consider seven-game play-off series in sports
r
to discover when the winner of the series is in fact the better team
r
to find the influence of winning the first game on winning the series
r
to discuss the effect of extending the series beyond seven games.
INTRODUCTION
We pause now in our formal development of probability and statistics to concentrate
on a particular application of the theory and ideas we have developed so far. Seven-
game play-off series in sports such as basketball play-offs and the World Series present
a waiting time problem. In this case, we wait until one team has won a given number
of games. We analyze this problem in some depth.
SEVEN-GAME SERIES
In a seven-game series, the series ends when one team has won four games. This is
another waiting time problem and gives us a finite sample space as opposed to the
infinite sample spaces we have considered.
Let the teams be A and B with probabilities of wining an individual game as p
and q, where p +q = 1. We also assume that p and q remain constant throughout
the series and that the games are independent; that is, winning or losing a game has
no effect on winning or losing the next game.
We look first at the sample space. The series can last four, five, six, or seven
games. We show here the ways in which team A can win the series.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
75
76 Chapter 6 Seven-Game Series in Sports
Four games Five games Six games Seven games
AAAA BAAAA BBAAAA AABBAA
ABAAA BABAAA AAABBA
AABAA BAABAA ABABAA
AAABA BAAABA ABAABA
ABBAAA AABABA
BBBAAAA ABBBAAA
BBABAAA ABBABAA
BBAABAA ABBAABA
BBAAABA ABABBAA
BABBAAA ABABABA
BABABAA ABAABBA
BABAABA AABBBAA
BAABBAA AABBABA
BAABABA AABABBA
BAAABBA AAABBBA
To write out the points where B wins the series, interchange the letters A and B
above. Note that the number of ways in which the series can be played in n games
is easily counted. The last game must be won by A, say, so in the previous n −1
games, A must win exactly three games and this can be done in

n−1
3

ways. For
example, there are

5
3

= 10 ways in which A can win the series in six games. So
there are

4−1
3

+

5−1
3

+

6−1
3

+

7−1
3

= 35 ways for A to win the series and so
70 ways in which the series can be played.
These points are not equally likely, however, so we assign probabilities now to
the sample points, where either team can win the series:
P(4 game series) = p
4
+q
4
P(5 game series) = 4p
4
q +4q
4
p
P(6 game series) = 10p
4
q
2
+10q
4
p
2
P(7 game series) = 20p
4
q
3
+20q
4
p
3
These probabilities add up to 1 when the substitution q = 1 −p is made. We also
see that
P(Awins the series) = p
4
+4p
4
q +10p
4
q
2
+20p
4
q
3
and this can be simplified to
P(Awins the series) = 35p
4
−84p
5
+70p
6
−20p
7
by again making the substitution q = 1 −p.
This formula gives some interesting results. In Table 6.1, we show p, the prob-
ability that team A wins a single game, and P, the probability that team A wins the
series.
Seven-Game Series 77
Table 6.1
p P(A wins the series)
0.40 0.2898
0.45 0.3917
0.46 0.4131
0.47 0.4346
0.48 0.4563
0.49 0.4781
0.50 0.5000
0.51 0.5219
0.52 0.5437
0.53 0.5654
0.54 0.5869
0.55 0.6083
0.60 0.7102
0.70 0.8740
0.80 0.9667
It can be seen, if the teams are fairly evenly matched, that the probability of
winning the series does not differ much from the probability of winning a single
game!
The series is then not very discriminatory in the sense that the winner of the series
is not necessarily the stronger team. The graph in Figure 6.1 shows the probability
of winning the series and the probability of winning a single game. The maximum
difference occurs when the probability of winning a single game is 0.739613 or
0.260387. The difference is shown in Figure 6.1.
What is the expected length of the series? To find this, we calculate
A = 4

p
4
+q
4

+5

4p
4
q +4q
4
p

+6

10p
4
q
2
+10q
4
p
2

+7

20p
4
q
3
+20q
4
p
3

0.2 0.4 0.6 0.8 1
p
0.2
0.4
0.6
0.8
1
P

w
i
n
s
Figure 6.1 Probability of winning the series and
the probability of winning a single game.
78 Chapter 6 Seven-Game Series in Sports
This can be simplified to
A = 4

1 +p +p
2
+p
3
−13p
4
+15p
5
−5p
6

A graph of A as a function of p is shown in Figure 6.2.
0.2 0.4 0.6 0.8 1
p
4.25
4.5
4.75
5
5.25
5.5
5.75
L
e
n
g
t
h
Figure 6.2 Expected length of the series.
The graph shows that in the range 0.45 ≤ p ≤ 0.55, the average length of the
series is almost always about 5.8 games.
WINNING THE FIRST GAME
Since the probability of winning the series is not much different from that of winning
a single game, we consider the probability that the winner of the first game wins the
series. From the sample space, we see that
P(winner of the first game wins the series) = p
3
+3p
3
q +6p
3
q
2
+10p
3
q
3
and this can be written as a function of p as
P(winner of the first game wins the series) = p
3
(20 −45p +36p
2
−10p
3
)
A graph of A as a function of p is shown in Figure 6.3.
0.2 0.4 0.6 0.8 1
p
0.2
0.4
0.6
0.8
1
P
r
o
b
a
b
i
l
i
t
y
Figure 6.3 Probability that the winner of the
first game wins the series.
Winning the First Game 79
The straight line gives the probability that an individual game is won. The other
graph shows the probability that the winner of the first game wins the series. The
graphs intersect at p = 0.347129; so, if p is greater than this, then the winner of the
first game is more likely to win the series.
How Long Should the Series Last?
We have found that the winner of a seven-game series is not necessarily the better
team. In fact, if the teams are about evenly matched, the probability that the weaker
team wins the series is about equal to the probability that the team wins an individual
game.
Perhaps a solution to this is to increase the length of the series so that the proba-
bility that the winner of the series is in fact the stronger team is increased.
How long, then, should the series last?
We presume the series is over when, in a series of n games, one team has won
(n +1)/2 games.
If n = 7, then the winner must win four games.
If n = 9, then the winner must win five games.
Now we find a formula for the probability that a given team wins a series of
n games. Call the winner of the series A and suppose the probability that A wins an
individual game is p and the probability that the loser of the series wins an individual
game is q. If A wins the series, then A has won (n +1)/2 games and has won
(n +1)/2 −1 = (n −1)2 games previously to winning the last game. During this
time, the loser of the series has won x games (where x can be as low as 0 and at
most (n −1)/2 games). So there are (n −1)/2 +x = (n +2x −1)/2 games played
before the final game. It follows that
P(Awins the series) = p
(n+1)/2
(n−1)/2
¸
x=0

(n +2x −1)/2
x

q
x
If n = 7, this becomes
P(Awins the series) = p
4
¸
1 +

4
1

q +

5
2

q
2
+

6
3

q
3

and if n = 9, this becomes
P(Awins the series) = p
5
¸
1 +

5
1

q +

6
2

q
2
+

7
3

q
3
+

8
4

q
4

Increasing the number of games to nine changes slightly the probability that A
wins the series, as shown in Table 6.2.
80 Chapter 6 Seven-Game Series in Sports
Table 6.2 Probability of Winning the Series
p Seven-game series Nine-game series
0.45 0.391712 0.378579
0.46 0.413058 0.402398
0.47 0.434611 0.426525
0.48 0.456320 0.450886
0.49 0.478134 0.475404
0.50 0.500000 0.500000
0.51 0.521866 0.549114
0.52 0.543680 0.573475
0.53 0.565389 0.573475
0.54 0.586942 0.597602
0.55 0.608288 0.621421
The difference between the two series can be seen in Figure 6.4.
0.2 0.4 0.6 0.8 1
p
0.2
0.4
0.6
0.8
1
P
(
A

w
i
n
s
)
Figure 6.4
The steeper curve is that for a nine-game series.
Nowsuppose we want P(Awins the series) to be some high probability, say 0.95.
This requires that the graph, such as one given in Figure 6.4, should pass through the
point (p, 0.95) for some value of n. Suppose again that p is the probability that A
wins an individual game and that p > 0.5. How many games should be played?
To solve this in theory, put P(A wins the series) = 0.95 and solve the resulting
equation for n. The equation to be solved is then
p
(n+1)/2
(n−1)/2
¸
x=0

(n +2x −1)/2
x

q
x
= 0.95
where we know p and q.
This cannot be done exactly. So, we defined a function h[p, n] as follows:
h[p, n] = p
(n+1)/2
(n−1)/2
¸
x=0

(n +2x −1)/2
x

q
x
Explorations 81
and then experimented with values of p and n until a probability of about 0.95 was
achieved. The results are shown in Table 6.3.
Table 6.3
p Number of games (n)
0.60 63
0.59 79
0.58 99
0.57 131
0.56 177
0.55 257
0.54 401
0.53 711
0.52 1601
0.51 4000

It is apparent that the length of the series becomes far too great, even if p = 0.60,
so that the teams are really quite unevenly matched. As the teams approach parity,
the number of games required grows quite rapidly. In the case where p = 0.51, the
number of games exceeds 4000! In fact h[0.51, 4000] = 0.90, so even with 4000
games (which is difficult to imagine!) we have not yet achieved a probability of 0.95!
This should be overly convincing that waiting until a team has won (n +1)/2
games is not a sensible plan for deciding which team is stronger. We will not pursue
it here, but tennis turns out to be a fairly good predictor of who the better player is;
this is because the winner of a game must win at least two points in a row and the
winner of a set must win at least two games in a row.
CONCLUSIONS
We have studied seven-game series in sports and have concluded that the first team
to win four games is by no means necessarily the stronger team. The winner of the
first game has a decided advantage in winning the series.
Extending the number of games so that the probability that the winner of a lengthy
series is most probably the better team is quite unfeasible.
EXPLORATIONS
1. Verify that the probabilities given for the seven-game series where p is the
probability that a given team wins a game and q = 1 −p is the probability
that a given team loses a game add up to 1.
2. Compare the actual record of the lengths of the World Series that have been
played to date with the probabilities of the lengths assuming the teams to be
82 Chapter 6 Seven-Game Series in Sports
evenly matched. Does the actual record suggest that the teams are not evenly
matched?
3. Compare the actual expected length of the series with the theoretical expected
length. Do the data suggest that the teams are not evenly matched?
4. Find data showing the winner of the first game in a seven-game series and
determine how often that team wins the series.
Chapter 7
Waiting Time Problems
CHAPTER OBJECTIVES:
r
to develop the negative binomial probability distribution
r
to apply the negative binomial distribution to a quality control problem
r
to show some practical applications of geometric series
r
to show how to sum some series which are not geometric (without calculus)
r
to encounter the Fibonacci sequence when tossing a coin
r
to show an unfair game with a fair coin
r
to introduce the negative hypergeometric distribution
r
to consider an (apparent) logical contradiction.
We now turn our attention to some problems usually not considered in an introduc-
tory course in probability and statistics, namely, waiting time problems. As we have
seen in problems involving the binomial probability distribution, we consider a fixed
number of trials and calculate the probability of a given number of “successes.” Now
we consider problems where we wait for a success, or a given number of successes,
or some pattern of successes and failures.
WAITING FOR THE FIRST SUCCESS
Recall, fromChapter 5, that in a binomial situation we have an experiment with one of
the two outcomes, which, for lack of better terms, we call “success” and “failure.” We
must also have a constant probability of success, say p, and consequently a constant
probability of failure, say q where of course p +q = 1. It is also necessary for the
experiments, or trials, to be independent. In this situation, it is common to define a
random variable, X, denoting the number of successes in n independent trials. We
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
83
84 Chapter 7 Waiting Time Problems
have seen that the probability distribution function is
P(X = x) =

n
x

p
x
q
n−x
, x = 1, 2, . . . , n
and we have verified that the resulting probabilities add up to 1.
Tossing a two-sided loaded coin is a perfect binomial model.
Now, however, we assume the binomial presumptions, but we do not have a fixed
number of trials, rather we fix the number of successes and then the number of trials
becomes the random variable.
Let us begin by waiting for the first success, as we did in Chapter 5.
The sample space and associated probabilities are shown in Table 7.1. S denotes
a success and F denotes a failure.
Table 7.1
Sample point Probability Number of trials
S p 1
FS qp 2
FFS q
2
p 3
FFFS q
3
p 4
.
.
.
.
.
.
Now, again using the symbol X, which nowdenotes the number of trials necessary
to achieve the first success, it is apparent that since we must have x −1 failures
followed by a success
P(X = x) = q
x−1
p, x = 1, 2, 3, . . .
where the values for X now have no bound. We called this a geometric probability
distribution.
We have shown that the probabilities sum to 1 so we have established that we
have a true probability distribution function. Nowwe consider a very specific example
and after that we will generalize this problem.
THE MYTHICAL ISLAND
Here is an example of our waiting time problem. On a mythical island, couples are
allowed to have children until a male child is born. What effect, if any, does this have
on the male:female ratio on the island? The answer may be surprising.
Suppose that the probabilities of a male birth or a female birth are each 1/2 (which
is not the case in actuality) and that the births are independent of each other.
The sample space now is shown in Table 7.2.
We know that S = p +qp +q
2
p +q
3
p +· · · = 1 and here p = q = 1/2 so we
have a probability distribution. We now want to know the expected number of males
Waiting for the Second Success 85
Table 7.2
Sample point Probability
M
1
2
FM
1
2
·
1
2
=
1
4
FFM
1
2
·
1
2
·
1
2
=
1
8
FFFM
1
2
·
1
2
·
1
2
·
1
2
=
1
16
.
.
.
.
.
.
in a family. To find this, let
A
M
= 1 ·
1
2
+1 ·
1
4
+1 ·
1
8
+1 ·
1
16
+· · ·
then
1
2
A
M
= 1 ·
1
4
+1 ·
1
8
+1 ·
1
16
+· · ·
and subtracting the second series from the first series,
1
2
A
M
=
1
2
so A
M
= 1 Now the average family size is A = 1 ·
1
2
+2 ·
1
4
+3 ·
1
8
+4 ·
1
16
+ · · ·
and so
1
2
A = 1 ·
1
4
+2 ·
1
8
+3 ·
1
16
+4 ·
1
32
+ · · ·
Subtracting these series gives
A −
1
2
A = 1 ·
1
2
+1 ·
1
4
+1 ·
1
8
+1 ·
1
16
+ · · · = 1, so A = 2.
Since the average number of male children in a family is 1, so is the average
number of females, giving the male:female ratio as 1:1, just as it would be if the
restrictive rule did not apply!
We now generalize the waiting time problem beyond the first success.
WAITING FOR THE SECOND SUCCESS
The sample space and associated probabilities when we wait for the second success
are shown in Table 7.3. Note that we are not necessarily waiting for two successes in
a row. We will consider that problem later in this chapter.
It is clear that among the first x −1 trials, we must have exactly one success.
We conclude that if the second success occurs at the xth trial, then the first x −1
trials must contain exactly one success (and is the usual binomial process), and this
is followed by the second success when the experiment ends.
86 Chapter 7 Waiting Time Problems
Table 7.3
Sample point Probability Number of trials
SS p
2
2
FSS qp
2
3
SFS qp
2
3
FFSS q
2
p
2
4
FSFS q
2
p
2
4
SFFS q
2
p
2
4
.
.
.
.
.
.
.
.
.
We conclude that
P(X = x) =

x −1
1

pq
x−2
· p
or
P(X = x) =

x −1
1

p
2
q
x−2
, x = 2, 3, 4, · · ·
Again we must check that we have assigned a probability of 1 to the entire sample
space.
Adding up the probabilities we see that

x=2
(X = x) = p
2

x=2

x −1
1

q
x−2
Call the summation T so
T =

x=2

x −1
1

q
x−2
=

x=2
(x −1)q
x−2
= 1 +2q +3q
2
+4q
3
+5q
4
+ · · ·
Now qT = q +2q
2
+3q
3
+4q
4
+ · · · and subtracting qT from T it follows
that (1 −q)T = 1 +q +q
2
+q
3
+ · · · which is a geometric series. So
(1 −q)T = 1/(1 −q) and so T = 1/(1 −q)
2
= 1/p
2
and since

x=2
P(X = x) = p
2

x=2

x −1
1

q
x−2
it follows that

x=2
P(X = x) = p
2
·
1
p
2
= 1
We used the process of multiplying the series T = 1+q+2q
2
+3q
3
+4q
4
+· · ·
by q in order to sum the series because this process will occur again. We could also
have noted that T = 1/(1 −q)
2
= (1 −q)
−2
= 1/p
2
by the binomial expansion with
Mean of the Negative Binomial 87
a negative exponent. We will generalize this in the next section and see that our last
two examples are special cases of what we will call the negative binomial distribution.
WAITING FOR THE rth SUCCESS
We now generalize the two special cases we have done and consider waiting for the
rth success where r = 1, 2, 3, . . . . Again the random variable X denotes the waiting
time or the total number of trials necessary to achieve the rth success.
It is clear that if the xth trial is the rth success, then the previous x −1 trials must
contain exactly r −1 successes by a binomial process and in addition the rth success
occurs on the xth trial. So

x=r
P(X = x) =

x=r

x −1
r −1

p
r−1
q
x−r
· p = p
r

x=r

x −1
r −1

q
x−r
As in the special cases, consider
T =

x=r

x −1
r −1

q
x−r
= 1 +

r
r −1

q +

r +1
r −1

q
2
+

r +2
r −1

q
3
+· · ·
But this is the expansion of (1 −q)
−r
, so

x=r
P(X = x) = p
r

x=r

x −1
r −1

q
x−r
= p
r
(1 −q)
−r
= p
r
p
−r
= 1
so our assignment of probabilities produces a probability distribution function. The
function is known as the negative binomial distribution due to the occurrence of a
binomial expansion with a negative exponent and is defined the way we have above
as
P(X = x) =

x −1
r −1

p
r
q
x−r
, x = r, r +1, r +2, ...
When r = 1 we wait for the first success and the probability distribution be-
comes P(X = x) = pq
x−1
, x = 1, 2, 3, ... as we saw above; if r = 2,the probability
distribution becomes P(X = x) = (x −1)p
2
q
x−2
, x = 2, 3, 4, ... again as we found
above.
MEAN OF THE NEGATIVE BINOMIAL
The expected value of the negative binomial random variable is easy to find.
E[X] =

x=r
x ·

x −1
r −1

p
r
q
x−r
= p
r
· r

x=r

x
r

q
x−r
= p
r
· r

1 +

r +1
1

q +

r +2
2

q
2
+

r +3
3

q
3
+· · ·

88 Chapter 7 Waiting Time Problems
The quantity in the square brackets is the expansion of (1 −q)
−(r+1)
and so
E[X] = p
r
· r · (1 −q)
−(r+1)
=
r
p
If p is fixed, this is a linear function of r as might be expected. If we wait for the
first head in tossing a fair coin, r = 1 and p = 1/2 so our average waiting time is two
tosses.
COLLECTING CEREAL BOX PRIZES
Abrand of cereal promises one of six prizes in a box of cereal. On average, howmany
boxes must a person buy in order to collect all the prizes?
Here r above remains at 1 but the value of p changes as we collect the
coupons. The first box yields the first prize, but then the average waiting time
to find the next prize is 1/(5/6), the average waiting time for the next prize is
1/(4/6), and so on, giving the total number of boxes bought on average to be
1 +1/(5/6) +1/(4/6) +1/(3/6) +1/(2/6) +1/(1/6) = 14.7 boxes. Note that the
waiting time increases as the number of prizes collected increases.
HEADS BEFORE TAILS
Here is another game with a coin, this time a loaded one.
Let p be the probability a head occurs when the coin is tossed. A running count
of the heads and tails is kept; we want the probability that the heads count reaches
three before the tails count reaches four. Let us call this event “3 heads before 4
tails”.
If the event is to occur, we must throw the third head on the last trial and this
must be preceded by at most three tails. So if we let x denote the number of tails then
the random variable x must be 0, 1, 2,or 3. So the tails must occur in the first 2 +x
trials (we need two heads and x tails) and of course we must have three heads in the
final result.
The probability of this is then
P(3 heads before 4 tails) = p
3
3

x=0
(1 −p)
x

2 +x
2

= p
3
[1 +3q +6q
2
+10q
3
] = p
3
[6(1 −p)
2
−3p +10(1 −p)
3
+4]
= 36p
5
−45p
4
−10p
6
+20p
3
A graph of this function is shown in Figure7.1 for various values of p.
Heads Before Tails 89
0.2 0.4 0.6 0.8 1
p
0.2
0.4
0.6
0.8
1
P
r
o
b
a
b
i
l
i
t
y
Figure 7.1
Let us generalize the problem so that we want the probability that a heads occur
before b tails.
The last toss must be a head. Then, of the preceding tosses, exactly a −1 must
be heads and x must be tails, where x is at most b −1, so
P(a heads before b tails) =
b−1

x=0
p
a
q
x

a −1 +x
x

Now let us fix the number of tails, say let b = 5. (Note that a was fixed above
at 3). So
P(a heads before 5 tails) =
4

x=0
p
a
q
x

a −1 +x
x

A graph of this function is in Figure 7.2 where we have taken p = 0.6.
Finally, we show in Figure 7.3 a graph of the surface when both a and p are
allowed to vary.
5 10 15 20
a
0.2
0.4
0.6
0.8
1
P
r
o
b
a
b
i
l
i
t
y
Figure 7.2
90 Chapter 7 Waiting Time Problems
0
5
10
15
20
a
0
0.25
0.5
0.75
1
p
0
0.25
0.5
0.75
1
P
r
o
b
a
b
i
l
i
t
y
Figure 7.3
WAITING FOR PATTERNS
We have considered the problem of waiting for the rth head in coin tossing; now
we turn to some unusual problems involving waiting for more general patterns in
binomial trials. We will encounter some interesting mathematical consequences and
we will show an unfair game with a fair coin.
We begin with a waiting time problem that involves waiting for two successes
in a row. Note that this differs from waiting for the second success that is a negative
binomial random variable. Let us begin with the points in the sample space where we
have grouped the sample points by the number of experiments necessary. These are
shown in Table 7.4.
Table 7.4
Sample point Number of points
HH 1
THH 1
TTHH 2
HTHH
TTTHH 3
THTHH
HTTHH
TTTTHH 5
TTHTHH
THTTHH
HTTTHH
HTHTHH
.
.
.
.
.
.
Expected Waiting Time for HH 91
We see that the number of points is 1, 1, 2, 3, 5, . . . . The Fibonacci sequence
begins with 1, 1; subsequent terms are found by adding the two immediately preceding
terms. Can it be that the number of points follows the Fibonacci sequence? Of course,
we cannot conclude that just because the pattern holds in the first few cases.
But the Fibonacci pattern does hold here! Here is why: if two heads in a row
occur on the nth toss, then either the sequence begins with T followed by HH in
n −1 tosses or the sequence begins with HT (to avoid the pattern HH) followed by
HH in n −2 tosses. So the number of points in the sample space is found by writing
T before every point giving HH in n −1 tosses and writing HT before every point
giving HH in n −2 tosses. So the total number of points in the sample space for
the occurrence of HH in n tosses is the sum of the number of points for which HH
occurs in n −1 tosses and the number of points in which HH occurs in n −2 tosses,
the Fibonacci sequence.
Here, for example, using the points in the previous table, are the points for which
HH occurs for the first time at the seventh toss:
T|TTTTHH
T|TTHTHH
T|THTTHH
T|HTTTHH
T|HTHTHH
HT|TTTHH
HT|THTHH
HT|HTTHH
The assignment of probabilities to the sample points is easy, but the pattern they
follow is difficult and cannot be simply stated.
EXPECTED WAITING TIME FOR HH
To calculate this expectation, let a
n
denote the probability that the event HH occurs
at the nth trial. Then, using the argument leading to the Fibonacci series above, it
follows that
a
n
= qa
n−1
+qpa
n−2
for n > 2 and we take a
1
= 0 and a
2
= p
2
.
This formula is a recursion and we will study this sort of formula in Chapter 16.
Now multiply this recursion through by n and sum this result from n = 3 to
infinity to find

n=3
na
n
= q

n=3
na
n−1
+qp

n=3
na
n−2
92 Chapter 7 Waiting Time Problems
which we can also write as

n=3
na
n
= q

n=3
[(n −1) +1]a
n−1
+qp

n=3
[(n −2) +2]a
n−2
or

n=3
na
n
= q

n=3
(n −1)a
n−1
+q

n=3
a
n−1
+qp

n=3
(n −2)a
n−2
+2qp

n=3
a
n−2
This simplifies, since

n=1
na
n
= E[N] and

n=1
a
n
= 1,
to E[N] −2a
2
= qE[N] +q +qpE[N] +2qp or
(1 −q −qp)E[N] = 2p
2
+q +2qp = 1 +p so
E[N] = (1 +p)/(1 −q −qp) = (1 +p)/p
2
. This can also be written as
E[N] = 1/p
2
+1/p.
It might be thought that this expectation would be just 1/p
2
, but that is not quite
the case. For a fair coin, this expectation is six tosses.
It is fairly easy to see that if we want for HHH then we get a “super” Fibonacci
sequence in which we start with 1, 1, 1 and obtain subsequent terms by adding the
previous three terms in the sequence.
As a second example, let us consider waiting for the pattern TH. Table 7.5 shows
some of the sample points.
Now the number of points in the sample space is simple. Suppose TH occurs on
the nth toss. The sample point then begins with 0, 1, 2, 3, ..., n −2 H’s. So there are
n −1 points for which TH occurs on the nth toss.
Table 7.5
Sample point
TH
TTH
HTH
TTTH
HTTH
HHTH
TTTTH
HTTTH
HHTTH
HHHTH
.
.
.
Expected Waiting Time for TH 93
This observation makes the sample space fairly easy to write out and it also makes
the calculation of probabilities fairly simple. Note that the probability of any sample
point has the factor qp for the sequence TH at the nth toss.
If the sample point begins with no heads, then its probability has a factor of
q
n−2
.
If the sample point begins with H, then its probability has a factor of pq
n−3
.
If the sample point begins with HH, then its probability has a factor of p
2
q
n−4
.
This pattern continues until we come to the sample point with n −2 H’s followed
by TH. The probability of this point has a factor of p
n−2
.
Consider the case for n = 5 shown in the sample space above. The probabilities
of the points add to
qp(q
3
+pq
2
+p
2
q +p
3
)
This can be recognized as qp(q
4
−p
4
)/(q −p). This pattern continues, and by letting
X denote the total number of tosses necessary, we find that
P(X = n) =
qp(q
n−1
−p
n−1
)
q −p
for n = 2, 3, 4, . . .
The formula would not work for q = p = 1/2.
In that case the sample points are equally likely, each having probability (1/2)
n
and, as we have seen, there are n −1 of them. It follows in the case where q = p = 1/2
that
P(X = n) =
n −1
2
n
for n = 2, 3, 4, . . .
EXPECTED WAITING TIME FOR TH
Using the formula above,
E[N] =

n=2
n ·
qp(q
n−1
−p
n−1
)
q −p
To calculate this sum, consider
S =

n=2
n · q
n−1
= 2q +3q
2
+4q
3
+· · ·
94 Chapter 7 Waiting Time Problems
Now S +1 = 1 +2q +3q
2
+4q
3
+· · · and we have seen that the left-hand side of
this equation is 1/p
2
. So S = 1/p
2
−1.
E[N] =
qp
q −p

1
p
2
−1 −

1
q
2
−1

=
1
qp
.
The formula above applies only if p / = q. In the case p = q,we first consider
P(X = n +1)
P(X = n)
=
n
2(n −1)
So

n=2
2(n −1)P(X = n +1) =

n=2
nP(X = n).
We can write this as
2

n=2
[(n +1) −2]P(X = n +1) =

n=2
nP(X = n)
and from this it follows that 2E[N] −4 = E[N] so E[N] = 4.
This may be a surprise. We found that the average waiting time for HH with a
fair coin is six. (By symmetry, the average waiting time for TT is six tosses and the
average waiting time for HT is four tosses.)
There is apparently no intuitive reason for this to be so, but these results can
easily be verified by simulation to provide some evidence that they are correct.
We will simply state the average waiting times for each of the patterns of length
3 with a fair coin in Table 7.6.
Table 7.6
Pattern Average waiting time
HHH 14
THH 8
HTH 10
TTH 8
THT 10
HTT 8
HHT 8
TTT 14
We continue this chapter with an intriguing game.
AN UNFAIR GAME WITH A FAIR COIN
Here is a game with a fair coin. In fact, it is an unfair game with a fair coin!
Three Tosses 95
Consider the patterns HH, TH, HT, and TT for two tosses of a fair coin. If you
choose one of these patterns, I will choose another and then we toss a fair coin
until one of these patterns occurs. The winner is the person who chose the first-
occurring pattern. For example, if you choose HH, I will choose TH. If we see the
sequence HTTTTH, then I win since the pattern TH occurred before the pattern HH.
My probability of beating you is 3/4! Here is why.
If the first two tosses are HH, you win.
If the first two tosses are TH, I win.
If the first two tosses are HT then this can be followed by any number of T’s,
but eventually H will occur and I win.
If the first two tosses are TT then this can be followed by any number of T’s but
eventually H will occur and I will win.
If you are to win, you must toss HH on the first two tosses; this is the only way
you can win and it has probability 1/4. If a T is tossed at any time, I will win,
so my probability of winning is 3/4.
The fact that the patterns HH, TH, HT, and TT are equally likely for a fair coin
may be observed by a game player who may think that any choice is equally good is
irrelevant; the choice of pattern is crucial, as is the fact that he is allowed to make the
first choice.
If you choose TT, then the only way you can win is by tossing two tails on
the first two tosses. I will choose HT and I will win 3/4 of the time. A sensi-
ble choice for you is either TH or HT, and then my probability of beating you is
only 1/2.
If we consider patterns withthree tosses, as showninthe table above, myminimum
probability of beating you is 2/3! (And if you do not choose well, I can increase that
probability to 3/4 or 7/8!). This can be tried by simulation.
It is puzzling to note that no matter what pattern you choose, I will probably beat
you. This means that if we play the game twice, and I win in the first game, then you
can choose the pattern I chose on the first game, and I can still probably beat you.
Probabilities then are not transitive, so if pattern A beats pattern B and pat-
tern B beats pattern C, then it does not follow that pattern A will necessarily beat
pattern C.
THREE TOSSES
This apparent paradox, that probabilities are not transitive, continues with patterns
involving patterns of length 3. We showed above the average waiting times for each
of the eight patterns that can occur when a fair coin is tossed (Table 7.6).
Now let us play the coin game again where A is the first player and B is the
second player. We show the probabilities that B beats A in Table 7.7.
Note that letting “>” means “beats” (probably) TTH > HTT > HHT >
THH > TTH!
96 Chapter 7 Waiting Time Problems
Table 7.7
A’s Choice B’s Choice P (B beats A)
HHH THH 7/8
HHT THH 3/4
HTH HHT 2/3
HTT HHT 2/3
THH TTH 2/3
THT TTH 2/3
TTH HTT 3/4
TTT HTT 7/8
Nor is it true that a pattern with a shorter average waiting time will necessarily
beat a pattern with a longer waiting time. It can be shown that the average waiting
time for THTH is 20 tosses and the average waiting time for HTHH is 18 tosses.
Nonetheless, the probability that THTH occurs before HTHH is 9/14.
Probability contains many apparent contradictions.
WHO PAYS FOR LUNCH?
Three friends, whom we will call A, B, and C, go to lunch regularly. The payer at
each lunch is selected randomly until someone pays for lunch for the second time.
On average, how many times will the group go to lunch?
Let X denote the number of dinners the group enjoys. Clearly, X = 2, 3, or 4.
We calculate the probabilities of each of these values.
If X = 2, then we have a choice of any of the three to pay for the first lunch.
Then the same person must pay for the second lunch as well. The probability of this
is 1/3.
If X = 3, then we have a choice of any of the three to pay for the first lunch.
Then we must choose one of the two who did not pay for the first lunch and finally,
we must choose one of the two previous payers to pay for the third lunch. There are
then 3 · 2 · 2 = 12 ways in which this can be done and since each way has probability
1/27, the probability that X = 3 is 12/27 = 4/9.
Finally, if X = 4 then any of the three can pay for the first lunch; either of the
other two must pay for the second lunch and the one remaining must pay for the third
lunch. The fourth lunch can be paid by any of the three so this gives 3 · 2 · 1 · 3 = 18
ways in which this can be done. Since each has probability (1/3)
4
, the probability
that X = 4 is 18/81 = 2/9.
These probabilities add up to 1 as they should.
The expected number of lunches is then E(X) =2·3/9+3·4/9+4·2/9=26/9.
This is a bit under three, so they might as well go to lunch three times and forget the
random choices except that sometimes someone never pays.
Nowwhat happens as the size of the group increases? Does the randomness affect
the number of lunches taken?
Who Pays for Lunch? 97
Suppose that there are four people inthe group, A, B, C, andD. ThenX = 2, 3, 4,
or 5. We calculate the probabilities in much the same way as we did when there were
three for lunch.
P(X = 2) =
4 · 1
4 · 4
=
1
4
P(X = 3) =
4 · 3 · 2
4 · 4 · 4
=
3
8
P(X = 4) =
4 · 3 · 2 · 3
4 · 4 · 4 · 4
=
9
32
And finally,
P(X = 5) =
4 · 3 · 2 · 1 · 4
4 · 4 · 4 · 4 · 4
=
3
32
and these sum to 1 as they should.
Then E(X) = 2 · 1/4 +3 · 3/8 +4 · 9/32 +5 · 3/32 = 103/32 = 3. 218 75.
This is now a bit under 4, so the randomness is having some effect.
To establish a general formula for P(X = x) for n lunchers, note that the first
payer can be any of the n people, the next must be one of the n −1 people, the third
one of the n −2 people, and so on. The next to the last payer is one of the n −(x −2)
people and the last payer must be one of the x −1 people who have paid once.
This means that
P(X = x) =
n
n
·
n −1
n
·
n −2
n
·
n −3
n
· · · · ·
n −(x −2)
n
·
x −1
n
This can be rewritten as
P(X = x) =
1
n
x

n
x −1

(x −1)(x −1)!, x = 2, 3, ..., n +1
If there are 10 diners, the probabilities of having x lunches are shown in
Table 7.8.
Table 7.8
x P(X = x)
2 0.1
3 0.18
4 0.216
5 0.2016
6 0.1512
7 0.09072
8 0.042336
9 0.0145152
10 0.00326592
11 0.00036288
98 Chapter 7 Waiting Time Problems
10 20 30 40 50
Diners
0.01
0.02
0.03
0.04
0.05
0.06
P
r
o
b
a
b
i
l
i
t
y
Figure 7.4
With a computer algebra system, it is possible to compute the probabilities for
various values of X with some accuracy. In Figure 7.4 we show a graph of the prob-
abilities for 100 diners.
EXPECTED NUMBER OF LUNCHES
It is also interesting to compute the expectations for increasing numbers of diners and
to study the effect the randomness has. The computer algebra system Mathematica
was used to produce Table 7.9 where n is the number of lunchers and the expected
number of lunch parties is computed. The number of diners now increases beyond
any sensible limit, providing some mathematical challenges.
Table 7.9
n Expectation
2 2.500
3 2.889
4 3.219
5 3.510
6 3.775
10 4.660
15 5.546
20 6.294
30 7.550
50 9.543
100 13.210
150 16.025
200 18.398
300 22.381
400 25.738
450 27.258
Negative Hypergeometric Distribution 99
We find
E[X] =
n+1

x=2
1
n
x

n
x −1

(x −1)x!
It is interesting to note that adding one person to the dinner party has less and
less an effect as n increases. A graph of the expected number of lunches as a function
of the number of diners is shown in Figure 7.5.
0 20 40 60 80 100
No. of People
4
6
8
10
12
E
x
p
e
c
t
e
d

N
o
.

o
f

D
i
n
n
e
r
s
Figure 7.5
A least squares straight line regression gives Expected No. of
Lunches = 8.86266 +0.04577 · No. of People. A statistical test indicates
that the fit is almost perfect for the calculated points.
NEGATIVE HYPERGEOMETRIC DISTRIBUTION
Amanufacturer has a lot of 400 items, 50 of which are special. The items are inspected
one at a time until 10 of the special items have been found. If the inspected items are
not replaced in the lot, the random variable representing the number of special items
found leads to the negative hypergeometric distribution.
The problem here is our final example of a waiting time problem. Had the in-
spected items been replaced, a questionable quality control procedure to say the least,
we would encounter the negative binomial distribution, which we have seen when
waiting for the rth success in a binomial process. Recall that if Y is the waiting time
for the rth success then
P(Y = y) =

y −1
r −1

p
r−1
(1 −p)
y−r
· p, y = r, r +1, · · ·
We showed previously that the expected value of Y is E(Y) = r/p, a fact we will
return to later.
Now we define the negative hypergeometric random variable. To be specific,
suppose the lot of N items contains k special items. We want to sample, without
replacement, until we find c of the special items. Then the sampling process stops.
Again Y is the randomvariable denoting the number of trials necessary. Since the first
100 Chapter 7 Waiting Time Problems
y −1 trials comprise a hypergeometric process and the last trial must find a special
item,
P(Y = y) =

k
c −1

·

N −k
y −c

N
y −1
·
k −(c −1)
N −(y −1)
, y = c, c +1, . . . , N −(k −c)
Note that the process can end in as fewas c trials. The maximumnumber of trials
must occur when the first N −k trials contain all the nonspecial items followed by
all c special items.
Some elementary simplifications show that
P(Y = y) =

y −1
c −1

·

N −y
k −c

N
k
, y = c, c +1, . . . , N −(k −c)
In the special case we have been considering, N = 400, k = 50, and c = 10.
With a computer algebra system such as Mathematica
r
, we can easily calculate
all the values in the probability distribution function and draw a graph that is shown
in Figure 7.6.
40 60 80 100 120 140
y
0.005
0.01
0.015
P
r
o
b
a
b
i
l
i
t
y
Figure 7.6
Some of the individual probabilities are shown in Table 7.10.
The mean value of Y is E(Y) = 4010/51 = 78.6275 and the variance is 425.454.
Note that had we been using a negative binomial model the mean would be
c
p
=
c
k/N
=
10
50/400
= 80
The fact that this negative binomial mean is always greater than that for the
negative hypergeometric has some implications. This fact will be shown below, but
first we establish formulas for the mean and variance of the negative hypergeometric
random variable.
Mean and Variance of the Negative Hypergeometric 101
Table 7.10
y Probability
40 0.00286975
45 0.00530886
50 0.00843214
55 0.0118532
60 0.0150666
65 0.0175912
70 0.0190899
75 0.0194303
80 0.0186812
85 0.0170618
90 0.0148697
95 0.0124115
100 0.00995123
105 0.00768271
110 0.00572262
115 0.00411924
120 0.0028691
MEAN AND VARIANCE OF THE NEGATIVE
HYPERGEOMETRIC
We use a recursion. If we calculate P(Y = y)/P(Y = y −1), we find after simplifi-
cation that
N−k+c

y=c
(y −c)(N −y +1)P(Y = y)
=
N−k+c

y=c
(y −1)(N −y +1 −k +c)P(Y = y −1)
where we have also indicated a sum over all the possible values for Y.
This becomes
N−k+c

y=c
[(N +1 +c)y −y
2
−c(N +1)]P(Y = y)
=
N−k+c

y=c
[(N −k +c)(y −1) −(y −1)
2
]P(Y = y −1)
102 Chapter 7 Waiting Time Problems
and this can be further expanded and simplified to
(N +1 +c)E(Y) −E(Y
2
) −c(N +1) = (N −k +c)[E(Y)
−(N −k +c)P(Y = N −k +c) −[E(Y
2
) −(N −k +c)
2
P(Y = N −k +c)]
When this is simplified we find
E(Y) =
c(N +1)
k +1
In our special case this gives E(Y) =10(400+1)/(50+1) =4010/51=78.6275.
Before proceeding to the variance, we establish the fact that the mean of the
negative binomial always exceeds that of the negative hypergeometric.
Since N > k, N +Nk > k +Nk so that cN/k > c(N +1)/k +1 establishing
the result.
Although it is true that the negative binomial distribution is the limiting distribu-
tion for the negative hypergeometric distribution, this fact must be used sparingly if
approximations are to be drawn. We will return to this point later.
Now we establish a formula for the variance of the negative hypergeometric
random variable. We start with the previous recursion and multiply by y:
N−k+c

y=c
y(y −c)(N −y +1)P(Y = y)
=
N−k+c

y=c
y(y −1)(N −y +1 −k +c)P(Y = y −1)
The left-hand side reduces quite easily to
(N +c +1)E(Y
2
) −E(Y
3
) −c(N +1)E(Y)
The right-hand side must be first expanded in terms of y −1 and then it can be
simplified to
(N +c −k)E(Y) +(N −1 +c −k)E(Y
2
) −E(Y
3
)
Then, putting the sides together, we find
E(Y
2
) =
(cN +2c +N −k)c(N +1)
(k +2)(k +1)
from which it follows that
Var(Y) =
c(N +1)
(k +2)(k +1)
2
[(N −k)(k −c +1)]
In our special case, we find Var(Y) = 425.454 as before.
Negative Binomial Approximation 103
NEGATIVE BINOMIAL APPROXIMATION
It might be thought that the negative binomial distribution is a good approximation
to the negative hypergeometric distribution. This is true as the values in Table 7.11
indicate.
Table 7.11
Y Neghyper Negbin Neghyper −Negbin
90 0.014870 0.0135815 0.0012883
95 0.012412 0.011657 0.0007543
100 0.009951 0.009730 0.0002209
105 0.007683 0.007921 −0.000239
110 0.005723 0.006305 −0.000582
115 0.004119 0.004916 −0.000797
120 0.002869 0.003763 −0.000894
125 0.001936 0.002831 −0.000896
130 0.001266 0.002097 −0.000831
135 0.000803 0.001531 −0.000728
140 0.000494 0.001103 −0.000609
145 0.000295 0.000785 −0.000490
150 0.000171 0.000552 −0.000381
155 0.000096 0.000384 −0.000288
160 0.000053 0.000265 −0.000212
Figure 7.7 also shows that the values for the two probability distribution functions
are very close.
It was noted above that the mean of the negative binomial distribution always
exceeds that of the negative hypergeometric distribution. This difference is
c(N −k)
k(k +1)
and this can lead to quite different results if probabilities are to be estimated.
30 60 90 120 150
y
0.005
0.01
0.015
P
r
o
b
a
b
i
l
i
t
y
NegBin NegHyper
Figure 7.7
104 Chapter 7 Waiting Time Problems
Figure 7.7 shows a comparison between the negative hypergeometric with
N = 400, k = 50, and c = 10 and the negative binomial. The means differ by
1.3725 units. Although the graphs appear to be different, the actual differences are
quite negligible as the values in Table 7.11 , calculated fromthe right-hand tails, show.
The problem of course is the fact that the negative binomial assumes a constant
probability of selecting a special item, while in fact this probability constantly changes
with each item selected.
THE MEANING OF THE MEAN
We have seen that the expected value of Y is given by
E(Y) =
c(N +1)
k +1
This has some interesting implications, one of which we now show.
First Occurrences
If we let c = 1 in E(Y), we look at the expected number of drawings until the first
special item is found. Table 7.12 shows some expected waiting times for a lot with
N = 1000 and various values of k.
The graph in Figure 7.8 shows these results.
Waiting Time for c Special Items to Occur
Now consider a larger case. In E(Y) let N = 1000 and k = 100. Table 7.13 shows
the expected waiting time for c special items to occur.
Table 7.12
k E(Y)
10 91.0000
20 47.6667
30 32.2903
40 24.4146
50 19.6275
60 16.4098
70 14.0986
80 12.3580
90 11.0000
100 9.91089
The Meaning of the Mean 105
20 40 60 80 100
E(Y)
20
40
60
80
k
Figure 7.8
Note that, not surprisingly, if c is some percentage of k, then E(Y) is
approximately the same percentage of N, a result easily seen from the formula for
E(Y)
The graph in Figure 7.9 shows this result as well.
Estimating k
Above we have assumed that k is known, a dubious assumption at best. On the basis
of a sample, how can we estimate k? To be specific, suppose that a sample of 100
from a population of 400 shows 5 special items. What is the maximum likelihood
estimate of k?
This is not a negative hypergeometric situation but a hypergeometric situation.
Consider prob[k] =

k
5

·

400 −k
95

400
100

. The ratio prob[k = 1]/prob[k] can be
simplified to ((k +1)(305 −k))/((k −4)(400 −k)). Seeing where this is > 1 gives
k < 19.05.
Table 7.13
c E(Y)
10 99.1089
20 198.218
30 297.327
40 396.436
50 495.545
60 594.653
70 693.762
80 792.871
90 891.980
100 991.089
106 Chapter 7 Waiting Time Problems
20 40 60 80 100
E(Y)
200
400
600
800
1000
c
Figure 7.9
The sample then has 20% special items and that is somewhat greater than the
maximum likelihood estimator for the percentage of special items in the population.
Now suppose the population is of size N, the sample is of size s, and the sample
contains s
p
special items. It is easy to show that

k the maximum likelihood estimator
for k, the unknown number if special items in the population, is

k =
s
p
(N +1) −s
s
and further

k
N
=
s
p
s

1 +
1
N


1
N
showing that the proportion of the population estimated is very close to the percentage
of the special items in the sample, especially as the sample size increases, not really
much of a surprise.
Here we have used the hypergeometric distribution to estimate k. We find exactly
the same estimate if we use the negative hypergeometric waiting time distribution.
CONCLUSIONS
This chapter has contained a wide variety of waiting time problems. These are of-
ten not considered in introductory courses in probability and statistics and yet they
offer interesting situations, both in their mathematical analysis and in their practical
applications.
EXPLORATIONS
1. Show that if X
1
is the waiting time for the first binomial success and X
2
is the
waiting time for the second binomial success, then X
1
+X
2
has a negative
binomial distribution with r = 2.
2. Show that the expected waiting time for the pattern HT with a fair coin is four
tosses.
Explorations 107
3. (a) Suppose there are 4 people going for lunch as in the text. How many
lunches could a person who never pays for lunch expect?
(b) Repeat part (a) for a person who pays for lunch exactly once.
4. A lot of 50 items contains 5 defective items. Find the waiting times for the
second defective item to occure if sampled items are not replaced before the
next item is selected.
5. Two fair coins are tossed and any that comes up heads is put aside. This
is repeated until all the coins have come up heads. Show that the expected
number of (group) tosses is 8/3.
6. Professor Banach has two jars of candy on his desk and when a student visits,
he or she is asked to select a jar and have a piece of candy. At some point, one
of the jars is found to be empty. On average, how many pieces of candy are in
the other jar?
7. (a) A coin, loaded to come up heads with probability 0.6, is thrown until a
head appears. What is the probability an odd number of tosses is neces-
sary?
(b) If the coin is fair, explain why the probabilities of odd or even numbers
of tosses are not equal.
8. Suppose X is a negative binomial random variable with p the probability of
success at any trial. Suppose the rth success occurs at trial t. Find the value of
p that makes this event most likely to occur.
Chapter 8
Continuous Probability
Distributions: Sums, the
Normal Distribution, and the
Central Limit Theorem;
Bivariate Random Variables
CHAPTER OBJECTIVES:
r
to study random variables taking on values in a continuous interval or intervals
r
to see how events with probability zero can and do occur
r
to discover the surprising behavior of sums and means
r
to use the normal distribution in a variety of settings
r
to explain why the normal curve is called “normal”
r
to discuss the central limit theorem
r
to introduce bivariate random variables.
So far we have considered discrete random variables, that is, random variables de-
fined on a discrete or countably infinite set. We now turn our attention to continuous
random variables, that is, random variables defined on an infinite, uncountable, set
such as an interval or intervals.
The simplest of these random variables is the uniform random variable.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
108
Uniform Random Variable 109
UNIFORM RANDOM VARIABLE
Suppose we have a spinner on a wheel labeled with all the numbers between 0 and 1 on
its circular border. The arrow is spun and the number the arrow stops at is the random
variable, X. The random variable X can now take on any value in the interval from 0
to 1. What shall we take as the probability distribution function, f(x) = P(X = x)?
We suppose the wheel is fair, that is, it is as likely to stop at any particular number as
at any other. The wheel is shown in Figure 8.1.
1/4
1
3/4
1/2
Figure 8.1
Clearly, we can not make P(X = x) very large, since we have an infinity of values
to use. Suppose we let
P(X = x) = 0.00000000000000000001 = 10
−20
The problem with this is that it can be shown that the wheel contains more than
10
20
points, so we have used up more than the total probability of 1!
We are forced to conclude that P(X = x) = 0.
Nowsuppose that the wheel is loaded so that P(X ≥ 1/2) = 3P(X ≥ 1/2), mak-
ing it three times as likely that the arrow ends up in the left-hand half of the wheel as
in the right-hand half of the wheel. What is P(X = x) now?
Again, we conclude that P(X = x) = 0. It is curious that we cannot distinguish
a loaded wheel from a fair one!
The difficulty here lies not in the answer to our question, but in the question itself.
If we consider any random variable defined on a continuous interval, then P(X = x)
will always be 0. So we ask a different question, what is P(a ≤ x ≤ b)? That is, what
is the probability that the random variable is contained in an interval? To make this
meaningful, we define the probability density function, f(x), which has the following
properties:
110 Chapter 8 Continuous Probability Distributions
1. f(x) ≥ 0.
2. The total area under f(x) is 1.
3. The area under f(x) between x = a and x = b is P(a ≤ x ≤ b).
So areas under f(x) are probabilities. This means that f(x) must always be
positive, since probabilities cannot be negative. Also, f(x) must enclose a total area
of 1, the total probability for the sample space.
What is f(x) for the fair wheel? Since, for the fair wheel, P(a ≤ x ≤ b) = b −a,
it follows that
f(x) = 1, 0 ≤ x ≤ 1.
Notice that f(x) ≥ 0 and that the total area under f(x) is 1. Here we call f(x) a
uniform probability density function. Its graph is shown in Figure 8.2.
0.2 0.4 0.6 0.8 1
x
0.5
1
1.5
2
f
(
x
)
Figure 8.2
So, for example, to find P(1/3 ≤ X ≤ 3/4), we find the area under the curve
between 1/3 and 3/4. This is (3/4 −1/3) · 1 = 5/12. Since f(x) is quite easy, areas
under the curve, probabilities, are also easy to find.
For the loaded wheel, where P(0 ≤X≤1/2) =1/4 and so P(1/2 ≤X ≤1) =3/4,
consider (among many other choices) the triangular distribution
f(x) = 2x, 0 ≤ x ≤ 1
The distribution is shown in Figure 8.3.
0.2 0.4 0.6 0.8 1
x
0.5
1
1.5
2
f
(
x
)
Figure 8.3
Sums 111
Areas can be found using triangles and it is easy to see, since f(1/2) = 1, that
P(0 ≤ X ≤ 1/2) = 1/4.
SUMS
Nowlet us return to the fair wheel (where f(x) = 1, 0 ≤ x ≤ 1) and consider spinning
the wheel twice and adding the numbers the pointer stops at. What is the probability
density function for the sum? One might think that since we are adding two uniform
random variables the sum will also be uniform, but that is not the case, as we saw in
the discrete case in Chapter 5.
The sums obtained will then be between 0 and 2. Consider the probability of
getting a small sum. For that to occur, both numbers must be small. Similarly, to get
a sum near 2, a large sum, both spins must be near 1. So either of these possibilities
is unlikely.
If X
1
and X
2
represent the outcomes on the individual spins, then the expected
value of each is E(X
1
) = 1/2 and E(X
2
) = 1/2. While we cannot prove this in the
continuous case, here is a fact and an example that may prove convincing.
A Fact About Means
If a probability distribution or a probability density function has a point of symmetry,
then that point is the expected value of the random variable.
As an example, consider the discrete random variable X that assumes the values
1, 2, 3, 4, 5 with probabilities f(1), f(2), f(3), f(4), and f(5), where X = 3 is a
point of symmetry and f(1) = f(5) and f(2) = f(4). Now
E(X) = 1·f(1) +2·f(2) +3·f(3) +4·f(4) +5·f(5) = 6·f(1) +6·f(2)+3·f(3)
But
f(1) +f(2) +f(3) +f(4) +f(5) = 2 · f(1) +2 · f(2) +f(3) = 1
so
6 · f(1) +6 · f(2) +3 · f(3) = 3
the point of symmetry.
This is far from a general explanation or proof, but this approach can easily be
generalized to a more general discrete probability distribution. We will continue to
use this as a fact. It is also true for a continuous probability distribution; we cannot
supply a proof here.
It is always true that E(X
1
+X
2
) = E(X
1
) +E(X
2
) (a fact we will prove later)
and since 1/2 is a point of symmetry, it follows that
E(X
1
+X
2
) = 1/2 +1/2 = 1
112 Chapter 8 Continuous Probability Distributions
It is more likely to find a sum near 1 than to find a sum near either of the extreme
values, namely, 0 or 2, as we shall see.
It can be shown that the probability distribution of X = X
1
+X
2
is
f
2
(x) =

x for 0 ≤ x ≤ 1
2 −x for 1 ≤ x ≤ 2
We should check that f
2
(x) is a probability density function. The function is
always positive. The easiest way to find its total area is to use the area of two trian-
gles. Each of these have base of length 1 and height 1 as well, so the total area is
2 · 1/2 · 1 · 1 = 1; so f
2
(x) is a probability density function.
A graph of f
2
(x) is shown in Figure 8.4.
0.5 1 1.5 2
x
0.2
0.4
0.6
0.8
1
f
2
(
x
)
Figure 8.4
Since areas represent probabilities, using the area of a triangle, we find that
P(1/2 ≤ X ≤ 3/2) = 3/4, so the sums do in fact cluster around their expected value.
If we increase the number of spins to 3 and record the sum, it can be shown that
the probability density function for X = X
1
+X
2
+X
3
is
f
3
(x) =















1
2
x
2
for 0 ≤ x ≤ 1
3
4

x −
3
2

2
for 1 ≤ x ≤ 2
1
2
(x −3)
2
for 2 ≤ x ≤ 3
This is also a probability density function, although finding the total area, or
finding probabilities by finding areas, is usually done using calculus. We can find
here, for example, that P(1/2 ≤ x ≤ 5/2) = 23/24. Probabilities, or areas, are now
difficult for us to find and we cannot do this by simple geometry. However, we will
soon find a remarkable approximation to this probability density function that will
enable us to determine areas and hence probabilities .
A graph of f
3
(x) is shown in Figure 8.5.
The graph in Figure 8.5 resembles a “bell-shaped” or normal curve that occurs
frequently in probability and statistics. We have encountered this curve before.
Normal Probability Distribution 113
0.5 1 1.5 2 2.5 3
x
0.1
0.2
0.3
0.4
0.5
0.6
0.7
f
3
(
x
)
Figure 8.5
If we were to continue adding spins, we would find that the graphs come closer
and closer to a normal curve. This is an illustration of the central limit theorem. We
will discuss the normal curve first and then state the central limit theorem.
NORMAL PROBABILITY DISTRIBUTION
The normal probability distribution is a bell-shaped curve that is completely specified
by its mean μ and its standard deviation σ. Its probability density function is
f(x) =
1
σ


e

1
2
(x−μ)
2
, −∞≤ μ ≤ ∞, σ 0, −∞≤ x ≤ ∞
Note that for any normal curve, the curve is symmetric about its mean value.
A typical normal distribution is shown in Figure 8.6, which is a normal curve
with mean 3 and standard deviation

2.
We abbreviate any normal curve by specifying its mean and standard deviation
and we write, for a general normal randomvariable X, X ∼ N(μ, σ). This is read, “X
is distributed normally with mean μ and standard deviation σ”. For the normal curve
−2 2 4 6 8
x
0.05
0.1
0.15
0.2
0.25
Figure 8.6
114 Chapter 8 Continuous Probability Distributions
in Figure 8.6, we write X ∼ N(3,

2). It can be shown, but with some difficulty, that
the total area of any normal curve is 1. Since the function is always positive, areas
under the curve represent probabilities. These cannot be calculated easily either.
Statistical calculators, however, can calculate these areas. Here are some exam-
ples from this normal curve.
(a) P(2 ≤ X ≤ 6) = 0.743303
(b) P(4 ≤ X ≤ 8) = 0.239547
(c) P(X > 4|X > 2) =
P(X > 4 and X > 2)
P(X > 2)
=
P(X > 4)
P(X > 2)
=
0.23975
0.76025
= 0.315357
Textbooks commonly include a table of the standard normal curve, that is,
N(0, 1). Since computers and calculators compute areas under the standard normal
curve, we do not need to include such a table in this book. The use of this standard
normal curve to calculate areas under any normal curve is based on this fact.
Fact. If X ∼ N(μ, σ) and if Z = (X−μ)/σ, then
Z ∼ N(0, 1).
This means that areas under any normal curve can be calculated using a single,
standard, normal curve. For example, using our example (a) above,
P(2 ≤ X ≤ 6) = P

2 −3

2
=
X−3

2
=
6 −3

2

= P (−0.707 11 ≤ Z ≤ 2. 121 3) = 0.743303
as before. The fact that any normal curve can be transformed into a standard normal
curve is a remarkable fact (and a fact not true for the other probability density functions
we will encounter). There are, however, many other uses of the standard normal curve
and we will meet these when we study other statistical topics.
Facts About Normal Curves
To show that the standard deviation does in fact measure the dispersion of the distri-
bution, we calculate several probabilities for a standard N(0, 1) distribution.
(a) P(−1 ≤ Z ≤ 1) = 0.6827
(b) P(−2 ≤ Z ≤ 2) = 0.9545
(c) P(−3 ≤ Z ≤ 3) = 0.9973
So the more standard deviations we use, the more of the distribution we enclose.
Bivariate Random Variables 115
EXAMPLE 8.1 IQ Scores
A standard test for the intelligence quotient (IQ) produces scores that are approximately nor-
mally distributed with mean 100 and standard deviation 10. So, using the facts above and as-
sumingZ=(IQ−100)/10, we see that P(90≤IQ≤110) =0.6827, P(80≤IQ≤120) =0.9545;
and P(70 ≤ IQ ≤ 130) = 0.9973.
Also we can calculate the percentage of the population whose IQ values are greater
than 140, as P(IQ ≥ 140) = 1 −P(Z ≤ 4) = 1 −0.9968 = 0.0032, a rarity. However, in a
population of about 300, 000, 000 in the United States, this gives about 1, 000, 000 people with
this IQ or greater.
᭿
There are many other applications for the normal distribution. One of the reasons
for this, but not the only one, is that sums of different random variables or means
of several random variables become normal. We had an indication of this when we
added uniformly distributed random variables. This is a consequence of the central
limit theorem, which we will discuss subsequently.
But before we do that, we turn our attention to bivariate random variables.
BIVARIATE RANDOM VARIABLES
We want to look at sums when we add the observations from the spinning wheel. We
have been concerned with a single randomvariable, but first note that the observations
we want to add are those from different observations and so are different random
variables. So we turn our attention to different random variables and first consider
two different random variables.
Suppose we have a sample space and we have defined two random variables,
which we will call X and Y, on the sample points. Now we must determine the
probabilities that the random variables assume values together. We need not show
the sample space, but the probabilities with which the random variables take on their
respective values together are shown in Table 8.1. This is called the joint probability
distribution function.
The table is to be read this way: X can take on the values 1, 2, and 3 while the
random variable Y can take on the values 1 and 2. The entries in the body of the table
Table 8.1
X
Y 1 2 3
1 1/12 1/12 1/3
2 1/3 1/12 1/12
116 Chapter 8 Continuous Probability Distributions
are the probabilities that X and Y assume their values simultaneously. For example,
P(X = 1 and Y = 2) = 1/3 and P(X = 3 and Y = 2) = 1/12
Note that the probabilities in the table add up to 1 as they should.
Now consider the random variable X+Y. This random variable can take on the
values 2, 3, 4, or 5.
There is only one way for X+Y to be 2, namely, each of the variables must be 1.
So the probability that X+Y = 2 is 1/12 and we write P(X+Y = 2) = 1/12.
There are two mutually exclusive ways for X+Y to be 3,
namely, X = 1 and Y = 2 or X = 2 and Y = 1. So
P(X+Y =3) =P(X=1 and Y = 2) +P(X = 2 and Y = 1) = 1/3 +1/12 = 5/12.
It is easy to check that there are two mutually exclusive ways for X+Y to be 4
and this has probability 5/12. Finally, the probability that X+Y = 5 is 1/12. These
probabilities add up to 1 as they should.
This means that the random variable X+Y has the following probability distri-
bution function:
f(x +y) =











1/12 if x +y = 2
5/12 if x +y = 3
5/12 if x +y = 4
1/12 if x +y = 5
where x and y denote values of the random variables X and Y.
This random variable then has a mean value. We find that
E(X+Y) = 2 ·
1
12
+3 ·
5
12
+4 ·
5
12
+5 ·
1
12
=
7
2
How does this value relate to the expected values of the variables X and Y taken
separately?
First, we must find the probability distribution functions of the variables
alone. What, for example, is the probability that X = 1? We know that
P(X = 1 and Y = 1) = 1/12 and P(X = 1 and Y = 2) = 1/3. These events are
mutually exclusive and are the only events for which X = 1. So
P(X = 1) = 1/12 +1/3 = 5/12
Notice that this is the sum of the probabilities in the column in Table 8.1 for which
X = 1.
In a similar way,
P(X = 2) = P(X = 2 and Y = 1) +P(X = 2 and Y = 2) = 1/12 +1/12 = 1/6,
which is the sum of the probabilities in the column of Table 8.1 for which X = 2
Finally, summing the probabilities in Table 8.1 for which X = 3 is
1/3 +1/12 = 5/12.
So the probability distribution for the random variable X alone can be found
by adding up the entries in the columns; the probability distribution for the random
variable Y alone can be found by adding up the probabilities in the rows of Table 8.1.
Bivariate Random Variables 117
Specifically,
P(Y = 1) = P(X = 1 and Y = 1) +P(X = 2 and Y = 1) +P(X = 3 and Y = 1)
= 1/12 +1/12 +1/3 = 1/2
In a similar way, we find P(Y = 2) = 1/2.
We then found the following probability distributions for the individual variables:
f(x) =





5/12 if x = 1
1/6 if x = 2
5/12 if x = 3
and
g(y) =

1/2 if y = 1
1/2 if y = 2
These distributions occur in the margins of the table and are called marginal
distributions. We have expanded Table 8.1 to show these marginal distributions in
Table 8.2.
Table 8.2
X
Y 1 2 3 g(y)
1 1/12 1/12 1/3 1/2
2 1/3 1/12 1/12 1/2
f(x) 5/12 1/6 5/12 1
Note that where the sums are over all the values of X and Y,

x
P(X = x, Y = y) = P(Y = y) = g(y)
and

y
P(X = x, Y = y) = P(X = x) = f(x)
These random variables also have expected values. We find
E(X) = 1 ·
5
12
+2 ·
1
6
+3 ·
5
12
= 2
E(Y) = 1 ·
1
2
+2 ·
1
2
=
3
2
Now we note that E(X+Y) =
7
2
= 2 +
3
2
= E(X) +E(Y).
118 Chapter 8 Continuous Probability Distributions
This is not a peculiarity of this special case, but is, in fact, true for any two
variables X and Y. Here is a proof.
E(X+Y) =

x

y
(x +y)P(X = x, Y = y)
=

x

y
xP(X = x, Y = y) +

x

y
yP(X = x, Y = y)
=

x
x

y
P(X = x, Y = y) +

y
y

x
P(X = x, Y = y)
=

x
xf(x) +

y
y g(y) = E(X) +E(Y)
This is easily extended to any number of random variables:
E(X+Y +Z +· · · ) = E(X) +E(Y) +E(Z) +· · ·
When more than one random variable is defined on the same sample space,
they may be related in several ways: they may be totally dependent as, for example,
if X = Y or if X = Y −4, they may be totally independent of each other, or they
may be partially dependent on each other. In the latter case, the variables are called
correlated. This will be dealt with when we consider the subject of regression later.
It is important to emphasize that
E(X+Y +Z +· · · ) = E(X) +E(Y) +E(Z) +· · ·
no matter what the relationships are between the several variables, since no condition
was used in the proof above.
Note in the example we have been considering that
P(X = 1 and Y = 1) = 1/12 / = P(X = 1) · P(Y = 1) = 5/12 · 1/2 = 5/24
so X and Y are not independent.
Now consider an example where X and Y are independent of each other. We
show another joint probability distribution function in Table 8.3.
Table 8.3
X
y 1 2 3 g(y)
1 5/24 1/12 5/24 1/2
2 5/24 1/12 5/24 1/2
f(x) 5/12 1/6 5/12 1
Bivariate Random Variables 119
Note that P(X = x, Y = y) = P(X = x)P(Y = y) in each case, so the random
variables are independent. We can calculate, for example,
P(X = 1 and Y = 1) = 5/24 = P(X = 1) · P(Y = 1) = 1/12 · 1/2
The other entries in the table can be checked similarly. Here we have shown the
marginal distributions of the random variables X and Y , f(x) and g(y), in the
margins of the table.
Now consider the random variable X · Y and in particular its expected value.
Using the fact that X and Y are independent, we sum the values of X · Y multiplied
by their probabilities to find the expected value of the product of X and Y:
E(X · Y) = 1 · 1 ·
1
2
·
5
12
+1 · 2 ·
1
6
·
1
2
+1 · 3 ·
5
12
·
1
2
+2 · 1 ·
5
12
·
1
2
+2 · 2 ·
1
6
·
1
2
+2 · 3 ·
5
12
·
1
2
= 3
but we also see that this can be written as
E(X · Y) =

1 ·
5
12
+2 ·
1
6
+3 ·
5
12

·

1 ·
1
2
+2 ·
1
2

= 2 ·
3
2
= 3 = E(X)E(Y)
and we see that the quantities in parentheses are E(X) and E(Y), respectively.
This is true of the general case, that is, if X and Y are independent, then
E(X· Y) = E(X) · E(Y)
A proof can be fashioned by generalizing the example above.
This can be extended to any number of random variables: if X,Y,Z,· · · are mutually
independent in pairs, then
E(X· Y · Z · · · · ) = E(X) · E(Y) · E(Z) · · · ·
Although the examples given here involve discrete random variables, the results con-
cerning the expected values are true for continuous random variables as well.
Variance
The variance of a random variable, X,is
Var(X) = E(X−μ)
2
where μ = E(X).
This definition holds for both discrete and continuous random variables. Now
E(X−μ)
2
= E(X
2
−2μX+μ
2
)
= E(X
2
) −2μE(X) +E(μ
2
)
120 Chapter 8 Continuous Probability Distributions
and since μ is a constant,
E(X−μ)
2
= E(X
2
) −2μ
2

2
so
E(X−μ)
2
= E(X
2
) −μ
2
The variance is a measure of the dispersion of a random variable as we have seen
with the normal random variable, but, as we will show when we study the design
of experiments, it can often be partitioned into parts that explain the source of the
variation in experimental results.
We conclude our consideration of expected values with a result concerning the
variance of a sum of independent random variables. By definition,
Var(X+Y) = E

(X+Y) −(μ
x

y
)

2
= E

(X−μ
x
) +(Y −μ
y
)

2
= E(X−μ
x
)
2
−2E(X−μ
x
)(Y −μ
y
) +E(Y −μ
y
)
2
Consider the middle term. Since X and Y are independent,
E(X−μ
x
)(Y −μ
y
) = E(X · Y −μ
x
· Y −X · μ
y

x
· μ
y
)
= E(X · Y) −μ
x
· E(Y) −E(X) · μ
y
+E(μ
x
· μ
y
)
= E(X · Y) −μ
x
· μ
y
−μ
x
· μ
y

x
· μ
y
= E(X) · E(Y) −μ
x
· μ
y
−μ
x
· μ
y

x
· μ
y
= μ
x
· μ
y
−μ
x
· μ
y
= 0
So
Var(X+Y) = E(X−μ
x
)
2
+E(Y −μ
y
)
2
= Var(X) +Var(Y)
Note that this result highly depends on the independence of the randomvariables.
As an example, consider the marginal distribution functions given in Table 8.3 and
repeated here in Table 8.4.
Table 8.4
X
Y 1 2 3 g(y)
1 5/24 1/12 5/24 1/2
2 5/24 1/12 5/24 1/2
f(x) 5/12 1/6 5/12 1
Central Limit Theorem: Sums 121
We find that
E(X+Y) = 2 ·
5
24
+3 ·
7
24
+4 ·
7
24
+5 ·
5
24
=
84
24
=
7
2
So
Var(X+Y) =

2 −
7
2

2
·
5
24
+

3 −
7
2

2
·
7
24
+

4 −
7
2

2
·
7
24
+

5 −
7
2

2
·
5
24
=
13
12
But
E(X) = 1 ·
5
12
+2 ·
1
6
+3 ·
5
12
= 2
and
E(Y) = 1 ·
1
2
+2 ·
1
2
=
3
2
so
Var(X) = (1 −2)
2
·
5
12
+(2 −2)
2
·
1
6
+(3 −2)
2
·
5
12
=
5
6
and
Var(Y) =

1 −
3
2

2
·
1
2
+

2 −
3
2

2
·
1
2
=
1
4
and so
Var(X) +Var(Y) =
5
6
+
1
4
=
13
12
We could also calculate the variance of the sum by using the formula
Var(X+Y) = E(X+Y)
2
−[E(X+Y)]
2
Here
E(X+Y)
2
= 2
2
·
5
24
+3
2
·
7
24
+4
2
·
7
24
+5
2
·
5
24
=
40
3
We previously calculated E(X+Y) = 7/2, so we find
Var(X+Y) =
40
3

7
2

2
=
13
12
as before.
Now we can return to the spinning wheel.
CENTRAL LIMIT THEOREM: SUMS
We have seen previously that the bell-shaped curve arises when discrete random
variables are added together. Now we look at continuous random variables. Suppose
122 Chapter 8 Continuous Probability Distributions
that we have n independent spins of the fair wheel, denoted by X
i
, and we let the
random variable X denote the sum so that
X = X
1
+X
2
+X
3
+· · · +X
n
For the individual observations, we knowthat the expected value is E(X
i
) = 1/2 and
the variance can be shown to be Var(X
i
) = 1/12. In addition, we know that
E(X) = E(X
1
+X
2
+X
3
+· · · +X
n
)
= E(X
1
) +E(X
2
) +E(X
3
) +· · · +E(X
n
)
= 1/2 +1/2 +1/2 +· · · +1/2
= n/2
and since the spins are independent,
Var(X) = Var(X
1
+X
2
+X
3
+· · · +X
n
)
= Var(X
1
) +Var(X
2
) +Var(X
3
) +· · · +Var(X
n
)
= 1/12 +1/12 +1/12 +· · · +1/12
= n/12
The central limit theorem states that in this case, X has, approximately, a normal
probability distribution with mean n/2 and standard deviation

n/12. We abbreviate
this by writing X N(n/2,

n/12). If n = 3, this becomes
X N(3/2, 1/2)
The value of n in the central limit theorem, as we have already stated, need not be
very large. To showhowclose the approximation is, we showthe graph N(3/2, 1/2) in
Figure 8.7 and then, in Figure 8.8, the graphs of N(3/2, 1/2) and f
3
(x) superimposed.
We previouslycalculated, usingf
3
(x), that P(1/2≤x≤5/2) =23/24=0.958 33.
Using the normal curve with mean 3/2 and standard deviation 1/2, we find the
−2 −1 1 2 3 4 5
X
0.2
0.4
0.6
0.8
f
Figure 8.7
Central Limit Theorem: Means 123
−2 −1 1 2 3 4 5
X
0.2
0.4
0.6
0.8
f
Figure 8.8
approximation to this probability to be 0.95450. As the number of spins increases,
the approximation becomes better and better.
CENTRAL LIMIT THEOREM: MEANS
Understanding the central limit theorem is absolutely essential to understanding the
material on statistical inference that follows in this book. In many instances, we know
the mean from a random sample and we wish to make some inference or draw some
conclusion, about the mean of the population from which the sample was selected.
So we turn our attention to means.
First, suppose X is some random variable and k is a constant. Then, supposing
that X is a discrete random variable,
E

X
k

=

S

x
k

P(X = x)
=

1
k

S
x · P(X = x)
=

1
k

E(X)
Here the summation is over all the values in the sample space, S. Therefore, if the
variable is divided by a constant, so is the expected value. Now, denoting E(X) by μ,
Var

X
k

=

S

x
k

μ
k

2
P(X = x)
=

1
k
2

S
(x −μ)
2
P(X = x)
=

1
k
2

Var(X)
The divisor, k, this time reduces the variance by its square.
124 Chapter 8 Continuous Probability Distributions
CENTRAL LIMIT THEOREM
If X denotes the mean of a sample of size n from a probability density function with
mean μ and standard deviation σ, then X N( μ, σ/

n).
This theorem is the basis on which much of statistical inference, our ability to
draw conclusions from samples and experimental data, is based. Statistical inference
is the subject of the next two chapters.
EXPECTED VALUES AND BIVARIATE RANDOM
VARIABLES
We now expand our knowledge of bivariate random variables. But before we can
discuss the distribution of sample means, we pause to consider the calculation of
means and variances of means.
Means and Variances of Means
We will now discuss the distribution of sample means. Suppose, as before, that we
have a sumof independent randomvariables so that X = X
1
+X
2
+X
3
+· · · +X
n
.
Suppose also that for each of these random variables, E(X
i
) = μ and Var(X
i
) = σ
2
.
The mean of these random variables is
X =
X
1
+X
2
+X
3
+· · · +X
n
n
Using the facts we just established, we find that
E(X) =
E(X
1
+X
2
+X
3
+· · · +X
n
)
n
=
E(X
1
) +E(X
2
) +E(X
3
) +· · · +E(X
n
)
n
=

n
= μ
So the expected value of the mean of a number of random variables with the same
mean is the mean of the individual random variables.
We also find that
Var(X) =
Var(X
1
+X
2
+X
3
+· · · +X
n
)
n
2
=
Var(X
1
) +Var(X
2
) +Var(X
3
) +· · · +Var(X
n
)
n
2
=

2
n
2
=
σ
2
n
Expected Values and Bivariate Random Variables 125
While the mean of the sample means is the mean of the distribution of the indi-
vidual X
i
’s, the variance is reduced by a factor of n. This shows that the larger the
sample size, the smaller the Var(X). This has important implications for sampling.
We show some examples.
EXAMPLE 8.2 Means of Random Variables
(a) The probability that an individual observation from the uniform random variable with
f(x) = 1 for 0 ≤ x ≤ 1 is between 1/3 and 2/3 is (2/3 −1/3) = 1/3.
What is the probability that the mean of a sample of 12 observations from this distribution
is between 1/3 and 2/3?
For the uniform random variable, μ = 1/2 and σ
2
= 1/12; so, using the central limit
theoremfor means, we knowthat E(X) = 1/2 and Var(X) = Var(x)/n = (1/12)/12 = 1/144.
This means that the standard deviation of X is 1/12.
The central limit theoremfor means then states that X ∼ N(1/2, 1/12). Using a statistical
calculator, we find
P(1/3 ≤ X ≤ 2/3) = 0.9545
So, while an individual observation falls between 1/3 and 2/3 only about 1/3 of the time,
the sample mean is almost certain to do so.
(b) Howlarge a sample must be selected froma population with mean 10 and standard de-
viation 2 so that the probability that the sample mean is within 1 unit of the population
mean is 0.95?
Let the sample size be n. We know that X ∼ N(10, 2/

n). We want n so that
P(9 ≤ X ≤ 11) = 0.95
We let Z = (X−10)/(2/

n). So we have
P

9 −10
2/

n

X−10
2/

n

11 −10
2/

n

= 0.95
or
P

−1
2/

n
≤ Z ≤
1
2/

n

= 0.95
But we know, for a standard normal variable, that P(−1.96 ≤ Z ≤ 1.96) = 0.95, so we
conclude that
1
2/

n
= 1.96
or

n/2 = 1.96
so

n = 2 · 1.96 = 3.92
126 Chapter 8 Continuous Probability Distributions
so
n = (3.92)
2
= 15.366
meaning that a sample of 16 is necessary. If we were to round down to 15, the probability
would be less than 0.95.
᭿
A NOTE ON THE UNIFORM DISTRIBUTION
We have stated that for the uniformdistribution, E(X) = 1/2 and Var(X) = 1/12, but
we have not proved this. The reason is that these calculations for a continuous random
variable require calculus, and that is not a prerequisite here. We give an example that
may be convincing.
Suppose we have a discrete random variable, X, where P(X = x) = 1/100 for
x = 0.01, 0.02, . . . , 1.00. This is a discrete approximation to the continuous uniform
distribution. We will need the following formulas:
1 +2 +· · · + n =
n

i=1
i =
n(n +1)
2
and
1
2
+2
2
+· · · + n
2
=
n

i=1
i
2
=
n(n +1)(2n +1)
6
We know that for a discrete random variable,
E(X) =

S
x · P(X = x)
which becomes in this case
E(X) =
1.00

x=0.01
x · (1/100) = (1/100) ·
1.00

x=0.01
x
Now assuming x = i/100 allows the variable i to assume integer values
E(X) = (1/100) ·
100

i=1

i
100

= (1/100)
2
·
100

i=1
i
= (1/100)
2
·
(100)(101)
2
= 0.505
A Note on the Uniform Distribution 127
To calculate the variance, we first find E(X
2
). This is
E(X
2
) =
1.00

x=0.01
x
2
· P(X = x)
=
1.00

x=0.01
x
2
· (1/100)
= (1/100) ·
1.00

x=0.01
x
2
and again assuming x = i/100,
E(X
2
) = (1/100) ·
100

i=1

i
100

2
= (1/100)
3
·
100

i=1
i
2
= (1/100)
3
·
(100)(101)(201)
6
= 0.33835
Now it follows that
Var(X) = E(X
2
) −[E(X)]
2
= 0.33835 −(0.505)
2
= 0.083325 1/12.
If we were to increase the number of subdivisions in the interval from 0 to 1 to
10, 000 (with each point then having probability 0.0001) we find that
E(X) = 0.50005
and
Var(X) = 0.08333333325
This may offer some indication, for the continuous uniformrandomvariable, that
E(X) = 1/2 and Var(X) = 1/12.
We conclude this chapter with an approximation of a probability from a proba-
bility density function.
EXAMPLE 8.3 Areas Without Calculus
Here is a method by which probabilities—areas under a continuous probability density
function—can be approximated without using calculus.
128 Chapter 8 Continuous Probability Distributions
Suppose we have part of a probability density function that is part of the parabola
f(x) =
1
2
x
2
, 0 ≤ x ≤ 1
and we wish to approximate P(0 ≤ x ≤ 1). The exact value of this probability is 1/6.
The graph of this function is shown in Figure 8.9.
0.2 0.4 0.6 0.8 1
0.1
0.2
0.3
0.4
0.5
g
(
x
)
x
Figure 8.9
We will approximate the probability, or the area under the curve, by a series of rectangles,
each of width 0.1, using the right-hand ends of the rectangles at the points 0.1, 0.2, . . . , 1.
We use the height of the curve at the midpoints of these rectangles and use the total area of
these rectangles as an approximation to the area, A, under the curve. Two of the approximating
rectangles are shown in Figure 8.9. This gives us
A = (0.1) ·
1
2
· (0.05)
2
+(0.1) ·
1
2
· (0.15)
2
+(0.1) ·
1
2
· (0.25)
2
+· · · +(0.1) ·
1
2
· (0.95)
2
= 0.16625
which is not a bad approximation to the actual area, 0.1666 . . .
᭿
This is a commontechnique incalculus, but the approximationhere is surprisingly
good. Increasing the number of rectangles only improves the approximation.
CONCLUSIONS
This chapter introduces continuous random variables, that is, random variables that
can assume values in an interval or intervals.
We have found some surprising things when we added independent random vari-
ables, producing the normal probability distribution. We then stated the central limit
theorem, a basic result when we study statistical inference in succeeding chapters.
We discussed some methods for approximating continuous distributions by dis-
crete distributions, adding some credence to the statements we have made without
proof for continuous distributions.
Explorations 129
EXPLORATIONS
1. Mathematics scores on the Scholastic Aptitude Test (SAT) are normally dis-
tributed with mean 500 and standard deviation 100.
(a) Find the probability that an individual’s score exceeds 620.
(b) Find the probability that an individual’s score exceeds 620, given that the
individual’s score exceeds 500.
(c) What score, or greater, can we expect to occur with probability 0.90?
2. A manufactured part is useful only if a measurement is between 0.25 and
0.38 in. The measurements follow a normal distribution with mean 0.30 and
standard deviation 0.03 in.
(a) What proportion of the parts meet specifications?
(b) Suppose the mean measurement could be changed, but the standard de-
viation cannot be changed. To what value should the mean be changed to
maximize the proportion of parts that meet specifications?
3. Upper and lower warning limits are often set for manufactured products. If
X ∼ N(μ, σ), these are commonly set at μ ±1.96σ. If the mean of the pro-
cess increases by one standard deviation, what effect does this have on the
proportion of parts outside the warning limits?
4. Suppose
f(x) =

x, 0 ≤ x ≤ 1
2 −x, 1 ≤ x ≤ 2
(X is the sum of two uniformly distribute random variables).
(a) Find P(1/2 ≤ X ≤ 3/4).
(b) What is the probability that at least two of the three independent obser-
vations are greater than 1/2?
5. The joint probability distribution for random variables X and Y is given by
f(x, y) = k, x = 0, 1, 2, · · · and y = 0, 1, 2, · · · 3 −x.
(a) Find k.
(b) Find the marginal distributions for X and Y.
6. Use the central limit theorem to approximate the probability that the sum is
34 when 12 dice are thrown.
7. The maximum weight an elevator can carry is 1600 lb. If the weights of
individuals using the elevator are N(150, 10), what is the probability that the
elevator will be overloaded?
8. A continuous probability distribution is defined as
f(x) =





x, 0 < x < 1
k, 1 < x < 2
k(3 −x), 2 < x < 3
Find k.
Chapter 9
Statistical Inference I
CHAPTER OBJECTIVES:
r
to study statistical inference: how conclusions can be drawn from samples
r
to study both point and interval estimation
r
to learn about two types of errors in hypothesis testing
r
to study operating characteristic curves and the influence of sample size on our
conclusions.
We nowoften encounter the results of sample surveys. We read about political polls
on how we feel about various issues; television networks and newspapers conduct
surveys about the popularity of politicians and how elections are likely to turn out.
These surveys are normally performed with what might be thought as a relatively
small sample of the population being surveyed. These sample sizes are in reality
quite adequate for drawing inferences or conclusions; it all depends on how accurate
one wishes the survey to be.
How is it that a sample from a population can give us information about
that population? After all, some samples may be quite typical of the population
from which they are chosen, while other samples may be very unrepresentative
of the population from which they are chosen. This heavily depends on the theory
of probability that we have developed. We explore some ideas here and explain some
of the basis for statistical inference: the process of drawing conclusions fromsamples.
Statistical inference is usually divided into two parts: estimation and hypothesis
testing. We consider each of these topics now.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
130
Confidence Intervals 131
ESTIMATION
Assuming we do not already know the answer, suppose we wish to guess the age
of your favorite teacher. Most people would give an exact response: 52, 61, 48, and
so on. These are estimates of the age and, since they are exact, are called point
estimates.
Now suppose it is very important that our response be correct. We are unlikely
to estimate the correct age exactly, so a point estimate may not be a good response.
How else can one respond to the question? Perhaps a better response is, “I think the
teacher is between 45 and 60.” We might feel, in some sense, that the interval might
have a better chance of being correct, that is, in containing the true age, than a point
estimate. But we must be very careful in interpreting this, as we will see.
The response in our example is probably based on observation and does not
involve a random sample, so we cannot assign any probability or likelihood to the
interval. We now turn to the situation where we have a random sample and wish to
create an interval estimate. Such intervals are called confidence intervals.
CONFIDENCE INTERVALS
EXAMPLE 9.1 A Confidence Interval
Consider a normal random variable X, whose variance σ
2
is known but whose mean μ is
unknown. The central limit theorem tells us that the random variable representing the mean of
a sample of n random observations, X, has a N(μ, σ/

n) distribution.
So, one could say that
P



−1.645 ≤
X−μ
σ

n
≤ 1.645



.
= 0.90
This is true because (X−μ)/(σ/

n) is a normally distributed randomvariable. We could
have chosen many other true statements such as
P



−1.96 ≤
X−μ
σ

n
≤ 1.96



.
= 0.95
or
P



−1.282 ≤
X−μ
σ

n



.
= 0.90
We have, of course, an infinite number of choices for this normal random variable. The
probabilities in the statements above are called confidence coefficients. There is an infinity of
choices for the confidence coefficient. Once a confidence coefficient is selected and a sym-
metric interval is decided, the normal z-values can be found by using computer or hand-held
132 Chapter 9 Statistical Inference I
calculator or from published tables. The probability statements are all based on the fact that
(X−μ)/(σ/

n) is a legitimate random variable that will vary from sample to sample.
Now rearrange the inequality in the statement
P



−1.645 ≤
X−μ
σ

n
≤ 1.645



.
= 0.90
to read
−1.645 ·
σ

n
≤ X−μ ≤ 1.645 ·
σ

n
We see further that
−1.645 ·
σ

n
−X ≤ −μ ≤ 1.645 ·
σ

n
−X
Now if we solve for μ and rearrange the inequalities, we find that
X−1.645 ·
σ

n
≤ μ ≤ X+1.645 ·
σ

n
Note now that the end points of the inequality above are both known, since X and n are
known from the sample and we presume that σ is known.
The interval we calculated above is called a 95% confidence interval.
We have omitted these statements as probability statements. The reason for this is that the
statement
P



−1.645 ≤
X−μ
σ

n
≤ 1.645



.
= 0.90
is a statement about a random variable and is a legitimate probability statement. The result
X−1.645 ·
σ

n
≤ μ ≤ X+1.645 ·
σ

n
however, is not a probability statement about μ. Although μ is unknown, it is a constant and
so it is either in the interval fromX−1.645 · σ/

n to X+1.645 · σ/

n or it is not. So the
probability that μ is in the interval is either 0 or 1. What meaning, then, are we to give to the
0.90 with which we began?
We interpret the final result in this way: 90% of all possible intervals calculated will
contain the unknown, constant value μ, and 10% of all possible intervals calculated will not
contain the unknown, constant value μ.
᭿
EXAMPLE 9.2 Several Confidence Intervals
To illustrate the ideas in Example 9.1, we drew 20 samples of size 3 from a normal distribution
with mean 2 and standard deviation 2. The sample means are then approximately normally
distributed with mean 2 and standard deviation 2/

3 = 1.1547.
A graph of the 20 confidence intervals generated is shown in Figure 9.1.
As it happened, exactly 19 of the 20 confidence intervals contain the mean, indicated by
the vertical line. This occurrence is not always to be expected.
᭿
Hypothesis Testing 133
Figure 9.1 Confidence intervals.
We have considered the estimation of some unknown mean μ by using a random
sample and constructing a confidence interval. We will consider other confidence
intervals subsequently, but nowwe turn to the other major part of statistical inference,
hypothesis testing.
HYPOTHESIS TESTING
If we choose a randomsample froma probability distribution and calculate the sample
mean, X = 32.4, can we confidently believe that the sample was chosen from a
population with μ = 26? The answer of course lies in both the variability of X and
the confidence we need to place in our conclusion.
Two examples will be discussed here, the first from a discrete distribution and
the second from a continuous distribution. We confess now that each of the examples
is somewhat artificial and is used primarily to introduce some ideas so that they can
be brought out clearly. It is crucial that these ideas be clearly understood before we
proceed to more realistic problems.
EXAMPLE 9.3 Germinating Bulbs
A horticulturist is experimenting with an altered bulb for a large plant. From previous expe-
rience, she knows that the percentage of these bulbs that germinate is either 50% or 75%. To
decide which germination rate is correct, she plans an experiment involving 15 of these altered
bulbs and records the number of bulbs that germinate.
We assume that the number of bulbs that germinate follows a binomial model, that is, a
bulb either germinates or it does not, the bulbs behave independently, and the probability of
germination is constant. If in fact the probability is 50% that a bulb germinates and if X is the
random variable denoting the number of bulbs that germinate, then
P(X = x) =

15
x

(0.50)
x
· (0.50)
15−x
for x = 0, 1, 2, · · · , 15
134 Chapter 9 Statistical Inference I
while if the probability is 75% that a bulb germinates, then
P(X = x) =

15
x

(0.75)
x
· (0.25)
15−x
for x = 0, 1, 2, · · · , 15
We should first consider the probabilities of all the possible outcomes fromthe experiment.
These are shown in Table 9.1.
Table 9.1 Probabilities for Example 9.3
x p = 0.50 p = 0.75
0 0.0000 0.0000
1 0.0004 0.0000
2 0.0032 0.0000
3 0.0139 0.0000
4 0.0417 0.0001
5 0.0916 0.0007
6 0.1527 0.0034
7 0.1964 0.0131
8 0.1964 0.0393
9 0.1527 0.0917
10 0.0916 0.1651
11 0.0417 0.2252
12 0.0139 0.2252
13 0.0032 0.1559
14 0.0004 0.0668
15 0.0000 0.0134
The statements that 50% of the bulbs germinate or 75% of the bulbs germinate are called
hypotheses. They are conjectures about the behavior of the bulbs. We will formalize these
hypotheses as
H
0
: p = 0.50
H
a
: p = 0.75
We have called H
0
: p = 0.50 the null hypothesis and H
a
: p = 0.75 the alternative
hypothesis. Nowwe must decide between them. If we decide that the null hypothesis is correct,
then we accept the null hypothesis and reject the alternative hypothesis. On the contrary, if we
reject the null hypothesis then we accept the alternative hypothesis How should we decide?
The decision process is called hypothesis testing.
In this case, we would certainly look at the number of bulbs that germinate. If in fact 75%
of the bulbs germinate, then we would expect a large number of the bulbs to germinate.
It would appear, if a large number of bulbs germinate, say 11 or more, that we would then
reject the null hypothesis (that p = 0.50) and accept the alternative hypothesis (that p = 0.75).
In coming to this test, we cannot reach a decision with certainty because our conclusion
is based on a sample, a small one at that in this instance. What are the risks involved? There
are two risks or errors that we can make: we could reject the null hypothesis when it is actually
true or we could accept the null hypothesis when it is false. Let us consider each of these.
Hypothesis Testing 135
Rejecting the null hypothesis when it is true is called a Type I error. In this case, we reject
the null hypothesis when the number of germinating bulbs is 11 or more. The probability this
occurs when the null hypothesis is true is
P(Type I error) = 0.0417 +0.0139 +0.0032 +0.0004 +0.0000 = 0.0592
So about 6% of the time, bulbs that have a germination rate of 50% will behave as if the
germination rate were 75%.
Accepting the null hypothesis when it is false is called a Type II error. In this case, we
accept the null hypothesis when the number of germinating bulbs is 10 or less. The probability
this occurs when the null hypothesis is false is
P(Type II error) = 0.0000 +0.0000 +0.0000 +0.0000 +0.0001 +0.0007
+0.0033 +0.0131 +0.0393 +0.0917 +0.1651
= 0.3133
So about 31% of the time, bulbs with a germination rate of 75% will behave as though
the germination rate were only 50%.
The experiment will always result in some value of X. We must decide in advance which
values of X cause us to accept the null hypothesis and which values of X cause us to reject the
null hypothesis.
The values of X that cause us to reject H
0
comprise what we call the critical region for
the test.
In this case, large values of X are more likely to come from a distribution with p = 0.75
than from a distribution with p = 0.50. We have used the critical region X ≥ 11 here.
So it is reasonable to conclude that if X ≥ 11, then p = 0.75.
The errors calculated above are usually denoted by α and β. In general then
α = P(H
0
is rejected if it is true)
where α is often called the size or the significance level of the test.
The size of the Type II error is denoted by β. In general then
β = P(H
0
is accepted if it is false).
In this case, with the critical region X ≥ 11, we find α = 0.0542 and β = 0.3133.
Note that α and β are calculated under quite different assumptions, since α presumes
the null hypothesis true and β presumes the null hypothesis false, so they bear no particular
relationship to one another.
It is of course possible to decrease α by reducing the critical region to say X ≥ 12. This
produces α = 0.0175, but unfortunately, the Type II error increases to 0.5385. The only way
to decrease both α and β simultaneously is to increase the sample size.
Finally note that both α and β increase or decrease in finite amounts. It is not possible to
find a critical region that would produce α between the values 0.0175 and 0.0542. It is possible
to decrease both α and β by increasing the sample size as we now show.
᭿
EXAMPLE 9.4 Increasing the Sample Size
Suppose now that the experimenter in the previous example has 100 bulbs with which to
experiment. We could work out all the probabilities for the 101 possible values for X and
136 Chapter 9 Statistical Inference I
would most certainly use a computer to do this. For this sample size, the binomial distri-
bution is very normal-like with a maximum at the expected value, np, and standard deviation

n · p · (1 −p). In either case here, n = 100. If p = 0.50, we find the probability distribution
centered about np = 100 · 0.50 = 50 with standard deviation

100 · 0.50 · 0.50 = 5. Since
we are seeking a critical region in the upper tail of this distribution, we look at values of Xat least
one standard deviation from the mean, so we start at X = 55. We show some probabilities in
Table 9.2.
Table 9.2
Critical region α β
X ≥ 56 0.1356 0.0000
X ≥ 57 0.0967 0.0000
X ≥ 58 0.0443 0.0001
X ≥ 59 0.0284 0.0001
X ≥ 60 0.0176 0.0003
X ≥ 61 0.0105 0.0007
X ≥ 62 0.0060 0.0014
We see that α decreases as we move to the right on the probability distribution and that β
increases. We have suggested various critical regions here and determined the resulting values
of the errors. This raises the possibility that the size of one of the errors, say α, is chosen
in advance and then a critical region found that produces this value of α. The consequence
of this is shown in Table 9.2. It is not possible, for example, to choose α = 0.05 and find an
appropriate critical region. This is because the random variable is discrete in this case. If the
random variable were continuous, then it is possible to specify α in advance. We show how
this is so in the next example.
᭿
EXAMPLE 9.5 Breaking Strength
The breaking strength of steel wires used in elevator cables is a crucial characteristic of these
cables. The cables can be assumed to come from a population with known σ = 400 lb. Before
accepting a shipment of these steel wires, an engineer wants to be confident that μ >10,000 lb.
A sample of 16 wires is selected and their mean breaking strength X is measured.
It would appear sensible to test the null hypothesis H
0
: μ = 10,000 lb against the alter-
native H
a
: μ <10,000. A test will be based on the sample mean, X. The central limit theorem
tells us that
X ∼ N

μ,
σ

n

In this case, we have
X ∼ N

μ,
400

16

= N (μ, 100)
If the critical region has size 0.05, so that α = 0.05, then we would select a critical region
in the left tail of the normal curve. The situation is shown in Figure 9.2.
β and the Power of a Test 137
0.004
0.003
0.002
0.001
0.000
X
D
e
n
s
i
t
y
9836
0.05
10,000
Distribution plot
Normal, Mean = 10,000, StDev = 100
Figure 9.2
The value of shaded area in Figure 9.2 is 0.05, so the z−score is −1.645. This means that
the critical value of X is −1.645 = (x −10,000)/(100) or x = 10000 −164.5 = 9835.5. So
the null hypothesis should be rejected if the sample mean is less than 9835.5 lb.
β AND THE POWER OF A TEST
What is β, the size of the Type II error in this example? We recall that
β = P(H
0
is accepted if it is false)
or
β = P(H
0
is accepted if the alternative is true)
We could calculate β easily in our first example since in that case we had a specific
alternative to deal with (namely, p = 0.75). However, in this case, we have an infinity of
alternatives (μ < 10,000) to deal with.
The size of β depends upon which of these specific alternatives is chosen. We will show
some examples. We use the notation β
alt
to denote the value of β when a particular alternative
is selected.
First, consider
β
9800
= P(X > 9835.5 if μ = 9800)
= P

Z >
9835.5 −9800
100
= 0.355

= 0.361295
138 Chapter 9 Statistical Inference I
So for this test the probability we accept H
0
: μ =10,000 if in fact μ = 9800 is over 36%.
Now let us try some other values for the alternative.
β
9900
= P(X > 9835.5 if μ = 9900)
= P

Z >
9835.5 −9900
100
= −0.645

= 0.740536
So almost 3/4 of the time this test will accept H
0
: μ = 10,000 if in fact μ = 9900.
β then highly depends upon the alternative hypothesis.
We can in fact show this dependence in Figure 9.3 in a curve that plots β
alt
against the
specific alternative.
9700 9800 9900 10,000
Alternative
0.2
0.4
0.6
0.8
B
e
t
a
Figure 9.3
This curve is often called the operating characteristic curve for the test.
This curve is not part of a normal curve; in fact, it has no algebraic equation, each
point on it being calculated in the same manner as we have done in the previous two
examples.
We know that β
alt
= P(accept H
0
if it is false) so 1 −β
alt
= P(reject H
0
if it is false).
This is the probability that the null hypothesis is correctly rejected. 1 −β
alt
is called the
power of the test for a specific alternative.
A graph of the power of the test is shown in Figure 9.4. Figure 9.4 also shows the power
of the test if the sample size were to be increased from 16 to 100. The graph indicates that the
sample of 100 is more likely to reject a false H
0
than is the sample of 16.
᭿
9700 9800 9900 10,000
Alternative
0.2
0.4
0.6
0.8
1
P
o
w
e
r
n = 100
n = 16
Figure 9.4
p-Value for A Test 139
p-VALUE FOR A TEST
We have discussedselectinga critical regionfor a test inadvance andthe disadvantages
of proceeding in that way. We abandoned that approach for what would appear to be a
more reasonable one, that is, selecting α in advance and calculating the critical region
that results. Selecting α in advance puts a great burden upon the experimenter. How
is the experimenter to know what value of α to choose? Should 5% or 6% or 10%
or 22% be selected? The choice often depends upon the sensitivity of the experiment
itself. If the experiment involves a drug to combat a disease, then α should be very
small; however, if the experiment involves a component of a nonessential mechanical
device, then the experimenter might tolerate a somewhat larger value of α. Now we
abandon that approach as well. But then we have a new problem: if we do not have a
critical region and if we do not have α either, then how can we proceed?
Suppose, to be specific, that we are testing H
0
: μ = 22 against H
1
: μ / = 22
with a sample of n = 25 and we know that σ = 5. The experimenter reports that the
observed X = 23.72. We could calculate that a sample of 25 would give this result,
or a result greater than 23.72, if the true mean were 22. This is found to be
P(X ≥ 23.72 if μ = 22)
= P(Z ≥ 1.72)
= 0.0427162
Since the test is two sided, the phrase “a result more extreme” is interpreted to
mean P(|Z| ≥ 1.72) = 2 · 0.0427162 = 0.08 5432. This is called the p-value for the
test.
This allows the experimenter to make the final decision, either to accept or reject
the null hypothesis, depending entirely upon the size of this probability. If the p-value
is very large, one would normally accept the null hypothesis, while if it is very small,
one would know that the result is in one of the extremities of the distribution and
reject the null hypothesis. The decision of course is up to the experimenter.
Here is a set of rules regarding the calculation of the p-value. We assume that z
is the observed value of Z and that the null hypothesis is H
0
: μ = μ
0
:
Alternative p-value
μ > μ
0
P(Z > z)
μ < μ
0
P(Z < z)
μ / = μ
0
P(|Z| > z)
The p-values have become popular because they can be easily computed. Tables
offer great limitations and their use generally allows only approximations to p-values.
Statistical computer programs commonly calculate p-values.
140 Chapter 9 Statistical Inference I
CONCLUSIONS
We have introduced some basic ideas regarding statistical inference—the process by
which we draw inferences from samples—using confidence intervals or hypothesis
testing.
We will continue our discussion of statistical inference in the next chapter where
we look at the situation when nothing is known about the population being sam-
pled. We will also learn about comparing two samples drawn from possibly different
populations.
EXPLORATIONS
1. Use a computer to select 100 samples of size 5 each froma N(0, 1) distribution.
Compute the mean of each sample and then find howmany of these are within
the interval −0.73567 and 0.73567. How many means are expected to be in
this interval?
2. In the past, 20% of the production of a sensitive component is unsatisfactory.
A sample of 20 components shows 6 items that must be reworked. The manu-
facturer is concerned that the percentage of components that must be reworked
has increased to 30%.
(a) Form appropriate null and alternative hypotheses.
(b) Let X denote the number of items in a sample that must be reworked. If
the critical region is {x | x ≥ 9}, find the sizes of both Type I and Type II
errors.
(c) Choose other critical regions and discuss the implications of these on both
types of errors.
Chapter 10
Statistical Inference II:
Continuous Probability
Distributions II—Comparing
Two Samples
CHAPTER OBJECTIVES:
r
to test hypotheses on the population variance
r
to expand our study of statistical inference to include hypothesis tests for a mean when
the population standard deviation is not known
r
to test hypotheses on two variances
r
to compare samples from two distributions
r
to introduce the Student t, Chi-Squared, and F probability distributions
r
to show some applications of the above topics.
I
n the previous chapter, we studied testing hypotheses on a single mean, but we
presumed, while we did not knowthe population mean, that we did knowthe standard
deviation. Now we examine the situation where nothing is known about the sampled
population. First, we look at the population variance.
THE Chi-SQUARED DISTRIBUTION
We have studied the probability distribution of the sample mean through the central
limit theorem. To expand this inference when the population standard deviation σ is
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
141
142 Chapter 10 Statistical Inference II: Continuous Probability Distributions II
unknown, we must be able to make inferences about the population variance. We
begin with a specific example.
One hundred samples of size 5 were selected from a N(0, 1) distribution and on
each case the sample variance s
2
was calculated. In general,
s
2
=
n
¸
i=1
(x
i
− ¯ x
i
)
2
n −1
This can be shown to be
s
2
=
n
n
¸
i=1
x
2
i

n
¸
i=1
x
i

2
n(n −1)
In this case, n = 5.
Here are some of the samples, the sample mean, and the sample variance:
Sample ¯ x s
2
−0.815117, −0.377894, 1.12863, −0.550026, −0.969271 −0.316736 0.705356
0.172919, −0.989559, 0.448382, 0.778734, −1.37236 −0.192377 0.878731
. . .
. . .
0.0723954, 0.633443, −0.437318, −0.684865, −1.33975 −0.351219 0.561233
A graph of the values of s
2
is shown in Figure 10.1.
0.02 0.320.620.921.22 1.521.822.122.42 2.723.02
5
10
15
20
Figure 10.1
It is clear that the graph has a somewhat long right-hand tail and of course must be
nonnegative. It is certainly not normal due to these facts. Our sample showed a mean
value for the sample variances to be 1.044452, and the variance of these variances
was 0.521731.
Now we state, without proof, the following theorem:
Theorem 10.1 If samples are selected from a N(μ, σ) distribution and the sample
variance s
2
is calculated for each sample, then (n −1)s
2

2
follows the chi-squared
probability distribution with n −1 degrees of freedom (χ
2
n−1
).
The Chi-Squared Distribution 143
Also
E(s
2
) = σ
2
and Var(s
2
) =

4
n −1
The proof of this theorem can be found in most texts on mathematical statistics. The
theorem says that the probability distribution depends on the sample size n and that
the distribution has a parameter n −1, which we call the degrees of freedom.
We would then expect, since σ
2
= 1,that E(s
2
) = 1; our sample showed a mean
value of the variances to be 1.044452. We would also expect the variance of the sample
variances to be 2σ
4
/(n −1) = 2 · 1/4 = 1/2; our variances showed a variance of
0.521731, so our samples tend to agree with the results stated in the theorem.
Nowwe showin Figure 10.2, the chi-squared distribution with n −1 = 4 degrees
of freedom (which we abbreviate as χ
2
4
).
2 4 6 8 10
x
2
0.025
0.05
0.075
0.1
0.125
0.15
0.175
P
r
o
b
a
b
i
l
i
t
y
Figure 10.2
Finally, in Figure 10.3, we superimpose the χ
2
4
distribution onour histogram of
sample variances.
0.020.320.620.921.221.521.822.122.422.723.02
0.05
0.1
0.15
0.2
Figure 10.3
Note that exponent 2 carries no meaning whatsoever; it is simply a symbol and
alerts us to the fact that the randomvariable is nonnegative. It is possible, but probably
useless, to find the probability distribution for χ, as we could find the probability
distribution for the square root of a normal random variable.
We close this section with an admonition: while the central limit theorem affords
us the luxuryof samplingfromanyprobabilitydistribution, the probabilitydistribution
of χ
2
highly depends on the fact that the samples arise from a normal distribution.
144 Chapter 10 Statistical Inference II: Continuous Probability Distributions II
STATISTICAL INFERENCE ON THE VARIANCE
While the mean value of the diameter of a manufactured part is very important for the
part to fit into a mechanism, the variance of the diameter is also crucial so that parts
do not vary widely from their target value. A sample of 12 parts showed a sample
variance s
2
= 0.0025. Is this the evidence that the true variance σ
2
exceeds 0.0010?
To solve the problem, we must calculate some probabilities.
One difficulty with the chi-squared distribution, and indeed with almost all prac-
tical continuous probability distributions, is the fact that areas, or probabilities, are
very difficult to compute and so we rely on computers to do that work for us. The
computer system Mathematica and the statistical program Minitab
r
have both been
used in this book for these calculations and the production of graphs.
Here are some examples where we need some points on χ
2
11
:
We find that P(χ
2
11
< 4.57) = 0.05 and so P((n −1)s
2

2
< 4.57) = 0.05,
which means that P(σ
2
> (n −1)s
2
/4.57) = 0.05 and in this case we find
P(σ
2
> 11 · 0.0025/4.57) = P(σ
2
> 0.0060175) = 0.05 and so we have a
confidence interval for σ
2
.
Hypothesis tests are carried out in a similar manner. In this case, as in many
other industrial examples, we are concerned that the variance may be too large; small
variances are of course desirable. For example, consider the hypotheses from the
previous example, H
0
: σ
2
= 0.0010 and H
A
: σ
2
> 0.0010.
From our data, where s
2
= 0.0025, and from the example above, we
see that this value for s
2
is in the rejection region. We also find that
χ
2
11
= (n −1)s
2

2
= 11 · 0.0025/0.0010 = 27.5 and we can calculate that
P(χ
2
11
> 27.5) = 0.00385934, and so we have a p−value for the test.
Without a computer this could only be done with absolutely accurate tables.
The situation is shown in Figure 10.4.
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
X
D
e
n
s
i
t
y
27.5
0.00386
0
Distribution plot
Chi square, df = 11
Figure 10.4
Statistical Inference on the Variance 145
0.20
0.15
0.10
0.05
0.00
X
D
e
n
s
i
t
y
0.025
11.1
0.025
0.484
Distribution plot
Chi square, df = 4
Figure 10.5
Now we look at some graphs with other degrees of freedom in Figures 10.5 and
10.6.
The chi-squared distribution becomes more and more “normal-like” as the
degrees of freedom increase. Figure 10.7 shows a graph of the distribution with
30 degrees of freedom.
0.20
0.15
0.10
0.05
0.00
X
D
e
n
s
i
t
y
9.49
0.05
0
Distribution plot
Chi square, df = 4
Figure 10.6
146 Chapter 10 Statistical Inference II: Continuous Probability Distributions II
0.06
0.05
0.04
0.03
0.02
0.01
0.00
X
D
e
n
s
i
t
y
43.8
0.05
Distribution plot
Chi square, df=30
Figure 10.7
We see that P(χ
2
30
> 43.8) = 0.05. It can be shown that E(χ
2
n
) = n and that
Var(χ
2
n
) = 2n. If we approximate χ
2
30
by a normal distribution with mean 30 and
standard deviation

60 = 7.746,we find that the point with 5% of the curve in the
right-hand tail is 30 +1.645

60 = 42.7, so the approximation is not too bad. The
approximation is not very good, however, for small degrees of freedom.
Now we are able to consider inferences about the sample mean when σ is
unknown.
STUDENT t DISTRIBUTION
In the previous chapter, we have used the central limit theorem to calculate
confidence intervals and to carry out tests of hypotheses. In noting that the ran-
dom variable (¯ x −μ)/(σ/

n) follows a N(0, 1) distribution, we rely heavily on
the fact that the standard deviation of the population, σ, is known. However, in
many practical situations, both population parameters, μ and σ, are unknown.
In 1908, W.G. Gossett, writing under the pseudonym “Student” discovered the
following:
Theorem 10.2 The ratio of a N(0, 1) random variable divided by the square root
of a chi-squared random variable divided by its degrees of freedom (say n) follows a
Student t distribution with n degrees of freedom (t
n
).
Since (¯ x −μ)/(σ/

n) is a N(0, 1) random variable, and (n −1)s
2

2
is a χ
2
n−1
variable, it follows that [(¯ x −μ)/(σ/

n)]/

(n −1)s
2

2
(n −1) follows a t
n−1
Student t Distribution 147
probability distribution. But
¯ x −μ
σ/

n

(n −1)s
2
σ
2
(n −1)
=
¯ x −μ
s/

n
So we have a random variable that involves μ alone and values calculated from
a sample. It is far too simple to say that we simply replace σ by s in the central limit
theorem; much more than that has transpired. The sampling can arise from virtually
any distribution due to the fact that the central limit theorem is almost immune to the
underlying distribution.
We show two typical Student t distributions in Figure 10.8 (which was produced
by Minitab).
5.0 2.5 0.0 -2.5 -5.0
0.4
0.3
0.2
0.1
0.0
X
D
e
n
s
i
t
y
4
20
df
Distribution plot
T
Figure 10.8
The curves have 4 and 20 degrees of freedom, respectively, and each appears to
be “normal-like.” Here is a table of values of v so that the probability the table value
exceeds v is 0.05. The number of degrees of freedom is n.
n v
5 2.02
10 1.81
20 1.72
30 1.70
40 1.68
148 Chapter 10 Statistical Inference II: Continuous Probability Distributions II
This is some evidence that the 0.05 point approaches that of the N(0, 1) dis-
tribution, 1.645,but the degrees of freedom, depending on the sample size, remain
crucial.
Again a computer is essential in doing calculations as the following example
shows.
EXAMPLE 10.1 A Sample
A sample of size 14 showed that ¯ x = 29.8 and s
2
= 123. Find the p-value for the test
H
0
: μ = 26 and H
a
: μ / = 26.
Here we find that t
13
=
29.8−26

123/14
= 4.797 and P (t
13
> 4.797) = 1.7439 · 10
−4
, a rare event
indeed.
᭿
TESTING THE RATIO OF VARIANCES: THE F
DISTRIBUTION
So far we have considered inferences from single samples, but we will soon turn to
comparing two samples, possibly arising from two different populations. So we now
investigate comparing variances from two different samples. We need the following
theorem whose proof can be found in texts on mathematical statistics.
Theroem 10.3 The ratio of two independent chi-squared random variables, divided
by their degrees of freedom, say n in the numerator and min the denominator, follows
the F distribution with n and m degrees of freedom.
We know from a previous section that (n −1)s
2

2
is a χ
2
n−1
random variable,
so if we have two samples, with degrees of freedomn and m, then using the theorem,
ns
2
1
σ
2
1
n
ms
2
2
σ
2
2
m
=
s
2
1
σ
2
1
s
2
2
σ
2
2
= F[n, m]
Figure 10.9 shows a graph of a typical F distribution, this with 7 and 9 degrees
of freedom.
The graph also shows the upper and lower 2.5% points.
Now notice that the reciprocal of an F random variable is also an F random
variable but with the degrees of freedom interchanged, so
1
F[n,m]
= F[m, n]. This
fact can be used in finding critical values. Suppose P[F[n, m] < v] = α. This is
equivalent to P[1/F[n, m] > 1/v] or P[F[m, n] > 1/v] = α. So the reciprocal of
the lower α point on F[n, m] is the upper α point on the F[m, n] distribution.
Testing the Ratio of Variances: the F Distribution 149
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
X
D
e
n
s
i
t
y
0.025
4.20
0.025
00.207
Distribution plot
F, df1 = 7, df2 = 9
Figure 10.9
EXAMPLE 10.2 Two Samples
Independent samples from two normal distributions gave the following data:
n
1
= 22, s
2
1
= 100, n
2
= 13, and s
2
2
= 200. We seek a 90% confidence interval for the
ratio of the true variances, σ
2
1

2
2
.
We know that
s
2
2
σ
2
2
s
2
1
σ
2
1
=
200
σ
2
2
·
σ
2
1
100
=

2
1
σ
2
2
= F[12, 21]
so we find, from Figure 10.10, that
P[0.395 < 2σ
2
1

2
2
< 2.25] = 0.90, which can also be written as
P
¸
0.1975 <
σ
2
1
σ
2
2
< 1.125

= 0.90
It is interesting to note, while one sample variance is twice the other, that the confidence
interval contains 1 so the hypothesis H
0
: σ
2
1
= σ
2
2
against H
a
: σ
2
1
/ = σ
2
2
would be accepted
with α = 0.10.
This is an indication of the great variability of the ratio of variances.
᭿
We now turn to other hypotheses involving two samples.
150 Chapter 10 Statistical Inference II: Continuous Probability Distributions II
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
X
D
e
n
s
i
t
y
0.395
0.05
2.25
0.05
0
Distribution plot
F, df1 = 12, df2 = 21
Figure 10.10
TESTS ON MEANS FROM TWO SAMPLES
It is often necessary to compare two samples with each other. We may wish to compare
methods of formulating a product; different versions of standardized tests may be
compared to decide if they in fact test equally able candidates equally; production
methods may be compared with respect to the quality of the products each method
produces. Two samples may be compared by comparing a variety of sample statistics;
we consider comparing only means and variances here.
EXAMPLE 10.3 Two Production Lines
Two production lines, called X and Y, are making specialized heaters. Samples are selected
from each production line, the thermostats set at 140
o
, and then the actual temperature in the
heater is measured. The results of the sampling are given in Table 10.1.
Graphs should be drawn from the data so that we may make an initial visual inspection.
Comparative dot plots of the samples are shown in Figure 10.11.
The data from the production line X appear to be much more variable than that from
production line Y. It also appears that the samples are centered about different points, so we
calculate some sample statistics and we find that
n
x
= 15, x = 138.14, s
x
= 6.95
n
y
= 25, y = 144.02, s
y
= 3.15
Tests on Means from Two Samples 151
Table 10.1
X Y
147.224 135.648
121.482 140.083
142.691 140.970
127.155 138.990
147.766 148.490
131.562 145.757
139.844 145.740
139.585 146.324
142.966 145.472
140.058 145.332
139.553 140.822
137.973 145.022
137.343 145.496
137.151 147.103
139.809 144.753
145.598
145.471
145.319
143.103
145.676
141.644
144.381
146.797
139.065
147.352
148 144 140 136 132 128 124
X
Y
Data
Comparative dot plots for Example 10.3
Figure 10.11
The real question here is howthese statistics would vary were we to select large numbers of
pairs of samples. The central limit theoremcan be used if assumptions can be made concerning
the population variances.
We formulate the problem as follows. We wish to test the hypothesis H
0
: μ
X
= μ
Y
against H
1
: μ
X
/ = μ
Y
.
We know that E[
¯
X−
¯
Y] = μ
X
−μ
Y
and that
Var[
¯
X−
¯
Y] =
σ
2
X
n
x
+
σ
2
Y
n
y
152 Chapter 10 Statistical Inference II: Continuous Probability Distributions II
and that each of the variables
¯
X and
¯
Y is individually approximately normally distributed by
the central limit theorem. It can be shown that the difference between normal variables is also
normal so
z =
(
¯
X−
¯
Y) −(μ
X
−μ
Y
)

σ
2
X
n
x
+
σ
2
Y
n
y
is a N(0, 1) variable.
The statistic z can be used to test hypotheses or to construct confidence intervals if the
variances are known. Consider for the moment that we can assume that the populations have
equal variances, say σ
2
X
= σ
2
Y
= 30.0. Then
z =
(138.14 −144.02) −0

30
15
+
30
25
= −3.287
The p-value for this test is approximately 0.001, so the null hypothesis would most
probably not be accepted.
We could also use z to construct a confidence interval. Here
P
¸
(X− Y) −1.96

σ
2
X
n
x
+
σ
2
Y
n
y
≤ μ
X
−μ
Y
≤ (X− Y) +1.96

σ
2
X
n
x
+
σ
2
Y
n
y
¸
= 0.95
which becomes in this case the interval from−9.3862 to −2.3739. Since 0 is not in this interval,
the hypothesis of equal means is rejected.
᭿
Larger samples reduce the width of the confidence interval.
Knowledge of the population variances in the previous example may be regarded
as artificial or unusual, although it is not infrequent, when data have been gathered
over a long period of time, that some idea of the size of the variance is known. Now
we consider the case where the population variances are unknown. There are two
cases: the unknown variances are equal or they are not equal. We give examples of
the procedure in each case.
EXAMPLE 10.4 Using Pooled Variances
If the population variances are known to be equal, with true value σ
2
, then we form an estimate
of this common value, which we denote by s
2
p
, where
s
2
p
=
(n
x
−1)s
2
X
+(n
y
−1)s
2
Y
n
x
+n
y
−2
Here s
2
p
is often called the pooled variance. The sampling must now be done from normal
distributions.
We replace each of the unknown, but equal, variances with the pooled variance.
Then it is known that
t
n
X
+n
Y
−2
=
(X−Y) −(μ
X
−μ
Y
)
s
p

1
n
X
+
1
n
Y
Tests on Means from Two Samples 153
In this case, we find that
s
2
p
=
14· 6.95
2
+24·3.15
2
15 +25 −2
= 24.0625
and so
t
38
=
(138.14 − 144.02) −0

24.0625

1
15
+
1
25
= −3.670
The p-value for the test is then about 0.0007 leading to the rejection of the hypothesis
that the population means are equal.
᭿
EXAMPLE 10.5 Unequal Variances
Nowwe must consider the case where the population variances are unknown and cannot be pre-
sumed to be equal. Regrettably, we do not knowthe exact probability distribution of any statistic
involving the sample data in this case. This unsolved problem is known as the Behrens–Fisher
problem; several approximations are known. An approximation due to Welch is given here.
The variable
T =
(X− Y) −(μ
X
−μ
Y
)

s
2
X
n
x
+
s
2
Y
n
y
is approximately a t variable with ν degrees of freedom, where
ν =

s
2
X
n
x
+
s
2
Y
n
y

2

s
2
X
n
x

2
n
x
−1
+

s
2
Y
n
y

2
n
y
−1
It cannot be emphasized too strongly that the exact probability distribution of T is un-
known.
᭿
Using the data in Example 10.5, we find ν = 17.660 so we must use a t variable
with17degrees of freedom. (We must always use the greatest integer less thanor equal
to ν; otherwise, the sample sizes are artificially increased.) This gives T
17
= −3.09,
a result quite comparable to previous results. The Welch approximation will make a
very significant difference if the population variances are quite disparate.
It is not difficult to see that , as n
x
→∞and n
y
→∞,
T =
(
¯
X−
¯
Y) −(μ
X
−μ
Y
)

s
2
X
n
x
+
s
2
Y
n
y

(
¯
X−
¯
Y) −(μ
X
−μ
Y
)

σ
2
X
n
x
+
σ
2
Y
n
y
= z
Certainly if each of the sample sizes exceeds 30, the normal approximation will be a
very good one. However, it is very dangerous to assume normality for small samples if
154 Chapter 10 Statistical Inference II: Continuous Probability Distributions II
their population variances are quite different. In that case, the normal approximation
is to be avoided. Computer programs such as Minitab make it easy to do the exact
calculations involved, regardless of sample size. This is a safe and prudent route to
follow in this circumstance.
The tests given here heavily depend on the relationship between the variances
as well as the normality of the underlying distributions. If these assumptions are not
true, no exact or approximate tests are known for this situation.
Often T is used, but the minimum of n
x
−1 and n
y
−1 is used for the degrees
of freedom. If this advice is followed, then in the example we have been using,
T = −3.09 with 14 degrees of freedom giving a p −value of 0.008. Our approxi-
mation in Example 10.5 allows us to use 17 degrees of freedom, making the p–value
0.007. The difference is not great in this case.
CONCLUSIONS
Three new, and very useful, probability distributions have been introduced here that
have been used in testing a single variance, a single mean when the population variance
is unknown, and two variances. This culminates in several tests on the difference
between two means when the population variances are unknown and both when
the population variances are assumed to be equal and when they are assumed to be
unequal.
We now study a very important application of much of what we have discussed
in statistical process control.
EXPLORATIONS
1. Two samples of students taking the Scholastic Aptitude Test (SAT) gave the
following data:
n
1
= 34, x
1
= 563, s
2
1
= 80.64
n
2
= 28, x
2
= 602, s
2
2
= 121.68
Assume that the samples arise from normal populations.
(a) Test H
0
: μ
1
= 540 against H
1
: μ
1
> 540
(b) Find a 90% confidence interval for σ
2
2
.
(c) Test H
0
: μ
1
= μ
2
assuming
(i) σ
2
1
= σ
2
2
. Is this assumption supported by the data?
(ii) σ
2
1
/ = σ
2
2
. Is this assumption supported by the data?
Chapter 11
Statistical Process Control
CHAPTER OBJECTIVES:
r
to introduce control charts
r
to show how to estimate the unknown σ
r
to show control charts for means, proportion defective, and number of defectives
r
to learn about acceptance sampling and how this may improve the quality of manufac-
tured products.
Statistics has become an important subject because, among other contributions to
our lives, it has improved the quality of the products we purchase and use. Statistical
analysis has become a centerpiece of manufacturing. In this chapter, we want to
explore some of the ways in which statistics does this. We first look at control charts.
CONTROL CHARTS
EXAMPLE 11.1 Data from a Production Line
A manufacturing process is subject to periodic inspection. Samples of five items are selected
from the production line periodically. A size measurement is taken and the mean of the five
observations is recorded. The means are plotted on a graph as a function of time. The result is
shown in Figure 11.1.
The means shown in Figure 11.1 certainly show considerable variation. Observation 15,
for example, appears to be greater than any of the other observations while the 19th observation
appears to be less than the others. The variability of the means appears to have decreased after
observation 20 as well. But, are these apparent changes in the process statistically significant?
If so, the process may have undergone some significant changes that may lead to an investigation
of the production process. However, processes show random variation arising from a number
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
155
156 Chapter 11 Statistical Process Control
40 36 32 28 24 20 16 12 8 4
22
21
20
19
18
Index
M
e
a
n
Means from a production line
Figure 11.1
of sources, and we may simply be observing this random behavior that occurs in all production
processes.
It would help, of course, if we were to knowthe true mean of the observations μas well as
the standard deviation σ, neither of these quantities is known. How can we proceed in judging
the process?
Forty samples, each of size 5, were selected. Some of the samples chosen and summary
statistics from these samples are given in Table 11.1. The observations are denoted by
x1, x2, x3, x4, and x5.
Table 11.1
Standard
x1 x2 x3 x4 x5 Mean Deviation Range
14.4357 16.3964 22.5443 23.2557 14.7566 18.2777 4.29191 8.81991
21.1211 17.5008 19.6445 19.9313 21.2559 19.8907 1.51258 3.75511
19.9250 17.3215 19.9991 13.4993 23.4052 18.8300 3.68069 9.90589
22.8812 18.8140 21.7421 17.3827 20.5589 20.2758 2.20938 5.49851
19.8692 22.2468 16.5286 20.3042 21.7208 20.1339 2.24052 5.71821
20.5668 20.9625 21.3229 19.1754 21.2247 20.6504 0.87495 2.14750
24.5589 18.9837 23.1692 19.1203 20.5269 21.2718 2.49118 5.57512
23.0998 18.7467 19.7165 17.8197 16.4408 19.1647 2.50961 6.65895
20.8545 18.8997 18.8628 21.7777 15.8875 19.2564 2.26620 5.89023
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19.4660 19.8596 19.6703 17.3478 20.9661 19.4619 1.31656 3.61823
Estimating σ Using the Sample Standard Deviations 157
For each sample, the mean, standard deviation, and range (the difference between the
largest observation and the smallest observation) were calculated.
We know the central limit theorem shows that the sample mean x follows a normal
distribution with mean μ and standard deviation σ/

n. In this case, n is the sample size 5. It
is natural to estimate μ by the mean of the sample means x. In this case, using all 40 sample
means, x = 19.933. Now if we could estimate σ, the standard deviation, we might say that a
mean greater than x +3(σ/

n) or a mean less than x −3(σ/

n) would be unusual since, for
a normal distribution, we know that the probability an observation is outside these limits is
0.0026998, a very unlikely event. While we do not know σ, we might assume, correctly, that
the sample standard deviations and the sample ranges may aid us in estimating σ. We nowshow
two ways to estimate σ based on each of these sample statistics.
᭿
ESTIMATING σ USING THE SAMPLE STANDARD
DEVIATIONS
In Table 11.1, the standard deviation is calculated for each of the samples shown
there. The sample standard deviations give us some information about the process
standard deviation σ. It is known that the mean of the sample standard deviations s
can be adjusted to provide an estimate for the standard deviation σ. In fact,
σ =
s
c
4
where the adjustment factor c
4
depends on the sample size. The adjustment c
4
ensures
that the expected value of σ is the unknown σ. Such estimates are called unbiased.
Table 11.2 shows some values of this quantity.
Table 11.2
n c
4
2 0.797885
3 0.886227
4 0.921318
5 0.939986
6 0.951533
7 0.959369
8 0.965030
9 0.969311
10 0.972659
11 0.975350
12 0.977559
13 0.979406
14 0.980971
15 0.982316
16 0.983484
17 0.984506
158 Chapter 11 Statistical Process Control
In this case, s = 1.874 and c
4
= 0.939986, so our estimate is
σ =
1.874
0.939986
= 1.99365
This means that the limits we suggested, x ±3(σ/

n) become
19.933 −3 · 1.99365/

5 = 17. 2582 and 19.933 +3 · 1.99365/

5 = 22. 6078.
These are called upper and lower control limits (UCL and LCL). If we show these
on the graph in Figure 11.1, we produce Figure 11.2.
37 33 29 25 21 17 13 9 5 1
23
22
21
20
19
18
17
Sample
S
a
m
p
l
e

m
e
a
n
_
_
X = 19.933
UCL = 22.608
LCL = 17.259
Means from a production line
S estimated from sample standard deviations
Figure 11.2
None of the observations are outside the control limits.
The use of three standarddeviations as control limits is a veryconservative choice.
The fact that the limits are exceeded so rarely is an argument in its favor since the
production line would be shut down or investigated very infrequently. It is possible,
of course, to select other control limits. The limits x −2 · σ/

n and x +2 · σ/

n
would be exceeded roughly 4.55% of the time. In this case, these limits are
19.933 −2 ·
1.99365

5
= 18. 149 8
and
19.933 +2 ·
1.99365

5
= 21. 716 1
The resulting control chart is shown in Figure 11.3.
Now the 15th observation is a bit beyond the upper control limit.
Estimating σ Using the Sample Ranges 159
37 33 29 25 21 17 13 9 5 1
22
21
20
19
18
Sample
S
a
m
p
l
e

m
e
a
n
_
_
X = 19.933
+2SL = 21.716
–2SL = 18.150
Means from a production line
S estimated from sample standard deviations
Figure 11.3
ESTIMATING σ USING THE SAMPLE RANGES
The sample ranges can also be used to estimate σ. The mean range R must be adjusted
to provide an unbiased estimate of σ. It is a fact that an unbiased estimate of σ is
σ =
R
d
2
where d
2
depends on the sample size. Table 11.3 gives some values of d
2
.
Table 11.3
n d
2
2 1.128
3 1.693
4 2.059
5 2.326
6 2.534
7 2.704
8 2.847
9 2.970
10 3.078
160 Chapter 11 Statistical Process Control
In this case, R = 4.536 and d
2
= 2.326, so our estimate of σ is
σ =
4.536
2.326
= 1.950 13
and this produces three sigma control limits at 19.933 −2 · 1.95013/

5 = 18.189
and 19.933 +2 · 1.95013/

5 = 21.678. The resulting control chart is shown in
Figure 11.4.
37 33 29 25 21 17 13 9 5 1
22
21
20
19
18
Sample
S
a
m
p
l
e

m
e
a
n
_
_
X = 19.933
+2SL = 21.678
–2SL = 18.189
Means from a production line
S estimated from sample ranges
Figure 11.4
The two control charts do not differ much in this case. It is easier on the production
floor to calculate R, but both methods are used with some frequency.
The control charts here were produced using Minitab that allows several methods
for estimating σ as well as great flexibility in using various multiples of σ as control
limits. It is possible to produce control charts for statistics other than the sample mean,
but we will not discuss those charts here. These are examples of control charts for
variables.
It is natural in our example to use the sample mean as a statistic since we made
measurements on each of the samples as they emerged from the production process.
It may be, however, that the production items are either usable or defective; in that
case, we call the resulting control charts as control charts for attributes.
Control Charts for Attributes 161
CONTROL CHARTS FOR ATTRIBUTES
EXAMPLE 11.2 Metal Plating Data
Thirty samples of size 50 each are selected from a manufacturing process involving plating a
metal. The data in Table 11.4 give the number of parts that showed a plating defect.
Table 11.4
Sample number Defects Sample number Defects
1 6 16 5
2 4 17 6
3 7 18 1
4 3 19 6
5 1 20 7
6 3 21 12
7 3 22 7
8 1 23 5
9 5 24 2
10 5 25 3
11 2 26 2
12 11 27 2
13 2 28 5
14 3 29 4
15 2 30 4
We are interested in the number of defects in each sample and howthis varies fromsample
to sample as the samples are taken over a period of time. The control chart involved is usually
called an np chart. We now show how this is constructed.
᭿
np Control Chart
Let the random variable X denote the number of parts showing plating defects. The
random variable X is a binomial random variable whose probability distribution in
general is given by
P(X = x) =

n
x

p
x
· (1 −p)
(n−x)
, x = 0, 1, . . . , n
The randomvariable X is a binomial randomvariable because a part either shows
a defect or it does not show a defect and we assume, correctly or incorrectly, that the
parts are produced independently and with constant probability of a defect p. Since
50 observations were taken in each sample, n = 50 and so X takes on integer values
from 0 to 50 . We know that the mean value of X is np and the standard deviation is

np(1 −p).
162 Chapter 11 Statistical Process Control
Reasonable control limits then might be from LCL = X−3

np(1 −p) to
UCL = X+3

np(1 −p). We can find from the data that the total number of defec-
tive parts is 129, so the mean number of defective parts is X = 129/30 = 4.30, but
we do not know the value of p.
A reasonable estimate for p is the total number of defects divided by the total
number of parts sampledor 129/(30)(50) = 0.086. This gives estimates for the control
limits as
LCL = X−3

n p(1 − p) = 4.30 −3

50(0.086)(1 −0.086)
= −1.6474
and
UCL = X+3

n p(1 − p) = 4.30 +3

50(0.086)(1 −0.086) = 10.25
Since X, the number of parts showing defects, cannot be negative, the lower
control limit is taken as 0. The resulting control chart, produced by Minitab, is shown
in Figure 11.5.
28 25 22 19 16 13 10 7 4 1
12
10
8
6
4
2
0
Sample
S
a
m
p
l
e

c
o
u
n
t
__
np= 4.3
UCL = 10.25
LCL = 0
1
1
np Chart for defects data
Figure 11.5
It appears that samples 12 and 21 show the process to be out of control. Except
for these points, the process is in good control. Figure 11.6 shows the control chart
using two sigma limits; none of the points, other than those for samples 12 and 21,
are out of control.
Control Charts for Attributes 163
28 25 22 19 16 13 10 7 4 1
12
10
8
6
4
2
0
Sample
S
a
m
p
l
e

c
o
u
n
t
__
np = 4.3
+2SL = 8.26
–2SL = 0.34
1
1
np Chart for defects data
Figure 11.6
p Chart
Due to cost and customer satisfaction, the proportion of the production defective is
also of great importance. Denote this randomvariable as p
s
; we knowthat p
s
= X/n.
Since the random variable X is the number of parts showing defects in our example,
X is binomial, and sample size is n, it follows that
E(p
s
) = E

X
n

=
E(X)
n
=
np
n
= p
and
Var(p
s
) = Var

X
n

=
Var(X)
n
2
=
np(1 −p)
n
2
=
p(1 −p)
n
We see that control limits, using three standard deviations, are
LCL = p −3

p(1 −p)
n
and
UCL = p +3

p(1 −p)
n
but, of course, we do not know the value for p.
Areasonable estimate for pwould be the overall proportion defective considering
all the samples. This is the estimate used in the previous section, that is, 0.086. This
164 Chapter 11 Statistical Process Control
gives control limits as
LCL = 0.086 −3

0.086(1 −0.086)
50
= −0.032948
and
UCL = 0.086 +3

0.086(1 −0.086)
50
= 0.2049
Zero is used for the lower control limit. The resulting control chart is shown in
Figure 11.7.
28 25 22 19 16 13 10 7 4 1
0.25
0.20
0.15
0.10
0.05
0.00
Sample
P
r
o
p
o
r
t
i
o
n
_
P = 0.086
UCL = 0.2049
LCL = 0
1
1
p Chart of np1
Figure 11.7
This chart gives exactly the same information as the chart shown in Figure 11.6.
SOME CHARACTERISTICS OF CONTROL CHARTS
EXAMPLE 11.3 Control Limits
Examine the control chart shown in Figure 11.8. Ignore the numbers 1, 2, and 6 on the chart
for the moment, since they will be explained later, but notice that one, two, and three sigma
units are displayed.
Some Additional Tests for Control Charts 165
91 81 71 61 51 41 31 21 11 1
50.0
47.5
45.0
42.5
40.0
37.5
35.0
Observation
I
n
d
i
v
i
d
u
a
l

v
a
l
u
e
_
X = 42.12
+3SL = 48.89
–3SL = 35.35
+2SL = 46.64
–2SL = 37.61
+1SL = 44.38
–1SL = 39.87
2
1
2
2
2
6
2
6
2
2
2
2
2
2
6
6
6
Control chart for Example 11.3
Figure 11.8
This control chart was constructed by selecting 50 samples of size 5 from a normal
distribution with mean 40 and standard deviation 5; the next 50 samples of size 5 were chosen
from a normal distribution with mean 43.6 and standard deviation 5. The control chart was
constructed by taking the mean of the samples. It is apparent from the control chart that
something occurred at or around the 50th sample, as indeed it did. But despite the fact that the
mean had increased by 3.6 or 72%of the standard deviation, the chart shows only a single point
outside three sigma limits, namely, the 83rd, although the 71st and 26th samples are close to
the control line. After the 60th sample, the observations become well behaved again, although
it is apparent that the mean has increased. The change here was a relatively large one, but using
only the three sigma criterion alone, we might not be alarmed.
᭿
The control charts we have considered offer great insight into production pro-
cesses since they track the production process in a periodic manner. They have some
disadvantages as well; many are slow to discover a change in the production process,
especially when the process mean changes by a small proportion of the standard de-
viation, or when the standard deviation itself changes. In the above example, it is
difficult to detect even a large change in the process mean with any certainty. It would
be very desirable to detect even small changes in the process mean, and this can be
done, but it may take many observations to do so. In the meantime, the process may
be essentially out of control without the knowledge of the operators. For this reason,
additional tests are performed on the data. We describe some of these now.
SOME ADDITIONAL TESTS FOR CONTROL CHARTS
First, consider the probability of a false reading, that is, an observation falling beyond
the three sigma limits entirely due to chance rather than to a real change in the process.
166 Chapter 11 Statistical Process Control
Assuming that the observations are normally distributed, an assumption justified for
the sample mean if the sample size is moderate, then the probability of an observation
outside the three sigma limits is 0.0027. Such observations are then very rare and
when one occurs, we are unlikely to conclude that the observations occurred by
chance alone. However, if 50 observations are made, the probability that at least one
of them is outside the three sigma limits is 1 −(1 −0.0027)
50
= 0. 12644. This
probability increases rapidly, as the data in Table 11.5 indicate. Here n represents the
number of observations.
Table 11.5
n Probability
50 0.126444
100 0.236899
200 0.417677
300 0.555629
400 0.660900
500 0.741233
600 0.802534
700 0.849314
800 0.885011
900 0.912252
1000 0.933039
So an extreme observation becomes almost certain as production continues, al-
though the process in reality has not changed.
Minitab offers eight additional tests on the data; we will describe four of them
here, namely, tests 1, 2, 5, and 6.When these tests become significant, Minitab in-
dicates this by putting the appropriate number of the test on the control chart. This
explains those numbers on the chart in Figure 11.8.
In general, since the probabilities of these events are quite small, the following
situations are to be regarded as cautionary flags for the production process. The
calculation of the probabilities involved in most of these tests relies upon the binomial
or normal probability distributions. The default values of the constant k in each of
these tests can be changed easily in Minitab.
Test 1. Points more than k sigma units from the centerline.
We have used k = 3, but Table 11.6 shows probabilities with which a single point
is more than k standard deviations from the target mean.
In Figure 11.8, while only one point is beyond the three sigma limit, several are
beyond the two sigma limits. These points are indicated by the symbol on the
control chart.
Some Additional Tests for Control Charts 167
Table 11.6
k Probability
1.0 0.317311
1.5 0.133614
2.0 0.045500
2.5 0.012419
3.0 0.002700
Test 2. k points in a row on the same side of the centerline.
It is commontouse k = 9. The probabilitythat nine points ina roware onthe same
side of the centerline is 2 (1/2)
9
= 0.00390625. Table 11.7 shows this probability for
k = 7, 8, . . . , 11.
Table 11.7
k Probability
7 0.01563
8 0.00781
9 0.00391
10 0.00195
11 0.00098
This test fails at samples 13, 14, 16, 17, 18, 28, 60, 61,and 91.
Test 5. At least k out of k +1 points in a row more than two sigmas from the
centerline.
The quantity k is commonly chosen as 2. Since the probability that one point
is more than two sigmas above the centerline is 0.0227501 and since the number of
observations outside these limits is a binomial random variable, the probability that
at least two out of three observations are more than two sigmas from the centerline
and either above or below the centerline is
2
3

x=2

3
x

(0.0227501)
x
(0.9771499)
3−x
= 0.003035
Table 11.8 gives values of this probability for other values of k.
This event occurred at observation 44 in Figure 11.8.
Table 11.8
k Probability
2 0.003035
3 0.000092
4 0.000003
168 Chapter 11 Statistical Process Control
Test 6. At least k out of k +1 points in a rowmore than one sigma fromthe centerline.
The value of k is commonly chosen as 4. Since the probability that one point is
more than one sigma above or below the centerline is 0.158655 and since the number
of observations outside these limits is a binomial randomvariable, the probability that
at least four out of five observations are more than two sigmas from the centerline is
2
5

x=4

5
x

(0.158655)
x
(1 −0.158655)
5−x
= 0.00553181
Table 11.9 gives values of this probability for other values of k.
Table 11.9
k Probability
2 0.135054
3 0.028147
4 0.005532
This test failed at samples 8, 9, and 11 in our data set.
Computers allowus to calculate the probabilities for each of the tests with relative
ease. One could even approximate the probability for the test in question and find a
value for k. In test 2, for example, where we seek sequences of points that are on
the same side of the centerline, if we wanted a probability of approximately 0.02,
Table 11.7 indicates that a run of seven points is sufficient.
The additional tests provide some increased sensitivity for the control chart, but
they decrease its simplicity. Simplicity is a desirable feature when production workers
monitor the production process.
CONCLUSIONS
We have explored here only a few of the ideas that make statistical analysis and
statistical process control, an important part of manufacturing. This topic also shows
the power of the computer and especially computer programs dedicated to statistical
analysis such as Minitab.
EXPLORATIONS
1. Create a data set similar to that given in Table 11.1. Select samples of size 6
from a N(25, 3) distribution.
(a) Find the mean, standard deviation, and the range for each sample.
(b) Use both the sample standard deviations and then the range to estimate σ
and show the resulting control charts.
Explorations 169
(c) Carry out the four tests given in the section “Some Additional Tests for
Control Chart” and discuss the results.
2. Generate 50 samples of size 40 from a binomial random variable with the
probability of a defective item, p = 0.05. For each sample, show the number
of defective items.
(a) Construct an np control chart and discuss the results.
(b) Construct a p chart and discuss the results.
Chapter 12
Nonparametric Methods
CHAPTER OBJECTIVES:
r
to learn about hypothesis tests that are not dependent on the parameters of a
probability distribution
r
to learn about the median and other order statistics
r
to use the median in testing hypotheses
r
to use runs of successes or failures in hypothesis testing.
INTRODUCTION
The Colorado Rockies national league baseball team early in the 2008 season lost
seven games in a row. Is this unusual or would we expect a sequence of seven wins
or losses in a row sometime in the regular season of 162 baseball games?
We will answer this question and others related to sequences of successes or
failures in this chapter.
We begin with a nonparametric test comparing two groups of teachers. In general,
nonparametric statistical methods refer to statistical methods that do not depend upon
the parameters of a distribution, suchas the meanandthe variance (bothof whichoccur
in the definition of the normal distribution, for example), or on the distribution itself.
We begin with two groups of teachers, each one of whom was being considered
for an award.
THE RANK SUM TEST
EXAMPLE 12.1 Two Groups of Teachers
Twenty-six teachers, from two different schools, were under consideration for a prestigious
award (one from each school). Each teacher was ranked by a committee. The results are shown
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
170
The Rank Sum Test 171
Table 12.1
Teacher Score Rank
A∗ 34.5 1
B∗ 32.5 2
C∗ 29.5 3.5
D 29.5 3.5
E∗ 28.6 5
F 27.5 6
G∗ 25.0 7
H 24.5 8
I 21.0 9
J 20.5 10
K 19.6 11
L 19.0 12
M 17.0 13.5
N 17.0 13.5
O∗ 16.5 15
P∗ 15.0 16
Q 14.0 17
R∗ 13.0 19
S 13.0 19
T 13.0 19
U 12.5 21
V 9.5 22
W 9.5 23
X 8.5 24
Y 8.0 25
Z 7.0 26
in Table 12.1. Names have been omitted (since this is a real case) and have been replaced by
letters. The scores are shown in decreasing order.
The stars (*) after the teacher’s names indicate that they came from School I, while those
teachers without stars came from School II.
We notice that five of the teachers from School I are clustered near the top of the list.
Could this have occurred by chance or is there a real difference between the teachers at
the two schools?
The starred group has 8 teachers, while the unstarred group has 18 teachers. Certainly, the
inequality of the sample sizes makes some difference. The usual parametric test, say, on the
equality of the means of the two groups is highly dependent upon the samples being selected
from normal populations and needs some assumptions concerning the population variances in
addition. In this case, these are dubious assumptions to say the least.
᭿
So how can we proceed?
Here is one possible procedure called the rank sum test.
To carry this test out, we first rank the teachers in order, from 1 to 26. These
rankings are shown in the right-hand column of Table 12.1. Note that teachers C and
172 Chapter 12 Nonparametric Methods
D each have scores of 29.5, so instead of ranking them as 3 and 4, each is given
rank 3.5, the arithmetic mean of the ranks 3 and 4. There are two other instances
where the scores are tied, and we have followed the same procedure with those
ranks.
Now the ranks of each group are added up; the starred group has a rank sum then
of
1 + 2 + 3 + 5 + 7 + 15 + 16 + 18.5 = 67.5
The sum of all the ranks is
1 + 2 + 3 + · · · + 26 =
26 · 27
2
= 351
so the rank sum for the unstarred teacher group must be 351 − 67.5 = 283.5.
There is a considerable discrepancy in these rank sums, but, on the contrary, the
sample sizes are quite disparate, and so some difference is to be expected. How much
of a difference is needed then before we conclude the difference to be statistically
significant?
Were we to consider all the possible rankings of the group of eight teachers, say,
we might then be able to conclude whether or not a sum of 67.5 was statistically
significant. It turns out that this is possible. Here is how we can do this.
Suppose then that we examine all the possible rank sums for our eight teachers.
These teachers canthenbe rankedin

26
8

= 1, 562, 275ways. It turns out that there are
only 145 different rank sums, however. Smaller samples provide interesting classroom
examples and can be carried out by hand. In our case, we need a computer to deal
with this.
Here is the set of all possible rank sums. Note that the eight teachers could have
ranksums rangingfrom 1 +2 +3 + · · · +8 = 36 to 19 +20 +21 + · · · + 26 = 180.
A table of all the possible values of rank sums is given in Table 12.2.
Table 12.2
{36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,
61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,
87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,
109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,
128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,
147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,
166,167,168,169,170,171,172,173,174,175,176,177,178,179,180}
These rank sums, however, do not occur with equal frequency. Table 12.3 shows
these frequencies.
Our eight teachers had a rank sum of 67.5. We can then find the rank
sums that are at most 67.5. This is 1 + 1 + 2 + 3 + 5 + 7 + · · · + 2611 = 17244.
So we would conclude that the probability the rank sum is at most 67.5 is
17244/1562275 = 0.01103 775, so it is very rare that a rank sum is less than the
rank sum we observed.
Order Statistics 173
Table 12.3
{1,1,2,3,5,7,11,15,22,29,40,52,70,89,116,146,186,230,288,351,432,521,631,752,
900,1060,1252,1461,1707,1972,2281,2611,2991,3395,3853,4338,4883,5453,6087,
6748,7474,8224,9042,9879,10783,11702,12683,13672,14721,15765,16862,17946,
19072,20171,21304,22394,23507,24563,25627,26620,27611,28512,29399,30186,30945,
31590,32200,32684,33125,33434,33692,33814,33885,33814,33692,33434,33125,32684,
32200,31590,30945,30186,29399,28512,27611,26620,25627,24563,23507,22394,21304,
20171,19072,17946,16862,15765,14721,13672,12683,11702,10783,9879,9042,8224,
7474,6748,6087,5453,4883,4338,3853,3395,2991,2611,2281,1972,1707,1461,1252,
1060,900,752,631,521,432,351,288,230,186,146,116,89,70,52,40,29,22,15,11,7,5,
3,2,1,1}
It is interesting to see a graph of all the possible rank sums, as shown in
Figure 12.1.
36 67.5 108 180
Rank sums
5000
10000
15000
20000
25000
30000
F
r
e
q
u
e
n
c
y
Figure 12.1
Our sum of 67.5 is then in the far left-hand end of this obviously normal curve
showing a perhaps surprising occurrence of the normal curve. It arises in many places
in probability where one would not usually expect it! Teachers A and D won the
awards.
We now turn to some other nonparametric tests and procedures.
ORDER STATISTICS
EXAMPLE 12.2 Defects in Manufactured Products
The number of defects in several samples of a manufactured product was observed to be
1, 6, 5, 5, 4, 3, 2, 2, 4, 6, 7
The mean number of defects is then
1 + 6 + 5 + 5 + 4 + 3 + 2 + 2 + 4 + 6 + 7
11
= 4.09
᭿
174 Chapter 12 Nonparametric Methods
This mean value is quite sensitive to each of the sample values. For example, if
the last observation had been 16 rather than 7, the mean value becomes 4.91, almost
a whole unit larger due to this single observation.
So we seek a measure of central tendency in the data that is not so sensitive.
Median
Consider then arranging the data in order of magnitude so that the data become
1, 2, 2, 3, 4, 4, 5, 5, 6, 6, 7
The median is the value in the middle when the data are arranged in order (or
the mean of the two middlemost values if the number of observations is even) . Here
that value is 4. Now if the final observation becomes 17, for example (or any other
value larger than 6), the median remains at 4.
The median is an example of an order statistic—values of the data that are
determined by the data when the data are arranged in order of magnitude. The smallest
value, the minimum, is an order statistic as is the largest value, or the maximum.
While there is no central limit theorem as there is for the mean, we explore now
the probability distribution of the median.
EXAMPLE 12.3 Samples and the Median
Samples of size 3 are selected, without replacement, from the set 1,2,3,4,5,6,7.
There are then

7
3

= 35possible samples. These are showninTable 12.4where the median
has been calculated for each sample. The samples are arranged in order. Were the samples not
Table 12.4
Sample Median Sample Median
1,2,3 2 2,3,6 3
1,2,4 2 2,3,7 3
1,2,5 2 2,4,5 4
1,2,6 2 2,4,6 4
1,2,7 2 2,4,7 4
1,3,4 3 2,5,6 5
1,3,5 3 2,5,7 5
1,3,6 3 2,6,7 6
1,3,7 3 3,4,5 4
1,4,5 4 3,4,6 4
1,4,6 4 3,4,7 4
1,4,7 4 3,5,6 5
1,5,6 5 3,5,7 5
1,5,7 5 3,6,7 6
1,6,7 6 4,5,6 5
2,3,4 3 4,5,7 5
2,3,5 3 4,6,7 6
5,6,7 6
Order Statistics 175
arranged in order, each sample would produce 3! = 6 samples, each with the same median, so
we will consider only the ordered samples.
This produces the probability distribution function for the median as shown in Table 12.5.
Table 12.5
Median Probability
2 5/35
3 8/35
4 9/35
5 8/35
6 5/35
᭿
The probabilities are easy to find. If the median is k, then one observation must
be chosen from the k − 1 integers less than k and the remaining observation must be
selected from the 7 −k integers greater than k, so the probability the median is k then
becomes
P(median = k) =

k − 1
1

·

7 −k
1

7
3
, k = 2, 3, 4, 5, 6
The expected value of the median is then
2 · 5/35 + 3 · 8/35 + 4 · 9/35 + 5 · 8/35 + 6 · 5/35 = 4
A graph of this probability distribution is shown in Figure 12.2.
2 3 4 5 6
Median
0.14
0.16
0.18
0.2
0.22
0.24
P
r
o
b
a
b
i
l
i
t
y
Figure 12.2
This does not showmuch pattern. In fact, the points here are points on the parabola
y = (x − 1)(7 −x)
We need a much larger population and sample to show some characteristics of the
distribution of the median, and we need a computer for this. The only way for us
176 Chapter 12 Nonparametric Methods
to show the distribution of the median is to select larger samples than in the above
example and to do so from larger populations.
Figure 12.3 shows the distribution of the median when samples of size 5 are
selected from 1, 2, 3, . . . , 100.
The points on the graph in Figure 12.3 are actually points on the fourth-degree
polynomial y = (x − 1)(x − 2)(100 −x)(99 −x).
20 40 60 80 100
Median
5ϫ10
6
1ϫ10
6
2ϫ10
6
3ϫ10
6
4ϫ10
6
P
r
o
b
a
b
i
l
i
t
y
Figure 12.3
Finally, in Figure 12.4 we show the median in samples of size 7 from the popu-
lation 1, 2, 3, . . . , 100.
20 40 60 80
Median
0.005
0.01
0.015
0.02
P
r
o
b
a
b
i
l
i
t
y
Figure 12.4
There is now little doubt that the median becomes normally distributed. We will
not pursue the theory of this but instead consider another order statistic, the maximum.
Maximum
The maximum value in a data set as well as the minimum, the smallest value in the
data set, is of great value in statistical quality control where the range, the differ-
ence between the maximum and the minimum, is easily computed on the produc-
tion floor from a sample. The range was used in Chapter 11 on statistical process
control.
Order Statistics 177
EXAMPLE 12.4 An Automobile Race
We return now to two examples that we introduced in Chapter 4.
A latecomer to an automobile race observes cars numbered 6,17, and 45. Assuming the
cars to be numbered from 1 to n, what is n, that is, how many cars are in the race?
The question may appear to be a trivial one, but it is not. In World War II, the Germans
numbered their tanks in the field. When some tanks were captured and their numbers noted,
we were able to estimate the total number of tanks they had.
Clearly, we should use the sample in some way in estimating the value for n.
We might use the median that we have been discussing or the mean for the samples. If
we refer back to Example 12.3, we saw that the expected value of the median is 4, not a very
accurate estimate of the maximum, 7.
Table 12.6 below shows the probability distribution for the mean of the samples in
Example 12.3.
Table 12.6
Mean Frequency
2 1
7/3 1
8/3 2
3 3
10/3 4
11/3 4
4 5
13/3 4
14/3 4
5 4
16/3 3
17/3 1
6 1
So the expected value of the mean is
(1/35)[2 · 1 + (7/3) · 1 + (8/3) · 2 + 3 · 3 + (10/3) · 4 + (11/3) · 4 + 4 · 5 + (13/3) · 4
+ (14/3) · 4 + 5 · 3 + (16/3) · 2 + (17/3) · 1 + 6 · 1] = 4
᭿
This of course was to be expected from our previous experience with the sample
mean and the central limit theorem.
Both the median and the mean could be expected, without surprise, to be estima-
tors of the population mean, or a measure of central tendency rather than estimators
of the maximum of the population, which is what we seek.
It is intuitively clear that if the race cars we saw were numbered 6, 17 and 45,
then there are at least 45 cars in the race. But how many more?
178 Chapter 12 Nonparametric Methods
Let us look at the probability distribution of the maximum of each sample in
Example 12.3. If we do that, we find the probability distribution shown in Table 12.7.
Table 12.7
Maximum Frequency
3 1
4 3
5 6
6 10
7 15
We find the expected value of the maximum to be
1
35
[3 · 1 + 4 · 3 + 5 · 6 + 6 · 10 + 7 · 15] = 6
This is better than the previous estimators, but can not be the best we can do. For
one thing, the maximum only achieves the largest value in the population, 7, with
probability 15/35, so with probability 20/35, the maximum of the samples is less
than that for the population and yet the maximum of the sample must carry more
information with it than the other two values.
We are left with the question, “How should the maximum of the sample be used
in estimating the maximum of the population?”
In Figure 12.5, the numbers observed on the cars are shown in a number line.
1 6 17 45 n
Figure 12.5
Now we observe that while we saw three cars, we know that cars 1, 2, . . . , 5
(a total of 5 cars) must be on the track as are cars numbered 7, 8, . . . 16 (a total of
10 cars), as well as cars numbered 18, 19, . . . , 44 ( a total of 27 cars), so there are
5 + 10 + 27 = 42 cars that we did not observe.
The three cars we did observe then divide the number line into three parts, which
we will call gaps. The average size of the gaps is then 42/3 = 14. Since this is the
average gap, it would appear sensible that this is also the gap between the sample
maximum, 45, and the unknown population maximum, n. Adding the average gap to
the sample maximum gives 45 + 14 = 59 as our estimate for n.
Before investigating the properties of this estimator, let us note a simple fact. We
made heavy use of the numbers 6 and 17 above, but these sizes of these values turn out
to be irrelevant. Now suppose that those cars were numbered 14 and 36. The average
gap would then be (13 + 21 + 8)/3 = 42/3 = 14, the same average gap we found
before. In fact, the average gap will be 14 regardless of the size of the two numbers
as long as they are less than 45. The reason for this is quite simple: since we observed
car number 45, and since we observed 3 cars in total, there must be 42 cars we did
not see, giving the average gap as 14. One can also see this by moving the numbers
for the two numbers less than 45 back and forth along the number line in Figure 12.5.
Order Statistics 179
So our estimate for n is 45 + (45 − 3)/3 = 59.
Now to generalize this, keeping our sample size at 3 and supposing the largest
observation in the sample is m, our estimate then becomes
m+
m− 3
3
=
4m− 3
3
Let us see how this works for the samples in Example 12.3. Table 12.8 shows the
maximum of each sample and our gap estimator.
Table 12.8
Maximum Gap estimator Frequency
3 3 1
4 13/3 3
5 17/3 6
6 7 10
7 25/3 15
This gives the expected value of the gap estimator as
(1/35) · (3 · 1 + (13/3) · 3 + (17/3) · 6 + 7 · 10 + (25/3) · 15) = 7
the result we desired.
In fact, this estimator can be shown to be unbiased (that is, its expected value
is the value to be estimated) regardless of the sample size. Suppose our sample is of
size k and is selected from the population 1, 2, . . . , M and that the sample maximum
is m. Then it can be shown, but not easily, that the expected value for m is not quite
M. In fact,
E[m] =
M

m=k

m− 1
k − 1

M
m
=
k(M + 1)
k + 1
So,
E

(k + 1)m−k
k

= M
We still have a slight problem. In our example, for some samples, the gap method
asked us to estimate n as 17/3, obviously not an integer. In that case, we would
probably select the integer nearest to 17/3 or 6.
If we do this, our estimator is now distributed as shown in Table 12.9.
We have denoted by square brackets the greatest integer function. The expected
value of this estimator is then
(1/35) · (3 · 1 + 4 · 3 + 6 · 6 + 7 · 10 + 8 · 15) = 6.89
180 Chapter 12 Nonparametric Methods
Table 12.9
Maximum [Gap estimation] Frequency
3 3 1
4 4 3
5 6 6
6 7 10
7 8 15
Now if we use the greatest integer function here, we estimate M as 7 again. This
estimator appears to be close to unbiased, but that is not at all easy to show.
This method was actually used in World War II and proved to be surprisingly
accurate in solving the German Tank problem.
Runs
Finally, in this chapter, we consider runs of luck (or runs of ill-fortune).
EXAMPLE 12.5 Losing Streaks in Baseball
The Colorado Rockies, a National League baseball team, lost seven games in a row early in
the 2008 season. Is this unusual or can a sequence of seven losses (or wins) be expected during
a regular season of 162 games?
A run is a series of like events in a sequence of events. For example, a computer program
(this can also be done with a spreadsheet) gave the following sequence of 20 1’s and 0’s:
{0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0}
The sequence starts with a run of two zeros, followed by six ones, three zeros, and so on.
There are nine runs in all and their lengths are given in Table 12.10
The expected number of runs is then
(1/9)(1 · 3 + 2 · 4 + 3 · 1 + 4 · 0 + 5 · 0 + 6 · 1) = 2.22
Table 12.10
Length Frequency
1 3
2 4
3 1
4 0
5 0
6 1
Order Statistics 181
It is not at all clear from this small example that this is a typical situation at all. So we
simulated 1000 sequences of 20 ones and zeros and counted the total number of runs and their
lengths.
Towrite a computer programtodothis, it is easiest tocount the total number of runs first. To
do this, scan the sequence from left to right, comparing adjacent entries. When they differ, add
one tothe number of runs until the sequence is completelyscanned. For example, inthe sequence
{0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0}
we find the first time adjacent entries are not equal occurs at the third entry, then at the ninth
entry, then at the thirteenth entry, and so on. This gives one less than the total number of entries
since the first run of two zeros is not counted.
To find the lengths of the individual runs, start with a vector of ones of length the total
number of runs. In this example, we begin with the vector
{1, 1, 1, 1, 1, 1, 1, 1, 1}
Scan the sequence again, adding one when adjacent entries are alike and skipping to the
next one in the ones vector when adjacent entries differ, and then adding one until the entries
differ and continuing though the vector of ones.
In our example, this will produce the vector {2, 6, 3, 2, 2, 1, 2, 1, 1}. (The sum of the
entries must be 20.)
Figure 12.6 shows a bar chart of the number of runs from the 1000 samples.
4 5 6 7 8 9 1011121314151617
Runs
25
50
75
100
125
150
175
F
r
e
q
u
e
n
c
y
Figure 12.6
Now we return to our baseball example where the Colorado Rockies lost seven games in
a row in the 2008 regular baseball season.
We simulated 300 seasons, each with 162 games. Here is a typical simulated year where
0 represents a loss and 1 represents a win.
{0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1,
1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,
1, 0, 1, 0, 0, 1, 1}
We then counted the number of runs in the 300 samples. Table 12.11 shows the number
of runs and Figure 12.7 shows a graph of these results.
182 Chapter 12 Nonparametric Methods
Table 12.11
{74, 88, 89, 85, 79, 92, 81, 82, 78, 83, 86, 86, 83, 73, 78, 73, 86, 75, 83,
79, 87, 81, 86, 91, 81, 74, 80, 83, 82, 83, 76, 84, 76, 84, 79, 78, 71, 72,
86, 82, 87, 87, 90, 91, 80, 75, 87, 72, 73, 84, 84, 91, 81, 83, 73, 82, 86,
80, 82, 88, 86, 84, 84, 76, 81, 85, 79, 69, 72, 85, 81, 80, 82, 85, 88, 77,
86, 68, 84, 76, 82, 75, 80, 85, 80, 84, 84, 76, 87, 89, 94, 84, 84, 81, 67,
92, 78, 75, 78, 68, 92, 73, 75, 79, 79, 83, 84, 77, 83, 85, 80, 75, 82, 83,
84, 76, 93, 94, 72, 73, 90, 89, 91, 83, 99, 76, 91, 87, 93, 88, 86, 96, 74,
78, 83, 75, 80, 80, 78, 73, 88, 76, 76, 85, 84, 91, 75, 74, 89, 82, 86, 83,
87, 85, 84, 86, 72, 83, 80, 81, 74, 82, 85, 85, 81, 71, 78, 81, 72, 85, 80,
87, 83, 78, 75, 79, 83, 79, 76, 79, 76, 94, 85, 80, 87, 70, 81, 82, 84, 90,
92, 85, 66, 73, 81, 74, 87, 90, 77, 77}
The mean number of runs is 81.595 with standard deviation 6.1587.
The simulation showed 135 + 49 + 26 + 15 + 55 + 5 + 3 + 2 + 1 = 291 samples hav-
ing winning or losing streaks of 7 games or more out of a total of 16, 319 runs or a probability
of 0.017832.
70 75 80 85 90 95
Runs
2.5
5
7.5
10
12.5
15
F
r
e
q
u
e
n
c
y
Figure 12.7
Table 12.12 shows the frequency with which runs of a given length occurred, and
Figure 12.8 is a graph of these frequencies.
This example shows the power of the computer in investigating the theory of runs. The
mathematical theory is beyond our scope here, but there are some interesting theoretical results
we can find using some simple combinatorial ideas.
᭿
Some Theory of Runs
Consider for a moment, the arrangement below of ∗’s contained in cells limited by
bars (|’s):
|| ∗∗ || ∗ ∗ ∗ ||| ∗ | ∗ ∗ ∗∗ |
There are eight cells, only four of which are occupied. There are nine bars in
total, but we need one at each end of the sequence to define the first and last cell.
Consider the bars to be all alike and the stars to be all alike.
Order Statistics 183
1 2 3 4 5 6 7 8 9 1011121314 15
Run length
2000
4000
6000
8000
F
r
e
q
u
e
n
c
y
Figure 12.8
Table 12.12
Run length Frequency
1 8201
2 4086
3 2055
4 983
5 506
6 247
7 135
8 49
9 26
10 15
11 5
12 5
13 3
14 2
15 1
How many different arrangements are there? We can not alter the bars at the
ends, so we have 7 bars and 10 stars to arrange. This can be done in
(7+10)!
7!10!
= 19, 448
different ways.
At present, the cells will become places into which we put symbols for runs. But,
of course, we can not have any empty cells, so we pursue that situation.
We have 10 stars in our example and 8 cells. If we wish to have each of the cells
occupied, we could, for example, have this arrangement.
| ∗ | ∗∗ | ∗ | ∗∗ | ∗ | ∗ | ∗ | ∗ |
There are other possibilities now. How many such arrangements are there?
First, place a star in each cell. This leaves us with 2 stars and 7 bars which can
be arranged in
(7+2)!
7!2!
= 36 different ways.
184 Chapter 12 Nonparametric Methods
Nowwe generalize the situation. Suppose there are n cells and r stars. The n cells
are defined by n + 1 bars, but two of these are fixed at the ends, so we have n − 1
bars that can be arranged along with the r stars. This can be done in
(n − 1 +r)!
(n − 1)!r!
=

n − 1 +r
n − 1

different ways.
Now if each cell is to contain at least one star, place one star in each cell. This
leaves us with r −n stars to put into n cells. This can then be done in

n − 1 +r −n
n − 1

=

r − 1
n − 1

ways.
This formula can then be used to count the number of runs from two sets of
objects, say x’s and y’s. We must distinguish, however, between an even number of
runs and an odd number of runs.
Suppose that there are n
x
x’s and n
y
y’s and that we have an even number of runs,
so say there are 2k runs. Then there must be k runs of the x’s and k runs of the y’s.
There are

n
x
+n
y
n
x

different arrangements of the x’s and y’s. Let us fill the k cells
with x’s first, leaving no cell empty. This can be done in

n
x
−1
k−1

ways. Then we must
fill the k cells for the y’s so that no cell is empty. This can be done in

n
y
−1
k−1

ways.
So we can fill all the cells in

n
x
−1
k−1

n
y
−1
k−1

ways. This is also the number of ways we
can fill all the cells if were we to choose the y’s first.
So if R is the random variable denoting the number of runs, then
P(R = 2k) =
2

n
x
− 1
k − 1

n
y
− 1
k − 1

n
x
+n
y
n
x

Now consider the case where the number of runs is odd, so R = 2k + 1. This
means that the number of runs of one of the letters is k and the number of runs of the
other letter is k + 1. It follows that
P(R = 2k + 1) =

n
x
− 1
k − 1

n
y
− 1
k

+

n
x
− 1
k

n
y
− 1
k − 1

n
x
+n
y
n
x

We will not show it, but it can be shown that
E(R) =
2n
x
n
y
n
x
+n
y
+ 1
Var(R) =
2n
x
n
y
(2n
x
n
y
−n
x
−n
y
)
(n
x
+n
y
)
2
(n
x
+n
y
− 1)
Order Statistics 185
In the baseball example, assuming that n
x
= n
y
= 81, the formulas give
E(R) = 82 and
Var(R) =
2 · 81 · 81 · (2 · 81 · 81 − 81 − 81)
(81 + 81)
2
(81 + 81 − 1)
= 40.248
so the standard deviation is

40.248 = 6. 344, results that are very close to those
found in our simulation.
EXAMPLE 12.6
Suppose n
x
= 5 and n
y
= 8.This is a fairly large example to do by hand since there are

5+8
5

= 1287 different orders in which the letters could appear.
Table 12.13 shows all the probabilities of the numbers of runs multiplied by 1287, and
Figure 12.9 shows a graph of the resulting probabilities.
᭿
Table 12.13
Runs Probability
2 2
3 11
4 56
5 126
6 294
7 280
8 175
9 70
10 21
4 6 8 10
Runs
50
100
150
200
250
300
F
r
e
q
u
e
n
c
y
Figure 12.9
186 Chapter 12 Nonparametric Methods
EXAMPLE 12.7 Even and Odd Runs
And now we return to our baseball example. It is possible to separate the baseball seasons for
which the number of runs was even from those for which it was odd. Figures 12.10 and 12.11
show the respective distributions.
6466 68 7072 7476 78 8082 84 8688 9092 94 96
Run length
2.5
5
7.5
10
12.5
15
F
r
e
q
u
e
n
c
y
Figure 12.10
67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99
Run length
2
4
6
8
10
12
14
F
r
e
q
u
e
n
c
y
Figure 12.11
It is interesting to compare some of the statistics for the years with an even number of runs
with those having an odd number of runs. There are 149 years with an even number of runs
with mean 80.9799 and standard deviation 6.0966. There are 151 years with an odd number
of runs with mean 81.5033 and standard deviation 6.4837.
᭿
CONCLUSIONS
In this chapter, we have introduced some nonparametric statistical methods; that is,
methods that do not make distributional assumptions. We have used the median and
the maximum in particular and have used them in statistical testing; there are many
other nonparametric tests and the reader is referred to many texts in this area. We
pursued the theory of runs and applied the results to winning and losing streaks in
baseball.
Explorations 187
EXPLORATIONS
1. Find all the samples of size 4 chosen without replacement from the set
1, 2, 3, ..., 8. Find the probability distributions of
(a) the sample mean,
(b) the sample median,
(c) the sample maximum.
2. Choose all the possible samples of size 3 selected without replacement from
the set 1, 2, 3, . . . , 10. Use the “average gap” method for each sample to
estimate the maximumof the population and showits probability distribution.
3. The text indicates a procedure that can be used with a computer to count the
number of runs in a sample. Produce 200 samples of size 10 (using the symbols
0 and 1) and show the probability distribution of the number of runs.
4. Show all the permutations of five integers. Suppose the integers 1, 2, and 3
are from population I and the integers 4 and 5 are from population II. Find the
probability distribution of all the possible rank sums.
Chapter 13
Least Squares, Medians, and
the Indy 500
CHAPTER OBJECTIVES:
r
to showtwo procedures for approximating bivariate data with straight lines one of which
uses medians
r
to find some surprising connections between geometry and data analysis
r
to find the least squares regression line without calculus
r
to see an interesting use of an elliptic paraboloid
r
to show how the equations of straight lines and their intersections can be used in a
practical situation
r
to use the properties of median lines in triangles that can be used in data analysis.
INTRODUCTION
We often summarize a set of data by a single number such as the mean, median,
range, standard deviation, and many other measures. We now turn our attention to
the analysis of a data set with two variables by an equation. We ask, “Can a bivariate
data set be described by an equation?” As an example, consider the following data
set that represents a very small study of blood pressure and age.
Age Blood pressure
35 114
45 124
55 143
65 158
75 166
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
188
Introduction 189
We seek here an equation that approximates and summarizes the data. First, let
us look at a graph of the data. This is shown in Figure 13.1.
40 50 60 70
Age
120
130
140
150
160
B
l
o
o
d

p
r
e
s
s
u
r
e
Figure 13.1
It would appear that the data could be well approximated by a straight line.
We might guess some straight lines that might fit the data well. For example,
we might try the straight line y = 60 +1.2x, where y is blood pressure and x is age.
How well does this line fit the data? Let us consider the predictions this line makes,
call them ˆ y
i
, and the observed values, say y
i
.We have shown these values and the
discrepancies, y − ˆ y, in Table 13.1.
Table 13.1
x y ´y y −´y
35 114 102 12
45 124 114 10
55 143 126 17
65 158 138 20
75 166 150 16
The discrepancies, or what are commonly called errors or residuals, happen
to be all positive in this case, but that is not always so. So how are we to measure
the adequacy of this straight line approximation or fit? Sometimes, of course, the
positive residuals will offset the negative residuals, so adding up the residuals can
be quite misleading. To avoid this complication, it is customary to square the residuals
before adding them up. If we do that in this case we get 1189,but we do not know if
that can be improved upon. So let us try some other combinations of straight lines.
First, suppose the line is of the form y
i
= α +βx
i
. Although the details of the
calculations have not been shown, Table 13.2 shows some values for the sum of
squares, SS =
¸
5
i=1
(y − ˆ y)
2
, and various choices for α and β.
One could continue in this way, trying various combinations of α and β until a
minimum is reached. The minimum in this case, as we shall soon see, occurs when
α = 65.1 and β = 1.38, producing a minimum sum of squares of 65.1. But trial
and error is a very inefficient way to determine the minimum sum of squares and
is feasible in this case because the data set consists of only five data points. It is
clearly nearly impossible, even with a computer, when we consider subsequently an
190 Chapter 13 Least Squares, Medians, and the Indy 500
Table 13.2
α β SS
55 1 4981
60 1.2 1189
65 1.3 139.25
70 1.4 212
75 1.4 637
example consisting of all the winning speeds at the Indianapolis 500-mile race, a data
set consisting of 91 data points.
In our small case, it is possible to examine a surface showing the sum of squares
(SS) as a function of α and β. A graph of SS =
¸
n
i=1
(y
i
−α −βx
i
)
2
is shown in
Figure 13.2. It can be shown that the surface is an elliptic paraboloid, that is, the
intersections of the surface with vertical planes are parabolas and the intersections of
the surface with horizontal planes are ellipses.
It is clear that SS does reach a minimum, although it is graphically difficult to
determine the exact values of α and β that produce that minimum. We now show an
55
65
75
85
95
a
30
3000
SS
1.
1.1
1.2
1.3
1.4
1.5
1.6
b
Figure 13.2
Least Squares 191
algebraic method for determining the values of α and β that minimize the sum of
squares.
LEAST SQUARES
Minimizing the sum of squares of the deviations of the predicted values from the
observed values is known as the principle of least squares.
We now show how to determine the values of α and β that minimize the sum of
squares.
Principle of Least Squares
Estimate α and β by those values that minimize SS =
¸
n
i=1
(y
i
−α −βx
i
)
2
.
Suppose that we have a set of data {x
i
, y
i
} where i = 1, · · · , n. Our straight line
is y
i
= α +βx
i
and our sum of squares is SS =
¸
n
i=1
(y
i
−α −βx
i
)
2
. The values
that minimize this sum of squares are denoted by ˆ α and
ˆ
β.
The least squares line then estimates the value of y for a particular value of x
as ˆ y
i
= ˆ α +
ˆ
βx
i
. So the principle of least squares says that we estimate the intercept
(α) and the slope (β) of the line by those values that minimize the sum of squares of
the residuals, y
i
−α −βx
i
; the differences between the observed values, the y
i
; and
the values predicted by the line, α +βx
i
.
We now look at the situation in general. We begin with
SS =
n
¸
i=1
(y
i
−α −βx
i
)
2
which can be written as
SS =
n
¸
i=1
y
2
i
+nα
2

2
n
¸
i=1
x
2
i
−2α
n
¸
i=1
y
i
−2β
n
¸
i=1
x
i
y
i
+2αβ
n
¸
i=1
x
i
From Figure 13.2 and the above equation, we find that if we hold β fixed, the
intersection is a parabola that has a minimum value.
So we hold β fixed and write SS as a function of α alone:
SS = nα
2
−2α

n
¸
i=1
y
i
−β
n
¸
i=1
x
i

+
n
¸
i=1
y
2
i

2
n
¸
i=1
x
2
i
−2β
n
¸
i=1
x
i
y
i
192 Chapter 13 Least Squares, Medians, and the Indy 500
Now factor out the factor of n and noting that
¸
n
i=1
y
i
/n = y, with a similar
result for x, and adding and subtracting n(y −βx)
2
(thus completing the square), we
can write
SS = n[α
2
−2α(y −βx) +(y −βx)
2
] −n(y −βx)
2
+
n
¸
i=1
y
2
i

2
n
¸
i=1
x
2
i
−2β
n
¸
i=1
x
i
y
i
or
SS = n[α −(y −βx)]
2
−n(y −βx)
2
+
n
¸
i=1
y
2
i

2
n
¸
i=1
x
2
i
−2β
n
¸
i=1
x
i
y
i
Now since β is fixed and n is positive, the minimum value for SS occurs when
ˆ α = y −
ˆ
βx. Note that in our example, y = 141 and x = 55, giving the estimate for ˆ α
as 141 −55
ˆ
β. However, we now have a general form for our estimate of α.
Nowwe find an estimate for β. Hold α fixed. We can write, using the above result
for ˆ α,
SS =
n
¸
i=1
(y
i
−(y −βx) −βx
i
)
2
=
n
¸
i=1
[(y
i
−y) −β(x
i
−x)]
2
= β
2
n
¸
i=1
(x
i
−x)
2
−2β
n
¸
i=1
(x
i
−x)(y
i
−y) +
n
¸
i=1
(y
i
−y)
2
Now factor out
¸
n
i=1
(x
i
−x)
2
and complete the square to find
SS =
n
¸
i=1
(x
i
−x)
2
¸
β
2
−2
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2
+
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2

2
¸

n
¸
i=1
(x
i
−x)
2
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2

2
+
n
¸
i=1
(y
i
−y)
2
Influential Observations 193
which can be written as
SS =
n
¸
i=1
(x
i
−x)
2
¸
β −
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2

2

n
¸
i=1
(x
i
−x)
2
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2

2
+
n
¸
i=1
(y
i
−y)
2
showing that the minimum value of SS is attained when
ˆ
β =
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2
We have found that the minimum value of SS is achieved when
ˆ
β =
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2
and
ˆ α = y −
ˆ
βx
We have writtenthese as estimates since these were determinedusingthe principle
of least squares.
Expanding the expression above for
ˆ
β, we find that it can be written as
ˆ
β =
n
¸
n
i=1
x
i
y
i

¸
n
i=1
x
i
¸
n
i=1
y
i

n
¸
n
i=1
x
2
i

¸
n
i=1
x
i

2
In our example, we find
ˆ
β =
5(40155) −(275)(705)
5(16125) −(275)
2
= 1.38
and
ˆ α = 141 −1.38(55) = 65.1
The expressions for ˆ α and
ˆ
β are called least squares estimates and the straight
line they produce is called the least squares regression line.
For a given value of x, say x
i
, the predicted value for y
i
, ˆ y
i
is
ˆ y
i
= ˆ α +
ˆ
βx
i
Now we have equations that can be used with any data set, although a computer
is of great value for a large data set. Many statistical computer programs (such as
Minitab) produce the least squares line from a data set.
INFLUENTIAL OBSERVATIONS
It turns out that the principle of least squares does not treat all the observations equally.
We investigate to see why this is so.
194 Chapter 13 Least Squares, Medians, and the Indy 500
Since
¸
n
i=1
(x
i
−x)
2
is constant for a given data set and
ˆ
β =
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2
we can write
ˆ
β =
n
¸
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2
So assuming a
i
= (x
i
−x)/
¸
n
i=1
(x
i
−x)
2
, our formula for
ˆ
β becomes
ˆ
β =
n
¸
i=1
a
i
(y
i
−y) where a
i
=
(x
i
−x)
¸
n
i=1
(x
i
−x)
2
Now the value of a
i
depends on the value of x
i
, so it appears that
ˆ
β is a weighted
sum of the deviations y
i
−y. But this expression for
ˆ
β can be simplified even further.
Now
ˆ
β =
¸
n
i=1
a
i
(y
i
−y), and this can be written as
ˆ
β =
¸
n
i=1
a
i
y
i
−y
¸
n
i=1
a
i
. Now notice that
n
¸
i=1
a
i
=
n
¸
i=1
(x
i
−x)
¸
n
i=1
(x
i
−x)
2
=
1
¸
n
i=1
(x
i
−x)
2
n
¸
i=1
(x
i
−x) = 0
since
¸
n
i=1
(x
i
−x) = 0.
So
ˆ
β =
n
¸
i=1
a
i
y
i
where a
i
=
(x
i
−x)
¸
n
i=1
(x
i
−x)
2
.
This shows that the least squares estimate for the slope,
ˆ
β, is a weighted average
of the y values. Notice that the values for a
i
highly depend upon the value for x
i
−x;
the farther x
i
is from x, the larger or smaller is a
i
. This is why we pointed out that
least squares does not treat all the data points equally. We also note now that if for
some point x
i
= x, then that point has absolutely no influence whatsoever on
ˆ
β. This
is a fact to which we will return when we consider another straight line to fit a data
set, called the median–median line.
The Indy 500 195
To continue with our example, we find that
¸
n
i=1
(x
i
−x)
2
= 1000, giving the
following values for a
i
:
a
1
=
(35 −55)
1000
= −
1
50
a
2
=
(45 −55)
1000
= −
1
100
a
3
=
(55 −55)
1000
= 0
a
4
=
(65 −55)
1000
=
1
100
a
5
=
(75 −55)
1000
=
1
50
Note that
¸
n
i=1
a
i
= 0, as we previously noted.
So
ˆ
β =
n
¸
i=1
a
i
y
i
= −
1
50
· 114 −
1
100
· 124 +0 · 143 +
1
100
· 158 +
1
50
· 166
= 1. 38.
Then ˆ α = y −
ˆ
βx = 141 −1.38(55) = 65.1 as before.
We nowpresent a fairly large data set. As we shall see, unlike our little blood pres-
sure example, dealing with it presents some practical as well as some mathematical
difficulties.
THE INDY 500
We now turn to a much larger data set, namely, the winning speeds at the Indianapolis
500-mile automobile race conducted every spring. The first race occurred in 1911 and
since then it has been held every year, except 1917 and 1918 and 1942–1946 when
the race was suspended due to World War I and World War II, respectively. The data
are provided in Table 13.3.
A graph of the data, produced by the computer algebra program Mathematica, is
shown in Figure 13.3.
1920 1940 1960 1980 2000
Year
80
100
120
140
160
180
S
p
e
e
d
Figure 13.3
196 Chapter 13 Least Squares, Medians, and the Indy 500
Table 13.3
Year Speed Year Speed Year Speed
1911 74.602 1944 * 1976 148.725
1912 78.719 1945 * 1977 161.331
1913 75.933 1946 * 1978 161.363
1914 82.474 1947 116.338 1979 158.899
1915 89.840 1948 119.814 1980 142.862
1916 84.001 1949 121.327 1981 139.084
1917 * 1982 162.029
1918 * 1950 124.002 1983 162.117
1919 88.050 1951 126.244 1984 163.612
1920 88.618 1952 128.922 1985 152.982
1921 89.621 1953 128.740 1986 170.722
1922 94.484 1954 130.840 1987 162.175
1923 90.954 1955 128.213 1988 144.809
1924 98.234 1956 128.490 1989 167.581
1925 101.127 1957 135.601 1990 185.981
1926 95.904 1958 133.791 1991 176.457
1927 97.545 1959 135.857 1992 134.477
1928 99.482 1960 138.767 1993 157.207
1929 97.585 1961 139.130 1994 160.872
1930 100.448 1962 140.293 1995 153.616
1931 96.629 1963 143.137 1996 147.956
1932 104.144 1964 147.350 1997 145.827
1933 104.162 1965 150.686 1998 145.155
1934 104.863 1966 144.317 1999 153.176
1935 106.240 1967 151.207 2000 167.607
1936 109.069 1968 152.882 2001 141.574
1937 113.580 1969 156.867 2002 166.499
1938 117.200 1970 155.749 2003 156.291
1939 115.035 1971 157.735 2004 138.518
1940 114.277 1972 162.962 2005 157.603
1941 115.117 1973 159.036 2006 157.085
1942 * 1974 158.589 2007 151.774
1943 * 1975 149.213 2008 143.562
The least squares regression line is Speed = −1649.13 +0.908574 ×Year.
The calculations for determining ˆ α and
ˆ
β are quite formidable for this (and
almost any other real) data set. The calculations and many of the graphs in this
chapter were made by the statistical program Minitab or the computer algebra
system Mathematica.
In Figure 13.3, one can notice the years in which the race was not held and the fact
that the data appear to be linear until 1970 or so when the winning speeds appear to
become quite variable and scattered. We will consider the data in three parts, using the
A Test for Linearity: The Analysis of Variance 197
partitions the war years provide. We will also use these partitions when we consider
another line to fit to the data, called the median–median line. Some of the statistics
for the data are shown in Table 13.4.
Table 13.4
Years Mean Median Variance
1911–1916 80.93 80.60 32.22
1919–1941 101.84 100.45 80.78
1947–2008 148.48 149.95 214.63
Clearly, the speeds during 1947–2008 have not only increased the mean winning
speed but also had a large influence on the variance of the speeds. We will return to
this discussion later. The principle of least squares can be used with any data set, no
matter whether it is truly linear or not. So we begin our discussion on whether the
line fits the data well or not.
A TEST FOR LINEARITY: THE ANALYSIS OF VARIANCE
It is interesting to fit a straight line to a data set, but the line may or may not fit the
data well. One could take points, for example, on the boundary of a semicircle and
not get a fit that was at all satisfactory. So we seek a test or procedure that will give
us some information on how well the line fits the data.
Is our regression line a good approximation to the data? The answer depends on
the accuracy one desires in the line. A good idea, always, is to compare the observed
values of y with the values predicted by the line. In the following table, let ˆ y denote
the predicted value for y. If we do this for our blood pressure data set, we find the
following values.
x
i
y
i
ˆ y
i
35 114 113.4
45 124 127.2
55 143 141.0
65 158 154.8
75 166 168.6
If the predicted values are sufficient for the experimenter, then the model is a
good one and no further investigation is really necessary. If the experimenter wishes
to have a measure of the adequacy of the fit of the model, then a test is necessary. We
give here what is commonly known as the analysis of variance.
198 Chapter 13 Least Squares, Medians, and the Indy 500
First, consider what we called the total sum of squares, or SST. This is the sum
of squares of the deviations of the y values from their mean, y:
SST =
n
¸
i=1
(y
i
−y)
2
For our data set, since y = 141, this is
SST = (114−141)
2
+(124−141)
2
+(143−141)
2
+(158−141)
2
+(166−141)
2
= 1936.0
In using the principle of least squares, we considered the sum of squares of the
deviations of the observed y values and their predicted values fromthe line. We simply
called this quantity SS above. We usually call this the sum of squares due to error, or
SSE, since it measures the discrepancies between the observed and predicted values
of y:
SSE =
n
¸
i=1
(y
i
− ˆ y
i
)
2
For our data set, this is
SSE = (114−113.4)
2
+(124−127.2)
2
+(143−141)
2
+(158−154.8)
2
+(166−168.6)
2
= 31. 6
This is of course the quantity we wished to minimize when we used the principle
of least squares and this is its minimum value.
Finally, consider SSR or what we call the sum of squares due to regression. This
is the sum of squares of the deviations between the predicted values, ˆ y
i
, and the mean
of the y values, y:
SSR =
n
¸
i=1
( ˆ y
i
−y)
2
For our data set, this is
SSR = (113.4−141)
2
+(127.2−141)
2
+(141−141)
2
+(154.8−141)
2
+(168.6−141)
2
= 1904. 4
We notice here that these sums of squares add together as 1904.4 +31.6 = 1936
or SST = SSR +SSE.
A Test for Linearity: The Analysis of Variance 199
Note that SSR is a large portion of SST. This indicates that the least squares line
is a good fit for the data. Now the identity SST = SSR +SSE in this case is not a
coincidence; this is always true!
To prove this, consider the identity
y
i
−y = (y
i
− ˆ y
i
) +( ˆ y
i
−y)
Now square both sides and sum over all the values giving
n
¸
i=1
(y
i
−y)
2
=
n
¸
i=1
(y
i
− ˆ y
i
)
2
+2
n
¸
i−1
(y
i
− ˆ y
i
)( ˆ y
i
−y) +
n
¸
i=1
( ˆ y
i
−y)
2
Note for the least squares line,
ˆ
β =
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2
and
ˆ α = y −
ˆ
βx
Now ˆ y
i
= ˆ α +
ˆ
βx
i
= y −
ˆ
βx +
ˆ
βx
i
= y +
ˆ
β(x
i
−x), so ˆ y
i
−y =
ˆ
β(x
i
−x) and
the middle term above (ignoring the 2) is
n
¸
i−1
(y
i
− ˆ y
i
)( ˆ y
i
−y) =
n
¸
i−1
[(y
i
−y) −
ˆ
β(x
i
−x)]
ˆ
β(x
i
−x)
=
ˆ
β
n
¸
i−1
(y
i
−y)(x
i
−x) −
ˆ
β
2
n
¸
i−1
(x
i
−x)
2
= 0
since
ˆ
β =
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2
So,
SST =
n
¸
i=1
(y
i
−y)
2
=
n
¸
i=1
(y
i
− ˆ y
i
)
2
+2
n
¸
i−1
(y
i
− ˆ y
i
)( ˆ y
i
−y) +
n
¸
i=1
( ˆ y
i
−y)
2
becomes
n
¸
i=1
(y
i
−y)
2
=
n
¸
i=1
(y
i
− ˆ y
i
)
2
+
n
¸
i=1
( ˆ y
i
−y)
2
We could say that
¸
n
i=1
(y
i
−y)
2
represents the total sum of squares of the
observations around their mean value. We denote this by SST.
200 Chapter 13 Least Squares, Medians, and the Indy 500
We could also say that
¸
n
i=1
(y
i
− ˆ y
i
)
2
represents the sum of squares due to error
or SSE. This is the quantity that the principle of least squares seeks to minimize.
Finally, we could say that
¸
n
i=1
( ˆ y
i
−y)
2
represents the sum of squares due to
regression or SSR. So the identity above,
n
¸
i=1
(y
i
−y)
2
=
n
¸
i=1
(y
i
− ˆ y
i
)
2
+
n
¸
i=1
( ˆ y
i
−y)
2
can be abbreviated as SST = SSE +SSR, as we have seen in our example previ-
ously. This partition of the total sumof squares is often called an analysis of variance
partition, although it has little to do with a variance.
If the data show a strong linear relationship, we expect SSR =
¸
n
i=1
( ˆ y
i
−y)
2
to
be large since we expect the deviations ˆ y
i
−y to be a large proportion of the total
sum of squares
¸
n
i=1
(y
i
−y)
2
. On the contrary, if the data are actually not linear, we
expect the sum of squares due to error,
¸
n
i=1
(y
i
− ˆ y
i
)
2
, to be a large proportion of
the total sum of squares
¸
n
i=1
(y
i
−y)
2
.
This suggests that we look at the ratio SSR/SST, which is called the
coefficient of determination and is denoted by r
2
. Its square root, r, is called the
correlation coefficient.
For our data, r
2
= 1904.4/1936 = 0. 98368; so r = 0.9918. (The positive square
root of r
2
is used if y increases with increasing x while the negative square root of
r
2
is used if y decreases with increasing x.)
Since SSR is only a part of SST, it follows that
0 ≤ r
2
≤ 1
so that
−1 ≤ r ≤ 1
It would be nice if this number by itself would provide an accurate test for the
adequacy of the regression, but unfortunately this is not so. There are data sets for
which r is large but the fit is poor and there are data sets for which r is small and the
fit is very good.
It is interesting to compare the prewar years (1911–1941) with the postwar years
(1947–2008). The results from the analysis of variance are provided in Table 13.5
and the least squares regression lines are provided in Table 13.6.
Table 13.5
Years SSE SSR SST r
2
1911–1941 8114 4978 13092 0.9396
1947–2008 293 3777 4020 0.3802
Nonlinear Models 201
Table 13.6
Years Least Squares Line
1911–1941 Speed = −2357.99 +1.27454 × Year
1947–2008 Speed = −841.651 +0.500697 × Year
1940 1960 1980 2000
Year
–20
–10
10
20
30
R
e
s
i
d
u
a
l
Figure 13.4
From the values for r
2
, we conclude that the linear fit is very good for the years
1911–1941 and that the fit is somewhat less good for the years 1947–2008. The graph
in Figure 13.4 shows the scatter plot of all the data and the least squares line.
A Caution
Conclusions based on the value of r
2
alone can be very misleading. Consider the
following data set.
x 13 4 10 14 9 11
y 17.372 7.382 9.542 21.2 7.932 11.652
x 15 7 12 6 12 8 16
y 25.092 6.212 14.262 6.102 0.500 6.822 29.702
The analysis of variance gives r
2
= 0.59, which might be regarded as fair but
certainly not a large value. The data here have not been presented in increasing
values for x, so it is difficult to tell whether the regression is linear or not. The graph
in Figure 13.5 persuades us that a linear fit is certainly not appropriate! The lesson
here is always draw a graph!
NONLINEAR MODELS
The procedures used in simple linear regression can be applied to a variety of nonlinear
models by making appropriate transformations of the data. This could be done in the
example above, where the relationship is clearly quadratic. For example, if we wish
202 Chapter 13 Least Squares, Medians, and the Indy 500
6 8 10 12 14 16
x
5
10
15
20
25
30
y
Figure 13.5
to fit a model of the form y
i
= a · 10
bx
i
to a data set, we can take logarithms to find
log y
i
= log a +bx
i
.
Then our simple linear regression procedure would give us estimates of
b and log a. We would then take the antilog to estimate a. Here are some other
models, including the quadratic relationship mentioned above, and the appropriate
transformations:
1. y
i
= a · 10
b/x
i
. Logarithms give log y
i
= log a +b/x
i
, so we fit our straight
line to the data set {1/x
i
, log y
i
}.
2. y
i
= a · x
b
i
. Taking logarithms gives log y
i
= log a +b log x
i
, so the linear
model is fitted to the data set {log x
i
, log y
i
}.
3. y
i
= a +b log x
i
uses the data set {log x
i
, y
i
}.
4. y
i
= 1/(a +b · 10
−x
i
) can be transformed into 1/y
i
= a +b · 10
−x
i
. The lin-
ear model is then fitted to the data set {10
−x
i
, 1/y
i
}.
5. The model y
i
= x
i
/(ax
i
−b) can be transformed into 1/y
i
= a −b/x
i
.
THE MEDIAN–MEDIAN LINE
Data sets are often greatly influenced by very large or very small observations; we
explored this to some extent in Chapter 12. For example, suppose salaries in a small
manufacturing plant are as follows:
$12, 500, $13, 850, $21, 900, $26, 200, $65, 600
The mean of these salaries is $27,410, but four out of five workers receive less
than this average salary! The mean is highly influenced by the largest salary. The
median salary (the salary in the middle or the mean of the two middlemost salaries
when the salaries are arranged in order) is $21,900. We could even replace the two
highest salaries with salaries equal to or greater than $21,900, and the median would
remain at $21,900 while the mean might be heavily influenced.
The Median–Median Line 203
For this reason, the median is called a robust statistic since it is not influenced
by extremely large (or small) values.
The median can also be used in regression. We now describe the median–median
line. This line is much easier to calculate than the least squares line and enjoys some
surprising connections with geometry.
We begin with a general set of data {x
i
, y
i
} for i = 1, ..., n. To find the median–
median line, first divide the data into three parts (which usually contain roughly the
same number of data points). In each part, determine the median of the x-values and
the median of the y-values (this will rarely be one of the data points). These points
are then plotted as the vertices of a triangle. P
1
contains the smallest x-values, P
2
contains the middle x-values, and P
3
contains the largest x-values.
The median–median line is determined by drawing a line parallel to the baseline
(the line joining P
1
and P
3
) and at 1/3 of the distance from the baseline toward P
2
.
The concept is that since P
1
and P
3
contain 2/3 of the data and P
2
contains 1/3
of the data, the line should be moved 1/3 of the distance from the baseline toward
P
2
. The general situation is shown in Figure 13.6.
P
2
P
3
P
1
Median–median line
Figure 13.6
Now for some geometric facts.
The medians of a triangle meet at a point that is 1/3 of the distance from the
baseline. Figure 13.7 shows these medians and their meeting point.
If the triangle is determined by the points P
1
(x
1
, y
1
), P
2
(x
2
, y
2
), and P
3
(x
3
, y
3
),
then the medians meet at the point (x, y) where x = 1/3(x
1
+x
2
+x
3
) and
y = 1/3(y
1
+y
2
+y
3
). (The means here refer to the coordinates of the median points,
not the data set in general.) It is also true that the meeting point of the medians is 1/3
of the distance from the baseline. Let us prove these facts before making use of them
in determining the median–median line.
204 Chapter 13 Least Squares, Medians, and the Indy 500
P
3
P
2
P
1
Figure 13.7
Figure 13.8 shows the situation in general. We have taken the baseline along the
x-axis, with the leftmost vertex of the triangle at the point (0, 0). This simplifies the
calculations greatly and does not limit the generality of our arguments.
(0,0)
(x
2
,y
2
)
(x
1
,0)
m
1
-- ->
m
2
-- -- >
m
3
-- -- -- -- -- ->
((x
1
+x
2
)/2,y
2
/2)
(x
2
/2,y
2
/2)
Figure 13.8
First the point (x, y) is the point ((x
1
+x
2
)/3, (y
2
/3)), so the point (x, y) is 1/3
of the distance from the baseline toward the vertex, P
2.
Now we must show that the
medians meet at that point.
The equations of the three medians shown are as follows:
m
1
: y =
y
2
x
1
+x
2
.x
m
2
: y = −
x
1
y
2
x
1
−2x
2
+
y
2
x
2
−2x
1
.x
m
3
: y = −
x
1
y
2
2x
2
−x
1
+
2y
2
2x
2
−x
1
.x
It is easy to verify that the point ((x
1
+x
2
)/3, (y
2
/3)) lies on each of the lines
and hence is the point of intersection of these median lines.
The Median–Median Line 205
These facts suggest two ways to determine the equation of the median–median
line. They are as follows:
1. First, determine the slope of the line joining P
1
and P
3
. This is the slope of
the median–median line. Second, determine the point (x, y). Finally, find the
median–median line using the slope and a point on the line.
2. Determine the slope of the line joiningP
1
andP
3
.Thendetermine the equations
of two of the medians and solve them simultaneously to determine the point
of intersection. The median–median line can be found using the slope and a
point on the line.
To these two methods, suggested by the facts above, we add a third method.
3. Determine the intercept of the line joining P
1
and P
3
and the intercept of
the line through P
2
with the slope of the line through P
1
and P
3
. The in-
tercept of the median–median line is the average of twice the first inter-
cept plus the second intercept (and its slope is the slope of the line joining
P
1
and P
3
).
Method 1 is by far the easiest of the three methods although all are valid. Methods
2 and 3 are probably useful only if one wants to practice finding equations of lines
and doing some algebra! Method 2 is simply doing the proof above with actual data.
We will not prove that method 3 is valid but the proof is fairly easy.
When Are the Lines Identical?
It turns out that if x
2
= x, then the least squares line and the median–median line are
identical. To show this, consider the diagram in Figure 13.9 where we have taken the
base of the triangle along the x-axis and the vertex of the triangle at (x
1
/2, y
2
) since
x =
0 +x
1
+
x
1
2
3
=
x
1
2
(0,0)
(x
1
,0)
(x
1
/2,y
2
/3)
(x
1
/2,y
2
)
Figure 13.9
206 Chapter 13 Least Squares, Medians, and the Indy 500
Now
ˆ
β =
¸
n
i=1
(x
i
−x)(y
i
−y)
¸
n
i=1
(x
i
−x)
2
Here
n
¸
i=1
(x
i
−x)(y
i
−y) =

0 −
x
1
2

0 −
y
2
3

+

x
1
2

x
1
2

y
2

y
2
3

+

x
1

x
1
2

0 −
y
2
3

= 0
so
ˆ
β = 0. Also, ˆ α = y −
ˆ
βx = y = y
2
/3. So the least squares line is y = y
2
/3.
But this is also the median–median line since the median–median line passes through
(x, y) and has slope 0.
It is not frequent that x
2
= x, but if these values are close, we expect the least
squares line and the median–median line to also be close.
We now proceed to an example using the Indianapolis 500-mile race winning
speeds.
The reason for using these speeds as an example is that the data are divided
naturally into three parts due to the fact that the race was not held during World War I
(1917 and 1918) and World War II (1942–1946). We admit that the three periods of
data are far from equal in size. The data have been given above by year; we now show
the median points for each of the three time periods (Table 13.7 and Figure 13.10).
Table 13.7
Period Median (years) Median (speed)
1911–1916 1913.5 80.60
1919–1941 1930 100.45
1947–2008 1977.5 149.95
P
1
1913.5, 80.6
P
2
1930, 100.45
P
3
1976.5, 149.95
Median–median line-->
Figure 13.10
The Median–Median Line 207
Determining the Median–Median Line
The equation of the median–median line will now be determined by each of the three
methods described above.
r
Method 1
The slope of the median–median line is the slope of the line joining P
1
and P
3
.
149.95 −80.6
1977.5 −1913.5
= 1. 08359
The point (x, y) is the point

1913.5+1930+1977.5
3
,
80.6+100.45+149.95
3

= (1940.333, 110.333)
So the equation of the median–median line is
y − 110.333
x −1940.333
= 1. 08359
This can be simplified to y = −1992.19 +1. 08359 x, where y is the speed
and x is the year.
r
Method 2
We determine the equations of two of the median lines and show that they
intersect at the point (x, y). The line from P
1
to the midpoint of the line joining
P
2
and P
3
(the point 1973.75, 125.2) is y=1.10807x−2039.69. The line from
P
3
to the midpoint of the line joining P
1
and P
2
(the point 1921.75, 90.525)
is y = 1.06592x −1957.91. These lines intersect at (1940.33, 110.33), thus
producing the same median–median line as in method 1.
r
Method 3
Here we find the intercept of the line joining P
1
and P
3.
This is easily found to
be −1992.85. Then the intercept of the line through P
2
with slope of the line
joining P
1
and P
3
(1.08359) is −1990.88. Then weighting the first intercept
twice as much as the second intercept, we find the intercept for the median–
median line to be
2(−1992.85) +(−1990.88)
3
= −1992.19
So again we find the same median–median line. The least squares line for the
three median points is y = −2067.73 +1.12082x.
These lines appear to be somewhat different, as shown in Figure 13.11. They meet
at the point (1958, 129) approximately. It is difficult to compare these lines since the
analysis of variance partition
n
¸
i=1
(y
i
−y)
2
=
n
¸
i=1
(y
i
− ˆ y
i
)
2
+
n
¸
i=1
( ˆ y
i
−y)
2
208 Chapter 13 Least Squares, Medians, and the Indy 500
1920 1940 1960 1980 2000
Year
80
100
120
140
160
180
S
p
e
e
d
Figure 13.11
or SST = SSE +SSR no longer holds, because the predictions are no longer
those given by least squares. For the least squares line, we find
SST SSE SSR r
2
Least squares line 68,428 12323 56105 0.819918
It is true, however, that
n
¸
i=1
(y
i
−y)
2
=
n
¸
i=1
(y
i
− ˆ y
i
)
2
+2
n
¸
i−1
(y
i
− ˆ y
i
)( ˆ y
i
−y) +
n
¸
i=1
( ˆ y
i
−y)
2
The total sum of squares remains at
¸
n
i=1
(y
i
−y)
2
= 68, 428, but the middle
term is no longer zero.
We find, in fact, that in this case
¸
n
i=1
(y
i
− ˆ y
i
)
2
= 75933, 2
¸
n
i−1
(y
i
− ˆ y
i
)( ˆ y
i
−y) = −25040, and
¸
n
i=1
( ˆ y
i
−y)
2
= 17535, so the middle term has considerable influence.
There are huge residuals from the predictions using either line, especially in
the later years. However, Figure 13.3 shows that speeds become very variable
and apparently deviate greatly from a possible straight line relationship during
1911–1970.
We can calculate all the residuals fromboth the median–median line and the least
squares line. Plots of these are difficult to compare. We show these in Figures 13.12
and 13.13.
It is not clear what causes this deviance from a straight line in these years, but in
1972 wings were allowed on the cars, making the aerodynamic design of the car of
greater importance than the power of the car. In 1974, the amount of gasoline a car
could carry was limited, producing more pit stops; practice time was also reduced.
Analysis for years 1911–1969 209
1940 1960 1980 2000
Year
–30
–20
–10
10
20
30
R
e
s
i
d
u
a
l
Figure 13.12 Median–median line residuals.
1940 1960 1980 2000
Year
–20
–10
10
20
30
R
e
s
i
d
u
a
l
Figure 13.13 Least squares residuals.
Both of these measures were taken to preserve fuel. For almost all the races, some
time is spent under the yellow flag. At present, all cars must follow a pace car and
are not allowed to increase their speed or pass other cars. Since the time under the
yellow flag is variable, this is no doubt a cause of some of the variability of the speeds
in the later years. It is possible to account for the time spent on the race under the
yellowflag, but that discussion is beyond our scope here. The interested reader should
consult a reference on the analysis of covariance.
These considerations prompt an examination of the speeds from the early years
only. We have selected the period 1911–1969.
ANALYSIS FOR YEARS 1911–1969
For these years, we find the least squares regression line to be
Speed = −2315.89 +1.2544× Year and r
2
= 0.980022, a remarkable fit.
For the median–median line, we use the points (1913.5, 80.60) , (1930, 100.45),
and (1958, 135.601), the last point being the median point for the years 1947 through
2008. We find the median–median line to be y = 1.23598x −2284.63. The lines are
very closely parallel, but have slightly different intercepts. Predictions based upon
them will be very close. These lines are shown in Figure 13.14.
210 Chapter 13 Least Squares, Medians, and the Indy 500
1940 1960 1980 2000
Year
100
120
140
160
180
200
S
p
e
e
d
Figure 13.14
CONCLUSIONS
The data from the winning speeds at the Indianapolis 500-mile race provide a fairly
realistic exercise when one is confronted with a genuine set of data. Things rarely
work out as well as they do with textbook cases of arranged or altered data sets.
We find that in this case, the median–median line is a fine approximation of the
data for the early years of the data and that the least squares line is also fine for the
early years of the data; neither is acceptable for the later years when we speculate
that alterations in the aerodynamics of the cars and time spent under the yellow flag
produce speeds that vary considerably from a straight line prediction.
EXPLORATIONS
1. Using the three median points for the Indy 500 data, show that method 3 is a
valid procedure for finding the median–median line.
2. Using the three median points for the Indy 500 data, find the least squares line
for these points.
3. Find the analysis of variance partition for the least squares line in Explo-
ration 2.
4. Analyze the Indy 500 data for 1998–2008 by finding both the median–median
and the least squares lines. Show the partitions of the total sum of squares,
SST, in each case.
Chapter 14
Sampling
CHAPTER OBJECTIVES:
r
to show some properties of simple random sampling
r
to introduce stratified sampling
r
to find some properties of stratified sampling
r
to see how proportional allocation works
r
to discuss optimal allocation
r
to find some properties of proportional and optimal allocation.
One of the primary reasons that statistics has become of great importance in sci-
ence and engineering is the knowledge we now have concerning sampling and the
conclusions that can be drawn from samples. It is perhaps a curious and counter-
intuitive fact that knowledge about a population or group can be found with great
accuracy by examining a sample—only part of the population or group.
Almost all introductory courses in statistics discuss only simple random sam-
pling. In simple random sampling every item in the population is given an equal
chance of occurring in the sample, so every item in the population is treated exactly
equally. It may come as a surprise to learn that simple random sampling can often be
improved upon; that is, other sampling procedures may well be more efficient in pro-
viding information about the population from which the sample is selected. In these
procedures, not all the sampled items are treated equally! In addition to simple ran-
dom sampling, we will discuss stratified sampling and both proportional allocation
and optimal allocation within stratified sampling.
We start with a very small example so that ideas become clear.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
211
212 Chapter 14 Sampling
EXAMPLE 14.1 High and Middle Schools
An urban school district is interested in discovering some characteristics of some of its high
and middle schools. We emphasize from the beginning that we assume we know the entire
population. In practice, however, we would never know this (or else sampling is at best an idle
exercise). The details of the population are given in Table 14.1.
Table 14.1
School enrollment School type
1667 High
2002 High
1493 High
1802 High
1535 High
731 Middle
834 Middle
699 Middle
The mean enrollment is μ
x
= 1345.375 and the standard deviation is σ
x
= 481.8737. We
want to show how these statistics can be estimated by taking a sample of the schools. The
subscript x is used to distinguish the population from the samples we will select. We first
consider simple random sampling.
᭿
SIMPLE RANDOM SAMPLING
In simple random sampling, each item in the population, in this case a school, has an
equal chance of appearing in the sample as any of the other schools.
We decide to choose five of the eight schools as a randomsample. If we give each
of the eight schools equal probability of occurring in the sample, then we have a
simple random sample. There are

8
5

= 56simple randomsamples. These are shown
in Table 14.2.
We note here that the sample size 5 is relatively very large compared to the size of
the population, 8, but this serves our illustrative purposes. In many cases, the sample
size can be surprisingly small in relative to the size of the population. We cannot
discuss sample size here, but refer the reader to many technical works on the subject.
Note that each of the schools occurs in

7
4

= 35 of the random samples, so each
of the schools is treated equally in the sampling process.
Since we are interested in the mean enrollment, we find the mean of each of the
samples to find the 56 mean enrollments given in Table 14.3.
Call these values of x. The mean of these values is μ
x
= 1345.375 with standard
deviation σ
x
= 141.078.We see that the mean of the means is the mean of the popu-
lation. So the sample mean, x, is an unbiased estimator of the true population mean,
Simple Random Sampling 213
Table 14.2
{1667, 2002, 1493, 1802, 1535}, {1667, 2002, 1493, 1802, 731}, {1667, 2002, 1493, 1802, 834},
{1667, 2002, 1493, 1802, 699}, {1667, 2002, 1493, 1535, 731}, {1667, 2002, 1493, 1535, 834}
{1667, 2002, 1493, 1535, 699}, {1667, 2002, 1493, 731, 834},{1667, 2002, 1493, 731, 699},
{1667, 2002, 1493, 834, 699}, {1667, 2002, 1802, 1535, 731}, {1667, 2002, 1802, 1535, 834},
{1667, 2002, 1802, 1535, 699}, {1667, 2002, 1802, 731, 834}, {1667, 2002, 1802, 731, 699},
{1667, 2002, 1802, 834, 699}, {1667, 2002, 1535, 731, 834}, {1667, 2002, 1535, 731, 699}
{1667, 2002, 1535, 834, 699}, {1667, 2002, 731, 834, 699}, {1667, 1493, 1802, 1535, 731},
{1667, 1493, 1802, 1535, 834}{1667, 1493, 1802, 1535, 699}, {1667, 1493, 1802, 731, 834}
{1667, 1493, 1802, 731, 699}, {1667, 1493, 1802, 834, 699}, {1667, 1493, 1535, 731, 834}
{1667, 1493, 1535, 731, 699}, {1667, 1493, 1535, 834, 699}, {1667, 1493, 731, 834, 699}
{1667, 1802, 1535, 731, 834}, {1667, 1802, 1535, 731, 699}, {1667, 1802, 1535, 834, 699},
{1667, 1802, 731, 834, 699}, {1667, 1535, 731, 834, 699}, {2002, 1493, 1802, 1535, 731},
{2002, 1493, 1802, 1535, 834}, {2002, 1493, 1802, 1535, 699}, {2002, 1493, 1802, 731, 834},
{2002, 1493, 1802, 731, 699}, {2002, 1493, 1802, 834, 699}, {2002, 1493, 1535, 731, 834},
{2002, 1493, 1535, 731, 699}, {2002, 1493, 1535, 834, 699}, {2002, 1493, 731, 834, 699},
{2002, 1802, 1535, 731, 834}, {2002, 1802, 1535, 731, 699}, {2002, 1802, 1535, 834, 699},
{2002, 1802, 731, 834, 699}, {2002, 1535, 731, 834, 699}, {1493, 1802, 1535, 731, 834},
{1493, 1802, 1535, 731, 699}, {1493, 1802, 1535, 834, 699}, {1493, 1802, 731, 834, 699}
{1493, 1535, 731, 834, 699}, {1802, 1535, 731, 834, 699}
Table 14.3
{1699.8, 1539, 1559.6, 1532.6, 1485.6, 1506.2, 1479.2, 1345.4, 1318.4, 1339, 1547.4,
1568, 1541, 1407.2, 1380.2, 1400.8, 1353.8, 1326.8, 1347.4, 1186.6, 1445.6, 1466.2,
1439.2, 1305.4, 1278.4, 1299, 1252, 1225, 1245.6, 1084.8, 1313.8, 1286.8, 1307.4,
1146.6, 1093.2, 1512.6, 1533.2, 1506.2, 1372.4, 1345.4, 1366, 1319, 1292, 1312.6,
1151.8, 1380.8, 1353.8, 1374.4, 1213.6, 1160.2, 1279, 1252, 1272.6, 1111.8,
1058.4, 1120.2}
μ
x
. So we write
E(x) = μ
x
Also, the process of averaging has reduced the standard deviation considerably. These
values of x are best seen in the graph shown in Figure 14.1. Despite the fact that
the population is flat, that is, each enrollment occurs exactly once, the graph of the
means begins to resemble the normal curve. This is a consequence of the central limit
theorem.
We must be cautious, however, in calculating the standard deviation of the means.
From the central limit theorem, one might expect this to be σ
2
x
/n, where σ
2
x
is the
variance of the population. This is not exactly so in this case. The reason for this is
that we are sampling without replacement from a finite population. The central limit
theorem deals with samples from an infinite population. If we are selecting a sample
214 Chapter 14 Sampling
10981178125813381418 149815781658
Mean
2.5
5
7.5
10
12.5
15
F
r
e
q
u
e
n
c
y
Figure 14.1
of size n from a finite population of size N, then the variance of the sample mean is
given by
σ
2
x
=
N −n
N −1
·
σ
2
x
n
where σ
2
x
is the variance of the finite population. Note that in calculating this, we must
also find
σ
2
x
=
N

i=1
(x
i
−μ)
2
N
where μ is the mean of the population. Note the divisor of N since we are not dealing
with a sample, but with the entire population. Many statistical computer programs
assume that the data used are that of a sample, rather than that of a population and use
a divisor of N −1. So the variance of the population calculated by using a computer
program must be multiplied by (N −1)/N to find the correct result.
Here
σ
2
x
=
N −n
N −1
·
σ
2
x
n
=
8 −5
8 −1
·
481.87367
2
5
= 19903. 048 6
and so
σ
x
=

19903. 048 6 = 141. 078
exactly the value we calculated using all the sample means.
The factor (N −n)/(N −1) is often called the finite population correction factor.
STRATIFICATION
At first glance it might appear that it is impossible to do better than simple random
sampling where each of the items in the population is given the same chance as
any other item in the population in appearing in the sample. This, however, is not the
case! The reason for this is that the population is divided into two recognizable groups,
high schools and middle schools. These groups are called strata, and in addition to
Stratification 215
occurring in unequal numbers, they have quite different characteristics. Table 14.4
shows some data from these strata.
Table 14.4
Stratum Number Mean Standard deviation
High Schools 5 1699.800 185.887
Middle Schools 3 754.667 57.5982
The strata then differ markedly in mean values, but, more importantly as we
shall see, they vary considerably in variability. We can capitalize on these differences
and compose a sampling procedure that is unbiased but has a much smaller standard
deviation than that in simple random sampling.
Stratified random sampling takes these different characteristics into account. It
turns out that it is most efficient to take simple random samples within each stratum.
The question is how to determine the sample size, or allocations, within each stratum.
We will show two ways to allocate the sample sizes, called proportional allocation
and optimal allocation, respectively.
Proportional Allocation
In proportional allocation, we choose samples of sizes n
1
and n
2
from strata of sizes
N
1
and N
2
so that the proportions in the sample reflect exactly the proportion in the
population. That is, we want
n
1
n
2
=
N
1
N
2
Since n = n
1
+n
2
, it follows that n
1
· N
2
= n
2
· N
1
= (n −n
1
) · N
1
and from this
it follows that
n
1
= n ·
N
1
N
1
+N
2
and so
n
2
= n ·
N
2
N
1
+N
2
In proportional allocation, the sample sizes taken in each stratumare then propor-
tional to the sizes of each stratum. In this case, keeping a total sample of size 5, we take
the proportion 5/(5 +3) = 5/8 from the high school stratum and 3/(5 +3) = 3/8
from the middle school stratum; so we take 5 · 5/8 = 3.125 observations from the
high school stratum and 5 · 3/8 = 1.875 observations from the middle school stra-
tum. We cannot do this exactly of course, so we choose three items from the high
school stratum and two items from the middle school stratum. The sampling within
216 Chapter 14 Sampling
each stratummust be done using simple randomsampling. We found

5
3

·

3
2

= 30
different samples, which are shown in Table 14.5.
Table 14.5
{1667, 2002, 1493, 731, 834}, {1667, 2002, 1493, 731, 699}, {1667, 2002, 1493, 834, 699},
{1667, 2002, 1802, 731, 834}, {1667, 2002, 1802, 731, 699}, {1667, 2002, 1802, 834, 699},
{1667, 2002, 1535, 731, 834}, {1667, 2002, 1535, 731, 699}, {1667, 2002, 1535, 834, 699},
{1667, 1493, 1802, 731, 834}, {1667, 1493, 1802, 731, 699}, {1667, 1493, 1802, 834, 699},
{1667, 1493, 1535, 731, 834}, {1667, 1493, 1535, 731, 699}, {1667, 1493, 1535, 834, 699},
{1667, 1802, 1535, 731, 834}, {1667, 1802, 1535, 731, 699}, {1667, 1802, 1535, 834, 699},
{2002, 1493, 1802, 731, 834}, {2002, 1493, 1802, 731, 699}, {2002, 1493, 1802, 834, 699},
{2002, 1493, 1535, 731, 834}, {2002, 1493, 1535, 731, 699}, {2002, 1493, 1535, 834, 699},
{2002, 1802, 1535, 731, 834}, {2002, 1802, 1535, 731, 699}, {2002, 1802, 1535, 834, 699},
{1493, 1802, 1535, 731, 834}, {1493, 1802, 1535, 731, 699}, {1493, 1802, 1535, 834, 699}
Now we might be tempted to calculate the mean of each of these samples and
proceed with that set of 30 numbers. However, this will give a biased estimate of the
population mean since the observations in the high school stratum were given 3/2
the probability of appearing in the sample as the observations in the middle school
stratum. To fix this, we weight the mean of the three observations in the high school
stratum with a factor of 5 (the size of the high school stratum) and the mean of the
two observations from the middle school stratum with a weight of 3 (the size of the
high school stratum) and then divide the result by 8 to find the weighted mean of each
sample. For example, for the first sample, we find the weighted mean to be
5 ·
(1667 +2002 +1493)
3
+3 ·
(731 +834)
2
8
= 1368.85
which differs somewhat from 1345.40, which is the unweighted mean of the first
sample. The set of weighted means is shown in Table 14.6.
A graph of these means is shown in Figure 14.2.
This set of means has mean 1345.375, the true mean of the population; so this
estimate for the population mean is also unbiased. But the real gain here is in the
standard deviation. The standard deviation of this set of means is 48.6441. This is
a large reduction from 141.078, the standard deviation of the set of simple random
samples.
Table 14.6
{1368.85, 1343.54, 1362.85, 1433.23, 1407.92, 1427.23, 1377.6, 1352.29, 1371.6,
1327.19, 1301.88, 1321.19, 1271.56, 1246.25, 1265.56, 1335.94, 1310.63, 1329.94,
1396.98, 1371.67, 1390.98, 1341.35, 1316.04, 1335.35, 1405.73, 1380.42, 1399.73,
1299.69, 1274.38, 1293.69}
Stratification 217
1280 1297 1314 1331 1348 13651382 13991416 1433
Mean
1
2
3
4
5
F
r
e
q
u
e
n
c
y
Figure 14.2
This procedure has then cut the standard deviation by about 66%while remaining
unbiased. Clearly, stratified sampling has resulted in much greater efficiency in esti-
mating the population mean. One of the reasons for this is the difference in the standard
deviations within the strata. The high school stratum has standard deviation 185.887,
whereas the middle school stratum has a much smaller standard deviation, 57.5982.
This discrepancy can be utilized further in stratification known as optimal allocation
Optimal Allocation
We now describe another way of allocating the observations between the strata. The
very name, optimal allocation, indicates that this is in some sense the best allocation
we can devise. We will see that this is usually so and that the standard deviation of
the means created this way is even less than that for proportional allocation.
Optimal allocation is derived, in principle, from proportional allocation in which
strata with large standard deviations are sampled more frequently than those with
smaller standard deviations.
Suppose now that we have two strata: first stratum of size N
1
and standard
deviation σ
1
and second stratum of size N
2
and standard deviation σ
2
. If the total size
of the sample to be selected is n, where n
1
items are selected from the first stratum
and n
2
items from the second stratum, then n = n
1
+n
2
, where these samples sizes
are determined so that
n
1
n
2
=
N
1
N
2
·
σ
1
σ
2
Here the population proportion, N
1
/N
2
, is weighted by the ratio of the standard
deviations, σ
1

2
.
Since n
2
= n −n
1
, we have
n
1
N
2
σ
2
= n
2
N
1
σ
1
= (n −n
1
)N
1
σ
1
218 Chapter 14 Sampling
and this means that we choose
n
1
= n ·
N
1
σ
1
N
1
σ
1
+N
2
σ
2
items from the first stratum and
n
2
= n ·
N
2
σ
2
N
1
σ
1
+N
2
σ
2
items from the second stratum.
In this case, then we would select
5 ·
5(185.887)
5(185.887) +3(57.5982)
= 4. 216
items from the first (high school) stratum and
5 ·
3(57.5982)
5(185.887) +3(57.5982)
= 0.784
items from the second (middle school) stratum.
The best we can then do is to select four items from the high school stratum
and one item from the middle school stratum. This gives

5
4

·

3
1

= 15 possible
samples, which are shown in Table 14.7.
Table 14.7
{1667, 2002, 1493, 1802, 731}, {1667, 2002, 1493, 1802, 834}, {1667, 2002, 1493, 1802, 699},
{1667, 2002, 1493, 1535, 731}, {1667, 2002, 1493, 1535, 834}, {1667, 2002, 1493, 1535, 699},
{1667, 2002, 1802, 1535, 731}, {1667, 2002, 1802, 1535, 834}, {1667, 2002, 1802, 1535, 699}
{1667, 1493, 1802, 1535, 731}, {1667, 1493, 1802, 1535, 834}, {1667, 1493, 1802, 1535, 699},
{2002, 1493, 1802, 1535, 731}, {2002, 1493, 1802, 1535, 834}, {2002, 1493, 1802, 1535, 699}
Nowagain, as we did with proportional allocation, we do not simply calculate the
mean of each of these samples but instead calculate a weighted mean that reflects the
differing probabilities with which the observations have been collected. We weight
the mean of the high school observations with a factor of 5 (the size of the stratum)
while we weight the observation from the middle school stratum with its size, 3,
before dividing by 8. For example, for the first sample, the weighted mean becomes
5 ·
(1667 +2002 +1493 +1802)
4
+3 · 731
8
= 1362. 25
The complete set of weighted means is
{1362.25, 1400.88, 1350.25, 1320.53, 1359.16, 1308.53, 1368.81, 1407.44,
1356.81, 1289.28, 1327.91, 1277.28, 1341.63, 1380.25, 1329.63}
A graph of these means is shown in Figure 14.3.
Some Practical Considerations 219
1310 1327 1344 1361 1378 1395
Mean
1
2
3
4
F
r
e
q
u
e
n
c
y
Figure 14.3
The mean of these weighted means is 1345.375, the mean of the population, so
once more our estimate is unbiased, but the standard deviation is 36.1958, a reduction
of about 26% from that for proportional allocation.
We summarize the results of the three sampling plans discussed here in
Table 14.8.
Table 14.8
Sampling Number Mean Standard deviation
Population 8 1345.375 481.874
Simple random 56 1345.375 141.078
Proportional allocation 30 1345.375 48.6441
Optimal allocation 15 1345.375 36.1958
SOME PRACTICAL CONSIDERATIONS
Our example here contains two strata, but in practice one could have many more.
Suppose the strata are as follows:
Stratum Number Standard deviation
1 N
1
σ
1
2 N
2
σ
2
.
.
.
.
.
.
k N
k
σ
k
220 Chapter 14 Sampling
Suppose N
1
+N
2
+· · · +N
k
= N and that we wish to select a sample of size
n. In proportional allocation, we want the sample sizes in each stratum to reflect the
sizes of the stratum, so we want
n
1
N
1
=
n
2
N
2
= · · · =
n
k
N
k
The solution to this set of equations is to choose n
i
= n · N
i
/N observations from
the ith stratum. Then the ratio of the number of observations in the ith stratum to the
number of observations in the jth stratum is
n
i
n
j
=
n ·
N
i
N
n ·
N
j
N
=
N
i
N
j
the ratio of the number of items in stratum i to the number of items in stratum j.
The numbers N
i
are usually known, at least approximately, so one can come
close to proportional allocation in most cases.
Optimal allocation, however, poses a different problem, because the number of
observations per stratum depends on the standard deviations. In the general case,
optimal allocation selects
n ·
N
i
σ
i
N
1
σ
1
+N
2
σ
2
+· · · +N
k
σ
k
fromthe ith stratum. This requires knowledge, or close approximation, to the standard
deviations. In the case of two strata, however, we need only to know the ratio of the
standard deviations. The number of items to be selected from the first stratum is
n ·
N
1
σ
1
N
1
σ
1
+N
2
σ
2
= n ·
N
1
σ
1

2
N
1
σ
1

2
+N
2
So the ratio provides all the information needed. For example, if we know that
σ
1

2
= 2 , N
1
= 10, and N
2
= 15 and if we wish to select a sample of 7, then we
select
7 ·
10 · 2
10 · 2 +15
= 4
from the first stratum and
n ·
N
2
σ
2
N
1
σ
1
+N
2
σ
2
= n ·
N
2
N
1
σ
1

2
+N
2
= 7 ·
15
10 · 2 +15
= 3
items from the second stratum. It is of course unusual for these sample sizes to be
integers, so we do the best we can. Usually the total sample size is determined by the
cost of selecting each item. In the general case, if the ratio of the standard deviations
to each other is known, or can be approximated, then an allocation equivalent to, or
approximately equal to, optimal allocation can be achieved.
It is clear that stratification, of either variety, reduces the standard deviation
and so increases greatly the accuracy with which predictions can be made. It is of-
ten the case that proportional and optimal allocation do not differ very much with
Explorations 221
respect to the reduction in the standard deviation although it can be shown in general
that
σ
2
Smple random sampling
≥ σ
2
Proportional allocation
≥ σ
2
Optimal allocation
is usually the case. Instances where this inequality does not hold are very unlikely to
be encountered in practice.
STRATA
Stratification is usually a very efficient technique in sampling. It is important to realize
that the strata must exist in a recognizable formbefore the sampling is done. The strata
are then groups with recognizable features. In political sampling, for example, strata
might consist of urban and rural residents, but within these strata we might sample
home owners, apartment dwellers, condominium owners, and so on as substrata.
In addition, in national political polling, the strata might differ from state to state.
In any event, the strata cannot be made up without the group having some known
characteristics. Stratified sampling has been known to provide very accurate estimates
in elections; generally the outcome is known, except in extremely tight races, well
before the polls close and all the votes have been cast!
CONCLUSIONS
This has been a very brief introduction to varieties of samples that can be chosen
from a population. Much more is known about sampling and the interested reader is
encouraged to sample some of the many specialized texts on sampling techniques.
EXPLORATIONS
1. The data in the following table show the populations of several counties in
Colorado, some of them urban and some rural.
Population County type
14,046 Rural
5,881 Rural
4,511 Rural
9,538 Rural
550,478 Urban
278,231 Urban
380,273 Urban
148,751 Urban
211,272 Urban
222 Chapter 14 Sampling
(a) Show all the simple random samples of size 4 and draw graphs of the sample
means and sample standard deviations.
(b) Draw stratified samples of size 4 by
(i) proportional allocation;
(ii) optimal allocation.
(c) Calculate weighted means for each of the samples in part (b).
(d) Discuss the differences in the above sampling plans and make inferences.
Chapter 15
Design of Experiments
CHAPTER OBJECTIVES:
r
to learn howplanning “what observations have to be taken in an experiment” can greatly
improve the efficiency of the experiment
r
to consider interactions between factors studied in an experiment
r
to study factorial experiments
r
to consider what to do when the effects in an experiment are confounded
r
to look at experimental data geometrically
r
to encounter some interesting three-dimensional geometry.
Arecent book by David Salsburg is titled The Lady Tasting Tea and is subtitled
How Statistics Revolutionized Science in the Twentieth Century. The title refers to a
famous experiment conducted by R. A. Fisher. The subtitle makes quite a claim, but it
is largely true. Did statistics revolutionize science and if so, how? The answer lies in
our discovery of how to decide what observations to make in a scientific experiment.
If observations are taken correctly we now know that conclusions can be drawn from
them that are not possible with only random observations. It is our knowledge of
the planning (or the design) of experiments that allows experimenters to carry out
efficient experiments in the sense that valid conclusions may then be drawn. It is our
object here to explore certain designed experiments and to provide an introduction to
this subject. Our knowledge of the design of experiments and the design of sample
surveys are the two primary reasons for studying statistics; yet, we can only give a
limited introduction to either of these topics in our introductory course. We begin
with an example.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
223
224 Chapter 15 Design of Experiments
EXAMPLE 15.1 Computer Performance
A study of the performance of a computer is being made. Only two variables (called factors)
are being considered in the study: Speed (S) and RAM (R). Each of these variables is being
studied at two values (called levels). The levels of Speed are 133 MHz and 400 MHz and the
levels of RAM are 128 MB and 256 MB.
If we study each level of Speed with each level of RAM and if we make one observation
for each factor combination, we need to make four observations. The observation or response
here is the time the computer takes, in microseconds, to perform a complex calculation. These
times are shown in Table 15.1.
Table 15.1
Speed (MHz)
RAM (MB) 133 400
128 27 10
256 18 9
What are we to make of the data? By examining the columns, it appears that the time to
performthe calculation is decreased by increasing speed and, by examining the rows, it appears
that the time to perform the calculation is decreased by increasing RAM. Since the factors are
studied together, it may be puzzling to decide exactly what influence each of these factors has
alone. It might appear that the factors should be studied separately, but, as we will see later, it is
particularly important that the factors be studied together. Studies involving only one factor are
called one-factor-at-a-time experiments and are rarely performed. One reason for the lack of
interest in such experiments is the fact that factors often behave differently when other factors
are introduced into the experiment. When this occurs, we say the factors interact with each
other. It may be very important to detect such interactions, but obviously, we cannot detect
such interactions unless the factors are studied together. We will address this subsequently.
Now we proceed to assess the influence of each of the factors. It is customary, since each
of these factors is at two levels or values, to code these as −1 and +1.The choice of coding
will make absolutely no difference whatever in the end. The data are then shown in Table 15.2
with the chosen codings. The overall mean of the data is 16, shown in parentheses.
One way to assess the influence of the factor speed is to compare the mean of the compu-
tation times at the +1 level with the mean of the computation times at the −1 level. We divide
this difference by 2, the distance between the codings −1 and +1, to find
Effect of Speed =
1
2

10 +9
2

27 +18
2

= −6.5
Table 15.2
Speed
RAM −1 +1
−1 27 10
+1 18 9
(16)
Chapter 15 Design of Experiments 225
This means that we decrease the computation time on average by 6.5 units as we move
from the −1 level to the +1 level.
Now, what is the effect of RAM? It would appear that we should compare the mean
computation times at the +1 level with those at the −1 level, and again divide by 2 to find
Effect of RAM =
1
2

18 +9
2

27 +10
2

= −2.5
This means that we decrease computation times on average by 2.5 units as we move from
the −1 level to the +1 level.
One more effect can be studied, namely, the interaction between the factors Speed and
RAM.
In Figure 15.1a, we show the computation times at the two levels of RAM for speed at its
two levels. As we go from the −1 level of RAM to the +1 level, the computation times change
differently for the different levels of speed, producing lines that are not parallel. Since the lines
are not quite parallel, this is a sign of a mild interaction between the factors speed and RAM.
In Figure 15.1b, we show two factors that have a very high interaction, that is, the performance
of one factor heavily depends upon the level of the other factor.
−1 1
Speed
10
12.5
15
17.5
20
22.5
25
(a)
R
A
M
−1 1
−1 1
Speed
5
10
15
20
25
(b)
R
A
M
−1 1
Figure 15.1
The size of the interaction is also a measurement of the effect of the combination of levels
of the main factors (Speed and RAM). We denote this effect by Speed · RAM. How are we to
compute this effect? We do this by comparing the mean where the product of the coded signs
is +1 with the mean where the product of the coded signs is −1 to find
Effect of speed · RAM =
1
2

27 +9
2

10 +18
2

= 2
So we conclude that the computation Speed where the combination of levels of the factors,
where the product is +1, tends to be two units more in average than the computation Speed
where the combination of levels of the factors is −1.
How can we put all this information together? We can use these computed effects in the
following model:
Observation = 16 −6.5 · Speed −2.5 · RAM+2 · Speed · RAM
This is called a linear model. Linear models are of great importance in statistics, espe-
cially in the areas of regression and design of experiments. In this case, we can use the signs
226 Chapter 15 Design of Experiments
of the factors (either +1 or −1) in the linear model. For example, if we use speed = −1 and
RAM = −1, we find
16 −6.5(−1) −2.5(−1) +2(−1)(−1) = 27
exactly the observation at the corner! This also works for the remaining corners in Table 15.2:
16 −6.5(−1) −2.5(+1) +2(−1)(+1) = 18
16 −6.5(+1) −2.5(−1) +2(+1)(−1) = 10
and
16 −6.5(+1) −2.5(+1) +2(+1)(+1) = 9
The linear model then explains exactly each of the observations!
᭿
We now explore what happens when another factor is added to the experiment.
EXAMPLE 15.2 Adding a Factor to the Computer Experiment
Suppose that we now wish to study two different brands of computers so we add the factor
Brand to the experiment. Again, we code the two brands as −1 and +1. Now we need a
three-dimensional cube to see the computation times resulting from all the combinations of the
factors. We show these in Figure 15.2.
Now we have three main effects S, R, and B, in addition to three two-factor interactions,
which we abbreviate as SR, SB, and RB,and one three-factor interaction SBR. To calculate
1
–1
1
–1 1 –1
B
R
S
6
20 29
13
9
10 27
18
Cube plot (data means) for data means
Figure 15.2
Chapter 15 Design of Experiments 227
the main effects, we use the planes that form the boundaries of the cube. To find the main S
effect, for example, we compare the mean of the plane where S is positive with the mean of
the plane where S is negative and take 1/2 of this difference as usual. This gives
S =
1
2

9 +10 +6 +20
4

18 +27 +13 +29
4

= −5.25
Similarly, we find
R =
1
2

6 +9 +13 +18
4

10 +27 +29 +20
4

= −5
and
B =
1
2

29 +13 +6 +20
4

9 +10 +18 +27
4

= 0.5
The calculation of the interactions nowremains. To calculate the SRinteraction, it would appear
consistent with our previous calculations if we were to compare the plane where SR is positive
to the plane where SR is negative. This gives
SR =
1
2

27 +9 +6 +29
4

18 +10 +13 +20
4

= 1.25
The planes shown in Figure 15.3 may be helpful in visualizing the calculations of the remaining
two-factor interactions.
We find
SB =
1
2

27 +18 +6 +20
4

10 +9 +13 +29
4

= 1.25
and
BR =
1
2

10 +13 +6 +27
4

9 +18 +20 +29
4

= −2.50
These two-factor interactions geometrically are equivalent to collapsing the cube along each
of its major axes and analyzing the data from the squares that result.
228 Chapter 15 Design of Experiments
(a) SR interaction.
S
B
R
(b) SB interaction.
S
B
R
(c) BR interaction.
S
B
R
Figure 15.3
This leaves the three-factor interaction SBR to be calculated. But we have used up every
one of the 12 planes that pass through the cube! Consistent with our previous calculations, if
we look at the points where SBR is positive, we find a tetrahedron within the cube as shown in
Figure 15.4.
Chapter 15 Design of Experiments 229
(1,1,1)
(−1, −1, 1)
(−1, 1, −1)
(1, −1, −1)
S
B
R
Figure 15.4 Positive tetrahedron.
We also find a negative tetrahedron as shown in Figure 15.5.
Now, we compare the means of the computation times in the positive tetrahedron with
that of the negative tetrahedron:
SBR =
1
2

18 +10 +6 +29
4

27 +9 +13 +20
4

= −0.75
(–1, 1, 1)
(–1, –1, –1)
(1, –1,1)
(1, 1, –1)
S
B
R
Figure 15.5 Negative tetrahedron.
230 Chapter 15 Design of Experiments
Now, as in the case of two factors, we form the linear model
Observation = 16.5 −5.25S −5R +0.5B +1.25SR +1.25SB −2.5RB −0.75SBR
where the mean of the eight observations is 16.5.
Again, we find that the model predicts each of the corner observations exactly. We show
the signs of the factors (in the order S, R, B) to the left of each calculation.
(+, +, +) 16.5 −5.25 −5 +0.5 +1.25 +1.25 −2.5 −0.75 = 6
(−, +, +) 16.5 +5.25 −5 +0.5 −1.25 −1.25 −2.5 +0.75 = 13
(+, −, +) 16.5 −5.25 +5 +0.5 −1.25 +1.25 +2.5 +0.75 = 20
(−, −, +) 16.5 +5.25 +5 +0.5 +1.25 −1.25 +2.5 −0.75 = 29
(+, +, −) 16.5 −5.25 −5 −0.5 +1.25 −1.25 +2.5 +0.75 = 9
(−, +, −) 16.5 +5.25 −5 −0.5 −1.25 +1.25 +2.5 −0.75 = 18
(+, −, −) 16.5 −5.25 +5 −0.5 −1.25 −1.25 −2.5 −0.75 = 10
(−, −, −) 16.5 +5.25 +5 −0.5 +1.25 +1.25 −2.5 +0.75 = 27
᭿
We would like to extend our discussion to four or more factors, but then the nice
geometric interpretation we have given in the previous examples becomes impossible.
Fortunately, there is another way to calculate the effects that applies to any number
of factors. The process is called the Yates algorithm after Frank Yates, a famous
statistician who discovered it.
YATES ALGORITHM
The calculation of the effects in our examples depends entirely upon the signs of the
factors. To calculate the R effect, for example, one would only need to know the sign
of R for each of the observations and use the mean of those observations for which R
is positive and also the mean of the observations for which the sign of R is negative.
So all the effects could be calculated from a table showing the signs of the factors.
It is also possible to calculate all the effects froma table using the Yates algorithm.
To do this, the observations must be put in what we will call a natural order. Although
the order of the factors is of no importance, we will followthe order S, R, Bas we have
done previously. We make a column for each of the main factors and make columns of
the signs as follows. Under S, make a column starting with the signs −, +, −, +, · · · ;
under R, make a column with the pattern −, −, +, +, −, −, +, +, · · · ; finally under
B, make a column −, −, −, −, +, +, +, +, −, · · · . If there are more factors, the next
column starts with eight minus signs followed by eight plus signs and this pattern is
repeated; the next factor would begin with 16 minus signs followed by 16 plus signs
and this pattern continues. The columns are of course long enough to accommodate
each of the observations. The result in our example is shown in Table 15.3.
Randomization and Some Notation 231
Table 15.3
S R B Observations 1 2 3 Effect (÷8)
− − − 27 37 64 132 16.5 μ
+ − − 10 27 68 −42 −5.25 S
− + − 18 49 −26 −40 −5 R
+ + − 9 19 −16 10 1.25 SR
− − + 29 −17 −10 4 0.50 B
+ − + 20 −9 −30 10 1.25 SB
− + + 13 −9 8 −20 −2.50 RB
+ + + 6 −7 2 −6 −0.75 SBR
To calculate the effects, proceed as follows. Going down the column of observa-
tions, consider the observations inpairs. Addthese tofindthe first entries inthe column
labeled1. Here we find27 +10 = 37, 18 +9 = 27, 29 +20 = 49, and13 +6 = 19.
Now consider the same pairs but subtract the top entry from the bottom entry to find
10 −27 = −17, 9 −18 = −9, 20 −29 = −9, and 6 −13 = −7. This completes
column 1.
Now perform exactly the same calculations on the entries in column 1 to find
column 2. Finally, follow the same pattern on the entries of column 2 to find the
entries in column 3. The effects are found by dividing the entries in column 3 by 8.
We find the same model here that we found above using geometry.
Note that the factors described in the rightmost column can be determined from
the + signs given in the columns beneath the main factors.
We want to show an example with four factors using the Yates algorithm since
the geometry is impossible, but first we make some comments on the experimental
design and introduce some notation.
RANDOMIZATION AND SOME NOTATION
Each of our examples are instances of what are called full factorial experiments. These
experiments make observations for each combination of each level of each factor; no
combinations are omitted. In our examples, these are usually denoted by the symbol
2
k
, the 2 indicating that each factor is at 2 levels and k indicating the number of
factors. Example 15.1 is then a 2
2
experiment while example 15.2 is a 2
3
experiment.
Factorial experiments are very efficient in the sense that although the factors
are observed together and would appear to be hopelessly mixed up, they can be
shown to be equivalent of one-factor-at-a-time experiments where each factor and
their interactions are observed separately 2
k
times! Of course, one-factor-at-a-time
experiments using only the main effects cannot draw conclusions about interactions
since they are never observed.
Both the design and the occurrence of randomization are keys to the statistical
analysis of an experiment. In factorial experiments, the order of the observations
232 Chapter 15 Design of Experiments
should be determined by some random scheme including repetitive observations for
the same factor combinations should these occur.
Now we show an example with four factors.
EXAMPLE 15.3 A 2
4
Factorial Experiment
In Statistics for Experimenters by George Box, WilliamHunter, and J. Stuart Hunter, a chemical
process development study is described using four main factors. These are A: catalyst charge
(in pounds), B: temperature (in degrees centigrade), C: pressure (in pounds per square inch),
and D: concentration (in percentage).
The chemical process is being created and the experimenters want to know what factors
and their interactions should be considered when the actual process is implemented. The results
are shown in Table 15.4.
Table 15.4
A B C D Observations 1 2 3 4 Effect (÷16)
− − − − 71 132 304 600 1156 72.25 μ
+ − − − 61 172 296 556 −64 −4 A
− + − − 90 129 283 −32 192 12 B
+ + − − 82 167 273 −32 8 0.50 AB
− − + − 68 111 −18 78 −18 −1.125 C
+ − + − 61 172 −14 114 6 0.375 AC
− + + − 87 110 −17 2 −10 −0.625 BC
+ + + − 80 163 −15 6 −6 −0.375 ABC
− − − + 61 −10 40 −8 −44 −2.75 D
+ − − + 50 −8 38 −10 0 0 AD
− + − + 89 −7 61 4 36 2.25 BD
+ + − + 83 −7 53 2 4 0.25 ABD
− − + + 59 −11 2 −2 −2 −0.125 CD
+ − + + 51 −6 0 −8 −2 −0.125 ACD
− + + + 85 −8 5 −2 −6 −0.375 BCD
+ + + + 78 −7 1 −4 −2 −0.125 ABCD
This gives the linear model
Observations = 72.25 −4A +12B +0.50AB −1.125C +0.375AC −0.625BC
−0.375ABC −2.75D+0AD+2.25BD+0.25ABD−0.125CD
−0.125ACD−0.375BCD−0.125ABCD
As in the previous examples, the model predicts each of the observations exactly.
᭿
Confounding 233
CONFOUNDING
It is often true that the effects for the higher order interactions become smaller in ab-
solute value as the number of factors in the interaction increases. So these interactions
have something, but often very little, to do with the prediction of the observed values.
If it is not necessary to estimate some of these higher order interactions, some very
substantial gains can be made in the experiment, for then we do not have to observe
all the combinations of factor levels! This then decreases the size, and hence the cost,
of the experiment without substantially decreasing the information to be gained from
the experiment.
We show this through an example.
EXAMPLE 15.4 Confounding
Consider the data in Example 15.3 again, as given in Table 15.4, but suppose that we have only
the observations for which ABCD = +1. This means that we have only half the data given in
Table 15.4. We show these data in Table 15.5, where we have arranged the data in standard
order for the three effects A, B, and C (since if the signs for these factors are known, then the
sign of D is determined).
Table 15.5
A B C D Observations 1 2 3 Effect (÷8)
− − − − 71 121 292 577 72.125 μ+ABCD
+ − − + 50 171 285 −35 −4.375 A+BCD
− + − + 89 120 −28 95 11.875 B+ACD
+ + − − 82 165 −7 3 0.375 AB+CD
− − + + 59 −21 50 7 0.875 C+ABD
+ − + − 61 −7 45 21 2.625 AC+BD
− + + − 87 2 14 −5 −0.625 BC+AD
+ + + + 78 −9 −11 −25 −3.125 ABC+D
Note that the effects are not the same as those found in Table 15.4. To the mean μ the
effect ABCD has been added to find μ +ABCD = 72.25 −0.125 = 72.125. Similarly, we
find A +BCD = −4 −0.375 = −4.375. The other results in Table 15.5 may be checked in
a similar way. Since we do not have all the factor level combinations, we would not expect to
find the results we found in Table 15.4. The effects have all been somewhat interfered with or
confounded. However, there is a pattern. To each effect has been added the effect or interaction
that is missing fromABCD. For example, if we consider AB, then CD is missing and is added
to AB.The formula ABCD = +1 is called a generator.
One could also confound by using the generator ABCD = −1. Then one would find
μ −ABCD, A −BCD, and so on. We will not show the details here.
Note also that if the generator ABCD = +1 is used, the experiment is a full factorial
experiment in factors A, B, and C.
234 Chapter 15 Design of Experiments
Each time a factor is added to a full factorial experiment where each factor is at two levels,
the number of factor level combinations is doubled, greatly increasing the cost and the time to
carry out the experiment. If we can allow the effects to be confounded, then we can decrease
the size and the cost of the experiment.
In this example, the confounding divides the size of the experiment in half and so is
called a half fraction of a 2
4
factorial experiment and is denoted as a 2
4−1
factorial ex-
periment to distinguish this from a full 2
3
experiment. These experiments are known as
fractional factorial experiments. These experiments can give experimenters much informa-
tion with great efficiency provided, of course, that the confounded effects are close to the true
effects. This is generally true for confounding using higher order interactions.
Part of the price to be paid here is that all the effects, including the main effects, are
confounded. It is generally poor practice to confound main effects with lower order interactions.
If we attempt to confound the experiment described in Example 15.2 by using the generator
SBR = +1,we find the results given in Table 15.6.
Table 15.6
S B R Observations (1) (2) Effect (÷4)
− + − 29 39 63 15.75 μ +SBR
− − + 10 24 −31 −7.75 S +BR
+ − − 18 −19 −15 −3.75 R +SB
+ + + 6 −12 7 1.75 B +SR
The problem here is that the main effects are confounded with second-order interactions.
This is generally a very poor procedure to follow. Fractional factorial experiments are only
useful when the number of factors is fairly large resulting in the main effects being confounded
with high-order interactions. These high-order interactions are normally small in absolute value
and are very difficult to interpret in any event. When we do have a large number of factors,
however, then fractional factorial experiments become very useful and can reduce the size of
an experiment in a very dramatic fashion. In that case, multiple generators may define the
experimental procedure leading to a variety of confounding patterns.
᭿
MULTIPLE OBSERVATIONS
We have made only a single observation for each combination of factor levels in each
of our examples. In reality, one would make multiple observations whenever possible.
This has the effect of increasing the accuracy of the estimation of the effects, but we
will not explore that in detail here. We will show an example where we have multiple
observations; to do this, we return to Example 15.1, where we studied the effects of
the factors Speed and RAM on computer performance. In Table 15.7, we have used
three observations for each factor level combination.
Our additional observations happen to leave the means of each cell, as well as
the overall mean, unchanged. Now what use is our linear model that was
Observation = 16 −6.5 · Speed −2.5 · RAM+2 · Speed · RAM?
Testing the Effects for Significance 235
Table 15.7
Speed
RAM −1 +1
−1 27,23,31 10,8,12
+1 18,22,14 9,11,7
(16)
This linear model predicts the mean for each factor combination but not the individual
observations. To predict the individual observations, we add a random component ε
to the model to find
Observation = 16 −6.5 · Speed −2.5 · RAM+2 · Speed · RAM+ε
The values of ε will then vary with the individual observations. For the purpose of
statistical inference, which we cannot consider here, it is customary to assume that
the random variable ε is normally distributed with mean 0, but a consideration of that
belongs in a more advanced course.
DESIGN MODELS AND MULTIPLE REGRESSION
MODELS
The linear models developed here are also known as multiple regression models. If
a multiple regression computer program is used with the data given in any of our
examples and the main effects and interactions are used as the independent variables,
then the coefficients found here geometrically are exactly the coefficients found using
the multiple regression program. The effects found here then are exactly the same as
those found using the principle of least squares.
TESTING THE EFFECTS FOR SIGNIFICANCE
We have calculated the effects in factorial designs, and we have examined their size,
but we have not determined whether these effects have statistical significance. We
show how this is done for the results in Example 15.3, the 2
4
factorial experiment.
The size of each effect is shown in Table 15.8.
The effects can be regarded as Student t random variables. To test any of the
effects for statistical significance, we must determine the standard error for each
effect. To do this, we first determine a set of effects that will be used to calculate this
standard error. Here, let us use the third- and fourth-order interactions; we will then
test each of the main effects and second-order interactions for statistical significance.
We proceed as follows.
236 Chapter 15 Design of Experiments
Table 15.8
Size Effect
72.25 μ
−4 A
12 B
0.50 AB
−1.125 C
0.375 AC
−0.625 BC
−0.375 ABC
−2.75 D
0 AD
2.25 BD
0.25 ABD
−0.125 CD
−0.125 ACD
−0.375 BCD
−0.125 ABCD
First, find a quantity called a sum of squares that is somewhat similar to the sums
of squares we used in Chapter 13. This is the sum of the squares of the effects that
will be used to determine the standard error. Here this is
SS = (ABC)
2
+(ABD)
2
+(ACD)
2
+(BCD)
2
+(ABCD)
2
= (−0.375)
2
+(0.250)
2
+(−0.125)
2
+(−0.375)
2
+(−0.125)
2
= 0.375
We define the degrees of freedom (df) as the number of effects used to find the sum
of squares. This is 5 in this case.
The standard error is the square root of the mean squares of the effects. The
formula for finding the standard error is
Standard error =

SS
df
We find here that standard error =

0.375/5 = 0.273 86. We can then find a Student
t variable with df degrees of freedom.
To choose an example, we test the hypothesis that the effect AB is 0:
t
5
=
AB −0
Standard error
t
5
=
0.5000 −0
0.27386
t
5
= 1.8258
Testing the Effects for Significance 237
The p-value for the test is the probability that this value of t is exceeded or that t is
at most −1.8258. This is 0.127464, so we probably would not decide that this is a
statistically significant effect.
It is probably best to use a computer program or a statistical calculator to de-
termine the p-values since only crude estimates of the p-values can be made using
tables. We used the computer algebra programMathematica to determine the t values
and the corresponding p-values in Table 15.9.
Table 15.9
Effect Size t
5
p-Value
μ 72.25 263.821 1.48·10
−11
A −4 −14.606 2.72·10
−5
B 12 43.818 1.17·10
−7
AB 0.50 1.826 0.127
C −1.125 −4.108 0.009
AC 0.375 1.369 0.229
BC −0.625 −2.282 0.071
D −2.75 −10.042 1.68·10
−4
AD 0 0 1.00
BD 2.25 8.216 4.34·10
−4
CD −0.125 −0.456 0.667
Using the critical value of 0.05 as our level of significance, we would conclude
that μ, A, B, C, D, and BD are of statistical significance while the other interactions
can be safely ignored. This may have important consequences for the experimenter
as future experiments are planned.
Statistical computer programs are of great value in analyzing experimental de-
signs. Their use is almost mandatory when the number of main factors is 5 or more.
These programs also provide graphs that can give great insight into the data. For
example, the statistical computer program Minitab generates a graph of the main ef-
fects and interactions for the example we have been considering, which is shown in
Figure 15.6.
Significant effects are those that are not close to the straight line shown. While
the mean is not shown, we would draw the same conclusions from the graph as we
did above.
Most experiments have more than one observation for each factor level com-
bination. In that case, the determination of the statistical significance of the factors
and interactions is much more complex than in the case we have discussed. This is
a topic commonly considered in more advanced courses in statistics. Such courses
might also consider other types of experimental designs that occur in science and
engineering.
238 Chapter 15 Design of Experiments
25 20 15 10 5 0 -5 -10
99
95
90
80
70
60
50
40
30
20
10
5
1
Effect
P
e
r
c
e
n
t
A A
B B
C C
D D
Factor Name
Not significant
Significant
Effect type
BD
D
B
A
Normal plot of the effects
(response is Obs, Alpha = 0.05)
Lenth's PSE = 1.125
Figure 15.6
CONCLUSIONS
This has been a very brief introduction to the design of experiments. Much more is
known about this subject and the interested reader is referred to more advanced books
on this subject. We have made use of the geometry here in analyzing experimental
data since that provides a visual display of the data and the conclusions we can draw
from the data.
EXPLORATIONS
1. Check that the model given after Table 15.4 predicts each of the observations
exactly.
2. Show that collapsing the cubes shown in Figure 15.3 to squares leads to the
determination of the two-factor interactions.
3. In Example 15.4, confound using ABCD = −1 and analyze the resulting data.
4. The following data represent measurements of the diameter of a product pro-
duced at two different plants and on three different shifts.
Explorations 239
Shift
1 2 3
Plant 1 66.2 66.3 64.2
64,7 65.7 64.7
Plant 2 65.3 65.4 66.2
61.5 63.4 67.2
Analyze the data and state any conclusions that can be drawn. Find all the main
effects and the interactions and show how the data can be predicted using a linear
model.
Chapter 16
Recursions and Probability
CHAPTER OBJECTIVES:
r
to learn about recursive functions
r
to apply recursive functions to permutations and combinations
r
to use recursive functions to find the expected value of the binomial distribution
r
to learn how to gamble (perhaps wisely)
r
consider the occurrence of HH in coin tossing.
INTRODUCTION
Since they are very useful in probability, we consider functions whose values depend
upon other values of the same function. Such functions are called recursive. These
functions are investigated here in general, and then we show their application to
probability and probability distribution functions.
EXAMPLE 16.1 A General Recursion
We begin with a nonprobability example. Suppose we define a function on the positive integers,
f(x), where
f(x +1) = 2f(x), x = 0, 1, 2, . . .
If we have a starting place, say f(1) = 1, then we can determine any of the subsequent
values for the function. For example,
f(2) = 2f(1) = 2 · 1 = 2, f(3) = 2f(2) = 2 · 2 = 4, f(4) = 2 · f(3) = 2 · 4 = 8
and so on. It is easy to see here, since subsequent values of f are twice that of the preceding
value, that the values of f are powers of 2, and that in fact
f(x) = 2
x−1
for x = 2, 3, 4, . . .
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
240
Introduction 241
The relationship f(x +1) = 2f(x) is called a recursion or difference equation, and
the formula f(x) = 2
x−1
is its solution. The solution was probably evident from the
start.
To verify that f(x) = 2
x−1
for x = 2, 3, 4, . . . is the solution, note that if f(x) = 2
x−1
,
then f(x +1) = 2
x
= 2 · 2
x−1
= 2f(x).
Finding solutions for recursions, however, is not always so easy. Consider
f(x +1) = 2f (x) +f(x −1) for x = 1, 2, 3 . . .
The solution is far from evident. Let us start by finding some of the values for f, but this
time we need two starting points, say f(0) = 1 and f(1) = 2. Then, applying the recursion,
we find
f(2) = 2f(1) +f(0) = 2 · 2 +1 = 5
f(3) = 2f(2) +f(1) = 2 · 5 +2 = 12
f(4) = 2f(3) +f(2) = 2 · 12 +5 = 29,
and so on.
᭿
It is easy to write a computer program to produce a table of these results.
Analytic methods exist to produce solutions for the recursions we consider here,
and while we will not explore them, we invite the reader to check that the solu-
tions we present are, in fact, solutions of the recursions. We did this in our first
example.
In this case, the solution of the recursion is
f(x) =
(1 +

2)
x+1
−(1 −

2)
x+1
2

2
, x = 0, 1, 2, . . .
Those

2’s look troublesome at first glance, but they all disappear! Table 16.1
shows some of the values of f(x) obtained from this solution.
Table 16.1
x f(x)
0 1
1 2
2 5
3 12
4 29
5 70
6 169
7 408
8 985
9 2378
10 5741
242 Chapter 16 Recursions and Probability
The values now increase rather rapidly. A computer program finds that
f(100) = 161, 733, 217, 200, 188, 571, 081, 311, 986, 634, 082, 331, 709
It is difficult to think about computing this in any other way. Now we turn to
probability, the subject of all our remaining applications.
EXAMPLE 16.2 Permutations
An easy application of recursions is with permutations—an arrangement of objects in a row.
If we have n distinct items to arrange, suppose we denote the number of distinct permutations
by
n
P
n
. Since we know there are n! possible permutations, it follows that
n
P
n.
= n!
It is also clear that
n! = n · (n −1)!
so
n
P
n.
= n ·
n−1
P
n−1
, a recursion
Since we knowthat 1! = 1, we can use the above result to find that and subsequent values.
Ordinarily, in finding 4! we would have to multiply 4 · 3 · 2 · 1, and in finding 5! we would
have to calculate 5 · 4 · 3 · 2 · 1. It is much easier and faster to use the facts that 4! = 4 · 3! and
5! = 5 · 4! since it is easy to calculate 3! and 4!.
᭿
It is easy to continue and produce a table of factorials very quickly and easily. We
continue now and show more examples of recursions and their uses in probability.
EXAMPLE 16.3 Combinations
A combination is the number of distinct samples, say of size r, that can be selected from a set
of n distinct items. This number is denoted by

n
r

. We know that

n
r

=
n!
r! · (n −r)!
where 0 ≤ r ≤ n
To find a recursion, note that

n
r +1

=
n!
(r +1)! · (n −r −1)!
so

n
r +1

n
r
=
n!
(r +1)! · (n −r −1)!
·
r!(n −r)!
n!
=
n −r
r +1
which we can write in recursive form as

n
r +1

=
n −r
r +1
·

n
r

᭿
Introduction 243
Using this recursion will allow us to avoid calculation of any factorials! To see
this, note that

n
1

= n. So it follows that

n
2

=
n −1
1 +1
·

n
1

or

n
2

=
n −1
2
· n =
n(n −1)
2
We can continue this by calculating

n
3

=
n −2
2 +1
·

n
2

or

n
3

=
n −2
3
·
n(n −1)
2
=
n(n −1)(n −2)
3 · 2
and

n
4

=
n −3
3 +1
·

n
3

or

n
4

=
n −3
4
·
n(n −1)(n −2)
3 · 2
=
n(n −1)(n −2) · (n −3)
4 · 3 · 2
To take a specific example, suppose n = 5. Since

5
1

= 5, the recursion can be
used repeatedly to find

5
2

=
5 −1
1 +1
·

5
1

or

5
2

=
4
2
· 5 = 10
and

5
3

=
5 −2
2 +1
·

5
2

or

5
3

=
3
3
· 10 = 10
and we could continue this. When the numbers become large, it is a distinct advantage
not to have to calculate the large factorials involved. We continue now with some
probability distribution functions.
244 Chapter 16 Recursions and Probability
EXAMPLE 16.4 Binomial Probability Distribution Function
It is frequently the case in probability that one value of a probability distribution function can
be found using some other value of the probability distribution function. Consider the binomial
probability distribution function as an example.
For the binomial probability distribution function, we know
f(x) = P(X = x) =

n
x

p
x
q
n−x
, x = 0, 1, . . . , n
Now
P(X = x +1) =

n
x +1

p
x+1
q
n−(x+1)
so if we divide P(X = x +1) by P(X = x), we find
P(X = x +1)
P(X = x)
=

n
x +1

p
x+1
q
n−(x+1)

n
x

p
x
q
n−x
=
n!
(x +1)!(n −x −1)!
·
x!(n −x)!
n!
·
p
q
which simplifies to
P(X = x +1)
P(X = x)
=
n −x
x +1
·
p
q
and this can be written as
P(X = x +1) =
n −x
x +1
·
p
q
· P(X = x)
The recursion is very useful; for example, we know that P(X = 0) = q
n
. The recursion,
using x = 0, then tells us that
P(X = 1) =
n −0
0 +1
·
p
q
· P(X = 0)
P(X = 1) = n ·
p
q
· q
n
= n · p · q
n−1
which is the correct result for X = 1.
Introduction 245
This can be continued to find P(X = 2). The recursion tells us that
P(X = 2) =
n −1
1 +1
·
p
q
· P(X = 1)
P(X = 2) =
n −1
2
·
p
q
· n · p · q
n−1
P(X = 2) =
n(n −1)
2
· p
2
· q
n−2
P(X = 2) =

n
2

· p
2
· q
n−2
and again this is the correct result for P(X = 2).
᭿
We can continue in this way, creating all the values of the probability distribution
function.
The advantage in doing this is that the quantities occurring in the values of the
probability distribution function do not need to be calculated each time. The value,
for example, of n! is never calculated at all. To take a specific example, suppose that
p = 0.6 so that q = 0.4 and that n = 12.
Letting X denote the number of successes in 12 trials, we start with
P(X = 0) = 0.4
12
= 0.00001677 7
The recursion is
P(X = x +1) =
n −x
x +1
·
p
q
· P(X = x)
which in this case is
P(X = x +1) =
12 −x
x +1
·
0.6
0.4
· P(X = x)
so
P(X = 1) = 12 ∗ 1.5 ∗ 0.00001677 7 = 0.00030199
and
P(X = 2) =
11
2
∗ 1.5 ∗ 0.00030199 = 0.0024914
We can continue and find the complete probability distribution function in
Table 16.2.
246 Chapter 16 Recursions and Probability
Table 16.2
x P(X = x)
0 0.000017
1 0.000302
2 0.002491
3 0.012457
4 0.042043
5 0.100902
6 0.176579
7 0.227030
8 0.212841
9 0.141894
10 0.063852
11 0.017414
12 0.002177
EXAMPLE 16.5 Finding the Mean of the Binomial
We actually used a recursion previously in finding the mean and the variance of the negative
hypergeometric distribution in Chapter 7.
One interesting application of the recursion
P(X = x +1) =
n −x
x +1
·
p
q
· P(X = x)
lies in finding the mean of the binomial probability distribution. Rearranging the recursion and
summing the recursion from 0 to n −1 gives
n−1
¸
x=0
q(x +1)P(X = x +1) =
n−1
¸
x=0
p(n −x)P(X = x)
which can be written and simplified as
qE[X] = np
n−1
¸
x=0
P(X = x) −p
n−1
¸
x=0
x · P(X = x)
which becomes
qE[X] = np[1 −P(X = n)] −p[E[X] −nP(X = n)]
or
qE[X] = np −npq
n
−pE[X] +npq
n
and this simplifies easily to
E[X] = np
᭿
Introduction 247
This derivation is probably no simpler than the standard derivation that evaluates
E[X] =
n
¸
x=0
x · P(X = x),
but it is shown here since it can be used with any discrete probability distribution,
usually providing a derivation easier than the direct calculation.
EXAMPLE 16.6 Hypergeometric Probability Distribution Function
We show one more discrete probability distribution and a recursion.
Suppose that a lot of N manufactured products contains D items that are special in some
way. The sample is of size n. Let the random variable X denote the number of special items in
the sample that is selected with nonreplacement, that is, a sampled item is not returned to the
lot before the next item is drawn. The probability distribution function is
P(X = x) =

D
x

N −D
n −x

N
n
, x = 0, 1, 2, . . . , Min[n, D]
A recursion is easily found since by considering P(X = x)/P(X = x −1) and simplify-
ing, we find
P(X = x) =
(D−x +1)(n −x +1)
x(N −D−n +x)
P(X = x −1)
To choose a specific example, suppose N = 100, D = 20, and n = 10.
Then
P(X = 0) =
(N −D)!(N −n)!
(N −D−n)!N!
=
80!90!
70!100!
=
80 · 79 · 78 · · · · · 71
100 · 99 · 98 · · · · · 91
= 0.0951163
Applying the recursion,
P(X = 1) =
D· n
(N −D−n +1)
P(X = 0) =
20 · 10
71
· 0.0951163 = 0.267933
and
P(X = 2) =
(D−1) · (n −1)
2(N −D−n +2)
P(X = 1) =
19 · 9
2 · 72
· 0.267933 = 0.318171
᭿
This could be continued to give all the values of the probability distribution
function. Note that although the definition of P(X = x) involves combinations, and
hence factorials, we never computed a factorial!
The recursion can also be used, in a way entirely similar to that we used with
the binomial distribution, to find that the mean value is E[X] = (n · D)/N. Variances
can also be produced this way.
248 Chapter 16 Recursions and Probability
EXAMPLE 16.7 Gambler’s Ruin
We now turn to an interesting probability situation, that of the Gambler’s Ruin.
Two players, A and B, play a gambling game until one player is out of money; that is, the
player is ruined. Suppose that $1 is gained or lost at each play, and A starts with $a , B starts
with $b, and a +b = N. Let p
k
be the probability that A (or B) is ruined with a fortune of $k.
Suppose that the player has $k and the probability of winning $1 on any given play is p
and the probability of losing on any given play is then 1 −p = q. If the player wins on the
next play, then his or her fortune is $ (k +1), while if the next play produces a loss, then his or
her fortune is $ (k −1). So
p
k
= pp
k+1
+qp
k−1
where p
0
= 1 and p
N
= 0
The solution of this recursion depends on whether p and q are equal or not,
If p / = q then the solution is
p
k
=

q
p

k

q
p

N
1 −

q
p

N
for k = 0, 1, 2, . . . , N
If p = q then the solution is
p
k
= 1 −
k
N
for k = 0, 1, 2, . . . , N
Let us consider the fair game (where p = q = 1/2) and player A whose initial fortune is
$a. The probability the player is ruined is then
p
a
= 1 −
a
N
= 1 −
a
a +b
=
b
a +b
=
1
a
b
+1
This means that if Ais playing against a player with a relatively large fortune (so b a),
then p
a
−→1 and the player faces almost certain ruin. The only question is how long the
player will last. This is the case in casino gambling where the house’s fortune is much greater
than that of the player. Note that this is for playing a fair game, which most casino games
are not.
Now let us look at the case where p / = q, where the solution is
p
k
=

q
p

k

q
p

N
1 −

q
p

N
for k = 0, 1, 2, . . . , N
The probability of ruin now depends on two ratios—that of q/p as well as a/b. Individual
calculations are not difficult and some results are given in Table 16.3. Here, we have selected
q = 15/28 and p = 13/28, so the game is slightly unfair to A.
So the probability of ruin is quite high almost without regard for the relative fortunes. A
graph of the situation is also useful and shown in Figure 16.1, where p denotes the probability
that favored player, here B, wins.
᭿
Introduction 249
Table 16.3
$a $b p
a
15 10 0.782808
15 20 0.949188
20 10 0.771473
20 20 0.945937
25 20 0.944355
25 25 0.972815
30 25 0.972426
30 30 0.986521
35 30 0.986427
35 35 0.993364
40 35 0.993341
40 40 0.996744
10
20
30
40
b
0.6
0.7
0.8
0.9
1
p
15
20
25
30
35
40
a
Figure 16.1
EXAMPLE 16.8 Finding a Pattern in Binomial Trials
It is interestingtoexplore patterns whenbinomial trials are performed. Aperfect model for this is
tossing a coin, possibly a loaded one, and looking at the pattern of heads and tails. We previously
considered waiting for the first occurrence of the pattern HH in Chapter 7. In this example, we
look only for the occurrence of the pattern HH, not necessarily the first occurrence. First, we
need to define when this pattern “occurs”. Consider the sequence TTHHHTTTHHHHHH. Scan
the sequence from left to right and we see that HH occurs on the 4th flip. Then begin the
sequence all over again. The pattern occurs again on the 10th,12th, and 14th trials and at no
other trials.
Let u
n
denote the probability the sequence HH occurs at the nth trial. If a sequence ends
in HH, then either the pattern occurs at the nth trial or it occurs at the (n −1)st trial and is
followed by H. So, since all possible sequences ending in HH have probability p
2
, we have
the recursion
u
n
+pu
n−1
= p
2
for n 2, u
1
= 0
The reader can check that the solution to this recursion is
u
n
=
p
1 +p
[p +(−p)
n
] for n 2
250 Chapter 16 Recursions and Probability
Some of the values this gives are
u
2
= p
2
; u
3
= qp
2
; u
4
= p
2
(p
2
−p +1); u
5
=
p
1 +p

p −p
5

;
u
6
=
p
1 +p

p +p
6

; u
7
=
p
1 +p

p −p
7

and so on, a really beautiful pattern.
Since (−p)
n
becomes small as n becomes large, it is evident that u
n
− > p
2
/(1 +p).
If p = 1/2, then u
n
− > 1/6. This limit occurs fairly rapidly as the values in Table 16.4
show.
᭿
Table 16.4
n u
n
2 0.25
4 0.1875
6 0.171875
8 0.167969
10 0.166992
12 0.166748
14 0.166687
16 0.166672
18 0.166668
20 0.166667
CONCLUSIONS
Recursions, or difference equations, are very useful in probability and can often be
usedtomodel situations inwhichthe formationof probabilityfunctions is challenging.
As we have seen, the recursions can usually be solved and then calculations made.
EXPLORATIONS
1. Verify that the solution for f(x +12f(x) +f(x −1), f(0) = 1, f(1) = 2 is
that given in the text.
2. Show how to use the recursion

n
r+1

=

(n −r)/(r −1)

n
r

.
3. Establish a recursion for

n+1
r

n
r

and show an application of the result.
4. Use Example 16.6 to find a recursion for the hypergeometric distribution and
use it to find its mean value.
5. The Poisson distribution is defined as f(x) = e
−μ
μ
x
/x! for x = 0, 1, 2, · · ·
and μ > 0. Find the recursion for the values of f(x) and use it to establish the
fact that E(X) = μ.
Chapter 17
Generating Functions and the
Central Limit Theorem
CHAPTER OBJECTIVES:
r
to develop the idea of a generating function here and show how these functions can be
used in probability modeling
r
to use generating functions to investigate the behavior of sums
r
to see the development of the central limit theorem.
EXAMPLE 17.1 Throwing a Fair Die
Let us suppose that we throw a fair die once. Consider the function
g(t) = (1/6)t +(1/6)t
2
+(1/6)t
3
+(1/6)t
4
+(1/6)t
5
+(1/6)t
6
and the random variable X
denoting the face that appears.
The coefficient of t
k
in g(t) is P(X = k) for k = 1, 2, . . . , 6. This is shown in
Figure 17.1.
Consider tossing the die twice with X
1
and X
2
, the relevant random variables. Now
calculate [g(t)]
2
. This is
[g(t)]
2
=
1
36
[t
2
+2t
3
+3t
4
+4t
5
+5t
6
+6t
7
+5t
8
+4t
9
+3t
10
+2t
11
+t
12
]
The coefficient of t
k
is now P(X
1
+X
2
= k). A graph of this is interesting and is shown
in Figure 17.2.
This process can be continued, the coefficients of [g(t)]
3
giving the probabilities associated
with the sum when three fair dice are thrown. The result is shown in Figure 17.3.
This “normal-like” pattern continues. Figure 17.4 shows the sums when 24 fair dice are
thrown.
Since the coefficients represent probabilities, g[t] and its powers are called generating
functions.
᭿
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
251
252 Chapter 17 Generating Functions
1 2 3 4 5 6
X
1
0.05
0.1
0.15
0.2
0.25
0.3
P
r
o
b
a
b
i
l
i
t
y
Figure 17.1
2 4 6 8 10
X
1
+X
2
0.04
0.06
0.08
0.1
0.12
0.14
0.16
P
r
o
b
a
b
i
l
i
t
y
Figure 17.2
2.5 5 7.5 10 12.5 15
X
1
+X
2
+X
3
0.02
0.04
0.06
0.08
0.1
0.12
P
r
o
b
a
b
i
l
i
t
y
Figure 17.3
20 40 60 80 100 120
Sum
0.01
0.02
0.03
0.04
F
r
e
q
u
e
n
c
y
Figure 17.4
Means and Variances 253
MEANS AND VARIANCES
This behavior of sums is a consequence of the central limit theorem, which states
that the probability distribution of sums of independent random variables ap-
proaches a normal distribution. If these summands, say X
1
, X
2
, X
3
, . . . , X
n
,
have means μ
1
, μ
2
, μ
3
, . . . , μ
n
and variances σ
2
1
, σ
2
2
, σ
2
3
, . . . , σ
2
n
, then
X
1
+X
2
+X
3
+· · · +X
n
has expectation μ
1

2

3
+· · · +μ
n
and variance
σ
2
1

2
2

2
3
+· · · +σ
2
n
.
Our example illustrates this nicely. Each of the X

i
s has the same uniform distri-
bution with μ
i
= 7/2 and σ
2
i
= 35/12, i = 1, 2, 3, . . . , n.
So we find the following means and variances in Table 17.1 for various numbers
of summands.
Table 17.1
n μ σ
2
1 7/2 35/12
2 7 35/6
3 21/2 35/4
24 84 70
EXAMPLE 17.2 Throwing a Loaded Die
The summands in the central limit theorem need not all have the same mean or variance.
Suppose the die is loaded and the generating function is
h(t) =
t
10
+
t
2
5
+
t
3
20
+
t
4
20
+
t
5
5
+
2t
6
5
A graph of this probability distribution, with variable X again, is shown is Figure 17.5.
When we look at sums now, the normal-like behavior does not appear quite so soon.
Figure 17.6 shows the sum of three of the loaded dice.
But the normality does appear. Figure 17.7 shows the sum of 24 dice.
1 2 3 4 5 6
X
1
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
F
r
e
q
u
e
n
c
y
Figure 17.5
254 Chapter 17 Generating Functions
0.025
0.05
0.075
0.1
0.125
0.15
2.5 5 7.5 10 12.5
X
1
+X
2
+X
3
F
r
e
q
u
e
n
c
y
15
Figure 17.6
20 40 60 80 100 120
Sum
0.01
0.02
0.03
0.04
F
r
e
q
u
e
n
c
y
Figure 17.7
The pattern for the mean and variances of the sums continues.
μ
1
=
17
4
and μ
24
= 102 while σ
2
1
=
279
80
and σ
2
24
=
837
10
᭿
A NORMAL APPROXIMATION
Let us return now to the case of the fair die and the graph of the sum of 24 tosses of
the die as shown in Figure 17.4.
Since E[X
1
] = 7/2 and Var[X
1
] = 35/12, we know that
E[X
1
+X
2
+X
3
+· · · +X
24
] = 24 ·
7
2
= 84
and
Var[X
1
+X
2
+X
3
+· · · +X
24
] = 24 ·
35
12
= 70
The visual evidence in Figure 17.4 suggests that the distribution becomes very
normal-like so we compare the probabilities when 24 dice are thrown with values
of a normal distribution with mean 84 and standard deviation

70 = 8.3666. A
comparison is shown in Table 17.2. S denotes the sum.
So the normal approximation is excellent.
This is yet another illustration of the central limit theorem. Normality, in fact,
occurs whenever a result can be considered as the sum of independent random vari-
ables. We find that many human characteristics such as height, weight, and IQ are
normally distributed. If the measurement can be considered to be the result of the sum
of factors, then the central limit theorem assures us that the result will be normal.
Explorations 255
Table 17.2
S Probability Normal
72 0.0172423 0.0170474
73 0.0202872 0.0200912
74 0.0235250 0.0233427
75 0.0268886 0.0267356
76 0.0302958 0.0301874
77 0.0336519 0.0336014
78 0.0368540 0.036871
79 0.0397959 0.0398849
80 0.0423735 0.0425331
81 0.0444911 0.0447139
82 0.0460669 0.0463396
83 0.0470386 0.0473433
84 0.047367 0.0476827
85 0.0470386 0.0473433
86 0.0460669 0.0463396
87 0.0444911 0.0447139
88 0.0423735 0.0425331
89 0.0397959 0.0398849
90 0.0368540 0.036871
91 0.0336519 0.0336014
92 0.0302958 0.0301874
93 0.0268886 0.0267356
94 0.0235250 0.0233427
95 0.0202872 0.0200912
CONCLUSIONS
Probability generating functions are very useful in calculating otherwise formidable
probabilities. They are also helpful in indicating limiting probability distributions.
We have shown here primarily the occurrence of the central limit theorem in finding
the limiting normal distributions when sums of independent random variables are
considered.
EXPLORATIONS
1. Load a die in any way you would like so long as it is not fair, investigate the
sums when the die is tossed a few times, and then investigate the behavior of
the die when it is tossed a large number of times.
2. Find the mean and variance of the loaded die in Exploration 1 and verify the
mean and variance of the sums found in Exploration 1.
256 Chapter 17 Generating Functions
3. Showthat the normal curve is a very good approximation for the sums on your
loaded die when tossed a large number of times.
4. Show that the function p(t) = (q +pt)
n
generates probabilities for the bino-
mial random variable with parameters q, p, and n.
5. Use the generating function in Exploration 4 to find the mean and variance of
the binomial random variable.
6. The geometric random variable describes the waiting time until the first suc-
cess ina binomial situationwithparameters pandq. Its probabilitydistribution
function is f(x) = pq
x−1
, x = 1, 2, . . . Show that its probability generating
function is pt/(1 −qt) and use this to find the mean and the variance of the
geometric random variable.
Bibliography
WHERE TO LEARN MORE
There is now a vast literature on the theory of probability. A few of the following
references are cited in the text; other titles that may be useful to the instructor or
student are included here as well.
1. G. E. P. Box, W. G. Hunter, and J. Stuart Hunter, Statistics for Experimenters, John Wiley & Sons,
1978.
2. F. N. David and D. E. Barton, Combinatorial Chance, Charles Griffin & Company Limited, 1962.
3. J. W. Drane, S. Cao, L. Wang, and T. Postelnicu, Limiting forms of probability mass functions via
recurrence formulas, The American Statistician, 1993, 47(4), 269–274.
4. N.R. Draper and H. Smith, Applied Regression Analysis, second edition, John Wiley & Sons, 1981.
5. A. J. Duncan, Quality Control and Industrial Statistics, fifth edition, Richard D. Irwin, Inc., 1896.
6. W. Feller, An Introduction to Probability and Its Applications, Volumes I and II, John Wiley & Sons,
1968.
7. B. V. Gnedenko, The Theory of Probability, fifth edition, Chelsea Publishing Company, 1989.
8. S. Goldberg, Probability: An Introduction, Prentice-Hall, Inc., 1960.
9. S. Goldberg, Introduction to Difference Equations, Dover Publications, 1986.
10. E. L. Grant and R. S. Leavenworth, Statistical Quality Control, sixth edition, McGraw-Hill, 1988.
11. R. P. Grimaldi, Discrete and Combinatorial Mathematics, fifth edition, Addison-Wesley Publishing
Co., Inc., 2004.
12. A. Hald, A History of Probability and Statistics and Their Applications Before 1750, John Wiley &
Sons, 1990.
13. N. L. Johnson, S. Kotz, and A. W. Kemp, Univariate Discrete Distributions, second edition, John
Wiley & Sons, 1992.
14. N. L. Johnson, S. Kotz, and N. Balakrishnan, Continuous Univariate Distributions, Volumes 1 and 2,
second edition, John Wiley & Sons, 1994.
15. J. J. Kinney, Tossing coins until all are heads, Mathematics Magazine, May 1978, pp. 184–186.
16. J. J. Kinney, Probability: An Introduction with Statistical Applications, John Wiley & Sons, 1997.
17. J. J. Kinney, Statistics for Science and Engineering, Addison-Wesley Publishing Co., Inc., 2002.
18. F. Mosteller, Fifty Challenging Problems in Probability, Addison-Wesley Publishing Co., Inc., 1965.
Reprinted by Dover Publications.
19. S. Ross, A First Course in Probability, sixth edition, Prentice-Hall, Inc., 2002.
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
257
258 Bibliography
20. D. Salsburg, The Lady Tasting Tea: How Science Revolutionized Science in the Twentieth Century,
W. H. Freeman and Company, 2001.
21. J. V. Uspensky, Introduction to Mathematical Probability, McGraw-Hill Book Company, Inc., 1937.
22. W. A. Whitworth, Choice and Chance, fifth edition, Hafner Publishing Company, 1965.
23. S. Wolfram, Mathematica: A System for Doing Mathematics by Computer, Addison-Wesley Publish-
ing Co., Inc., 1991.
Index
Acceptance sampling 32
Addition Theorem 10, 25
α 135
Alternative hypothesis 134
Analysis of variance 197
Average outgoing quality 34
Axioms of Probability 8
Bayes’ theorem 45
β 137
Binomial coefficients 24
Binomial probability
distributuion 64, 244
Mean 69
Variance 69
Binomial theorem 3, 11, 15, 24, 83
Birthday problem 8, 16
Bivariate random
variables 115
Cancer test 42
Central limit theorem 121, 123, 213
Cereal Box Problem 88
Chi squared distribution 141
Choosing an assistant 30
Combinations 22, 242
Conditional probability 10, 39, 40
Confidence coefficient 131, 149
Confidence interval 131
Confounding 233
Control charts 155
Mean 159, 160
np 161
p 163
Correlation coefficient 200
Counting principle 19
Critical region 135
Degrees of freedom 143
Derangements 17
Difference equation 241
Discrete uniform distriution 59
Drivers’ed 39
e 5, 6, 26, 31
Estimation 130, 177
Estimating σ 157, 159
Expectation 6
Binomial distribution 69
Geometric distribution 73
Hypergeometric distribution 71
Negative Binomial distribution 88
Negative Hypergeometric
distribution 102
F ratio distribution 148
Factor 224
Factorial experiment 231, 232
Finite population correction
factor 71, 214
Fractional factorial experiment 234
Ganbler’s ruin 248
General addition theorem 10, 25
Generating functions 252
Geometric probability 48
Geometric probability distribution 72, 84
Geometric series 12, 13
German tanks 28, 62, 177
Hat problem 5, 26
Heads before tails 88
Hypergeometric probability distribution
70, 105, 247
Hypothesis testing 133
A Probability and Statistics Companion, John J. Kinney
Copyright © 2009 by John Wiley & Sons, Inc.
259
260 Index
Inclusion and exclusion 26
Independent events 11
Indianapolis 500 data 196
Influential observations 193
Interaction 225
Joint probability distributions 115
Least squares 191
Let’s Make a Deal 8, 15, 17, 44
Lower control limit 158
Lunch problem 96
Marginal distributions 117
Maximum 176
Mean 60
Binomial distribution 69, 246
Geometric distribution 73
Hypergeometric distribution 71
Negative Binomial distribution 88
Negative Hypergeometric
distribution 174
Mean of means 124
Means - two samples 150
Median 28, 174
Median - median line 202, 207
Meeting at the library 48
Minimum 174, 177
Multiple regression 235
Multiplication rule 10
Mutually exclusive events 9
Mythical island 84
Negative binomial distribution 87, 103
Negative hypergeometric distribution 99
Nonlinear models 201
Nonparametric methods 170
Normal probability distribution 113
Null hypothesis 134
Operating characteristic curve 138
Optimal allocation 217
Order statistics 173, 174
p-value for a test 139
Patterns in binomial trials 90
Permutations 5, 12, 19, 242
Some objects alike 20
Poisson distribution 250
Poker 27
Pooled variances 152
Power of a test 137
Probability 2, 8
Probability distribution 32, 59
Proportional allocation 215
Race cars 28, 64
Random variable 58
Randomized response 46, 55
Range 156, 176
Rank sum test 170
Ratio of variances 148
Recursion 91, 101, 241
Residuals 189
Runs 3, 180, 182
Theory 182
Mean 184
Variance 184
Sample space 2, 4, 6, 7, 10, 14
Sampling 211
Seven game series in sports 75
Significance level 135
Simple random sampling 211
Standard deviation 60
Strata 214, 221
Student t distribution 146
Sums 62, 69, 111, 121
Type I error 135
Type II error 135
Unbiased estimator 212
Uniform random variable 59, 109
Upper control limit 158
Variance 60, 119
Binomial distribution 69
Hypergeometric distribution 71
Negative Binomial distribution 88
Negative Hypergeometric
distribution 102
Waiting at the library 48
Waiting time problems 83
World series 76
Yates algorithm 230

A Probability and Statistics Companion

A Probability and Statistics Companion
John J. Kinney

Copyright ©2009 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Kinney, John J. A probability and statistics companion / John J. Kinney. p. cm. Includes bibliographical references and index. ISBN 978-0-470-47195-1 (pbk.) 1. Probabilities. I. Title. QA273.K494 2009 519.2–dc22 2009009329 Typeset in 10/12pt Times by Thomson Digital, Noida, India. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

For Cherry, Kaylyn, and James

Permutations and Combinations: Choosing the Best Candidate. Acceptance Sampling Permutations 19 Counting Principle 19 Permutations with Some Objects Alike Permuting Only Some of the Objects 18 20 21 25 Combinations Conclusions Explorations 22 General Addition Theorem and Applications 35 35 3.Contents Preface 1. Conditional Probability Introduction Some Notation Bayes’ Theorem Conclusions Explorations 46 46 37 40 45 37 vii . Probability and Sample Spaces Why Study Probability? Probability Sample Spaces 2 2 8 11 1 xv 1 Some Properties of Probabilities Finding Probabilities of Events Conclusions Explorations 16 16 2.

Seven-Game Series in Sports Introduction 75 75 75 Seven-Game Series Winning the First Game Conclusions Explorations 81 81 78 How Long Should the Series Last? 79 7. Binomial. Geometric Probability Conclusion Explorations 56 57 5. σ. Hypergeometric. Random Variables and Discrete Probability Distributions—Uniform. and Geometric Distributions Introduction 58 58 Discrete Uniform Distribution 59 Mean and Variance of a Discrete Random Variable Intervals.viii Contents 48 4. Waiting Time Problems Waiting for the First Success The Mythical Island 84 85 87 83 83 Waiting for the Second Success Waiting for the rth Success . and German Tanks 61 Sums 62 60 Binomial Probability Distribution 64 Mean and Variance of the Binomial Distribution Sums 69 68 Hypergeometric Distribution 70 Other Properties of the Hypergeometric Distribution 72 72 Geometric Probability Distribution Conclusions Explorations 73 74 6.

and the Central Limit Theorem. Bivariate Random Variables Uniform Random Variable Sums 111 111 113 114 115 109 108 A Fact About Means Normal Probability Distribution Facts About Normal Curves Bivariate Random Variables Variance 119 .Contents ix Mean of the Negative Binomial Collecting Cereal Box Prizes Heads Before Tails Waiting for Patterns 88 90 88 87 Expected Waiting Time for HH Expected Waiting Time for TH An Unfair Game with a Fair Coin Three Tosses 95 96 91 93 94 Who Pays for Lunch? Expected Number of Lunches 98 99 101 Negative Hypergeometric Distribution Mean and Variance of the Negative Hypergeometric Negative Binomial Approximation The Meaning of the Mean 104 104 103 First Occurrences 104 Waiting Time for c Special Items to Occur Estimating k 105 Conclusions Explorations 106 106 8. the Normal Distribution. Continuous Probability Distributions: Sums.

Statistical Inference II: Continuous Probability Distributions II—Comparing Two Samples The Chi-Squared Distribution 141 144 141 Statistical Inference on the Variance Student t Distribution 146 Testing the Ratio of Variances: The F Distribution Tests on Means from Two Samples Conclusions Explorations 154 154 150 148 .x Contents Central Limit Theorem: Sums Central Limit Theorem: Means Central Limit Theorem 124 121 123 Expected Values and Bivariate Random Variables Means and Variances of Means 124 126 124 A Note on the Uniform Distribution Conclusions Explorations 128 129 9. Statistical Inference I Estimation 131 131 133 137 130 Confidence Intervals Hypothesis Testing β and the Power of a Test p-Value for a Test Conclusions Explorations 140 140 139 10.

Medians. and the Indy 500 Introduction Least Squares 188 191 193 188 191 Principle of Least Squares Influential Observations The Indy 500 A Caution 195 A Test for Linearity: The Analysis of Variance 201 201 197 Nonlinear Models The Median–Median Line 202 When Are the Lines Identical? 205 Determining the Median–Median Line 207 .Contents xi 155 11. Nonparametric Methods Introduction 170 170 170 The Rank Sum Test Order Statistics 173 Median 174 Maximum 176 Runs 180 Some Theory of Runs Conclusions Explorations 186 187 182 13. Statistical Process Control Control Charts 155 157 Estimating σ Using the Sample Standard Deviations Estimating σ Using the Sample Ranges Control Charts for Attributes np Control Chart p Chart 163 161 164 165 161 159 Some Characteristics of Control Charts Some Additional Tests for Control Charts Conclusions Explorations 168 168 12. Least Squares.

Recursions and Probability Introduction Conclusions Explorations 240 250 250 240 . Sampling Simple Random Sampling Stratification 214 215 217 212 210 210 209 211 Proportional Allocation Optimal Allocation Some Practical Considerations Strata 221 221 221 219 Conclusions Explorations 15. Design of Experiments Yates Algorithm 230 231 223 Randomization and Some Notation Confounding 233 234 Multiple Observations Design Models and Multiple Regression Models Testing the Effects for Significance Conclusions Explorations 238 238 235 235 16.xii Contents Analysis for Years 1911–1969 Conclusions Explorations 14.

Contents xiii 251 17. Generating Functions and the Central Limit Theorem Means and Variances A Normal Approximation Conclusions Explorations Bibliography Where to Learn More Index 257 255 255 253 254 257 259 .

John Kinney Colorado Springs April 2009 xv . problems in conditional probability. Curiously. Although some of these examples can be regarded as advanced. primarily because they are crucial in the analysis of data derived from samples and designed experiments and in statistical process control in manufacturing. Students searching for topics to investigate will find many examples in this book. again. while these topics have put statistics at the forefront of scientific investigation. This book has been written to provide instructors with material on these important topics so that they may be included in introductory courses. I am most deeply grateful to my wife. It is a pleasure to acknowledge the many contributions of Susanne Steitz-Filler. Cherry. Connections with geometry are frequent. she has been indispensable. they are given very little emphasis in textbooks for these courses. and the computer makes graphic illustrations and heretofore exceedingly difficult computations quick and easy. a method of selecting the best candidate from a group of applicants for a position. often a challenge for students. it provides instructors with examples that go beyond those commonly used. It is hoped that these examples will be of interest in themselves. both at the college and at the high school level. Examples include a problem involving a run of defeats in baseball. they are presented here in ways to make them accessible at the introductory level. thus increasing student motivation in the subjects and providing topics students can investigate in individual projects. I have developed these examples from my own long experience with students and with teachers in teacher enrichment programs. In addition. and trust it will prove a useful resource for both teachers and students. The fact that the medians of a triangle meet at a point becomes an extremely useful fact in the analysis of bivariate data. Graphs allow us to see many solutions visually. my editor at John Wiley & Sons. and an interesting set of problems involving the waiting time for an event to occur. I think then of the book as providing both supplemental applications and novel explanations of some significant topics. are solved using only the area of a rectangle.Preface Courses in probability and statistics are becoming very popular.

1 . Inc. without a formula WHY STUDY PROBABILITY? There are two reasons to study probability. in fact. A Probability and Statistics Companion. relatively small samples) and what kinds of conclusions can be taken from the sample data. including a way to add them r to show a use of the Fibonacci sequence r to use the binomial theorem r to introduce the basic theorems of probability. Statistics. We will show examples of each of these types of problems in this book. and counterintuitively. very difficult. Part of its fascination is that some problems that appear to be easy are. has become a central part of science. reason to study probability is that it is the mathematical basis for the statistical analysis of experimental data and the analysis of sample survey data. Statistics can tell experimenters what observations to take so that conclusions to be drawn from the data are as broad as possible.Chapter 1 Probability and Sample Spaces CHAPTER OBJECTIVES: r to introduce the theory of probability r to introduce sample spaces r to show connections with geometric series. easy to solve. John J. statistics tells us how many observations to take (usually. Kinney Copyright © 2009 by John Wiley & Sons. One reason is that this branch of mathematics contains many interesting problems. In sample surveys. whereas some problems that appear to be difficult are. some of which have very surprising solutions. The second. and compelling. Some problems have very beautiful solutions. although relatively new in the history of mathematics. in fact.

We begin by enumerating.1 A Production Line Items coming off a production line can be classified as either good (G) or defective (D). SAMPLE SPACES An experimenter has four doses of a drug under testing and four doses of an inert placebo. D} since one of these sample points must occur. but first we must establish the probabilistic basis for statistics. Most students lack a structure for thinking about probability problems in general and so one must be created. We will not solve the problem involving the experimental drug here but instead will show other examples involving a sample space. geometric probability.2 Chapter 1 Probability and Sample Spaces Each of these areas of statistics is discussed in this book. We will see that the problem above is in reality not as difficult as one might presume. If the drugs are randomly allocated to eight patients. We begin with a framework for thinking about problems that involve randomness or chance. Here the set of all possible outcomes is S = {G. conditional probability. or showing. Some of the examples at the beginning may appear to have little or no practical application. This set is called a sample space. . We follow this chapter with chapters on permutations and combinations. but these are needed to establish ideas since understanding problems involving actual data can be very challenging without doing some simple problems first. One of the reasons for this is that we lack a framework in which to think about the problem. PROBABILITY A brief introduction to probability is given here with an emphasis on some unusual problems to consider for the classroom. and then with a chapter on random variables and probability distributions. Probability refers to the relative frequency with which events occur where there is some element or randomness or chance. Now suppose we inspect the next five items that are produced. EXAMPLE 1. what is the probability that the experimental drug is given to the first four patients? This problem appears to be very difficult. We observe the next item produced. There are now 32 sample points that are shown in Table 1. the set of all the possible outcomes when an experiment involving randomness is performed.1.

Table 1. so the sample point GGDGG contains three runs while the sample point GDGDD has four runs.2 Good 0 1 2 3 4 5 Frequency 1 5 10 10 5 1 . the frequencies with which various numbers of runs occur. It is also interesting to see.3. A run is a sequence of like adjacent results of length 1 or more. If we collect these points together we find the distribution of the number of good items in Table 1. The sample space also shows the number of runs that occur.2. in Table 1. It is interesting to see that these frequencies are exactly those that occur in the binomial expansion of 25 = (1 + 1)5 = 1 + 5 + 10 + 10 + 5 + 1 = 32 This is not coincidental. we will explain this subsequently.Sample Spaces Table 1.1 Point GGGGG GGGGD GGGDG GGDGG GDGGG DGGGG DGGGD DGGDG DGDGG DDGGG GDDGG GDGDG GDGGD GGDDG GGDGD GGGDD Good 5 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 Runs 1 2 3 3 3 2 3 4 4 2 3 5 4 3 4 2 Point GGDDD GDGDD DGGDD DGDGD DDGGD DDGDG DDDGG GDDGD GDDDG GDDGD GDDDD DGDDD DDGDD DDDGD DDDDG DDDDD Good 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 0 3 Runs 2 4 3 5 3 4 2 4 3 4 2 3 3 3 2 1 We have shown in the second column the number of good items that occur with each sample point.

like symbols are very likely to occur together. The sample space here consists of all the possible orders. The sample space chosen for an experiment depends upon the probabilities one wishes to calculate. 3. they are not. In a baseball season of 162 games. We should also note that good and defective items usually do not come off production lines at random. So we have three possible useful sample spaces. so alternative sample spaces provide different ways for viewing the same problem. as shown below. We will explore the topic of runs more thoroughly in Chapter 12. G’s and D’s. 2. Very often one sample space will be much easier to deal with than another for a problem. they are likely to write down too many runs. the probabilities assigned to these sample points are quite different. Is there a “correct” sample space? The answer is “no”.3 Runs 1 2 3 4 5 Frequency 2 8 12 8 2 We see a pattern but not one as simple as the binomial expansion we saw previously.4 Chapter 1 Probability and Sample Spaces Table 1. 2. we could choose the sample space that has 32 points or the sample space {0. These might be noted as remarkable in the press. 5} showing the number of runs. S= ⎧ ⎫ ∗ ∗ ∗ ⎪1234 2134 3124 4123 ⎪ ⎪ ⎪ ⎪ ⎪1243∗ 2143 3142 4132∗ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨1324∗ 2314∗ 3214∗ 4231∗ ⎪ ⎬ ⎪1342∗ 2341 3241∗ 4213∗ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪1423∗ 2413 3412 4312 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ ∗ ∗ 1432 2431 3421 4321 . If a group is asked to write down a sequence of. 2. As we will see. EXAMPLE 1. In this example. 4. So we see that like adjacent results are almost certain to occur somewhere in the sequence that is the sample point.2 Random Arrangements The numbers 1. The frequency of defective items is usually extremely small. 3. 5} indicating the number of good items or the set {1. and 4 are arranged in a line at random. say. 1. Items of the same sort are likely to occur together. We will return to this when we consider acceptance sampling in Chapter 2 and statistical process control in Chapter 11. 3. One usually has a number of choices for the sample space. so the sample points are by no means equally likely. it is virtually certain that runs of several wins or losses will occur. 4. The mean number of runs is 3.

Now what happens as we increase the number of integers? This leads to the well-known “hat check” problem that involves n people who visit a restaurant and each check a hat.1).(2.3). however. the number of possible linear orders.1. By examining the sample space above. .3 Running Sums A box contains slips of paper numbered 1. to learn that the proportion for 100 people differs little from that for 4 people! In fact.1. We will consider permutations more generally in Chapter 2.1).(3. how many of the integers occupy their own place? For example.1.) The sample space is shown in Table 1. Increasing the number of diners complicates the problem greatly if one is thinking of listing all the orders and counting the appropriate orders as we have done here.625 of the permutations has at least one integer in its own place.1. or arrangements of 4 distinct items.(2. (To six decimal places. encounter with e = 2. and 3. Table 1. . has little to do with natural logarithms.4.Sample Spaces 5 S here contains 24 elements. 2. A well-known probability problem arises from the above permutations. Suppose the “natural” order of the four integers is 1234. This is an example of a waiting time problem. What proportion of the diners gets his own hat? If there are four diners.1. 2.632121 as n increases. the hats are distributed at random. . the integers 2 and 4 are in their own place. We find 15 such permutations. it is easy to count the permutations in which at least one of the integers is in its own place.(1.(1.1. we see that 62.2).) This is our first.1. So the hats are distributed according to a random permutation of the integers 1.1.2). or four drawings.(3. It is perhaps surprising. we wait until an event occurs. .5% of the diners receive their own hats. and a cumulative running sum is kept until the sum equals or exceeds 4. the base of the system of natural logarithms. .4 n 2 3 4 Orders (1.71828 . The event can occur in two. however. respectively. EXAMPLE 1. We will show this in Chapter 2.1. The occurrence of e in probability.2). so 15/24 = 0. These are marked with an asterisk in S. n. this is the exact result for 10 diners.3) . in the order 3214. and counterintuitive. If the four integers are arranged randomly. . (It must occur no later than the fourth drawing.3). . It is possible.(3. These arrangements are called permutations.(1.2).1).3). three.(1.2.3) (1. receiving a numbered receipt.(2. replaced.1.3).2.(1. The next example also involves e. the proportion approaches 1 − 1/e = 0. Slips are drawn one at a time. where n is the number of drawings and the sample points show the order in which the integers were selected.2) (2.2) (1. however.(2. Upon leaving.3) (1. to find the answer without proceeding in this way.2. but by no means our last.1).1.

While the value of n increases.4 An Infinite Sample Space Examples 1. but the expected length of the game approaches e = 2. This does.71828 .5 shows exact results for small values of n. however. that is.37. We now consider an infinite sample space. the expected length of the game increases.1. We observe a production line until a defective (D) item appears. where we draw until the sum equals or exeeds n + 1. . .00 2. EXAMPLE 1. The sample space is shown below (where G denotes a good item). Table 1.37 2. What happens as the number of slips of paper increases? The approach used here becomes increasingly difficult. The sample space now is infinite since the event may never occur. It is too difficult to show here. Uncountable infinite sample spaces are also encountered in probability. since they contain a finite number of elements. 1. but we will not consider these here. S= ⎧ ⎫ ⎪ D ⎪ ⎪ ⎪ ⎪ ⎪ GD ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ GGD ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ GGGD . and 1.2. . as n increases.25 2. . but at a decreasing rate. The result will probably surprise students of calculus and be an interesting introduction to e for other students. Countable sample spaces often behave as if they were finite.6 Chapter 1 Probability and Sample Spaces Table 1.3 are examples of finite sample spaces. ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ We note that S in this case is a countable set. a set that can be put in one-to-one correspondence with the set of positive integers.49 We will show later that the expected number of drawings is 2. make a very interesting classroom exercise either by generating random numbers within the specified range or by a computer simulation.5 n 1 2 3 4 5 Expected value 2. .44 2.

J3 S1 . S3 S4 . JS. J3 S2 . SJ. one might consider the individual students selected so that the sample space. SS} where we have shown the class of the students selected in order.Sample Spaces 7 EXAMPLE 1. 1. 4. S1 S2 . J1 S3 . J2 S3 . J3 S3 . 3. 2} Alternatively. becomes S2 = {J1 J2 . J2 J3 . J1 S2 . if order of selection is disregarded. One might also simply count the number of juniors on the committee and use the sample space S1 = {0. A sample space is shown below.6 AP Statistics A class in advanced placement statistics consists of three juniors (J) and four seniors (S). We will return to assigning probabilities to the sample points in S and S2 later in this chapter. shown below. So we would expect that each of the points in S2 would occur about 1/21 of the time. Since there are two possibilities on each toss. S2 S4 . 1. S= ⎧ ⎪TTTTT ⎪ ⎪ ⎪HHTTT ⎪ ⎪ ⎪ ⎪ ⎨ TTTTH HTHTT HTTTH TTTHT HTTHT THTTH TTHTT TTTHH TTHTH HTHTH HHHTH THTTT THTHT HHHHT THHTT HHTHH HTTHH ⎪ THTHH ⎬ HHHTT ⎪ ⎪ ⎪ ⎪ HTHHH ⎪ ⎪ HTTTT ⎪ ⎪ ⎪ ⎪ TTHHT ⎪ ⎪ ⎫ ⎪ ⎪THHTH THHHT TTHHH ⎪ ⎪ ⎪HTHHT HHTHT HHHHT ⎪ ⎪ ⎪ ⎪ ⎩THHHH HHHHH ⎪ ⎪ ⎭ It is also possible in this example simply to count the number of heads. Perhaps one can think of other sets that describe the sample space in this case. In that case. so one might think that these 21 sample points are equally likely to occur provided no priority is given to any of the particular individuals. J2 S1 . J3 S4 } S2 is as detailed a sample space one can think of. say. J2 S4 . J1 S4 . 5} Both S and S1 are sets that contain all the possibilities when the experiment is performed and so are sample spaces. that occur. S1 S4 . there are 25 = 32 sample points. EXAMPLE 1. An appropriate sample space is S = {JJ. 2. J1 S1 . So we see that the sample space is not uniquely defined. S1 S3 . .5 Tossing a Coin We toss a coin five times and record the tosses in order. It is desired to select a committee of size two. J1 J3 . the sample space is S1 = {0. J2 S2 . S2 S3 .

3. If A is an event in a sample space. E2 ). a contestant is shown three doors. the sample point where exactly four heads occur is an event. and we write the contestant’s initial choice and then the door she finally ends up with. We are interested in whether or not at least two of the students share the same birthday. . (E2 . Probabilities are governed by these three axioms: 1. The host then asks the contestant if she wishes to change her choice of doors from the one she selected to the remaining door. are numbers between 0 and 1. 2. Let W denote a door with the prize and E1 and E2 the empty doors. or probability. We wish to calculate the relative likelihood. then . might consist of components with 10 items each. then the relative likelihood of the event is t/n. If events A and B are P(A ∪ B) = P(A) + P(B).8 A Birthday Problem A class in calculus has 10 students. Here the sample space. E1 ). or probabilities. we write P(A) to denote the probability of the event A. We see that relative likelihoods.). . P(S) = 1. The contestant chooses one of the doors and the host then opens one of the remaining doors to show that it is empty.196 9 × 1025 points! Here S = {(March 10. W)} EXAMPLE 1. of these events. (W.3. .4. so that A ∩ B = ∅. the sample point where the first defective item occurs in an even number of items is an event. but we can calculate the probability that at least two of the students share the same birthday without enumerating all the points in S. June 15. 0 ≤ P(A) ≤ 1.8 Chapter 1 Probability and Sample Spaces EXAMPLE 1. as we will see in a later chapter. In Example 1. SOME PROPERTIES OF PROBABILITIES Any subset of a sample space is called an event. In Example 1. April 24. . . W). showing all possible birthdays.)} It may seem counterintuitive. she should do).2. the occurrence of a good item is an event.1. We will return to this problem later. disjoint. August 2. September 9. . In Example 1. In Example 1. only one of which hides a valuable prize. We can only show part of the sample space since it contains 36510 = 4. Supposing that the contestant switches choices of doors (which. . (May 5. Now we continue to develop the theory of probability. the sample space is S = {(W. If we try an experiment n times and an event occurs t times. the sample point where the number 3 is to the left of the number 2 is an event.7 Let’s Make a Deal On the television program Let’s Make a Deal. (E1 .

say A. if a prospective student decides to attend University A with probability 2/5 and to attend University B with probability 1/5. say ai . Now if an event A occurs with probability P(A) and an event B occurs with probability P(B) and if the events cannot occur together. Axiom 3 concerns events that are mutually exclusive. Example 1.1. but our conclusions also apply to a countably infinite sample space. the probability assigned to the entire sample space must be 1 since by definition of the sample space some point in the sample space must occur and the probability of an event must be between 0 and 1. then the relative frequency with which one or the other occurs is P(A) + P(B).Some Properties of Probabilities 9 Axioms 1 and 2 are fairly obvious.1 . What if they are not mutually exclusive? Refer to Figure 1. she will attend one or the other (but not both) with probability 2/5 + 1/5 = 3/5. P(A) = 1 − P(A). This explains Axiom 3. For example. Note that A and A are mutually exclusive so P(S) = P(A ∪ A) = P(A) + P(A) = 1 and so we have Fact 1. A A B B Figure 1. as being composed of distinct points. Disjoint events are also called mutually exclusive events. we will encounter several more examples of these sample spaces in Chapter 7.4 involved a countable infinite sample space.with probabilities p(ai ). It is also very useful to consider an event. Let A denote the points in the sample space where event A does not occur. By Axiom 3 we can add these individual probabilities to compute P(A) so n P(A) = i=1 p(ai ) It is perhaps easiest to consider a finite sample space.

7/12. It can be generalized to three or more events in Fact 3: Fact 3. differs from the probability the marble is green. PA ∪ B) = P(A) + P(B) − P(A ∩ B). We prefer to prove it using techniques developed in Chapter 2. As for the second marble. . The probability the second marble is green given that the first is red.9 Drawing Marbles Suppose we have a jar of marbles containing five red and seven green marbles. n n P(A1 ∪ A2 ∪ · · · ∪ An ) = i=n P(Ai ) − i = j=1 / n P(Ai ∩ Aj ) + i = j = k=1 / / n P(Ai ∩ Aj ∩ Ak ) − · · · ± i = j = ··· = n=1 / / / P(Ai ∩ Aj ∩ · · · ∩ An ) We simply state this theorem here. so we delay the proof until then. 7/11. and the probability the second marble is green given that the first marble is red is now 7/11. This is called the multiplication rule. then we have counted P(A ∩ B) twice so Fact 2. EXAMPLE 1. given the first is red. We draw them out one at a time (without replacing the drawn marble) and want the probability that the first drawn marble is red and the second green. the contents of the jar have now changed. where Fact 2 applies whether events A and B are disjoint or not. Now we turn to events that can occur together. Clearly. given a red. The fact that the composition of the jar changes with the selection of the first marble alters the probability of the color of the second marble. the probability the first is red is 5/12.10 Chapter 1 Probability and Sample Spaces If we find P(A) + P(B) by adding the probabilities of the distinct points in those events. Fact 2 is known as the addition theorem for two events. the conditional probability of a green. We conclude that the probability the first marble is red and the second green is 5/12 · 7/11 = 35/132. We say in general that P(A ∩ B) = P(A) · P(B|A) where we read P(B|A) as the conditional probability of event B given that event A has occurred. (General addition theorem). We call the probability the second marble is green.

Independent events and disjoint events are commonly confused. are fairly easy.01)2 = 0. of course. then they cannot be disjoint since they must be able to occur together. It is a common error to presume. if events are disjoint. the probability of drawing a green marble on the second drawing would have been the same as drawing a green marble on the first drawing. then we might have P(Head) = 1/2 and P(Tail) = 1/2. as shown in the previous section. It might be sensible to assign the probabilities to the events as P(G) = 1/2 and P(D) = 1/2 if we suppose that the production line is not a very good one. In that case.10 A Binomial Problem In Example 1. But if we toss a fair coin.1.000097 . 7/12. EXAMPLE 1. but it would not be usual to assign equal probabilities to the two events. disjoint events cannot occur together. The first step in any probability problem is to define an appropriate sample space. because a sample space has n points.01 and even these assumptions assume a fairly poor production line. 0 ≤ p ≤ 1. P(B|A) = P(B). then the desired probabilities can be found. and assuming the events are independent. but it is not always necessary to consider order. We might also consider a loaded coin where P(H) = p and P(T ) = q = 1 − p where. For another example.99 and P(D) = 0. If events are independent. Let us consider the examples for which we previously found sample spaces. In this case. Independent events refer to events that can occur together. FINDING PROBABILITIES OF EVENTS The facts about probabilities. then they cannot be independent because they cannot occur together.99)3 · (0. we then find that P(GGDDG) = P(G) · P(G) · P(D) · P(D) · P(G) = (0. We refer now to events that do not have probability 0 (such events are encountered in nondenumerably infinite sample spaces). because that is the most detailed sample space one can write. It is far more sensible to suppose in our production line example that P(G) = 0. when a student takes a course she will either pass it or fail it. More than one sample space is possible. and the events are called independent. This is an example of a binomial event (where one of the two possible outcomes occurs at each trial) but it is not necessary to assign equal probabilities to the two outcomes. it is usually the case that if order is considered. we examined an item emerging from a production line and observed the result. We will study conditional probability in Chapter 3.Finding Probabilities of Events 11 Had the first marble been replaced before making the second drawing. The difficulty arises when we try to apply them. that each point has probability 1/n.

12

Chapter 1

Probability and Sample Spaces

Also, since the sample points are disjoint, we can compute the probability we see exactly two defective items as P(GGDDG) +P(DGDGG)+P(DGGDG)+P(DGGGD)+P(GDGGD)+P(GGDGD) +P(GGGDD) + P(GDGDG) + P(GDDGG) + P(DGDGG) = 0.00097 Note that the probability above must be 10 · P(GGDDG) = 10 · (0.99)3 · (0.01)2 since each of the 10 orders shown above has the same probability. Note also that 10 · (0.99)3 · (0.01)2 is a term in the binomial expansion (0.99 + 0.01)10 .

We will consider more problems involving the binomial theorem in Chapter 5.

EXAMPLE 1.11

More on Arrangements

In Example 1.2, we considered all the possible permutations of four objects. Thinking that these permutations occur at random, we assign probability 1/24 to each of the sample points. The events “3 occurs to the left of 2” then consists of the points {3124, 3142, 4132, 1324, 3214, 1342, 3241, 3412, 4312, 1432, 3421, 4321} Since there are 12 of these and since they are mutually disjoint and since each has probability 1/24, we find P(3 occurs to the left of 2) = 12/24 = 1/2 We might have seen this without so much work if we considered the fact that in a random permutation, 3 is as likely to be to the left of 2 as to its right. As you were previously warned, easy looking problems are often difficult while difficult looking problems are often easy. It is all in the way one considers the problem.

EXAMPLE 1.12

Using a Geometric Series

Example 1.4 is an example of a waiting time problem; that is, we do not have a determined number of trials, but we wait for an event to occur. If we consider the manufacturing process to be fairly poor and the items emerging from the production line are independent, then one possible assignment of probabilities is shown in Table 1.6. Table 1.6 Event D GD GGD GGGD Probability 0.01 0.99 · 0.01 = 0.0099 (0.99)2 · 0.01 = 0.009801 (0.99)3 · 0.01 = 0.009703

· · ·

· · ·

Finding Probabilities of Events

13

We should check that the probabilities add up to 1.We find that (using S for sum now) S = 0.01 + 0.99 · 0.01 + (0.99)2 · 0.01 + (0.99)3 · 0.01 + · · · and so 0.99S = 0.99 · 0.01 + (0.99)2 · 0.01 + (0.99)3 · 0.01 + · · · and subtracting one series from another we find S − 0.99S = 0.01 or 0.01S = 0.01 and so S = 1. This is also a good opportunity to use the geometric series to find the sum, but we will have to use for the above trick in later chapters for series that are not geometric. What happens if we assign arbitrary probabilities to defective items and good items? This would certainly be the case with an effective production process. If we let P(D) = p and P(G) = 1 − p = q, then the probabilities appear as shown in Table 1.7. Table 1.7 Event D GD GGD GGGD . . . Probability p qp q2 p q3 p . . .

Again, have we assigned a probability of 1 to the entire sample space? Letting S stand for sum again, we have S = p + qp + q2 p + q3 p + · · · and so qS = qp + q2 p + q3 p + · · · and subtracting, we find S − qS = p so (1 − q)S = p or pS = p meaning that S = 1. This means that our assignment of probabilities is correct for any value of p.

14

Chapter 1

Probability and Sample Spaces

Now let us find the probability the first defective item occurs at an even-numbered toss. Let the event be denoted by E. P(E) = qp + q3 p + q5 p + · · · and so q2 · P(E) = q3 p + q5 p + · · · and subtracting we find P(E) − q2 · P(E) = qp from which it follows that (1 − q2 ) · P(E) = qp and so P(E) = q qp qp = = 1 − q2 (1 − q)(1 + q) 1+q

If the process produces items with the above probabilities, this becomes 0.01/(1 + 0.01) = 0.0099. One might presume that the probability the first defective item occurs at an even-numbered observation is the same as the probability the first defective item occurs at an odd-numbered observation. This cannot be correct, however, since the probability the first defective item occurs at the first observation (an odd-numbered observation) is p. It is easy to show that the probability the first defective item occurs at an odd-numbered observation is 1/(1 + q), and for a process with equal probabilities, such as tossing a fair coin, this is 2/3.

EXAMPLE 1.13

Relating Two Sample Spaces

Example 1.5 considers a binomial event where we toss a coin five times. In the first sample space, S, we wrote out all the possible orders in which the tosses could occur. This is of course impossible if we tossed the coin, say, 10, 000 times! In the second sample space, S1 ,we simply looked at the number of heads that occurred. The difference is that the sample points are not equally likely. In the first sample space, where we enumerated the result of each toss, using the fact that the tosses are independent, and assuming that the coin is loaded, where P(H) = p and P(T ) = 1 − p = q, we find, to use two examples, that P(TTTTT ) = q5 and P(HTTHH) = p3 q2 Now we can relate the two sample spaces. In S1 , P(0) = P(0 heads) = P(TTTTT ) = q5 . Now P(1 head) is more complex since the single head can occur in one of the five possible places. Since these sample points are mutually disjoint, P(1 head) = 5 · p · q4 . There are 10 points in S where two heads appear. Each of these points has probability p2 · q3 so P(2 heads) = 10 · p2 · q3 . We find, similarly, that P(3 heads) = 10 · p3 · q2 , P(4 heads) = 5 · p4 · q, and, finally, P(5 heads) = p5 . So the sample points in S1 are far from being equally likely. If we add all these probabilities, we find q5 + 5 · p · q4 + 10 · p2 · q3 + 10 · p3 · q2 + 5 · p4 · q + p5 which we recognize as the binomial expansion (q + p)5 that is 1 since q = 1 − p.

Finding Probabilities of Events

15

In a binomial situation (where one of the two possible outcomes occurs at each trial) with n observations, we see that the probabilities are the individual terms in the binomial expansion (q + p)n .

EXAMPLE 1.14

Committees and Probability

In Example 1.6, we chose a committee of two students from a class with three juniors and four seniors. The sample space we used is S = {JJ, JS, SJ, SS} How should probabilities be assigned to the sample points? First we realize that each sample point refers to a combination of events so that JJ means choosing a junior first and then choosing another junior. So JJ really refers to J ∩ J whose probability is P(J ∩ J) = P(J) · P(J|J) by the multiplication rule. Now P(J) = 3/7 since there are three juniors and we regard the selection of the students as equally likely. Now, with one of the juniors selected, we have only two juniors to choose from, so P(J|J) = 2/6 and so P(J ∩ J) = In a similar way, we find P(J ∩ S) = 3 7 4 P(S ∩ J) = 7 4 P(S ∩ S) = 7 3 7

· · · ·

2 1 = 6 7 4 2 = 6 7 3 2 = 6 7 3 2 = 6 7

These probabilities add up to 1 as they should.

EXAMPLE 1.15

Let’s Make a Deal

Example 1.7 is the Let’s Make a Deal problem. It has been widely written about since it is easy to misunderstand the problem. The contestant chooses one of the doors that we have labeled W, E1 , and E2 .We suppose again that the contestant switches doors after the host exhibits one of the nonchosen doors to be empty. If the contestant chooses W, then the host has two choices of empty doors to exhibit. Suppose he chooses these with equal probabilities. Then W ∩ E1 means that the contestant initially chooses W, the host exhibits E2 , the contestant switches doors and ends up with E1 .The probability of this is then P(W ∩ E1 ) = P(W) · P(E1 |W) = In an entirely similar way, P(W ∩ E2 ) = P(W) · P(E2 |W) = 1 3 1 3

· ·

1 1 = 2 6 1 1 = 2 6

16

Chapter 1

Probability and Sample Spaces

Using the switching strategy, the only way the contestant loses is by selecting the winning door first (and then switching to an empty door), so the probability the contestant loses is the sum of these probabilities, 1/6 + 1/6 = 1/3, which is just the probability of choosing W in the first place. It follows that the probability of winning under this strategy is 2/3! Another way to see this is to calculate the probabilities of the two ways to winning, namely, P(E1 ∩ W) and P(E2 ∩ W). In either of these, an empty door is chosen first. This means that the host has only one choice for exhibiting an empty door. So each of these probabilities is simply the probability of choosing the specified empty door first, which is 1/3. The sum of these probabilities is 2/3, as we found before. After the contestant selects a door, the probability the winning door is one not chosen is 2/3. The fact that one of these is shown to be empty does not change this probability.

EXAMPLE 1.16

A Birthday Problem

To think about the birthday problem, in Example 1.8, we will use the fact that P(A) = 1 − P(A). So if A denotes the event that the birthdays are all distinct, then A denotes the event that at least two of the birthdays are the same. To find P(A), note that the first person can have any birthday in the 365 possible birthdays, the next can choose any day of the 364 remaining, the next has 363 choices, and so on. If there are 10 students in the class, then P(at least two birthdays are the same) = 1 − If there are n students, we find 365 364 363 366 − (n − 1) · · · ··· · 365 365 365 365 This probability increases as n increases. It is slightly more than 1/2 if n = 23, while if n = 40, it is over 0.89. These calculations can be made with your graphing calculator. This result may be surprising, but note that any two people in the group can share a birthday; this is not the same as finding someone whose birthday matches, say, your birthday. P(at least two birthdays are the same) = 1 − 365 365

·

364 365

·

363 365

·

···

·

356 = 0.116948 365

CONCLUSIONS
This chapter has introduced the idea of the probability of an event and has given us a framework, called a sample space, in which to consider probability problems. The axioms on which probability is based have been shown and some theorems resulting from them have been shown.

EXPLORATIONS
1. Consider all the arrangements of the integers 1, 2, 3, and 4. Count the number of derangements, that is, the number of arrangements in which no integer occupies its own place. Speculate on the relative frequency of the number of derangements as the number of integers increases.

what is the least value of n so that p 1/2? 6. one of which is designated to be the prize. Two tags are selected. 5. 6. Toss a fair coin 100 times and find the frequencies of the number of runs. 7. 4. 3 until the sum exceeds 4. . A hat contains tags numbered 1. Show the sample space and then compare the probability that the number on the second tag exceeds the number on the first tag when (a) the first tag is not replaced before the second tag is drawn and (b) the first tag is replaced before the second tag is drawn. Simulate drawing integers from the set 1. 2. 3. 3. Use a computer to simulate tossing a coin 1000 times and find the frequencies of the number of runs produced. Find the probability of (a) exactly three heads and (b) at most three heads when a fair coin is tossed five times. 2. Repeat the experiment as often as you can. 5.Explorations 17 2. Simulate the Let’s Make a Deal problem by taking repeated selections from three cards. Compare two strategies: (1) never changing the selection and (2) always changing the selection. If p is the probability of obtaining a 5 at least once in n tosses of a fair die. Compare your mean value to the expected value given in the text. 4. 8.

Acceptance Sampling CHAPTER OBJECTIVES: r r r r r r r to discuss permutations and combinations to use the binomial theorem to show how to select the best candidate for a position to encounter an interesting occurrence of e to show how sampling can improve the quality of a manufactured product to use the principle of maximum likelihood to apply permutations and combinations to other practical problems.Chapter 2 Permutations and Combinations: Choosing the Best Candidate. How should the executive proceed so as to maximize the chance that the best candidate is selected? Manufacturers of products commonly submit their product to inspection before the product is shipped to a consumer. John J. then all the manufactured product cannot be inspected. This inspection usually measures whether or not the product meets the manufacturer’s as well as the consumer’s specifications. A Probability and Statistics Companion. Kinney Copyright © 2009 by John Wiley & Sons. An executive in a company has an opening for an executive assistant. Inc. 18 . If the product inspection is destructive. The executive is constrained by company rules that say that candidates must be told whether they are selected or not at the time of an interview. Twenty candidates have applied for the position. however (such as determining the length of time a light bulb will burn).

Permutations 19 Even if the inspection is not destructive or harmful to the product.1 abcd bacd cabd dabc abdc badc cadb dacb acbd bcda cbda dbca acdb bcad cbad dbac adbc bdac cdab dcab adcb bdca cdba dcba In Chapter 1. It is easy to see that the branches of the diagram can be generalized. There are 2 · 3 = 6 paths from the starting point. We have made repeated use of the counting principle. Since there are four positions to be filled to determine a unique permutation. Proceeding to the right. where we have taken n = 2 and m = 3. If the testing is destructive. For example. So there are 4 · 3 · 2 · 1 = 24 possible permutations of these four objects. we showed all the permutations of the set {1. if we have four objects.and d. Counting Principle Fundamental Counting Principle. it is possible to inspect only a random sample of the product produced. so we discuss permutations and combinations as well as some problems solved using them. 4} and. then A and B can occur in n · m ways. To count the permutations. 3. in which the order of the objects is important. in this chapter. PERMUTATIONS A permutation is a linear arrangement of objects. This gives 4 · 3 possible choices in total. as well as several others. The counting principle can be extended to three or more events by simply multiplying the number of ways subsequent events can occur. we need a fundamental principle first. . The proof of this can be seen in Figure 2. We denote 4 · 3 · 2 · 1 as 4! (read as “4 factorial”). we have four choices for the letter or object in the leftmost position. If an event A can occur in n ways and an event B can occur in m ways. inspection of all the product manufactured is expensive and time consuming. there are 24 distinct linear arrangements as shown in Table 2. there are three choices of letter or object to be placed in the next position.1 Table 2. Can random sampling improve the quality of the product sold? We will consider each of these problems. which we will denote by a. assuming for the moment that all of the objects to be permuted are distinct. b. First we must learn to count points in sets.1. or an arrangement of objects in a row. found 24 of these. c. The number of permutations of the four objects shown above can be counted as follows. of course. 2. Now we are left with two choices for the object in the next position and finally with only one choice for the rightmost position.

number the 5 C’s and permute these for each item in the list.20 Chapter 2 Permutations and Combinations 1 A 2 3 B 1 2 Figure 2. and 5 C’s to be permuted. Finally. so. suppose we have 3. since the A’s are not distinguishable from each other.2 n 1 2 3 4 5 6 7 8 9 10 n! 1 2 6 24 120 720 5. Now label the B’s from 1 to 4 and permute the B’s in each item in the list in all 4! ways. or 12 objects all together. Table 2. the list now has 3!G items.1 3 It follows that there are n! possible permutations of n distinct objects.2.628. A’s.800 Permutations with Some Objects Alike Sometimes not all of the objects to be permuted are distinct. Now number the A’s from 1 to 3. The list now has 4!3!G items. if we permute the A’s in each item in our list.These can be permuted in 3! ways. The list now contains 5!4!3!G items. For example. nor are the C’s.880 3. Suppose we let G be the number of distinct permutations and that we have a list of these permutations.320 362. The number of permutations of n distinct objects grows rapidly as n increases as shown in Table 2. nor are the B’s. . There are not 12! permutations.040 40. But now each of the items is distinct. 4 B’s.

We usually associate impossible events with infinite sets. 020. we define 0! = 1. The executive can. 081. as we will see. 000 different orders. Selecting the best of the group by making a random choice means that the best applicant has a 1/20 = 0. If r < n. 001. We have four choices for the leftmost position and three choices for the second position.Permutations 21 so the list has 12! items. Permutations are often the basis for a sample space in a probability problem. For example. This can be done in n · (n − 1) · (n − 2) · · · [n − (r − 1)] = n · (n − 1) · (n − 2) · · · (n − r + 1) ways. but this is an example of a finite set for which this event is impossible. b. but can be expressed in terms of factorials by multiplying and dividing by (n − r)! We see that n · (n − 1) · (n − 2) · · · (n − r + 1) = = n · (n − 1) · (n − 2) · · · (n − r + 1) · (n − r)! (n − r)! n! (n − r)! We will have little use for this formula. suppose we have four objects (a. so 4 P2 = 4! 4! 4 · 3 · 2! = = = 4 · 3 = 12 (4 − 2)! 2! 2! giving the correct result. if we could produce 10. choose the best candidate with a probability approaching 1/3. It is surprising to find. So the executive must create a better procedure. 400. 000 distinct permutations of these per second. This number is of the order 8 · 1067 . where r ≤ n. Applying the formula we have n = 4 and r = 2. giving 4 · 3 = 12 permutations. a fairly low probability. that it would take about 2 · 1056 years to enumerate all of these. Here are two examples.05 chance of being selected. 766. We see that 5!4!3!G = 12!. We derived it so that we can count the number of samples that can be chosen from a population. 720 Permuting Only Some of the Objects Now suppose that we have n distinct objects and we wish to permute r of them. We remark now that the 20 applicants to the executive faced with choosing a new assistant could appear in 20! = 24. 600. and d again) and that we wish to permute only two of these. but that is something we will discuss much later. 329. c. which we do subsequently. We now have r boxes to fill. There are 52! distinct arrangements of a deck of cards. 12! 5!4!3! or G = 27. so G = and this is considerably less than 12! = 479. . For the formula to work for any value of r. this expression is not a factorial.

7. Return to the problem of counting the number of samples of size 3 that can be chosen from the set {1. which we read as “n choose r”. 6. We denote this number by 10 . There are 4! = 24 of these permutations. Were we to do this. we must count the number of ways in which she could be in the second place. 10}. We would expect that any of the four people has a probability of 1/4 to be in any of the four positions. The remaining marbles.3. which. Now we must count the number of points in which the end points are both red. We have five choices for the marble at the left end and four choices for the marble at the right end. There are then 3 · 1 · 2 · 1 = 6 ways for Sue to occupy the second place. then each sample is permuted in all possible ways.1 Lining Up at a Counter Jim. This leaves three choices for the first place. 8. in which only some of the possible samples are listed. 4. occupying places between the ends. two choices for the third place. where the order in which the sample items are selected is of no importance. Number the marbles from 1 to 12. We want to find the probability that both the end marbles are red. so P (end marbles are both red) = 5 · 4 · 10! 5 · 4 · 10! 5·4 5 = = = 12! 12 · 11 · 10! 12 · 11 33 COMBINATIONS If we have n distinct objects and we choose only r of them. letting the red marbles be numbered from 1 to 5 for convenience. So 6 1 3·1·2·1 = = 4·3·2·1 24 4 This is certainly no surprise. . We want to find a r formula for this quantity and first we consider a special case. Sue. and Kate stand in line at a ticket counter. 5. are equally likely. the result might look like Table 2. Each sample contains three distinct numbers and 3 each sample could be permuted in 3! different ways. The sample space consists of all the possible permutations of 12 distinct objects. 3. P(Sue is in the second place) = EXAMPLE 2. so the sample space contains 12! points. Let us suppose that we 3 have a list of all these 10 samples. is equally likely. we denote the number of possible samples. or orders. There are two ways in which to view the contents of the table. if shown in its entirety. If we want to find the probability that Sue is in the second place. we will assume. can be arranged in 10! ways. so each sample gives 3! permutations. Assume that all the possible permutations. only one choice for the fourth place. 9.22 Chapter 2 Permutations and Combinations EXAMPLE 2. each of which. Bill. by n . 2. first put her there—there is only one way to do this.2 Arranging Marbles Five red and seven blue marbles are arranged in a row. To count these. finally. would contain all the permutations of 10 objects taken 3 at a time. and.

the table must contain (10−3)! permutations.7 .1 9.10. .7 . . we see that 10 3 = 10! 10 · 9 · 8 · 7! = = 120 7! · 3! 7! · 3 · 2 · 1 10 3 = 10! (10 − 3)! This process is easily generalized.10 . . 598. yielding all the permutations of n objects taken r at a time. However.3 Sample {1.9 6.2 10.7} {2.4.10 . 598. This is 2. 960 5·4·3·2·1 . .6.7.4.7. 1. 23 7. each of these r can be permuted in r! ways. using our formula for the number of permutations of 10 objects taken 3 at a 10! time. This calculation by hand would appear this way: 52 5 = = 52! 52! 52 · 51 · 50 · 49 · 48 · 47! = = 5!(52 − 5)! 5!47! 5 · 4 · 3 · 2 · 1 · 47! 52 · 51 · 50 · 49 · 48 = 2.9} {6.4 9.6 .4.7 4. .7.7 2. enumerate each of these. If we have n distinct samples.4 10. It then follows that 3 3! · From this.4. Permutations 4. given enough time.4. . . 7.2.10} .7. . .7.1. the total number of permutations must also be 3! · 10 . 1.10. so r! · or n r 52 5 n r = n! (n − r)! = n! r! · (n − r)! then represents the total number of possible poker hands.9. .9. 4.6.1 4.1. since each of the 10 3 combinations can be permuted in 3! ways.2 7.4 2. . .2.4 6. .6 .9 7. . 960. This number is small enough so that one could.4. First.Combinations Table 2.

Specifically. so n r = n(n − 1)(n − 2) · · r! · [n − (r − 1)] This makes the calculation by hand fairly simple. Every change in an item selected produces a change in the set of items that are not selected. So n r = n−1 r−1 + n−1 r . It is also true that n n = r n−r An easy way to see this is to notice that if r objects are selected from n objects. We cannot explore all these here. The binomial theorem states that n (a + b)n = i=1 n n−i i a b i This is fairly easy to see. Here’s another way to arrive at the answer. 4 B’s. Then from the remaining nine positions. to acceptance sampling. To find the product (a + b)n . It is also true that n = n−1 + n−1 . Either you are on the committee or you are not on the committee. then n − r objects are left unselected. Consider (a + b)n = (a + b) · (a + b) · · · · · (a + b). If you are selected for the committee. where there are n factors on the right-hand side. We will solve some interesting problems after stating and proving some facts about these binomial coefficients. 12! We found the answer was G = 5!4!3! = 27. we must choose either a or b from each of the factors on the right-hand side. some of which are alike. This can be done in 4 . we wanted to count the number of distinct permutations of 3 A’s. To prove this. EXAMPLE 2. This cancellation always occurs and a calculation rule is that n r has r factors in the numerator and in the denominator. The product (a + b)n consists of the sum of all such terms. there are n−1 further choices to be made. but we will show an application. There are n−1 committees r−1 r that do not include you. and 5 C’s. suppose you are a r r−1 r member of a group of n people and that a committee of size r is to be selected. among others. Many other facts are known about the binomial coefficients.24 Chapter 2 Permutations and Combinations Notice that the factors of 47! cancel from both the numerator and the denominator of the fraction above. There are n i ways to select i b’s (and hence n − i a’s). Many other facts are known about the numbers n . where the individual letters are not distinguishable from one another. 720. choose four for the B’s. and so the equality follows.3 Arranging Some Like Objects Let us return to the problem first encountered when we counted the permutations of objects. choose 3 for the A’s. This can be done in 12 3 9 ways. which are also called r binomial coefficients because they occur in the binomial theorem. From the 12 positions in the permutation.

n n P(A1 ∪ A2 ∪ · · · ∪ An ) = i=n n P(Ai ) − i = j=1 / n P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak ) − · · · ± i = j = k=1 / / i = j = .. What if we have three or more events? This addition theorem can be generalized and we call this.. Suppose a sample point is contained in exactly k of the events Ai . following Chapter 1. Now we show that the probability of the sample point is contained exactly once in the right-hand side of the theorem. Fact 3. So the total number of permutations 9 9! · 5! 12! must be 12 · 4 · 5 . we discussed some properties of probabilities including the addition theorem for two events: P(A ∪ B) = P(A) + P(B) − P(A ∩ B). But consider the binomial expansion of 0 = [1 + (−1)]k = 1k − which shows that k k k k − + − ··· ± 1 2 3 k =1 k k k k + − + ··· ∓ 1 2 3 k So the sample point is counted exactly once. we use a different technique from the one we used to prove the theorem for two events. (General addition theorem).. = n=1 / / / P(Ai ∩ Aj ∩ · · · ∩ An ) We could not prove this in Chapter 1 since our proof involves combinations. 3 5 9!4! Note that we have used combinations to count permutations! General Addition Theorem and Applications In Chapter 1. For convenience. as before. proving the theorem. To prove the general addition theorem. number the events so that the sample point is in the first k events.Combinations 25 ways. which can be simplified to 3! ·12! · · 5!5! · 0! = 5!4!3! . there are five positions left for the 5 C’s. . Finally. The point is contained on the right-hand side k k k k − + − ··· ± 1 2 3 k times.

To six decimal places.26 Chapter 2 Permutations and Combinations The principle we used here is that of inclusion and exclusion and is of great importance in discrete mathematics.4 n 1 2 3 4 5 6 7 8 9 p 1. Suppose diner i gets his own hat.so P(Ai ) = (n − 1)!/n! . 2 Clearly.63333 0. this argument can be continued.632121 for n ≥ 9. if diners i and j get their own hats. the probability that at least one diner gets his own hat is 0.4 shows some numerical results from this formula.62500 0. It could also have been used in the case k = 2.63214 0.00000 0.” so we seek P(A1 ∪ A2 ∪ · · · ∪ An ) using the general addition theorem. given the correct hat to diner i. the remaining hats can be distributed in (n − 2)! ways. EXAMPLE 2. It can also be shown that this limit is 1 − 1/e for n ≥ 9. where n diners have checked their hats and we seek the probability that at least one diner is given his own hat at the end of the evening. they appear to approach a limit. Let the events Ai denote the event “diner i gets his own hat. In a similar way.4 Checking Hats Now we return to Example 1. We then find that P(A1 ∪ A2 ∪ · · · ∪ An ) = + which simplifies easily to P(A1 ∪ A2 ∪ · · · ∪ An ) = 1 1 1 1 − + − ··· ± 1! 2! 3! n! n (n − 1)! − 1 n! n (n − 2)! 2 n! n (n − n)! n n! n (n − 3)! − ··· ± 3 n! Table 2. .66667 0. There are (n − 1)! ways for the remaining hats to be distributed.63212 0. while the probabilities fluctuate a bit. There are n ways for two diners to be chosen.63212 It is perhaps surprising that.2. so P(Ai ∩ Aj ) = (n − 2)!/n!.50000 0. There are n ways for a 1 single diner to be selected.63194 0. Table 2.

492 917 . .507 083. 5 We will calculate the probabilities of several different hands. and one pair.0014406. This hand contains all four cards of a single value plus another card that must be of another value. the one concerning auto racing. . 0. 4 4 So the probability of two pairs is 13 · 2 · 2 · 11 · 4/ 52 = 0.5 Aces and Kings Now we can solve the problem involving a real drug and a placebo given at the beginning of Chapter 1. the probability of a straight is 9 · 45 / 52 = 0. This is the same as finding the probability all the users of the real drug will occur before any of the users of the placebo. There are 13 choices for the values of the pairs and then 2 choices 2 4 for the two cards in the first pair and 2 choices for the two cards in the second pair. Since there are four of these. 2 5 (e) Other special hands are three of a kind (three cards of one value and two other cards of different values). The probabilities of these hands are 0. Here are some of the possible hands and their probabilities. 10 through Ace.02113.. 2 through 6. suppose we seek the probability that when the cards are turned up one at a time in a shuffled deck of 52 cards all the aces will turn up before any of the kings. There are then 4 = 70 possible orders for these cards. The first insight into the card problem is that the remaining 44 cards have absolutely nothing to do with the problem.960 different hands that can be dealt in playing poker. and since the suits are not important. Order is not important. The most common hand is the one with five different values. We need to only concentrate on the eight aces and kings. the probability of a royal flush is 4/ 52 = 0. Now we show another example.507 083 = 0. This has probability 13 · 45 / 52 = 0.Combinations 27 EXAMPLE 2. only one of them has all the aces preceding all the kings. This is a sequence of 10 through ace in a single suit. This is a sequence of five cards regardless of suit. 5 (c) Straight.422569.. the probability of this hand is 13 · 12 · 4/ 52 = 0. there are 11 choices for the value of the fifth card and 4 choices for that card.. Assume that the aces are indistinguishable from one another and that the kings are indis8 tinguishable from one another.598.6 Poker We have seen that there are 52 = 2. EXAMPLE 2. There are nine possible sequences.0002401. Caution is advised in calculating the probabilities: choose the values of the cards first and then the actual cards. there are nine possible sequences and four choices for each of the five cards in the sequence. We will see that the special hands have very low probabilities of occurring. To make an equivalent problem.04754.000001539 . 5 4 (d) Two pairs. and then 4 choices for a card of that value. Finally. 3 through 7. respectively. 5 (b) Four of a kind.003 546 . Since there are 13 values to choose from for the four cards of a single value (and only one way to select them) and then 12 possible values for the fifth card. The probability of a hand with at least one pair is then 5 5 1 − 0. full house (one pair and one triple). so the probability is 1/70. (a) Royal flush. and 0.

2.005 20 40 Median 60 80 Figure 2. The most likely value of m is 50 or 51. How many cars are racing around the track. then the probability the median of a sample of 5 tanks is m is m−1 2 n−m 2 · n 5 . then we must choose two that are less than m and then two that are greater than m. 0. what is n? This problem actually arose during World War II. say.2 The race car problem is hardly a practical one. Here we will consider maximum likelihood estimation: we will estimate n as the value that makes the sample median we observed most likely. When we captured some tanks. A more practical problem is this.0191346. We observe a sample of five noting the numbers on the cars and then calculate the median (the number in the middle when the sample is arranged in order). each having probability 0. If the median is m.28 Chapter 2 Permutations and Combinations EXAMPLE 2. that is. If there are n tanks. This can be done in m−1 · 100−m 2 2 ways. The Germans numbered all kinds of war materiel and their parts. we have taken a random sample of size 5 and we find that the median of the sample is 8.7 Race Cars One hundred race cars. we could then estimate the total number of tanks they had from the serial numbers on the captured tanks.015 Probability 0. numbered from 1 to 100. So the probability that the median is m is m−1 2 100 − m 2 · 100 5 A graph of this function of m is shown in Figure 2. are running around a race course.01 0.

16 0.122172 0. This is shown in Table 2. The computer is of great value here in carrying out a fairly simple idea.134615 0.0804954 0. 0.110294 0.0833333 0.12 0.136364 0.06 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Figure 2.14 Probability 0.159091 0.0725678 0.08 0. It should be clear that the maximum value in the sample carries more information about n than does the median.Combinations Table 2. We see that the maximum probability occurs when n = 13.0655294 0.16317 0. Now we return to the executive who is selecting the best assistant. The mathematical solution of this problem would be a stretch for most students in this course. so we have found the maximum likelihood estimator for n.5 n 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Probability 0. We will return to this problem in Chapter 12.0993292 0.3).1 0.157343 0.3 n . letting m = 8 for various values of n. A graph is helpful (Figure 2.0537549 29 Now let us compute a table of values of n and these probabilities.0592885 0. It should be added here that this is not the optimal solution for the German tank problem.5.0893963 0.146853 0.

This strategy has surprising consequences. These rankings appear in Table 2. How should the executive proceed so as to maximize the chance that the best candidate is selected? We have already seen that a random choice of candidate selects the best one with probability 1/20 = 0.6 under the column labeled “Pass 1”. It is only sensible to pass by one or two candidates since we will choose the fourth candidate if we let three candidates pass by and the probability of choosing the best candidate is then 1/4. let us consider a small example of four candidates whom we might as well number 1. It is probably clear. then we should choose the next candidate who is better than the best candidate in the first group (or the last candidate if a better candidate does not appear). 4 with 4 being the best candidate and 1 the worst candidate.7. Similarly. noting his or her ranking. 2. Then we choose the next candidate better than the best in the group we passed. reject him or her.30 Chapter 2 Permutations and Combinations EXAMPLE 2. . assuming the candidates appear in random order. the next best candidate is 4. Twenty candidates have applied for the position.6. Suppose we interview one candidate. then we would pass by the candidate ranked 3. we get Table 2. that we should deny the job to a certain number of candidates while noting which one was best in this first group. So if the candidates appeared in the order 3214. the column headed “Pass 2” indicates the final choice when the first two candidates are passed by and the next candidate ranking higher than the first two candidates is selected. noting the best of these. 3. The executive is constrained by company rules that say that candidates must be told whether they are selected or not at the time of an interview.8 Choosing an Assistant An executive in a company has an opening for an executive assistant. If we examine the rankings and their frequencies in that column. The candidates can appear in 4! = 24 different orders that are shown in Table 2.6 Order 1234 1243 1324 1342 1423 1432 2134 2143 2314 2341 2413 2431 Pass 1 2 2 3 3 4 4 3 4 3 3 4 4 Pass2 3 4 4 4 3 2 3 4 4 4 3 1 Order 3124 3142 3241 3214 3412 3421 4123 4132 4231 4213 4312 4321 Pass 1 4 4 4 4 4 4 3 2 1 3 2 1 Pass 2 4 4 4 4 2 1 3 2 1 3 2 1 The column headed “Pass 1” indicates the final choice when the first candidate is passed by and the next candidate ranking higher than the first candidate is selected. so it is not a very sensible strategy. So we let one or two candidates pass. To illustrate what follows from it. Table 2.05.

A little forethought would reduce the number of permutations we have to list. so we did not need to list 17 of the 24 permutations.7 Passing the First Candidate Ranking 1 2 3 4 Frequency 2 4 7 11 31 Interestingly. we find the rankings in Table 2.8 Passing the First Two Candidates Ranking 1 2 3 4 Frequency 4 4 6 10 We do somewhat less well.8.90 and we see with the plans presented here that we get the highest ranked candidate at least 42% of the time. if 4 appears in the second position. Consider the plan to pass the first candidate by.Combinations Table 2. noting the best.125.9. we will choose the best candidate.917. Table 2. where the first candidate is passed by. The average ranking here is 2. It is possible to list the permutations of five or six candidates and calculate the average choice.” A summary of our choice is shown in Table 2. the most frequent choice is 4! and the average of the rankings is 3. and then choose the candidate with the better ranking (or the last candidate). Similar comments will apply to the plan to pass the first two candidates by. beyond that this procedure is not very sensible. so we do fairly well. but still better than a random choice. If we interview the first two candidates. we will not choose the best candidate. Table 2.9 Five Candidates Passing the First One Ranking 1 2 3 4 5 Frequency 6 12 20 32 50 The plan still does very well. if 3 appears in the first position. The average rank selected is 3. .6 labeled “Pass 2. we will choose the best candidate. The results for five candidates are shown in Table 2. If 4 appears in the first position.

5. That is.4. Suppose a lot of 100 items manufactured in an industrial plant actually contains items that do not meet either the manufacturer’s or the buyer’s specifications. 520 possible samples. So we allow the first candidate to pass by until n = 6 and then let the first two candidates pass by until n = 9. is not so easy. Since we assume that each of the samples is equally likely. Let us denote these items by calling them D items while the remainder of the manufacturer’s output. Suppose we want the probability that 5 the sample contains exactly three of the D items.00638353 making it fairly unlikely that this sample will find three of the items that do not meet specifications. we want to find the values of the function 10 d · f (d) = 90 5−d 100 5 for d = 0.32 Chapter 2 Permutations and Combinations Generalizing the plan. 4. however. It can be shown that the optimal plan passes the first [n/e] candidates (where the brackets indicate the greatest integer function) and the probability of selecting the best candidate out of the n candidates approaches 1/e as n increases. 3. one of two courses is followed: either the D items found in the sample are . 287. EXAMPLE 2. This is often called the probability distribution of the random variable D. It may be of interest to find the probabilities for all the possible values of D. we will call G items. and so on. so the manufacturer uses sampling and so inspects only a portion of the manufactured items. Now the manufacturer wishes to inspect a random sample of the items produced by the production line. those items that do meet the manufacturer’s and the buyer’s specifications. 1.9 Acceptance Sampling Now let us discuss an acceptance sampling plan. What should the manufacturer do if items not meeting specifications are discovered in the sample? Normally. this probability is 10 3 · 100 5 P(D = 3) = 90 2 = 0. There are 100 = 75. 2. It may be that the inspection process destroys the product or that the inspection process is very costly. suppose the lot of 100 items actually contains 10 D items and 90 G items and that we select a random sample of 5 items from the entire lot produced by the manufacturer. As an example. A graph of this function is shown in Figure 2. This problem makes an interesting classroom exercise that is easily simulated with a computer that can produce random permutations of n integers.

namely.2 10 20 d 30 40 Figure 2.1 0 1 2 3 d 4 5 6 33 Figure 2.the process is called acceptance sampling. the lot is accepted and sent to the buyer. We will explore the second possibility noted above here. Let us also assume that we do not know how many D items are in the lot.4 0. If the sample does not contain too many D items. A graph of this function is shown in Figure 2. the larger d.2 0.3 0. that if any D items at all are found in the sample. of course. but it is not clear how much of an improvement will result. This clearly will improve the quality of the lot of items sold. the entire delivered lot consists of G items when the sample detects any D items at all. To be specific. So.4 replaced by G items or the entire lot is inspected and any D items found in the entire lot are replaced by G items. then the entire lot is inspected and any D items in it are replaced with G items. the more likely the sample will contain some D items and hence the lot will not be accepted.6 0. The lot is then accepted with probability 100 − d 5 100 5 P(D = 0) = This is a decreasing function of d.4 0. so we will suppose that there are d of these in the lot. Probability Probability 1 0. and.5 . Otherwise. The process has some surprising consequences and we will now explore this procedure. Hence. the lot is rejected. can only be followed if the sampling is not destructive.Combinations 0.5. let us suppose that the lot is accepted only if the sample contains no D items whatsoever.8 0. The last course is usually followed if the sample does not exhibit too many D items. perhaps after some D items in the sample are replaced by G items. is.5 0.

This is the average outgoing quality.066! 0.6. or the average of the quantity d/100. then the average value of D is d · P(D = d) all values of d Here we wish to find the average value of the percentage of D items delivered to the buyer. The delivered lot has percentage D items of d/100 only if the sample contains no D items whatsoever. whose specific values are d.05 0. We notice that the graph attains a maximum value. the lot has 0% D items due to the replacement plan. So if a random variable is D. So the average outgoing quality is AOQ = d 100 · P(D = 0) + 0 100 · P(D = 0) / so AOQ = or d AOQ = 100 d 100 · P(D = 0) 100 − d 5 100 5 · A graph of this function is shown in Figure 2.6 AOQ . This is often called the average outgoing quality (AOQ) in the quality control literature.01 10 20 30 d 40 50 60 Figure 2.34 Chapter 2 Permutations and Combinations Finally. say.02 0.10 shows the values of the AOQ near the maximum value.06 0. so the maximum average percentage of D items that can be delivered to the customer is 0. we consider the average percentage of D items delivered to the customer with this acceptance sampling plan. AOQ = all values of d d 100 · P(D = d) But we have a very special circumstance here. We see that the maximum AOQ occurs when d = 16.03 0.04 0. this may not have been anticipated! This means that regardless of the quality of the lot. The average of a quantity is found by multiplying the values of that quantity by the probability of that quantity and adding the results. otherwise. Table 2. there is a maximum for the average percentage of D items that can be delivered to the customer! This maximum can be found using a computer and the above graph.

We continue in the following two chapters with a discussion of conditional probability and geometric probability. most notably a plan for choosing the best candidate from a group of applicants for a position and acceptance sampling where we found that sampling does improve the quality of the product sold and actually puts a limit on the percentage of unacceptable product sold. Use the principle of inclusion and exclusion to prove the general addition theorem for two events. 2.06535 0. Compare the probabilities . Each of these topics fits well into a course in geometry. Find the probability a poker hand has (a) exactly two aces.06476 0. This is just one example of how probability and statistics can assist in delivering high-quality product to consumers.06523 35 Sampling here has had a dramatic impact on the average percentage of D items delivered to the customer. EXPLORATIONS 1.06561 0. Simulate Example 2.8: Choose random permutations of the integers from 1 to 10 with 10 being the most qualified assistant. What is the probability a bridge hand (13 cards from a deck of 52 cards) does not contain a heart? 4. Statistical process control is the subject of Chapter 11. (b) exactly one pair. 3. We will continue our discussion or production methods and the role probability and statistics can play in producing more acceptable product in the chapter on quality control and statistical process control. or SPC.Explorations Table 2.10 d 14 15 16 17 18 AOQ 0. CONCLUSIONS We have explored permutations and combinations in this chapter and have applied them to several problems.06556 0. these have found wide use in industry today. There are many other techniques used that are called in general statistical process control methods.

7. Simulate Example 2. how would you use the maximum of the sample in order to estimate the maximum of the population? 7. In Example 2. that is.4: Choose random permutations of the integers from 1 to 5 and count the number of instances of an integer being in its own place. “Which is more likely—at least one 6 in 4 rolls of a fair die or at least one sum of 12 in 24 rolls of a pair of dice?” Show that the two events are almost equally likely. contained the question. permutations where no integer occupies its own place. 6. each of whom made important contributions to the theory of probability. 2. or 3 applicants go by and then choosing the best applicant better than the best of those passed by. Then count the number of derangements. Which is more likely? . A famous correspondence between the Chevalier de Mere and Blaise Pascal.36 Chapter 2 Permutations and Combinations of making the best choice by letting 1. 5.

Each of these problems can be solved using what is usually known in introductory courses as Bayes’ theorem. INTRODUCTION A physician tells a patient that a test for cancer has given a positive response (indicating the possible presence of the disease). The test appears at first glance to be quite accurate. Kinney Copyright © 2009 by John Wiley & Sons. We will not need this theorem at all to solve these problems. the physician knows that the test is 95% accurate both for patients who have cancer and for patients who do not have cancer. so these problems can actually be explained to many students of mathematics. except that for the area of a rectangle. Inc. I regret to report that the remaining students are terribly ignorant of my subject.Chapter 3 Conditional Probability CHAPTER OBJECTIVES: r r r r r r to consider some problems involving conditional probability to show diagrams of conditional probability problems to show how these probability problems can be solved using only the area of a rectangle to show connections with geometry to show how a test for cancer and other medical tests can be misinterpreted to analyze the famous Let’s Make a Deal problem. In this particular test. 37 . I regard one of my students as an expert since she never makes an error. No formulas are necessary. A Probability and Statistics Companion. How is it then. John J. and so guess at each answer. What is the probability that I have selected the expert’s paper? The answer may be surprising. I gave a test with six true–false questions. based on this test alone. that this patient almost certainly does not have cancer? We will explain this. a paper I have selected at random has each of these questions answered correctly. I have a class of 65 students.

1 0. what is the probability that it came from the first supplier? We will show how to make this problem visual and we will see that it is fairly simple to analyze. and regrettably. we have shown a horizontal line at 75%. The total area of the square is 1.60 and whose base is along the horizontal axis is 1/4 · 0. buying 3/4 of his bird seed from one supplier and the remainder from another.6 Figure 3. the probability that the test is positive if the patient has cancer (0. the proportion of the first supplier’s seed that germinates. Finally.75. indicating the proportion of the seed purchased from the second supplier. which is probably the one of interest. The area of the rectangle formed is 3/4 · 0. so portions of the total area represent probabilities. Along the vertical axis. In the cancer example. We begin with a square of side 1. which is the probability that the seed came from the first supplier and germinates. indicating the proportion of seed purchased from the first supplier. We have also shown a horizontal line at the 60% mark along the vertical axis.60.75 0. We will explain how this apparent difference arises and how to calculate the conditional probability. indicating that this is the percentage of the second supplier’s seed that germinates. We will also discuss the famous Let’s Make a Deal problem.1 For the Birds A pet store owner purchases bird seed from two suppliers. Now let us see how to tackle these problems in a simple way. in the second example. as shown in Figure 3. The area of the unshaded rectangle with height 0.1 0 3/4 1 . despite the large size of my class. EXAMPLE 3.1. The remainder of the axis is of length 1/4. Along the horizontal axis we have shown the point at 3/4.95 in our example) is quite different from the probability that the patient has cancer if the test is positive (only 0.087!). If seed is randomly selected and germinates.1. This is shaded in Figure 3.38 Chapter 3 Conditional Probability These problems often cause confusion since it is tempting to interchange two probabilities. the probability I am looking at the expert’s paper is only 1/2. We begin with two simple examples and then solve the problems above. The germination rate of the seed from the first supplier is 75% whereas the germination rate from the second supplier is 60%.

Now of the 60% who take driver’s ed.75 + 1/4 · 0. EXAMPLE 3. 4% have an accident. Now the total area of the two rectangles is 3/4 · 0. Of these.7125. A student has an accident.474%). we have shown 60% of the axis for students who take driver’s ed while the remainder of the horizontal axis. but. It matters greatly whether we are given the condition that the seed came from the first supplier (in which case the germination rate is 75%) or that the seed germinated (in which case the probability it came from the first supplier is 78. what is the probability he or she took driver’s ed? Note that the probability a driver’s ed student has an accident (4%) may well differ from the probability he took driver’s ed if he had an accident.60 = 0.2 0 60% 1 .Introduction 39 representing the probability that the seed came from the second supplier and germinates. the percentage from the first supplier.60 differing from 3/4. as some of our examples will show. as we have shown above. We now want to find the probability that germinating seed came from the first supplier. Of the remaining students.75 + 1/4 · 0. 4% have an accident in a year.2 Driver’s Ed In a high school. They are usually solved using Bayes’ theorem. this is the probability that the seed germinates regardless of the supplier. We emphasize here that the condition is the crucial distinction. In Figure 3. Along the horizontal axis. all that is involved is areas of rectangles. 8% have an accident in a year. 40%. This is the portion of the two rectangles that is shaded or 3/4 · 0.75 = 0. The easiest way to look at this is again to draw a picture.789474 3/4 · 0. 1 8% 4% Figure 3. represents students who do not take driver’s education. Confusing these two conditions can lead to erroneous conclusions. 60% of the students take driver’s education. Let us see if this is so. so we show a line on the vertical axis at 4%.2 we show the unit square as we did in the first example. This problem and the others considered in this chapter are known as conditional probability problems.

EXAMPLE 3.4 · 0.40 Chapter 3 Conditional Probability The area of this rectangle. represents students who both take driver’s ed and have an accident.60.5%). What is the probability that I have selected the expert’s paper? One might think this is certain. as we have done previously.75 and P(G|S2) = 0. we will use the notation. representing students who do not take driver’s ed and who have an accident. 0. The area of the two rectangles then represents the proportion of students who have an accident.08 = 0. Begin again with a square of side 1. .056 = 3/8 = 37. It is not uncommon for these probabilities to be mixed up. and so guess at each answer.6 · 0. but it is not.) This rectangle has been shaded. This is unshaded in the figure.032. SOME NOTATION Now we pause in our examples to consider some notation. Note that the condition has a great effect on the probability. It is clear then that the probability a student has an accident if he took driver’s ed (4%) differs markedly from the probability a student who took driver’s ed had an accident (37.024 + 0. This problem can be solved in much the same way that we solved the previous two problems. On the horizontal axis. This is the area of the two rectangles that arises from the left-hand rectangle. so we have exaggerated the scale in Figure 3. we show a line at 8%. I gave a test with six true–false questions. Note that the first relates to students who have had accidents whereas the second refers to the group of students who took driver’s ed.3 The Expert in My Class Recall that our problem is this: I have a class of 65 students. It follows that 5/8 = 62.5 %. This rectangle on the right has area 0. I regret to report that the remaining students are terribly ignorant of my subject. P(A|B) to denote “the probability that an event A occurs given that an event B has occurred. (The scale has been exaggerated here so that the rectangles we are interested in can be seen more easily. we indicate the probability we have chosen the expert (1/65) and the probability we have chosen an ordinary student (64/65). Now we want the probability that a student who has an accident has taken driver’s ed. we let S1 and S2 denote the suppliers and let G denote the event that the seed germinates. The chance of choosing the expert is small. Above the 40% who do not take driver’s ed. and we found that P(S1|G) = 0.5% of the students who have accidents did not take driver’s ed. a paper I have selected at random has each of these questions answered correctly. or 0. I regard one of my students as an expert since she never makes an error.04 = 0.056. The relative areas then should not be judged visually.3 for reasons of clarity. We then have P(G|S1) = 0.78474.024/0. This is 0.024. From this point on.” In the bird seed example.032 = 0.

This has been indicated on the right-hand vertical scale in Figure 3. Let E denote the expert and E denote a nonexpert and let A denote the event that the student responds correctly to each question. is P(A) = P(A and E) + P(A and E) P(A) = 1 65 2 = 65 ·1+ 64 65 · 1 64 . shown as two rectangles in Figure 3.3. We found that P(A|E) = 1 while we want P(E|A). This has been indicated on the vertical axis.Some Notation 1 41 1/64 Figure 3. So the total area.3 0 1/65 1 Now the expert answers the six true–false questions on my examination with probability 1. so again the condition is crucial. Now the remainder of my students are almost totally ignorant of my subject and so they guess the answers to each of my questions. The probability that one of these students answers all the six questions correctly is then (1/2)6 = 1/64.3. corresponding to the probability that all six of the questions are answered correctly. The area of the corresponding rectangle is then 1 65 ·1= 1 65 This represents the probability that I chose the expert and that all the questions are answered correctly. The rectangle corresponding to the probability that a nonexpert answers all six of the questions correctly is then P(A|E) = 64 65 · 1 1 = 64 65 Let us introduce some notation here.

although. Along the vertical axis are the probabilities that a person shows a positive test for each of the two groups of patients. We interpret the statement that the cancer test is accurate for 95% of patients with cancer and 95% accurate for patients without cancer as that P(T + |C) = 0. but they do not appear to be equal. 0. A picture will clarify the situation. The law can be extended to more than two mutually exclusive events.05 Figure 3.05. We add the probabilities here of mutually exclusive events. Supposing that a small percentage of the population has cancer. P(T + |C) is known as the false positive rate for the test since it produces a positive result for noncancer patients.3 are in reality equal. we assume in this case that P(C) = 0.95 and P(T + |C) = 0. namely.4 The Cancer Test We now return to the medical example at the beginning of this chapter.005 1 . the small probabilities involved force us to exaggerate the picture somewhat. Recall that the statement P(A) = P(A and E) + P(A and E) is often known as the Law of Total Probability. A and E and A and E (represented by the nonoverlapping rectangles). EXAMPLE 3.42 Chapter 3 Conditional Probability The portion of this area corresponding to the expert is 1 65 P(E|A) = ·1 64 65 1 65 1 = 2 ·1+ · 1 64 The shaded areas in Figure 3. This assumption will prove crucial in our conclusions. where C indicates the presence of cancer and T + means that the test indicates a positive result or the presence of cancer. again.4 0 0.95 1 0.4 shows along the horizontal axis the probability that a person has cancer. Figure 3.005.

5 . Replacing the proportion 0. have small incidence rates.2 0 0 0.6 0. however.005 by r and the proportion 0. Now suppose that the test has probability p indicating the disease among patients who actually have the disease and that the test indicates the presence of the disease with probability 1 − p among patients who do not have the disease p = 0. Most diseases.90 · r 1 + 18r r · 0.95 + (1 − r) · 0.95 in our example.4 r 0.95 · r = 0. It is also interesting 1 0.05 A graph of this function is shown in Figure 3.95 · 0.2 0. however.5.0545 43 = This is surprisingly low.00475 = 0.995 0.995 = 0. We also note here that the probability that a person testing positive actually has cancer highly depends upon the true proportion of people in the population who are actually cancer patients.95 r · 0.4 0.05 · 0.087 0. Let us suppose that this true proportion is r. that the test should not be relied upon alone. we find that the proportion of people who test positive and actually have the disease is P(C|T + ) = This can be simplified to P(C|T + ) = 19r 0.005 + 0.6 0.95 · 0.0545 The portion of this area corresponding to the people who actually have cancer is then P(C|T + ) = 0.05 + 0. We emphasize.005 0.05 · 0. so that r represents the incidence rate of the disease. one should have other indications of the disease as well.8 P[C|T +] 0.005 + 0.8 1 Figure 3. so the false positive rate for these tests is a very important number.995 by 1 − r in the above calculation. We see that the test is quite reliable when the incidence rate for the disease is large.95 · 0.Some Notation It is clear that the probability that a person shows a positive test is P(T + ) = 0.

the probability is 1/3 that the prize is won. r. Clearly.5 r 0.6.5 Let’s Make a Deal In this TV game show. Now P(C|T + ) = r·p r · p + (1 − r) · (1 − p) The surface showing this probability as a function of both r and p is shown in Figure 3. Ours has a low probability since we are dealing with the lower left corner of the surface. In thinking about the problem.5 0. He now offers the contestant the opportunity to change the choice of doors. at least one. the accuracy of the test increases as r and p increase.25 0. a contestant is presented with three doors. The contestant who switches has probability 2/3 of winning the prize. opens one door and shows that it is empty. EXAMPLE 3. The contestant is allowed to choose one door. one of which contains a valuable prize while the other two are empty.75 1 p Figure 3. and p.75 0.6 to examine the probability P(C|T + ) as a function of both the incidence rate of the disease. or doesn’t it matter? It matters. and perhaps two.25 0 0 0. note that when the empty door is revealed. of the remaining doors is empty.5 0. the game does not suddenly become choosing one of the two doors that contains the prize. say Monty Hall. The problem here is that sometimes Monty Hall has one choice of door to show empty and sometimes he has two choices of doors that are empty. should the contestant switch. The show host.44 Chapter 3 Conditional Probability 1 P(C|T+) 0.25 0.75 1 0 0. Regardless of the choice made. If the contestant does not switch. . This must be accounted for in analyzing the problem.

P(D|P1 ) = 1/2. hence. 2.7 0 A1 1/3 A2 2/3 A3 1 Our unit square is shown in Figure 3.Bayes’ Theorem 45 An effective classroom strategy at this point is to try the experiment several times. . The symmetry of the problem tells us that this is a proper analysis of the general situation. BAYES’ THEOREM Bayes’ theorem is stated here although. Revealing one of the doors to be empty does not alter these probabilities. 1 1/2 Figure 3. The probability that the contestant wins if he switches is then the proportion of this area corresponding to door 3. and P(D|P3 ) = 1. perhaps more intuitive. i = 1. way to view the problem is this: when the first choice is made. the contestant should switch. Now we need some notation. The probability that the prize is behind one of the other doors is 2/3. 3. denote the event “prize is behind door i” and let D be the event “door 2 is opened. since in that case the host then has a choice of two doors to open. the contestant has probability 1/3 of winning the prize. To be specific. as we have seen. we analyze the problem using geometry. Then.7 represents the probability that door 2 is opened. let us call the door that the contestant chooses as door 1 and the door that the host opens as door 2.7. since the host will not open the door showing the prize. since in this case door 2 is the only one that can be opened to show no prize behind it. perhaps using large cards that must be shuffled thoroughly before each trial. Let Pi . some students can use the “never switch” strategy whereas others can use the “always switch” strategy and the results compared. This is 1 ·1 3 1 1 1 + ·0+ 2 3 3 2 3 P(P3 |D) = 1 3 · ·1 = Another. problems involving it can be done geometrically. P(D|P2 ) = 0. This experiment alone is enough to convince many people that the switching strategy is the superior one.” We assume that P1 = P2 = P3 = 1/3. It is clear that the shaded area in Figure 3.

say A. then if B is an event.) B was the event “door 2 is opened. are sometimes used to teach an industrial worker a skill. The methods fail to instruct with rates of 20%. Sample surveys are often subject to error because the respondent might not truthfully answer a sensitive question such as “Do you use illegal drugs?” A procedure known as randomized response is sometimes used. Binary symbols (0 or 1) sent over a communication line are sometimes interchanged. EXPLORATIONS 1. The probability that a 0 is changed to 1 is 0. I think our students should have mastery of these problems and others like them since data of this kind are frequently encountered in various fields. B. what is the probability that she was taught by method A? 2. The analysis given here should make these problems accessible to elementary students of mathematics.” CONCLUSIONS The problems considered above should be interesting and practical for our students. If a 1 is received. and 30%. otherwise he .1 while the probability that a 1 is changed to 0 is 0. and C.4 and the probability that a 1 is sent is 0.46 Chapter 3 Conditional Probability Bayes’ theorem: If S = A1 ∪ A2 ∪ · · · An . 10%. 3. Three methods. The probability that a 0 is sent is 0.2. where Ai and Aj have no sample points in common if i = j. the analyses given above are equivalent to those found by using a probability theorem known as Bayes’ theorem.5. If the coin comes up heads.5 (Let’s Make a Deal). 2. respectively. The geometric model given here shows that this result need not be known since it follows so simply from the area of a rectangle. Cost considerations restrict method A to be used twice as often as B that is used twice as often as C. A respondent is asked to flip a coin and not reveal the result. / P(Ai |B) = P(Ai |B) = or P(Ai |B) = P(Ai ) · P(B|Ai ) n j=1 P(Ai ∩ B) P(B) P(Ai ) · P(B|Ai ) P(A1 ) · P(B|A1 ) + P(A2 ) · P(B|A2 ) + · · · + P(An ) · P(B|An ) P(Aj ) · P(B|Aj ) In Example 3.6. what is the probability that a 0 was sent? 3. the respondent answers the sensitive question. If a worker fails to learn the skill. Here is how that works. Mathematically. Ai is the event “Prize is behind door i” for i = 1. (We used Pi in Example 3.

have the students answer the question about illegal drugs in the previous exploration and otherwise answer an innocuous question. Assume that 98% of the subjects of the test are truthful. while if he is lying.92. If an ace is chosen. the frequency of illegal drug use can be determined. the detector indicates he is lying with probability 0.” it is not known to which question he is responding. 5. Then approximate the use of illegal drugs. 12 aces and 40 other cards. say. the detector indicates he is telling the truth with probability 0. if a person is telling the truth. 4. What is the probability that a person is lying if the detector indicates he is lying? .92. Show. that with a large number of respondents. A certain brand of lie detector is accurate with probability 0.Explorations 47 responds to an innocuous question such as “Is your Social Security number even?” So if the respondent answers “Yes. that is. Have students select a card and not reveal whether it is an ace or not. Combine cards from several decks and create another deck of cards with. however.92.

we take the interval between 3 and 4 p. so these arrival times are points somewhere in a square of side 1. This is the region above the line Y = X − 1/4 and has intercepts at (1/4. EXAMPLE 4. then Y − X < 1/4 or Y < X + 1/4. the shaded area must represent the probability that Joan and Jim meet. We show the situation in Figure 4. then Jim’s arrival time Y must be greater than Joan’s arrival time X. 1) and is the top line in Figure 4. If Joan arrives first. to be the interval between 0 and 1. They both meet then when Y < X + 1/4 and when Y > X − 1/4. but it is. For convenience. This is the region below the line Y = X + 1/4 and has intercepts at (0. 48 .1 Meeting at the Library Joan and Jim agree to meet at the library after school between 3 and 4 p.Chapter 4 Geometric Probability CHAPTER OBJECTIVES: r r r r to use connections between geometry and probability to see how the solution of a quadratic equation solves a geometric problem to use linear inequalities in geometric problems to show some unusual problems for geometry. Inc. What is the probability that they will meet? This at first glance does not appear to be a geometric problem. John J. then X − Y < 1/4 or Y > X − 1/4. So Y > X or Y − X > 0 and if they are to meet. 1/4) and (3/4.1.m. This is the region between the lines and is shaded in the figure. so both Joan and Jim’s waiting time becomes 1/4 of an hour. Kinney Copyright © 2009 by John Wiley & Sons.1. then X > Y and X − Y > 0 and if they are to meet. The easiest way to compute this is to subtract the areas of the two triangles A Probability and Statistics Companion. 3/4) and is the lower line in Figure 4.m.1. Each agrees to wait no longer than 15 min for the other. Let X denote Joan’s arrival time and Y denote Jim’s arrival time. Since the area of the square is 1. If Jim arrives first. We suppose that each person arrives at some random time. 0) and (1.

perhaps. This is one of the many examples of the solution of probability problems using geometry. This means that 1−2· or that 1 − (1 − t)2 = 3 4 1 2 · (1 − t)2 = 3 4 . The situation is shown in Figure 4. so we want the probability that they meet to be.2. This gives us the probability that they meet as P(Joan and Jim meet) = 1 − 2 · So they meet less than half the time.Chapter 4 Geometric Probability 1 49 (3/4.3/4) (0. How long should each wait for the other? Let us say that each waits t minutes for the other. We know then that the shaded area is 3/4 or that the nonshaded area is 1/4.1/4) Figure 4.2 How Long Should They Wait? Now suppose that it is really important that they meet.0) 1 from 1.1 0 (1/4.437 5 16 16 It is. EXAMPLE 4. say. the area of the square. 3/4.1) (1. We now show some more problems of this sort. surprising that the solution to our problem is geometric. The triangles are equal in size. 1 2 · 3 4 2 =1− 7 9 = = 0.

say t hours for each person.0) 1 so (1 − t)2 = so t= 1 2 1 4 or 1−t = 1 2 So each must wait up to 30 min for the other for them to meet with probability 3/4.3 A General Graph In the previous example. . EXAMPLE 4. The probability that they meet is 1 2 p=1−2· or · (1 − t)2 t 2 − 2t + p = 0 Figure 4. Now suppose we want to know how the waiting time.t) Figure 4.1−t) (0.1) (1. we specified the probability that Joan and Jim meet. affects the probability that they meet.3 is a graph of this quadratic function where t is restricted to be between 0 and 1.2 0 (t.50 Chapter 4 Geometric Probability 1 (1−t.

6 Geometric Probability 51 0. then x must be to the left of the vertical line at L/2.8 0. So.4 0. this is the region above the line x + y = L/2. L L/2 Figure 4. If x < L/2.4 Breaking a Wire I have a piece of wire of length L.4.2 0.8 1 Figure 4. What is the probability that I can form a triangle from the three pieces of wire? Suppose the pieces of wire after the wire is broken have lengths x.4 0 L /2 L .4 t 0. I break it into three pieces. Finally.Chapter 4 1 Probability 0. y. the sum of the lengths of any two sides must exceed the length of the remaining side. and L − x − y. x + y > L/2. then y must be below the horizontal line at L/2. as shown in Figure 4. x+y>L−x−y and x + (L − x − y) > y and y + (L − x − y) > x so x < L/2 so y < L/2 or x + y > L/2 It is easy to see the simultaneous solution of these inequalities geometrically.2 0. To form a triangle.3 EXAMPLE 4.6 0. If y < L/2.

corresponding to the base of the triangle. otherwise Kaylyn does the dishes.52 Chapter 4 Geometric Probability The resulting region is the triangle shaded in the figure. The region labeled “A” is the region where player A is shooting and she makes the shot. Its area is clearly 1/8 of the square.3)(0. A makes 40% of her foul shots. and C takes 20% of the team’s foul shots. B makes 60% of her foul shots. the situation is similar for players B and C.6 that Kaylyn does the dishes 2/3 of the time. I have in my pocket two red marbles and one green marble. Is the game fair? Of course the game is not fair. It is clear from the triangle in Figure 4.8 0. corresponding to two sides of the triangle. so the probability that I can form a triangle from the pieces of wire is 1/8. But it does! My daughter.8) = 0. We agree to select two marbles at random. If the colors match.5 0.2)(0.54 1 0. as we have done in all our previous examples. and C makes 80% of her foul shots.8 1 EXAMPLE 4. I do the dishes. .6 0. To make the situation interesting. A takes 50% of the team’s foul shots.5 0 0. and C are basketball players. Kaylyn.5)(0.6) + (0. So the probability that a shot is made is then the sum of the areas for the three players or (0.6 Doing the Dishes We conclude this chapter with a problem that at first glance does not involve geometry. B takes 30% of the team’s foul shots.4) + (0.5.4 B A C Figure 4.5 Shooting Fouls A. B. and I share doing the dinner dishes. while I do the dishes 1/3 of the time. What is the probability that a foul shot is made? The situation can be seen in Figure 4. EXAMPLE 4.

It is interesting to find. or unfairness.6 Figure 4.7 where we have two green and two red marbles.8 R G . There are six possible samples of two marbles. Adding a green marble does not change the probabilities at all! Why is this so? Part of the explanation lies in the fact that while the number of red marbles and green marbles is certainly important.Chapter 4 G Geometric Probability 53 G G R R R R Figure 4. It is the geometry of the situation that explains the fairness. of the game. but it is not a fair situation. Figure 4. two of these contain marbles of the same color while four contain marbles of different colors. in the above example.8 shows three red and three green marbles. G G R R Figure 4. examine Figure 4. then the game is fair.7 How can the game be made fair? It may be thought that I have too many red marbles in my pocket and that adding a green marble will rectify things. However. The unfairness of having two red and one green marbles in my pocket did not arise from the presumption that I had too many red marbles in my pocket. that if we have three red marbles and one green marble. it is the number of sides and diagonals of the figure involved that is crucial. I had too few! Increasing the number of marbles in my pocket is an interesting challenge.

Let us see why R and G must be triangular numbers. One might notice that the numbers of red and green marbles in Table 4.1 R 3 6 10 15 G 1 3 6 10 R+G 4 9 16 25 The lines in Figure 4. With a total of six marbles. there is no way in which the game can be made fair. Table 4. 6 of which have both marbles of the same color while 9 of the samples contain marbles of different colors. There are no combinations of 5 through 8. 1 + 2 + 3 + 4 = 10. The term triangular comes from the fact that these numbers can be shown as equilateral triangles: r r r r r r r r r r 1 3 6 We also note that 1 + 2 + 3 + · · · + k = k(k + 1)/2. they are sums of consecutive positive integers 1 + 2 = 3. For the game to be fair.8 show that there are 15 possible samples to be selected. 1 + 2 + 3 = 6. and so on. or 17 through 25 marbles for which the game is fair.54 Chapter 4 Geometric Probability Table 4. then 1 + 2 + 3 + · · · + k + (k + 1) = (k + 2)(k + 1) k(k + 1) + (k + 1) = 2 2 which is the formula for the sum of k + 1 positive integers. note that the formula works for k = 2 and also note that if the formula is correct for the sum of k positive integers. To see that this is so. that is.1 are triangular numbers. R G + 2 2 or 2R(R − 1) + 2G(G − 1) = (R + G)(R + G − 1) = 1 R+G 2 2 .1 shows some of the first combinations of red and green marbles that produce a fair game. We will prove this now. 10 through 15. This proves the formula.

Are you an illegal drug user? The interviewer then asks the person being interviewed to flip a fair coin (and not show the result to the interviewer). he is to answer the second question. he is to answer the first question. to estimate the proportion of drug users. Here is a procedure for obtaining responses to sensitive survey questions that has proved to be quite accurate. since R + G = (R − G)2 .9 0 1/2 1 . so it would be unlikely that we could determine the percentage of drug users or dangerous criminals with any degree of accuracy. Solving these simultaneously gives 2R = k + k2 or R = k(k + 1)/2 and so G = k(k − 1)/2 showing that R and G are consecutive triangular numbers. EXAMPLE 4.Chapter 4 Geometric Probability 55 and this can easily be simplified to R + G = (R − G)2 This is one equation in two unknowns. Is your Social Security number even? 2. So if the person answers “Yes. Then. Figure 4. But we also know that both R and G must be positive integers. Suppose we have two questions: 1. Those who do sample surveys are often interested in asking sensitive questions such as “Are you an illegal drug user?” or “Have you ever committed a serious crime?” Asking these questions directly would involve self-incrimination and would likely not produce honest answers. it follows that R + G = k2 .9 should be very useful. if we draw a picture of the situation. if the coin comes up tails. representing those who when interviewed showed a head on the coin (probability 1/2) and who then answered the first question “Yes” (also with probability 1/2). The square in the bottom left-hand corner has area 1/4. But it is possible. If the coin comes up heads.” we have no way of knowing whether his Social Security number is even or if he is a drug user. The rectangle on the bottom right-hand side represents 1 1/2 p Figure 4. So let R − G = k.7 Randomized Response We show here a geometric solution to the randomized response exploration in Chapter 2.

but could vary considerably from 1/2 in a small sample. It is useful to see how our estimate of p. 0.4 0.6 p 0.” Suppose that proportion of people answering “Yes” is. then p = 1.2 0. p = 0. comparing areas. 1 0.5 Ps 0.” So the total area represents the proportion of those interviewed who responded “Yes.3 0.8 0. namely. varies with the proportion answering “Yes” in the survey.4 0. These are unusual examples for a geometry class and they may . and if ps = 3/4.30 − 1/4) or p = 0.10 where we assume that 1/4 ≤ ps ≤ 3/4 since if ps = 1/4. say.30. in the proportion showing heads on the coin.56 Chapter 4 Geometric Probability those who when interviewed showed a tail on the coin (probability 1/2) and who answered the drug question “Yes. We have that 1/4 + (1/2) · p = ps so that p = 2(ps − 1/4) A graph of this straight line is shown in Figure 4.7 Figure 4.10 CONCLUSION We have shown several examples in this chapter where problems in probability can be used in geometry.30 so p = 2(0.10 So our estimate from this survey is that 10% of the population uses drugs.6 0. Then . the true proportion who should answer “Yes” to the sensitive question. This estimate is of course due to some sampling variability. we have 1/4 + (1/2) · p = 0. This should not differ much from 1/2 if the sample is large. say ps .

he is to simply say “Yes. he is to answer the sensitive question. EXPLORATIONS Here are some interesting situations for classroom projects or group work. . In Exploration 4. suppose we do not know the frequency with which C makes a foul shot. 1. but we know the overall percentage of foul shots made by the team. In the foul shooting example.Explorations 57 well motivate the student to draw graphs and draw conclusions from them. Change the waiting times so that they are different for Joan and for Jim at the library. draw pairs of random numbers with your computer or calculator and estimate the probability that the product of the numbers is less than 50.” What is the estimate of the “Yes” answers to the sensitive question? 4.” otherwise. Two numbers are chosen at random between 0 and 20. If the number drawn (unknown to the interviewer) is less than or equal to 60. 2. “No. Suppose in randomized response that we have the subject interviewed draw a ball from a sack of balls numbered from 1 to 100. motivation that is often lacking in our classrooms. How good a foul shooter is C? 3. They are also examples of practical situations producing reasons to find areas and equations of straight lines. if the number is between 61 and 80. (a) What is the probability that their sum is less than 25? (b) What is the probability that the sum of their squares is less than 25? 5.

distribution functions INTRODUCTION Suppose a hat contains slips of paper with the numbers 1 through 5 on them. In general. Kinney Copyright © 2009 by John Wiley & Sons. Z. Hypergeometric. say X. and one of its values. such as X. 58 . A slip is drawn at random and the number on the slip observed. we say that X = 3. hypergeometric. and Geometric Distributions CHAPTER OBJECTIVES: r to introduce random variables and probability distribution functions r to discuss uniform. usually denoted by small letters. the number is called a random variable. It is important to distinguish between the variable itself. Since the result cannot be known in advance. Inc. binomial. and so on. John J. Random variables are generally denoted by capital letters. a random variable is a variable that takes on values on the points of a sample space.Chapter 5 Random Variables and Discrete Probability Distributions—Uniform. If we see the number 3 in the slip of paper experiment. A Probability and Statistics Companion. Y. and geometric probability r to discover some surprising results when random variables are added r to encounter the “bell-shaped” curve for the first (but not the last!) time r to use the binomial theorem with both positive and negative exponents. say x. Binomial.

2. Random variables occur with different properties and characteristics. these random variables are called continuous random variables. we will discuss several random variables that can take on an uncountably infinite number of values. . If the random variable is discrete. we call the function giving the probability the random variable takes on any of its values. number of values so the associated sample space has a finite.. or perhaps a countably infinite. DISCRETE UNIFORM DISTRIBUTION The experiment of drawing one of five slips of paper from a hat at random suggests that the probability of observing any of the numbers 1 through 5 is 1/5. Later.. 2. say P(X = x). . number of points. 5 is the PDF for the random variable X. 3. the probability distribution function of the random variable X. In general. we will discuss some of these in this chapter. the sum of the probabilities over the entire sample space must be 1. or countably infinite. All discrete probability distribution functions have these properties: If f (x) = P(X = x) is the PDF for a random variable X. that is. We discuss several discrete random variables in this chapter. Not any function can serve as a probability distribution function. We begin with the probability distribution function suggested by the example of drawing a slip of paper from a hat.. P(X = x) = 1/5 for x = 1. the discrete uniform probability distribution function is defined as f (x) = P(x = x) = 1/n for x = 1. then 1) 2) allx f (x) 0 f (x) = 1 These properties arise from the fact that probabilities must be nonnegative and since some event must occur in the sample space. We will often abbreviate this as the PDF for the random variable X. n It is easy to verify that f (x) satisfies the properties of a discrete probability distribution function. This is called a discrete uniform probability distribution function.Discrete Uniform Distribution 59 Discrete random variables are those random variables that take on a finite. 4.

Unfortunately. In our example. we have μx = E(X) = (5 + 1)/2 = 3. we have E(X ± Y ) = E(X) ± E(Y ) if X and Y are random variables defined on the same sample space. This is a measure of how variable the probability distribution is. producing 0 for any random variable. Since the expected value is a sum. The thinking here is that values that depart markedly from the mean value show that the probability distribution is quite variable. E(X − μ) = E(X) − E(μ) = μ − μ = 0 for any random variable. To measure this. then E(X) = all x x · f (x) = c all x f (x) = c · 1 = c We now turn to another descriptive measure of a probability distribution. we find μx = E(X) = 1 · 1 1 1 1 1 +2· +3· +4· +5· =3 5 5 5 5 5 In general. If the random variable X is a constant. we have discrete uniform distribution μx = E(X) = all x x · f (x) = n x=1 x· 1 1 = n n n x= x=1 1 n · n+1 n(n + 1) = 2 2 In our case where n = 5. the variance. The problem here is that positive deviations from the mean exactly cancel out negative deviations. So we square each of the deviations to prevent this and find the expected value of those deviations. We call this quantity the variance and denote it by σ 2 = Var(X) = E(X − μ)2 . The mean or the expected value of a discrete random variable X is denoted by μx or E(X) and is defined as μx = E(X) = all x x · f (x) This then is a weighted average of the values of X and the probabilities associated with its values. as before.60 Chapter 5 Random Variables and Discrete Probability Distributions Mean and Variance of a Discrete Random Variable We pause now to introduce two numbers by which random variables can be characterized. we might subtract the mean value from each of the values of X and find the expected value of the result. say X = c. the mean and the variance.

and German Tanks Probability distributions that contain extreme values (with respect to the mean) in general have larger standard deviations than distributions whose values are mostly close to the mean. The positive square root of the variance. If k = 1. this becomes σ 2 = 2.7% and if k = 1. σ. this √ about 57. let us look at some uniform distributions and the percentage of those distributions contained in an interval centered at the mean. 8 or 60% of the distribution. we have n σ2 = x=1 x2 · 1 − n n+1 2 2 = 1 n · n(n + 1)(2n + 1) (n + 1)2 − 6 4 which is easily simplified to σ 2 = (n2 − 1)/12. 8. 5. 4. So the percent√ √ age of the distribution contained in the interval is approximately 2k/ 12 = k/ 3 for reasonably large values of n. Since μ = (n + 1)/2 and σ = (n2 − 1)/12 for the uniform distribution on x = 1. . for example. is called the standard deviation. this is the interval (2. . this is is about 86. This will be dealt later in this book when we consider confidence intervals. k of course must be less than 3 or else the entire distribution is covered by the interval. 7.Discrete Uniform Distribution 61 This can be expanded as σ 2 = Var(X) = E(X − μ)2 = E(X2 − 2μX + μ2 ) = E(X2 ) − 2μE(X) + μ2 or σ 2 = E(X2 ) − μ2 We will often use this form of σ 2 for computation. rapidly approaches 1 as n increases. Intervals. Suppose now that we add and subtract k standard deviations from the mean for the general uniform discrete distribution. For now. however. then the length of this interval is 2k (n2 − 1)/12 and the percentage of the distribution contained in that interval is 2k n2 − 1 12 = 2k n n2 − 1 12n2 The factor (n2 − 1)/n2 . .372) that contains the values 3. σ. 6.6%.5. If n = 5. For the general discrete uniform distribution. .628. consider the interval μ ± σ that in this case is the interval n+1 ± 2 n2 − 1 12 We have used the mean plus or minus one standard deviation here. n. 2. . If n = 10.

For now.4 4. . . were captured. This really seems to be a bit senseless however since we know the value of n and so it is simple to figure out what percentage of the distribution is contained in any particular interval.2 3.13. that of sums. They numbered parts of motorized vehicles and parts of aircraft and tanks. n (here is our uniform distribution!) and we have captured tanks numbered 7.5 5.4 5. What happens if we add the two numbers found.1 3. This problem actually arose during World War II. Suppose further that the first slip is replaced before the second is drawn so that the sampling for the first and second slips is done under exactly the same conditions.4 1.5 Sum 7 8 5 6 7 8 9 6 7 8 9 10 .1 2. .3 Sum 2 3 4 5 6 3 4 5 6 7 4 5 6 Sample 3.2 1.3 4.5 2.4 3. for example.3 5. We will return to confidence intervals later. what if we do not know the value of n? This is an entirely different matter. and 42. Sample 1. Does the sum also have a uniform distribution? We might think the answer to this is “yes” until we look at the sample space for the sum shown below. We will have much more to say about statistical problems in later chapters and this one in particular. The reader may wish to think about this problem and make his or her own estimate. the Germans unwittingly gave out information about how many tanks were in the field! If we assume the tanks are numbered 1.2 4.2 5. but by how much? We will return to this problem later.3 2.5 4.5 3. Notice that the estimate must exceed 42.4 2. what is n? This is not a probability problem but a statistical one since we want to estimate the value of n from a sample. The Germans numbered much of the military material it put into the battlefield.62 Chapter 5 Random Variables and Discrete Probability Distributions So the more standard deviations we add to the mean. the more of the distribution we cover. and this is true for probability distributions in general..3 1.1 4. Sums Suppose in the discrete uniform distribution with n = 5 (our slips of paper example) we draw two slips of paper. when tanks. So.2 2. 2.1 1.1 5. We now turn to another extremely useful situation.

1. say to 5. 5. As we will see. it is to be expected when sums are considered—and the sums can arise from virtually any distribution or combination of these distributions! We will discuss this further in the chapter on continuous probability distributions. Although the sample space now contains 55 = 3125 points. This gives a “bell-shaped” curve.1 Even more surprising things occur when we increase the number of drawings. enumerating these is a daunting task to say the least. Frequency 300 200 100 10 Figure 5.Discrete Uniform Distribution 63 Now we realize that sums of 4. this is far from uncommon in probability theory. or 6 are fairly likely. in fact. Other techniques can be used however to produce the graph in Figure 5. 5 Frequency 4 3 2 4 6 Sum 8 10 Figure 5.2 15 Sum 20 25 . Here is the probability distribution of the sum: X f (x) 2 3 4 5 6 7 8 9 10 1 2 3 4 5 4 3 2 1 25 25 25 25 25 25 25 25 25 and the graph is as shown in Figure 5.2.

the sample space then consists of all the possible sequences of five items. unfortunately. the result of which is a head or a tail. or 5. If we inspect five items. 2. for convenience. we let P(S) = p and P(F ) = q = 1 − p for each trial.64 Chapter 5 Random Variables and Discrete Probability Distributions BINOMIAL PROBABILITY DISTRIBUTION The next example of a discrete probability distribution is called the binomial distribution. Suppose we inspect an item as it is coming off a production line. none of the items are good so. The sample space then contains 25 = 32 sample points. We call these experiments binomial since. Examples of this situation are very common. We also note that the number of orders in which there are exactly two good items 5 must be 2 or the number of ways in which we can select two positions for the . 4. Examples of this include tossing a coin. a newborn child is a female or male. 3. In fact. We also suppose as above that P(G) = p and P(D) = q = 1 − p. It is common to let the random variable X denote the number of successes in n independent trials of the experiment. the result is one of two outcomes. a vaccination against the flu is successful or nonsuccessful. then. Any particular order will have probability q3 p2 since the trials of the experiment are independent. If X = 0. are called success (S) or failure (F ). at each trial. using the independence of the events and the associated sample point. Now we must calculate the probabilities of each of these events. then we see that the possible values of X are 0. P(X = 0) = P(DDDDD) = P(D) · P(D) · P(D) · P(D) · P(D) = q5 How can X = 1? Then we must have exactly one good item and four defective items. So P(X = 1) = P(GDDDD or DGDDD or DDGDD or DDDGD or DDDDG = P(GDDDD) + P(DGDDD) + P(DDGDD) + P(DDDGD) + P(DDDDG) = pq4 + pq4 + pq4 + pq4 + pq4 = 5pq4 Now P(X = 2) is somewhat more complicated since two good items and three defective items can occur in a number of ways. each G or D. 1. But that event can occur in five different ways since the good item can occur at any one of the five trials. which. One of the most commonly occurring random variables is the one that takes one of two values each time the experiment is performed. Let us consider a particular example. and if we let X denote the number of good items. We will make two further assumptions: that the trials are independent and that the probabilities of success or failure remain constant from trial to trial. The item is good (G) or defective (D).

One reason for not pursuing this is that exact calculations involving the . We note that P(X = x) 0 and n P(X = x) = n x=0 x=0 x p q so the properties of a discrete probability distribution function are satisfied. We show some here where we have chosen p = 0. We see that P(X = x) = n x n−x p q x for x = 0. . we find P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) = q5 + 5pq4 + 10p2 q3 + 10p3 q2 + 5p4 q + p5 = (q + p) = 1 5 which we recognize as since q + p = 1 Note that the coefficients in the binomial expansion add up to 32. so all the points in the sample space have been used. This is in fact the case. n x n−x = (q+p)n = 1. that X denotes the number of successes. in Chapter 8. 1. and 5. the probability distributions become more “bell shaped” and strongly resemble what we will call. The graphs indicate that as n increases. although this fact will not be pursued here. we find that P(X = 3) = and P(X = 4) = 5 4 p q = 5p4 q and P(X = 5) = p5 4 5 3 2 p q = 10p3 q2 3 If we add all these probabilities together. Graphs of binomial distributions are interesting. 2. The occurrence of the binomial theorem here is one reason the probability distribution of X is called the binomial probability distribution. 5. .5). and that P(S) = p and P(F ) = q = 1 − p. We conclude that P(X = 2) = 5 2 3 p q = 10p2 q3 2 In an entirely similar way.3. a continuous normal curve. n This is the probability distribution function for the binomial random variable in general. . Suppose now that we have n independent trials. .Binomial Probability Distribution 65 two good items from five positions in total.3 for various values of n (Figures 5.4. The above situation can be generalized.

A sample of 20 parts is taken.15 0.125 0.5 p = 0.4 p = 0. n = 15. n = 5.1 0. and so we do not need to approximate these probabilities with a normal curve. Binomial distribution. Here are some examples.35 0.2 0.025 5 10 15 X 20 25 30 Probability Figure 5.15 0.3. Assuming a binomial model. Binomial distribution n = 30.3. 0. is this a cause for concern? .15 0.1 0. 0. and it is found that 4 of these are defective.2 Probability 0.1 A Production Line A production line has been producing good parts with probability 0. EXAMPLE 5. Binomial distribution.3 0.05 0 Random Variables and Discrete Probability Distributions Probability 0 1 2 X 3 4 5 Figure 5. which we study in the chapter on continuous distributions.05 2 4 6 8 X 10 12 14 Figure 5.66 Chapter 5 0.3.25 0.1 0. binomial distribution are possible using a statistical calculator or a computer algebra system.85.075 0.3 p = 0.05 0.

85. can be expressed in terms of the number of voters.864808 x So if candidates in fact did not know the true proportion of voters willing to vote for them. What is the probability that the sample will contain between 40% and 60% of voters who favor the candidate? We presume that a binomial model is appropriate. ps = X/100.90)45 = 0.40 ≤ ps ≤ 0.55)100−x = 0.40 ≤ X ≤ 0.45)x (0.352275 x So this event is not unusual and one would probably conclude that the production line was behaving normally and that although the percentage of defective parts has increased. the sample is not a cause for concern.85)x (0. Assuming that the probability a good part is 0. . She has a box of 45 seeds.60 100 = P(40 ≤ X ≤ 60) 60 = x=40 100 (0. increase the probability that the sample proportion is within the range of 40–60%. In fact.60) = P 0. Note that the sample proportion of voters. so P(0. of course.15)20−x = 0.Binomial Probability Distribution 67 Let X denote the number of good parts in a sample of 20. we find the probability the sample has at most 16 good parts is 16 P(X ≤ 16) = x=0 20 (0. letting X denote the number of seeds germinating. say ps .0087280 So this is not a very probable event although the germination rate is fairly high. Increasing the sample size will.90)45 (0. The probability that all the seeds germinate.10)0 45 = (0. EXAMPLE 5. who favor the candidate. the survey would be of little comfort since it indicates that they could either win or lose.2 A Political Survey A sample survey of 100 voters is taken where actually 45% of the voters favor a certain candidate. say X.3 Seed Germinations A biologist studying the germination rate of a certain type of seed finds that 90% of the seeds germinate. EXAMPLE 5. is P(X = 45) = 45 (0.

671067 0.90 · 0. we calculate the mean value.1: Table 5. or 1%. but no value in between these values. so it is not possible to find a 5% error rate exactly.95.5 and variance. Mean and Variance of the Binomial Distribution It will be shown in the next section that the following formulas apply to the binomial distribution with n trials and p the probability of success at any trial. or P(X ≤ k) = 0. This can be done only approximately. To illustrate this.991272 1 It appears that one should fix the guarantee at 43 seeds. 05. Since the variable is discrete. E(X) = μ = 45 · 0.90 = 40. The table in fact indicates that this probability could be approximately 16%.70772 Now suppose that the seller of the seeds wishes to advertise a “guarantee” that at least k of the seeds germinates. we wish to determine k so that P(X ≥ k) = 0. we find the following in Table 5.05 = 2. . 5%. What should k be if the seller wishes to disappoint at most 5% of the buyers of the seeds? Here. We have seen that the standard deviation is a measure of the variation in the distribution. the cumulative probabilities increase in discrete jumps.10 = 4.90)x (0. meaning that the stan√ dard deviation is σ = 4.840957 0.05.472862 0. Using a statistical calculator. n μ = E(X) = x=0 x· n x p (1 − p)n−x = np x n σ 2 = Var(X) = E(X − μ)2 = x=0 (x − μ)2 · n x p (1 − p)n−x = npq x In our above example.10)45−x x = 0.68 Chapter 5 Random Variables and Discrete Probability Distributions The probability that at least 40 of the seeds germinate is 45 P(X ≥ 40) = x=40 45 (0. Var(X) = npq = 45 · 0.947632 0.1 k 40 41 42 43 44 45 P(X ≤ k) 0.01246.

The reason is as follows: Consider a binomial process with n trials and probability of success at any particular trial. We also find that P(μ − 2σ ≤ X ≤ μ + 2σ) = P(40. x We will return to these intervals in the chapter on continuous distributions.51246) 43 = 38 45 (0.5 − 2 · 2.5 + 2 · 2. That is.10)45−x = 0. We define a random variable now for each one of the n trials as follows: Xi = 1 0 if the ith trial is a success if the ith trial is a failure and Xi then is 1 only if the ith trial is a success.5 − 2.01246 ≤ X ≤ 40.90)x (0.871934. The identity X1 + X2 + X3 + · · · + Xn = X also provides an easy way to calculate the mean and variance of X. when we took sums of independent observations. Sums In our study of the discrete uniform distribution. and we indicated that graphs of sums in general became shaped that way.47508 ≤ X ≤ 44. Could it be then that binomial probabilities could be considered to be sums? The answer to this is “yes”. X1 + X2 + X3 + · · · + Xn = X so the binomial random variable X is in fact a sum.Binomial Probability Distribution 69 we calculate P(μ − σ ≤ X ≤ μ + σ) = P(40. p.10)45−x = 0.48754 ≤ X ≤ 42.01246) = P(36.01246) = P(38.01246 ≤ X ≤ 40. This explains the “bell shaped” curve we see when we graph the binomial distribution. x Notice that we must round off some of the results since X can only assume integer values.90)x (0.987970. We find that E(Xi ) = 1 · p + 0 · q = p and .52492) 45 = 36 45 (0. graphs of those sums became “bell shaped”.5 + 2. it follows that the sum of the Xi ’s counts the total number of successes in n trials.

We can. and E(Xi 2 ) = p. and each has the same probability. Var(Xi ) = p − p2 = p(1 − p) = pq it follows that Var(Xi ) = Var(X1 ) + Var(X2 ) + Var(X3 ) + · · · + Var(Xn ) = pq + pq + pq + · · · + pq = npq HYPERGEOMETRIC DISTRIBUTION We now consider another very useful and frequently occurring discrete probability distribution. Unknown to him. This is not always an accurate assumption.70 Chapter 5 Random Variables and Discrete Probability Distributions since the expected value of a sum is the sum of the expected values. the probability that the second part is good depends on the quality of the first part. 1D) = P(GGD) + P(GDG) + P(DGG) = 8 11 · 7 10 · 3 8 + 9 11 · 3 10 · 7 3 + 9 11 · 8 10 · 7 = 0. we simply need to choose two of . So the binomial model does not apply. Since the order of the parts is irrelevant. The binomial distribution assumes that the probability of an event remains constant from trial to trial. This means that after the first part is selected.509 9 Notice that there are three ways for the event to occur. We will show later that the variance of a sum of independent random variables is the sum of the variances and Var(Xi ) = E(Xi 2 ) − [E(Xi )]2 . We actually have encountered this situation when we studied acceptance sampling in the previous chapter. As an example. P(2G. This event could occur in three ways. for example. E(X) = E(X1 + X2 + X3 + · · · + Xn ) = E(X1 ) + E(X2 ) + E(X3 ) + · · · + E(Xn ) = p + p + p + · · · + p = np as we saw earlier. no matter whether it is good or defective. that is. suppose that a small manufacturer has produced 11 machine parts in a day. A sample of three parts is taken. find the probability that the sample of three contains two good and one unacceptable part. and the sampling being done without replacement. a selected part is not replaced so that it cannot be sampled again. while the remaining parts are acceptable and can be sold (G). the lot contains three parts that are not acceptable (D). now we make the situation formal.

suppose our sampling is destructive and we can only replace the defective items that occur in the sample. . D) Since the sum covers all the possibilities Min(n. Min(n. .509 165 as before. The nonreplacement does not affect the mean value. however. but Var(X) = n · D N · 1− D N · N −n N −1 is like the binomial npq except for the factor (N − n)/(N − 1). surprisingly like n · p in the binomial. So 8 2 8 2 · 2 1 P(2G. n the probabilities sum to 1 as they should. Then. N n x = 0. It does effect the variance. altering our sampling procedure from our previous discussion of acceptance sampling. if we find one defective item in the sample. we find the following values for the probability distribution function: x P(X = x) 0 56 165 1 28 55 2 8 55 3 1 165 Now. The demonstration will not be given here. we sell 2/11 defective product. 1D) = · 11 3 3 1 = 84 = 0. for example. We can generalize this as follows. . Suppose a manufactured lot contains D defective items and N − D good items. . It can be shown that the mean value is μx = n · (D/N). So the average defective product sold under this sampling plan is 56 165 · 3 28 + 11 55 · 2 8 + 11 55 · 1 1 + 11 165 · 0 = 19. To continue our example. 1. which is often called a finite population correction factor. This can be done in ways. Let X denote the number of unacceptable items in a sample of n items. This is D x P(X = x) = · N −D n−x .D) x=0 D x · N −D n−x = N . This is called a hypergeometric probability distribution function.8% 11 .Hypergeometric Distribution 71 = 56 the good items and one of the defective items.

We discussed a waiting time problem. it can be shown that the hypergeometric distribution approaches the binomial distribution as the population size increases. In Examples 1. There are hundreds of other discrete probability distributions. but not as dramatic as what we saw in our first encounter with acceptance sampling. and it will be encountered again. ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ . we have a fixed number of trials. the number of trials to achieve those successes is the random variable. Those considered here are a sample of these. we showed that no matter the size of the probability an item was good or defective. Later. the geometric distribution. and the random variable is the number of successes.12. ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ . although the sampling has been purposeful. In many situations. say n. and the number of trials to achieve that success is the random variable. and a variable number of successes. we have a fixed number of trials. GEOMETRIC PROBABILITY DISTRIBUTION In the binomial distribution. we discussed a sample space in which we sampled items emerging from a production line that can be characterized as good (G) or defective (D). We presumed that the selections are independent and showed the following sample space: ⎫ ⎧ ⎪ D ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ GD ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ GGD ⎪ ⎪ ⎪ ⎬ ⎨ S = GGGD ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ . We end this chapter with another probability distribution that we have actually seen before. we have discussed some of the most common distributions. In waiting time problems.3% had we proceeded without the sampling plan. we have a given number of successes (here 1). Other Properties of the Hypergeometric Distribution Although the nonreplacement in sampling creates quite a different mathematical situation than that we encountered with the binomial distribution. Notice that in the binomial random variable. waiting for a defective item to occur. It is also true that the graphs of the hypergeometric distribution show the same “bell-shaped” characteristic that we have encountered several times now. we wait for the first success.72 Chapter 5 Random Variables and Discrete Probability Distributions an improvement over the 3/11 = 27. the probability assigned to the entire sample space is 1. . namely. The improvement is substantial. however.4 and 1.

X 1 2 3 4 . Then. we find the following sample space (where T and F indicate respectively. values of X. We will consider the generalization of this problem in Chapter 7.8.8 qp = (0.Conclusions 73 Let us begin with the following waiting time problem.25 attempts. The probabilities are obviously positive and their sum is S = p + qp + q2 p + q3 p + · · · = 1 which we showed in Chapter 2 by multiplying the series by q and subtracting one series from another. For example. The expected waiting time for our new driver to pass the driving test is 1/(8/10) = 1. This is called a geometric probability distribution function. . We also state here that E[X] = 1/p. that the test is passed and test has been failed). Let the random variable X denote the number of trials necessary up to and including when the test is passed. . .8) q2 p = (0. then the expected waiting time for the first head to occur is 1/(1/2) = 2 tosses. applying the assumptions we have made and letting q = 1 − p. . Table 5.2)(0. . and the probability of passing the test remains constant. a fact that will be proven in Chapter 7. . . if we toss a fair coin with p = 1/2.8) q3 p = (0. so we conclude that P(X = x) = qx−1 · p for x = 1. The occurrence of a geometric series here explains the use of the word “geometric” in describing the probability distribution. but those discussed here are the most important. . In taking a driver’s test. 3. . is the probability distribution function for the random variable X. There are many others that occur in practical problems. CONCLUSIONS This has been a brief introduction to four of the most commonly occurring discrete random variables. the trials are independent.2)2 (0. 2.2)3 (0. We see that if the first success occurs at trial number x.2 Sample Point T FT FFT FFFT . then it must be preceded by exactly x − 1 failures. and probabilities in Table 5. suppose that the probability the test is passed is 0. . Probability p = 0.8) .2.

Then we will return to discrete probability distribution functions. or we could decide that n = 5 while in reality n = 6. For now. The second defective item is the fifth item drawn. 3. A hypergeometric distribution has N = 100. 5. Find the probability distribution of the number of special items in the sample and then compare these probabilities with those from a binomial distribution with p = 0. This decision rule is subject to two kinds of errors: we could decide that n = 6 while in reality n = 5. let r denote the probability that the first success occurs in no more than n trials. k. 2. of defective items. 2. Items are selected from the lot and inspected. but only to random variables whose probability distributions are continuous.) 4. but it is not known whether n = 5 or n = 6. A lot of 100 manufactured items contains an unknown number. EXPLORATIONS 1. For a geometric random variable with parameters p and q. What is k? (Try various values of k and select the value that makes the event most likely. n. · · · . A sample of size 2 has been selected from the uniform distribution 1. (a) Show that r = 1 − qn . Samples of size 4 are selected. Flip a fair coin. . with D = 10 special items. or use a computer to select random numbers 0 or 1. and verify that the expected waiting time for a head to appear is 2. we pause and consider an application of our work so far by considering seven-game series in sports. and the inspected items are not replaced before the next item is drawn.74 Chapter 5 Random Variables and Discrete Probability Distributions We will soon return to random variables. It is agreed that if the sum of the sample is 6 or greater. then it will be decided that n = 6. (b) Now let r vary and show a graph of r and n. Find the probabilities of each of these errors.10.

five. A Probability and Statistics Companion. We analyze this problem in some depth. or seven games. Sevengame play-off series in sports such as basketball play-offs and the World Series present a waiting time problem. This is another waiting time problem and gives us a finite sample space as opposed to the infinite sample spaces we have considered. INTRODUCTION We pause now in our formal development of probability and statistics to concentrate on a particular application of the theory and ideas we have developed so far. We also assume that p and q remain constant throughout the series and that the games are independent. The series can last four. Let the teams be A and B with probabilities of wining an individual game as p and q. 75 .Chapter 6 Seven-Game Series in Sports CHAPTER OBJECTIVES: r r r r to consider seven-game play-off series in sports to discover when the winner of the series is in fact the better team to find the influence of winning the first game on winning the series to discuss the effect of extending the series beyond seven games. Kinney Copyright © 2009 by John Wiley & Sons. six. We look first at the sample space. the series ends when one team has won four games. We show here the ways in which team A can win the series. we wait until one team has won a given number of games. SEVEN-GAME SERIES In a seven-game series. John J. that is. Inc. winning or losing a game has no effect on winning or losing the next game. where p + q = 1. In this case.

so we assign probabilities now to the sample points. So 3 there are 4−1 + 5−1 + 6−1 + 7−1 = 35 ways for A to win the series and so 3 3 3 3 70 ways in which the series can be played. This formula gives some interesting results. . Note that the number of ways in which the series can be played in n games is easily counted. In Table 6. we show p. say. interchange the letters A and B above. the probability that team A wins the series.1. For 3 example. where either team can win the series: P(4 game series) = p4 + q4 P(5 game series) = 4p4 q + 4q4 p P(6 game series) = 10p4 q2 + 10q4 p2 P(7 game series) = 20p4 q3 + 20q4 p3 These probabilities add up to 1 when the substitution q = 1 − p is made. however. so in the previous n − 1 games. A must win exactly three games and this can be done in n−1 ways. there are 5 = 10 ways in which A can win the series in six games.76 Chapter 6 Seven-Game Series in Sports Four games AAAA Five games BAAAA ABAAA AABAA AAABA Six games BBAAAA BABAAA BAABAA BAAABA ABBAAA AABBAA AAABBA ABABAA ABAABA AABABA Seven games BBBAAAA BBABAAA BBAABAA BBAAABA BABBAAA BABABAA BABAABA BAABBAA BAABABA BAAABBA ABBBAAA ABBABAA ABBAABA ABABBAA ABABABA ABAABBA AABBBAA AABBABA AABABBA AAABBBA To write out the points where B wins the series. These points are not equally likely. The last game must be won by A. We also see that P(A wins the series) = p4 + 4p4 q + 10p4 q2 + 20p4 q3 and this can be simplified to P(A wins the series) = 35p4 − 84p5 + 70p6 − 20p7 by again making the substitution q = 1 − p. and P. the probability that team A wins a single game.

we calculate A = 4 p4 + q4 + 5 4p4 q + 4q4 p + 6 10p4 q2 + 10q4 p2 +7 20p4 q3 + 20q4 p3 1 0.Seven-Game Series Table 6. .52 0.6 0.2898 0.5000 0.48 0.8 P wins 0. The difference is shown in Figure 6.260387.53 0. that the probability of winning the series does not differ much from the probability of winning a single game! The series is then not very discriminatory in the sense that the winner of the series is not necessarily the stronger team.80 P(A wins the series) 0.40 0.8740 0.60 0.8 1 Figure 6.50 0. if the teams are fairly evenly matched. What is the expected length of the series? To find this.1 p 0.4 p 0.54 0.49 0.4346 0.5869 0.9667 77 It can be seen.1.4781 0.46 0.1 Probability of winning the series and the probability of winning a single game.739613 or 0. The maximum difference occurs when the probability of winning a single game is 0.7102 0. The graph in Figure 6.2 0.5437 0.3917 0.4563 0.6083 0.45 0.47 0.5219 0.4 0.1 shows the probability of winning the series and the probability of winning a single game.2 0.6 0.70 0.4131 0.5654 0.55 0.51 0.

5 5.3 Probability that the winner of the first game wins the series.25 5 4.8 1 Figure 6. we see that P(winner of the first game wins the series) = p3 + 3p3 q + 6p3 q2 + 10p3 q3 and this can be written as a function of p as P(winner of the first game wins the series) = p3 (20 − 45p + 36p2 − 10p3 ) A graph of A as a function of p is shown in Figure 6.5 4.6 0.78 Chapter 6 Seven-Game Series in Sports This can be simplified to A = 4 1 + p + p2 + p3 − 13p4 + 15p5 − 5p6 A graph of A as a function of p is shown in Figure 6. 5.75 5.55.4 p 0.45 ≤ p ≤ 0. The graph shows that in the range 0.75 4.6 0.2.4 p 0.2 0.4 0. 1 0.8 games.8 Probability 0.2 0.8 1 Length Figure 6.2 Expected length of the series.2 0. we consider the probability that the winner of the first game wins the series. WINNING THE FIRST GAME Since the probability of winning the series is not much different from that of winning a single game.25 0. From the sample space. . the average length of the series is almost always about 5.3.6 0.

So there are (n − 1)/2 + x = (n + 2x − 1)/2 games played before the final game. if p is greater than this. so. In fact. . as shown in Table 6. this becomes P(Awins the series) = p4 1 + and if n = 9. The other graph shows the probability that the winner of the first game wins the series. During this time. It follows that (n−1)/2 P(Awins the series) = p(n+1)/2 x=0 (n + 2x − 1)/2 x q x If n = 7. then the winner of the first game is more likely to win the series. If n = 7. Call the winner of the series A and suppose the probability that A wins an individual game is p and the probability that the loser of the series wins an individual game is q. How Long Should the Series Last? We have found that the winner of a seven-game series is not necessarily the better team. If n = 9. in a series of n games. this becomes P(Awins the series) = p5 1 + 5 6 2 7 3 8 4 q+ q + q + q 1 2 3 4 4 5 2 6 3 q+ q + q 1 2 3 Increasing the number of games to nine changes slightly the probability that A wins the series.347129. If A wins the series. then the winner must win five games. the probability that the weaker team wins the series is about equal to the probability that the team wins an individual game. Perhaps a solution to this is to increase the length of the series so that the probability that the winner of the series is in fact the stronger team is increased. then the winner must win four games.Winning the First Game 79 The straight line gives the probability that an individual game is won. The graphs intersect at p = 0.2. Now we find a formula for the probability that a given team wins a series of n games. should the series last? We presume the series is over when. then A has won (n + 1)/2 games and has won (n + 1)/2 − 1 = (n − 1)2 games previously to winning the last game. one team has won (n + 1)/2 games. How long. if the teams are about evenly matched. the loser of the series has won x games (where x can be as low as 0 and at most (n − 1)/2 games). then.

say 0.413058 0.549114 0.450886 0.8 1 Figure 6.573475 0.51 0.95 x where we know p and q.621421 The difference between the two series can be seen in Figure 6. Now suppose we want P(A wins the series) to be some high probability.49 0.4 The steeper curve is that for a nine-game series.4. such as one given in Figure 6.378579 0.434611 0.4 0.6 p 0.475404 0.391712 0.95 and solve the resulting equation for n. we defined a function h[p. n] as follows: (n−1)/2 h[p.500000 0. 0.2 0.54 0.456320 0.543680 0. This requires that the graph.597602 0.80 Chapter 6 Seven-Game Series in Sports Table 6.2 0. How many games should be played? To solve this in theory.53 0.586942 0.46 0. put P(A wins the series) = 0.50 0.48 0.45 0.5. So.95) for some value of n.8 0.521866 0. n] = p (n+1)/2 x=0 (n + 2x − 1)/2 x q x .6 0. The equation to be solved is then (n−1)/2 p(n+1)/2 x=0 (n + 2x − 1)/2 x q = 0.95.2 Probability of Winning the Series p 0. should pass through the point (p.478134 0.52 0.55 Seven-game series 0. Suppose again that p is the probability that A wins an individual game and that p > 0.500000 0.402398 0.47 0. This cannot be done exactly.565389 0.608288 Nine-game series 0.573475 0.4 0.4. 1 P(A wins) 0.426525 0.

so even with 4000 games (which is difficult to imagine!) we have not yet achieved a probability of 0.3 p 0. Extending the number of games so that the probability that the winner of a lengthy series is most probably the better team is quite unfeasible. the number of games exceeds 4000! In fact h[0. 4000] = 0. this is because the winner of a game must win at least two points in a row and the winner of a set must win at least two games in a row.60.90. We will not pursue it here.52 0. In the case where p = 0. As the teams approach parity.53 0.51 Number of games (n) 63 79 99 131 177 257 401 711 1601 4000∗ It is apparent that the length of the series becomes far too great. so that the teams are really quite unevenly matched.58 0. the number of games required grows quite rapidly.60 0.59 0.57 0. but tennis turns out to be a fairly good predictor of who the better player is.55 0.95! This should be overly convincing that waiting until a team has won (n + 1)/2 games is not a sensible plan for deciding which team is stronger. EXPLORATIONS 1.51. Compare the actual record of the lengths of the World Series that have been played to date with the probabilities of the lengths assuming the teams to be . Table 6.95 was achieved.56 0. The results are shown in Table 6. CONCLUSIONS We have studied seven-game series in sports and have concluded that the first team to win four games is by no means necessarily the stronger team. even if p = 0.Explorations 81 and then experimented with values of p and n until a probability of about 0.3.51. 2.54 0. Verify that the probabilities given for the seven-game series where p is the probability that a given team wins a game and q = 1 − p is the probability that a given team loses a game add up to 1. The winner of the first game has a decided advantage in winning the series.

Does the actual record suggest that the teams are not evenly matched? 3.82 Chapter 6 Seven-Game Series in Sports evenly matched. Do the data suggest that the teams are not evenly matched? 4. Find data showing the winner of the first game in a seven-game series and determine how often that team wins the series. Compare the actual expected length of the series with the theoretical expected length. .

and consequently a constant probability of failure. WAITING FOR THE FIRST SUCCESS Recall. As we have seen in problems involving the binomial probability distribution. we call “success” and “failure. it is common to define a random variable. or a given number of successes. waiting time problems. X. for lack of better terms. which. we consider a fixed number of trials and calculate the probability of a given number of “successes. or some pattern of successes and failures. say p. or trials. Inc. 83 .” We must also have a constant probability of success. Kinney Copyright © 2009 by John Wiley & Sons.” Now we consider problems where we wait for a success. that in a binomial situation we have an experiment with one of the two outcomes. to be independent. In this situation.Chapter 7 Waiting Time Problems CHAPTER OBJECTIVES: r r r r r r r r to develop the negative binomial probability distribution to apply the negative binomial distribution to a quality control problem to show some practical applications of geometric series to show how to sum some series which are not geometric (without calculus) to encounter the Fibonacci sequence when tossing a coin to show an unfair game with a fair coin to introduce the negative hypergeometric distribution to consider an (apparent) logical contradiction. John J. It is also necessary for the experiments. from Chapter 5. denoting the number of successes in n independent trials. namely. say q where of course p + q = 1. We A Probability and Statistics Companion. We now turn our attention to some problems usually not considered in an introductory course in probability and statistics.

The sample space now is shown in Table 7. Suppose that the probabilities of a male birth or a female birth are each 1/2 (which is not the case in actuality) and that the births are independent of each other. it is apparent that since we must have x − 1 failures followed by a success P(X = x) = qx−1 p. We know that S = p + qp + q2 p + q3 p + · · · = 1 and here p = q = 1/2 so we have a probability distribution. Tossing a two-sided loaded coin is a perfect binomial model. x = 1. n x and we have verified that the resulting probabilities add up to 1. if any. Let us begin by waiting for the first success. but we do not have a fixed number of trials. where the values for X now have no bound. Table 7. THE MYTHICAL ISLAND Here is an example of our waiting time problem. . however. Number of trials 1 2 3 4 Now. Now. We have shown that the probabilities sum to 1 so we have established that we have a true probability distribution function. The sample space and associated probabilities are shown in Table 7. S denotes a success and F denotes a failure. .2.1. Now we consider a very specific example and after that we will generalize this problem. as we did in Chapter 5. . 3. 2. couples are allowed to have children until a male child is born. x = 1. . does this have on the male:female ratio on the island? The answer may be surprising. On a mythical island. . rather we fix the number of successes and then the number of trials becomes the random variable. . which now denotes the number of trials necessary to achieve the first success. . We now want to know the expected number of males . . we assume the binomial presumptions. . Probability p qp q2 p q3 p . What effect. again using the symbol X. . We called this a geometric probability distribution. 2.84 Chapter 7 Waiting Time Problems have seen that the probability distribution function is P(X = x) = n px qn−x . .1 Sample point S FS FFS FFFS .

Waiting for the Second Success Table 7. let AM = 1 · then 1 1 1 1 AM = 1 · + 1 · + 1 · + ··· 2 4 8 16 and subtracting the second series from the first series. . Probability 85 1 2 1 · 2 1 · 2 · 1 = 4 1 1 · = 2 8 1 1 1 · · = 2 2 16 in a family. To find this. We will consider that problem later in this chapter. just as it would be if the restrictive rule did not apply! We now generalize the waiting time problem beyond the first success. . 1 1 AM = 2 2 1 1 so AM = 1 Now the average family size is A = 1 · 2 + 2 · 4 + 3 · and so 1 1 1 1 1 A=1· +2· +3· +4· + ··· 2 4 8 16 32 Subtracting these series gives 1 8 1 1 1 1 +1· +1· +1· + ··· 2 4 8 16 +4· 1 16 + ··· 1 1 1 1 1 + · · · = 1. we must have exactly one success. . and this is followed by the second success when the experiment ends.3.2 Sample point M FM FFM FFFM . We conclude that if the second success occurs at the xth trial. . Note that we are not necessarily waiting for two successes in a row. so A = 2. so is the average number of females. then the first x − 1 trials must contain exactly one success (and is the usual binomial process). It is clear that among the first x − 1 trials. giving the male:female ratio as 1:1. 1 2 1 2 1 2 1 2 . WAITING FOR THE SECOND SUCCESS The sample space and associated probabilities when we wait for the second success are shown in Table 7. A− A=1· +1· +1· +1· 2 2 4 8 16 Since the average number of male children in a family is 1. .

We could also have noted that T = 1/(1 − q)2 = (1 − q)−2 = 1/p2 by the binomial expansion with . Probability p2 qp2 qp2 q2 p2 q2 p2 q2 p2 . . . . · · · 1 x−1 pqx−2 · p 1 Again we must check that we have assigned a probability of 1 to the entire sample space. Adding up the probabilities we see that ∞ ∞ (X = x) = p2 x=2 x=2 x − 1 x−2 q 1 Call the summation T so ∞ T = x=2 x − 1 x−2 q = 1 ∞ (x − 1)qx−2 = 1 + 2q + 3q2 + 4q3 + 5q4 + · · · x=2 Now qT = q + 2q2 + 3q3 + 4q4 + · · · and subtracting qT from T it follows that (1 − q)T = 1 + q + q2 + q3 + · · · which is a geometric series. So (1 − q)T = 1/(1 − q) and so T = 1/(1 − q)2 = 1/p2 and since ∞ ∞ P(X = x) = p2 x=2 x=2 x − 1 x−2 q 1 it follows that ∞ P(X = x) = p2 · x=2 1 =1 p2 We used the process of multiplying the series T = 1+q+2q2 +3q3 +4q4 +· · · by q in order to sum the series because this process will occur again. 4.86 Chapter 7 Waiting Time Problems Table 7. . x = 2. . Number of trials 2 3 3 4 4 4 . We conclude that P(X = x) = or P(X = x) = x − 1 2 x−2 p q . 3. .3 Sample point SS FSS SFS FFSS FSFS SFFS .

.the probability distribution becomes P(X = x) = (x − 1)p2 qx−2 . x = r. 3. as we saw above. then the previous x − 1 trials must contain exactly r − 1 successes by a binomial process and in addition the rth success occurs on the xth trial. The function is known as the negative binomial distribution due to the occurrence of a binomial expansion with a negative exponent and is defined the way we have above as P(X = x) = x − 1 r x−r p q . . . x = 2... . 3. r−1 When r = 1 we wait for the first success and the probability distribution becomes P(X = x) = pqx−1 .. r + 2. 2.. . . again as we found above.. It is clear that if the xth trial is the rth success. r + 1. x = 1.. 4. . Again the random variable X denotes the waiting time or the total number of trials necessary to achieve the rth success. consider ∞ T = x=r x − 1 x−r r r+1 2 r+2 3 q =1+ q+ q + q + ··· r−1 r−1 r−1 r−1 But this is the expansion of (1 − q)−r . 3. WAITING FOR THE rth SUCCESS We now generalize the two special cases we have done and consider waiting for the rth success where r = 1. so ∞ ∞ P(X = x) = pr x=r x=r x − 1 x−r q = pr (1 − q)−r = pr p−r = 1 r−1 so our assignment of probabilities produces a probability distribution function. if r = 2. ∞ E[X] = x=r x· x − 1 r x−r pq = pr · r r−1 x=r ∞ x x−r q r = pr · r 1 + r+1 r+2 2 r+3 3 q+ q + q + ··· 1 2 3 .Mean of the Negative Binomial 87 a negative exponent. MEAN OF THE NEGATIVE BINOMIAL The expected value of the negative binomial random variable is easy to find. So ∞ ∞ P(X = x) = x=r x=r x − 1 r−1 x−r p q · p = pr r−1 ∞ x=r x − 1 x−r q r−1 As in the special cases. 2. We will generalize this in the next section and see that our last two examples are special cases of what we will call the negative binomial distribution.

r = 1 and p = 1/2 so our average waiting time is two tosses. HEADS BEFORE TAILS Here is another game with a coin. The first box yields the first prize. Let us call this event “3 heads before 4 tails”.1 for various values of p.88 Chapter 7 Waiting Time Problems The quantity in the square brackets is the expansion of (1 − q)−(r+1) and so E[X] = pr · r · (1 − q)−(r+1) = r p If p is fixed.7 boxes. the average waiting time for the next prize is 1/(4/6). If the event is to occur. this time a loaded one. giving the total number of boxes bought on average to be 1 + 1/(5/6) + 1/(4/6) + 1/(3/6) + 1/(2/6) + 1/(1/6) = 14. If we wait for the first head in tossing a fair coin. 2. . but then the average waiting time to find the next prize is 1/(5/6). The probability of this is then 3 P(3 heads before 4 tails) = p3 x=0 3 2 (1 − p)x 2+x 2 = p [1 + 3q + 6q + 10q3 ] = p3 [6(1 − p)2 − 3p + 10(1 − p)3 + 4] = 36p5 − 45p4 − 10p6 + 20p3 A graph of this function is shown in Figure7. we want the probability that the heads count reaches three before the tails count reaches four. On average. this is a linear function of r as might be expected. and so on. A running count of the heads and tails is kept. 1. COLLECTING CEREAL BOX PRIZES A brand of cereal promises one of six prizes in a box of cereal.or 3. Let p be the probability a head occurs when the coin is tossed. we must throw the third head on the last trial and this must be preceded by at most three tails. So if we let x denote the number of tails then the random variable x must be 0. So the tails must occur in the first 2 + x trials (we need two heads and x tails) and of course we must have three heads in the final result. Note that the waiting time increases as the number of prizes collected increases. how many boxes must a person buy in order to collect all the prizes? Here r above remains at 1 but the value of p changes as we collect the coupons.

Heads Before Tails 1 0.4 0. The last toss must be a head. Then.1 Let us generalize the problem so that we want the probability that a heads occur before b tails.2 5 10 a 15 20 Figure 7.2 0.2 0.4 p 0. so b−1 P(a heads before b tails) = x=0 pa q x a−1+x x Now let us fix the number of tails. we show in Figure 7. So 4 P(a heads before 5 tails) = x=0 pa q x a−1+x x A graph of this function is in Figure 7.8 0. exactly a − 1 must be heads and x must be tails. 1 Probability 0.8 Probability 0.6.2 . say let b = 5.3 a graph of the surface when both a and p are allowed to vary. Finally. (Note that a was fixed above at 3).6 0. where x is at most b − 1.4 0. of the preceding tosses.6 0.8 1 89 Figure 7.6 0.2 where we have taken p = 0.

75 0. These are shown in Table 7. .75 0. . Let us begin with the points in the sample space where we have grouped the sample points by the number of experiments necessary. now we turn to some unusual problems involving waiting for more general patterns in binomial trials. Note that this differs from waiting for the second success that is a negative binomial random variable.25 0 0 5 10 a 15 20 0 Probability 1 0.3 WAITING FOR PATTERNS We have considered the problem of waiting for the rth head in coin tossing.4 Sample point HH THH TTHH HTHH TTTHH THTHH HTTHH TTTTHH TTHTHH THTTHH HTTTHH HTHTHH .90 Chapter 7 Waiting Time Problems 1 0. We begin with a waiting time problem that involves waiting for two successes in a row.4. We will encounter some interesting mathematical consequences and we will show an unfair game with a fair coin. . . .5 0. Number of points 1 1 2 3 5 . Table 7.5 0.25 p Figure 7.

The Fibonacci sequence begins with 1. This formula is a recursion and we will study this sort of formula in Chapter 16. using the points in the previous table. . 3. 1. for example. subsequent terms are found by adding the two immediately preceding terms. Now multiply this recursion through by n and sum this result from n = 3 to infinity to find ∞ ∞ ∞ nan = q n=3 n=3 nan−1 + qp n=3 nan−2 . But the Fibonacci pattern does hold here! Here is why: if two heads in a row occur on the nth toss. then either the sequence begins with T followed by HH in n − 1 tosses or the sequence begins with HT (to avoid the pattern HH) followed by HH in n − 2 tosses. So the number of points in the sample space is found by writing T before every point giving HH in n − 1 tosses and writing HT before every point giving HH in n − 2 tosses. Then. 2. . we cannot conclude that just because the pattern holds in the first few cases. are the points for which HH occurs for the first time at the seventh toss: T |TTTTHH T |TTHTHH T |THTTHH T |HTTTHH T |HTHTHH HT |TTTHH HT |THTHH HT |HTTHH The assignment of probabilities to the sample points is easy. the Fibonacci sequence.Expected Waiting Time for HH 91 We see that the number of points is 1. but the pattern they follow is difficult and cannot be simply stated. 1. Can it be that the number of points follows the Fibonacci sequence? Of course. 5. So the total number of points in the sample space for the occurrence of HH in n tosses is the sum of the number of points for which HH occurs in n − 1 tosses and the number of points in which HH occurs in n − 2 tosses. let an denote the probability that the event HH occurs at the nth trial. . Here. using the argument leading to the Fibonacci series above. EXPECTED WAITING TIME FOR HH To calculate this expectation. it follows that an = qan−1 + qpan−2 for n > 2 and we take a1 = 0 and a2 = p2 . .

This can also be written as E[N] = 1/p2 + 1/p. 2. . It might be thought that this expectation would be just 1/p2 . It is fairly easy to see that if we want for HHH then we get a “super” Fibonacci sequence in which we start with 1. Now the number of points in the sample space is simple. 1 and obtain subsequent terms by adding the previous three terms in the sequence.5 shows some of the sample points. Table 7. . . . this expectation is six tosses. 3.92 Chapter 7 Waiting Time Problems which we can also write as ∞ ∞ ∞ nan = q n=3 n=3 ∞ [(n − 1) + 1]an−1 + qp n=3 ∞ ∞ [(n − 2) + 2]an−2 or ∞ ∞ nan = q n=3 n=3 (n − 1)an−1 + q n=3 an−1 + qp n=3 (n − 2)an−2 + 2qp n=3 an−2 This simplifies. to E[N] − 2a2 = qE[N] + q + qpE[N] + 2qp or so (1 − q − qp)E[N] = 2p2 + q + 2qp = 1 + p E[N] = (1 + p)/(1 − q − qp) = (1 + p)/p2 . 1.. but that is not quite the case. Table 7. Suppose TH occurs on the nth toss. since ∞ ∞ nan = E[N] and n=1 n=1 an = 1. n − 2 H’s.5 Sample point TH TTH HTH TTTH HTTH HHTH TTTTH HTTTH HHTTH HHHTH . 1. let us consider waiting for the pattern TH. So there are n − 1 points for which TH occurs on the nth toss. The sample point then begins with 0. As a second example.. For a fair coin..

. This pattern continues until we come to the sample point with n − 2 H’s followed by TH. The probability of this point has a factor of pn−2 . and by letting X denote the total number of tosses necessary. 4. . This pattern continues. It follows in the case where q = p = 1/2 that P(X = n) = n−1 2n for n = 2. then its probability has a factor of qn−2 . 3. 3. EXPECTED WAITING TIME FOR TH Using the formula above. there are n − 1 of them.Expected Waiting Time for TH 93 This observation makes the sample space fairly easy to write out and it also makes the calculation of probabilities fairly simple. The formula would not work for q = p = 1/2. In that case the sample points are equally likely. If the sample point begins with HH. . The probabilities of the points add to qp(q3 + pq2 + p2 q + p3 ) This can be recognized as qp(q4 − p4 )/(q − p). 4. then its probability has a factor of p2 qn−4 . If the sample point begins with H. If the sample point begins with no heads. . consider ∞ S= n=2 n · qn−1 = 2q + 3q2 + 4q3 + · · · . . each having probability (1/2)n and. . as we have seen. Consider the case for n = 5 shown in the sample space above. then its probability has a factor of pqn−3 . we find that P(X = n) = qp(qn−1 − pn−1 ) q−p for n = 2. ∞ E[N] = n=2 n· qp(qn−1 − pn−1 ) q−p To calculate this sum. Note that the probability of any sample point has the factor qp for the sequence TH at the nth toss.

We will simply state the average waiting times for each of the patterns of length 3 with a fair coin in Table 7. it is an unfair game with a fair coin! .) There is apparently no intuitive reason for this to be so. the average waiting time for TT is six tosses and the average waiting time for HT is four tosses. In fact. Table 7. This may be a surprise.6. qp The formula above applies only if p = q. In the case p = q. We can write this as ∞ ∞ 2 n=2 [(n + 1) − 2]P(X = n + 1) = n=2 nP(X = n) and from this it follows that 2E[N] − 4 = E[N] so E[N] = 4. but these results can easily be verified by simulation to provide some evidence that they are correct.we first consider / P(X = n + 1) n = P(X = n) 2(n − 1) So ∞ ∞ 2(n − 1)P(X = n + 1) = n=2 n=2 nP(X = n). We found that the average waiting time for HH with a fair coin is six. AN UNFAIR GAME WITH A FAIR COIN Here is a game with a fair coin. E[N] = 1 qp −1− q − p p2 1 −1 q2 = 1 .6 Pattern HHH THH HTH TTH THT HTT HHT TTT Average waiting time 14 8 10 8 10 8 8 14 We continue this chapter with an intriguing game. So S = 1/p2 − 1.94 Chapter 7 Waiting Time Problems Now S + 1 = 1 + 2q + 3q2 + 4q3 + · · · and we have seen that the left-hand side of this equation is 1/p2 . (By symmetry.

We show the probabilities that B beats A in Table 7. If a T is tossed at any time. HT . If the first two tosses are TT then this can be followed by any number of T ’s but eventually H will occur and I will win. If you choose TT . I win. This means that if we play the game twice. I will choose TH. THREE TOSSES This apparent paradox. I will choose HT and I will win 3/4 of the time. If we consider patterns with three tosses. then I win since the pattern TH occurred before the pattern HH. I will win. The fact that the patterns HH. but eventually H will occur and I win. Probabilities then are not transitive. My probability of beating you is 3/4! Here is why. I can increase that probability to 3/4 or 7/8!). Now let us play the coin game again where A is the first player and B is the second player. If you choose one of these patterns. continues with patterns involving patterns of length 3. the choice of pattern is crucial. If the first two tosses are HT then this can be followed by any number of T ’s.Three Tosses 95 Consider the patterns HH. you win. this is the only way you can win and it has probability 1/4. and then my probability of beating you is only 1/2. and I win in the first game. The winner is the person who chose the firstoccurring pattern. Note that letting “>” means “beats” (probably) TTH > HTT > HHT > THH > TTH! . A sensible choice for you is either TH or HT. then it does not follow that pattern A will necessarily beat pattern C. so my probability of winning is 3/4. then you can choose the pattern I chose on the first game.7. I will probably beat you. if you choose HH. you must toss HH on the first two tosses. We showed above the average waiting times for each of the eight patterns that can occur when a fair coin is tossed (Table 7. TH. as shown in the table above. For example.6). then the only way you can win is by tossing two tails on the first two tosses. TH. HT . This can be tried by simulation. It is puzzling to note that no matter what pattern you choose. and TT for two tosses of a fair coin. If you are to win. I will choose another and then we toss a fair coin until one of these patterns occurs. my minimum probability of beating you is 2/3! (And if you do not choose well. If the first two tosses are HH. so if pattern A beats pattern B and pattern B beats pattern C. If the first two tosses are TH. that probabilities are not transitive. and I can still probably beat you. If we see the sequence HTTTTH. and TT are equally likely for a fair coin may be observed by a game player who may think that any choice is equally good is irrelevant. as is the fact that he is allowed to make the first choice.

we must choose one of the two previous payers to pay for the third lunch. and C. the probability that X = 4 is 18/81 = 2/9.7 A’s Choice HHH HHT HTH HTT THH THT TTH TTT B’s Choice THH THH HHT HHT TTH TTH HTT HTT P (B beats A) 7/8 3/4 2/3 2/3 2/3 2/3 3/4 7/8 Nor is it true that a pattern with a shorter average waiting time will necessarily beat a pattern with a longer waiting time. The fourth lunch can be paid by any of the three so this gives 3 · 2 · 1 · 3 = 18 ways in which this can be done. On average. This is a bit under three. WHO PAYS FOR LUNCH? Three friends. go to lunch regularly. There are then 3 · 2 · 2 = 12 ways in which this can be done and since each way has probability 1/27. If X = 3. Now what happens as the size of the group increases? Does the randomness affect the number of lunches taken? . the probability that X = 3 is 12/27 = 4/9. We calculate the probabilities of each of these values. Since each has probability (1/3)4 . either of the other two must pay for the second lunch and the one remaining must pay for the third lunch. how many times will the group go to lunch? Let X denote the number of dinners the group enjoys. It can be shown that the average waiting time for THTH is 20 tosses and the average waiting time for HTHH is 18 tosses. The expected number of lunches is then E(X) = 2·3/9+3·4/9+4·2/9 = 26/9. so they might as well go to lunch three times and forget the random choices except that sometimes someone never pays. Nonetheless. The payer at each lunch is selected randomly until someone pays for lunch for the second time. Finally. whom we will call A. Then the same person must pay for the second lunch as well. Probability contains many apparent contradictions. Clearly. X = 2. the probability that THTH occurs before HTHH is 9/14. or 4. If X = 2.96 Chapter 7 Waiting Time Problems Table 7. 3. B. then we have a choice of any of the three to pay for the first lunch. Then we must choose one of the two who did not pay for the first lunch and finally. These probabilities add up to 1 as they should. then we have a choice of any of the three to pay for the first lunch. if X = 4 then any of the three can pay for the first lunch. The probability of this is 1/3.

2016 0. To establish a general formula for P(X = x) for n lunchers. or 5.00326592 0. Then X = 2..216 0. This means that n n−1 n−2 n−3 n − (x − 2) x − 1 P(X = x) = · · · · ··· · · n n n n n n This can be rewritten as 1 n P(X = x) = x (x − 1)(x − 1)!.18 0.1512 0. The next to the last payer is one of the n − (x − 2) people and the last payer must be one of the x − 1 people who have paid once. Table 7. the third one of the n − 2 people. note that the first payer can be any of the n people.042336 0.. C. so the randomness is having some effect. 3.Who Pays for Lunch? 97 Suppose that there are four people in the group. 4·1 1 P(X = 2) = = 4·4 4 4·3·2 3 P(X = 3) = = 4·4·4 8 P(X = 4) = And finally. 218 75.. 3. B. This is now a bit under 4.0145152 0. A. We calculate the probabilities in much the same way as we did when there were three for lunch. and D. 4. P(X = 5) = 4·3·2·1·4 3 = 4·4·4·4·4 32 4·3·2·3 9 = 4·4·4·4 32 and these sum to 1 as they should. x = 2. n + 1 n x−1 If there are 10 diners.8. the next must be one of the n − 1 people. Then E(X) = 2 · 1/4 + 3 · 3/8 + 4 · 9/32 + 5 · 3/32 = 103/32 = 3.8 x 2 3 4 5 6 7 8 9 10 11 P(X = x) 0.1 0. and so on. .00036288 .09072 0. the probabilities of having x lunches are shown in Table 7.

219 3.660 5.98 Chapter 7 Waiting Time Problems 0. The number of diners now increases beyond any sensible limit.025 18.546 6.02 0. providing some mathematical challenges.550 9. The computer algebra system Mathematica was used to produce Table 7.03 0.500 2.258 . it is possible to compute the probabilities for various values of X with some accuracy.9 n 2 3 4 5 6 10 15 20 30 50 100 150 200 300 400 450 Expectation 2.01 10 20 30 40 50 Figure 7.4 we show a graph of the probabilities for 100 diners.398 22.4 Diners With a computer algebra system. Table 7.05 0.9 where n is the number of lunchers and the expected number of lunch parties is computed.04 0.738 27.294 7.775 4.543 13.381 25.210 16. EXPECTED NUMBER OF LUNCHES It is also interesting to compute the expectations for increasing numbers of diners and to study the effect the randomness has.889 3. In Figure 7.510 3.06 Probability 0.

Now we define the negative hypergeometric random variable.86266 + 0. of Dinners 12 10 8 6 4 0 20 40 60 80 100 Figure 7.5. a questionable quality control procedure to say the least. 50 of which are special. We want to sample. NEGATIVE HYPERGEOMETRIC DISTRIBUTION A manufacturer has a lot of 400 items. of Lunches = 8. without replacement. The problem here is our final example of a waiting time problem. · · · r−1 We showed previously that the expected value of Y is E(Y ) = r/p. we would encounter the negative binomial distribution. a fact we will return to later. r + 1. Had the inspected items been replaced. Expected No. suppose the lot of N items contains k special items. A graph of the expected number of lunches as a function of the number of diners is shown in Figure 7. until we find c of the special items.Negative Hypergeometric Distribution 99 We find n+1 E[X] = x=2 n 1 (x − 1)x! x x−1 n It is interesting to note that adding one person to the dinner party has less and less an effect as n increases.04577 · No. To be specific. Again Y is the random variable denoting the number of trials necessary. the random variable representing the number of special items found leads to the negative hypergeometric distribution. of People. Since the first . y = r. of People A least squares straight line regression gives Expected No. The items are inspected one at a time until 10 of the special items have been found.5 No. Recall that if Y is the waiting time for the rth success then P(Y = y) = y − 1 r−1 p (1 − p)y−r · p. which we have seen when waiting for the rth success in a binomial process. A statistical test indicates that the fit is almost perfect for the calculated points. Then the sampling process stops. If the inspected items are not replaced in the lot.

01 0. . .005 40 60 Figure 7.454.015 Probability 0.6 80 y 100 120 140 Some of the individual probabilities are shown in Table 7. The maximum number of trials must occur when the first N − k trials contain all the nonspecial items followed by all c special items. k N −k · c−1 y−c N y−1 k − (c − 1) . . c + 1. . N − (k − c) N − (y − 1) P(Y = y) = · Note that the process can end in as few as c trials. . we can easily calculate all the values in the probability distribution function and draw a graph that is shown in Figure 7. Some elementary simplifications show that y−1 N −y · c−1 k−c . N = 400. . 0.100 Chapter 7 Waiting Time Problems y − 1 trials comprise a hypergeometric process and the last trial must find a special item. . and c = 10.6. Note that had we been using a negative binomial model the mean would be c c 10 = = = 80 p k/N 50/400 The fact that this negative binomial mean is always greater than that for the negative hypergeometric has some implications. . With a computer algebra system such as Mathematica r . y = c.6275 and the variance is 425. c + 1.10. The mean value of Y is E(Y ) = 4010/51 = 78. y = c. but first we establish formulas for the mean and variance of the negative hypergeometric random variable. k = 50. N − (k − c) N k P(Y = y) = In the special case we have been considering. This fact will be shown below. .

0148697 0.0118532 0.Mean and Variance of the Negative Hypergeometric Table 7.0186812 0.00411924 0.0150666 0.00995123 0.00286975 0.0124115 0.0190899 0. If we calculate P(Y = y)/P(Y = y − 1).00843214 0. This becomes N−k+c [(N + 1 + c)y − y2 − c(N + 1)]P(Y = y) y=c N−k+c = y=c [(N − k + c)(y − 1) − (y − 1)2 ]P(Y = y − 1) .00572262 0.0170618 0.00768271 0. we find after simplification that N−k+c (y − c)(N − y + 1)P(Y = y) y=c N−k+c = y=c (y − 1)(N − y + 1 − k + c)P(Y = y − 1) where we have also indicated a sum over all the possible values for Y.0194303 0.0028691 101 MEAN AND VARIANCE OF THE NEGATIVE HYPERGEOMETRIC We use a recursion.10 y 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 Probability 0.00530886 0.0175912 0.

we establish the fact that the mean of the negative binomial always exceeds that of the negative hypergeometric. We start with the previous recursion and multiply by y: N−k+c y(y − c)(N − y + 1)P(Y = y) y=c N−k+c = y=c y(y − 1)(N − y + 1 − k + c)P(Y = y − 1) The left-hand side reduces quite easily to (N + c + 1)E(Y 2 ) − E(Y 3 ) − c(N + 1)E(Y ) The right-hand side must be first expanded in terms of y − 1 and then it can be simplified to (N + c − k)E(Y ) + (N − 1 + c − k)E(Y 2 ) − E(Y 3 ) Then. this fact must be used sparingly if approximations are to be drawn. N + Nk > k + Nk so that cN/k > c(N + 1)/k + 1 establishing the result. putting the sides together. Although it is true that the negative binomial distribution is the limiting distribution for the negative hypergeometric distribution.102 Chapter 7 Waiting Time Problems and this can be further expanded and simplified to (N + 1 + c)E(Y ) − E(Y 2 ) − c(N + 1) = (N − k + c)[E(Y ) −(N − k + c)P(Y = N − k + c) − [E(Y 2 ) − (N − k + c)2 P(Y = N − k + c)] When this is simplified we find E(Y ) = c(N + 1) k+1 In our special case this gives E(Y ) = 10(400+1)/(50+1) = 4010/51 = 78. Now we establish a formula for the variance of the negative hypergeometric random variable. We will return to this point later. Since N > k. we find E(Y 2 ) = from which it follows that Var(Y ) = c(N + 1) [(N − k)(k − c + 1)] (k + 2)(k + 1)2 (cN + 2c + N − k)c(N + 1) (k + 2)(k + 1) In our special case. Before proceeding to the variance.6275.454 as before. we find Var(Y ) = 425. .

002831 0.000212 Figure 7.014870 0.009730 0. 0. This difference is c(N − k) k(k + 1) and this can lead to quite different results if probabilities are to be estimated.001936 0.000096 0.000831 −0.000053 Negbin 0.000797 −0.000239 −0.001531 0.0007543 0.000552 0.000288 −0.005 NegHyper 30 60 90 y 120 150 Figure 7.004119 0.001266 0. It was noted above that the mean of the negative binomial distribution always exceeds that of the negative hypergeometric distribution.0002209 −0.007683 0. This is true as the values in Table 7.000803 0.000894 −0. Table 7.0012883 0.000295 0.002097 0.7 also shows that the values for the two probability distribution functions are very close.011657 0.015 Probability 0.012412 0.002869 0.001103 0.000265 Neghyper − Negbin 0.0135815 0.000381 −0.000384 0.000171 0.007921 0.000785 0.Negative Binomial Approximation 103 NEGATIVE BINOMIAL APPROXIMATION It might be thought that the negative binomial distribution is a good approximation to the negative hypergeometric distribution.000494 0.000490 −0.7 .005723 0.11 indicate.11 Y 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 Neghyper 0.000896 −0.000609 −0.006305 0.000728 −0.004916 0.000582 −0.009951 0.003763 0.01 NegBin 0.

13 shows the expected waiting time for c special items to occur.6667 32.12 k 10 20 30 40 50 60 70 80 90 100 E(Y ) 91. and c = 10 and the negative binomial. In E(Y ) let N = 1000 and k = 100.4098 14. k = 50.104 Chapter 7 Waiting Time Problems Figure 7.7 shows a comparison between the negative hypergeometric with N = 400. Waiting Time for c Special Items to Occur Now consider a larger case.3580 11. the actual differences are quite negligible as the values in Table 7. calculated from the right-hand tails. THE MEANING OF THE MEAN We have seen that the expected value of Y is given by E(Y ) = c(N + 1) k+1 This has some interesting implications.3725 units. The means differ by 1. Table 7.91089 . we look at the expected number of drawings until the first special item is found. show.4146 19.0000 9. The problem of course is the fact that the negative binomial assumes a constant probability of selecting a special item. Table 7. First Occurrences If we let c = 1 in E(Y ).12 shows some expected waiting times for a lot with N = 1000 and various values of k.0986 12. Table 7.0000 47.2903 24. The graph in Figure 7. Although the graphs appear to be different. while in fact this probability constantly changes with each item selected. one of which we now show.8 shows these results.6275 16.11 .

then E(Y ) is approximately the same percentage of N. suppose that a sample of 100 from a population of 400 shows 5 special items.871 891. if c is some percentage of k. Consider prob[k] = k 5 · 400 − k 95 400 100 .653 693. a result easily seen from the formula for E(Y ) The graph in Figure 7.762 792.980 991. The ratio prob[k = 1]/prob[k] can be simplified to ((k + 1)(305 − k))/((k − 4)(400 − k)). On the basis of a sample. how can we estimate k? To be specific.9 shows this result as well.The Meaning of the Mean 105 80 60 k 40 20 20 40 60 E(Y) 80 100 Figure 7. not surprisingly.05.218 297.545 594.089 .327 396.436 495. a dubious assumption at best. Estimating k Above we have assumed that k is known. Seeing where this is > 1 gives k < 19.8 Note that. Table 7. What is the maximum likelihood estimate of k? This is not a negative hypergeometric situation but a hypergeometric situation.1089 198.13 c 10 20 30 40 50 60 70 80 90 100 E(Y ) 99.

Show that the expected waiting time for the pattern HT with a fair coin is four tosses. Show that if X1 is the waiting time for the first binomial success and X2 is the waiting time for the second binomial success. the unknown number if special items in the population. is k= and further k sp = N s 1+ 1 N − 1 N sp (N + 1) − s s showing that the proportion of the population estimated is very close to the percentage of the special items in the sample. the sample is of size s. It is easy to show that k the maximum likelihood estimator for k. Here we have used the hypergeometric distribution to estimate k. We find exactly the same estimate if we use the negative hypergeometric waiting time distribution. and the sample contains sp special items. not really much of a surprise. especially as the sample size increases. then X1 + X2 has a negative binomial distribution with r = 2. Now suppose the population is of size N. CONCLUSIONS This chapter has contained a wide variety of waiting time problems. both in their mathematical analysis and in their practical applications. .106 Chapter 7 Waiting Time Problems 1000 800 c 600 400 200 20 40 60 E(Y) 80 100 Figure 7. These are often not considered in introductory courses in probability and statistics and yet they offer interesting situations. 2.9 The sample then has 20% special items and that is somewhat greater than the maximum likelihood estimator for the percentage of special items in the population. EXPLORATIONS 1.

one of the jars is found to be empty. Find the waiting times for the second defective item to occure if sampled items are not replaced before the next item is selected. 6.Explorations 107 3. 5. A lot of 50 items contains 5 defective items. Professor Banach has two jars of candy on his desk and when a student visits. What is the probability an odd number of tosses is necessary? (b) If the coin is fair.6. . is thrown until a head appears. At some point. he or she is asked to select a jar and have a piece of candy. (a) Suppose there are 4 people going for lunch as in the text. loaded to come up heads with probability 0. (a) A coin. explain why the probabilities of odd or even numbers of tosses are not equal. Show that the expected number of (group) tosses is 8/3. Suppose X is a negative binomial random variable with p the probability of success at any trial. 8. how many pieces of candy are in the other jar? 7. 4. Two fair coins are tossed and any that comes up heads is put aside. How many lunches could a person who never pays for lunch expect? (b) Repeat part (a) for a person who pays for lunch exactly once. On average. Find the value of p that makes this event most likely to occur. Suppose the rth success occurs at trial t. This is repeated until all the coins have come up heads.

The simplest of these random variables is the uniform random variable. and the Central Limit Theorem. that is. random variables defined on an infinite. Inc. set such as an interval or intervals. Bivariate Random Variables CHAPTER OBJECTIVES: r r r r r r r to study random variables taking on values in a continuous interval or intervals to see how events with probability zero can and do occur to discover the surprising behavior of sums and means to use the normal distribution in a variety of settings to explain why the normal curve is called “normal” to discuss the central limit theorem to introduce bivariate random variables. So far we have considered discrete random variables.Chapter 8 Continuous Probability Distributions: Sums. A Probability and Statistics Companion. uncountable. 108 . Kinney Copyright © 2009 by John Wiley & Sons. John J. We now turn our attention to continuous random variables. random variables defined on a discrete or countably infinite set. that is. the Normal Distribution.

it is as likely to stop at any particular number as at any other. What is P(X = x) now? Again. since we have an infinity of values to use. we can not make P(X = x) very large. what is the probability that the random variable is contained in an interval? To make this meaningful. f (x) = P(X = x)? We suppose the wheel is fair. X.00000000000000000001 = 10−20 The problem with this is that it can be shown that the wheel contains more than 1020 points. but in the question itself. which has the following properties: . Now suppose that the wheel is loaded so that P(X ≥ 1/2) = 3P(X ≥ 1/2). The random variable X can now take on any value in the interval from 0 to 1. What shall we take as the probability distribution function. we define the probability density function. so we have used up more than the total probability of 1! We are forced to conclude that P(X = x) = 0. f (x). If we consider any random variable defined on a continuous interval. what is P(a ≤ x ≤ b)? That is. The arrow is spun and the number the arrow stops at is the random variable. we conclude that P(X = x) = 0.1 1/2 Clearly. So we ask a different question. The wheel is shown in Figure 8. 1 3/4 1/4 Figure 8. making it three times as likely that the arrow ends up in the left-hand half of the wheel as in the right-hand half of the wheel. that is. then P(X = x) will always be 0.Uniform Random Variable 109 UNIFORM RANDOM VARIABLE Suppose we have a spinner on a wheel labeled with all the numbers between 0 and 1 on its circular border. It is curious that we cannot distinguish a loaded wheel from a fair one! The difficulty here lies not in the answer to our question.1. Suppose we let P(X = x) = 0.

P(a ≤ x ≤ b) = b − a. Also. for the fair wheel. f (x) must enclose a total area of 1. we find the area under the curve between 1/3 and 3/4.2 So. What is f (x) for the fair wheel? Since. 2 1.2 0.6 0.3 . consider (among many other choices) the triangular distribution f (x) = 2x. to find P(1/3 ≤ X ≤ 3/4). The area under f (x) between x = a and x = b is P(a ≤ x ≤ b). For the loaded wheel. where P(0 ≤ X ≤ 1/2) = 1/4 and so P(1/2 ≤ X ≤ 1) = 3/4. for example.2. f (x) ≥ 0.3. 2 1. 3. it follows that f (x) = 1. Here we call f (x) a uniform probability density function.5 0. 2. the total probability for the sample space.5 0.110 Chapter 8 Continuous Probability Distributions 1. since probabilities cannot be negative.8 1 Figure 8. This means that f (x) must always be positive. Its graph is shown in Figure 8.2 0. So areas under f (x) are probabilities. This is (3/4 − 1/3) · 1 = 5/12.6 0. 0≤x≤1 The distribution is shown in Figure 8. areas under the curve. probabilities.8 1 Figure 8. Notice that f (x) ≥ 0 and that the total area under f (x) is 1. Since f (x) is quite easy.5 f (x ) 1 0. are also easy to find. The total area under f (x) is 1.4 x 0.5 f (x) 1 0. 0 ≤ x ≤ 1.4 x 0.

For that to occur. 0 ≤ x ≤ 1) and consider spinning the wheel twice and adding the numbers the pointer stops at. and f (5). where X = 3 is a point of symmetry and f (1) = f (5) and f (2) = f (4). a large sum. As an example. If X1 and X2 represent the outcomes on the individual spins. f (4). This is far from a general explanation or proof. f (3). then the expected value of each is E(X1 ) = 1/2 and E(X2 ) = 1/2. It is always true that E(X1 + X2 ) = E(X1 ) + E(X2 ) (a fact we will prove later) and since 1/2 is a point of symmetry. but that is not the case. as we saw in the discrete case in Chapter 5. to get a sum near 2.Sums 111 Areas can be found using triangles and it is easy to see. Similarly. it follows that E(X1 + X2 ) = 1/2 + 1/2 = 1 . both spins must be near 1. here is a fact and an example that may prove convincing. SUMS Now let us return to the fair wheel (where f (x) = 1. since f (1/2) = 1. but this approach can easily be generalized to a more general discrete probability distribution. f (2). Now E(X) = 1 ·f (1) + 2 ·f (2) + 3 ·f (3) + 4 ·f (4) + 5 ·f (5) = 6 ·f (1) + 6 ·f (2)+3 ·f (3) But f (1) + f (2) + f (3) + f (4) + f (5) = 2 · f (1) + 2 · f (2) + f (3) = 1 so 6 · f (1) + 6 · f (2) + 3 · f (3) = 3 the point of symmetry. It is also true for a continuous probability distribution. both numbers must be small. 4. we cannot supply a proof here. So either of these possibilities is unlikely. A Fact About Means If a probability distribution or a probability density function has a point of symmetry. We will continue to use this as a fact. 3. What is the probability density function for the sum? One might think that since we are adding two uniform random variables the sum will also be uniform. that P(0 ≤ X ≤ 1/2) = 1/4. The sums obtained will then be between 0 and 2. 2. then that point is the expected value of the random variable. 5 with probabilities f (1). Consider the probability of getting a small sum. consider the discrete random variable X that assumes the values 1. While we cannot prove this in the continuous case.

namely. We have encountered this curve before. However. The function is always positive.2 0. 1 0. as we shall see.5 2 Figure 8.5 resembles a “bell-shaped” or normal curve that occurs frequently in probability and statistics.112 Chapter 8 Continuous Probability Distributions It is more likely to find a sum near 1 than to find a sum near either of the extreme values. A graph of f2 (x) is shown in Figure 8. it can be shown that the probability density function for X = X1 + X2 +X3 is ⎧ ⎪ 1 x2 ⎪ for 0 ≤ x ≤ 1 ⎪ ⎪2 ⎪ ⎪ ⎨ 3 2 3 f3 (x) = for 1 ≤ x ≤ 2 − x− ⎪4 2 ⎪ ⎪ ⎪1 ⎪ ⎪ ⎩ (x − 3)2 for 2 ≤ x ≤ 3 2 This is also a probability density function.5 1 x 1. so the total area is 2 · 1/2 · 1 · 1 = 1. Each of these have base of length 1 and height 1 as well. The easiest way to find its total area is to use the area of two triangles. although finding the total area. or areas. or finding probabilities by finding areas. for example. We can find here. 0 or 2.6 0. we will soon find a remarkable approximation to this probability density function that will enable us to determine areas and hence probabilities .8 f2(x) 0. It can be shown that the probability distribution of X = X1 + X2 is f2 (x) = x for 0 ≤ x ≤ 1 2 − x for 1 ≤ x ≤ 2 We should check that f2 (x) is a probability density function.5. are now difficult for us to find and we cannot do this by simple geometry. so the sums do in fact cluster around their expected value. so f2 (x) is a probability density function.4. that P(1/2 ≤ x ≤ 5/2) = 23/24. using the area of a triangle. The graph in Figure 8. If we increase the number of spins to 3 and record the sum.4 Since areas represent probabilities. we find that P(1/2 ≤ X ≤ 3/2) = 3/4. Probabilities. is usually done using calculus. A graph of f3 (x) is shown in Figure 8. .4 0.

We will discuss the normal curve first and then state the central limit theorem. This is an illustration of the central limit theorem. “X is distributed normally with mean μ and standard deviation σ”.1 0.4 0.05 −2 2 x 4 6 8 Figure 8. the curve is symmetric about its mean value. for a general normal random variable X.2 0. −∞ ≤ x ≤ ∞ Note that for any normal curve. which is a normal curve with mean 3 and standard deviation 2.3 0. σ 0.5 x 2 2.6 0. For the normal curve 0. we would find that the graphs come closer and closer to a normal curve. NORMAL PROBABILITY DISTRIBUTION The normal probability distribution is a bell-shaped curve that is completely specified by its mean μ and its standard deviation σ.5 1 1. X ∼ N(μ. σ).Normal Probability Distribution 0.25 0. A typical normal distribution is√ shown in Figure 8. Its probability density function is 1 1 2 f (x) = √ e− 2 (x−μ) . This is read. σ 2π −∞ ≤ μ ≤ ∞.15 0. We abbreviate any normal curve by specifying its mean and standard deviation and we write.7 0.1 0.5 If we were to continue adding spins.5 3 113 Figure 8.6 .2 0.5 f3(x) 0.6.

It can be shown. This means that areas under any normal curve can be calculated using a single.315357 Textbooks commonly include a table of the standard normal curve. we calculate several probabilities for a standard N(0. then Z ∼ N(0. P(2 ≤ X ≤ 6) = P 2−3 X−3 6−3 √ = √ = √ 2 2 2 = P (−0. the more of the distribution we enclose. can calculate these areas. 2). Since the function is always positive. normal curve.707 11 ≤ Z ≤ 2. areas under the curve represent probabilities. that is. but with some difficulty. Fact. The use of this standard normal curve to calculate areas under any normal curve is based on this fact. (a) P(−1 ≤ Z ≤ 1) = 0.743303 (b) P(4 ≤ X ≤ 8) = 0. we do not need to include such a table in this book. Facts About Normal Curves To show that the standard deviation does in fact measure the dispersion of the distribution. using our example (a) above. Since computers and calculators compute areas under the standard normal curve. standard. 1) distribution. There are. Here are some examples from this normal curve. For example.239547 P(X > 4 and X > 2) P(X > 4) 0.743303 as before. . N(0. These cannot be calculated easily either.114 Chapter 8 Continuous Probability Distributions √ in Figure 8. we write X ∼ N(3. however.9973 So the more standard deviations we use. 1). Statistical calculators.23975 (c) P(X > 4|X > 2) = = = P(X > 2) P(X > 2) 0.6.76025 = 0. (a) P(2 ≤ X ≤ 6) = 0. The fact that any normal curve can be transformed into a standard normal curve is a remarkable fact (and a fact not true for the other probability density functions we will encounter). that the total area of any normal curve is 1. If X ∼ N(μ. however. many other uses of the standard normal curve and we will meet these when we study other statistical topics. 121 3) = 0. 1).6827 (b) P(−2 ≤ Z ≤ 2) = 0.9545 (c) P(−3 ≤ Z ≤ 3) = 0. σ) and if Z = (X − μ)/σ.

The table is to be read this way: X can take on the values 1. One of the reasons for this. Also we can calculate the percentage of the population whose IQ values are greater than 140. So. This is called the joint probability distribution function. 000 in the United States. we see that P(90 ≤ IQ ≤ 110) = 0. 000.9968 = 0. in a population of about 300. However. this gives about 1. as P(IQ ≥ 140) = 1 − P(Z ≤ 4) = 1 − 0. Now we must determine the probabilities that the random variables assume values together. using the facts above and assuming Z = (IQ−100)/10. 000. we turn our attention to bivariate random variables. a rarity. but not the only one. We had an indication of this when we added uniformly distributed random variables.1 X Y 1 2 1 1/12 1/3 2 1/12 1/12 3 1/3 1/12 . on the sample points. This is a consequence of the central limit theorem. which we will discuss subsequently.1 IQ Scores A standard test for the intelligence quotient (IQ) produces scores that are approximately normally distributed with mean 100 and standard deviation 10. 000 people with this IQ or greater. We need not show the sample space. There are many other applications for the normal distribution. is that sums of different random variables or means of several random variables become normal. The entries in the body of the table Table 8.9973.9545. P(80 ≤ IQ ≤ 120) = 0. 2.6827. and P(70 ≤ IQ ≤ 130) = 0.1.Bivariate Random Variables 115 EXAMPLE 8. We have been concerned with a single random variable. but the probabilities with which the random variables take on their respective values together are shown in Table 8. and 3 while the random variable Y can take on the values 1 and 2. So we turn our attention to different random variables and first consider two different random variables. But before we do that. but first note that the observations we want to add are those from different observations and so are different random variables. Suppose we have a sample space and we have defined two random variables. BIVARIATE RANDOM VARIABLES We want to look at sums when we add the observations from the spinning wheel. which we will call X and Y.0032.

This random variable then has a mean value. the probability that X + Y = 5 is 1/12. So P(X + Y = 3) = P(X = 1 and Y = 2) + P(X = 2 and Y = 1) = 1/3 + 1/12 = 5/12.1 for which X = 3 is 1/3 + 1/12 = 5/12. We find that 1 5 5 1 7 E(X + Y ) = 2 · +3· +4· +5· = 12 12 12 12 2 How does this value relate to the expected values of the variables X and Y taken separately? First. It is easy to check that there are two mutually exclusive ways for X + Y to be 4 and this has probability 5/12. This random variable can take on the values 2. X=1 and Y =2 or X=2 and Y = 1.1 for which X = 1. There are two mutually exclusive ways for X + Y to be 3. So the probability distribution for the random variable X alone can be found by adding up the entries in the columns. There is only one way for X + Y to be 2. the probability distribution for the random variable Y alone can be found by adding up the probabilities in the rows of Table 8. . In a similar way. This means that the random variable X + Y has the following probability distribution function: ⎧ ⎪ 1/12 if x + y = 2 ⎪ ⎪ ⎪ ⎨ 5/12 if x + y = 3 f (x + y) = ⎪ 5/12 if x + y = 4 ⎪ ⎪ ⎪ ⎩ 1/12 if x + y = 5 where x and y denote values of the random variables X and Y. namely. Finally. is the probability that X = 1? We know that P(X = 1 and Y = 1) = 1/12 and P(X = 1 and Y = 2) = 1/3.116 Chapter 8 Continuous Probability Distributions are the probabilities that X and Y assume their values simultaneously. So the probability that X + Y = 2 is 1/12 and we write P(X + Y = 2) = 1/12. P(X = 2) = P(X = 2 and Y = 1) + P(X = 2 and Y = 2) = 1/12 + 1/12 = 1/6. we must find the probability distribution functions of the variables alone. each of the variables must be 1. namely. P(X = 1 and Y = 2) = 1/3 and P(X = 3 and Y = 2) = 1/12 Note that the probabilities in the table add up to 1 as they should. Now consider the random variable X + Y.1 for which X = 2 Finally.1. For example. These probabilities add up to 1 as they should. for example. which is the sum of the probabilities in the column of Table 8. summing the probabilities in Table 8. So P(X = 1) = 1/12 + 1/3 = 5/12 Notice that this is the sum of the probabilities in the column in Table 8. or 5. 3. What. 4. These events are mutually exclusive and are the only events for which X = 1.

P(Y = 1) = P(X = 1 and Y = 1) + P(X = 2 and Y = 1) + P(X = 3 and Y = 1) = 1/12 + 1/12 + 1/3 = 1/2 In a similar way. P(X = x. . Table 8.Bivariate Random Variables 117 Specifically.2 X Y 1 2 f (x) 1 1/12 1/3 5/12 2 1/12 1/12 1/6 3 1/3 1/12 5/12 g(y) 1/2 1/2 1 Note that where the sums are over all the values of X and Y. Y = y) = P(X = x) = f (x) y These random variables also have expected values.1 to show these marginal distributions in Table 8. We then found the following probability distributions for the individual variables: ⎧ ⎪ 5/12 if x = 1 ⎨ if x = 2 f (x) = 1/6 ⎪ ⎩ 5/12 if x = 3 and g(y) = 1/2 1/2 if y = 1 if y = 2 These distributions occur in the margins of the table and are called marginal distributions. we find P(Y = 2) = 1/2. We find E(X) = 1 · 5 1 5 +2· +3· =2 12 6 12 1 1 3 E(Y ) = 1 · + 2 · = 2 2 2 7 2 Now we note that E(X + Y ) = =2+ 3 2 = E(X) + E(Y ). Y = y) = P(Y = y) = g(y) x and P(X = x.2. We have expanded Table 8.

Now consider an example where X and Y are independent of each other. Y = y) + P(X = x. in fact. if X = Y or if X = Y − 4. Y = y) = x xf (x) + y y g(y) = E(X) + E(Y ) This is easily extended to any number of random variables: E(X + Y + Z + · · · ) = E(X) + E(Y ) + E(Z) + · · · When more than one random variable is defined on the same sample space. This will be dealt with when we consider the subject of regression later. they may be totally independent of each other.118 Chapter 8 Continuous Probability Distributions This is not a peculiarity of this special case. Y = y) xP(X = x. the variables are called correlated. We show another joint probability distribution function in Table 8. It is important to emphasize that E(X + Y + Z + · · · ) = E(X) + E(Y ) + E(Z) + · · · no matter what the relationships are between the several variables.3 X y 1 2 f (x) 1 5/24 5/24 5/12 2 1/12 1/12 1/6 3 5/24 5/24 5/12 g(y) 1/2 1/2 1 . Y = y) + x y x y = = x yP(X = x. or they may be partially dependent on each other. true for any two variables X and Y. they may be related in several ways: they may be totally dependent as. Note in the example we have been considering that P(X = 1 and Y = 1) = 1/12 = P(X = 1) · P(Y = 1) = 5/12 · 1/2 = 5/24 / so X and Y are not independent. E(X + Y ) = x y (x + y)P(X = x. for example. Y = y) y y x x y P(X = x.3. Here is a proof. In the latter case. but is. since no condition was used in the proof above. Table 8.

P(X = 1 and Y = 1) = 5/24 = P(X = 1) · P(Y = 1) = 1/12 · 1/2 The other entries in the table can be checked similarly.· · · are mutually independent in pairs. in the margins of the table. for example. Variance The variance of a random variable. This is true of the general case. f (x) and g(y). Now consider the random variable X · Y and in particular its expected value. Now E(X − μ)2 = E(X2 − 2μX + μ2 ) = E(X2 ) − 2μE(X) + E(μ2 ) . Y = y) = P(X = x)P(Y = y) in each case.Bivariate Random Variables 119 Note that P(X = x. We can calculate.Z. This can be extended to any number of random variables: if X.Y . Using the fact that X and Y are independent. we sum the values of X · Y multiplied by their probabilities to find the expected value of the product of X and Y : E(X · Y ) = 1 · 1 · 1 2 5 1 1 5 +1·2· · +1·3· 12 6 2 12 1 1 5 1 · +2·3· · 6 2 12 2 · · 1 5 +2·1· 2 12 · 1 2 +2 · 2 · =3 but we also see that this can be written as 5 1 5 E(X · Y ) = 1 · +2· +3· 12 6 12 =2· · 1· 1 1 +2· 2 2 3 = 3 = E(X)E(Y ) 2 and we see that the quantities in parentheses are E(X) and E(Y ). if X and Y are independent.is Var(X) = E(X − μ)2 where μ = E(X). that is. X. respectively. then E(X · Y · Z · · · · ) = E(X) · E(Y ) · E(Z) · · · · Although the examples given here involve discrete random variables. Here we have shown the marginal distributions of the random variables X and Y . This definition holds for both discrete and continuous random variables. then E(X · Y ) = E(X) · E(Y ) A proof can be fashioned by generalizing the example above. so the random variables are independent. the results concerning the expected values are true for continuous random variables as well.

120

Chapter 8

Continuous Probability Distributions

and since μ is a constant, E(X − μ)2 = E(X2 ) − 2μ2 + μ2 so E(X − μ)2 = E(X2 ) − μ2 The variance is a measure of the dispersion of a random variable as we have seen with the normal random variable, but, as we will show when we study the design of experiments, it can often be partitioned into parts that explain the source of the variation in experimental results. We conclude our consideration of expected values with a result concerning the variance of a sum of independent random variables. By definition, Var(X + Y ) = E (X + Y ) − (μx + μy ) = E (X − μx ) + (Y − μy )
2 2

= E(X − μx )2 − 2E(X − μx )(Y − μy ) + E(Y − μy )2 Consider the middle term. Since X and Y are independent, E(X − μx )(Y − μy ) = E(X · Y − μx · Y − X · μy + μx · μy ) = E(X · Y ) − μx · E(Y ) − E(X) · μy + E(μx · μy ) = E(X · Y ) − μx · μy − μx · μy + μx · μy = E(X) · E(Y ) − μx · μy − μx · μy + μx · μy = μ x · μy − μ x · μ y = 0 So Var(X + Y ) = E(X − μx )2 + E(Y − μy )2 = Var(X) + Var(Y ) Note that this result highly depends on the independence of the random variables. As an example, consider the marginal distribution functions given in Table 8.3 and repeated here in Table 8.4.
Table 8.4 X Y 1 2 f(x) 1 5/24 5/24 5/12 2 1/12 1/12 1/6 3 5/24 5/24 5/12 g(y) 1/2 1/2 1

Central Limit Theorem: Sums

121

We find that E(X + Y ) = 2 · So Var(X + Y ) = 2− 7 2
2

5 7 7 5 84 7 +3· +4· +5· = = 24 24 24 24 24 2

·
2

5 7 + 3− 24 2

2

·

7 7 + 4− 24 2

2

·

7 24

+ 5− But

7 2

·

5 13 = 24 12

E(X) = 1 · and

5 1 5 +2· +3· =2 12 6 12 3 1 1 +2· = 2 2 2

E(Y ) = 1 · so Var(X) = (1 − 2)2 and Var(Y ) = and so 1− 3 2
2

·

5 + (2 − 2)2 12 ·

·

1 + (3 − 2)2 6
2

·

5 5 = 12 6

1 3 + 2− 2 2

·

1 1 = 2 4

5 1 13 + = 6 4 12 We could also calculate the variance of the sum by using the formula Var(X) + Var(Y ) = Var(X + Y ) = E(X + Y )2 − [E(X + Y )]2 Here 5 7 7 + 32 · + 42 · + 52 24 24 24 We previously calculated E(X + Y ) = 7/2, so we find E(X + Y )2 = 22

·

·

5 40 = 24 3

Var(X + Y ) =

40 − 3

7 2

2

=

13 12

as before. Now we can return to the spinning wheel.

CENTRAL LIMIT THEOREM: SUMS
We have seen previously that the bell-shaped curve arises when discrete random variables are added together. Now we look at continuous random variables. Suppose

122

Chapter 8

Continuous Probability Distributions

that we have n independent spins of the fair wheel, denoted by Xi , and we let the random variable X denote the sum so that X = X1 + X2 + X3 + · · · + Xn For the individual observations, we know that the expected value is E(Xi ) = 1/2 and the variance can be shown to be Var(Xi ) = 1/12. In addition, we know that E(X) = E(X1 + X2 + X3 + · · · + Xn ) = E(X1 ) + E(X2 ) + E(X3 ) + · · · + E(Xn ) = 1/2 + 1/2 + 1/2 + · · · + 1/2 = n/2 and since the spins are independent, Var(X) = Var(X1 + X2 + X3 + · · · + Xn ) = Var(X1 ) + Var(X2 ) + Var(X3 ) + · · · + Var(Xn ) = 1/12 + 1/12 + 1/12 + · · · + 1/12 = n/12 The central limit theorem states that in this case, X has, approximately, a normal √ probability distribution with √ mean n/2 and standard deviation n/12. We abbreviate this by writing X N(n/2, n/12). If n = 3, this becomes X N(3/2, 1/2)

The value of n in the central limit theorem, as we have already stated, need not be very large. To show how close the approximation is, we show the graph N(3/2, 1/2) in Figure 8.7 and then, in Figure 8.8, the graphs of N(3/2, 1/2) and f3 (x) superimposed. We previously calculated, using f3 (x), that P(1/2 ≤ x ≤ 5/2) = 23/24 = 0.958 33. Using the normal curve with mean 3/2 and standard deviation 1/2, we find the

0.8 0.6

0.4 0.2

−2

−1

f

1 X

2

3

4

5

Figure 8.7

Central Limit Theorem: Means
0.8 0.6 0.4 0.2

123

−2

f −1

1 X

2

3

4

5

Figure 8.8

approximation to this probability to be 0.95450. As the number of spins increases, the approximation becomes better and better.

CENTRAL LIMIT THEOREM: MEANS
Understanding the central limit theorem is absolutely essential to understanding the material on statistical inference that follows in this book. In many instances, we know the mean from a random sample and we wish to make some inference or draw some conclusion, about the mean of the population from which the sample was selected. So we turn our attention to means. First, suppose X is some random variable and k is a constant. Then, supposing that X is a discrete random variable, E X k =
S

x P(X = x) k x · P(X = x)
S

= =

1 k 1 k

E(X)

Here the summation is over all the values in the sample space, S. Therefore, if the variable is divided by a constant, so is the expected value. Now, denoting E(X) by μ, x μ 2 X P(X = x) = − Var k k k
S

= =

1 k2 1 k2

(x − μ)2 P(X = x)
S

Var(X)

The divisor, k, this time reduces the variance by its square.

124

Chapter 8

Continuous Probability Distributions

CENTRAL LIMIT THEOREM
If X denotes the mean of a sample of size n from a probability density function with √ mean μ and standard deviation σ, then X N( μ, σ/ n). This theorem is the basis on which much of statistical inference, our ability to draw conclusions from samples and experimental data, is based. Statistical inference is the subject of the next two chapters.

EXPECTED VALUES AND BIVARIATE RANDOM VARIABLES
We now expand our knowledge of bivariate random variables. But before we can discuss the distribution of sample means, we pause to consider the calculation of means and variances of means.

Means and Variances of Means
We will now discuss the distribution of sample means. Suppose, as before, that we have a sum of independent random variables so that X = X1 + X2 + X3 + · · · + Xn . Suppose also that for each of these random variables, E(Xi ) = μ and Var(Xi ) = σ 2 . The mean of these random variables is X= X1 + X2 + X3 + · · · + Xn n

Using the facts we just established, we find that E(X) = E(X1 + X2 + X3 + · · · + Xn ) n E(X1 ) + E(X2 ) + E(X3 ) + · · · + E(Xn ) = n nμ = =μ n

So the expected value of the mean of a number of random variables with the same mean is the mean of the individual random variables. We also find that Var(X) = Var(X1 + X2 + X3 + · · · + Xn ) n2 Var(X1 ) + Var(X2 ) + Var(X3 ) + · · · + Var(Xn ) = n2 = nσ 2 σ2 = 2 n n

95 or P −1 1 √ ≤Z≤ √ 2/ n 2/ n = 0.96 ≤ Z ≤ 1. while an individual observation falls between 1/3 and 2/3 only about 1/3 of the time.95? √ Let the sample size be n. We show some examples.2 Means of Random Variables (a) The probability that an individual observation from the uniform random variable with f (x) = 1 for 0 ≤ x ≤ 1 is between 1/3 and 2/3 is (2/3 − 1/3) = 1/3. We know that X ∼ N(10. We want n so that P(9 ≤ X ≤ 11) = 0. Using a statistical calculator. (b) How large a sample must be selected from a population with mean 10 and standard deviation 2 so that the probability that the sample mean is within 1 unit of the population mean is 0.96 so √ n = 2 · 1. we find P(1/3 ≤ X ≤ 2/3) = 0.9545 So. the smaller the Var(X). The central limit theorem for means then states that X ∼ N(1/2.95 √ We let Z = (X − 10)/(2/ n).95 But we know. What is the probability that the mean of a sample of 12 observations from this distribution is between 1/3 and 2/3? For the uniform random variable. μ = 1/2 and σ 2 = 1/12. This means that the standard deviation of X is 1/12. for a standard normal variable. using the central limit theorem for means.Expected Values and Bivariate Random Variables 125 While the mean of the sample means is the mean of the distribution of the individual Xi ’s.92 .96) = 0. This shows that the larger the sample size.96 2/ n or √ n/2 = 1. This has important implications for sampling.96 = 3. so. so we conclude that 1 √ = 1. we know that E(X) = 1/2 and Var(X) = Var(x)/n = (1/12)/12 = 1/144. EXAMPLE 8.95. So we have P 9 − 10 X − 10 11 − 10 √ ≤ √ ≤ √ 2/ n 2/ n 2/ n = 0. 1/12). that P(−1. 2/ n). the sample mean is almost certain to do so. the variance is reduced by a factor of n.

00. the probability would be less than 0. .00 E(X) = x=0.01 x · (1/100) = (1/100) · 1. The reason is that these calculations for a continuous random variable require calculus.126 so Chapter 8 Continuous Probability Distributions n = (3. E(X) = 1/2 and Var(X) = 1/12.01. 0. We give an example that may be convincing. . Suppose we have a discrete random variable. This is a discrete approximation to the continuous uniform distribution.95. E(X) = S x · P(X = x) which becomes in this case 1.02. A NOTE ON THE UNIFORM DISTRIBUTION We have stated that for the uniform distribution.366 meaning that a sample of 16 is necessary. and that is not a prerequisite here. but we have not proved this. If we were to round down to 15. X. 1.00 x x=0. .505 · · 100 i=1 (100)(101) 2 . . We will need the following formulas: n 1 + 2 + ··· + n = i=1 i= n(n + 1) 2 and n 12 + 22 + · · · + n2 = i=1 i2 = n(n + 1)(2n + 1) 6 We know that for a discrete random variable. where P(X = x) = 1/100 for x = 0.92)2 = 15.01 Now assuming x = i/100 allows the variable i to assume integer values E(X) = (1/100) · 100 i=1 i 100 i = (1/100)2 = (1/100)2 = 0.

00 = x=0. EXAMPLE 8.0001) we find that E(X) = 0. This is 1.01 and again assuming x = i/100. .00 E(X2 ) = x=0. We conclude this chapter with an approximation of a probability from a probability density function. we first find E(X2 ).083325 1/12. E(X2 ) = (1/100) · 100 i=1 i 100 i2 2 = (1/100)3 = (1/100)3 = 0. for the continuous uniform random variable.01 = (1/100) · x2 x=0.505)2 = 0.A Note on the Uniform Distribution 127 To calculate the variance.33835 − (0.08333333325 This may offer some indication.50005 and Var(X) = 0. If we were to increase the number of subdivisions in the interval from 0 to 1 to 10. that E(X) = 1/2 and Var(X) = 1/12.33835 · · 100 i=1 (100)(101)(201) 6 Now it follows that Var(X) = E(X2 ) − [E(X)]2 = 0. 000 (with each point then having probability 0.01 1.00 x2 · P(X = x) x2 · (1/100) 1.3 Areas Without Calculus Here is a method by which probabilities—areas under a continuous probability density function—can be approximated without using calculus.

1) · · (0. that is.1) · · (0. or the area under the curve.1.2 0.05)2 + (0. This is a common technique in calculus.15)2 1 2 1 2 +(0. 0.4 x 0.5 0. random variables that can assume values in an interval or intervals.1666 . CONCLUSIONS This chapter introduces continuous random variables. Increasing the number of rectangles only improves the approximation.2. .25)2 + · · · + (0. each of width 0.4 g(x) 0. We then stated the central limit theorem.95)2 = 0. under the curve.128 Chapter 8 Continuous Probability Distributions Suppose we have part of a probability density function that is part of the parabola f (x) = 1 2 x. . 0. The exact value of this probability is 1/6.2 0. .9. This gives us A = (0. 0.9 We will approximate the probability. adding some credence to the statements we have made without proof for continuous distributions.1 0. Two of the approximating rectangles are shown in Figure 8. using the right-hand ends of the rectangles at the points 0. .1) · 1 2 · (0. A. a basic result when we study statistical inference in succeeding chapters. 2 0≤x≤1 and we wish to approximate P(0 ≤ x ≤ 1). producing the normal probability distribution. We have found some surprising things when we added independent random variables. We use the height of the curve at the midpoints of these rectangles and use the total area of these rectangles as an approximation to the area.1) · · (0. but the approximation here is surprisingly good.9. We discussed some methods for approximating continuous distributions by discrete distributions. .16625 1 2 which is not a bad approximation to the actual area.8 1 Figure 8. by a series of rectangles. The graph of this function is shown in Figure 8.1. . . 1.3 0.6 0.

90? 2. To what value should the mean be changed to maximize the proportion of parts that meet specifications? 3. 1. Upper and lower warning limits are often set for manufactured products.Explorations 129 EXPLORATIONS 1. what is the probability that the elevator will be overloaded? A continuous probability distribution is defined as ⎧ 0<x<1 ⎪ ⎨ x. (b) What is the probability that at least two of the three independent observations are greater than 1/2? The joint probability distribution for random variables X and Y is given by f (x. · · · and y = 0. what effect does this have on the proportion of parts outside the warning limits? 4. 0≤x≤1 2 − x. . Suppose f (x) = x. but the standard deviation cannot be changed. can we expect to occur with probability 0. y) = k. The measurements follow a normal distribution with mean 0. (c) What score. 7. 2. x = 0. these are commonly set at μ ± 1. If X ∼ N(μ.25 and 0.03 in. (b) Find the marginal distributions for X and Y. If the mean of the process increases by one standard deviation. 10). σ). (a) Find k. Mathematics scores on the Scholastic Aptitude Test (SAT) are normally distributed with mean 500 and standard deviation 100. 2. If the weights of individuals using the elevator are N(150.30 and standard deviation 0. given that the individual’s score exceeds 500. (b) Find the probability that an individual’s score exceeds 620. A manufactured part is useful only if a measurement is between 0. 1<x<2 f (x) = k. 2 < x < 3 Find k. (X is the sum of two uniformly distribute random variables). Use the central limit theorem to approximate the probability that the sum is 34 when 12 dice are thrown. The maximum weight an elevator can carry is 1600 lb. (a) Find P(1/2 ≤ X ≤ 3/4). (a) What proportion of the parts meet specifications? (b) Suppose the mean measurement could be changed. · · · 3 − x. ⎪ ⎩ k(3 − x).96σ.38 in. 1 ≤ x ≤ 2 5. (a) Find the probability that an individual’s score exceeds 620. 1. or greater. 8. 6.

television networks and newspapers conduct surveys about the popularity of politicians and how elections are likely to turn out. We consider each of these topics now. We explore some ideas here and explain some of the basis for statistical inference: the process of drawing conclusions from samples. We read about political polls on how we feel about various issues. Kinney Copyright © 2009 by John Wiley & Sons. How is it that a sample from a population can give us information about that population? After all. A Probability and Statistics Companion. We now often encounter the results of sample surveys. Inc. it all depends on how accurate one wishes the survey to be.Chapter 9 Statistical Inference I CHAPTER OBJECTIVES: r r r r to study statistical inference: how conclusions can be drawn from samples to study both point and interval estimation to learn about two types of errors in hypothesis testing to study operating characteristic curves and the influence of sample size on our conclusions. 130 . John J. These sample sizes are in reality quite adequate for drawing inferences or conclusions. some samples may be quite typical of the population from which they are chosen. while other samples may be very unrepresentative of the population from which they are chosen. Statistical inference is usually divided into two parts: estimation and hypothesis testing. This heavily depends on the theory of probability that we have developed. These surveys are normally performed with what might be thought as a relatively small sample of the population being surveyed.

are called point estimates. 61. one could say that ⎛ ⎜ ⎞ P ⎝−1.95 σ √ n ⎛ P ⎝−1. Once a confidence coefficient is selected and a symmetric interval is decided. the normal z-values can be found by using computer or hand-held .90 σ √ n √ This is true because (X − μ)/(σ/ n) is a normally distributed random variable. whose variance σ 2 is known but whose mean μ is unknown. that the interval might have a better chance of being correct. suppose we wish to guess the age of your favorite teacher. σ ⎠ = 0.645⎠ = 0.96⎠ = 0. We now turn to the situation where we have a random sample and wish to create an interval estimate. 48. How else can one respond to the question? Perhaps a better response is. σ/ n) distribution. as we will see.90 √ n ⎜ We have.645 ≤ X−μ ⎟ . ≤ 1.282 ≤ ⎞ X − μ⎟ . The central limit theorem tells us that the random variable representing the mean of √ a sample of n random observations. ≤ 1. We are unlikely to estimate the correct age exactly.” We might feel. than a point estimate.Confidence Intervals 131 ESTIMATION Assuming we do not already know the answer. The probabilities in the statements above are called confidence coefficients. has a N(μ. in containing the true age. since they are exact. There is an infinity of choices for the confidence coefficient.96 ≤ or X−μ ⎟ . So. The response in our example is probably based on observation and does not involve a random sample. These are estimates of the age and. We could have chosen many other true statements such as ⎛ ⎜ ⎞ P ⎝−1. an infinite number of choices for this normal random variable. CONFIDENCE INTERVALS EXAMPLE 9. But we must be very careful in interpreting this. and so on. so a point estimate may not be a good response. in some sense. X. “I think the teacher is between 45 and 60. Now suppose it is very important that our response be correct. that is. so we cannot assign any probability or likelihood to the interval. Most people would give an exact response: 52. Such intervals are called confidence intervals. of course.1 A Confidence Interval Consider a normal random variable X.

90 σ √ n is a statement about a random variable and is a legitimate probability statement. So the probability that μ is in the interval is either 0 or 1. Although μ is unknown. We have omitted these statements as probability statements. is not a probability statement about μ.645⎠ = 0. . then.1.132 Chapter 9 Statistical Inference I calculator or from published tables. This occurrence is not always to be expected. it is a constant and √ √ so it is either in the interval from X − 1.645 ≤ to read −1.1547.645 · σ/ n to X + 1.645 · √ ≤ μ ≤ X + 1.2 Several Confidence Intervals To illustrate the ideas in Example 9. we drew 20 samples of size 3 from a normal distribution with mean 2 and standard deviation 2. The reason for this is that the statement ⎛ ⎜ ⎞ P ⎝−1.90 σ √ n · σ √ ≤ X − μ ≤ 1.1. The sample means are then approximately normally √ distributed with mean 2 and standard deviation 2/ 3 = 1.645 n · σ √ n · σ √ − X ≤ −μ ≤ 1. The result σ σ X − 1. and 10% of all possible intervals calculated will not contain the unknown. constant value μ. The probability statements are all based on the fact that √ (X − μ)/(σ/ n) is a legitimate random variable that will vary from sample to sample.645 n · σ √ −X n Now if we solve for μ and rearrange the inequalities.645 ≤ X−μ ⎟ . exactly 19 of the 20 confidence intervals contain the mean.645 · σ/ n or it is not. The interval we calculated above is called a 95% confidence interval. since X and n are known from the sample and we presume that σ is known.90 with which we began? We interpret the final result in this way: 90% of all possible intervals calculated will contain the unknown. constant value μ.645 · √ ≤ μ ≤ X + 1.645 We see further that −1. A graph of the 20 confidence intervals generated is shown in Figure 9. indicated by the vertical line.645⎠ = 0. Now rearrange the inequality in the statement ⎛ ⎜ ⎞ P ⎝−1. EXAMPLE 9. What meaning. we find that σ σ X − 1. As it happened.645 X−μ ⎟ . ≤ 1. are we to give to the 0.645 · √ n n Note now that the end points of the inequality above are both known. ≤ 1.645 · √ n n however.

but now we turn to the other major part of statistical inference.1 Confidence intervals.4. From previous experience. can we confidently believe that the sample was chosen from a population with μ = 26? The answer of course lies in both the variability of X and the confidence we need to place in our conclusion. the first from a discrete distribution and the second from a continuous distribution. We will consider other confidence intervals subsequently. · · · . We confess now that each of the examples is somewhat artificial and is used primarily to introduce some ideas so that they can be brought out clearly. then P(X = x) = 15 (0. To decide which germination rate is correct. 15 . HYPOTHESIS TESTING If we choose a random sample from a probability distribution and calculate the sample mean. the bulbs behave independently.Hypothesis Testing 133 Figure 9. and the probability of germination is constant. she plans an experiment involving 15 of these altered bulbs and records the number of bulbs that germinate.50)15−x x for x = 0. EXAMPLE 9. she knows that the percentage of these bulbs that germinate is either 50% or 75%. If in fact the probability is 50% that a bulb germinates and if X is the random variable denoting the number of bulbs that germinate.50)x · (0. that is. hypothesis testing. Two examples will be discussed here. X = 32. a bulb either germinates or it does not. We have considered the estimation of some unknown mean μ by using a random sample and constructing a confidence interval. 1.3 Germinating Bulbs A horticulturist is experimenting with an altered bulb for a large plant. It is crucial that these ideas be clearly understood before we proceed to more realistic problems. We assume that the number of bulbs that germinate follows a binomial model. 2.

then P(X = x) = 15 (0.25)15−x x for x = 0.0139 0. say 11 or more.0417 0.2252 0.50 0.0916 0.0034 0.75 0. Now we must decide between them.1527 0. Let us consider each of these. If we decide that the null hypothesis is correct. that we would then reject the null hypothesis (that p = 0.1964 0.75).1. In coming to this test. then we accept the null hypothesis and reject the alternative hypothesis.0000 0.2252 0.0004 0. .1964 0.0000 p = 0.0668 0.0393 0.0000 0. It would appear. We will formalize these hypotheses as H0 : p = 0.0000 0. 2. we cannot reach a decision with certainty because our conclusion is based on a sample.0134 The statements that 50% of the bulbs germinate or 75% of the bulbs germinate are called hypotheses. if we reject the null hypothesis then we accept the alternative hypothesis How should we decide? The decision process is called hypothesis testing. we would certainly look at the number of bulbs that germinate.0916 0. They are conjectures about the behavior of the bulbs. If in fact 75% of the bulbs germinate.50 Ha : p = 0.0417 0.50) and accept the alternative hypothesis (that p = 0.75)x · (0. if a large number of bulbs germinate.1559 0.0001 0. then we would expect a large number of the bulbs to germinate.134 Chapter 9 Statistical Inference I while if the probability is 75% that a bulb germinates. 1.0000 0.0007 0.0917 0. 15 We should first consider the probabilities of all the possible outcomes from the experiment.75 the alternative hypothesis. What are the risks involved? There are two risks or errors that we can make: we could reject the null hypothesis when it is actually true or we could accept the null hypothesis when it is false.1651 0.75 We have called H0 : p = 0.1527 0. In this case.50 the null hypothesis and Ha : p = 0. · · · .0032 0. a small one at that in this instance.0032 0. These are shown in Table 9. On the contrary.0000 0.0131 0.3 x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 p = 0.0139 0. Table 9.1 Probabilities for Example 9.0004 0.

we accept the null hypothesis when the number of germinating bulbs is 10 or less.75. In this case. The values of X that cause us to reject H0 comprise what we call the critical region for the test. we reject the null hypothesis when the number of germinating bulbs is 11 or more. since α presumes the null hypothesis true and β presumes the null hypothesis false.5385. It is not possible to find a critical region that would produce α between the values 0.0175 and 0. Note that α and β are calculated under quite different assumptions. The errors calculated above are usually denoted by α and β.0033 + 0. In this case.50.0592 So about 6% of the time.0000 + 0. EXAMPLE 9.0000 = 0. The size of the Type II error is denoted by β. In this case. we find α = 0.0000 + 0.1651 = 0.75 than from a distribution with p = 0.3133. It is possible to decrease both α and β by increasing the sample size as we now show. In general then β = P(H0 is accepted if it is false). bulbs with a germination rate of 75% will behave as though the germination rate were only 50%. The experiment will always result in some value of X. The probability this occurs when the null hypothesis is true is P(Type I error) = 0.0032 + 0. In general then α = P(H0 is rejected if it is true) where α is often called the size or the significance level of the test. We must decide in advance which values of X cause us to accept the null hypothesis and which values of X cause us to reject the null hypothesis. Accepting the null hypothesis when it is false is called a Type II error.0417 + 0. large values of X are more likely to come from a distribution with p = 0.0004 + 0. We could work out all the probabilities for the 101 possible values for X and .0000 + 0. so they bear no particular relationship to one another. This produces α = 0.Hypothesis Testing 135 Rejecting the null hypothesis when it is true is called a Type I error. but unfortunately. the Type II error increases to 0. In this case.0542 and β = 0. with the critical region X ≥ 11.4 Increasing the Sample Size Suppose now that the experimenter in the previous example has 100 bulbs with which to experiment.0131 + 0.0917 + 0.0139 + 0.3133 So about 31% of the time.0542. So it is reasonable to conclude that if X ≥ 11. It is of course possible to decrease α by reducing the critical region to say X ≥ 12. The only way to decrease both α and β simultaneously is to increase the sample size. Finally note that both α and β increase or decrease in finite amounts.0393 + 0. then p = 0.0007 + 0.0000 + 0.0175.0001 + 0. bulbs that have a germination rate of 50% will behave as if the germination rate were 75%. The probability this occurs when the null hypothesis is false is P(Type II error) = 0. We have used the critical region X ≥ 11 here.

then it is possible to specify α in advance. so we start at X = 55. 100) σ μ.0014 We see that α decreases as we move to the right on the probability distribution and that β increases. It would appear sensible to test the null hypothesis H0 : μ = 10.50 = 5.0105 0. Before accepting a shipment of these steel wires. for example. EXAMPLE 9.0001 0. A sample of 16 wires is selected and their mean breaking strength X is measured.0284 0. to choose α = 0. .2. then we would select a critical region in the left tail of the normal curve. The central limit theorem tells us that X∼N In this case. We have suggested various critical regions here and determined the resulting values of the errors. say α.2. Table 9. √ n If the critical region has size 0. we have X∼N 400 μ. In either case here.136 Chapter 9 Statistical Inference I would most certainly use a computer to do this.1356 0.5 Breaking Strength The breaking strength of steel wires used in elevator cables is a crucial characteristic of these cables. If the random variable were continuous.0443 0.50.0001 0. and standard deviation √ n · p · (1 − p). We show how this is so in the next example.50 · 0.000 lb. so that α = 0. This raises the possibility that the size of one of the errors. We show some probabilities in Table 9.0000 0. n = 100. X. an engineer wants to be confident that μ >10.0060 β 0.2 Critical region X ≥ 56 X ≥ 57 X ≥ 58 X ≥ 59 X ≥ 60 X ≥ 61 X ≥ 62 α 0.0003 0. The consequence of this is shown in Table 9. Since we are seeking a critical region in the upper tail of this distribution. This is because the random variable is discrete in this case.0000 0.0967 0.000. If p = 0. The situation is shown in Figure 9. √ 16 = N (μ.05. np. we √ the probability distribution find centered about np = 100 · 0. The cables can be assumed to come from a population with known σ = 400 lb.000 lb against the alternative Ha : μ < 10. is chosen in advance and then a critical region found that produces this value of α.0007 0. A test will be based on the sample mean.2.0176 0. the binomial distribution is very normal-like with a maximum at the expected value.05. It is not possible.50 = 50 with standard deviation 100 · 0.05 and find an appropriate critical region. we look at values of X at least one standard deviation from the mean. For this sample size.

75).β and the Power of a Test Distribution plot Normal. However.645. p = 0.003 Density 0. This means that the critical value of X is −1. in this case.355 100 = 0. β AND THE POWER OF A TEST What is β. So the null hypothesis should be rejected if the sample mean is less than 9835.5 − 9800 = 0.000) to deal with.05 0. We will show some examples. The size of β depends upon which of these specific alternatives is chosen.5 = 9835. StDev = 100 0.000)/(100) or x = 10000 − 164.002 0. We use the notation βalt to denote the value of β when a particular alternative is selected. the size of the Type II error in this example? We recall that β = P(H0 is accepted if it is false) or β = P(H0 is accepted if the alternative is true) We could calculate β easily in our first example since in that case we had a specific alternative to deal with (namely.000.004 137 0.2 The value of shaded area in Figure 9.2 is 0.5.05. Mean = 10. First.361295 .645 = (x − 10.000 9836 10.5 lb. so the z−score is −1. we have an infinity of alternatives (μ < 10.000 X Figure 9. consider β9800 = P(X > 9835.001 0.5 if μ = 9800) =P Z> 9835.

This curve is not part of a normal curve.138 Chapter 9 Statistical Inference I So for this test the probability we accept H0 : μ = 10.8 Beta 0.3 This curve is often called the operating characteristic curve for the test.740536 9835. Now let us try some other values for the alternative. 1 0. 1 − βalt is called the power of the test for a specific alternative.3 in a curve that plots βalt against the specific alternative.8 Power 0. A graph of the power of the test is shown in Figure 9. We know that βalt = P(accept H0 if it is false) so 1 − βalt = P(reject H0 if it is false). in fact.6 0.645 100 So almost 3/4 of the time this test will accept H0 : μ = 10.000 Figure 9. Figure 9. This is the probability that the null hypothesis is correctly rejected. it has no algebraic equation.000 if in fact μ = 9900. β9900 = P(X > 9835.4 0.4 0. We can in fact show this dependence in Figure 9.000 Figure 9.000 if in fact μ = 9800 is over 36%. β then highly depends upon the alternative hypothesis. The graph indicates that the sample of 100 is more likely to reject a false H0 than is the sample of 16.4. 0.5 =P Z> if μ = 9900) = 0.4 also shows the power of the test if the sample size were to be increased from 16 to 100.5 − 9900 = −0.6 0.4 .2 n = 100 n = 16 9700 9800 Alternative 9900 10. each point on it being calculated in the same manner as we have done in the previous two examples.2 9700 9800 Alternative 9900 10.

then α should be very small. selecting α in advance and calculating the critical region that results. to be specific. How is the experimenter to know what value of α to choose? Should 5% or 6% or 10% or 22% be selected? The choice often depends upon the sensitivity of the experiment itself. We assume that z is the observed value of Z and that the null hypothesis is H0 : μ = μ0 : Alternative μ > μ0 μ < μ0 μ = μ0 / p-value P(Z > z) P(Z < z) P(|Z| > z) The p-values have become popular because they can be easily computed. while if it is very small.08 5432. This is called the p-value for the test. .72. either to accept or reject the null hypothesis. If the experiment involves a drug to combat a disease. This is found to be P(X ≥ 23.0427162 = 0.72) = 2 · 0. that we are testing H0 : μ = 22 against H1 : μ = 22 / with a sample of n = 25 and we know that σ = 5. The experimenter reports that the observed X = 23. then the experimenter might tolerate a somewhat larger value of α. then how can we proceed? Suppose.72.72) = 0. The decision of course is up to the experimenter. the phrase “a result more extreme” is interpreted to mean P(|Z| ≥ 1. This allows the experimenter to make the final decision. however. We abandoned that approach for what would appear to be a more reasonable one. that is. one would know that the result is in one of the extremities of the distribution and reject the null hypothesis. If the p-value is very large. Now we abandon that approach as well.0427162 Since the test is two sided. But then we have a new problem: if we do not have a critical region and if we do not have α either. Here is a set of rules regarding the calculation of the p-value. Tables offer great limitations and their use generally allows only approximations to p-values.p-Value for A Test 139 p-VALUE FOR A TEST We have discussed selecting a critical region for a test in advance and the disadvantages of proceeding in that way.72 if μ = 22) = P(Z ≥ 1. Selecting α in advance puts a great burden upon the experimenter. if the experiment involves a component of a nonessential mechanical device. if the true mean were 22. depending entirely upon the size of this probability. or a result greater than 23. one would normally accept the null hypothesis. We could calculate that a sample of 25 would give this result. Statistical computer programs commonly calculate p-values.

We will also learn about comparing two samples drawn from possibly different populations. 20% of the production of a sensitive component is unsatisfactory. (a) Form appropriate null and alternative hypotheses. If the critical region is {x | x ≥ 9}. In the past. find the sizes of both Type I and Type II errors. A sample of 20 components shows 6 items that must be reworked. .140 Chapter 9 Statistical Inference I CONCLUSIONS We have introduced some basic ideas regarding statistical inference—the process by which we draw inferences from samples—using confidence intervals or hypothesis testing. 1) distribution. The manufacturer is concerned that the percentage of components that must be reworked has increased to 30%.73567. (c) Choose other critical regions and discuss the implications of these on both types of errors. EXPLORATIONS 1. We will continue our discussion of statistical inference in the next chapter where we look at the situation when nothing is known about the population being sampled. Use a computer to select 100 samples of size 5 each from a N(0.73567 and 0. (b) Let X denote the number of items in a sample that must be reworked. Compute the mean of each sample and then find how many of these are within the interval −0. How many means are expected to be in this interval? 2.

Inc. and F probability distributions to show some applications of the above topics. To expand this inference when the population standard deviation σ is A Probability and Statistics Companion. John J. n the previous chapter. 141 . that we did know the standard deviation. Chi-Squared. but we presumed.Chapter 10 Statistical Inference II: Continuous Probability Distributions II—Comparing Two Samples CHAPTER OBJECTIVES: r to test hypotheses on the population variance r to expand our study of statistical inference to include hypothesis tests for a mean when r r r r the population standard deviation is not known to test hypotheses on two variances to compare samples from two distributions to introduce the Student t. I THE Chi-SQUARED DISTRIBUTION We have studied the probability distribution of the sample mean through the central limit theorem. we studied testing hypotheses on a single mean. Now we examine the situation where nothing is known about the sampled population. while we did not know the population mean. Kinney Copyright © 2009 by John Wiley & Sons. we look at the population variance. First.

778734. 1. 0. −0.377894. 0.684865.92 1. . n (xi − xi )2 ¯ n−1 n 2 s2 = This can be shown to be n i=1 n s2 = i=1 2 xi − xi i=1 n(n − 1) In this case.561233 A graph of the values of s2 is shown in Figure 10.82 2.42 2.1. −1.02 0. Our sample showed a mean value for the sample variances to be 1.32 0.52 1.1 If samples are selected from a N(μ.316736 −0. −1.351219 s2 0. We begin with a specific example. −0. −0. . 0. 20 15 10 5 Figure 10.969271 0.0723954. It is certainly not normal due to these facts.448382.044452.521731.550026.12863. without proof.37236 .815117.33975 x ¯ −0.705356 0.989559.878731 . 1) distribution and on each case the sample variance s2 was calculated. 0. 0.192377 .1 0. −0. In general.02 It is clear that the graph has a somewhat long right-hand tail and of course must be nonnegative. the following theorem: Theorem 10.62 0. we must be able to make inferences about the population variance. −0. Here are some of the samples. .142 Chapter 10 Statistical Inference II: Continuous Probability Distributions II unknown. −0. . Now we state.12 2. n = 5. the sample mean.437318.72 3.172919.633443. −0.22 1. and the sample variance: Sample −0. σ) distribution and the sample variance s2 is calculated for each sample. One hundred samples of size 5 were selected from a N(0. and the variance of these variances was 0. then (n − 1)s2 /σ 2 follows the chi-squared 2 probability distribution with n − 1 degrees of freedom (χn−1 ).

it is simply a symbol and alerts us to the fact that the random variable is nonnegative.05 Figure 10. We would then expect.020.521731. our variances showed a variance of 0. we superimpose the χ4 distribution onour histogram of sample variances.2 0.122.125 0.075 0.422.The Chi-Squared Distribution 143 Also E(s2 ) = σ 2 and Var(s2 ) = 2σ 4 n−1 The proof of this theorem can be found in most texts on mathematical statistics.02 Note that exponent 2 carries no meaning whatsoever.620.05 0. since σ 2 = 1.22 1. which we call the degrees of freedom. It is possible.that E(s2 ) = 1. our sample showed a mean value of the variances to be 1.320.521. 0. The theorem says that the probability distribution depends on the sample size n and that the distribution has a parameter n − 1.044452.2 2 Finally. in Figure 10. We would also expect the variance of the sample variances to be 2σ 4 /(n − 1) = 2 · 1/4 = 1/2. so our samples tend to agree with the results stated in the theorem. the chi-squared distribution with n − 1 = 4 degrees 2 of freedom (which we abbreviate as χ4 ). the probability distribution of χ2 highly depends on the fact that the samples arise from a normal distribution. as we could find the probability distribution for the square root of a normal random variable.72 3.15 0.1 0.3 0. 0. but probably useless.175 0. We close this section with an admonition: while the central limit theorem affords us the luxury of sampling from any probability distribution.15 0.2. Probability .3. Now we show in Figure 10.822. to find the probability distribution for χ.1 0.921.025 2 4 x2 6 8 10 Figure 10.

where s2 = 0. Is this the evidence that the true variance σ 2 exceeds 0. The situation is shown in Figure 10. or probabilities.57) = 0.57) = 0.05 0.0025. The computer system Mathematica and the statistical program Minitab r have both been used in this book for these calculations and the production of graphs.00386 Figure 10. Without a computer this could only be done with absolutely accurate tables.0010 = 27. 2 Here are some examples where we need some points on χ11 : 2 We find that P(χ11 < 4. A sample of 12 parts showed a sample variance s2 = 0. is the fact that areas.05 and in this case we find P(σ 2 > 11 · 0.57) = P(σ 2 > 0. One difficulty with the chi-squared distribution.05.07 0.0010? To solve the problem.06 Density 0.0025. small variances are of course desirable.03 0.05 and so we have a confidence interval for σ 2 . H0 : σ 2 = 0. which means that P(σ 2 > (n − 1)s2 /4. as in many other industrial examples.5 0.02 0.4.05 and so P((n − 1)s2 /σ 2 < 4.144 Chapter 10 Statistical Inference II: Continuous Probability Distributions II STATISTICAL INFERENCE ON THE VARIANCE While the mean value of the diameter of a manufactured part is very important for the part to fit into a mechanism. we are concerned that the variance may be too large. consider the hypotheses from the previous example. and indeed with almost all practical continuous probability distributions. are very difficult to compute and so we rely on computers to do that work for us. In this case.08 0.0010 and HA : σ 2 > 0.5 and we can calculate that 2 P(χ11 > 27.5) = 0. and from the example above. For example.4 . and so we have a p−value for the test.0025/4. we must calculate some probabilities.0060175) = 0. the variance of the diameter is also crucial so that parts do not vary widely from their target value. We also find that 2 χ11 = (n − 1)s2 /σ 2 = 11 · 0.57) = 0.00385934.0010. Hypothesis tests are carried out in a similar manner. we see that this value for s2 is in the rejection region.04 0.01 0.0025/0. df = 11 0.00 0 X 27.09 0. From our data. Distribution plot Chi square.

025 0.484 X 11. The chi-squared distribution becomes more and more “normal-like” as the degrees of freedom increase.6 .15 Density 0.20 0.025 0.1 Figure 10.10 0. Distribution plot Chi square.05 0.Statistical Inference on the Variance 145 Distribution plot Chi square. df = 4 0.15 Density 0. df = 4 0.49 Figure 10.5 Now we look at some graphs with other degrees of freedom in Figures 10.10 0.5 and 10.05 0. Figure 10.00 0 X 9.00 0.7 shows a graph of the distribution with 30 degrees of freedom.20 0.6.05 0.

1) distribution. are unknown. W. we rely heavily on x the fact that the standard deviation of the population. If we approximate χ2 by a normal distribution with mean 30 and Var(χn 30 √ standard deviation 60 = 7. writing under the pseudonym “Student” discovered the following: Theorem 10. The approximation is not very good.05 0. Now we are able to consider inferences about the sample mean when σ is unknown.7.8 Figure 10. so the approximation is not too bad.7 2 2 We see that P(χ30 > 43.06 0.G.746.we find that the point with 5% of the curve in the √ right-hand tail is 30 + 1. It can be shown that E(χn ) = n and that 2 ) = 2n.05 0.00 X 43.01 0. in many practical situations. it follows that [(¯ − μ)/(σ/ n)]/ (n − 1)s2 /σ 2 (n − 1) follows a tn−1 x . σ. and (n − 1)s2 /σ 2 is a χn−1 x √ variable.04 Density 0. √ 2 Since (¯ − μ)/(σ/ n) is a N(0. In 1908.146 Chapter 10 Statistical Inference II: Continuous Probability Distributions II Distribution plot Chi square. However. is known. μ and σ. however.03 0. 1) random variable divided by the square root of a chi-squared random variable divided by its degrees of freedom (say n) follows a Student t distribution with n degrees of freedom (tn ). we have used the central limit theorem to calculate confidence intervals and to carry out tests of hypotheses. STUDENT t DISTRIBUTION In the previous chapter. for small degrees of freedom.2 The ratio of a N(0. df=30 0. 1) random variable.05. Gossett.645 60 = 42.02 0. both population parameters. In noting that the ran√ dom variable (¯ − μ)/(σ/ n) follows a N(0.8) = 0.

0 -2.68 .0 -5.5 5.72 1.3 Density 0.02 1. much more than that has transpired.5 0.70 1.4 df 4 20 = x−μ ¯ √ s/ n 0.05. Distribution plot T 0.1 0.8 (which was produced by Minitab).0 X 2. We show two typical Student t distributions in Figure 10. and each appears to be “normal-like.0 Figure 10. The number of degrees of freedom is n. n 5 10 20 30 40 v 2. But x−μ ¯ √ σ/ n (n − 1)s2 σ 2 (n − 1) So we have a random variable that involves μ alone and values calculated from a sample. respectively.” Here is a table of values of v so that the probability the table value exceeds v is 0. The sampling can arise from virtually any distribution due to the fact that the central limit theorem is almost immune to the underlying distribution.2 0.8 The curves have 4 and 20 degrees of freedom.Student t Distribution 147 probability distribution. It is far too simple to say that we simply replace σ by s in the central limit theorem.81 1.

n] distribution.148 Chapter 10 Statistical Inference II: Continuous Probability Distributions II This is some evidence that the 0.9 shows a graph of a typical F distribution.but the degrees of freedom. Find the p-value for the test ¯ H0 : μ = 26 and Ha : μ = 26. 1. follows the F distribution with n and m degrees of freedom. Suppose P[F [n. n]. Now notice that the reciprocal of an F random variable is also an F random 1 variable but with the degrees of freedom interchanged.645. then using the theorem. n] > 1/v] = α.797 and P (t13 > 4.7439 · 10−4 . EXAMPLE 10. This fact can be used in finding critical values. This is equivalent to P[1/F [n. so if we have two samples.8 and s2 = 123. say n in the numerator and m in the denominator.797) = 1.8−26 Here we find that t13 = √123/14 = 4.m] = F [m. Theroem 10.05 point approaches that of the N(0. this with 7 and 9 degrees of freedom. Again a computer is essential in doing calculations as the following example shows. m] 2 s2 ms2 2m 2 σ2 σ2 Figure 10. 1) distribution. so F [n. 2 We know from a previous section that (n − 1)s2 /σ 2 is a χn−1 random variable. a rare event indeed. The graph also shows the upper and lower 2. TESTING THE RATIO OF VARIANCES: THE F DISTRIBUTION So far we have considered inferences from single samples. depending on the sample size. possibly arising from two different populations. remain crucial. m] > 1/v] or P[F [m. but we will soon turn to comparing two samples. with degrees of freedom n and m. So the reciprocal of the lower α point on F [n. / 29.5% points. m] is the upper α point on the F [m. So we now investigate comparing variances from two different samples. . We need the following theorem whose proof can be found in texts on mathematical statistics. m] < v] = α.1 A Sample A sample of size 14 showed that x = 29.3 The ratio of two independent chi-squared random variables. 2 2 ns1 s1 2 2 σ1 n σ1 = 2 = F [n. divided by their degrees of freedom.

which can also be written as P 0. We now turn to other hypotheses involving two samples.4 0.395 < 2σ1 /σ2 < 2.10.7 0.20 0. from Figure 10.9 EXAMPLE 10. We know that 2 s2 2 200 σ2 = 2 2 s1 σ2 2 σ1 · 2 σ1 2σ 2 = 21 = F [12. This is an indication of the great variability of the ratio of variances.1 0.2 0.0 00.025 Figure 10.Testing the Ratio of Variances: the F Distribution 149 Distribution plot F. We seek a 90% confidence interval for the 2 2 ratio of the true variances. n2 = 13.5 Density 0.207 X 4.1975 < 2 σ1 < 1. df2 = 9 0. s1 = 100.90.10. and s2 = 200.3 0. σ1 /σ2 . 21] 100 σ2 so we find.90 2 σ2 It is interesting to note. .2 Two Samples Independent samples from two normal distributions gave the following data: 2 2 n1 = 22. df1 = 7.025 0.25] = 0. that 2 2 P[0. while one sample variance is twice the other. that the confidence 2 2 2 interval contains 1 so the hypothesis H0 : σ1 = σ2 against Ha : σ1 = σ2 would be accepted / 2 with α = 0.6 0.125 = 0.

Samples are selected from each production line. The results of the sampling are given in Table 10. Graphs should be drawn from the data so that we may make an initial visual inspection. different versions of standardized tests may be compared to decide if they in fact test equally able candidates equally.3 0.1 0. the thermostats set at 140o .2 0.14.05 Figure 10. Comparative dot plots of the samples are shown in Figure 10.15 .05 0.11. y = 144. and then the actual temperature in the heater is measured. production methods may be compared with respect to the quality of the products each method produces. EXAMPLE 10. so we calculate some sample statistics and we find that nx = 15.3 Two Production Lines Two production lines.9 0. The data from the production line X appear to be much more variable than that from production line Y . Two samples may be compared by comparing a variety of sample statistics.10 TESTS ON MEANS FROM TWO SAMPLES It is often necessary to compare two samples with each other.4 0.25 0.7 0.0 0 0. sx = 6. sy = 3. x = 138. we consider comparing only means and variances here. called X and Y.95 ny = 25.395 X 2.150 Chapter 10 Statistical Inference II: Continuous Probability Distributions II Distribution plot F.8 0. It also appears that the samples are centered about different points.6 Density 0. df2 = 21 0.1.5 0. are making specialized heaters. df1 = 12.02. We may wish to compare methods of formulating a product.

990 148.797 139. We wish to test the hypothesis H0 : μX = μY against H1 : μX = μY .482 142.058 139.381 146.11 The real question here is how these statistics would vary were we to select large numbers of pairs of samples.490 145.224 121.103 144.324 145.332 140.844 139.155 147.065 147.740 146.Tests on Means from Two Samples Table 10.691 127. We formulate the problem as follows.343 137.585 142.472 145. / ¯ ¯ We know that E[ X − Y ] = μX − μY and that σ2 σ2 ¯ ¯ Var[X − Y ] = X + Y nx ny .766 131.973 137.103 145.352 151 Comparative dot plots for Example 10.151 139.553 137.471 145.809 Y 135.822 145.644 144.757 145.1 X 147.319 143.676 141.648 140.3 X Y 124 128 132 136 Data 140 144 148 Figure 10.753 145.970 138.496 147.562 139.598 145.022 145. The central limit theorem can be used if assumptions can be made concerning the population variances.966 140.083 140.

Here P (X − Y ) − 1. so the null hypothesis would most probably not be accepted. There are two cases: the unknown variances are equal or they are not equal.95 which becomes in this case the interval from −9. but equal.14 − 144. We could also use z to construct a confidence interval. with true value σ 2 .001.3862 to −2. although it is not infrequent. We replace each of the unknown. where 2 sp = 2 2 (nx − 1)sX + (ny − 1)sY nx + ny − 2 2 Here sp is often called the pooled variance. Then it is known that tnX +nY −2 = (X − Y ) − (μX − μY ) sp 1 1 + nX nY . Then z= (138. It can be shown that the difference between normal variables is also normal so ¯ ¯ (X − Y ) − (μX − μY ) z= 2 σX σ2 + Y nx ny is a N(0. 1) variable. say σX = σY = 30. variances with the pooled variance. EXAMPLE 10. which we denote by sp . then we form an estimate 2 of this common value. when data have been gathered over a long period of time. Since 0 is not in this interval.0.4 Using Pooled Variances If the population variances are known to be equal. that some idea of the size of the variance is known. Now we consider the case where the population variances are unknown. The sampling must now be done from normal distributions.3739. Larger samples reduce the width of the confidence interval.96 2 σ2 σX + Y ≤ μX − μY ≤ (X − Y ) + 1. Consider for the moment that we can assume that the populations have 2 2 equal variances.152 Chapter 10 Statistical Inference II: Continuous Probability Distributions II ¯ ¯ and that each of the variables X and Y is individually approximately normally distributed by the central limit theorem. The statistic z can be used to test hypotheses or to construct confidence intervals if the variances are known.287 The p-value for this test is approximately 0. Knowledge of the population variances in the previous example may be regarded as artificial or unusual. the hypothesis of equal means is rejected. We give examples of the procedure in each case.02) − 0 30 30 + 15 25 = −3.96 nx ny 2 σ2 σX + Y nx ny = 0.

An approximation due to Welch is given here. (We must always use the greatest integer less than or equal to ν. we do not know the exact probability distribution of any statistic involving the sample data in this case.Tests on Means from Two Samples In this case.09.02) − 0 = −3. the normal approximation will be a very good one. Using the data in Example 10. several approximations are known.0007 leading to the rejection of the hypothesis that the population means are equal. The Welch approximation will make a very significant difference if the population variances are quite disparate. Regrettably. as nx → ∞ and ny → ∞. However.) This gives T17 = −3.14 − 144. a result quite comparable to previous results. otherwise. we find that 2 sp = 153 14 · 6.670 √ 1 1 24. The variable T = (X − Y ) − (μX − μY ) 2 s2 sX + Y nx ny is approximately a t variable with ν degrees of freedom.5. the sample sizes are artificially increased. It is not difficult to see that . T = ¯ ¯ (X − Y ) − (μX − μY ) 2 sX s2 + Y nx ny → ¯ ¯ (X − Y ) − (μX − μY ) 2 σX σ2 + Y nx ny =z Certainly if each of the sample sizes exceeds 30.152 = 24.660 so we must use a t variable with 17 degrees of freedom. This unsolved problem is known as the Behrens–Fisher problem. we find ν = 17. where 2 s2 sX + Y nx ny 2 2 ν= 2 2 sX sY nx ny + nx − 1 ny − 1 2 It cannot be emphasized too strongly that the exact probability distribution of T is unknown.0625 15 + 25 and so t38 = The p-value for the test is then about 0. EXAMPLE 10. it is very dangerous to assume normality for small samples if .5 Unequal Variances Now we must consider the case where the population variances are unknown and cannot be presumed to be equal.952 + 24 · 3.0625 15 + 25 − 2 (138.

The difference is not great in this case. CONCLUSIONS Three new. Often T is used. We now study a very important application of much of what we have discussed in statistical process control. x2 = 602.09 with 14 degrees of freedom giving a p − value of 0. no exact or approximate tests are known for this situation. and two variances. Computer programs such as Minitab make it easy to do the exact calculations involved. but the minimum of nx − 1 and ny − 1 is used for the degrees of freedom. x1 = 563. the normal approximation is to be avoided. then in the example we have been using. Our approximation in Example 10.154 Chapter 10 Statistical Inference II: Continuous Probability Distributions II their population variances are quite different. EXPLORATIONS 1. (a) Test H0 : μ1 = 540 against H1 : μ1 > 540 2 (b) Find a 90% confidence interval for σ2 . This culminates in several tests on the difference between two means when the population variances are unknown and both when the population variances are assumed to be equal and when they are assumed to be unequal. s1 = 80. If this advice is followed. a single mean when the population variance is unknown. making the p–value 0. In that case. regardless of sample size. probability distributions have been introduced here that have been used in testing a single variance. Is this assumption supported by the data? (ii) σ1 / 2 . This is a safe and prudent route to follow in this circumstance. and very useful. Two samples of students taking the Scholastic Aptitude Test (SAT) gave the following data: 2 n1 = 34. (c) Test H0 : μ1 = μ2 assuming 2 2 (i) σ1 = σ2 .007.68 Assume that the samples arise from normal populations. The tests given here heavily depend on the relationship between the variances as well as the normality of the underlying distributions.5 allows us to use 17 degrees of freedom. T = −3.008. Is this assumption supported by the data? 2 = σ 2 .64 2 n2 = 28. s2 = 121. If these assumptions are not true.

The result is shown in Figure 11. The variability of the means appears to have decreased after observation 20 as well. The means shown in Figure 11. In this chapter. Inc. it has improved the quality of the products we purchase and use. We first look at control charts. for example. Samples of five items are selected from the production line periodically. Kinney Copyright © 2009 by John Wiley & Sons. A size measurement is taken and the mean of the five observations is recorded. the process may have undergone some significant changes that may lead to an investigation of the production process. Statistical analysis has become a centerpiece of manufacturing. CONTROL CHARTS EXAMPLE 11. However. and number of defectives to learn about acceptance sampling and how this may improve the quality of manufactured products. among other contributions to our lives.1.1 Data from a Production Line A manufacturing process is subject to periodic inspection.Chapter 11 Statistical Process Control CHAPTER OBJECTIVES: r r r r to introduce control charts to show how to estimate the unknown σ to show control charts for means. But. The means are plotted on a graph as a function of time. processes show random variation arising from a number A Probability and Statistics Companion.1 certainly show considerable variation. Statistics has become an important subject because. are these apparent changes in the process statistically significant? If so. 155 . proportion defective. John J. appears to be greater than any of the other observations while the 19th observation appears to be less than the others. Observation 15. we want to explore some of the ways in which statistics does this.

x3. .8197 21.9250 22. x4.1 of sources.4619 Range 8.8300 20.9991 21.9625 18.3215 18. .2559 23.68069 2.0998 20.5668 24.3042 19. . 19.90589 5.5008 17.49118 2. x2.57512 6.29191 1.7421 16.1647 19.3827 20. .7467 18.1 Standard Deviation 4.7208 21.81991 3.50961 2.9837 18. if we were to know the true mean of the observations μ as well as the standard deviation σ. and x5.7566 21.5443 19.2564 . 19.20938 2. Table 11.5589 21. .8875 .1692 19.89023 .4993 17. .2468 20.4357 21.5589 23.8812 19. .3964 17. were selected.8545 .49851 5. of course.26620 . . 17.14750 5.6703 x4 23.87495 2. . It would help.65895 5.31656 x1 14. . How can we proceed in judging the process? Forty samples. The observations are denoted by x1.2758 20.24052 0.1754 19. Some of the samples chosen and summary statistics from these samples are given in Table 11.6504 21. .7165 18. neither of these quantities is known.4408 15.3229 23.5269 16. . 1.3478 x5 14.4052 20.8692 20.71821 2.1211 19.1339 20.9313 13.2777 19.2247 20.7777 . .51258 3. 19.6445 19.5286 21.8907 18. each of size 5.2557 19.61823 .156 Chapter 11 Statistical Process Control Means from a production line 22 21 Mean 20 19 18 4 8 12 16 20 Index 24 28 32 36 40 Figure 11. . 19.9661 Mean 18.4660 x2 16. 3.8628 .8596 x3 22.75511 9.1203 17. and we may simply be observing this random behavior that occurs in all production processes. . 20.2718 19. .8140 22.8997 .1.

we might say that a √ √ mean greater than x + 3(σ/ n) or a mean less than x − 3(σ/ n) would be unusual since. Now if we could estimate σ. Table 11. σ= s c4 where the adjustment factor c4 depends on the sample size.975350 0. Table 11. The sample standard deviations give us some information about the process standard deviation σ.972659 0.797885 0.977559 0.965030 0. standard deviation. correctly. In this case.979406 0. x = 19.939986 0.959369 0. Such estimates are called unbiased. We know the central limit theorem shows that the sample mean x follows a normal √ distribution with mean μ and standard deviation σ/ n.980971 0. ESTIMATING σ USING THE SAMPLE STANDARD DEVIATIONS In Table 11. n is the sample size 5. It is natural to estimate μ by the mean of the sample means x. the standard deviation is calculated for each of the samples shown there.2 n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 c4 0. The adjustment c4 ensures that the expected value of σ is the unknown σ.933. we know that the probability an observation is outside these limits is 0. a very unlikely event. for a normal distribution.886227 0. using all 40 sample means. the standard deviation.984506 .0026998. the mean. we might assume.983484 0. In fact.2 shows some values of this quantity. and range (the difference between the largest observation and the smallest observation) were calculated. In this case. We now show two ways to estimate σ based on each of these sample statistics. It is known that the mean of the sample standard deviations s can be adjusted to provide an estimate for the standard deviation σ. that the sample standard deviations and the sample ranges may aid us in estimating σ.969311 0.982316 0.Estimating σ Using the Sample Standard Deviations 157 For each sample.921318 0.1.951533 0. While we do not know σ.

608 22 21 20 19 18 LCL = 17.55% of the time. these limits are 19.1.99365 √ = 21. 716 1 5 1.939986. .933 − 2 · and 19.99365/ 5 = 22.933 + 3 · 1.158 Chapter 11 Statistical Process Control In this case. The use of three standard deviations as control limits is a very conservative choice. The fact that the limits are exceeded so rarely is an argument in its favor since the production line would be shut down or investigated very infrequently. The limits x − 2 · σ/ n and x + 2 · σ/ n would be exceeded roughly 4.874 and c4 = 0.874 = 1.2 None of the observations are outside the control limits.939986 √ This means that the limits we suggested.259 17 1 5 9 13 17 21 Sample 25 29 33 37 _ _ X = 19. x ± 3(σ/ n) become √ √ 19.933 − 3 · 1.99365/ 5 = 17. Means from a production line S estimated from sample standard deviations 23 UCL = 22. so our estimate is σ= 1.99365 0.2.933 Figure 11.3. In this case. 2582 and 19. 6078. If we show these on the graph in Figure 11. √ √ of course. These are called upper and lower control limits (UCL and LCL). It is possible. we produce Figure 11.99365 √ = 18. to select other control limits.933 + 2 · 1. Now the 15th observation is a bit beyond the upper control limit. s = 1. 149 8 5 Sample mean The resulting control chart is shown in Figure 11.

Table 11.3 ESTIMATING σ USING THE SAMPLE RANGES The sample ranges can also be used to estimate σ. It is a fact that an unbiased estimate of σ is σ= R d2 where d2 depends on the sample size. Table 11.933 19 18 1 5 9 13 17 21 Sample 25 29 33 37 –2SL = 18.704 2.128 1.326 2.3 gives some values of d2 .059 2.847 2.716 21 Sample mean 20 _ _ X = 19.Estimating σ Using the Sample Ranges 159 Means from a production line S estimated from sample standard deviations 22 +2SL = 21.534 2.693 2.078 .970 3.150 Figure 11. The mean range R must be adjusted to provide an unbiased estimate of σ.3 n 2 3 4 5 6 7 8 9 10 d2 1.

It is possible to produce control charts for statistics other than the sample mean. These are examples of control charts for variables.950 13 2. so our estimate of σ is σ= 4. The control charts here were produced using Minitab that allows several methods for estimating σ as well as great flexibility in using various multiples of σ as control limits. R = 4.4. The resulting control chart is shown in Figure 11. It is easier on the production floor to calculate R. we call the resulting control charts as control charts for attributes. in that case.95013/ 5 = 18. that the production items are either usable or defective.678 21 Sample mean 20 _ _ X = 19. .189 Figure 11. It may be.933 + 2 · 1.189 √ and 19.536 = 1. however.536 and d2 = 2. but we will not discuss those charts here.95013/ 5 = 21.678. but both methods are used with some frequency.326 √ and this produces three sigma control limits at 19.933 − 2 · 1.933 19 18 1 5 9 13 17 21 Sample 25 29 33 37 –2SL = 18.326.4 The two control charts do not differ much in this case.160 Chapter 11 Statistical Process Control In this case. It is natural in our example to use the sample mean as a statistic since we made measurements on each of the samples as they emerged from the production process. Means from a production line S estimated from sample ranges 22 +2SL = 21.

Since 50 observations were taken in each sample.4 Sample number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Defects 6 4 7 3 1 3 3 1 5 5 2 11 2 3 2 Sample number 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Defects 5 6 1 6 7 12 7 5 2 3 2 2 5 4 4 We are interested in the number of defects in each sample and how this varies from sample to sample as the samples are taken over a period of time. x x = 0. . . correctly or incorrectly. n The random variable X is a binomial random variable because a part either shows a defect or it does not show a defect and we assume. Table 11. . We know that the mean value of X is np and the standard deviation is np(1 − p). n = 50 and so X takes on integer values from √ 0 to 50 .4 give the number of parts that showed a plating defect.2 Metal Plating Data Thirty samples of size 50 each are selected from a manufacturing process involving plating a metal. . np Control Chart Let the random variable X denote the number of parts showing plating defects. . The control chart involved is usually called an np chart. 1. that the parts are produced independently and with constant probability of a defect p.Control Charts for Attributes 161 CONTROL CHARTS FOR ATTRIBUTES EXAMPLE 11. The data in Table 11. The random variable X is a binomial random variable whose probability distribution in general is given by P(X = x) = n x p · (1 − p)(n−x) . We now show how this is constructed.

25 Since X.30. We can find from the data that the total number of defective parts is 129.086) = 10.5 It appears that samples 12 and 21 show the process to be out of control. Except for these points. The resulting control chart. This gives estimates for the control limits as LCL = X − 3 np(1 − p) = 4. the lower control limit is taken as 0.3 LCL = 0 Figure 11. Figure 11. . none of the points. but we do not know the value of p.25 Sample count __ np= 4. so the mean number of defective parts is X = 129/30 = 4.30 − 3 50(0.086)(1 − 0. is shown in Figure 11. other than those for samples 12 and 21. the process is in good control.086.086)(1 − 0. np Chart for defects data 12 1 1 10 8 6 4 2 0 1 4 7 10 13 16 Sample 19 22 25 28 UCL = 10. cannot be negative. are out of control.162 Chapter 11 Statistical Process Control √ Reasonable control limits then might be from LCL = X − 3 np(1 − p) to √ UCL = X + 3 np(1 − p). the number of parts showing defects.6 shows the control chart using two sigma limits. produced by Minitab.086) = −1.30 + 3 50(0.6474 and UCL = X + 3 np(1 − p) = 4. A reasonable estimate for p is the total number of defects divided by the total number of parts sampled or 129/(30)(50) = 0.5.

086.26 Figure 11. 0.Control Charts for Attributes 163 np Chart for defects data 12 1 1 10 8 6 4 2 0 1 4 7 10 13 16 Sample 19 22 25 28 –2SL = 0. This is the estimate used in the previous section. of course. we do not know the value for p. and sample size is n. Denote this random variable as ps . X is binomial. Since the random variable X is the number of parts showing defects in our example. it follows that E(ps ) = E and Var(ps ) = Var X n = np(1 − p) p(1 − p) Var(X) = = n2 n2 n p(1 − p) n p(1 − p) n X n = E(X) np = =p n n Sample count We see that control limits.3 +2SL = 8. we know that ps = X/n. using three standard deviations. the proportion of the production defective is also of great importance. This .34 __ np = 4.6 p Chart Due to cost and customer satisfaction. A reasonable estimate for p would be the overall proportion defective considering all the samples. are LCL = p − 3 and UCL = p + 3 but. that is.

086 − 3 0. . Ignore the numbers 1.3 Control Limits Examine the control chart shown in Figure 11.086) = 0.164 Chapter 11 Statistical Process Control gives control limits as LCL = 0.8.032948 50 and 0.10 _ P = 0.086 + 3 Zero is used for the lower control limit.6.2049 1 Proportion 0.7. 2.20 UCL = 0. The resulting control chart is shown in Figure 11. SOME CHARACTERISTICS OF CONTROL CHARTS EXAMPLE 11.2049 50 UCL = 0.7 This chart gives exactly the same information as the chart shown in Figure 11.086(1 − 0. since they will be explained later.25 1 0.086(1 − 0. and three sigma units are displayed. two. and 6 on the chart for the moment.00 1 4 7 10 13 16 19 Sample 22 25 28 LCL = 0 Figure 11. p Chart of np1 0.086) = −0.05 0.15 0.086 0. but notice that one.

6 and standard deviation 5. and this can be done. After the 60th sample. In the meantime. The control charts we have considered offer great insight into production processes since they track the production process in a periodic manner. SOME ADDITIONAL TESTS FOR CONTROL CHARTS First.5 +2SL = 46.5 40. an observation falling beyond the three sigma limits entirely due to chance rather than to a real change in the process. it is difficult to detect even a large change in the process mean with any certainty. but using only the three sigma criterion alone. .0 1 11 21 31 41 51 61 Observation 71 81 91 Figure 11.61 –3SL = 35. the next 50 samples of size 5 were chosen from a normal distribution with mean 43. The change here was a relatively large one. although the 71st and 26th samples are close to the control line. For this reason. They have some disadvantages as well.0 6 6 2 6 37. the chart shows only a single point outside three sigma limits. the 83rd. But despite the fact that the mean had increased by 3. the observations become well behaved again. although it is apparent that the mean has increased.Some Additional Tests for Control Charts 165 Control chart for Example 11.87 –2SL = 37.64 2 Individual value 45. In the above example. additional tests are performed on the data. The control chart was constructed by taking the mean of the samples.6 or 72% of the standard deviation.12 –1SL = 39.89 47.0 2 2 2 2 2 6 2 6 2 2 2 +1SL = 44. It would be very desirable to detect even small changes in the process mean.3 50. especially when the process mean changes by a small proportion of the standard deviation. that is. as indeed it did.35 42. We describe some of these now.38 _ X = 42. the process may be essentially out of control without the knowledge of the operators. but it may take many observations to do so. consider the probability of a false reading. or when the standard deviation itself changes.0 1 +3SL = 48. namely.8 This control chart was constructed by selecting 50 samples of size 5 from a normal distribution with mean 40 and standard deviation 5. It is apparent from the control chart that something occurred at or around the 50th sample. we might not be alarmed. many are slow to discover a change in the production process.5 35.

Here n represents the number of observations. However. In Figure 11. 12644. This explains those numbers on the chart in Figure 11. the probability that at least one of them is outside the three sigma limits is 1 − (1 − 0. 2.885011 0. We have used k = 3. The default values of the constant k in each of these tests can be changed easily in Minitab. as the data in Table 11. but Table 11.5 indicate.236899 0. . Minitab offers eight additional tests on the data.741233 0.660900 0. In general.933039 So an extreme observation becomes almost certain as production continues.126444 0.849314 0. This probability increases rapidly. an assumption justified for the sample mean if the sample size is moderate. Table 11.8.5 n 50 100 200 300 400 500 600 700 800 900 1000 Probability 0.912252 0.555629 0. Such observations are then very rare and when one occurs. if 50 observations are made. The calculation of the probabilities involved in most of these tests relies upon the binomial or normal probability distributions. These points are indicated by the symbol on the control chart. Points more than k sigma units from the centerline. and 6.802534 0. Minitab indicates this by putting the appropriate number of the test on the control chart. since the probabilities of these events are quite small. then the probability of an observation outside the three sigma limits is 0.When these tests become significant. Test 1. while only one point is beyond the three sigma limit.166 Chapter 11 Statistical Process Control Assuming that the observations are normally distributed.0027)50 = 0. although the process in reality has not changed.8.417677 0.0027. 5. the following situations are to be regarded as cautionary flags for the production process. we are unlikely to conclude that the observations occurred by chance alone. we will describe four of them here. tests 1. several are beyond the two sigma limits. namely.6 shows probabilities with which a single point is more than k standard deviations from the target mean.

000092 0.012419 0. . .00390625.Some Additional Tests for Control Charts Table 11.6 k 1.and 91.002700 167 Test 2.00195 0. . 8.7 shows this probability for k = 7.000003 .7 k 7 8 9 10 11 Probability 0. .003035 0.045500 0. At least k out of k + 1 points in a row more than two sigmas from the centerline.00781 0. 61.133614 0.8 k 2 3 4 Probability 0. 18. The probability that nine points in a row are on the same side of the centerline is 2 (1/2)9 = 0.5 2.9771499)3−x = 0.0227501)x (0.01563 0. 16.0 Probability 0. Table 11.0 2. 60. The quantity k is commonly chosen as 2. k points in a row on the same side of the centerline. 17.003035 x Table 11. 14. This event occurred at observation 44 in Figure 11. 28. Since the probability that one point is more than two sigmas above the centerline is 0. Table 11. 11.00098 This test fails at samples 13.8. the probability that at least two out of three observations are more than two sigmas from the centerline and either above or below the centerline is 3 2 x=2 3 (0.317311 0.8 gives values of this probability for other values of k.0227501 and since the number of observations outside these limits is a binomial random variable.0 1. Test 5. It is common to use k = 9.00391 0.5 3. Table 11.

7 indicates that a run of seven points is sufficient.005532 This test failed at samples 8.158655)x (1 − 0.00553181 x Table 11. The value of k is commonly chosen as 4. 3) distribution.028147 0. Simplicity is a desirable feature when production workers monitor the production process. (a) Find the mean. for example. Select samples of size 6 from a N(25. the probability that at least four out of five observations are more than two sigmas from the centerline is 5 2 x=4 5 (0.158655 and since the number of observations outside these limits is a binomial random variable. and 11 in our data set. and the range for each sample. where we seek sequences of points that are on the same side of the centerline. Since the probability that one point is more than one sigma above or below the centerline is 0. One could even approximate the probability for the test in question and find a value for k. In test 2. (b) Use both the sample standard deviations and then the range to estimate σ and show the resulting control charts. Table 11. an important part of manufacturing.9 k 2 3 4 Probability 0.168 Chapter 11 Statistical Process Control Test 6. Table 11. CONCLUSIONS We have explored here only a few of the ideas that make statistical analysis and statistical process control. EXPLORATIONS 1. .158655)5−x = 0.135054 0. This topic also shows the power of the computer and especially computer programs dedicated to statistical analysis such as Minitab.1. The additional tests provide some increased sensitivity for the control chart. but they decrease its simplicity.02. standard deviation. if we wanted a probability of approximately 0. At least k out of k + 1 points in a row more than one sigma from the centerline. Create a data set similar to that given in Table 11. 9. Computers allow us to calculate the probabilities for each of the tests with relative ease.9 gives values of this probability for other values of k.

(a) Construct an np control chart and discuss the results. show the number of defective items. p = 0.Explorations 169 (c) Carry out the four tests given in the section “Some Additional Tests for Control Chart” and discuss the results. 2. .05. Generate 50 samples of size 40 from a binomial random variable with the probability of a defective item. For each sample. (b) Construct a p chart and discuss the results.

for example). or on the distribution itself. Kinney Copyright © 2009 by John Wiley & Sons. In general. nonparametric statistical methods refer to statistical methods that do not depend upon the parameters of a distribution. Inc. We begin with two groups of teachers.Chapter 12 Nonparametric Methods CHAPTER OBJECTIVES: r to learn about hypothesis tests that are not dependent on the parameters of a r to learn about the median and other order statistics r to use the median in testing hypotheses r to use runs of successes or failures in hypothesis testing. We begin with a nonparametric test comparing two groups of teachers. probability distribution INTRODUCTION The Colorado Rockies national league baseball team early in the 2008 season lost seven games in a row. from two different schools. 170 . The results are shown A Probability and Statistics Companion. such as the mean and the variance (both of which occur in the definition of the normal distribution. THE RANK SUM TEST EXAMPLE 12.1 Two Groups of Teachers Twenty-six teachers. were under consideration for a prestigious award (one from each school). John J. Is this unusual or would we expect a sequence of seven wins or losses in a row sometime in the regular season of 162 baseball games? We will answer this question and others related to sequences of successes or failures in this chapter. Each teacher was ranked by a committee. each one of whom was being considered for an award.

5 29.1.6 27.0 16.5 25. Certainly.5 15.0 13.6 19.0 Rank 1 2 3.0 17. from 1 to 26. The scores are shown in decreasing order. the inequality of the sample sizes makes some difference.5 21.5 8.5 29.1.5 15 16 17 19 19 19 21 22 23 24 25 26 171 in Table 12. These rankings are shown in the right-hand column of Table 12.5 8. while those teachers without stars came from School II. In this case.5 32. Could this have occurred by chance or is there a real difference between the teachers at the two schools? The starred group has 8 teachers. while the unstarred group has 18 teachers. we first rank the teachers in order. these are dubious assumptions to say the least.5 3. The stars (*) after the teacher’s names indicate that they came from School I.5 9.5 13.0 7. The usual parametric test.0 14.5 9.5 5 6 7 8 9 10 11 12 13.0 20. Names have been omitted (since this is a real case) and have been replaced by letters.5 28.0 24. on the equality of the means of the two groups is highly dependent upon the samples being selected from normal populations and needs some assumptions concerning the population variances in addition. We notice that five of the teachers from School I are clustered near the top of the list. So how can we proceed? Here is one possible procedure called the rank sum test. To carry this test out.1 Teacher A∗ B∗ C∗ D E∗ F G∗ H I J K L M N O∗ P∗ Q R∗ S T U V W X Y Z Score 34.0 17.0 13. Note that teachers C and .5 19.0 13.0 12. say.The Rank Sum Test Table 12.

40.45.113.124.127.146.58. A table of all the possible values of rank sums is given in Table 12.101.143.105. 275 ways.140.148.65.66.129.167. 109.80.114.134. We can then find the rank sums that are at most 67.104. Smaller samples provide interesting classroom examples and can be carried out by hand.41.172. It turns out that there are 8 only 145 different rank sums. say.172 Chapter 12 Nonparametric Methods D each have scores of 29. we might then be able to conclude whether or not a sum of 67.122.43. So we would conclude that the probability the rank sum is at most 67.139. Table 12.92.74.5 = 67.163.78.157.120.106.160.125.56.86.94.2 {36.95. but.46.49.93.145.60.89. There is a considerable discrepancy in these rank sums.108.138.70.01103 775.47.110. There are two other instances where the scores are tied. 562.100.170.130. These teachers can then be ranked in 26 = 1.150.77.178.174.5. on the contrary.152.59.84. each is given rank 3.52.5 The sum of all the ranks is 26 · 27 = 351 2 so the rank sum for the unstarred teacher group must be 351 − 67.39.76.63.5.72.91.48.5 = 283. How much of a difference is needed then before we conclude the difference to be statistically significant? Were we to consider all the possible rankings of the group of eight teachers. however.85. Note that the eight teachers could have rank sums ranging from 1 + 2 + 3 + · · · + 8 = 36 to 19 + 20 + 21 + · · · + 26 = 180. 166.153.121.5.126.71.168.102.53.119.62. .111.137.64.169.115.96.165.144.44. however. Suppose then that we examine all the possible rank sums for our eight teachers.142.73. Here is how we can do this.151. This is 1 + 1 + 2 + 3 + 5 + 7 + · · · + 2611 = 17244.68.42. 61.156.155.123. Now the ranks of each group are added up.131.173. Our eight teachers had a rank sum of 67. 128.154.90.97.141.37. 147.161.67.2.177.132.51. the arithmetic mean of the ranks 3 and 4. It turns out that this is possible.171.38.180} These rank sums. we need a computer to deal with this.75.83.69. so instead of ranking them as 3 and 4.57.5.5.3 shows these frequencies.5 is 17244/1562275 = 0.5 was statistically significant. the sample sizes are quite disparate.88. 87.54.149.135.136. 1 + 2 + 3 + · · · + 26 = Table 12.81.158.176. In our case.50. do not occur with equal frequency.107. and we have followed the same procedure with those ranks.175.99.98.133.117.79.164.116.179.55.103.118. the starred group has a rank sum then of 1 + 2 + 3 + 5 + 7 + 15 + 16 + 18.82. Here is the set of all possible rank sums.159. so it is very rare that a rank sum is less than the rank sum we observed. and so some difference is to be expected.112.162.

9042.15765.116.8224.9879.6087.26620.1252.32684. 30000 25000 20000 15000 10000 5000 36 67.146.2281. 31590.15.10783.33814.3853. 6748.521.30186. 32200.19072. 20171.146.351.9042.3 173 {1.27611.22.24563.1707.21304.631. 7 The mean number of defects is then 1+6+5+5+4+3+2+2+4+6+7 = 4. 19072.4338.900.52. ORDER STATISTICS EXAMPLE 12. 5. 3. 4.1972.186.13672.1. as shown in Figure 12.40.17946.70. 4.7.1 Our sum of 67.30186.3853.351.33434.32684.Order Statistics Table 12.186.3.4883.15765.11702.230.2991.15.4338.2281. 2.5 is then in the far left-hand end of this obviously normal curve showing a perhaps surprising occurrence of the normal curve.17946.631.4883.2.33692.2 Defects in Manufactured Products The number of defects in several samples of a manufactured product was observed to be 1.752.5.16862. 3.9879.8224. 1060.22394.29.29399.1060.288.2991.16862.30945.5453.25627. We now turn to some other nonparametric tests and procedures.33885.230.6748.1} It is interesting to see a graph of all the possible rank sums.432.288. 2.31590.52.33125.2611.7474.33814.23507.10783.432.1461.11.22394.5453.1707.3395.1972. 5.521.29399.24563.6087.1252.12683.26620.33125.30945.2611.25627.5.33692.14721.28512.21304.3395. 6.1.32200.11.70.1.23507. 900.33434.22. 6. It arises in many places in probability where one would not usually expect it! Teachers A and D won the awards.28512.1461.20171.27611.29.11702.5 108 Rank sums 180 Figure 12. 7474.2.14721.116.13672.89.09 11 Frequency .7.89.752.40.12683.

6 3. 3. The smallest value.3.6.5 3. There are then 7 = 35 possible samples.5.5 1.5. almost a whole unit larger due to this single observation. the minimum.5 1.2.2.6 3.5.3 Samples and the Median Samples of size 3 are selected.3. for example (or any other value larger than 6).6 2. 4. 2. For example. 6.7 3.4.7 1. the median remains at 4.4. 7 The median is the value in the middle when the data are arranged in order (or the mean of the two middlemost values if the number of observations is even) .5.5.2. Now if the final observation becomes 17.6. The samples are arranged in order. without replacement. Here that value is 4.7 1.4.3. 5.6. Median Consider then arranging the data in order of magnitude so that the data become 1.4 Sample 1.7 1.4.6.174 Chapter 12 Nonparametric Methods This mean value is quite sensitive to each of the sample values.6 1.7 2.3.3.3.5.7 4.6 2.5. These are shown in Table 12.7 4.5 1. from the set 1.6 4.4.7 3. we explore now the probability distribution of the median.91.7.6 1.3.2. The median is an example of an order statistic—values of the data that are determined by the data when the data are arranged in order of magnitude. Were the samples not Table 12.4 2.5.6 1. So we seek a measure of central tendency in the data that is not so sensitive.2. the mean value becomes 4.4 where the median 3 has been calculated for each sample.5.7 Median 3 3 4 4 4 5 5 6 4 4 4 5 5 6 5 5 6 6 .7 2. While there is no central limit theorem as there is for the mean.5 2. is an order statistic as is the largest value.3 1.3.7 5. or the maximum. 5. 2.6 1.7 3.4.4 1.7 2.2. 6. if the last observation had been 16 rather than 7.5 Median 2 2 2 2 2 3 3 3 3 4 4 4 5 5 6 3 3 Sample 2.7 1. EXAMPLE 12.3.6. 4.4.4.7 2.4 1.4.6.4.6 2.

the points here are points on the parabola y = (x − 1)(7 − x) We need a much larger population and sample to show some characteristics of the distribution of the median. then one observation must be chosen from the k − 1 integers less than k and the remaining observation must be selected from the 7 − k integers greater than k.5 Median 2 3 4 5 6 Probability 5/35 8/35 9/35 8/35 5/35 The probabilities are easy to find. each with the same median.24 0. 4. If the median is k.18 0. 0. 7 3 k = 2.5. so we will consider only the ordered samples. The only way for us Probability .2 0. Table 12. 5.2.14 2 3 4 5 Median 6 Figure 12.16 0. This produces the probability distribution function for the median as shown in Table 12. 6 The expected value of the median is then 2 · 5/35 + 3 · 8/35 + 4 · 9/35 + 5 · 8/35 + 6 · 5/35 = 4 A graph of this probability distribution is shown in Figure 12. and we need a computer for this. each sample would produce 3! = 6 samples. In fact.Order Statistics 175 arranged in order. so the probability the median is k then becomes P(median = k) = k−1 7−k · 1 1 .2 This does not show much pattern. 3.22 0.

. the smallest value in the data set. is of great value in statistical quality control where the range. the maximum.01 0. The points on the graph in Figure 12. 5 106 Probability 4 106 3 106 2 106 1 106 20 40 60 Median 80 100 Figure 12. 0. The range was used in Chapter 11 on statistical process control.005 Probability 20 40 Median 60 80 Figure 12. . 100.02 0. .015 0. .3 are actually points on the fourth-degree polynomial y = (x − 1)(x − 2)(100 − x)(99 − x). Maximum The maximum value in a data set as well as the minimum. in Figure 12. . 2. Figure 12. . 2.3 shows the distribution of the median when samples of size 5 are selected from 1. the difference between the maximum and the minimum. 3. .3 Finally. 100. .4 we show the median in samples of size 7 from the population 1. .4 There is now little doubt that the median becomes normally distributed. 3. is easily computed on the production floor from a sample. We will not pursue the theory of this but instead consider another order statistic.176 Chapter 12 Nonparametric Methods to show the distribution of the median is to select larger samples than in the above example and to do so from larger populations.

we saw that the expected value of the median is 4. we were able to estimate the total number of tanks they had. or a measure of central tendency rather than estimators of the maximum of the population. what is n. the Germans numbered their tanks in the field. When some tanks were captured and their numbers noted. 7. If we refer back to Example 12.4 An Automobile Race We return now to two examples that we introduced in Chapter 4. not a very accurate estimate of the maximum. but it is not. Assuming the cars to be numbered from 1 to n. how many cars are in the race? The question may appear to be a trivial one. But how many more? . We might use the median that we have been discussing or the mean for the samples. which is what we seek. without surprise. 17 and 45. It is intuitively clear that if the race cars we saw were numbered 6. and 45. In World War II. we should use the sample in some way in estimating the value for n. Clearly.6 below shows the probability distribution for the mean of the samples in Example 12.3. Table 12. then there are at least 45 cars in the race.3. A latecomer to an automobile race observes cars numbered 6.6 Mean 2 7/3 8/3 3 10/3 11/3 4 13/3 14/3 5 16/3 17/3 6 Frequency 1 1 2 3 4 4 5 4 4 4 3 1 1 So the expected value of the mean is (1/35)[2 · 1 + (7/3) · 1 + (8/3) · 2 + 3 · 3 + (10/3) · 4 + (11/3) · 4 + 4 · 5 + (13/3) · 4 + (14/3) · 4 + 5 · 3 + (16/3) · 2 + (17/3) · 1 + 6 · 1] = 4 This of course was to be expected from our previous experience with the sample mean and the central limit theorem. Both the median and the mean could be expected. to be estimators of the population mean. Table 12.17.Order Statistics 177 EXAMPLE 12. that is.

178 Chapter 12 Nonparametric Methods Let us look at the probability distribution of the maximum of each sample in Example 12. The average gap would then be (13 + 21 + 8)/3 = 42/3 = 14. the same average gap we found before. we find the probability distribution shown in Table 12. we know that cars 1. . but can not be the best we can do. For one thing. so there are 5 + 10 + 27 = 42 cars that we did not observe. If we do that. 5 (a total of 5 cars) must be on the track as are cars numbered 7. the maximum only achieves the largest value in the population.5 1 6 17 45 n Now we observe that while we saw three cars. Before investigating the properties of this estimator. Now suppose that those cars were numbered 14 and 36. . 7. The three cars we did observe then divide the number line into three parts. the average gap will be 14 regardless of the size of the two numbers as long as they are less than 45. the maximum of the samples is less than that for the population and yet the maximum of the sample must carry more information with it than the other two values. . 2. but these sizes of these values turn out to be irrelevant. . We are left with the question. . . . Table 12. 19.5. . . We made heavy use of the numbers 6 and 17 above. giving the average gap as 14.7. “How should the maximum of the sample be used in estimating the maximum of the population?” In Figure 12. let us note a simple fact. 44 ( a total of 27 cars). One can also see this by moving the numbers for the two numbers less than 45 back and forth along the number line in Figure 12. it would appear sensible that this is also the gap between the sample maximum. . which we will call gaps. . Since this is the average gap. In fact. there must be 42 cars we did not see. and since we observed 3 cars in total. Adding the average gap to the sample maximum gives 45 + 14 = 59 as our estimate for n. The average size of the gaps is then 42/3 = 14.5.3. so with probability 20/35. 16 (a total of 10 cars). n. . the numbers observed on the cars are shown in a number line. 8. with probability 15/35. Figure 12. as well as cars numbered 18. 45. The reason for this is quite simple: since we observed car number 45.7 Maximum 3 4 5 6 7 Frequency 1 3 6 10 15 We find the expected value of the maximum to be 1 [3 · 1 + 4 · 3 + 5 · 6 + 6 · 10 + 7 · 15] = 6 35 This is better than the previous estimators. and the unknown population maximum.

. Then it can be shown.3.Order Statistics 179 So our estimate for n is 45 + (45 − 3)/3 = 59.8 shows the maximum of each sample and our gap estimator. 2. its expected value is the value to be estimated) regardless of the sample size. . M E[m] = m=k m· m−1 k−1 M m = k(M + 1) k+1 So. In that case. . this estimator can be shown to be unbiased (that is. If we do this. Suppose our sample is of size k and is selected from the population 1. Table 12. M and that the sample maximum is m. our estimator is now distributed as shown in Table 12. the gap method asked us to estimate n as 17/3.89 . In fact. . We have denoted by square brackets the greatest integer function. E (k + 1)m − k =M k We still have a slight problem. keeping our sample size at 3 and supposing the largest observation in the sample is m. The expected value of this estimator is then (1/35) · (3 · 1 + 4 · 3 + 6 · 6 + 7 · 10 + 8 · 15) = 6. Table 12.9. that the expected value for m is not quite M. In fact. but not easily.8 Maximum 3 4 5 6 7 Gap estimator 3 13/3 17/3 7 25/3 Frequency 1 3 6 10 15 This gives the expected value of the gap estimator as (1/35) · (3 · 1 + (13/3) · 3 + (17/3) · 6 + 7 · 10 + (25/3) · 15) = 7 the result we desired. we would probably select the integer nearest to 17/3 or 6. obviously not an integer. Now to generalize this. our estimate then becomes m+ m−3 4m − 3 = 3 3 Let us see how this works for the samples in Example 12. for some samples. In our example.

5 Losing Streaks in Baseball The Colorado Rockies. we consider runs of luck (or runs of ill-fortune). 0. For example. 0. 0. we estimate M as 7 again. and so on.10 Length 1 2 3 4 5 6 Frequency 3 4 1 0 0 1 . but that is not at all easy to show. 1.9 Maximum 3 4 5 6 7 [Gap estimation] 3 4 6 7 8 Frequency 1 3 6 10 15 Now if we use the greatest integer function here. This estimator appears to be close to unbiased. This method was actually used in World War II and proved to be surprisingly accurate in solving the German Tank problem.22 Table 12. 0. a computer program (this can also be done with a spreadsheet) gave the following sequence of 20 1’s and 0’s: {0. 1. 1. 0. 1. Runs Finally. 1. a National League baseball team. 1. 1.180 Chapter 12 Nonparametric Methods Table 12. EXAMPLE 12. in this chapter.10 The expected number of runs is then (1/9)(1 · 3 + 2 · 4 + 3 · 1 + 4 · 0 + 5 · 0 + 6 · 1) = 2. followed by six ones. There are nine runs in all and their lengths are given in Table 12. three zeros. 0. 0. 1. 1. 0} The sequence starts with a run of two zeros. lost seven games in a row early in the 2008 season. 1. 0. Is this unusual or can a sequence of seven losses (or wins) be expected during a regular season of 162 games? A run is a series of like events in a sequence of events.

1. 175 150 125 100 75 50 25 4 5 6 7 8 9 10111213141516 17 Runs Figure 12. add one to the number of runs until the sequence is completely scanned. 1. 0. 1. 1. 0. 0. 1. 1. 1. 1. then at the ninth entry. 0. Here is a typical simulated year where 0 represents a loss and 1 represents a win. 1. 0. 1. 1. 0} we find the first time adjacent entries are not equal occurs at the third entry. 0. When they differ. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 6. 1}. 0.) Figure 12. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. it is easiest to count the total number of runs first. 0. 0. 1. This gives one less than the total number of entries since the first run of two zeros is not counted. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. For example. 0. 1. 1. and so on. 1. 1. 0. 0. 1. 1. So we simulated 1000 sequences of 20 ones and zeros and counted the total number of runs and their lengths. this will produce the vector {2. 0. 1. 2. 0. 1. 1. 1. comparing adjacent entries. 0. 1. 1. (The sum of the entries must be 20. 0. start with a vector of ones of length the total number of runs. 0. 1. 1. 0. 0. then at the thirteenth entry. 1. 0. 1. 1. 1. each with 162 games. 0. 1. 1. 1. 1. 1} Scan the sequence again. 0. To do this. 0. 1. 0. 2. 2. 1. 1. 3. 1} We then counted the number of runs in the 300 samples. 1. 0. 1. 0. 1. and then adding one until the entries differ and continuing though the vector of ones. adding one when adjacent entries are alike and skipping to the next one in the ones vector when adjacent entries differ. 0. 0. 1. 1. 0. 0. 0. scan the sequence from left to right. 1. 1. 1. 0. 1. 0. 1.11 shows the number of runs and Figure 12. 0. To find the lengths of the individual runs.7 shows a graph of these results. 1.6 shows a bar chart of the number of runs from the 1000 samples. 1. 0. 0. 1. In this example. 0. We simulated 300 seasons. 1. In our example. 0. 0. 0. 0. 1. in the sequence {0. 0. 1. we begin with the vector {1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 1. 0.Order Statistics 181 It is not at all clear from this small example that this is a typical situation at all. 1. 0. 1. 0. To write a computer program to do this. Table 12. 1. 0. 0. Frequency . 0. 1. 1. 0. {0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1.6 Now we return to our baseball example where the Colorado Rockies lost seven games in a row in the 2008 regular baseball season.

the arrangement below of ∗’s contained in cells limited by bars (|’s): || ∗∗ || ∗ ∗ ∗ ||| ∗ | ∗ ∗ ∗∗ | There are eight cells. 85. 81. 85.5 70 75 80 85 90 95 Runs Figure 12. 78. 85. Frequency . 78. 80. 76. 94. 85. 84. 73. 86. 86. 73. 85. only four of which are occupied. 81. 83. 78. 86. 89. 79. 81. 83. 84. 77. 84. 94.8 is a graph of these frequencies. 76. 70. 87. 82. 72. 84. 82.5 5 2. 74.5 10 7. 87. 75. 84. 79. 91. 73. 76. 94. 71. 87. 92. 81. 82. 83. 76. 84.7 Table 12. 82. 84. 68. 86. 99. 73. 74. 72. 87. 75. 89. 79. 77. 80. 80. 96. 76. 82. 77} The mean number of runs is 81. 73. 73. but we need one at each end of the sequence to define the first and last cell. 85. 72.1587. 88. 78. 71. 82. 84. 92. 78. 80. 77. 81. 87. The simulation showed 135 + 49 + 26 + 15 + 55 + 5 + 3 + 2 + 1 = 291 samples having winning or losing streaks of 7 games or more out of a total of 16. There are nine bars in total. 91. 82. 92.11 {74. This example shows the power of the computer in investigating the theory of runs. 82. 88. 75. 91. 75. 93. 74. 76. 82. 80. 72. Some Theory of Runs Consider for a moment. 84. 83. 86. 87. 83. 80. The mathematical theory is beyond our scope here. 81. 80. 78. 72. 85. 84. 83. 93. 80. 91. 86. 88. 75.182 Chapter 12 Nonparametric Methods Table 12. 75. 83. 84. 68. 82. 83. but there are some interesting theoretical results we can find using some simple combinatorial ideas. 83. 79. 319 runs or a probability of 0. 88. 83. 80. 88. 74. 15 12. 76. 90. 91. 69.12 shows the frequency with which runs of a given length occurred. 75. 76. 89. 92. 86. and Figure 12. 80. 81. 85. 90. 90. 90. 86. 79. 73. 83. 73. 72. 81. 86. 83. 83. 87.017832. 84. 81. 81. Consider the bars to be all alike and the stars to be all alike.595 with standard deviation 6. 87. 76. 83. 75. 81. 85. 87. 85. 91. 76. 79. 84. 79. 75. 84. 74. 80. 84. 66. 80. 82. 79. 84. 83. 86. 76. 78. 78. 85. 87. 81. 78. 86. 67. 79. 89. 85. 85.

the cells will become places into which we put symbols for runs. so we pursue that situation.12 Run length 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Frequency 8201 4086 2055 983 506 247 135 49 26 15 5 5 3 2 1 How many different arrangements are there? We can not alter the bars at the ends. of course. | ∗ | ∗∗ | ∗ | ∗∗ | ∗ | ∗ | ∗ | ∗ | There are other possibilities now. we can not have any empty cells.Order Statistics 8000 Frequency 6000 4000 2000 183 Figure 12. for example. we could. place a star in each cell. At present. How many such arrangements are there? First. so we have 7 bars and 10 stars to arrange. But. have this arrangement. If we wish to have each of the cells occupied. This can be done in (7+10)! = 19. 7!2! .8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Run length Table 12. We have 10 stars in our example and 8 cells. 448 7!10! different ways. This leaves us with 2 stars and 7 bars which can be arranged in (7+2)! = 36 different ways.

So if R is the random variable denoting the number of runs. but two of these are fixed at the ends. Then there must be k runs of the x’s and k runs of the y’s. so R = 2k + 1. place one star in each cell.184 Chapter 12 Nonparametric Methods Now we generalize the situation. This can be done in (n − 1 + r)! = (n − 1)!r! n−1+r n−1 different ways. We must distinguish. leaving no cell empty. Suppose there are n cells and r stars. between an even number of runs and an odd number of runs. then 2 P(R = 2k) = nx − 1 n y − 1 k−1 k−1 nx + n y nx Now consider the case where the number of runs is odd. The n cells are defined by n + 1 bars. +n There are nxnx y different arrangements of the x’s and y’s. This formula can then be used to count the number of runs from two sets of objects. but it can be shown that E(R) = Var(R) = 2nx ny +1 nx + n y 2nx ny (2nx ny − nx − ny ) (nx + ny )2 (nx + ny − 1) . This can be done in ways. It follows that P(R = 2k + 1) = nx − 1 k−1 ny − 1 nx − 1 + k k nx + n y nx ny − 1 k−1 We will not show it. however. so we have n − 1 bars that can be arranged along with the r stars. Then we must ny −1 k−1 fill the k cells for the y’s so that no cell is empty. Now if each cell is to contain at least one star. So we can fill all the cells in nx −1 ny −1 ways. This leaves us with r − n stars to put into n cells. Let us fill the k cells with x’s first. This means that the number of runs of one of the letters is k and the number of runs of the other letter is k + 1. so say there are 2k runs. This can then be done in n−1+r−n n−1 = r−1 n−1 ways. Suppose that there are nx x’s and ny y’s and that we have an even number of runs. This is also the number of ways we k−1 k−1 can fill all the cells if were we to choose the y’s first. say x’s and y’s. This can be done in nx −1 k−1 ways.

This is a fairly large example to do by hand since there are 5+8 = 1287 different orders in which the letters could appear.13 Runs 2 3 4 5 6 7 8 9 10 Probability 2 11 56 126 294 280 175 70 21 300 250 Frequency 200 150 100 50 4 6 Runs 8 10 Figure 12.6 Suppose nx = 5 and ny = 8.13 shows all the probabilities of the numbers of runs multiplied by 1287. results that are very close to those found in our simulation.Order Statistics 185 In the baseball example. 344. 5 Table 12. and Figure 12. the formulas give E(R) = 82 and 2 · 81 · 81 · (2 · 81 · 81 − 81 − 81) = 40.9 shows a graph of the resulting probabilities. assuming that nx = ny = 81. Var(R) = EXAMPLE 12.248 = 6. Table 12.9 .248 (81 + 81)2 (81 + 81 − 1) √ so the standard deviation is 40.

that is. we have introduced some nonparametric statistical methods. Figures 12.9799 and standard deviation 6.11 show the respective distributions.10 14 12 Frequency 10 8 6 4 2 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 Run length Figure 12. there are many other nonparametric tests and the reader is referred to many texts in this area.186 Chapter 12 Nonparametric Methods EXAMPLE 12.10 and 12. We pursued the theory of runs and applied the results to winning and losing streaks in baseball.5 5 2.0966. methods that do not make distributional assumptions. It is possible to separate the baseball seasons for which the number of runs was even from those for which it was odd. We have used the median and the maximum in particular and have used them in statistical testing. There are 151 years with an odd number of runs with mean 81. 15 12.11 It is interesting to compare some of the statistics for the years with an even number of runs with those having an odd number of runs. There are 149 years with an even number of runs with mean 80.5 Frequency 10 7. .5033 and standard deviation 6.7 Even and Odd Runs And now we return to our baseball example.4837. CONCLUSIONS In this chapter.5 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 Run length Figure 12.

8.. and 3 are from population I and the integers 4 and 5 are from population II. Use the “average gap” method for each sample to estimate the maximum of the population and show its probability distribution.. . (c) the sample maximum. 3. Show all the permutations of five integers. (b) the sample median. 3. The text indicates a procedure that can be used with a computer to count the number of runs in a sample. 2. . 2. 4. 2. Suppose the integers 1. Find all the samples of size 4 chosen without replacement from the set 1. .Explorations 187 EXPLORATIONS 1. Choose all the possible samples of size 3 selected without replacement from the set 1. .. . Find the probability distribution of all the possible rank sums. 10. . Produce 200 samples of size 10 (using the symbols 0 and 1) and show the probability distribution of the number of runs. 2. 3. Find the probability distributions of (a) the sample mean.

standard deviation. range. and many other measures.Chapter 13 Least Squares. median. We ask. 188 . Kinney Copyright © 2009 by John Wiley & Sons. Inc. r r r r INTRODUCTION We often summarize a set of data by a single number such as the mean. consider the following data set that represents a very small study of blood pressure and age. and the Indy 500 CHAPTER OBJECTIVES: r to show two procedures for approximating bivariate data with straight lines one of which uses medians to find some surprising connections between geometry and data analysis to find the least squares regression line without calculus to see an interesting use of an elliptic paraboloid to show how the equations of straight lines and their intersections can be used in a practical situation r to use the properties of median lines in triangles that can be used in data analysis. John J. We now turn our attention to the analysis of a data set with two variables by an equation. Age 35 45 55 65 75 Blood pressure 114 124 143 158 166 A Probability and Statistics Companion. “Can a bivariate data set be described by an equation?” As an example. Medians.

38. To avoid this complication. of course. So let us try some other combinations of straight lines. producing a minimum sum of squares of 65. it is customary to square the residuals before adding them up.1 x 35 45 55 65 75 y 114 124 143 158 166 y 102 114 126 138 150 y−y 12 10 17 20 16 The discrepancies.1. say yi . ˆ Table 13. The minimum in this case. It is clearly nearly impossible.1. call them yi . y − y. as we shall soon see. ˆ i=1 One could continue in this way. For example. and various choices for α and β. and the observed values. in Table 13. let us look at a graph of the data. happen to be all positive in this case. when we consider subsequently an Blood pressure .but we do not know if that can be improved upon.2x.2 shows some values for the sum of squares. occurs when α = 65. the positive residuals will offset the negative residuals.Introduction 189 We seek here an equation that approximates and summarizes the data. How well does this line fit the data? Let us consider the predictions this line makes. where y is blood pressure and x is age. Table 13. or what are commonly called errors or residuals. Although the details of the calculations have not been shown. We might guess some straight lines that might fit the data well. but that is not always so. even with a computer. This is shown in Figure 13. so adding up the residuals can be quite misleading. SS = 5 (y − y)2 . we might try the straight line y = 60 + 1. But trial and error is a very inefficient way to determine the minimum sum of squares and is feasible in this case because the data set consists of only five data points. suppose the line is of the form yi = α + βxi .1. 160 150 140 130 120 40 50 Age 60 70 Figure 13.1 and β = 1. First. trying various combinations of α and β until a minimum is reached. First. So how are we to measure the adequacy of this straight line approximation or fit? Sometimes. If we do that in this case we get 1189.1 It would appear that the data could be well approximated by a straight line.We have shown these values and the ˆ discrepancies.

3 1.5 1. b 55 a 65 75 85 95 3000 SS 30 Figure 13.4 1. that is. the intersections of the surface with vertical planes are parabolas and the intersections of the surface with horizontal planes are ellipses.25 212 637 example consisting of all the winning speeds at the Indianapolis 500-mile race.2 α 55 60 65 70 75 β 1 1. It can be shown that the surface is an elliptic paraboloid. although it is graphically difficult to determine the exact values of α and β that produce that minimum. it is possible to examine a surface showing the sum of squares (SS) as a function of α and β. a data set consisting of 91 data points.1 1. We now show an 1.2 1.6 1.190 Chapter 13 Least Squares. A graph of SS = n (yi − α − βxi )2 is shown in i=1 Figure 13. and the Indy 500 Table 13.2. Medians. It is clear that SS does reach a minimum.3 1.4 SS 4981 1189 139.2 .4 1.2 1. In our small case.

yi − α − βxi . The values i=1 ˆ ˆ that minimize this sum of squares are denoted by α and β. We now show how to determine the values of α and β that minimize the sum of squares. LEAST SQUARES Minimizing the sum of squares of the deviations of the predicted values from the observed values is known as the principle of least squares. So we hold β fixed and write SS as a function of α alone: n n n n 2 yi + β 2 i=1 i=1 2 xi − 2β i=1 n SS = nα2 − 2α i=1 yi − β i=1 xi + xi yi . We now look at the situation in general.2 and the above equation. i=1 Suppose that we have a set of data {xi . α + βxi . Our straight line is yi = α + βxi and our sum of squares is SS = n (yi − α − βxi )2 . the intersection is a parabola that has a minimum value. The least squares line then estimates the value of y for a particular value of x ˆ ˆ as yi = α + βxi . the yi . · · · . and the values predicted by the line. we find that if we hold β fixed. So the principle of least squares says that we estimate the intercept ˆ (α) and the slope (β) of the line by those values that minimize the sum of squares of the residuals. the differences between the observed values. Principle of Least Squares Estimate α and β by those values that minimize SS = n (yi − α − βxi )2 . We begin with n SS = i=1 (yi − α − βxi )2 which can be written as n n 2 yi + nα2 + β2 i=1 i=1 2 xi − 2α i=1 n n n SS = yi − 2β i=1 xi yi + 2αβ i=1 xi From Figure 13.Least Squares 191 algebraic method for determining the values of α and β that minimize the sum of squares. yi } where i = 1. n.

and adding and subtracting n(y − βx)2 (thus completing the square). Now we find an estimate for β. using the above result ˆ for α.192 Chapter 13 Least Squares. we can write n SS = n[α2 − 2α(y − βx) + (y − βx)2 ] − n(y − βx)2 + i=1 n n 2 xi − 2β i=1 i=1 2 yi +β2 xi yi or n n 2 yi i=1 n 2 xi i=1 SS = n[α − (y − βx)] − n(y − βx) + 2 2 +β 2 − 2β i=1 xi yi Now since β is fixed and n is positive. with a similar i=1 result for x. the minimum value for SS occurs when ˆ ˆ ˆ α = y −βx. and the Indy 500 Now factor out the factor of n and noting that n yi /n = y. y = 141 and x = 55. We can write. Hold α fixed. Medians. n n SS = i=1 (yi − (y − βx) − βxi )2 = i=1 n n [(yi − y) − β(xi − x)]2 n = β2 i=1 (xi − x)2 − 2β i=1 (xi − x)(yi − y) + i=1 (yi − y)2 Now factor out n n i=1 (xi − x)2 and complete the square to find n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) 2 n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) n 2 SS = i=1 (xi − x)2 β2 − 2 n + − i=1 (xi − x)2 n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) + i=1 (yi − y)2 . However. we now have a general form for our estimate of α. giving the estimate for α ˆ as 141 − 55β. Note that in our example.

INFLUENTIAL OBSERVATIONS It turns out that the principle of least squares does not treat all the observations equally. say xi . Many statistical computer programs (such as Minitab) produce the least squares line from a data set. We investigate to see why this is so. yi is ˆ ˆ ˆ yi = α + βxi ˆ Now we have equations that can be used with any data set. the predicted value for yi .1 ˆ ˆ The expressions for α and β are called least squares estimates and the straight line they produce is called the least squares regression line. although a computer is of great value for a large data set. For a given value of x.Influential Observations 193 which can be written as n SS = i=1 (xi − x)2 β − n 2 n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) 2 n − i=1 (xi − x)2 + i=1 (yi − y)2 showing that the minimum value of SS is attained when ˆ β= n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) We have found that the minimum value of SS is achieved when ˆ β= and ˆ ˆ α = y − βx We have written these as estimates since these were determined using the principle of least squares.38(55) = 65. ˆ Expanding the expression above for β. we find 5(40155) − (275)(705) ˆ β= = 1. we find that it can be written as ˆ β= n n i=1 xi yi − 2 n n xi i=1 n n i=1 xi i=1 yi 2 n − i=1 xi In our example. .38 5(16125) − (275)2 and ˆ α = 141 − 1.

This is why we pointed out that least squares does not treat all the data points equally. ˆ Now β = n ai (yi − y). Notice that the values for ai highly depend upon the value for xi − x. and the Indy 500 Since n i=1 (xi − x)2 is constant for a given data set and n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) ˆ β= we can write n ˆ β= i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) ˆ − x)2 . the larger or smaller is ai . n ˆ β= i=1 ai yi where ai = (xi − x) . . and this can be written as i=1 ˆ = n ai yi − y n ai . We also note now that if for ˆ some point xi = x. But this expression for β can be simplified even further. called the median–median line. Medians. is a weighted average of the y values.194 Chapter 13 Least Squares. n 2 i=1 (xi − x) ˆ This shows that the least squares estimate for the slope. This is a fact to which we will return when we consider another straight line to fit a data set. so it appears that β is a weighted ˆ sum of the deviations yi − y. the farther xi is from x. then that point has absolutely no influence whatsoever on β. Now notice that β i=1 i=1 n n ai = i=1 n i=1 (xi i=1 (xi − x) = n 2 i=1 (xi − x) 1 n 2 i=1 (xi − x) n (xi − x) = 0 i=1 since So − x) = 0. our formula for β becomes (xi − x) n 2 i=1 (xi − x) So assuming ai = (xi − x)/ n n i=1 (xi ˆ β= i=1 ai (yi − y) where ai = ˆ Now the value of ai depends on the value of xi . β.

except 1917 and 1918 and 1942–1946 when the race was suspended due to World War I and World War II. the winning speeds at the Indianapolis 500-mile automobile race conducted every spring.3. 38. As we shall see. The first race occurred in 1911 and since then it has been held every year.38(55) = 65.The Indy 500 195 To continue with our example. The data are provided in Table 13. 180 160 Speed 140 120 100 80 1920 1940 1960 Year 1980 2000 Figure 13.3. 1 50 ˆ β= i=1 ai yi = − · 114 − 1 100 · 124 + 0 · 143 + 1 100 · 158 + 1 · 166 50 = 1. as we previously noted.1 as before. is shown in Figure 13. THE INDY 500 We now turn to a much larger data set. produced by the computer algebra program Mathematica.3 . respectively. ˆ ˆ Then α = y − βx = 141 − 1. dealing with it presents some practical as well as some mathematical difficulties. we find that following values for ai : a1 = n i=1 (xi − x)2 = 1000. unlike our little blood pressure example. We now present a fairly large data set. A graph of the data. giving the 1 (35 − 55) =− 1000 50 (45 − 55) 1 a2 = =− 1000 100 (55 − 55) a3 = =0 1000 (65 − 55) 1 a4 = = 1000 100 (75 − 55) 1 a5 = = 1000 50 Note that So n n i=1 ai = 0. namely.

144 104.457 134.857 138. one can notice the years in which the race was not held and the fact that the data appear to be linear until 1970 or so when the winning speeds appear to become quite variable and scattered.908574 × Year.001 * * 88.954 98.162 104.117 163. Medians.767 139.499 156.244 128.956 145.196 Chapter 13 Least Squares.331 161.175 144. ˆ ˆ The calculations for determining α and β are quite formidable for this (and almost any other real) data set.084 162.574 166.3 Year 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 Speed 74.213 Year 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Speed 148.618 89.589 149.629 104.050 88.363 158.581 185.069 113.904 97.933 82.293 143.899 142.477 157.872 153.213 128.982 170. We will consider the data in three parts.616 147.922 128.3.490 135.291 138.277 115.338 119.602 78.234 101.200 115.085 151.791 135.840 84.137 147. In Figure 13.722 162.585 100.155 153.603 157.240 109.867 155.882 156.814 121.840 128.036 158.719 75.601 133.350 150.827 145.13 + 0.035 114.735 162.686 144. using the . and the Indy 500 Table 13.207 160.317 151.749 157.621 94.002 126. The calculations and many of the graphs in this chapter were made by the statistical program Minitab or the computer algebra system Mathematica.484 90.127 95.482 97.448 96.029 162.981 176.862 139.740 130.176 167.545 99.580 117.774 143.327 124.809 167.962 159.130 140.562 The least squares regression line is Speed = −1649.612 152.207 152.863 106.474 89.607 141.518 157.117 * * Year 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 Speed * * * 116.725 161.

In the following table.4.95 Variance 32.0 154.4 Years 1911–1916 1919–1941 1947–2008 Mean 80. If the experimenter wishes to have a measure of the adequacy of the fit of the model.22 80.78 214. on the boundary of a semicircle and not get a fit that was at all satisfactory. called the median–median line. Some of the statistics for the data are shown in Table 13. xi 35 45 55 65 75 yi 114 124 143 158 166 yi ˆ 113.84 148. A TEST FOR LINEARITY: THE ANALYSIS OF VARIANCE It is interesting to fit a straight line to a data set. but the line may or may not fit the data well. let y denote ˆ the predicted value for y. If we do this for our blood pressure data set. So we begin our discussion on whether the line fits the data well or not. is to compare the observed values of y with the values predicted by the line. for example.8 168. Table 13. no matter whether it is truly linear or not. then a test is necessary.45 149. A good idea.A Test for Linearity: The Analysis of Variance 197 partitions the war years provide.6 If the predicted values are sufficient for the experimenter.63 Clearly.4 127.2 141. then the model is a good one and no further investigation is really necessary. we find the following values.60 100. Is our regression line a good approximation to the data? The answer depends on the accuracy one desires in the line. We will return to this discussion later. always. So we seek a test or procedure that will give us some information on how well the line fits the data. One could take points. . We will also use these partitions when we consider another line to fit to the data. the speeds during 1947–2008 have not only increased the mean winning speed but also had a large influence on the variance of the speeds.93 101.48 Median 80. We give here what is commonly known as the analysis of variance. The principle of least squares can be used with any data set.

since it measures the discrepancies between the observed and predicted values of y: n SSE = i=1 (yi − yi )2 ˆ For our data set.198 Chapter 13 Least Squares. this is SST = (114−141)2 + (124−141)2 + (143−141)2 + (158−141)2 + (166−141)2 = 1936. 4 We notice here that these sums of squares add together as 1904. or SSE.4−141)2 +(127. this is SSR = (113.2−141)2 +(141−141)2 +(154. Finally.6−141)2 = 1904. . and the mean ˆ of the y values. We usually call this the sum of squares due to error.4)2 +(124−127. and the Indy 500 First. since y = 141. consider SSR or what we call the sum of squares due to regression.6 = 1936 or SST = SSR + SSE. This is the sum of squares of the deviations between the predicted values.4 + 31. or SST. y : n SST = i=1 (yi − y)2 For our data set. we considered the sum of squares of the deviations of the observed y values and their predicted values from the line.6)2 = 31.8−141)2 +(168.8)2 +(166−168. this is SSE = (114−113. This is the sum of squares of the deviations of the y values from their mean. Medians. y: n SSR = i=1 (yi − y)2 ˆ For our data set.2)2 +(143−141)2 +(158−154. We simply called this quantity SS above. yi .0 In using the principle of least squares. consider what we called the total sum of squares. 6 This is of course the quantity we wished to minimize when we used the principle of least squares and this is its minimum value.

this is always true! To prove this. ˆ β= and ˆ ˆ α = y − βx ˆ ˆ ˆ ˆ ˆ ˆ Now yi = α + βxi = y − βx + βxi = y + β(xi − x). consider the identity yi − y = (yi − yi ) + (yi − y) ˆ ˆ Now square both sides and sum over all the values giving n n n n (yi − y)2 = i=1 i=1 (yi − yi )2 + 2 ˆ i−1 (yi − yi )(yi − y) + ˆ ˆ i=1 (yi − y)2 ˆ Note for the least squares line.A Test for Linearity: The Analysis of Variance 199 Note that SSR is a large portion of SST. so yi − y = β(xi − x) and ˆ ˆ the middle term above (ignoring the 2) is n n n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) (yi − yi )(yi − y) = ˆ ˆ i−1 i−1 ˆ ˆ [(yi − y) − β(xi − x)]β(xi − x) n n ˆ =β i−1 ˆ (yi − y)(xi − x) − β2 i−1 (xi − x)2 =0 since ˆ β= So. This indicates that the least squares line is a good fit for the data. . n n n n n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) SST = i=1 (yi − y)2 = i=1 (yi − yi )2 + 2 ˆ i−1 (yi − yi )(yi − y) + ˆ ˆ i=1 (yi − y)2 ˆ becomes n n n (yi − y)2 = i=1 i=1 (yi − yi )2 + ˆ i=1 (yi − y)2 ˆ n 2 We could say that i=1 (yi − y) represents the total sum of squares of the observations around their mean value. Now the identity SST = SSR + SSE in this case is not a coincidence. We denote this by SST.

5 and the least squares regression lines are provided in Table 13.6. we i=1 expect the sum of squares due to error. The results from the analysis of variance are provided in Table 13. It is interesting to compare the prewar years (1911–1941) with the postwar years (1947–2008). Finally. and the Indy 500 We could also say that n (yi − yi )2 represents the sum of squares due to error ˆ i=1 or SSE. This partition of the total sum of squares is often called an analysis of variance partition. although it has little to do with a variance. to be a large proportion of ˆ i=1 the total sum of squares n (yi − y)2 . but unfortunately this is not so. For our data. Table 13. n n n (yi − y)2 = i=1 i=1 (yi − yi )2 + ˆ i=1 (yi − y)2 ˆ can be abbreviated as SST = SSE + SSR.4/1936 = 0.) Since SSR is only a part of SST. Its square root.9918. is called the correlation coefficient.200 Chapter 13 Least Squares. we expect SSR = n (yi − y)2 to i=1 ˆ be large since we expect the deviations yi − y to be a large proportion of the total ˆ sum of squares n (yi − y)2 . There are data sets for which r is large but the fit is poor and there are data sets for which r is small and the fit is very good. if the data are actually not linear. On the contrary. as we have seen in our example previously. So the identity above. Medians. so r = 0.3802 . it follows that 0 ≤ r2 ≤ 1 so that −1 ≤ r ≤ 1 It would be nice if this number by itself would provide an accurate test for the adequacy of the regression. 98368. i=1 This suggests that we look at the ratio SSR/SST. which is called the coefficient of determination and is denoted by r2 .5 Years 1911–1941 1947–2008 SSE 8114 293 SSR 4978 3777 SST 13092 4020 r2 0. we could say that n (yi − y)2 represents the sum of squares due to i=1 ˆ regression or SSR. (The positive square root of r2 is used if y increases with increasing x while the negative square root of r 2 is used if y decreases with increasing x. r2 = 1904.9396 0. This is the quantity that the principle of least squares seeks to minimize. If the data show a strong linear relationship. n (yi − yi )2 . r.

102 9 7.Nonlinear Models Table 13.092 4 7. Consider the following data set.4 From the values for r 2 .5 persuades us that a linear fit is certainly not appropriate! The lesson here is always draw a graph! NONLINEAR MODELS The procedures used in simple linear regression can be applied to a variety of nonlinear models by making appropriate transformations of the data.652 8 6.212 10 9.59. which might be regarded as fair but certainly not a large value.99 + 1. x y x y 13 17. This could be done in the example above. The data here have not been presented in increasing values for x.822 16 29. so it is difficult to tell whether the regression is linear or not. we conclude that the linear fit is very good for the years 1911–1941 and that the fit is somewhat less good for the years 1947–2008.6 Years 1911–1941 1947–2008 Least Squares Line Speed = −2357.262 14 21.372 15 25. A Caution Conclusions based on the value of r 2 alone can be very misleading.542 12 14.702 The analysis of variance gives r 2 = 0. For example.500697 × Year 201 30 20 Residual 10 1940 –10 –20 1960 Year 1980 2000 Figure 13.651 + 0.27454 × Year Speed = −841. The graph in Figure 13. where the relationship is clearly quadratic. The graph in Figure 13.382 7 6.2 6 6.932 12 0.4 shows the scatter plot of all the data and the least squares line.500 11 11. if we wish .

5. $65. 200. . 850. b 2. and the median would remain at $21. suppose salaries in a small manufacturing plant are as follows: $12. 4. yi = a + b log xi uses the data set {log xi . Taking logarithms gives log yi = log a + b log xi . 1/yi }. 500. but four out of five workers receive less than this average salary! The mean is highly influenced by the largest salary. $13. we can take logarithms to find log yi = log a + bxi .202 Chapter 13 Least Squares. yi = 1/(a + b · 10−xi ) can be transformed into 1/yi = a + b · 10−xi . We would then take the antilog to estimate a. $21.900 while the mean might be heavily influenced. The linear model is then fitted to the data set {10−xi . log yi }.900.410. The median salary (the salary in the middle or the mean of the two middlemost salaries when the salaries are arranged in order) is $21. Logarithms give log yi = log a + b/xi . so the linear model is fitted to the data set {log xi . The model yi = xi /(axi − b) can be transformed into 1/yi = a − b/xi . 600 The mean of these salaries is $27.900. yi = a · 10b/xi . For example. THE MEDIAN–MEDIAN LINE Data sets are often greatly influenced by very large or very small observations. and the Indy 500 30 25 20 y 15 10 5 6 8 10 x 12 14 16 Figure 13. yi }. yi = a · xi . log yi }. so we fit our straight line to the data set {1/xi . Here are some other models. and the appropriate transformations: 1. Medians. including the quadratic relationship mentioned above. 900. 3. we explored this to some extent in Chapter 12. Then our simple linear regression procedure would give us estimates of b and log a.5 to fit a model of the form yi = a · 10bxi to a data set. $26. We could even replace the two highest salaries with salaries equal to or greater than $21.

We begin with a general set of data {xi . If the triangle is determined by the points P1 (x1 .6 P1 Now for some geometric facts. Let us prove these facts before making use of them in determining the median–median line. P1 contains the smallest x-values. This line is much easier to calculate than the least squares line and enjoys some surprising connections with geometry.7 shows these medians and their meeting point. y1 ). and P3 (x3 . P2 (x2 . The concept is that since P1 and P3 contain 2/3 of the data and P2 contains 1/3 of the data. The medians of a triangle meet at a point that is 1/3 of the distance from the baseline. y) where x = 1/3(x1 + x2 + x3 ) and y = 1/3(y1 + y2 + y3 ). yi } for i = 1. the median is called a robust statistic since it is not influenced by extremely large (or small) values.The Median–Median Line 203 For this reason.6. the line should be moved 1/3 of the distance from the baseline toward P2 . y3 ). y2 ). and P3 contains the largest x-values.. These points are then plotted as the vertices of a triangle. . The general situation is shown in Figure 13. P3 P2 Median–median line Figure 13.) It is also true that the meeting point of the medians is 1/3 of the distance from the baseline. not the data set in general. n.. . P2 contains the middle x-values.. We now describe the median–median line. then the medians meet at the point (x. The median–median line is determined by drawing a line parallel to the baseline (the line joining P1 and P3 ) and at 1/3 of the distance from the baseline toward P2 . first divide the data into three parts (which usually contain roughly the same number of data points). To find the median– median line. (The means here refer to the coordinates of the median points. The median can also be used in regression. Figure 13. In each part. determine the median of the x-values and the median of the y-values (this will rarely be one of the data points).

y) is 1/3 of the distance from the baseline toward the vertex. and the Indy 500 P2 P3 Figure 13.x x1 + x 2 x 1 y2 y2 y=− + . This simplifies the calculations greatly and does not limit the generality of our arguments. with the leftmost vertex of the triangle at the point (0.-. (y2 /3)). y) is the point ((x1 + x2 )/3. Now we must show that the medians meet at that point.8 shows the situation in general.y2) m3 -.-.-> (x2/2.-. (x2.8 (0.0) First the point (x.> (x1.y2/2) m1-. .x x1 − 2x2 x2 − 2x1 y= y=− x 1 y2 2y2 + . (y2 /3)) lies on each of the lines and hence is the point of intersection of these median lines.0) Figure 13. We have taken the baseline along the x-axis. so the point (x.-.204 Chapter 13 Least Squares.-. Medians.y2/2) ((x1+x2)/2.7 P1 Figure 13. P2. 0).x 2x2 − x1 2x2 − x1 It is easy to verify that the point ((x1 + x2 )/3.-> m2-. The equations of the three medians shown are as follows: m1: m2: m3: y2 .

then the least squares line and the median–median line are identical. suggested by the facts above.y2) (x1/2.0) . The intercept of the median–median line is the average of twice the first intercept plus the second intercept (and its slope is the slope of the line joining P1 and P3 ). When Are the Lines Identical? It turns out that if x2 = x.y2/3) Figure 13. Method 1 is by far the easiest of the three methods although all are valid.9 (0. 3. y). find the median–median line using the slope and a point on the line. consider the diagram in Figure 13. They are as follows: 1. Determine the intercept of the line joining P1 and P3 and the intercept of the line through P2 with the slope of the line through P1 and P3 .The Median–Median Line 205 These facts suggest two ways to determine the equation of the median–median line. Methods 2 and 3 are probably useful only if one wants to practice finding equations of lines and doing some algebra! Method 2 is simply doing the proof above with actual data. First. 2. Finally. determine the slope of the line joining P1 and P3 . This is the slope of the median–median line. The median–median line can be found using the slope and a point on the line. Second. y2 ) since 0 + x1 + 3 x1 2 = x1 2 x= (x1/2.9 where we have taken the base of the triangle along the x-axis and the vertex of the triangle at (x1 /2. To show this.0) (x1.Then determine the equations of two of the medians and solve them simultaneously to determine the point of intersection. determine the point (x. To these two methods. We will not prove that method 3 is valid but the proof is fairly easy. we add a third method. Determine the slope of the line joining P1 and P3 .

Table 13. but if these values are close. 149. y) and has slope 0.7 and Figure 13. It is not frequent that x2 = x.60 100.10 P11913.206 Chapter 13 Least Squares.5 Median (speed) 80. Medians. So the least squares line is y = y2 /3.5 1930 1977.95 P31976.45 Median–median line--> Figure 13. Also. We admit that the three periods of data are far from equal in size.6 .7 Period 1911–1916 1919–1941 1947–2008 Median (years) 1913. we expect the least squares line and the median–median line to also be close. and the Indy 500 Now ˆ β= Here n n i=1 (xi − x)(yi − y) n 2 i=1 (xi − x) (xi − x)(yi − y) = 0 − i=1 x1 2 x1 2 0− y2 x1 x1 + − 3 2 2 y2 3 =0 y2 − y2 3 + x1 − 0− ˆ ˆ ˆ so β = 0. The reason for using these speeds as an example is that the data are divided naturally into three parts due to the fact that the race was not held during World War I (1917 and 1918) and World War II (1942–1946). The data have been given above by year.10).5.45 149. α = y − βx = y = y2 /3. We now proceed to an example using the Indianapolis 500-mile race winning speeds.5.95 P21930. 100. But this is also the median–median line since the median–median line passes through (x. 80. we now show the median points for each of the three time periods (Table 13.

The Median–Median Line

207

Determining the Median–Median Line
The equation of the median–median line will now be determined by each of the three methods described above. r Method 1 The slope of the median–median line is the slope of the line joining P1 and P3 . 149.95 − 80.6 = 1. 08359 1977.5 − 1913.5 The point (x, y) is the point 1913.5+1930+1977.5 80.6+100.45+149.95 , 3 3 So the equation of the median–median line is y − 110.333 = 1. 08359 x − 1940.333 This can be simplified to y = −1992.19 + 1. 08359 x, where y is the speed and x is the year. r Method 2 We determine the equations of two of the median lines and show that they intersect at the point (x, y). The line from P1 to the midpoint of the line joining P2 and P3 (the point 1973.75, 125.2) is y = 1.10807x−2039.69. The line from P3 to the midpoint of the line joining P1 and P2 (the point 1921.75, 90.525) is y = 1.06592x − 1957.91. These lines intersect at (1940.33, 110.33), thus producing the same median–median line as in method 1. r Method 3 Here we find the intercept of the line joining P1 and P3. This is easily found to be −1992.85. Then the intercept of the line through P2 with slope of the line joining P1 and P3 (1.08359) is −1990.88. Then weighting the first intercept twice as much as the second intercept, we find the intercept for the median– median line to be 2(−1992.85) + (−1990.88) = −1992.19 3 So again we find the same median–median line. The least squares line for the three median points is y = −2067.73 + 1.12082x. These lines appear to be somewhat different, as shown in Figure 13.11. They meet at the point (1958, 129) approximately. It is difficult to compare these lines since the analysis of variance partition
n n n

= (1940.333, 110.333)

(yi − y) =
2 i=1 i=1

(yi − yi ) + ˆ
2 i=1

(yi − y)2 ˆ

208

Chapter 13

Least Squares, Medians, and the Indy 500
180 160 Speed 140 120 100 80 1920 1940

Figure 13.11

1960 Year

1980

2000

or SST = SSE + SSR no longer holds, because the predictions are no longer those given by least squares. For the least squares line, we find

SST Least squares line 68,428

SSE 12323

SSR 56105

r2 0.819918

It is true, however, that
n n n n

(yi − y)2 =
i=1 i=1

(yi − yi )2 + 2 ˆ
i−1

(yi − yi )(yi − y) + ˆ ˆ
i=1

(yi − y)2 ˆ

The total sum of squares remains at n (yi − y)2 = 68, 428, but the middle i=1 term is no longer zero. We find, in fact, that in this case n ˆ 2 2 n (yi − yi )(yi − y) = −25040, ˆ ˆ and i=1 (yi − yi ) = 75933, i−1 n (yi − y)2 = 17535, so the middle term has considerable influence. ˆ i=1 There are huge residuals from the predictions using either line, especially in the later years. However, Figure 13.3 shows that speeds become very variable and apparently deviate greatly from a possible straight line relationship during 1911–1970. We can calculate all the residuals from both the median–median line and the least squares line. Plots of these are difficult to compare. We show these in Figures 13.12 and 13.13. It is not clear what causes this deviance from a straight line in these years, but in 1972 wings were allowed on the cars, making the aerodynamic design of the car of greater importance than the power of the car. In 1974, the amount of gasoline a car could carry was limited, producing more pit stops; practice time was also reduced.

Analysis for years 1911–1969
30 20 Residual 10 1940 –10 –20 –30 1960 1980 Year 2000

209

Figure 13.12

Median–median line residuals.

30 20 Residual 10 1940 –10 –20 1960 Year 1980 2000

Figure 13.13

Least squares residuals.

Both of these measures were taken to preserve fuel. For almost all the races, some time is spent under the yellow flag. At present, all cars must follow a pace car and are not allowed to increase their speed or pass other cars. Since the time under the yellow flag is variable, this is no doubt a cause of some of the variability of the speeds in the later years. It is possible to account for the time spent on the race under the yellow flag, but that discussion is beyond our scope here. The interested reader should consult a reference on the analysis of covariance. These considerations prompt an examination of the speeds from the early years only. We have selected the period 1911–1969.

ANALYSIS FOR YEARS 1911–1969
For these years, we find the least squares regression line to be Speed = −2315.89 + 1.2544× Year and r 2 = 0.980022, a remarkable fit. For the median–median line, we use the points (1913.5, 80.60) , (1930, 100.45), and (1958, 135.601), the last point being the median point for the years 1947 through 2008. We find the median–median line to be y = 1.23598x − 2284.63. The lines are very closely parallel, but have slightly different intercepts. Predictions based upon them will be very close. These lines are shown in Figure 13.14.

210

Chapter 13

Least Squares, Medians, and the Indy 500
200 180 Speed 160 140 120 100 1940 1960 Year 1980 2000

Figure 13.14

CONCLUSIONS
The data from the winning speeds at the Indianapolis 500-mile race provide a fairly realistic exercise when one is confronted with a genuine set of data. Things rarely work out as well as they do with textbook cases of arranged or altered data sets. We find that in this case, the median–median line is a fine approximation of the data for the early years of the data and that the least squares line is also fine for the early years of the data; neither is acceptable for the later years when we speculate that alterations in the aerodynamics of the cars and time spent under the yellow flag produce speeds that vary considerably from a straight line prediction.

EXPLORATIONS
1. Using the three median points for the Indy 500 data, show that method 3 is a valid procedure for finding the median–median line. 2. Using the three median points for the Indy 500 data, find the least squares line for these points. 3. Find the analysis of variance partition for the least squares line in Exploration 2. 4. Analyze the Indy 500 data for 1998–2008 by finding both the median–median and the least squares lines. Show the partitions of the total sum of squares, SST, in each case.

Chapter

14

Sampling
CHAPTER OBJECTIVES:

r r r r r r

to show some properties of simple random sampling to introduce stratified sampling to find some properties of stratified sampling to see how proportional allocation works to discuss optimal allocation to find some properties of proportional and optimal allocation.

One of the primary reasons that statistics has become of great importance in science and engineering is the knowledge we now have concerning sampling and the conclusions that can be drawn from samples. It is perhaps a curious and counterintuitive fact that knowledge about a population or group can be found with great accuracy by examining a sample—only part of the population or group. Almost all introductory courses in statistics discuss only simple random sampling. In simple random sampling every item in the population is given an equal chance of occurring in the sample, so every item in the population is treated exactly equally. It may come as a surprise to learn that simple random sampling can often be improved upon; that is, other sampling procedures may well be more efficient in providing information about the population from which the sample is selected. In these procedures, not all the sampled items are treated equally! In addition to simple random sampling, we will discuss stratified sampling and both proportional allocation and optimal allocation within stratified sampling. We start with a very small example so that ideas become clear.

A Probability and Statistics Companion, John J. Kinney Copyright © 2009 by John Wiley & Sons, Inc.

211

We emphasize from the beginning that we assume we know the entire population.1. so each of the schools is treated equally in the sampling process. SIMPLE RANDOM SAMPLING In simple random sampling. in this case a school.2. we find the mean of each of the samples to find the 56 mean enrollments given in Table 14. We first consider simple random sampling. The mean of these values is μx = 1345. There are 8 = 56 simple random samples. These are shown 5 in Table 14. We cannot discuss sample size here.1 High and Middle Schools An urban school district is interested in discovering some characteristics of some of its high and middle schools. Call these values of x. The details of the population are given in Table 14. x. but refer the reader to many technical works on the subject. then we have a simple random sample. In practice.375 with standard deviation σx = 141. is an unbiased estimator of the true population mean.1 School enrollment 1667 2002 1493 1802 1535 731 834 699 School type High High High High High Middle Middle Middle The mean enrollment is μx = 1345. we would never know this (or else sampling is at best an idle exercise).375 and the standard deviation is σx = 481. We want to show how these statistics can be estimated by taking a sample of the schools. the sample size can be surprisingly small in relative to the size of the population. but this serves our illustrative purposes. however. . We note here that the sample size 5 is relatively very large compared to the size of the population. 8.078. each item in the population. So the sample mean. The subscript x is used to distinguish the population from the samples we will select. In many cases. We decide to choose five of the eight schools as a random sample.We see that the mean of the means is the mean of the population. Since we are interested in the mean enrollment.3. If we give each of the eight schools equal probability of occurring in the sample. 7 Note that each of the schools occurs in 4 = 35 of the random samples. Table 14.212 Chapter 14 Sampling EXAMPLE 14. has an equal chance of appearing in the sample as any of the other schools.8737.

{1667. 731. {1667. 834. 1802. 1493. 731. 1380. 834.4. 699}. 699}.2. 1802. So we write E(x) = μx Also. 1547. 1802. 834} {1667. 1493. Despite the fact that the population is flat. 1312. 1535. 1493. 699}. 1305. 1493.6.4.6. 2 2 From the central limit theorem. {2002. 1535. 1120. 731}. 1802.6. 1493. 699}. 1093. 1353. {1667. 1535. 1535. 731.8. 1802. 1493.8. 1535.6. 834. {1667. 1535. 699}. 699}. {1493. 834.2. 1278. the graph of the means begins to resemble the normal curve. each enrollment occurs exactly once. 1326.2. 731. 731. 1493.8. {1667. These values of x are best seen in the graph shown in Figure 14. 1802. 1535. 731. 731}. 1252. 834. 1802. {1667. 1345. 1372. the process of averaging has reduced the standard deviation considerably. 731.2. 834. 1506. 1535. 1493. 1292. {1667. 1479. {1667. 834}. 834.4.2. 834}. 1272. {1667. 1802. 1535.2. 699} Table 14.1. 834}. 1802. 1318. 1535. 1802. 699}. {1667. 1539. {2002. 1493.8. 2002. 1535. We must be cautious. 1493. 1493. {2002. 1146.6. 1380. {1667. {1493. 699}. 1485. 731. 731. 699}. 1535.2 213 {1667. 834. 1466. 731. 834. 1058. 1535. 2002. 2002.6. 834. 2002. 1535. 1802. {2002. 699}. 1535.6. 731. 1493. 834}. 1151. 1802. 1802. 731. 1802. 1213. 1559. 2002. {1667. The central limit theorem deals with samples from an infinite population. 699}. 834. 699}. {1667. 1299. {2002. {2002. 2002.2. 1802. 731. This is not exactly so in this case. 699}. in calculating the standard deviation of the means. 1493. 2002. 834. The reason for this is that we are sampling without replacement from a finite population. 834}. This is a consequence of the central limit theorem. 1313. 1252. 1286. 1186. 699}. 731. 1802. 1802. 1535. 1493. {1802. {1667. 834. 1493. 699}. however. 1535. 1279. 1802. 1535. 731. 2002. {1667. {1667.8. 1802. {2002. 1535. 1407. {1667. one might expect this to be σx /n. 731. 2002. {1667. 1366. 699}. 1493. 1160. {2002. 699} {1493. 1493. 834}. 1535. 834}.4. 1535. 2002. 1084.2} μx . 731}. 1374. 731. 699}. 1568. 1802. 1535. 1802. 834. 1802. 731. 731.8.Simple Random Sampling Table 14.6. {1667. 1802.4. 2002. {1493.4. 1439. {2002. 2002. 834. 1111. {1667. {1667. 1802. 1535. 1535. {2002. 1493. {1493. 1493. {1667. 699}. 834}. {2002. 1802. {1667. 1493.4. {1667. 834} {1667. 731. 2002. 834}. 1493. 1445.6. 1353. 731. 699}. 731. 1535. 1493. {1667. 1802. 834}.2. 1535. 1802. 699}. 699}. 1535}. 1535. 1512. 699}. 1493. 731. 1493. 1535. 2002. 1802. 834. 1493.4. 731. {1667. 834. 2002. 1535. 1493. 1319. 731. 2002. {2002. 1802. 834. {2002. 731. 1400. {2002. 1225. 2002. 699} {1667. 731. 699}. 1541.2. 699}. 2002. 1802.4. 834.8. 1347. 834. 1535. 834} {1667. 731. that is. 1339. 1493. {2002. 699}.4. 1493. 699}. 699}. {1667. 1307.8. 2002. 1533. 699}. 834}{1667. 1535. where σx is the variance of the population.8. 1493.2.8. 1506.8. 731}. 1245. 1345. 699} {1667.{1667. 1802. 1802. 1535. 1535. 731. 1802.6.3 {1699. 1493. 699}. 1802.4. 834}. 699}. 731}. 731. If we are selecting a sample . 1535. 1532. 699}. 2002.6.

048 6 = 141. STRATIFICATION At first glance it might appear that it is impossible to do better than simple random sampling where each of the items in the population is given the same chance as any other item in the population in appearing in the sample. The factor (N − n)/(N − 1) is often called the finite population correction factor.1 Mean of size n from a finite population of size N. and in addition to . rather than that of a population and use a divisor of N − 1. is not the case! The reason for this is that the population is divided into two recognizable groups.5 10981178 1258 13381418 14981578 1658 Figure 14. Note the divisor of N since we are not dealing with a sample.5 Frequency 10 7. high schools and middle schools. Here 2 σx = 2 8 − 5 481. Many statistical computer programs assume that the data used are that of a sample. 078 exactly the value we calculated using all the sample means. however. These groups are called strata.873672 N − n σx · = · = 19903. then the variance of the sample mean is given by 2 σx = 2 N − n σx · N −1 n 2 where σx is the variance of the finite population. Note that in calculating this.214 Chapter 14 Sampling 15 12.5 5 2. but with the entire population. we must also find N 2 σx = i=1 (xi − μ)2 N where μ is the mean of the population. So the variance of the population calculated by using a computer program must be multiplied by (N − 1)/N to find the correct result. 048 6 N −1 n 8−1 5 and so σx = √ 19903. This.

The question is how to determine the sample size. it follows that n1 · N2 = n2 · N1 = (n − n1 ) · N1 and from this it follows that N1 n1 = n · N1 + N 2 and so n2 = n · N2 N1 + N 2 In proportional allocation.4 Stratum High Schools Middle Schools Number 5 3 Mean 1699. within each stratum. so we choose three items from the high school stratum and two items from the middle school stratum. Table 14. In this case. That is. we choose samples of sizes n1 and n2 from strata of sizes N1 and N2 so that the proportions in the sample reflect exactly the proportion in the population. but. The sampling within . Stratified random sampling takes these different characteristics into account. Table 14.800 754. respectively. they have quite different characteristics. they vary considerably in variability. We will show two ways to allocate the sample sizes. or allocations.125 observations from the high school stratum and 5 · 3/8 = 1. Proportional Allocation In proportional allocation. It turns out that it is most efficient to take simple random samples within each stratum. so we take 5 · 5/8 = 3.875 observations from the middle school stratum.4 shows some data from these strata. more importantly as we shall see.887 57.5982 The strata then differ markedly in mean values.667 Standard deviation 185. called proportional allocation and optimal allocation. the sample sizes taken in each stratum are then proportional to the sizes of each stratum. We cannot do this exactly of course. we take the proportion 5/(5 + 3) = 5/8 from the high school stratum and 3/(5 + 3) = 3/8 from the middle school stratum. we want N1 n1 = n2 N2 Since n = n1 + n2 .Stratification 215 occurring in unequal numbers. We can capitalize on these differences and compose a sampling procedure that is unbiased but has a much smaller standard deviation than that in simple random sampling. keeping a total sample of size 5.

The standard deviation of this set of means is 48. {1667. {1493. 834.98. 731. 1407. {1667. 1493. 1535. 731. {1667. 1535. 699}. 699}. the true mean of the population. 1535. 834. 834}. 731. 1493. 1802. 2002.35. {2002. 731. 1802.6. {2002. 834}. 1802. 731. 1493. {1667. for the first sample. {1667.73. 834. 731. 1802. 1265.67.98. 1352. 731. 699}.078.6. 1802.23.42.69. 699}. 1301. 834. 834}. {2002. 731. 2002. we find the weighted mean to be 5· (1667 + 2002 + 1493) (731 + 834) +3· 3 2 = 1368.29. 1493. 1493. 1405. 699}. 731. 1399. {2002. 1493. 731. A graph of these means is shown in Figure 14.04.35. 1802. 1390. 1493. 1535. {2002. 1493. 1493.6441. 1535. 699}. 1535. 834. 1535. 834}. Table 14. 834. The set of weighted means is shown in Table 14. 699}. 699}. 1380. 1493.2. 1299. {1667. {1667.23.85. 1535. {2002. 731. 1493. {1493. 1802. 699}. 1362. 1535. 1802. 834}. 1535.92. 699}. 834. 1802. 1493. the standard deviation of the set of simple random samples. 1802. 1802. which is the unweighted mean of the first sample. 699}. 1535.56. 699}.6 {1368. 699} Now we might be tempted to calculate the mean of each of these samples and proceed with that set of 30 numbers. 1246. 1802.6. 834. But the real gain here is in the standard deviation. 2002. 834}. 1327. 1377. {2002. 1293. 834}. 2002. 699}. 1321. 1535. {1667. 1316. 2002. We found different samples. which are shown in Table 14. 731. 1535. {1493. this will give a biased estimate of the population mean since the observations in the high school stratum were given 3/2 the probability of appearing in the sample as the observations in the middle school stratum. {1667.216 Chapter 14 Sampling 5 3 each stratum must be done using simple random sampling. 1802. {1667. we weight the mean of the three observations in the high school stratum with a factor of 5 (the size of the high school stratum) and the mean of the two observations from the middle school stratum with a weight of 3 (the size of the high school stratum) and then divide the result by 8 to find the weighted mean of each sample.88.85. 699}. 699}.94. {1667. {2002. {2002. 1371.25. 1396. 1493. 1329. 834}. 1371. 731. 2002.69} . 731. Table 14. 1427. 731. 1802. 731. 1274. 1802.54. 1271.63. 699}. so this estimate for the population mean is also unbiased. 1335.375.19. This set of means has mean 1345. 1802. 731. 1535.5 · 3 2 = 30 {1667. 1310.56.5. 731. 1335. 834. 731. For example. {1667.40. 1343. {1667. 2002. 731. 1535. 1433.19. 1493. This is a large reduction from 141. {1667. 2002. 699}. However. 1802. 834. {1667. To fix this.94. 1341. 1535.85 8 which differs somewhat from 1345. 1535. 1493. 834}. 2002. 1535. 834}. {1667. {1667.73. 1802. 699}.38. 699}.

Stratification 5 4 Frequency 3 2 1 217 1280 1297 1314 1331 1348 13651382 13991416 1433 Figure 14. Clearly.2 Mean This procedure has then cut the standard deviation by about 66% while remaining unbiased. where n1 items are selected from the first stratum and n2 items from the second stratum. If the total size of the sample to be selected is n. is weighted by the ratio of the standard deviations. 57. N1 /N2 . in principle. The very name. indicates that this is in some sense the best allocation we can devise. optimal allocation. We will see that this is usually so and that the standard deviation of the means created this way is even less than that for proportional allocation. This discrepancy can be utilized further in stratification known as optimal allocation Optimal Allocation We now describe another way of allocating the observations between the strata. The high school stratum has standard deviation 185. whereas the middle school stratum has a much smaller standard deviation. Optimal allocation is derived. stratified sampling has resulted in much greater efficiency in estimating the population mean. σ1 /σ2 . One of the reasons for this is the difference in the standard deviations within the strata. then n = n1 + n2 . from proportional allocation in which strata with large standard deviations are sampled more frequently than those with smaller standard deviations. we have n1 N2 σ2 = n2 N1 σ1 = (n − n1 )N1 σ1 .887. where these samples sizes are determined so that n1 N1 σ 1 = · n2 N2 σ2 Here the population proportion.5982. Suppose now that we have two strata: first stratum of size N1 and standard deviation σ1 and second stratum of size N2 and standard deviation σ2 . Since n2 = n − n1 .

1802. 1535. We weight the mean of the high school observations with a factor of 5 (the size of the stratum) while we weight the observation from the middle school stratum with its size. 1802.44.5982) 3(57.887) + 3(57. 2002. 1535. 1493.784 5(185. {1667. 1289. 731}. 1308. {1667. 1802. 1493. The best we can then do is to select four items from the high school stratum 5 and one item from the middle school stratum. 1802. 1341. {1667. 1802. 1535.3. 1802. 1535. 1400. 1493. 1535. 1493. . 834}.81.7 {1667. 1493.53. 1356.28. 1535. 699}. 834}. 1802. 1493.887) = 4. In this case. 1359. 2002. {2002. 1535. before dividing by 8. 2002. which are shown in Table 14.25. 1535.63} A graph of these means is shown in Figure 14.81.63.218 Chapter 14 Sampling and this means that we choose n1 = n · items from the first stratum and n2 = n · items from the second stratum.88. 1802. 2002. 699}. 1350. {1667. 1327. 731}. 1535. 1535. 1493. 1320. For example. 1493. for the first sample. 1407. 25 8 The complete set of weighted means is {1362. 1277. {1667. Table 14. {2002.53.7. 1535. 1329. 1493. {1667. 216 5(185. 1493. 3. 1802. 2002. {1667.5982) N 2 σ2 N 1 σ1 + N 2 σ 2 N1 σ1 N1 σ1 + N 2 σ2 items from the first (high school) stratum and 5· items from the second (middle school) stratum.91. 1802. 834}. {2002. 731}.16. the weighted mean becomes 5· (1667 + 2002 + 1493 + 1802) + 3 · 731 4 = 1362. 699} Now again. 699}. 2002.28.887) + 3(57. 834}. 1493. as we did with proportional allocation. {1667. 699} {1667.5982) = 0. 1493. 1368. 1802. {1667.25. 1380. 2002. This gives 4 · 3 = 15 possible 1 samples. we do not simply calculate the mean of each of these samples but instead calculate a weighted mean that reflects the differing probabilities with which the observations have been collected. 1802.25. 834}. 2002. 2002. {1667. then we would select 5· 5(185. 731}. 1535. 731}.

6441 36. so once more our estimate is unbiased. . We summarize the results of the three sampling plans discussed here in Table 14. .1958 SOME PRACTICAL CONSIDERATIONS Our example here contains two strata. the mean of the population. Table 14.874 141. but in practice one could have many more. Nk Standard deviation σ1 σ2 σk .1958.375 1345. Suppose the strata are as follows: Stratum 1 2 . k Number N1 N2 .375 1345. . .8 Sampling Population Simple random Proportional allocation Optimal allocation Number 8 56 30 15 Mean 1345.375. but the standard deviation is 36.078 48.375 Standard deviation 481.375 1345.3 Mean The mean of these weighted means is 1345.Some Practical Considerations 4 3 Frequency 2 1 219 1310 1327 1344 1361 1378 1395 Figure 14.8. a reduction of about 26% from that for proportional allocation.

The numbers Ni are usually known. however. or can be approximated. poses a different problem. In the case of two strata. so we want n2 nk n1 = = ··· = N1 N2 Nk The solution to this set of equations is to choose ni = n · Ni /N observations from the ith stratum. of either variety. or close approximation. The number of items to be selected from the first stratum is N1 σ1 /σ2 N 1 σ1 =n· n· N 1 σ1 + N 2 σ 2 N1 σ1 /σ2 + N2 So the ratio provides all the information needed. because the number of observations per stratum depends on the standard deviations. It is clear that stratification. if we know that σ1 /σ2 = 2 . reduces the standard deviation and so increases greatly the accuracy with which predictions can be made. at least approximately. In the general case. In proportional allocation. or approximately equal to. however. so we do the best we can. In the general case. we want the sample sizes in each stratum to reflect the sizes of the stratum. to the standard deviations. then we select 10 · 2 =4 7· 10 · 2 + 15 from the first stratum and N2 15 N 2 σ2 =n· =7· =3 n· N 1 σ1 + N 2 σ2 N1 σ1 /σ2 + N2 10 · 2 + 15 items from the second stratum. optimal allocation can be achieved. For example. optimal allocation selects n· N i σi N1 σ1 + N 2 σ2 + · · · + N k σk from the ith stratum. It is often the case that proportional and optimal allocation do not differ very much with . we need only to know the ratio of the standard deviations. This requires knowledge. so one can come close to proportional allocation in most cases. Optimal allocation. if the ratio of the standard deviations to each other is known. then an allocation equivalent to. Then the ratio of the number of observations in the ith stratum to the number of observations in the jth stratum is Ni n· ni N = Ni = Nj nj Nj n· N the ratio of the number of items in stratum i to the number of items in stratum j. It is of course unusual for these sample sizes to be integers. and N2 = 15 and if we wish to select a sample of 7. Usually the total sample size is determined by the cost of selecting each item. N1 = 10.220 Chapter 14 Sampling Suppose N1 + N2 + · · · + Nk = N and that we wish to select a sample of size n.

In addition. the strata cannot be made up without the group having some known characteristics.231 380. Instances where this inequality does not hold are very unlikely to be encountered in practice. In political sampling. In any event. except in extremely tight races. Stratified sampling has been known to provide very accurate estimates in elections.751 211. Much more is known about sampling and the interested reader is encouraged to sample some of the many specialized texts on sampling techniques. some of them urban and some rural. in national political polling.881 4. Population 14.478 278. It is important to realize that the strata must exist in a recognizable form before the sampling is done. for example. but within these strata we might sample home owners. and so on as substrata. apartment dwellers.511 9. generally the outcome is known. well before the polls close and all the votes have been cast! CONCLUSIONS This has been a very brief introduction to varieties of samples that can be chosen from a population. STRATA Stratification is usually a very efficient technique in sampling. strata might consist of urban and rural residents. The strata are then groups with recognizable features. the strata might differ from state to state. EXPLORATIONS 1.046 5.538 550.273 148.Explorations 221 respect to the reduction in the standard deviation although it can be shown in general that 2 2 2 σSmple random sampling ≥ σProportional allocation ≥ σOptimal allocation is usually the case.272 County type Rural Rural Rural Rural Urban Urban Urban Urban Urban . condominium owners. The data in the following table show the populations of several counties in Colorado.

(ii) optimal allocation. . (b) Draw stratified samples of size 4 by (i) proportional allocation.222 Chapter 14 Sampling (a) Show all the simple random samples of size 4 and draw graphs of the sample means and sample standard deviations. (c) Calculate weighted means for each of the samples in part (b). (d) Discuss the differences in the above sampling plans and make inferences.

but it is largely true. Did statistics revolutionize science and if so. A. The subtitle makes quite a claim. A Probability and Statistics Companion. A recent book by David Salsburg is titled The Lady Tasting Tea and is subtitled How Statistics Revolutionized Science in the Twentieth Century. John J. yet. It is our knowledge of the planning (or the design) of experiments that allows experimenters to carry out efficient experiments in the sense that valid conclusions may then be drawn. The title refers to a famous experiment conducted by R. It is our object here to explore certain designed experiments and to provide an introduction to this subject. Inc. we can only give a limited introduction to either of these topics in our introductory course. how? The answer lies in our discovery of how to decide what observations to make in a scientific experiment. Our knowledge of the design of experiments and the design of sample surveys are the two primary reasons for studying statistics.Chapter 15 Design of Experiments CHAPTER OBJECTIVES: r to learn how planning “what observations have to be taken in an experiment” can greatly r r r r r improve the efficiency of the experiment to consider interactions between factors studied in an experiment to study factorial experiments to consider what to do when the effects in an experiment are confounded to look at experimental data geometrically to encounter some interesting three-dimensional geometry. Kinney Copyright © 2009 by John Wiley & Sons. 223 . We begin with an example. Fisher. If observations are taken correctly we now know that conclusions can be drawn from them that are not possible with only random observations.

2 Speed RAM −1 +1 −1 27 18 +1 10 9 (16) . The overall mean of the data is 16.1 Speed (MHz) RAM (MB) 128 256 133 27 18 400 10 9 What are we to make of the data? By examining the columns. since each of these factors is at two levels or values. The data are then shown in Table 15. Now we proceed to assess the influence of each of the factors. Only two variables (called factors) are being considered in the study: Speed (S) and RAM (R).2 with the chosen codings. but. by examining the rows. It is customary.5 Table 15.1 Computer Performance A study of the performance of a computer is being made. If we study each level of Speed with each level of RAM and if we make one observation for each factor combination. to perform a complex calculation.224 Chapter 15 Design of Experiments EXAMPLE 15. Since the factors are studied together. We divide this difference by 2. shown in parentheses. Studies involving only one factor are called one-factor-at-a-time experiments and are rarely performed. it appears that the time to perform the calculation is decreased by increasing RAM. The levels of Speed are 133 MHz and 400 MHz and the levels of RAM are 128 MB and 256 MB. Each of these variables is being studied at two values (called levels).The choice of coding will make absolutely no difference whatever in the end. it is particularly important that the factors be studied together.1. as we will see later. One way to assess the influence of the factor speed is to compare the mean of the computation times at the +1 level with the mean of the computation times at the −1 level. It might appear that the factors should be studied separately. One reason for the lack of interest in such experiments is the fact that factors often behave differently when other factors are introduced into the experiment. we cannot detect such interactions unless the factors are studied together. When this occurs. we need to make four observations. the distance between the codings −1 and +1. We will address this subsequently. The observation or response here is the time the computer takes. It may be very important to detect such interactions. we say the factors interact with each other. to find Effect of Speed = 1 2 10 + 9 27 + 18 − 2 2 = −6. Table 15. to code these as −1 and +1. it appears that the time to perform the calculation is decreased by increasing speed and. These times are shown in Table 15. in microseconds. but obviously. it may be puzzling to decide exactly what influence each of these factors has alone.

5 This means that we decrease computation times on average by 2.5 units as we move from the −1 level to the +1 level. In Figure 15. where the product is +1. that is. we show two factors that have a very high interaction.5 · Speed − 2. and again divide by 2 to find Effect of RAM = 1 2 18 + 9 27 + 10 − 2 2 = −2. How are we to compute this effect? We do this by comparing the mean where the product of the coded signs is +1 with the mean where the product of the coded signs is −1 to find Effect of speed · RAM = 1 2 27 + 9 10 + 18 − 2 2 =2 So we conclude that the computation Speed where the combination of levels of the factors. the interaction between the factors Speed and RAM. namely. Now. we show the computation times at the two levels of RAM for speed at its two levels. this is a sign of a mild interaction between the factors speed and RAM. we can use the signs RAM .5 units as we move from the −1 level to the +1 level. How can we put all this information together? We can use these computed effects in the following model: Observation = 16 − 6.5 15 12. One more effect can be studied. Linear models are of great importance in statistics. We denote this effect by Speed · RAM. As we go from the −1 level of RAM to the +1 level. In Figure 15.1b. especially in the areas of regression and design of experiments.Chapter 15 Design of Experiments 225 This means that we decrease the computation time on average by 6. tends to be two units more in average than the computation Speed where the combination of levels of the factors is −1.1 The size of the interaction is also a measurement of the effect of the combination of levels of the main factors (Speed and RAM). In this case.5 · RAM + 2 · Speed · RAM This is called a linear model.5 20 17. what is the effect of RAM? It would appear that we should compare the mean computation times at the +1 level with those at the −1 level. the computation times change differently for the different levels of speed. producing lines that are not parallel.5 10 −1 1 Speed (a) 1 25 RAM 20 15 10 5 −1 1 Speed (b) −1 1 Figure 15. −1 25 22. Since the lines are not quite parallel.1a. the performance of one factor heavily depends upon the level of the other factor.

if we use speed = −1 and RAM = −1.2. we find 16 − 6. Again. which we abbreviate as SR. in addition to three two-factor interactions. EXAMPLE 15.and one three-factor interaction SBR.5(+1) − 2. Now we have three main effects S.5(−1) − 2.226 Chapter 15 Design of Experiments of the factors (either +1 or −1) in the linear model. Now we need a three-dimensional cube to see the computation times resulting from all the combinations of the factors. we code the two brands as −1 and +1. To calculate Cube plot (data means) for data means 13 6 18 1 9 29 R 20 1 B 27 –1 –1 S 10 1 –1 Figure 15.5(−1) + 2(−1)(−1) = 27 exactly the observation at the corner! This also works for the remaining corners in Table 15. R.5(+1) − 2. and RB.2 Adding a Factor to the Computer Experiment Suppose that we now wish to study two different brands of computers so we add the factor Brand to the experiment. We show these in Figure 15.5(−1) − 2. For example. SB.5(+1) + 2(+1)(+1) = 9 The linear model then explains exactly each of the observations! We now explore what happens when another factor is added to the experiment.2: 16 − 6. and B.5(+1) + 2(−1)(+1) = 18 16 − 6.5(−1) + 2(+1)(−1) = 10 and 16 − 6.2 .

3 may be helpful in visualizing the calculations of the remaining two-factor interactions. we use the planes that form the boundaries of the cube.Chapter 15 Design of Experiments 227 the main effects. We find 1 2 27 + 18 + 6 + 20 10 + 9 + 13 + 29 − 4 4 SB = = 1. for example.25 and 1 2 10 + 13 + 6 + 27 9 + 18 + 20 + 29 − 4 4 BR = = −2.25 The planes shown in Figure 15. This gives 9 + 10 + 6 + 20 18 + 27 + 13 + 29 − 4 4 S= 1 2 = −5. This gives 27 + 9 + 6 + 29 18 + 10 + 13 + 20 − 4 4 SR = 1 2 = 1. it would appear consistent with our previous calculations if we were to compare the plane where SR is positive to the plane where SR is negative.5 The calculation of the interactions now remains.50 These two-factor interactions geometrically are equivalent to collapsing the cube along each of its major axes and analyzing the data from the squares that result.25 Similarly. we compare the mean of the plane where S is positive with the mean of the plane where S is negative and take 1/2 of this difference as usual. . we find 6 + 9 + 13 + 18 10 + 27 + 29 + 20 − 4 4 R= 1 2 = −5 and 1 2 29 + 13 + 6 + 20 9 + 10 + 18 + 27 − 4 4 B= = 0. To find the main S effect. To calculate the SR interaction.

S R B (c) BR interaction. But we have used up every one of the 12 planes that pass through the cube! Consistent with our previous calculations.228 Chapter 15 Design of Experiments S R B R S B (a) SR interaction. Figure 15. if we look at the points where SBR is positive. (b) SB interaction. .4.3 This leaves the three-factor interaction SBR to be calculated. we find a tetrahedron within the cube as shown in Figure 15.

75 (–1. We also find a negative tetrahedron as shown in Figure 15. we compare the means of the computation times in the positive tetrahedron with that of the negative tetrahedron: SBR = 1 2 18 + 10 + 6 + 29 27 + 9 + 13 + 20 − 4 4 = −0. −1.1) (1. 1.5 Negative tetrahedron. –1) S (–1.5. −1) R S (1.1) (−1. 1. 1) (−1.4 Positive tetrahedron. –1. –1) R B Figure 15. –1.Chapter 15 Design of Experiments 229 (1. −1.1. −1) B Figure 15. Now. 1. 1) (1. .

25 − 5 + 0. It is also possible to calculate all the effects from a table using the Yates algorithm.5 + 5. Fortunately.75 = 10 (−. we will follow the order S. (+.75 = 6 (−. R. −. R. the observations must be put in what we will call a natural order. +. the next column starts with eight minus signs followed by eight plus signs and this pattern is repeated. −.5 − 1. finally under B.5 − 1.25 − 1. The process is called the Yates algorithm after Frank Yates.75 = 29 (+. We make a column for each of the main factors and make columns of the signs as follows.25 + 5 + 0.5RB − 0. The columns are of course long enough to accommodate each of the observations. there is another way to calculate the effects that applies to any number of factors.5 + 0.5 + 0. +) 16.75 = 13 (+.25 − 5 + 0.5 + 0.25 + 2. B) to the left of each calculation.25 − 5 − 0.25 + 5 − 0. −) 16.25 + 1. make a column with the pattern −.25S − 5R + 0.5 + 0. −.5 + 5.5 + 5. · · · . −.5 − 5. as in the case of two factors. So all the effects could be calculated from a table showing the signs of the factors.25 + 5 − 0. −. Under S.25 − 1.5 + 1.5B + 1. +.25 + 1. +.75SBR where the mean of the eight observations is 16. Although the order of the factors is of no importance.5 − 0.25 − 2. +.5 − 0. but then the nice geometric interpretation we have given in the previous examples becomes impossible.25SB − 2.75 = 27 We would like to extend our discussion to four or more factors. YATES ALGORITHM The calculation of the effects in our examples depends entirely upon the signs of the factors. +. To do this. +.5 + 1. we find that the model predicts each of the corner observations exactly.25 − 1. one would only need to know the sign of R for each of the observations and use the mean of those observations for which R is positive and also the mean of the observations for which the sign of R is negative. +) 16. +. −. −. we form the linear model Observation = 16.25 − 1. +.25 + 1.25 + 1. the next factor would begin with 16 minus signs followed by 16 plus signs and this pattern continues. +. −) 16.25 + 5 + 0. . −.25 + 2.5 − 5. +.5 − 5. +.75 = 20 (−. · · · . for example. If there are more factors.25 + 2.25 − 2. make a column starting with the signs −.230 Chapter 15 Design of Experiments Now.5 − 0. B as we have done previously. +.5 + 1.5 − 1.75 = 9 (−. −) 16.25 − 5 − 0.75 = 18 (+. under R.5 − 1. −.25 − 2. make a column −. Again. · · · . We show the signs of the factors (in the order S. −. The result in our example is shown in Table 15. +) 16. +.3. −) 16. +.5.5 + 1.25 − 2.5 − 5. a famous statistician who discovered it. To calculate the R effect. +) 16.25 + 2.5 + 5.5 − 5. −.25SR + 1. −.5 − 0.

These experiments make observations for each combination of each level of each factor. Factorial experiments are very efficient in the sense that although the factors are observed together and would appear to be hopelessly mixed up. one-factor-at-a-time experiments using only the main effects cannot draw conclusions about interactions since they are never observed. 9 − 18 = −9. Now perform exactly the same calculations on the entries in column 1 to find column 2.Randomization and Some Notation Table 15. but first we make some comments on the experimental design and introduce some notation.50 1.2 is a 23 experiment.3 S − + − + − + − + R − − + + − − + + B − − − − + + + + Observations 27 10 18 9 29 20 13 6 1 37 27 49 19 −17 −9 −9 −7 2 64 68 −26 −16 −10 −30 8 2 3 132 −42 −40 10 4 10 −20 −6 Effect (÷8) 16.25 −2. Now consider the same pairs but subtract the top entry from the bottom entry to find 10 − 27 = −17. the order of the observations . This completes column 1. RANDOMIZATION AND SOME NOTATION Each of our examples are instances of what are called full factorial experiments. Example 15. Going down the column of observations. no combinations are omitted. Both the design and the occurrence of randomization are keys to the statistical analysis of an experiment.1 is then a 22 experiment while example 15. We want to show an example with four factors using the Yates algorithm since the geometry is impossible. Add these to find the first entries in the column labeled 1.25 −5 1. In factorial experiments. Here we find 27 + 10 = 37.5 −5. In our examples. Finally. consider the observations in pairs. 20 − 29 = −9.75 231 μ S R SR B SB RB SBR To calculate the effects. We find the same model here that we found above using geometry. and 13 + 6 = 19. 29 + 20 = 49. and 6 − 13 = −7.25 0.50 −0. Note that the factors described in the rightmost column can be determined from the + signs given in the columns beneath the main factors. these are usually denoted by the symbol 2k . follow the same pattern on the entries of column 2 to find the entries in column 3. 18 + 9 = 27. proceed as follows. they can be shown to be equivalent of one-factor-at-a-time experiments where each factor and their interactions are observed separately 2k times! Of course. the 2 indicating that each factor is at 2 levels and k indicating the number of factors. The effects are found by dividing the entries in column 3 by 8.

232

Chapter 15

Design of Experiments

should be determined by some random scheme including repetitive observations for the same factor combinations should these occur. Now we show an example with four factors.

EXAMPLE 15.3

A 24 Factorial Experiment

In Statistics for Experimenters by George Box, William Hunter, and J. Stuart Hunter, a chemical process development study is described using four main factors. These are A : catalyst charge (in pounds), B : temperature (in degrees centigrade), C : pressure (in pounds per square inch), and D : concentration (in percentage). The chemical process is being created and the experimenters want to know what factors and their interactions should be considered when the actual process is implemented. The results are shown in Table 15.4.

Table 15.4 A − + − + − + − + − + − + − + − + B − − + + − − + + − − + + − − + + C − − − − + + + + − − − − + + + + D − − − − − − − − + + + + + + + + Observations 71 61 90 82 68 61 87 80 61 50 89 83 59 51 85 78 1 132 172 129 167 111 172 110 163 −10 −8 −7 −7 −11 −6 −8 −7 2 304 296 283 273 −18 −14 −17 −15 40 38 61 53 2 0 5 1 3 600 556 −32 −32 78 114 2 6 −8 −10 4 2 −2 −8 −2 −4 4 1156 −64 192 8 −18 6 −10 −6 −44 0 36 4 −2 −2 −6 −2 Effect (÷16) 72.25 −4 12 0.50 −1.125 0.375 −0.625 −0.375 −2.75 0 2.25 0.25 −0.125 −0.125 −0.375 −0.125 μ A B AB C AC BC ABC D AD BD ABD CD ACD BCD ABCD

This gives the linear model Observations = 72.25 − 4A + 12B + 0.50AB − 1.125C + 0.375AC − 0.625BC −0.375ABC − 2.75D + 0AD + 2.25BD + 0.25ABD − 0.125CD −0.125ACD − 0.375BCD − 0.125ABCD As in the previous examples, the model predicts each of the observations exactly.

Confounding

233

CONFOUNDING
It is often true that the effects for the higher order interactions become smaller in absolute value as the number of factors in the interaction increases. So these interactions have something, but often very little, to do with the prediction of the observed values. If it is not necessary to estimate some of these higher order interactions, some very substantial gains can be made in the experiment, for then we do not have to observe all the combinations of factor levels! This then decreases the size, and hence the cost, of the experiment without substantially decreasing the information to be gained from the experiment. We show this through an example.

EXAMPLE 15.4

Confounding

Consider the data in Example 15.3 again, as given in Table 15.4, but suppose that we have only the observations for which ABCD = +1. This means that we have only half the data given in Table 15.4. We show these data in Table 15.5, where we have arranged the data in standard order for the three effects A, B, and C (since if the signs for these factors are known, then the sign of D is determined). Table 15.5 A − + − + − + − + B − − + + − − + + C − − − − + + + + D − + + − + − − + Observations 71 50 89 82 59 61 87 78 1 121 171 120 165 −21 −7 2 −9 2 292 285 −28 −7 50 45 14 −11 3 577 −35 95 3 7 21 −5 −25 Effect (÷8) 72.125 −4.375 11.875 0.375 0.875 2.625 −0.625 −3.125 μ +ABCD A +BCD B +ACD AB +CD C +ABD AC +BD BC +AD ABC +D

Note that the effects are not the same as those found in Table 15.4. To the mean μ the effect ABCD has been added to find μ + ABCD = 72.25 − 0.125 = 72.125. Similarly, we find A + BCD = −4 − 0.375 = −4.375. The other results in Table 15.5 may be checked in a similar way. Since we do not have all the factor level combinations, we would not expect to find the results we found in Table 15.4. The effects have all been somewhat interfered with or confounded. However, there is a pattern. To each effect has been added the effect or interaction that is missing from ABCD. For example, if we consider AB, then CD is missing and is added to AB.The formula ABCD = +1 is called a generator. One could also confound by using the generator ABCD = −1. Then one would find μ − ABCD, A − BCD, and so on. We will not show the details here. Note also that if the generator ABCD = +1 is used, the experiment is a full factorial experiment in factors A, B, and C.

234

Chapter 15

Design of Experiments

Each time a factor is added to a full factorial experiment where each factor is at two levels, the number of factor level combinations is doubled, greatly increasing the cost and the time to carry out the experiment. If we can allow the effects to be confounded, then we can decrease the size and the cost of the experiment. In this example, the confounding divides the size of the experiment in half and so is called a half fraction of a 24 factorial experiment and is denoted as a 24−1 factorial experiment to distinguish this from a full 23 experiment. These experiments are known as fractional factorial experiments. These experiments can give experimenters much information with great efficiency provided, of course, that the confounded effects are close to the true effects. This is generally true for confounding using higher order interactions. Part of the price to be paid here is that all the effects, including the main effects, are confounded. It is generally poor practice to confound main effects with lower order interactions. If we attempt to confound the experiment described in Example 15.2 by using the generator SBR = +1,we find the results given in Table 15.6. Table 15.6 S − − + + B + − − + R − + − + Observations 29 10 18 6 (1) 39 24 −19 −12 (2) 63 −31 −15 7 Effect (÷4) 15.75 −7.75 −3.75 1.75 μ + SBR S + BR R + SB B + SR

The problem here is that the main effects are confounded with second-order interactions. This is generally a very poor procedure to follow. Fractional factorial experiments are only useful when the number of factors is fairly large resulting in the main effects being confounded with high-order interactions. These high-order interactions are normally small in absolute value and are very difficult to interpret in any event. When we do have a large number of factors, however, then fractional factorial experiments become very useful and can reduce the size of an experiment in a very dramatic fashion. In that case, multiple generators may define the experimental procedure leading to a variety of confounding patterns.

MULTIPLE OBSERVATIONS
We have made only a single observation for each combination of factor levels in each of our examples. In reality, one would make multiple observations whenever possible. This has the effect of increasing the accuracy of the estimation of the effects, but we will not explore that in detail here. We will show an example where we have multiple observations; to do this, we return to Example 15.1, where we studied the effects of the factors Speed and RAM on computer performance. In Table 15.7, we have used three observations for each factor level combination. Our additional observations happen to leave the means of each cell, as well as the overall mean, unchanged. Now what use is our linear model that was Observation = 16 − 6.5 · Speed − 2.5 · RAM + 2 · Speed · RAM?

Testing the Effects for Significance Table 15.7 Speed RAM −1 +1 −1 27,23,31 18,22,14 +1 10,8,12 9,11,7 (16)

235

This linear model predicts the mean for each factor combination but not the individual observations. To predict the individual observations, we add a random component ε to the model to find Observation = 16 − 6.5 · Speed − 2.5 · RAM + 2 · Speed · RAM + ε The values of ε will then vary with the individual observations. For the purpose of statistical inference, which we cannot consider here, it is customary to assume that the random variable ε is normally distributed with mean 0, but a consideration of that belongs in a more advanced course.

DESIGN MODELS AND MULTIPLE REGRESSION MODELS
The linear models developed here are also known as multiple regression models. If a multiple regression computer program is used with the data given in any of our examples and the main effects and interactions are used as the independent variables, then the coefficients found here geometrically are exactly the coefficients found using the multiple regression program. The effects found here then are exactly the same as those found using the principle of least squares.

TESTING THE EFFECTS FOR SIGNIFICANCE
We have calculated the effects in factorial designs, and we have examined their size, but we have not determined whether these effects have statistical significance. We show how this is done for the results in Example 15.3, the 24 factorial experiment. The size of each effect is shown in Table 15.8. The effects can be regarded as Student t random variables. To test any of the effects for statistical significance, we must determine the standard error for each effect. To do this, we first determine a set of effects that will be used to calculate this standard error. Here, let us use the third- and fourth-order interactions; we will then test each of the main effects and second-order interactions for statistical significance. We proceed as follows.

236

Chapter 15

Design of Experiments Table 15.8 Size 72.25 −4 12 0.50 −1.125 0.375 −0.625 −0.375 −2.75 0 2.25 0.25 −0.125 −0.125 −0.375 −0.125 Effect μ A B AB C AC BC ABC D AD BD ABD CD ACD BCD ABCD

First, find a quantity called a sum of squares that is somewhat similar to the sums of squares we used in Chapter 13. This is the sum of the squares of the effects that will be used to determine the standard error. Here this is SS = (ABC)2 + (ABD)2 + (ACD)2 + (BCD)2 + (ABCD)2 = (−0.375)2 + (0.250)2 + (−0.125)2 + (−0.375)2 + (−0.125)2 = 0.375 We define the degrees of freedom (df) as the number of effects used to find the sum of squares. This is 5 in this case. The standard error is the square root of the mean squares of the effects. The formula for finding the standard error is Standard error = SS df

√ We find here that standard error = 0.375/5 = 0.273 86. We can then find a Student t variable with df degrees of freedom. To choose an example, we test the hypothesis that the effect AB is 0 : AB − 0 Standard error 0.5000 − 0 t5 = 0.27386 t5 = 1.8258 t5 =

B.071 1.821 −14. D.34·10−4 0. Table 15.108 1. We used the computer algebra program Mathematica to determine the t values and the corresponding p-values in Table 15.282 −10.25 −4 12 0.216 −0.375 −0.127464.00 4.50 −1.6.229 0.009 0.9.125 0. This may have important consequences for the experimenter as future experiments are planned. C. so we probably would not decide that this is a statistically significant effect.72·10−5 1. This is 0.127 0. Most experiments have more than one observation for each factor level combination.456 p-Value 1. It is probably best to use a computer program or a statistical calculator to determine the p-values since only crude estimates of the p-values can be made using tables.75 0 2.625 −2. . we would conclude that μ. the statistical computer program Minitab generates a graph of the main effects and interactions for the example we have been considering. Significant effects are those that are not close to the straight line shown. which is shown in Figure 15. Statistical computer programs are of great value in analyzing experimental designs. A. While the mean is not shown.25 −0.606 43. and BD are of statistical significance while the other interactions can be safely ignored.05 as our level of significance.8258. the determination of the statistical significance of the factors and interactions is much more complex than in the case we have discussed. This is a topic commonly considered in more advanced courses in statistics.17·10−7 0.042 0 8.125 t5 263.826 −4. Their use is almost mandatory when the number of main factors is 5 or more. we would draw the same conclusions from the graph as we did above.48·10−11 2.9 Effect μ A B AB C AC BC D AD BD CD Size 72.667 Using the critical value of 0. These programs also provide graphs that can give great insight into the data.369 −2.818 1.68·10−4 1. For example.Testing the Effects for Significance 237 The p-value for the test is the probability that this value of t is exceeded or that t is at most −1. Such courses might also consider other types of experimental designs that occur in science and engineering. In that case.

. Alpha = 0. In Example 15. EXPLORATIONS 1.4.238 Chapter 15 Design of Experiments Normal plot of the effects (response is Obs.3 to squares leads to the determination of the two-factor interactions.6 CONCLUSIONS This has been a very brief introduction to the design of experiments.4 predicts each of the observations exactly. 3. We have made use of the geometry here in analyzing experimental data since that provides a visual display of the data and the conclusions we can draw from the data. 2. 4. Check that the model given after Table 15. The following data represent measurements of the diameter of a product produced at two different plants and on three different shifts. confound using ABCD = −1 and analyze the resulting data.125 Figure 15. Show that collapsing the cubes shown in Figure 15. Much more is known about this subject and the interested reader is referred to more advanced books on this subject.05) 99 Effect type Not significant Significant B BD Factor A B C D Name A B C D 95 90 80 70 Percent 60 50 40 30 20 10 5 A 1 D -10 -5 0 5 Effect 10 15 20 25 Lenth's PSE = 1.

2 64. .2 64.4 63.3 61.2 67.Explorations 239 Shift 1 Plant 1 Plant 2 66.7 66.4 3 64.7 65. Find all the main effects and the interactions and show how the data can be predicted using a linear model.3 65.7 65.5 2 66.2 Analyze the data and state any conclusions that can be drawn.

where f (x + 1) = 2f (x). f (3) = 2f (2) = 2 · 2 = 4. we consider functions whose values depend upon other values of the same function. then we can determine any of the subsequent values for the function. . f (2) = 2f (1) = 2 · 1 = 2. If we have a starting place. . f (x).1 A General Recursion We begin with a nonprobability example. . These functions are investigated here in general. EXAMPLE 16. 240 . INTRODUCTION Since they are very useful in probability. and then we show their application to probability and probability distribution functions. and that in fact f (x) = 2x−1 for x = 2. x = 0. 2. since subsequent values of f are twice that of the preceding value. . Suppose we define a function on the positive integers. A Probability and Statistics Companion. . Kinney Copyright © 2009 by John Wiley & Sons. . say f (1) = 1.Chapter 16 Recursions and Probability CHAPTER OBJECTIVES: r r r r r to learn about recursive functions to apply recursive functions to permutations and combinations to use recursive functions to find the expected value of the binomial distribution to learn how to gamble (perhaps wisely) consider the occurrence of HH in coin tossing. 3. It is easy to see here. John J. 1. For example. 4. Inc. that the values of f are powers of 2. Such functions are called recursive. f (4) = 2 · f (3) = 2 · 4 = 8 and so on.

4. . however. say f (0) = 1 and f (1) = 2. applying the recursion. Consider f (x + 1) = 2f (x) + f (x − 1) for x = 1. .1 shows some of the values of f (x) obtained from this solution. we find f (2) = 2f (1) + f (0) = 2 · 2 + 1 = 5 f (3) = 2f (2) + f (1) = 2 · 5 + 2 = 12 f (4) = 2f (3) + f (2) = 2 · 12 + 5 = 29. 1. is not always so easy. we invite the reader to check that the solutions we present are. and the formula f (x) = 2x−1 is its solution. To verify that f (x) = 2x−1 for x = 2. 2 2 √ Those 2’s look troublesome at first glance. We did this in our first example. and while we will not explore them. and so on. Analytic methods exist to produce solutions for the recursions we consider here. then f (x + 1) = 2x = 2 · 2x−1 = 2f (x). . in fact. but this time we need two starting points. 2. solutions of the recursions. 3. . The solution is far from evident. 3 . is the solution. . note that if f (x) = 2x−1 . 2. Finding solutions for recursions. It is easy to write a computer program to produce a table of these results. Table 16. Then.Introduction 241 The relationship f (x + 1) = 2f (x) is called a recursion or difference equation. the solution of the recursion is √ √ (1 + 2)x+1 − (1 − 2)x+1 √ f (x) = . The solution was probably evident from the start. . . but they all disappear! Table 16.1 x 0 1 2 3 4 5 6 7 8 9 10 f (x) 1 2 5 12 29 70 169 408 985 2378 5741 . x = 0. . Let us start by finding some of the values for f . In this case.

331. 081. a recursion Since we know that 1! = 1. 188. 217. We continue now and show more examples of recursions and their uses in probability.2 Permutations An easy application of recursions is with permutations—an arrangement of objects in a row. we can use the above result to find that and subsequent values. 200. A computer program finds that f (100) = 161. 571. note that n r+1 so n r+1 n r n! r!(n − r)! n−r · = (r + 1)! · (n − r − 1)! n! r+1 = n! (r + 1)! · (n − r − 1)! = which we can write in recursive form as n r+1 = n−r · r+1 n r . 709 It is difficult to think about computing this in any other way. If we have n distinct items to arrange. 986. EXAMPLE 16. 733. it follows that n Pn. and in finding 5! we would have to calculate 5 · 4 · 3 · 2 · 1. It is easy to continue and produce a table of factorials very quickly and easily. in finding 4! we would have to multiply 4 · 3 · 2 · 1.242 Chapter 16 Recursions and Probability The values now increase rather rapidly. EXAMPLE 16. that can be selected from a set of n distinct items. It is much easier and faster to use the facts that 4! = 4 · 3! and 5! = 5 · 4! since it is easy to calculate 3! and 4!. We know that r n r = n! r! · (n − r)! where 0 ≤ r ≤ n To find a recursion. suppose we denote the number of distinct permutations by n Pn . Now we turn to probability. 311. This number is denoted by n . the subject of all our remaining applications. Since we know there are n! possible permutations. = n ·n−1 Pn−1 . say of size r. 634. 082. Ordinarily.3 Combinations A combination is the number of distinct samples. = n! It is also clear that n! = n · (n − 1)! so n Pn.

We continue now with some probability distribution functions. note that n = n. it is a distinct advantage not to have to calculate the large factorials involved. . When the numbers become large. suppose n = 5. the recursion can be and we could continue this. So it follows that 1 n 2 or n 2 = n−1 n(n − 1) ·n= 2 2 n−2 n · 2+1 2 = n−1 n · 1+1 1 We can continue this by calculating n 3 or n 3 and n 4 or n 4 = n − 3 n(n − 1)(n − 2) n(n − 1)(n − 2) · (n − 3) · = 4 3·2 4·3·2 5 1 = = n − 2 n(n − 1) n(n − 1)(n − 2) · = 3 2 3·2 n−3 n · 3+1 3 = To take a specific example. Since used repeatedly to find 5 2 or 5 2 and 5 3 or 5 3 = 3 · 10 = 10 3 = 5−2 5 · 2+1 2 = 4 · 5 = 10 2 = 5−1 5 · 1+1 1 = 5.Introduction 243 Using this recursion will allow us to avoid calculation of any factorials! To see this.

for example.244 Chapter 16 Recursions and Probability EXAMPLE 16. For the binomial probability distribution function. x f (x) = P(X = x) = x = 0. . we find n px+1 qn−(x+1) x+1 n x n−x pq x P(X = x + 1) = P(X = x) = x!(n − x)! p n! · · (x + 1)!(n − x − 1)! n! q which simplifies to n−x p P(X = x + 1) = · P(X = x) x+1 q and this can be written as P(X = x + 1) = n−x p · · P(X = x) x+1 q The recursion is very useful. then tells us that P(X = 1) = n−0 p · · P(X = 0) 0+1 q p P(X = 1) = n · · qn = n · p · qn−1 q which is the correct result for X = 1. using x = 0. we know that P(X = 0) = qn . n Now n px+1 qn−(x+1) x+1 P(X = x + 1) = so if we divide P(X = x + 1) by P(X = x). 1. Consider the binomial probability distribution function as an example. we know n x n−x pq . . . .4 Binomial Probability Distribution Function It is frequently the case in probability that one value of a probability distribution function can be found using some other value of the probability distribution function. . The recursion.

We can continue in this way. The recursion tells us that P(X = 2) = P(X = 2) = P(X = 2) = P(X = 2) = n−1 p · · P(X = 1) 1+1 q n−1 p · · n · p · qn−1 2 q n(n − 1) 2 n−2 ·p ·q 2 n 2 · p2 · qn−2 245 and again this is the correct result for P(X = 2).00030199 and 11 ∗ 1.6 · · P(X = x) x + 1 0.00001677 7 The recursion is P(X = x + 1) = which in this case is P(X = x + 1) = so P(X = 1) = 12 ∗ 1. for example. To take a specific example. we start with P(X = 0) = 0. The value.00030199 = 0.412 = 0.5 ∗ 0. suppose that p = 0. of n! is never calculated at all. The advantage in doing this is that the quantities occurring in the values of the probability distribution function do not need to be calculated each time.00001677 7 = 0. creating all the values of the probability distribution function.2.5 ∗ 0.0024914 2 12 − x 0. .Introduction This can be continued to find P(X = 2).4 n−x p · · P(X = x) x+1 q P(X = 2) = We can continue and find the complete probability distribution function in Table 16.6 so that q = 0. Letting X denote the number of successes in 12 trials.4 and that n = 12.

227030 0.017414 0.000017 0.212841 0.246 Chapter 16 Recursions and Probability Table 16. One interesting application of the recursion P(X = x + 1) = n−x p · · P(X = x) x+1 q lies in finding the mean of the binomial probability distribution.141894 0.042043 0.002491 0.012457 0.2 x 0 1 2 3 4 5 6 7 8 9 10 11 12 P(X = x) 0.100902 0.5 Finding the Mean of the Binomial We actually used a recursion previously in finding the mean and the variance of the negative hypergeometric distribution in Chapter 7.000302 0.176579 0.002177 EXAMPLE 16.063852 0. Rearranging the recursion and summing the recursion from 0 to n − 1 gives n−1 n−1 q(x + 1)P(X = x + 1) = x=0 x=0 p(n − x)P(X = x) which can be written and simplified as n−1 n−1 qE[X] = np x=0 P(X = x) − p x=0 x · P(X = x) which becomes qE[X] = np[1 − P(X = n)] − p[E[X] − nP(X = n)] or qE[X] = np − npqn − pE[X] + npqn and this simplifies easily to E[X] = np .

a sampled item is not returned to the lot before the next item is drawn. we find P(X = x) = (D − x + 1)(n − x + 1) P(X = x − 1) x(N − D − n + x) To choose a specific example. P(X = 1) = and P(X = 2) = 19 · 9 (D − 1) · (n − 1) P(X = 1) = · 0. 1. and hence factorials. . D] A recursion is easily found since by considering P(X = x)/P(X = x − 1) and simplifying. and n = 10. D = 20.0951163 = 0.318171 2(N − D − n + 2) 2 · 72 D·n 20 · 10 P(X = 0) = · 0. Note that although the definition of P(X = x) involves combinations. . in a way entirely similar to that we used with the binomial distribution. that is. Then (N − D)!(N − n)! 80!90! 80 · 79 · 78 · · · · · 71 P(X = 0) = = = = 0. suppose N = 100. x = 0. EXAMPLE 16.0951163 (N − D − n)!N! 70!100! 100 · 99 · 98 · · · · · 91 Applying the recursion. 2.Introduction 247 This derivation is probably no simpler than the standard derivation that evaluates n E[X] = x=0 x · P(X = x). . but it is shown here since it can be used with any discrete probability distribution. Suppose that a lot of N manufactured products contains D items that are special in some way. Let the random variable X denote the number of special items in the sample that is selected with nonreplacement. we never computed a factorial! The recursion can also be used.267933 (N − D − n + 1) 71 This could be continued to give all the values of the probability distribution function. The probability distribution function is D x N −D n−x N n P(X = x) = . The sample is of size n. Variances can also be produced this way. .6 Hypergeometric Probability Distribution Function We show one more discrete probability distribution and a recursion.267933 = 0. to find that the mean value is E[X] = (n · D)/N. . Min[n. usually providing a derivation easier than the direct calculation.

248 Chapter 16 Recursions and Probability EXAMPLE 16. while if the next play produces a loss.3. . so the game is slightly unfair to A. This is the case in casino gambling where the house’s fortune is much greater than that of the player. 2. then his or her fortune is $ (k − 1). . 1. the player is ruined. Suppose that $1 is gained or lost at each play. we have selected q = 15/28 and p = 13/28. . here B. . The only question is how long the player will last. So the probability of ruin is quite high almost without regard for the relative fortunes. Two players. N 1− The probability of ruin now depends on two ratios—that of q/p as well as a/b. . . If the player wins on the next play. and a + b = N. 1. Suppose that the player has $k and the probability of winning $1 on any given play is p and the probability of losing on any given play is then 1 − p = q. . . Individual calculations are not difficult and some results are given in Table 16. then pa −→ 1 and the player faces almost certain ruin. . which most casino games are not. N Let us consider the fair game (where p = q = 1/2) and player A whose initial fortune is $a. Let pk be the probability that A (or B) is ruined with a fortune of $k.1. 1. wins. The probability the player is ruined is then pa = 1 − a a b 1 =1− = = a N a+b a+b +1 b This means that if A is playing against a player with a relatively large fortune (so b a). . that of the Gambler’s Ruin. Here. Note that this is for playing a fair game. then his or her fortune is $ (k + 1). . and A starts with $a . So pk = ppk+1 + qpk−1 where p0 = 1 and pN = 0 The solution of this recursion depends on whether p and q are equal or not. 2. where p denotes the probability that favored player. 2. Now let us look at the case where p = q. . N 1− If p = q then the solution is pk = 1 − k N for k = 0. B starts with $b.7 Gambler’s Ruin We now turn to an interesting probability situation. . If p = q then the solution is / q p k − q p pk = q p N N for k = 0. where the solution is / q p k − q p pk = q p N N for k = 0. that is. A graph of the situation is also useful and shown in Figure 16. play a gambling game until one player is out of money. A and B.

986521 0.771473 0. we look only for the occurrence of the pattern HH.993341 0.7 0. In this example. then either the pattern occurs at the nth trial or it occurs at the (n − 1)st trial and is followed by H.996744 249 1 p 0. Consider the sequence TTHHHTTTHHHHHH.12th. If a sequence ends in HH. u1 = 0 The reader can check that the solution to this recursion is p [p + (−p)n ] for n 2 un = 1+p .986427 0. Then begin the sequence all over again.944355 0.972426 0.993364 0. we have the recursion un + pun−1 = p2 for n 2.945937 0.949188 0. possibly a loaded one. not necessarily the first occurrence. The pattern occurs again on the 10th. We previously considered waiting for the first occurrence of the pattern HH in Chapter 7.8 Finding a Pattern in Binomial Trials It is interesting to explore patterns when binomial trials are performed.9 0. A perfect model for this is tossing a coin. Scan the sequence from left to right and we see that HH occurs on the 4th flip. and looking at the pattern of heads and tails.1 EXAMPLE 16. since all possible sequences ending in HH have probability p2 .8 0. First.6 15 40 30 20 25 a 30 20 35 40 10 b Figure 16. and 14th trials and at no other trials. we need to define when this pattern “occurs”.782808 0. Let un denote the probability the sequence HH occurs at the nth trial.3 $a 15 15 20 20 25 25 30 30 35 35 40 40 $b 10 20 10 20 20 25 25 30 30 35 35 40 pa 0.Introduction Table 16. So.972815 0.

the recursions can usually be solved and then calculations made. 1+p p p + p6 . 1. u7 = p p − p7 1+p u5 = p p − p5 .25 0. u4 = p2 (p2 − p + 1).6 to find a recursion for the hypergeometric distribution and use it to find its mean value. or difference equations. The Poisson distribution is defined as f (x) = e−μ μx /x! for x = 0. a really beautiful pattern. r and show an application of the result.1875 0.166668 0. are very useful in probability and can often be used to model situations in which the formation of probability functions is challenging. Show how to use the recursion r+1 = (n − r)/(r − 1) n . n 2.4 n 2 4 6 8 10 12 14 16 18 20 un 0. This limit occurs fairly rapidly as the values in Table 16.166748 0. Use Example 16.166992 0. 1+p and so on. then un − > 1/6. it is evident that un − > p2 /(1 + p). Table 16. 3. f (1) = 2 is that given in the text.166687 0. As we have seen.4 show.250 Chapter 16 Recursions and Probability Some of the values this gives are u2 = p 2 . Establish a recursion for n+1 r n r . 4. u6 = u3 = qp2 . Verify that the solution for f (x + 12f (x) + f (x − 1). EXPLORATIONS 1.166672 0.166667 CONCLUSIONS Recursions. 2. Find the recursion for the values of f (x) and use it to establish the fact that E(X) = μ.171875 0. Since (−p)n becomes small as n becomes large. · · · and μ > 0. f (0) = 1. If p = 1/2.167969 0. 5.

A Probability and Statistics Companion. Now calculate [g(t)]2 . 251 . Kinney Copyright © 2009 by John Wiley & Sons. Since the coefficients represent probabilities. the coefficients of [g(t)]3 giving the probabilities associated with the sum when three fair dice are thrown.1 Throwing a Fair Die Let us suppose that we throw a fair die once.4 shows the sums when 24 fair dice are thrown. .1. . This is 1 2 [g(t)]2 = [t + 2t 3 + 3t 4 + 4t 5 + 5t 6 + 6t 7 + 5t 8 + 4t 9 + 3t 10 + 2t 11 + t 12 ] 36 The coefficient of t k is now P(X1 + X2 = k).2. The coefficient of t k in g(t) is P(X = k) for k = 1.Chapter 17 Generating Functions and the Central Limit Theorem CHAPTER OBJECTIVES: r to develop the idea of a generating function here and show how these functions can be r to use generating functions to investigate the behavior of sums r to see the development of the central limit theorem. This “normal-like” pattern continues. Consider tossing the die twice with X1 and X2 . Consider the function g(t) = (1/6)t + (1/6)t 2 + (1/6)t 3 + (1/6)t 4 + (1/6)t 5 + (1/6)t 6 and the random variable X denoting the face that appears. . A graph of this is interesting and is shown in Figure 17. 2. . Inc. Figure 17. The result is shown in Figure 17. used in probability modeling EXAMPLE 17. This process can be continued. This is shown in Figure 17. 6. John J.3. the relevant random variables. g[t] and its powers are called generating functions.

3 0.5 10 12.01 20 40 60 80 Sum 100 120 Figure 17.04 Frequency 0.08 0.252 Chapter 17 Generating Functions 0.16 0.05 1 2 3 X1 4 5 6 Figure 17.03 0.2 Probability 0.06 0.5 5 7.1 0.25 0.04 0.14 0.3 0.08 0.1 0.4 Probability .12 0.12 0.1 0.04 2 4 6 8 X1 + X2 10 Figure 17.06 0.02 0.02 2.2 0.15 0.1 Probability 0.5 15 X1 + X2 +X3 Figure 17.

.1 0. . Suppose the die is loaded and the generating function is h(t) = t t2 t3 t4 t5 2t 6 + + + + + 10 5 20 20 5 5 A graph of this probability distribution.2 Throwing a Loaded Die The summands in the central limit theorem need not all have the same mean or variance. . then X1 + X2 + X3 + · · · + Xn has expectation μ1 + μ2 + μ3 + · · · + μn and variance 2 2 2 2 σ1 + σ2 + σ3 + · · · + σn .4 0. σ2 . 2 2 2 2 have means μ1 . .35 0. . If these summands. So we find the following means and variances in Table 17. 3.3 0. is shown is Figure 17. 2.1 for various numbers of summands. μ2 . . . n.6 shows the sum of three of the loaded dice. Figure 17. When we look at sums now. . .7 shows the sum of 24 dice.25 0. Figure 17. i = 1.Means and Variances 253 MEANS AND VARIANCES This behavior of sums is a consequence of the central limit theorem. Table 17. with variable X again.1 n 1 2 3 24 μ 7/2 7 21/2 84 σ2 35/12 35/6 35/4 70 EXAMPLE 17. Each of the Xi s has the same uniform distribution with μi = 7/2 and σi2 = 35/12. . say X1 . σn . Xn .2 0. 0. . X3 . . μn and variances σ1 . . .15 0. which states that the probability distribution of sums of independent random variables approaches a normal distribution. μ3 .05 1 2 3 X1 4 5 6 Figure 17. But the normality does appear.5. Our example illustrates this nicely. σ3 . the normal-like behavior does not appear quite so soon.5 Frequency . X2 . . .

If the measurement can be considered to be the result of the sum of factors. Since E[X1 ] = 7/2 and Var[X1 ] = 35/12.01 20 40 60 80 Sum 100 120 Figure 17.125 0.4 suggests that the distribution becomes very normal-like so we compare the probabilities when 24 dice are thrown with values √ of a normal distribution with mean 84 and standard deviation 70 = 8.05 0. then the central limit theorem assures us that the result will be normal.5 5 7. S denotes the sum.7 The pattern for the mean and variances of the sums continues.254 Chapter 17 Generating Functions 0. and IQ are normally distributed.15 0. μ1 = 17 and μ24 = 102 4 2 while σ1 = 279 837 2 and σ24 = 80 10 A NORMAL APPROXIMATION Let us return now to the case of the fair die and the graph of the sum of 24 tosses of the die as shown in Figure 17. occurs whenever a result can be considered as the sum of independent random variables. Normality.025 2.04 0. in fact. .5 15 X1 + X2 +X3 Frequency Frequency 0.3666. weight.02 0.2.075 0. A comparison is shown in Table 17.4.03 0.1 0. We find that many human characteristics such as height. This is yet another illustration of the central limit theorem.5 10 12. we know that E[X1 + X2 + X3 + · · · + X24 ] = 24 · and Var[X1 + X2 + X3 + · · · + X24 ] = 24 · 35 = 70 12 7 = 84 2 The visual evidence in Figure 17. So the normal approximation is excellent.6 Figure 17.

.0460669 0.2 S 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 Probability 0.0444911 0.0200912 0.0463396 0.0425331 0.0336014 0.0397959 0.0460669 0.0368540 0.0170474 0. investigate the sums when the die is tossed a few times.0473433 0.047367 0. and then investigate the behavior of the die when it is tossed a large number of times.0302958 0.0470386 0.0235250 0.0235250 0.0301874 0.0397959 0.0447139 0.0476827 0.0233427 0. EXPLORATIONS 1.0268886 0.0336519 0.0172423 0.0425331 0.0398849 0.0336519 0.0200912 255 CONCLUSIONS Probability generating functions are very useful in calculating otherwise formidable probabilities. Load a die in any way you would like so long as it is not fair. We have shown here primarily the occurrence of the central limit theorem in finding the limiting normal distributions when sums of independent random variables are considered.036871 0. They are also helpful in indicating limiting probability distributions.0336014 0.0447139 0.0444911 0.Explorations Table 17.0301874 0.0423735 0.0268886 0. 2.0470386 0. Find the mean and variance of the loaded die in Exploration 1 and verify the mean and variance of the sums found in Exploration 1.0233427 0.0302958 0.036871 0.0463396 0.0473433 0.0267356 0.0398849 0.0202872 Normal 0.0423735 0.0368540 0.0267356 0.0202872 0.

5. . . 2. and n. .256 Chapter 17 Generating Functions 3. Use the generating function in Exploration 4 to find the mean and variance of the binomial random variable. x = 1. Show that the normal curve is a very good approximation for the sums on your loaded die when tossed a large number of times. . 4. The geometric random variable describes the waiting time until the first success in a binomial situation with parameters p and q. Show that the function p(t) = (q + pt)n generates probabilities for the binomial random variable with parameters q. Show that its probability generating function is pt/(1 − qt) and use this to find the mean and the variance of the geometric random variable. Its probability distribution function is f (x) = pqx−1 . 6. p.

Univariate Discrete Distributions. Kemp. John Wiley & Sons. S. 9. Kinney. L. second edition. Balakrishnan. Inc. Grimaldi. S. Hunter. Gnedenko. Prentice-Hall. 1993. Inc. 19. Mosteller. and N. A First Course in Probability. A few of the following references are cited in the text. Goldberg. 1997. Inc. John Wiley & Sons. Kinney Copyright © 2009 by John Wiley & Sons. 1. S. fifth edition. G. fifth edition. Kinney. G. Applied Regression Analysis. Volumes 1 and 2. pp. McGraw-Hill. Grant and R. 1994. Reprinted by Dover Publications. sixth edition. 1989. J. Johnson. John Wiley & Sons. J. John Wiley & Sons.. E. and A. 1990. Inc. 2002. 1896. 3. 257 . 7. sixth edition. E. 2. 8. 2004. E. 2002. N. Inc.. Introduction to Difference Equations.. Barton. J. V. A History of Probability and Statistics and Their Applications Before 1750. Feller. Ross. Addison-Wesley Publishing Co. The Theory of Probability. and J. Duncan. J. Richard D. Addison-Wesley Publishing Co. Addison-Wesley Publishing Co. John Wiley & Sons. 13. J.. Leavenworth. N.. P. 14. 15. Smith. 1978. 4. 1962. R. F. Box. Kinney. Mathematics Magazine. Inc. Kotz. F. B. Drane. 1992. 1986. W. An Introduction to Probability and Its Applications. Continuous Univariate Distributions. May 1978. S.Bibliography WHERE TO LEARN MORE There is now a vast literature on the theory of probability. Combinatorial Chance. L. Hald. 6. Stuart Hunter. 184–186. P. 10. S. Inc.. A. W. John Wiley & Sons. 18. Statistics for Science and Engineering. 5. The American Statistician. 1968. Cao. L. Prentice-Hall. 1988. and T. W. 47(4).. J. L. S. Volumes I and II. Discrete and Combinatorial Mathematics. Irwin. Tossing coins until all are heads. J. David and D. Dover Publications. Fifty Challenging Problems in Probability. Quality Control and Industrial Statistics. second edition. other titles that may be useful to the instructor or student are included here as well. N. Charles Griffin & Company Limited. 11. Statistics for Experimenters. 1981. Chelsea Publishing Company. S. Goldberg. Kotz. 1960. 269–274. Draper and H. Probability: An Introduction. John J. 17. 16. Johnson. 1965. Probability: An Introduction with Statistical Applications. J. A. second edition. Limiting forms of probability mass functions via recurrence formulas. Postelnicu. A Probability and Statistics Companion. 12. N. John Wiley & Sons. fifth edition.. Wang. Statistical Quality Control. W..R.

Wolfram. Freeman and Company. W. . D. 23. J. S. McGraw-Hill Book Company. A. Addison-Wesley Publishing Co. W. 1991. Uspensky.. Inc. V.. Whitworth. Hafner Publishing Company. The Lady Tasting Tea: How Science Revolutionized Science in the Twentieth Century. H. Introduction to Mathematical Probability. 1965. fifth edition. 1937. 2001.. Choice and Chance. Inc. 21. Salsburg. 22.258 Bibliography 20. Mathematica: A System for Doing Mathematics by Computer.

232 Finite population correction factor 71.Index Acceptance sampling 32 Addition Theorem 10. 259 . 26 Heads before tails 88 Hypergeometric probability distribution 70. 11. 13 German tanks 28. 6. 31 Estimation 130. 177 Estimating σ 157. 123. 177 Hat problem 5. 247 Hypothesis testing 133 A Probability and Statistics Companion. 244 Mean 69 Variance 69 Binomial theorem 3. 105. 16 Bivariate random variables 115 Cancer test 42 Central limit theorem 121. 25 Generating functions 252 Geometric probability 48 Geometric probability distribution 72. 160 np 161 p 163 Correlation coefficient 200 Counting principle 19 Critical region 135 Degrees of freedom 143 Derangements 17 Difference equation 241 Discrete uniform distriution Drivers’ed 39 59 e 5. John J. Inc. 213 Cereal Box Problem 88 Chi squared distribution 141 Choosing an assistant 30 Combinations 22. 242 Conditional probability 10. 159 Expectation 6 Binomial distribution 69 Geometric distribution 73 Hypergeometric distribution 71 Negative Binomial distribution 88 Negative Hypergeometric distribution 102 F ratio distribution 148 Factor 224 Factorial experiment 231. 149 Confidence interval 131 Confounding 233 Control charts 155 Mean 159. 39. Kinney Copyright © 2009 by John Wiley & Sons. 24. 84 Geometric series 12. 83 Birthday problem 8. 62. 25 α 135 Alternative hypothesis 134 Analysis of variance 197 Average outgoing quality 34 Axioms of Probability 8 Bayes’ theorem 45 β 137 Binomial coefficients 24 Binomial probability distributuion 64. 26. 40 Confidence coefficient 131. 15. 214 Fractional factorial experiment 234 Ganbler’s ruin 248 General addition theorem 10.

4. 44 Lower control limit 158 Lunch problem 96 Marginal distributions 117 Maximum 176 Mean 60 Binomial distribution 69. 19. 55 Range 156. 177 Multiple regression 235 Multiplication rule 10 Mutually exclusive events 9 Mythical island 84 Negative binomial distribution 87. 121 Type I error 135 Type II error 135 Unbiased estimator 212 Uniform random variable 59. 6. 182 Theory 182 Mean 184 Variance 184 Sample space 2. 8 Probability distribution 32. 174 Median . 103 Negative hypergeometric distribution 99 Nonlinear models 201 Nonparametric methods 170 Normal probability distribution 113 Null hypothesis 134 Operating characteristic curve Optimal allocation 217 Order statistics 173.median line 202. 14 Sampling 211 Seven game series in sports 75 Significance level 135 Simple random sampling 211 Standard deviation 60 Strata 214.260 Index Pooled variances 152 Power of a test 137 Probability 2. 12. 59 Proportional allocation 215 115 Race cars 28. 221 Student t distribution 146 Sums 62.two samples 150 Median 28. 246 Geometric distribution 73 Hypergeometric distribution 71 Negative Binomial distribution 88 Negative Hypergeometric distribution 174 Mean of means 124 Means . 241 Residuals 189 Runs 3. 17. 180. 64 Random variable 58 Randomized response 46. 7. 69. 242 Some objects alike 20 Poisson distribution 250 Poker 27 138 . 174 p-value for a test 139 Patterns in binomial trials 90 Permutations 5. 207 Meeting at the library 48 Minimum 174. 15. 10. 101. 176 Rank sum test 170 Ratio of variances 148 Recursion 91. 111. 109 Upper control limit 158 Variance 60. 119 Binomial distribution 69 Hypergeometric distribution 71 Negative Binomial distribution 88 Negative Hypergeometric distribution 102 Waiting at the library 48 Waiting time problems 83 World series 76 Yates algorithm 230 Inclusion and exclusion 26 Independent events 11 Indianapolis 500 data 196 Influential observations 193 Interaction 225 Joint probability distributions Least squares 191 Let’s Make a Deal 8.