31 views

Uploaded by wriaii

Notes for stat for engineers

Notes for stat for engineers

© All Rights Reserved

- A&J Study Manual for SOA Exam P/CAS Exam 1
- Probability Distributions
- Research Proposal for MBA
- UNSW -Econ2206 Solutions Semester 1 2011 - Introductory Eco No Metrics
- Single Sampling Plan Theory
- December 09 10
- MCA Regular First to Fourth Semester Syllabus Final..
- Probability Theory (Lecture Notes)
- A New Approach to Fitting Linear Models in High Dimensional Spaces
- 15 08 0489-02-0006 Statistical Property of Dynamic Ban Channel Gain at 4 5ghz
- Probability Text
- Fundamentals of Statistical Signal Processing by Dr Steven Kay
- Effects of Hub-and-Spoke Free Trade Agreements on Trade: Panel Data Analysis
- Edu 2010 Spring Exam c
- CHAPTER 3
- The Law of Large Numbers
- Dimensions of Hard Power. Lemke D
- Level1LOS2017.pdf
- 11_Session_CSC304A.pptx
- Problem Set 10 Answers

You are on page 1of 195

Spring, 2014

Lecture Notes

Joshua M. Tebbs

Department of Statistics

University of South Carolina

TABLE OF CONTENTS STAT/MATH 511, J. TEBBS

Contents

3 Modeling Random Behavior 1

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

3.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2.1 Sample spaces and events . . . . . . . . . . . . . . . . . . . . . . 4

3.2.2 Unions and intersections . . . . . . . . . . . . . . . . . . . . . . . 7

3.2.3 Axioms and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.4 Conditional probability and independence . . . . . . . . . . . . . 10

3.2.5 Additional probability rules . . . . . . . . . . . . . . . . . . . . . 12

3.3 Random variables and distributions . . . . . . . . . . . . . . . . . . . . . 15

3.4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.3 Negative binomial distribution . . . . . . . . . . . . . . . . . . . . 30

3.4.4 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . . 32

3.4.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5.1 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . 42

3.5.2 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5.3 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Reliability and lifetime distributions . . . . . . . . . . . . . . . . . . . . . 53

3.6.1 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6.2 Reliability functions . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Statistical Inference 63

4.1 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Parameters and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

i

TABLE OF CONTENTS STAT/MATH 511, J. TEBBS

4.3 Point estimators and sampling distributions . . . . . . . . . . . . . . . . 67

4.4 Sampling distributions involving Y . . . . . . . . . . . . . . . . . . . . . 69

4.4.1 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4.2 t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.3 Normal quantile-quantile (qq) plots . . . . . . . . . . . . . . . . . 77

4.5 Condence intervals for a population mean . . . . . . . . . . . . . . . . 78

4.5.1 Known population variance

2

. . . . . . . . . . . . . . . . . . . . 78

4.5.2 Unknown population variance

2

. . . . . . . . . . . . . . . . . . 85

4.5.3 Sample size determination . . . . . . . . . . . . . . . . . . . . . . 88

4.6 Condence interval for a population proportion p . . . . . . . . . . . . . 90

4.7 Condence interval for a population variance

2

. . . . . . . . . . . . . . 95

4.8 Condence intervals for the dierence of two population means

1

2

:

Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.8.1 Equal variance case:

2

1

=

2

2

. . . . . . . . . . . . . . . . . . . . . 101

4.8.2 Unequal variance case:

2

1

=

2

2

. . . . . . . . . . . . . . . . . . . 105

4.9 Condence interval for the dierence of two population proportions p

1

p

2

:

Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.10 Condence interval for the ratio of two population variances

2

2

/

2

1

: Inde-

pendent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.11 Condence intervals for the dierence of two population means

1

2

:

Dependent samples (Matched-pairs) . . . . . . . . . . . . . . . . . . . . . 116

4.12 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.12.1 Overall F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.12.2 Follow up analysis: Tukey pairwise condence intervals . . . . . . 130

6 Linear regression 133

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.2.1 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . 136

ii

TABLE OF CONTENTS STAT/MATH 511, J. TEBBS

6.2.2 Model assumptions and properties of least squares estimators . . 139

6.2.3 Estimating the error variance . . . . . . . . . . . . . . . . . . . . 140

6.2.4 Inference for

0

and

1

. . . . . . . . . . . . . . . . . . . . . . . . 142

6.2.5 Condence and prediction intervals for a given x = x

0

. . . . . . . 145

6.3 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.3.2 Matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.3.3 Estimating the error variance . . . . . . . . . . . . . . . . . . . . 154

6.3.4 The hat matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.3.5 Analysis of variance for linear regression . . . . . . . . . . . . . . 156

6.3.6 Inference for individual regression parameters . . . . . . . . . . . 161

6.3.7 Condence and prediction intervals for a given x = x

0

. . . . . . . 164

6.4 Model diagnostics (residual analysis) . . . . . . . . . . . . . . . . . . . . 165

7 Factorial Experiments 174

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.2 Example: A 2

2

experiment with replication . . . . . . . . . . . . . . . . . 176

7.3 Example: A 2

4

experiment without replication . . . . . . . . . . . . . . . 186

iii

CHAPTER 3 STAT 509, J. TEBBS

3 Modeling Random Behavior

Complementary reading: Chapter 3 (VK); Sections 3.1-3.5 and 3.9.

3.1 Introduction

TERMINOLOGY: Statistics is the development and application of theory and meth-

ods to the collection, design, analysis, and interpretation of observed information from

planned and unplanned studies.

Statisticians get to play in everyone elses back yard. (John Tukey, Princeton)

Here are some examples where statistics could be used:

1. In a reliability (time to event) study, an engineer is interested in quantifying the

time until failure for a jet engine fan blade.

2. In an agricultural study in Iowa, researchers want to know which of four fertilizers

(which vary in their nitrogen contents) produces the highest corn yield.

3. In a clinical trial, physicians want to determine which of two drugs is more eective

for treating HIV in the early stages of the disease.

4. In a public health study, epidemiologists want to know whether smoking is linked

to a particular demographic class in high school students.

5. A food scientist is interested in determining how dierent feeding schedules (for

pigs) could aect the spread of salmonella during the slaughtering process.

6. A pharmacist posits that administering caeine to premature babies in the ICU at

Richland Hospital will reduce the incidence of necrotizing enterocolitis.

7. A research dietician wants to determine if academic achievement is related to body

mass index (BMI) among African American students in the fourth grade.

PAGE 1

CHAPTER 3 STAT 509, J. TEBBS

8. An economist as part of President Obamas re-election campaign is trying to fore-

cast the monthly unemployment and under-employment rates for 2012.

REMARK: Statisticians use their skills in mathematics and computing to formulate

statistical models and analyze data for a specic problem at hand. These models are

then used to estimate important quantities of interest (to the researcher), to test the

validity of important conjectures, and to predict future behavior. Being able to identify

and model sources of variability is an important part of this process.

TERMINOLOGY: A deterministic model is one that makes no attempt to explain

variability. For example, in chemistry, the ideal gas law states that

PV = nRT,

where P = pressure of a gas, V = volume, n = the amount of substance of gas (number

of moles), R = Boltzmanns constant, and T = temperature. In circuit analysis, Ohms

law states that

V = IR,

where V = voltage, I = current, and R = resistance.

In both of these models, the relationship among the variables is completely deter-

mined without any ambiguity.

In real life, this is rarely true for the obvious reason: there is natural variation that

arises in the measurement process.

For example, a common electrical engineering experiment involves setting up a

simple circuit with a known resistance R. For a given current I, dierent students

will then calculate the voltage V .

With a sample of n = 20 students, conducting the experiment in succession,

we might very well get 20 dierent measured voltages!

A deterministic model is too simplistic for real life; it does not acknowledge

the inherent variability that arises in the measurement process.

PAGE 2

CHAPTER 3 STAT 509, J. TEBBS

A probabilistic (or stochastic or statistical) model might look like

V = IR + ,

where is a random term that accounts for measurement error.

PREDICTION: Statistical models can also be used to predict future outcomes. For

example, suppose that I am trying to predict

Y = MATH 141 nal course percentage

for incoming freshmen enrolled in MATH 141. For each freshmen student, I will record

the following variables:

x

1

= SAT MATH score

x

2

= high school GPA.

A deterministic model would be

Y = f(x

1

, x

2

),

for some function f : R

2

[0, 100]. This model suggests that for a student with values x

1

and x

2

, we could compute Y exactly if the function f was known. A statistical model

for Y might look like something like this:

Y =

0

+

1

x

1

+

2

x

2

+ ,

where is a random term that accounts for not only measurement error (e.g., incorrect

student information, grading errors, etc.) but also

(a) all of the other variables not accounted for (e.g., major, leisure habits, natural

ability, etc.) and

(b) the error induced by assuming a linear relationship between Y and {x

1

, x

2

} when,

in fact, it may not be.

In this example, with certain (probabilistic) assumptions on and a mathematically

sensible way to estimate the unknown

0

,

1

, and

2

(i.e., coecients of the linear

function), we can produce point predictions of Y on a student-by-student basis; we can

also characterize numerical uncertainty with our predictions.

PAGE 3

CHAPTER 3 STAT 509, J. TEBBS

3.2 Probability

3.2.1 Sample spaces and events

TERMINOLOGY: Probability is a measure of ones belief in the occurrence of a future

event. Here are some events to which we may wish to assign a probability:

tomorrows temperature exceeding 80 degrees

manufacturing a defective part

concluding one fertilizer is superior to another when it isnt

the NASDAQ losing 5 percent of its value.

you being diagnosed with prostate/cervical cancer in the next 20 years.

TERMINOLOGY: Many real life phenomena can be envisioned as a random exper-

iment. The set of all possible outcomes for a given random experiment is called the

sample space, denoted by S.

The number of outcomes in S is denoted by n

S

.

Example 3.1. In each of the following random experiments, we write out a correspond-

ing sample space.

(a) The Michigan state lottery calls for a three-digit integer to be selected:

S = {000, 001, 002, ..., 998, 999}.

The size of the set of all possible outcomes is n

S

= 1000.

(b) A USC undergraduate student is tested for chlamydia (0 = negative, 1 = positive):

S = {0, 1}.

The size of the set of all possible outcomes is n

S

= 2.

PAGE 4

CHAPTER 3 STAT 509, J. TEBBS

(c) Four equally qualied applicants (a, b, c, d) are competing for two positions. If the

positions are identical (so that selection order does not matter), then

S = {ab, ac, ad, bc, bd, cd}.

The size of the set of all possible outcomes is n

S

= 6. If the positions are dierent (e.g.,

project leader, assistant project leader, etc.), then

S = {ab, ba, ac, ca, ad, da, bc, cb, bd, db, cd, dc}.

In this case, the size of the set of all possible outcomes is n

S

= 12.

TERMINOLOGY: Suppose that S is a sample space for a random experiment. We say

that A is an event in S if the outcome satises

A S.

GOAL: We would like to develop a mathematical framework so that we can assign prob-

ability to an event A. This will quantify how likely the event is. The probability that

the event A occurs is denoted by P(A).

INTUITIVE: Suppose that a sample space S contains n

S

< outcomes, each of which

is equally likely. If the event A contains n

A

outcomes, then

P(A) =

n

A

n

S

.

This is called an equiprobability model. Its main requirement is that all outcomes in

S are equally likely.

Important: If the outcomes in S are not equally likely, then this result is not

applicable.

Example 3.2. In the random experiments from Example 3.1, we use the previous result

to assign probabilities to events (if applicable).

(a) The Michigan state lottery calls for a three-digit integer to be selected:

S = {000, 001, 002, ..., 998, 999}.

PAGE 5

CHAPTER 3 STAT 509, J. TEBBS

The size of the set of all possible outcomes is n

S

= 1000. Let the event

A = {000, 005, 010, 015, ..., 990, 995}

= {winning number is a multiple of 5}.

There are n

A

= 200 outcomes in A. It is reasonable to assume that each outcome in S

is equally likely. Therefore,

P(A) =

200

1000

= 0.20.

(b) A USC undergraduate student is tested for chlamydia (0 = negative, 1 = positive):

S = {0, 1}.

The size of the set of all possible outcomes is n

S

= 2. However, is it reasonable to assume

that each outcome in S (0 = negative, 1 = positive) is equally likely? The prevalence of

chlamydia among college age students is much less than 50 percent (in SC, this prevalence

is probably somewhere between 5-12 percent). Therefore, it would be illogical to assign

probabilities using an equiprobability model.

(c) Four equally qualied applicants (a, b, c, d) are competing for two positions. If the

positions are identical (so that selection order does not matter), then

S = {ab, ac, ad, bc, bd, cd}.

The size of the set of all possible outcomes is n

S

= 6. If A is the event that applicant d

is selected for one of the two positions, then

A = {ad, bd, cd}

= {applicant d is chosen}.

There are n

A

= 3 outcomes in A. If each of the 4 applicants has the same chance of

being selected (an assumption), then each of the n

S

= 6 outcomes in S is equally likely.

Therefore,

P(A) =

3

6

= 0.50.

PAGE 6

CHAPTER 3 STAT 509, J. TEBBS

INTERPRETATION: In general, what does P(A) really measure? There are two main

interpretations:

P(A) measures the likelihood that A will occur on any given experiment.

If the experiment is performed many times, then P(A) can be interpreted as the

percentage of times that A will occur over the long run. This is called the relative

frequency interpretation.

If we are using the former interpretation, then it is common to use a decimal represen-

tation; e.g.,

P(A) = 0.50.

If we are using the latter, it is commonly accepted to say something like the event A will

occur 50 percent of the time. This gives the impression that A will occur, on average,

1 out of every 2 times the experiment is performed. This does not mean that the event

will occur exactly 1 out of every 2 times the experiment is performed.

3.2.2 Unions and intersections

TERMINOLOGY: The null event, denoted by , is an event that contains no outcomes.

The null event has probability P() = 0.

TERMINOLOGY: The union of two events A and B is the set of all outcomes in

either set or both. We denote the union of two events A and B by

A B = { : A or B}.

TERMINOLOGY: The intersection of two events A and B is the set of all outcomes

in both sets. We denote the intersection of two events A and B by

A B = { : A and B}.

TERMINOLOGY: If the events A and B contain no common outcomes, we say the

events are mutually exclusive. In this case,

P(A B) = P() = 0.

PAGE 7

CHAPTER 3 STAT 509, J. TEBBS

Example 3.3. A primary computer system is backed up by two secondary systems.

They operate independently of each other; i.e., the failure of one has no eect on any

of the others. We are interested in the readiness of these three systems, which we can

envision as an experiment. The sample space for this experiment can be described as

S = {yyy, yny, yyn, ynn, nyy, nny, nyn, nnn},

where y means the system is ready and n means the system is not ready. Dene

A = {primary system is operable} = {yyy, yny, yyn, ynn}

B = {rst backup system is operable} = {yyy, yyn, nyy, nyn}.

The union of A and B is

A B = {yyy, yny, yyn, ynn, nyy, nyn}.

In words, A B occurs if the primary system or the rst backup system is operable (or

both). The intersection of A and B is

A B = {yyy, yyn}

and occurs if the primary system and the rst backup system are both operable.

Important: Can we compute P(A), P(B), P(A B), and P(A B) in this example?

For example, if we wrote

P(A B) =

n

AB

n

S

=

6

8

= 0.75,

we would be making the assumption that each outcome in S is equally likely! This would

only be true if each system (primary and both backups) functions with probability equal

to 1/2. This is extremely unlikely! Therefore, we can not compute probabilities in this

example without additional information about the system-specic failure rates.

Example 3.4. Hemophilia is a sex-linked hereditary blood defect of males characterized

by delayed clotting of the blood which makes it dicult to control bleeding. When a

woman is a carrier of classical hemophilia, there is a 50 percent chance that a male child

PAGE 8

CHAPTER 3 STAT 509, J. TEBBS

will inherit this disease. If a carrier gives birth to two males (not twins), what is the

probability that either will have the disease? both will have the disease?

Solution. We can envision the process of having two male children as an experiment

with sample space

S = {++, +, +, },

where + means the male ospring has the disease; means the male does not have

the disease. To compute the probabilities requested in this problem, we will

assume that each outcome in S is equally likely. Dene the events:

A = {rst child has disease} = {++, +}

B = {second child has disease} = {++, +}.

The union and intersection of A and B are, respectively,

A B = {either child has disease} = {++, +, +}

A B = {both children have disease} = {++}.

The probability that either male child will have the disease is

P(A B) =

n

AB

n

S

=

3

4

= 0.75.

The probability that both male children will have the disease is

P(A B) =

n

AB

n

S

=

1

4

= 0.25.

3.2.3 Axioms and rules

KOLMOLGOROV AXIOMS: For any sample space S, a probability P must satisfy

(1) 0 P(A) 1, for any event A

(2) P(S) = 1

(3) If A

1

, A

2

, ..., A

n

are pairwise mutually exclusive events, then

P

_

n

i=1

A

i

_

=

n

i=1

P(A

i

).

PAGE 9

CHAPTER 3 STAT 509, J. TEBBS

The term pairwise mutually exclusive means that A

i

A

j

= , for all i = j.

The event

n

i=1

A

i

= A

1

A

2

A

n

.

This event, in words, is read at least one A

i

occurs.

3.2.4 Conditional probability and independence

IDEA: In some situations, we may be fortunate enough to have prior knowledge about

the likelihood of other events related to the event of interest. We can then incorporate

this information into a probability calculation.

TERMINOLOGY: Let A and B be events in a sample space S with P(B) > 0. The

conditional probability of A, given that B has occurred, is

P(A|B) =

P(A B)

P(B)

.

Similarly,

P(B|A) =

P(A B)

P(A)

.

Example 3.5. In a company, 36 percent of the employees have a degree from a SEC

university, 22 percent of the employees that have a degree from the SEC are also engineers,

and 30 percent of the employees are engineers. An employee is selected at random.

(a) Compute the probability that the employee is an engineer and is from the SEC.

(b) Compute the conditional probability that the employee is from the SEC, given that

s/he is an engineer.

Solution: Dene the events

A = {employee is an engineer}

B = {employee is from the SEC}.

PAGE 10

CHAPTER 3 STAT 509, J. TEBBS

From the information in the problem, we are given: P(A) = 0.30, P(B) = 0.36, and

P(A|B) = 0.22. In part (a), we want P(A B). Note that

0.22 = P(A|B) =

P(A B)

P(B)

=

P(A B)

0.36

.

Therefore,

P(A B) = 0.22(0.36) = 0.0792.

In part (b), we want P(B|A). From the denition of conditional probability:

P(B|A) =

P(A B)

P(A)

=

0.0792

0.30

= 0.264.

IMPORTANT: Note that, in this example, the conditional probability P(B|A) and the

unconditional probability P(B) are not equal.

In other words, knowledge that A has occurred has changed the likelihood that

B occurs.

In other situations, it might be that the occurrence (or non-occurrence) of a com-

panion event has no eect on the probability of the event of interest. This leads us

to the denition of independence.

TERMINOLOGY: When the occurrence or non-occurrence of B has no eect on whether

or not A occurs, and vice-versa, we say that the events A and B are independent.

Mathematically, we dene A and B to be independent if and only if

P(A B) = P(A)P(B).

Note that if A and B are independent,

P(A|B) =

P(A B)

P(B)

=

P(A)P(B)

P(B)

= P(A)

and

P(B|A) =

P(B A)

P(A)

=

P(B)P(A)

P(A)

= P(B).

Note: These results only apply if A and B are independent. In other words, if A and B

are not independent, then these rules do not apply.

PAGE 11

CHAPTER 3 STAT 509, J. TEBBS

Example 3.6. In an engineering system, two components are placed in a series; that is,

the system is functional as long as both components are. Each component is functional

with probability 0.95. Dene the events

A

1

= {component 1 is functional}

A

2

= {component 2 is functional}

so that P(A

1

) = 0.95 and P(A

2

) = 0.95. The probability that the system is functional

is given by P(A

1

A

2

).

If the components operate independently, then A

1

and A

2

are independent events

so that

P(A

1

A

2

) = P(A

1

)P(A

2

) = 0.95(0.95) = 0.9025.

If the components do not operate independently; e.g., failure of one component

wears on the other, then we can not compute P(A

1

A

2

) without additional

knowledge.

EXTENSION: The notion of independence extends to any nite collection of events

A

1

, A

2

, ..., A

n

. Mutual independence means that the probability of the intersection of

any sub-collection of A

1

, A

2

, ..., A

n

equals the product of the probabilities of the events

in the sub-collection. For example, if A

1

, A

2

, A

3

, and A

4

are mutually independent, then

P(A

1

A

2

) = P(A

1

)P(A

2

)

P(A

1

A

2

A

3

) = P(A

1

)P(A

2

)P(A

3

)

P(A

1

A

2

A

3

A

4

) = P(A

1

)P(A

2

)P(A

3

)P(A

4

).

3.2.5 Additional probability rules

TERMINOLOGY: Suppose that S is a sample space and that A is an event. The

complement of A, denoted by A, is the set of all outcomes in S not in A. That is,

A = { S : / A}.

PAGE 12

CHAPTER 3 STAT 509, J. TEBBS

1. Complement rule: Suppose that A is an event.

P(A) = 1 P(A).

2. Additive law: Suppose that A and B are two events.

P(A B) = P(A) + P(B) P(A B).

3. Multiplicative law: Suppose that A and B are two events.

P(A B) = P(B|A)P(A)

= P(A|B)P(B).

4. Law of Total Probability (LOTP): Suppose that A and B are two events.

P(A) = P(A|B)P(B) + P(A|B)P(B).

5. Bayes Rule: Suppose that A and B are two events.

P(B|A) =

P(A|B)P(B)

P(A)

=

P(A|B)P(B)

P(A|B)P(B) + P(A|B)P(B)

.

Example 3.7. The probability that train 1 is on time is 0.95. The probability that train

2 is on time is 0.93. The probability that both are on time is 0.90. Dene the events

A

1

= {train 1 is on time}

A

2

= {train 2 is on time}.

We are given that P(A

1

) = 0.95, P(A

2

) = 0.93, and P(A

1

A

2

) = 0.90.

(a) What is the probability that train 1 is not on time?

P(A

1

) = 1 P(A

1

)

= 1 0.95 = 0.05.

(b) What is the probability that at least one train is on time?

P(A

1

A

2

) = P(A

1

) + P(A

2

) P(A

1

A

2

)

= 0.95 + 0.93 0.90 = 0.98.

PAGE 13

CHAPTER 3 STAT 509, J. TEBBS

(c) What is the probability that train 1 is on time given that train 2 is on time?

P(A

1

|A

2

) =

P(A

1

A

2

)

P(A

2

)

=

0.90

0.93

0.968.

(d) What is the probability that train 2 is on time given that train 1 is not on time?

P(A

2

|A

1

) =

P(A

1

A

2

)

P(A

1

)

=

P(A

2

) P(A

1

A

2

)

1 P(A

1

)

=

0.93 0.90

1 0.95

= 0.60.

(e) Are A

1

and A

2

independent events?

Answer: They are not independent because

P(A

1

A

2

) = P(A

1

)P(A

2

).

Equivalently, note that P(A

1

|A

2

) = P(A

1

). In other words, knowledge that A

2

has

occurred changes the likelihood that A

1

occurs.

Example 3.8. An insurance company classies people as accident-prone and non-

accident-prone. For a xed year, the probability that an accident-prone person has an

accident is 0.4, and the probability that a non-accident-prone person has an accident is

0.2. The population is estimated to be 30 percent accident-prone. Dene the events

A = {policy holder has an accident}

B = {policy holder is accident-prone}.

We are given that P(B) = 0.3, P(A|B) = 0.4, and P(A|B) = 0.2.

(a) What is the probability that a new policy-holder will have an accident?

Solution: By the Law of Total Probability,

P(A) = P(A|B)P(B) + P(A|B)P(B)

= 0.4(0.3) + 0.2(0.7) = 0.26.

PAGE 14

CHAPTER 3 STAT 509, J. TEBBS

(b) Suppose that the policy-holder does have an accident. What is the probability that

s/he was accident-prone?

Solution: We want P(B|A). By Bayes Rule,

P(B|A) =

P(A|B)P(B)

P(A|B)P(B) + P(A|B)P(B)

=

0.4(0.3)

0.4(0.3) + 0.2(0.7)

0.46.

3.3 Random variables and distributions

TERMINOLOGY: A random variable Y is a variable whose value is determined by

chance. The distribution of a random variable consists of two parts:

1. an elicitation of the set of all possible values of Y (called the support)

2. a function that describes how to assign probabilities to events involving Y .

NOTATION: By convention, we denote random variables by upper case letters towards

the end of the alphabet; e.g., W, X, Y , Z, etc. A possible value of Y (i.e., a value in the

support) is denoted generically by the lower case version y. In words, the symbol

P(Y = y)

is read, the probability that the random variable Y equals the value y. The symbol

F

Y

(y) P(Y y)

is read, the probability that the random variable Y is less than or equal to the value

y. This probability is called the cumulative distribution function of Y and will be

discussed later.

TERMINOLOGY: If a random variable Y can assume only a nite (or countable) number

of values, we call Y a discrete random variable. If it makes more sense to envision Y as

assuming values in an interval of numbers, we call Y a continuous random variable.

PAGE 15

CHAPTER 3 STAT 509, J. TEBBS

Example 3.9. Classify the following random variables as discrete or continuous and

specify the support of each random variable.

V = number of unbroken eggs in a randomly selected carton (dozen)

W = pH of an aqueous solution

X = length of time between accidents at a factory

Y = whether or not you pass this class

Z = number of aircraft arriving tomorrow at CAE.

The random variable V is discrete. It can assume values in

{v : v = 0, 1, 2, ..., 12}.

The random variable W is continuous. It most certainly assumes values in

{w : < w < }.

Of course, with most solutions, it is more likely that W is not negative (although

this is possible) and not larger than, say, 15 (a very reasonable upper bound).

However, the choice of {w : < w < } is not mathematically incongruous

with these practical constraints.

The random variable X is continuous. It can assume values in

{x : x > 0}.

The key feature here is that a time cannot be negative. In theory, it is possible

that X can be very large.

The random variable Y is discrete. It can assume values in

{y : y = 0, 1},

where I have arbitrarily labeled 1 for passing and 0 for failing. Random vari-

ables that can assume exactly 2 values (e.g., 0, 1) are called binary.

PAGE 16

CHAPTER 3 STAT 509, J. TEBBS

The random variable Z is discrete. It can assume values in

{z : z = 0, 1, 2, ..., }.

I have allowed for the possibility of a very large number of aircraft arriving.

3.4 Discrete random variables

TERMINOLOGY: Suppose that Y is a discrete random variable. The function

p

Y

(y) = P(Y = y)

is called the probability mass function (pmf) for Y . The pmf p

Y

(y) is a function

that assigns probabilities to each possible value of Y .

PROPERTIES: A pmf p

Y

(y) for a discrete random variable Y satises the following:

1. 0 < p

Y

(y) < 1, for all possible values of y

2. The sum of the probabilities, taken over all possible values of Y , must equal 1; i.e.,

all y

p

Y

(y) = 1.

Example 3.10. A mail-order computer business has six telephone lines. Let Y denote

the number of lines in use at a specic time. Suppose that the probability mass function

(pmf) of Y is given by

y 0 1 2 3 4 5 6

p

Y

(y) 0.10 0.15 0.20 0.25 0.20 0.06 0.04

Figure 3.1 (left) displays p

Y

(y), the probability mass function (pmf) of Y .

The height of the bar above y is equal to p

Y

(y) = P(Y = y).

If y is not equal to 0, 1, 2, 3, 4, 5, 6, then p

Y

(y) = 0.

PAGE 17

CHAPTER 3 STAT 509, J. TEBBS

0 1 2 3 4 5 6

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

0

.

2

5

0

.

3

0

y

p

(

y

)

0 2 4 6

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.1: PMF (left) and CDF (right) of Y in Example 3.10.

Figure 3.1 (right) displays the cumulative distribution function (cdf) of Y .

F

Y

(y) = P(Y y).

F

Y

(y) is a nondecreasing function.

0 F

Y

(y) 1; this makes sense since F

Y

(y) = P(Y y) is a probability!

The cdf F

Y

(y) in this example (Y is discrete) takes a step at each possible

value of Y and stays constant otherwise.

The height of the step at a particular y is equal to p

Y

(y) = P(Y = y).

Here is the table I gave you above, but now I have added the values of the cumulative

distribution function:

y 0 1 2 3 4 5 6

p

Y

(y) 0.10 0.15 0.20 0.25 0.20 0.06 0.04

F

Y

(y) 0.10 0.25 0.45 0.70 0.90 0.96 1.00

PAGE 18

CHAPTER 3 STAT 509, J. TEBBS

(a) What is the probability that exactly two lines are in use?

p

Y

(2) = P(Y = 2) = 0.20.

(b) What is the probability that at most two lines are in use?

P(Y 2) = P(Y = 0) + P(Y = 1) + P(Y = 2)

= p

Y

(0) + p

Y

(1) + p

Y

(2)

= 0.10 + 0.15 + 0.20 = 0.45.

Note: This is also equal to F

Y

(2) = 0.45.

(c) What is the probability that at least ve lines are in use?

P(Y 5) = P(Y = 5) + P(Y = 6)

= p

Y

(5) + p

Y

(6) = 0.06 + 0.04 = 0.10.

It is also important to note that in part (c), we could have computed

P(Y 5) = 1 P(Y 4)

= 1 F

Y

(4) = 1 0.90 = 0.10.

TERMINOLOGY: Let Y be a discrete random variable with pmf p

Y

(y). The expected

value of Y is given by

= E(Y ) =

all y

yp

Y

(y).

The expected value for a discrete random variable Y is simply a weighted average of the

possible values of Y . Each value y is weighted by its probability p

Y

(y). In statistical

applications, = E(Y ) is commonly called the population mean.

Example 3.11. In Example 3.10, we examined the distribution of Y , the number of

lines in use at a specied time. The probability mass function (pmf) of Y is given by

y 0 1 2 3 4 5 6

p

Y

(y) 0.10 0.15 0.20 0.25 0.20 0.06 0.04

PAGE 19

CHAPTER 3 STAT 509, J. TEBBS

The expected value of Y is

= E(Y ) =

all y

yp

Y

(y)

= 0(0.10) + 1(0.15) + 2(0.20) + 3(0.25) + 4(0.20) + 5(0.06) + 6(0.04)

= 2.64.

Interpretation: On average, we would expect 2.64 calls at the specied time.

Interpretation: Over the long run, if we observed many values of Y at this specied

time (and this pmf was applicable each time), then the average of these Y observations

would be close to 2.64.

Interpretation: Place an x at = 2.64 in Figure 3.1 (left). This represents the

balance point of the probability mass function.

FUNCTIONS: Let Y be a discrete random variable with pmf p

Y

(y). Suppose that g is

a real-valued function. Then, g(Y ) is a random variable and

E[g(Y )] =

all y

g(y)p

Y

(y).

PROPERTIES OF EXPECTATIONS: Let Y be a discrete random variable with pmf

p

Y

(y). Suppose that g, g

1

, g

2

, ..., g

k

are real-valued functions, and let c be any real con-

stant. Expectations satisfy the following (linearity) properties:

(a) E(c) = c

(b) E[cg(Y )] = cE[g(Y )]

(c) E[

k

j=1

g

j

(Y )] =

k

j=1

E[g

j

(Y )].

Note: These rules are also applicable if Y is continuous (coming up).

Example 3.12. In a one-hour period, the number of gallons of a certain toxic chemical

that is produced at a local plant, say Y , has the following pmf:

y 0 1 2 3

p

Y

(y) 0.2 0.3 0.3 0.2

PAGE 20

CHAPTER 3 STAT 509, J. TEBBS

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

y

p

(

y

)

1 0 1 2 3 4

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.2: PMF (left) and CDF (right) of Y in Example 3.12.

(a) Compute the expected number of gallons produced during a one-hour period.

Solution: The expected value of Y is

= E(Y ) =

all y

yp

Y

(y) = 0(0.2) + 1(0.3) + 2(0.3) + 3(0.2) = 1.5.

We would expect 1.5 gallons of the toxic chemical to be produced per hour (on average).

(b) The cost (in hundreds of dollars) to produce Y gallons is given by the cost function

g(Y ) = 3 + 12Y + 2Y

2

. What is the expected cost in a one-hour period?

Solution: We want to compute E[g(Y )]. We rst compute E(Y

2

):

E(Y

2

) =

all y

y

2

p

Y

(y) = 0

2

(0.2) + 1

2

(0.3) + 2

2

(0.3) + 3

2

(0.2) = 3.3.

Therefore,

E[g(Y )] = E(3 + 12Y + 2Y

2

) = 3 + 12E(Y ) + 2E(Y

2

)

= 3 + 12(1.5) + 2(3.3) = 27.6.

The expected hourly cost is $2, 760.00.

PAGE 21

CHAPTER 3 STAT 509, J. TEBBS

TERMINOLOGY: Let Y be a discrete random variable with pmf p

Y

(y) and expected

value E(Y ) = . The population variance of Y is given by

2

var(Y ) E[(Y )

2

]

=

all y

(y )

2

p

Y

(y).

The population standard deviation of Y is the positive square root of the variance:

=

2

=

var(Y ).

FACTS: The population variance

2

satises the following:

(a)

2

0.

2

= 0 if and only if the random variable Y has a degenerate distribu-

tion; i.e., all the probability mass is located at one support point.

(b) The larger (smaller)

2

is, the more (less) spread in the possible values of Y about

the population mean = E(Y ).

(c)

2

is measured in (units)

2

and is measured in the original units.

COMPUTING FORMULA: Let Y be a random variable with population mean E(Y ) = .

An alternative computing formula for the population variance is

var(Y ) = E[(Y )

2

]

= E(Y

2

) [E(Y )]

2

.

This formula is easy to remember and makes subsequent calculations easier.

Example 3.13. In Example 3.12, we examined the pmf for the number of gallons of a

certain toxic chemical that is produced at a local plant, denoted by Y . The pmf of Y is

y 0 1 2 3

p

Y

(y) 0.2 0.3 0.3 0.2

In Example 3.13, we computed

E(Y ) = 1.5

E(Y

2

) = 3.3.

PAGE 22

CHAPTER 3 STAT 509, J. TEBBS

The population variance of Y is

2

= var(Y ) = E(Y

2

) [E(Y )]

2

= 3.3 (1.5)

2

= 1.05.

The population standard deviation of Y is

=

2

=

1.05 1.025.

3.4.1 Binomial distribution

BERNOULLI TRIALS: Many experiments can be envisioned as consisting of a sequence

of trials, where

1. each trial results in a success or a failure,

2. the trials are independent, and

3. the probability of success, denoted by p, 0 < p < 1, is the same on every trial.

Example 3.14. We give examples of situations that can be thought of as observing

Bernoulli trials.

When circuit boards used in the manufacture of Blue Ray players are tested, the

long-run percentage of defective boards is 5 percent.

circuit board = trial

defective board is observed = success

p = P(success) = P(defective board) = 0.05.

Ninety-eight percent of all air trac radar signals are correctly interpreted the rst

time they are transmitted.

radar signal = trial

signal is correctly interpreted = success

p = P(success) = P(correct interpretation) = 0.98.

PAGE 23

CHAPTER 3 STAT 509, J. TEBBS

Albino rats used to study the hormonal regulation of a metabolic pathway are

injected with a drug that inhibits body synthesis of protein. The probability that

a rat will die from the drug before the study is complete is 0.20.

rat = trial

dies before study is over = success

p = P(success) = P(dies early) = 0.20.

TERMINOLOGY: Suppose that n Bernoulli trials are performed. Dene

Y = the number of successes (out of n trials performed).

We say that Y has a binomial distribution with number of trials n and success prob-

ability p. Shorthand notation is Y b(n, p).

PMF: If Y b(n, p), then the probability mass function of Y is given by

p

Y

(y) =

_

_

_

n

y

_

p

y

(1 p)

ny

, y = 0, 1, 2, ..., n

0, otherwise.

MEAN/VARIANCE: If Y b(n, p), then

E(Y ) = np

var(Y ) = np(1 p).

Example 3.15. In an agricultural study, it is determined that 40 percent of all plots

respond to a certain treatment. Four plots are observed. In this situation, we interpret

plot of land = trial

plot responds to treatment = success

p = P(success) = P(responds to treatment) = 0.4.

PAGE 24

CHAPTER 3 STAT 509, J. TEBBS

0 1 2 3 4

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

y

p

(

y

)

1 0 1 2 3 4 5

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.3: PMF (left) and CDF (right) of Y b(n = 4, p = 0.4) in Example 3.15.

If the Bernoulli trial assumptions hold (independent plots, same response probability for

each plot), then

Y = the number of plots which respond b(n = 4, p = 0.4).

(a) What is the probability that exactly two plots respond?

P(Y = 2) = p

Y

(2) =

_

4

2

_

(0.4)

2

(1 0.4)

42

= 6(0.4)

2

(0.6)

2

= 0.3456.

(b) What is the probability that at least one plot responds?

P(Y 1) = 1 P(Y = 0) = 1

_

4

0

_

(0.4)

0

(1 0.4)

40

= 1 1(1)(0.6)

4

= 0.8704.

(c) What are E(Y ) and var(Y )?

E(Y ) = np = 4(0.4) = 1.6

var(Y ) = np(1 p) = 4(0.4)(0.6) = 0.96.

PAGE 25

CHAPTER 3 STAT 509, J. TEBBS

Example 3.16. An electronics manufacturer claims that 10 percent of its power supply

units need servicing during the warranty period. To investigate this claim, technicians

at a testing laboratory purchase 30 units and subject each one to an accelerated testing

protocol to simulate use during the warranty period. In this situation, we interpret

power supply unit = trial

supply unit needs servicing during warranty period = success

p = P(success) = P(supply unit needs servicing) = 0.1.

If the Bernoulli trial assumptions hold (independent units, same probability of needing

service for each unit), then

Y = the number of units requiring service during warranty period

b(n = 30, p = 0.1).

Note: Instead of computing probabilities by hand, we will use R.

BINOMIAL R CODE: Suppose that Y b(n, p).

p

Y

(y) = P(Y = y) F

Y

(y) = P(Y y)

dbinom(y,n,p) pbinom(y,n,p)

(a) What is the probability that exactly ve of the 30 power supply units require

servicing during the warranty period?

p

Y

(5) = P(Y = 5) =

_

30

5

_

(0.1)

5

(1 0.1)

305

dbinom(5,30,0.1) = 0.1023048.

(b) What is the probability that at most ve of the 30 power supply units require

servicing during the warranty period?

F

Y

(5) = P(Y 5) =

5

y=0

_

30

y

_

(0.1)

y

(1 0.1)

30y

pbinom(5,30,0.1) = 0.9268099.

PAGE 26

CHAPTER 3 STAT 509, J. TEBBS

0 5 10 15 20 25 30

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

0

.

2

5

y

p

(

y

)

0 5 10 15 20 25 30

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.4: PMF (left) and CDF (right) of Y b(n = 30, p = 0.1) in Example 3.16.

(c) What is the probability at least ve of the 30 power supply units require service?

P(Y 5) = 1 P(Y 4) = 1

4

y=0

_

30

y

_

(0.1)

y

(1 0.1)

30y

1-pbinom(4,30,0.1) = 0.1754949.

(d) What is P(2 Y 8)?

P(2 Y 8) =

8

y=2

_

30

y

_

(0.1)

y

(1 0.1)

30y

.

One way to get this in R is to use the command:

> sum(dbinom(2:8,30,0.1))

[1] 0.8142852

The dbinom(2:8,30,0.1) command creates a vector containing p

Y

(2), p

Y

(3), ..., p

Y

(8),

and the sum command adds them. Another way to calculate this probability in R is

> pbinom(8,30,0.1)-pbinom(1,30,0.1)

[1] 0.8142852

PAGE 27

CHAPTER 3 STAT 509, J. TEBBS

3.4.2 Geometric distribution

NOTE: The geometric distribution also arises in experiments involving Bernoulli trials:

1. Each trial results in a success or a failure.

2. The trials are independent.

3. The probability of success, denoted by p, 0 < p < 1, is the same on every trial.

TERMINOLOGY: Suppose that Bernoulli trials are continually observed. Dene

Y = the number of trials to observe the rst success.

We say that Y has a geometric distribution with success probability p. Shorthand

notation is Y geom(p).

PMF: If Y geom(p), then the probability mass function of Y is given by

p

Y

(y) =

_

_

_

(1 p)

y1

p, y = 1, 2, 3, ...

0, otherwise.

MEAN/VARIANCE: If Y geom(p), then

E(Y ) =

1

p

var(Y ) =

1 p

p

2

.

Example 3.17. Biology students are checking the eye color of fruit ies. For each y,

the probability of observing white eyes is p = 0.25. In this situation, we interpret

fruit y = trial

y has white eyes = success

p = P(success) = P(white eyes) = 0.25.

PAGE 28

CHAPTER 3 STAT 509, J. TEBBS

5 10 15 20 25

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

0

.

2

5

y

p

(

y

)

0 5 10 15 20 25

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.5: PMF (left) and CDF (right) of Y geom(p = 0.25) in Example 3.17.

If the Bernoulli trial assumptions hold (independent ies, same probability of white eyes

for each y), then

Y = the number of ies needed to nd the rst white-eyed

geom(p = 0.25).

(a) What is the probability the rst white-eyed y is observed on the fth y checked?

p

Y

(5) = P(Y = 5) = (1 0.25)

51

(0.25)

= (0.75)

4

(0.25) 0.079.

(b) What is the probability the rst white-eyed y is observed before the fourth y is

examined? Note: For this to occur, we must observe the rst white-eyed y (success)

on either the rst, second, or third y.

F

Y

(3) = P(Y 3) = P(Y = 1) + P(Y = 2) + P(Y = 3)

= (1 0.25)

11

(0.25) + (1 0.25)

21

(0.25) + (1 0.25)

31

(0.25)

= 0.25 + 0.1875 + 0.140625 0.578.

PAGE 29

CHAPTER 3 STAT 509, J. TEBBS

GEOMETRIC R CODE: Suppose that Y geom(p).

p

Y

(y) = P(Y = y) F

Y

(y) = P(Y y)

dgeom(y-1,p) pgeom(y-1,p)

> dgeom(5-1,0.25) ## Part (a)

[1] 0.07910156

> pgeom(3-1,0.25) ## Part (b)

[1] 0.578125

3.4.3 Negative binomial distribution

NOTE: The negative binomial distribution also arises in experiments involving Bernoulli

trials:

1. Each trial results in a success or a failure.

2. The trials are independent.

3. The probability of success, denoted by p, 0 < p < 1, is the same on every trial.

TERMINOLOGY: Suppose that Bernoulli trials are continually observed. Dene

Y = the number of trials to observe the rth success.

We say that Y has a negative binomial distribution with waiting parameter r and

success probability p. Shorthand notation is Y nib(r, p).

REMARK: Note that the negative binomial distribution is a mere generalization of the

geometric. If r = 1, then the nib(r, p) distribution reduces to the geom(p).

PMF: If Y nib(r, p), then the probability mass function of Y is given by

p

Y

(y) =

_

_

_

y 1

r 1

_

p

r

(1 p)

yr

, y = r, r + 1, r + 2, ...

0, otherwise.

PAGE 30

CHAPTER 3 STAT 509, J. TEBBS

MEAN/VARIANCE: If Y nib(r, p), then

E(Y ) =

r

p

var(Y ) =

r(1 p)

p

2

.

Example 3.18. At an automotive paint plant, 15 percent of all batches sent to the lab

for chemical analysis do not conform to specications. In this situation, we interpret

batch = trial

batch does not conform = success

p = P(success) = P(not conforming) = 0.15.

If the Bernoulli trial assumptions hold (independent batches, same probability of non-

conforming for each batch), then

Y = the number of batches needed to nd the third nonconforming

nib(r = 3, p = 0.15).

(a) What is the probability the third nonconforming batch is observed on the tenth batch

sent to the lab?

p

Y

(10) = P(Y = 10) =

_

10 1

3 1

_

(0.15)

3

(1 0.15)

103

=

_

9

2

_

(0.15)

3

(0.85)

7

0.039.

(b) What is the probability that no more than two nonconforming batches will be

observed among the rst 30 batches sent to the lab? Note: This means the third

nonconforming batch must be observed on the 31st batch observed, the 32nd, the 33rd,

etc.

P(Y 31) = 1 P(Y 30)

= 1

30

y=3

_

y 1

3 1

_

(0.15)

3

(0.85)

y3

0.151.

PAGE 31

CHAPTER 3 STAT 509, J. TEBBS

10 20 30 40 50 60 70

0

.

0

0

0

.

0

1

0

.

0

2

0

.

0

3

0

.

0

4

0

.

0

5

y

p

(

y

)

0 20 40 60

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.6: PMF (left) and CDF (right) of Y nib(r = 3, p = 0.15) in Example 3.18.

NEGATIVE BINOMIAL R CODE: Suppose that Y nib(r, p).

p

Y

(y) = P(Y = y) F

Y

(y) = P(Y y)

dnbinom(y-r,r,p) pnbinom(y-r,r,p)

> dnbinom(10-3,3,0.15) ## Part (a)

[1] 0.03895012

> 1-pnbinom(30-3,3,0.15) ## Part (b)

[1] 0.1514006

3.4.4 Hypergeometric distribution

SETTING: Consider a population of N objects and suppose that each object belongs to

one of two dichotomous classes: Class 1 and Class 2. For example, the objects (classes)

might be people (infected/not), parts (conforming/not), plots of land (respond to treat-

PAGE 32

CHAPTER 3 STAT 509, J. TEBBS

ment/not), etc. In the population of interest, we have

N = total number of objects

r = number of objects in Class 1

N r = number of objects in Class 2.

Envision taking a sample n objects from the population (objects are selected at random

and without replacement). Dene

Y = the number of objects in Class 1 (out of the n selected).

We say that Y has a hypergeometric distribution and write Y hyper(N, n, r).

PMF: If Y hyper(N, n, r), then the probability mass function of Y is given by

p

Y

(y) =

_

_

_

r

y

__

Nr

ny

_

_

N

n

_ , y r and n y N r

0, otherwise.

MEAN/VARIANCE: If Y hyper(N, n, r), then

E(Y ) = n

_

r

N

_

var(Y ) = n

_

r

N

_

_

N r

N

__

N n

N 1

_

.

Example 3.19. A supplier ships parts to a company in lots of 100 parts. The company

has an acceptance sampling plan which adopts the following acceptance rule:

....sample 5 parts at random and without replacement. If there are no de-

fectives in the sample, accept the entire lot; otherwise, reject the entire lot.

In this example, the population size is N = 100. The sample size is n = 5. Dene the

random variable

Y = the number of defectives in the sample

hyper(N = 100, n = 5, r).

PAGE 33

CHAPTER 3 STAT 509, J. TEBBS

0 1 2 3 4 5

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

0

.

5

0

.

6

y

p

(

y

)

1 0 1 2 3 4 5 6

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.7: PMF (left) and CDF (right) of Y hyper(N = 100, n = 5, r = 10) in

Example 3.19.

(a) If r = 10, what is the probability that the lot will be accepted? Note: The lot will

be accepted only if Y = 0.

p

Y

(0) = P(Y = 0) =

_

10

0

__

90

5

_

_

100

5

_

=

1(43949268)

75287520

0.584.

(b) If r = 10, what is the probability that at least 3 of the 5 parts sampled are defective?

P(Y 3) = 1 P(Y 2)

= 1

_

_

10

0

__

90

5

_

_

100

5

_ +

_

10

1

__

90

4

_

_

100

5

_ +

_

10

2

__

90

3

_

_

100

5

_

_

1 (0.584 + 0.339 + 0.070) = 0.007.

HYPERGEOMETRIC R CODE: Suppose that Y hyper(N, n, r).

p

Y

(y) = P(Y = y) F

Y

(y) = P(Y y)

dhyper(y,r,N-r,n) phyper(y,r,N-r,n)

PAGE 34

CHAPTER 3 STAT 509, J. TEBBS

> dhyper(0,10,100-10,5) ## Part (a)

[1] 0.5837524

> 1-phyper(2,10,100-10,5) ## Part (b)

[1] 0.006637913

3.4.5 Poisson distribution

NOTE: The Poisson distribution is commonly used to model counts, such as

1. the number of customers entering a post oce in a given hour

2. the number of -particles discharged from a radioactive substance in one second

3. the number of machine breakdowns per month

4. the number of insurance claims received per day

5. the number of defects on a piece of raw material.

TERMINOLOGY: In general, we dene

Y = the number of occurrences over in a unit interval of time (or space).

A Poisson distribution for Y emerges if these occurrences obey the following rules:

(i) the number of occurrences in non-overlapping intervals (of time or space) are inde-

pendent random variables.

(ii) The probability of an occurrence in a suciently short interval is proportional to

the length of the interval.

(iii) The probability of 2 or more occurrences in a suciently short interval is zero.

We say that Y has a Poisson distribution and write Y Poisson(). A process that

produces occurrences according to these rules is called a Poisson process.

PAGE 35

CHAPTER 3 STAT 509, J. TEBBS

0 2 4 6 8 10

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

0

.

2

5

0

.

3

0

y

p

(

y

)

0 2 4 6 8 10

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.8: PMF (left) and CDF (right) of Y Poisson( = 2.5) in Example 3.20.

PMF: If Y Poisson(), then the probability mass function of Y is given by

p

Y

(y) =

_

y

e

y!

, y = 0, 1, 2, ...

0, otherwise.

MEAN/VARIANCE: If Y Poisson(), then

E(Y ) =

var(Y ) = .

Example 3.20. Let Y denote the number of times per month that a detectable amount

of radioactive gas is recorded at a nuclear power plant. Suppose that Y follows a Poisson

distribution with mean = 2.5 times per month.

(a) What is the probability that there are exactly three times a detectable amount of

gas is recorded in a given month?

P(Y = 3) = p

Y

(3) =

(2.5)

3

e

2.5

3!

=

15.625e

2.5

6

0.214.

PAGE 36

CHAPTER 3 STAT 509, J. TEBBS

(b) What is the probability that there are no more than four times a detectable amount

of gas is recorded in a given month?

P(Y 4) =

4

y=0

(2.5)

y

e

2.5

y!

=

(2.5)

0

e

2.5

0!

+

(2.5)

1

e

2.5

1!

+

(2.5)

2

e

2.5

2!

+

(2.5)

3

e

2.5

3!

+

(2.5)

4

e

2.5

4!

0.891.

POISSON R CODE: Suppose that Y Poisson().

p

Y

(y) = P(Y = y) F

Y

(y) = P(Y y)

dpois(y,) ppois(y,)

> dpois(3,2.5) ## Part (a)

[1] 0.213763

> ppois(4,2.5) ## Part (b)

[1] 0.891178

3.5 Continuous random variables

RECALL: A random variable Y is called continuous if it can assume any value in an

interval of real numbers.

Contrast this with a discrete random variable whose values can be counted.

For example, if Y = time (measured in seconds), then the set of all possible values

of Y is

{y : y > 0}.

If Y = temperature (measured in deg C), the set of all possible values of Y (ignoring

absolute zero and physical upper bounds) might be described as

{y : < y < }.

Neither of these sets of values can be counted.

PAGE 37

CHAPTER 3 STAT 509, J. TEBBS

IMPORTANT: Assigning probabilities to events involving continuous random variables

is dierent than in discrete models. We do not assign positive probability to specic

values (e.g., Y = 3, etc.) like we did with discrete random variables. Instead, we assign

positive probability to events which are intervals (e.g., 2 < Y < 4, etc.).

TERMINOLOGY: Every continuous random variable we will discuss in this course has a

probability density function (pdf ), denoted by f

Y

(y). This function has the following

characteristics:

1. f

Y

(y) 0, that is, f

Y

(y) is nonnegative.

2. The area under any pdf is equal to 1, that is,

f

Y

(y)dy = 1.

3. If y

0

is a specic value of interest, then the cumulative distribution function

(cdf) of Y is given by

F

Y

(y

0

) = P(Y y

0

) =

y

0

f

Y

(y)dy.

4. If y

1

and y

2

are specic values of interest (y

1

< y

2

), then

P(y

1

Y y

2

) =

y

2

y

1

f

Y

(y)dy

= F

Y

(y

2

) F

Y

(y

1

).

5. If y

0

is a specic value, then P(Y = y

0

) = 0. In other words, in continuous

probability models, specic points are assigned zero probability (see #4 above and

this will make perfect mathematical sense). An immediate consequence of this is

that if Y is continuous,

P(y

1

Y y

2

) = P(y

1

Y < y

2

) = P(y

1

< Y y

2

) = P(y

1

< Y < y

2

)

and each is equal to

y

2

y

1

f

Y

(y)dy.

This is not true if Y has a discrete distribution because positive probability is

assigned to specic values of Y .

PAGE 38

CHAPTER 3 STAT 509, J. TEBBS

0.0 0.2 0.4 0.6 0.8 1.0

0

.

0

0

.

5

1

.

0

1

.

5

2

.

0

2

.

5

3

.

0

y

f

(

y

)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.9: PDF (left) and CDF (right) of Y in Example 3.21.

IMPORTANT: Evaluating a pdf at a specic value y

0

, that is, computing f

Y

(y

0

), does

not give you a probability! This simply gives you the height of the pdf f

Y

(y) at y = y

0

.

Example 3.21. Suppose that Y has the pdf

f

Y

(y) =

_

_

_

3y

2

, 0 < y < 1

0, otherwise.

Find the cumulative distribution function (cdf) of Y .

Solution. For 0 < y < 1,

F

Y

(y) =

f

Y

(t)dt =

y

0

3t

2

dt

= t

3

y

0

= y

3

.

Therefore, the cdf of Y is

F

Y

(y) =

_

_

0, y < 0

y

3

, 0 y < 1

1, y 1.

PAGE 39

CHAPTER 3 STAT 509, J. TEBBS

(a) Calculate P(Y < 0.3).

Method 1: PDF:

P(Y < 0.3) =

0.3

0

3y

2

dy = y

3

0.3

0

= (0.3)

3

0

3

= 0.027.

Method 2: CDF:

P(Y < 0.3) = F

Y

(0.3) = (0.3)

3

= 0.027.

(b) Calculate P(Y > 0.8).

Method 1: PDF:

P(Y > 0.8) =

1

0.8

3y

2

dy = y

3

1

0.8

= 1

3

(0.8)

3

= 0.488.

Method 2: CDF:

P(Y > 0.8) = 1 P(Y 0.8) = 1 F

Y

(0.8) = 1 (0.8)

3

= 0.488.

(c) Calculate P(0.3 < Y < 0.8).

Method 1: PDF:

P(0.3 < Y < 0.8) =

0.8

0.3

3y

2

dy = y

3

0.8

0.3

= (0.8)

3

(0.3)

3

= 0.485.

Method 2: CDF:

P(0.3 < Y < 0.8) = F

Y

(0.8) F

Y

(0.3) = (0.8)

3

(0.3)

3

= 0.485.

TERMINOLOGY: Let Y be a continuous random variable with pdf f

Y

(y). The ex-

pected value (or mean) of Y is given by

= E(Y ) =

yf

Y

(y)dy.

NOTE: The limits of the integral in this denition, while technically correct, will always

be the lower and upper limits corresponding to the nonzero part of the pdf.

PAGE 40

CHAPTER 3 STAT 509, J. TEBBS

FUNCTIONS: Let Y be a continuous random variable with pdf f

Y

(y). Suppose that g

is a real-valued function. Then, g(Y ) is a random variable and

E[g(Y )] =

g(y)f

Y

(y)dy.

TERMINOLOGY: Let Y be a continuous random variable with pdf f

Y

(y) and expected

value E(Y ) = . The population variance of Y is given by

2

var(Y ) E[(Y )

2

] =

(y )

2

f

Y

(y)dy.

The computing formula is still

var(Y ) = E(Y

2

) [E(Y )]

2

.

The population standard deviation of Y is the positive square root of the variance:

=

2

=

var(Y ).

Example 3.22. Find the mean and variance of Y in Example 3.21.

Solution. The mean of Y is

= E(Y ) =

1

0

yf

Y

(y)dy

=

1

0

y 3y

2

dy =

1

0

3y

3

dy =

3y

4

4

1

0

=

3

4

.

To nd var(Y ), we will use the computing formula var(Y ) = E(Y

2

) [E(Y )]

2

. We

already have E(Y ) = 3/4.

E(Y

2

) =

1

0

y

2

f

Y

(y)dy

=

1

0

y

2

3y

2

dy =

1

0

3y

4

dy =

3y

5

5

1

0

=

3

5

.

Therefore,

2

= var(Y ) = E(Y

2

) [E(Y )]

2

=

3

5

_

3

4

_

2

= 0.0375.

The population standard deviation is

=

0.0375 0.194.

PAGE 41

CHAPTER 3 STAT 509, J. TEBBS

QUANTILES: Suppose that Y is a continuous random variable with cdf F

Y

(y) and let

0 < p < 1. The pth quantile of the distribution of Y , denoted by

p

, solves

F

Y

(

p

) = P(Y

p

) =

f

Y

(y)dy = p.

The median of the distribution of Y is the p = 0.5 quantile. That is, the median

0.5

solves

F

Y

(

0.5

) = P(Y

0.5

) =

0.5

f

Y

(y)dy = 0.5.

NOTE: Another name for the pth quantile is the 100pth percentile.

REMARK: When Y is discrete, there are some potential problems with the denition

that

p

solves F

Y

(

p

) = P(Y

p

) = p. The reason is that there may be many values of

p

that satisfy this equation. By convention, in discrete distributions, the pth quantile

p

is taken to be the smallest value satisfying F

Y

(

p

) = P(Y

p

) p.

3.5.1 Exponential distribution

TERMINOLOGY: A random variable Y is said to have an exponential distribution

with parameter > 0 if its pdf is given by

f

Y

(y) =

_

_

_

e

y

, y > 0

0, otherwise.

Shorthand notation is Y exponential(). Important: The exponential distribution is

used to model the distribution of positive quantities (e.g., lifetimes, etc.).

MEAN/VARIANCE: If Y exponential(), then

E(Y ) =

1

var(Y ) =

1

2

.

CDF: Suppose that Y exponential(). Then, the cdf of Y exists in closed form and

is given by

F

Y

(y) =

_

_

_

0, y 0

1 e

y

, y > 0.

PAGE 42

CHAPTER 3 STAT 509, J. TEBBS

0 5 10 15 20

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

f

(

y

)

0 5 10 15 20

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

0 5 10 15 20

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

lambda = 1

lambda = 1/2

lambda = 1/5

Figure 3.10: Exponential pdfs with dierent values of .

Example 3.23. Extensive experience with fans of a certain type used in diesel engines

has suggested that the exponential distribution provides a good model for time until

failure (i.e., lifetime). Suppose that the lifetime of a fan, denoted by Y (measured in

10000s of hours), follows an exponential distribution with = 0.4.

(a) What is the probability that a fan lasts longer than 30,000 hours?

Method 1: PDF:

P(Y > 3) =

3

0.4e

0.4y

dy = 0.4

_

1

0.4

e

0.4y

3

_

= e

0.4y

3

= e

1.2

0.301.

PAGE 43

CHAPTER 3 STAT 509, J. TEBBS

0 2 4 6 8 10 12

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

y

f

(

y

)

0 2 4 6 8 10 12

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.11: PDF (left) and CDF (right) of Y exponential( = 0.4) in Example 3.23.

Method 2: CDF:

P(Y > 3) = 1 P(Y 3) = 1 F

Y

(3)

= 1 [1 e

0.4(3)

]

= e

1.2

0.301.

(b) What is the probability that a fan will last between 20,000 and 50,000 hours?

Method 1: PDF:

P(2 < Y < 5) =

5

2

0.4e

0.4y

dy = 0.4

_

_

1

0.4

e

0.4y

5

2

_

_

= e

0.4y

5

2

= [e

0.4(5)

e

0.4(2)

]

= e

0.8

e

2

0.314.

PAGE 44

CHAPTER 3 STAT 509, J. TEBBS

Method 2: CDF:

P(2 < Y < 5) = F

Y

(5) F

Y

(2)

= [1 e

0.4(5)

] [1 e

0.4(2)

]

= e

0.8

e

2

0.314.

MEMORYLESS PROPERTY: Suppose that Y exponential(), and let r and s be

positive constants. Then

P(Y > r + s|Y > r) = P(Y > s).

If Y measures time (e.g., time to failure, etc.), then the memoryless property says that

the distribution of additional lifetime (s time units beyond time r) is the same as the

original distribution of the lifetime. In other words, the fact that Y has made it to

time r has been forgotten. For example, in Example 3.23,

P(Y > 5|Y > 2) = P(Y > 3) 0.301.

POISSON RELATIONSHIP: Suppose that we are observing occurrences over time

according to a Poisson distribution with rate . Dene the random variable

W = the time until the rst occurrence.

Then, W exponential(). It is also true that the time between any two occur-

rences in a Poisson process follows this same exponential distribution (these are called

interarrival times).

Example 3.24. Suppose that customers arrive at a check-out according to a Poisson

process with mean = 12 per hour. What is the probability that we will have to wait

longer than 10 minutes to see the rst customer? Note: 10 minutes is 1/6th of an hour.

Solution. The time until the rst arrival, say W, follows an exponential distribution

with = 12, so the cdf of W, for w > 0, is

F

W

(w) = 1 e

12w

.

PAGE 45

CHAPTER 3 STAT 509, J. TEBBS

The desired probability is

P(W > 1/6) = 1 P(W 1/6) = 1 F

W

(1/6)

= 1 [1 e

12(1/6)

]

= e

2

0.135.

EXPONENTIAL R CODE: Suppose that Y exponential().

F

Y

(y) = P(Y y)

p

pexp(y,) qexp(p,)

> 1-pexp(1/6,12) ## Example 3.24

[1] 0.1353353

> qexp(0.9,12) ## 0.9 quantile

[1] 0.1918821

NOTE: The command qexp(0.9,12) gives the 0.90 quantile (90th percentile) of the

exponential( = 12) distribution. In Example 3.24, this means that 90 percent of the

waiting times will be less than approximately 0.192 hours (only 10 percent will exceed).

3.5.2 Gamma distribution

TERMINOLOGY: The gamma function is a real function of t, dened by

(t) =

0

y

t1

e

y

dy,

for all t > 0. The gamma function satises the recursive relationship

() = ( 1)( 1),

for > 1. Therefore, if is an integer, then

() = ( 1)!.

PAGE 46

CHAPTER 3 STAT 509, J. TEBBS

0 5 10 15 20 25

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

0

.

2

5

y

f

(

y

)

0 5 10 15 20 25

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

0

.

2

5

0 5 10 15 20 25

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

0

.

2

5

alpha = 1.5, lambda = 1/2

alpha = 2, lambda = 1/3

alpha = 2.5, lambda = 1/5

Figure 3.12: Gamma pdfs with dierent values of and .

TERMINOLOGY: A random variable Y is said to have a gamma distribution with

parameters > 0 and > 0 if its pdf is given by

f

Y

(y) =

_

()

y

1

e

y

, y > 0

0, otherwise.

Shorthand notation is Y gamma(, ).

By changing the values of and , the gamma pdf can assume many shapes. This

makes the gamma distribution popular for modeling positive random variables (it

is more exible than the exponential).

Note that when = 1, the gamma pdf reduces to the exponential() pdf.

PAGE 47

CHAPTER 3 STAT 509, J. TEBBS

MEAN/VARIANCE: If Y gamma(, ), then

E(Y ) =

var(Y ) =

2

.

CDF: The cdf of a gamma random variable does not exist in closed form. Therefore,

probabilities involving gamma random variables and gamma quantiles must be computed

numerically (e.g., using R, etc.).

GAMMA R CODE: Suppose that Y gamma(, ).

F

Y

(y) = P(Y y)

p

pgamma(y,,) qgamma(p,,)

Example 3.25. When a certain transistor is subjected to an accelerated life test, the

lifetime Y (in weeks) is well modeled by a gamma distribution with = 4 and = 1/6.

(a) Find the probability that a transistor will last at least 50 weeks.

P(Y 50) = 1 P(Y < 50) = 1 F

Y

(50)

= 1-pgamma(50,4,1/6)

= 0.03377340.

(b) Find the probability that a transistor will last between 12 and 24 weeks.

P(12 Y 24) = F

Y

(24) F

Y

(12)

= pgamma(24,4,1/6)-pgamma(12,4,1/6)

= 0.4236533.

(c) Twenty percent of the transistor lifetimes will be below which time? Note: I am

asking for the 0.20 quantile (20th percentile) of the lifetime distribution.

> qgamma(0.2,4,1/6)

[1] 13.78072

PAGE 48

CHAPTER 3 STAT 509, J. TEBBS

0 10 20 30 40 50 60 70

0

.

0

0

0

.

0

1

0

.

0

2

0

.

0

3

0

.

0

4

y

f

(

y

)

0 10 20 30 40 50 60 70

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.13: PDF (left) and CDF (right) of Y gamma( = 4, = 1/6) in Example

3.25.

3.5.3 Normal distribution

TERMINOLOGY: A random variable Y is said to have a normal distribution if its

pdf is given by

f

Y

(y) =

_

_

1

2

e

1

2

_

y

_

2

, < y <

0, otherwise.

Shorthand notation is Y N(,

2

). Another name for the normal distribution is the

Guassian distribution.

MEAN/VARIANCE: If Y N(,

2

), then

E(Y ) =

var(Y ) =

2

.

REMARK: The normal distribution serves as a very good model for a wide range of mea-

surements; e.g., reaction times, ll amounts, part dimensions, weights/heights, measures

of intelligence/test scores, economic indicators, etc.

PAGE 49

CHAPTER 3 STAT 509, J. TEBBS

10 5 0 5 10

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

y

f

(

y

)

10 5 0 5 10

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

10 5 0 5 10

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

mu = 0, sigma = 1

mu = 2, sigma = 2

mu = 1, sigma = 3

Figure 3.14: Normal pdfs with dierent values of and

2

.

CDF: The cdf of a normal random variable does not exist in closed form. Probabilities

involving normal random variables and normal quantiles can be computed numerically

(e.g., using R, etc.).

NORMAL R CODE: Suppose that Y N(,

2

).

F

Y

(y) = P(Y y)

p

pnorm(y,,) qnorm(p,,)

Example 3.26. The time it takes for a driver to react to the brake lights on a decelerating

vehicle is critical in helping to avoid rear-end collisions. A recently published study

suggests that this time during in-trac driving, denoted by Y (measured in seconds),

follows a normal distribution with mean = 1.5 and variance

2

= 0.16.

PAGE 50

CHAPTER 3 STAT 509, J. TEBBS

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

f

(

y

)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.15: PDF (left) and CDF (right) of Y N( = 1.5,

2

= 0.16) in Example 3.26.

(a) What is the probability that reaction time is less than 1 second?

P(Y < 1) = F

Y

(1)

= pnorm(1,1.5,sqrt(0.16))

= 0.1056498.

(b) What is the probability that reaction time is between 1.1 and 2.5 seconds?

P(1.1 Y 2.5) = F

Y

(2.5) F

Y

(1.1)

= pnorm(2.5,1.5,sqrt(0.16))-pnorm(1.1,1.5,sqrt(0.16))

= 0.835135.

(c) Five percent of all reaction times will exceed which time? Note: I am asking for

the 0.95 quantile (95th percentile) of the reaction time distribution.

> qnorm(0.95,1.5,sqrt(0.16))

[1] 2.157941

PAGE 51

CHAPTER 3 STAT 509, J. TEBBS

EMPIRICAL RULE: For any N(,

2

) distribution,

about 68% of the distribution is between and +

about 95% of the distribution is between 2 and + 2

about 99.7% of the distribution is between 3 and + 3.

This is also called the 68-95-99.7% rule. This rule allows for us to make statements

like this (referring to Example 3.26, where = 1.5 and = 0.4):

About 68 percent of all reaction times will be between 1.1 and 1.9 seconds.

About 95 percent of all reaction times will be between 0.7 and 2.3 seconds.

About 99.7 percent of all reaction times will be between 0.3 and 2.7 seconds.

TERMINOLOGY: A random variable Z is said to have a standard normal distribu-

tion if its pdf is given by

f

Z

(z) =

_

_

1

2

e

z

2

/2

, < z <

0, otherwise.

Shorthand notation is Z N(0, 1). A standard normal distribution is a special normal

distribution, that is, a normal distribution with mean = 0 and variance

2

= 1. The

variable Z is called a standard normal random variable.

RESULT: If Y N(,

2

), then

Z =

Y

N(0, 1).

The result says that Z follows a standard normal distribution; i.e., Z N(0, 1). In this

context, Z is called the standardized value of Y .

Important: Therefore, any normal random variable Y N(,

2

) can be converted

to a standard normal random variable Z by applying this transformation.

PAGE 52

CHAPTER 3 STAT 509, J. TEBBS

IMPLICATION: Any probability calculation involving a normal random variable Y

N(,

2

) can be transformed into a calculation involving Z N(0, 1). More speci-

cally, if Y N(,

2

), then

P(y

1

< Y < y

2

) = P

_

y

1

< Z <

y

2

_

= F

Z

_

y

2

_

F

Z

_

y

1

_

.

Q: Why is this important?

Because it is common for textbooks (like yours) to table the cumulative distri-

bution function F

Z

(z) for various values of z. See Table 1 (pp 593-594, VK).

Therefore, the preceding standardization result makes it possible to nd proba-

bilities involving normal random variables by hand without using calculus or

software.

Because R will compute normal probabilities directly, I view hand calculation to

be unnecessary and outdated.

See the examples in the text (Section 3.5) for illustrations of this approach.

3.6 Reliability and lifetime distributions

TERMINOLOGY: Reliability analysis deals with failure time (i.e., lifetime, time-to-

event) data. For example,

T = time from start of product service until failure

T = time of sale of a product until a warranty claim

T = number of hours in use/cycles until failure.

We call T a lifetime random variable if it measures the time to an event; e.g.,

failure, death, eradication of some infection/condition, etc. Engineers are often involved

with reliability studies in practice, because reliability is related to product quality.

PAGE 53

CHAPTER 3 STAT 509, J. TEBBS

NOTE: There are many well known lifetime distributions, including

exponential

Weibull

lognormal

Others: gamma, inverse Gaussian, Gompertz-Makeham, Birnbaum-Sanders, ex-

treme value, log-logistic, etc.

The normal (Gaussian) distribution is rarely used to model lifetime variables.

3.6.1 Weibull distribution

TERMINOLOGY: A random variable T is said to have a Weibull distribution with

parameters > 0 and > 0 if its pdf is given by

f

T

(t) =

_

_

t

_

1

e

(t/)

, t > 0

0, otherwise.

Shorthand notation is T Weibull(, ).

We call

= shape parameter

= scale parameter.

By changing the values of and , the Weibull pdf can assume many shapes. The

Weibull distribution is very popular among engineers in reliability applications.

Note that when = 1, the Weibull pdf reduces to the exponential( = 1/) pdf.

MEAN/VARIANCE: If T Weibull(, ), then

E(T) =

_

1 +

1

_

var(T) =

2

_

_

1 +

2

_

1 +

1

__

2

_

.

PAGE 54

CHAPTER 3 STAT 509, J. TEBBS

0 5 10 15 20 25

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

t

f

(

t

)

0 5 10 15 20 25

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

0 5 10 15 20 25

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0

.

2

0

beta = 2, eta = 5

beta = 2, eta = 10

beta = 3, eta = 10

Figure 3.16: Weibull pdfs with dierent values of and .

CDF: Suppose that T Weibull(, ). Then, the cdf of T exists in closed form and is

given by

F

T

(t) =

_

_

_

0, t 0

1 e

(t/)

, t > 0.

Example 3.27. Suppose that the lifetime of a rechargeable battery, denoted by T

(measured in hours), follows a Weibull distribution with parameters = 2 and = 10.

(a) What is the mean time to failure?

E(T) = 10

_

3

2

_

8.862 hours.

(b) What is the probability that a battery is still functional at time t = 20?

P(T 20) = 1 P(T < 20) = 1 F

T

(20)

= 1 [1 e

(20/10)

2

] 0.018.

PAGE 55

CHAPTER 3 STAT 509, J. TEBBS

0 5 10 15 20 25

0

.

0

0

0

.

0

2

0

.

0

4

0

.

0

6

0

.

0

8

0

.

1

0

y

f

(

y

)

0 5 10 15 20 25

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

y

F

(

y

)

Figure 3.17: PDF (left) and CDF (right) of T Weibull( = 2, = 10) in Example 3.27.

(c) What is the probability that a battery is still functional at time t = 20 given that

the battery is functional at time t = 10?

P(T 20|T 10) =

P(T 20 and T 10)

P(T 10)

=

P(T 20)

P(T 10)

=

1 F

T

(20)

1 F

T

(10)

=

e

(20/10)

2

e

(10/10)

2

0.050.

(d) What is the 99th percentile of this lifetime distribution? We set

F

T

(

0.99

) = 1 e

(

0.99

/10)

2

set

= 0.99.

Solving for

0.99

gives

0.99

21.460 hours. Only one percent of the battery lifetimes

will exceed this value.

WEIBULL R CODE: Suppose that T Weibull(, ).

F

T

(t) = P(T t)

p

pweibull(t,,) qweibull(p,,)

PAGE 56

CHAPTER 3 STAT 509, J. TEBBS

> 10*gamma(3/2) ## Part (a)

[1] 8.86227

> 1-pweibull(20,2,10) ## Part (b)

[1] 0.01831564

> (1-pweibull(20,2,10))/(1-pweibull(10,2,10)) ## Part (c)

[1] 0.04978707

> qweibull(0.99,2,10) ## Part (d)

[1] 21.45966

3.6.2 Reliability functions

DESCRIPTION: We now describe some dierent, but equivalent, ways of dening the

distribution of a (continuous) lifetime random variable T.

The cumulative distribution function (cdf)

F

T

(t) = P(T t).

This can be interpreted as the proportion of units that have failed by time t.

The survivor function

S

T

(t) = P(T > t) = 1 F

T

(t).

This can be interpreted as the proportion of units that have not failed by time t;

e.g., the unit is still functioning, a warranty claim has not been made, etc.

The probability density function (pdf )

f

T

(t) =

d

dt

F

T

(t) =

d

dt

S

T

(t).

Also, recall that

F

T

(t) =

t

0

f

T

(u)du

and

S

T

(t) =

t

f

T

(u)du.

PAGE 57

CHAPTER 3 STAT 509, J. TEBBS

TERMINOLOGY: The hazard function is dened as

h

T

(t) = lim

0

P(t T < t + |T t)

.

The hazard function is not a probability; rather, it is a probability rate. Therefore, it

is possible that a hazard function may exceed one.

REMARK: The hazard function (or hazard rate) is a very important characteristic of

a lifetime distribution. It indicates the way the risk of failure varies with time.

Distributions with increasing hazard functions are seen in units for whom some kind of

aging or wear out takes place. Certain types of units (e.g., electronic devices, etc.)

may display a decreasing hazard function, at least in the early stages of their lifetimes.

NOTE: It is insightful to note that

h

T

(t) = lim

0

P(t T < t + |T t)

= lim

0

P(t T < t + )

P(T t)

=

1

P(T t)

lim

0

F

T

(t + ) F

T

(t)

=

f

T

(t)

S

T

(t)

.

We can therefore describe the distribution of the continuous lifetime random variable T

by using either f

T

(t), F

T

(t), S

T

(t), or h

T

(t).

Example 3.28. In this example, we nd the hazard function for T Weibull(, ).

Recall that the pdf of T is

f

T

(t) =

_

_

t

_

1

e

(t/)

, t > 0

0, otherwise.

The cdf of T is

F

T

(t) =

_

_

_

0, t 0

1 e

(t/)

, t > 0.

The survivor function of T is

S

T

(t) = 1 F

T

(t) =

_

_

_

1, t 0

e

(t/)

, t > 0.

PAGE 58

CHAPTER 3 STAT 509, J. TEBBS

0 1 2 3 4

0

1

0

2

0

3

0

4

0

t

h

(

t

)

0 1 2 3 4

0

.

0

1

.

0

2

.

0

3

.

0

t

h

(

t

)

0 1 2 3 4

0

.

6

0

.

8

1

.

0

1

.

2

1

.

4

t

h

(

t

)

0 1 2 3 4

0

1

2

3

4

5

t

h

(

t

)

Figure 3.18: Weibull hazard functions with = 1. Upper left: = 3. Upper right:

= 1.5. Lower left: = 1. Lower right: = 0.5.

Therefore, the hazard function, for t > 0, is

h

T

(t) =

f

T

(t)

S

T

(t)

=

_

t

_

1

e

(t/)

e

(t/)

_

t

_

1

.

Plots of Weibull hazard functions are given in Figure 3.18. It is easy to show

h

T

(t) is increasing if > 1 (wear out; population of units get weaker with aging)

h

T

(t) is constant if = 1 (constant hazard; exponential distribution)

h

T

(t) is decreasing if < 1 (infant mortality; population of units gets stronger with

aging).

PAGE 59

CHAPTER 3 STAT 509, J. TEBBS

Example 3.29. We consider the data in Example 3.23 of Vining and Kowalski (pp 162).

The data are times, denoted by T (measured in months), to the rst failure for 20 electric

carts used for internal delivery and transportation in a large manufacturing facility.

0.9 1.5 2.3 3.2 3.9 5.0 6.2 7.5 8.3 10.4

11.1 12.6 15.0 16.3 19.3 22.6 24.8 31.5 38.1 53.0

From these data, maximum likelihood estimates of and are computed to be

1.110

15.271.

Note: Maximum likelihood estimation is a mathematical technique used to nd

values of and that most closely agree with the observed data. R computes these

estimates automatically. In Figure 3.19, we display the (estimated) PDF f

T

(t), CDF

F

T

(t), survivor function S

T

(t), and hazard function h

T

(t) based on these estimates.

REMARK: Note that the estimate

1.110 is larger than 1. This suggests that there

is wear out taking place among the carts; that is, the population of carts gets weaker

as time passes.

(a) Using the estimated Weibull(

cart lifetimes, nd the probability that a cart will still be functioning after t = 36 months.

P(T 36) = 1 P(T < 36) = 1 F

T

(36)

= 1 [1 e

(36/15.271)

1.110

] 0.075.

(b) Use the estimated distribution to nd the 90th percentile of the cart lifetimes.

F

T

(

0.90

) = 1 e

(

0.90

/15.271)

1.110

set

= 0.90.

Solving for

0.90

gives

0.90

32.373 months. Only ten percent of the cart lifetimes will

exceed this value.

PAGE 60

CHAPTER 3 STAT 509, J. TEBBS

0 20 40 60

0

.

0

0

0

.

0

2

0

.

0

4

0

.

0

6

t

f

(

t

)

0 20 40 60

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

t

F

(

t

)

0 20 40 60

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

t

S

(

t

)

0 20 40 60

0

.

0

0

0

.

0

4

0

.

0

8

t

h

(

t

)

Figure 3.19: Weibull functions with

1.110 and 15.271. Upper left: PDF. Upper

right: CDF. Lower left: Survivor function. Lower right: Hazard function.

> 1-pweibull(36,1.110,15.271) ## Part (a)

[1] 0.07497392

> qweibull(0.90,1.110,15.271) ## Part (b)

[1] 32.37337

TERMINOLOGY: A quantile-quantile plot (qq plot) is a graphical display that can

help assess the appropriateness of a distribution. Here is how the plot is constructed:

On the vertical axis, we plot the observed data, ordered from low to high.

PAGE 61

CHAPTER 4 STAT 509, J. TEBBS

0 10 20 30 40 50

0

1

0

2

0

3

0

4

0

5

0

Weibull percentiles

O

b

s

e

r

v

e

d

v

a

l

u

e

s

Figure 3.20: Weibull qq plot for the electric cart data in Example 3.29. The observed data

are plotted versus the theoretical quantiles from a Weibull distribution with

1.110

and 15.271.

On the horizontal axis, we plot the (ordered) theoretical quantiles from the distri-

bution (model) assumed for the observed data.

Our intuition should suggest the following:

If the observed data agrees with the distributions theoretical quantiles, then the

qq plot should look like a straight line (the distribution is a good choice).

If the observed data does not agree with the theoretical quantiles, then the qq

plot should have curvature in it (the distribution is not a good choice).

Interpretation: The Weibull qq plot in Figure 3.20 looks like a straight line. This

suggests that a Weibull distribution is a good t for the electric cart lifetime data.

PAGE 62

CHAPTER 4 STAT 509, J. TEBBS

4 Statistical Inference

Complementary reading: Chapter 4 (VK); Sections 4.1-4.8. Read also Sections 3.6-3.7.

4.1 Populations and samples

OVERVIEW: This chapter is about statistical inference. This deals with making

(probabilistic) statements about a population of individuals based on information that

is contained in a sample taken from the population.

Example 4.1. Suppose that we wish to study the performance of lithium batteries used

in a certain calculator. The purpose of our study is to determine the mean lifetime of

these batteries so that we can place a limited warranty on them in the future. Since this

type of battery has not been used in this calculator before, no one (except the Oracle) can

tell us the distribution of Y , the batterys lifetime. In fact, not only is the distribution

not known, but all parameters which index this distribution arent known either.

TERMINOLOGY: A population refers to the entire group of individuals (e.g., parts,

people, batteries, etc.) about which we would like to make a statement (e.g., proportion

defective, median weight, mean lifetime, etc.).

It is generally accepted that the entire population can not be measured. It is too

large and/or it would be too time consuming to do so.

To draw inferences (make probabilistic statements) about a population, we therefore

observe a sample of individuals from the population.

We will assume that the sample of individuals constitutes a random sample.

Mathematically, this means that all observations are independent and follow the

same probability distribution. Informally, this means that each sample (of the same

size) has the same chance of being selected. Our hope is that a random sample of

individuals is representative of the entire population of individuals.

PAGE 63

CHAPTER 4 STAT 509, J. TEBBS

NOTATION: We will denote a random sample of observations by

Y

1

, Y

2

, ..., Y

n

.

That is, Y

1

is the value of Y for the rst individual in the sample, Y

2

is the value of

Y for the second individual in the sample, and so on. The sample size tells us how

many individuals are in the sample and is denoted by n. Statisticians refer to the set of

observations Y

1

, Y

2

, ..., Y

n

generically as data. Lower case notation y

1

, y

2

, ..., y

n

is used

when citing numerical values (or when referring to realizations of the upper case versions).

BATTERY DATA: Consider the following random sample of n = 50 battery lifetimes

y

1

, y

2

, ..., y

50

(measured in hours):

4285 2066 2584 1009 318 1429 981 1402 1137 414

564 604 14 4152 737 852 1560 1786 520 396

1278 209 349 478 3032 1461 701 1406 261 83

205 602 3770 726 3894 2662 497 35 2778 1379

3920 1379 99 510 582 308 3367 99 373 454

In Figure 4.1, we display a histogram of the battery lifetime data. We see that the

(empirical) distribution of the battery lifetimes is skewed towards the high side.

Which continuous probability distribution seems to display the same type of pattern

that we see in histogram?

An exponential() model seems reasonable here (based on the histogram shape).

What is ?

In this example, is called a (population) parameter. It describes the theoretical

distribution which is used to model the entire population of battery lifetimes.

In general, (population) parameters which index probability distributions (like the

exponential) are unknown.

All of the probability distributions that we discussed in Chapter 3 are meant to

describe (model) population behavior.

PAGE 64

CHAPTER 4 STAT 509, J. TEBBS

Lifetime (in hours)

C

o

u

n

t

0 1000 2000 3000 4000

0

5

1

0

1

5

Figure 4.1: Histogram of battery lifetime data (measured in hours).

4.2 Parameters and statistics

TERMINOLOGY: A parameter is a numerical quantity that describes a population.

In general, population parameters are unknown. Some very common examples are:

= population mean

2

= population variance

p = population proportion.

CONNECTION: All of the probability distributions that we talked about in Chapter 3

were indexed by population (model) parameters. For example,

the N(,

2

) distribution is indexed by two parameters, the population mean and

the population variance

2

.

PAGE 65

CHAPTER 4 STAT 509, J. TEBBS

the Poisson() distribution is indexed by one parameter, the population mean .

the Weibull(, ) distribution is indexed by two parameters, the shape parameter

and the scale parameter .

the b(n, p) distribution is indexed by one parameter, the population proportion of

successes p.

TERMINOLOGY: Suppose that Y

1

, Y

2

, ..., Y

n

is a random sample from a population.

The sample mean is

Y =

1

n

n

i=1

Y

i

.

The sample variance is

S

2

=

1

n 1

n

i=1

(Y

i

Y )

2

.

The sample standard deviation is the positive square root of the sample variance;

i.e.,

S =

S

2

=

_

1

n 1

n

i=1

(Y

i

Y )

2

.

Important: Unlike their population analogues, these quantities can be computed from

a sample of data Y

1

, Y

2

, ..., Y

n

.

TERMINOLOGY: A statistic is a numerical quantity that can be calculated from a

sample of data. Some very common examples are:

Y = sample mean

S

2

= sample variance

p = sample proportion.

For example, with the battery lifetime data (a random sample of n = 50 lifetimes),

y = 1274.14 hours

s

2

= 1505156 (hours)

2

s 1226.85 hours.

PAGE 66

CHAPTER 4 STAT 509, J. TEBBS

> mean(battery) ## sample mean

[1] 1274.14

> var(battery) ## sample variance

[1] 1505156

> sd(battery) ## sample standard deviation

[1] 1226.848

SUMMARY: The table below succinctly summarizes the salient dierences between a

population and a sample (a parameter and a statistic):

Group of individuals Numerical quantity Status

Population (Not observed) Parameter Unknown

Sample (Observed) Statistic Calculated from sample data

Statistical inference deals with making (probabilistic) statements about a population

of individuals based on information that is contained in a sample taken from the popu-

lation. We do this by

(a) estimating unknown population parameters with sample statistics

(b) quantifying the uncertainty (variability) that arises in the estimation process.

These are both necessary to construct condence intervals and to perform hypothesis

tests, two important exercises discussed in this chapter.

4.3 Point estimators and sampling distributions

NOTATION: To keep our discussion as general as possible (as the material in this subsec-

tion can be applied to many situations), we will let denote a population parameter.

For example, could denote a population mean, a population variance, a population

proportion, a Weibull or gamma model parameter, etc. It could also denote a

parameter in a regression context (Chapter 6-7).

PAGE 67

CHAPTER 4 STAT 509, J. TEBBS

TERMINOLOGY: A point estimator

is a statistic that is used to estimate a popu-

lation parameter . Common examples of point estimators are:

Y a point estimator for (population mean)

S

2

a point estimator for

2

(population variance)

S a point estimator for (population standard deviation).

CRUCIAL POINT: It is important to note that, in general, an estimator

is a statistic,

so it depends on the sample of data Y

1

, Y

2

, ..., Y

n

.

The data Y

1

, Y

2

, ..., Y

n

come from the sampling process; e.g., dierent random sam-

ples will yield dierent data sets Y

1

, Y

2

, ..., Y

n

.

In this light, because the sample values Y

1

, Y

2

, ..., Y

n

will vary from sample to sam-

ple, the value of

will too! It therefore makes sense to think about all possible

values of

; that is, the distribution of

.

TERMINOLOGY: The distribution of an estimator

(a statistic) is called its sampling

distribution. A sampling distribution describes mathematically how

would vary in

repeated sampling. We will study many sampling distributions in this chapter.

TERMINOLOGY: We say that

is an unbiased estimator of if and only if

E(

) = .

In other words, the mean of the sampling distribution of

is equal to . Note that

unbiasedness is a characteristic describing the center of a sampling distribution. This

deals with accuracy.

RESULT: Mathematics shows that when Y

1

, Y

2

, ..., Y

n

is a random sample,

E(Y ) =

E(S

2

) =

2

.

That is, Y and S

2

are unbiased estimators of their population analogues.

PAGE 68

CHAPTER 4 STAT 509, J. TEBBS

GOAL: Not only do we desire to use point estimators

also like for them to have small variability. In other words, when

misses , we would

like for it to not miss by much. This deals with precision.

MAIN POINT: Accuracy and precision are the two main mathematical characteristics

that arise when evaluating the quality of a point estimator

. We desire point estimators

which are unbiased (perfectly accurate) and have small variance (highly precise).

TERMINOLOGY: The standard error of a point estimator

is equal to

se(

) =

var(

).

In other words, the standard error is equal to the standard deviation of the sampling

distribution of

. An estimators standard error measures the amount of variability in

the point estimator

. Therefore,

smaller se(

)

more precise.

4.4 Sampling distributions involving Y

NOTE: This subsection summarizes Sections 3.6-3.7 (VK).

Result 1: Suppose that Y

1

, Y

2

, ..., Y

n

is a random sample from a N(,

2

) distribution.

The sample mean Y has the following sampling distribution:

Y N

_

,

2

n

_

.

This result reminds us that

E(Y ) = .

That is, the sample mean Y is an unbiased estimator of the population mean .

This result also shows that the standard error of Y (as a point estimator) is

se(Y ) =

var(Y ) =

2

n

=

n

.

PAGE 69

CHAPTER 4 STAT 509, J. TEBBS

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0

1

2

3

4

5

y

f

(

y

)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0

1

2

3

4

5

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0

1

2

3

4

5

Population distribution

Sample mean, n=5

Sample mean, n=25

Figure 4.2: Braking time example. Population distribution: Y N( = 1.5,

2

= 0.16).

Also depicted are the sampling distributions of Y when n = 5 and n = 25.

Example 4.2. In Example 3.26 (notes), we examined the distribution of

Y = time (in seconds) to react to brake lights during in-trac driving.

We assumed that

Y N( = 1.5,

2

= 0.16).

We call this the population distribution, because it describes the distribution of values

of Y for all individuals in the population (here, in-trac drivers).

Question. Suppose that we take a random sample of n = 5 drivers with times

Y

1

, Y

2

, ..., Y

5

. What is the distribution of the sample mean Y ?

Solution. If the sample size is n = 5, then with = 1.5 and

2

= 0.16, we have

Y N

_

,

2

n

_

= Y N(1.5, 0.032).

PAGE 70

CHAPTER 4 STAT 509, J. TEBBS

This distribution describes the values of Y we would expect to see in repeated sampling,

that is, if we repeatedly sampled n = 5 individuals from this population of in-trac

drivers and calculated the sample mean Y each time.

Question. Suppose that we take a random sample of n = 25 drivers with times

Y

1

, Y

2

, ..., Y

25

. What is the distribution of the sample mean Y ?

Solution. If the sample size is n = 25, then with = 1.5 and

2

= 0.16, we have

Y N

_

,

2

n

_

= Y N(1.5, 0.0064).

The sampling distribution of Y when n = 5 and when n = 25 is shown in Figure 4.2.

4.4.1 Central Limit Theorem

Result 2: Suppose that Y

1

, Y

2

, ..., Y

n

is a random sample from a population distribution

with mean and variance

2

(not necessarily a normal distribution). When the sample

size n is large, we have

Y AN

_

,

2

n

_

.

The symbol AN is read approximately normal. This result is called the Central Limit

Theorem (CLT).

Result 1 guarantees that when the underlying population distribution is N(,

2

),

the sample mean

Y N

_

,

2

n

_

.

The Central Limit Theorem (Result 2) says that even if the population distribution

is not normal (Guassian), the sampling distribution of the sample mean Y will be

approximately normal (Gaussian) when the sample size is suciently large.

Example 4.3. The time to death for rats injected with a toxic substance, denoted by

Y (measured in days), follows an exponential distribution with = 1/5. That is,

Y exponential( = 1/5).

PAGE 71

CHAPTER 4 STAT 509, J. TEBBS

0 5 10 15 20

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

Time to death (in days)

P

D

F

0 5 10 15 20

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

0 5 10 15 20

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

Population distribution

Sample mean, n=5

Sample mean, n=25

Figure 4.3: Rat death times. Population distribution: Y exponential( = 1/5). Also

depicted are the sampling distributions of Y when n = 5 and n = 25.

This is the population distribution, that is, this distribution describes the time to

death for all rats in the population.

In Figure 4.3, I have shown the exponential(1/5) population distribution (solid

curve). I have also depicted the theoretical sampling distributions of Y when n = 5

and when n = 25.

Main point: Notice how the sampling distribution of Y begins to (albeit distantly)

resemble a normal distribution when n = 5. When n = 25, the sampling distri-

bution of Y looks very much to be normal (Gaussian). This is precisely what is

conferred by the CLT. The larger the sample size n, the better a normal (Gaussian)

distribution approximates the true sampling distribution of Y .

PAGE 72

CHAPTER 4 STAT 509, J. TEBBS

Example 4.4. When a batch of a certain chemical product is prepared, the amount of

a particular impurity in the batch (measured in grams) is a random variable Y with the

following population parameters:

= 4.0g

2

= (1.5g)

2

.

Suppose that n = 50 batches are prepared (independently). What is the probability that

the sample mean impurity amount Y is greater than 4.2 grams?

Solution. With n = 50, = 4, and

2

= (1.5)

2

, the CLT says that

Y AN

_

,

2

n

_

= Y AN(4, 0.045).

Therefore,

P(Y > 4.2) = 1 P(Y < 4.2)

1-pnorm(4.2,4,sqrt(0.045)) = 0.1728893.

Important: Note that in making this (approximate) probability calculation, we never

made an assumption about the underlying population distribution shape.

4.4.2 t distribution

Result 3: Suppose that Y

1

, Y

2

, ..., Y

n

is a random sample from a N(,

2

) distribution.

Result 1 says the sample mean Y has the following sampling distribution:

Y N

_

,

2

n

_

.

If we standardize Y , we obtain

Z =

Y

/

n

N(0, 1).

Replacing the population standard deviation with the sample standard deviation S,

we get a new sampling distribution:

t =

Y

S/

n

t(n 1),

a t distribution with degrees of freedom = n 1.

PAGE 73

CHAPTER 4 STAT 509, J. TEBBS

4 2 0 2 4

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

y

f

(

y

)

4 2 0 2 4

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

4 2 0 2 4

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

N(0,1)

t(2)

t(10)

Figure 4.4: Probability density functions of N(0, 1), t(2), and t(10).

FACTS: The t distribution has the following characteristics:

It is continuous and symmetric about 0 (just like the standard normal distribution).

It is indexed by a value called the degrees of freedom.

In practice, is often an integer (related to the sample size).

As , t() N(0, 1); thus, when becomes larger, the t() and the N(0, 1)

distributions look more alike.

When compared to the standard normal distribution, the t distribution, in general,

is less peaked and has more probability (area) in the tails.

The t pdf formula is complicated and is unnecessary for our purposes. R will

compute t probabilities and quantiles from the t distribution.

PAGE 74

CHAPTER 4 STAT 509, J. TEBBS

t R CODE: Suppose that T t().

F

T

(t) = P(T t)

p

pt(t,) qt(p,)

Example 4.5. Hollow pipes are to be used in an electrical wiring project. In testing

1-inch pipes, the data below were collected by a design engineer. The data are mea-

surements of Y , the outside diameter of this type of pipe (measured in inches). These

n = 25 pipes were randomly selected and measuredall in the same location.

1.296 1.320 1.311 1.298 1.315

1.305 1.278 1.294 1.311 1.290

1.284 1.287 1.289 1.292 1.301

1.298 1.287 1.302 1.304 1.301

1.313 1.315 1.306 1.289 1.291

From their extensive experience, the manufacturers of this pipe claim that the population

distribution is normal (Gaussian) and that the mean outside diameter is = 1.29 inches.

Under this assumption (which may or may not be true), calculate the value of

t =

y

s/

n

.

Solution. We use R to nd the sample mean y and the sample standard deviation s:

> mean(pipes) ## sample mean

[1] 1.29908

> sd(pipes) ## sample standard deviation

[1] 0.01108272

With n = 25, we have

t =

1.29908 1.29

0.01108272/

25

4.096.

PAGE 75

CHAPTER 4 STAT 509, J. TEBBS

4 2 0 2 4

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

t

f

(

t

)

Figure 4.5: t(24) probability density function. An at t = 4.096 has been added.

Analysis. If the manufacturers claim is true (that is, if = 1.29 inches), then

t =

y

s/

n

comes from a t(24) distribution. The t(24) pdf is displayed above in Figure 4.5.

Key question: Does t = 4.096 seem like a value you would expect to see from this

distribution? If not, what might this suggest? Recall that t was computed under the

assumption that = 1.29 inches (the manufacturers claim).

Question. The value t = 4.096 is what percentile of the t(24) distribution?

> pt(4.096,24)

[1] 0.9997934

Answer: t = 4.096 is approximately the 99.98th percentile of the t(24) distribution.

PAGE 76

CHAPTER 4 STAT 509, J. TEBBS

4.4.3 Normal quantile-quantile (qq) plots

IMPORTANT: Result 3 says that if Y

1

, Y

2

, ..., Y

n

is a random sample from a N(,

2

)

distribution, then

t =

Y

S/

n

t(n 1).

An obvious question therefore arises: What if Y

1

, Y

2

, ..., Y

n

are non-normal (i.e., non-

Gaussian)?

Answer: The t distribution result still approximately holds, even if the underlying

population distribution is not perfectly normal. The approximation improves when

the sample size is larger

the population distribution is more symmetric (not highly skewed).

Because the normality assumption (for the population distribution) is not absolutely

critical for this t sampling distribution result to hold, we say that Result 3 is robust to

the normality assumption.

REMARK: Robustness is a nice property; it assures us that the underlying assumption

of normality is not an absolute requirement for the t distribution result to hold. Other

sampling distribution results (coming up) are not always robust to normality departures.

TERMINOLOGY: Just as we used Weibull qq plots to assess the Weibull model assump-

tion in the last chapter, we can use a normal quantile-quantile (qq) plot to assess the

normal distribution assumption. The plot is constructed as follows:

On the vertical axis, we plot the observed data, ordered from low to high.

On the horizontal axis, we plot the (ordered) theoretical quantiles from the distri-

bution (model) assumed for the observed data (here, normal).

ILLUSTRATION: Figure 4.6 shows the normal qq plot for the pipe diameter data in

Example 4.5. The ordered data do not match up perfectly with the normal quantiles,

but the plot doesnt set o any serious alarms (insofar as a departure from normality is

concerned).

PAGE 77

CHAPTER 4 STAT 509, J. TEBBS

2 1 0 1 2

1

.

2

8

1

.

2

9

1

.

3

0

1

.

3

1

1

.

3

2

Theoretical Quantiles

S

a

m

p

l

e

Q

u

a

n

t

i

l

e

s

Figure 4.6: Normal qq plot for the pipe diameter data in Example 4.5. The observed

data are plotted versus the theoretical quantiles from a normal distribution. The line

added passes through the rst and third theoretical quartiles.

4.5 Condence intervals for a population mean

4.5.1 Known population variance

2

SETTING: To get things started, we will assume that Y

1

, Y

2

, ..., Y

n

is a random sample

from a N(,

2

) population distribution. We will assume that

the population variance

2

is known (largely unrealistic).

the goal is to estimate the population mean .

PAGE 78

CHAPTER 4 STAT 509, J. TEBBS

We already know that Y is an unbiased (point) estimator for , that is,

E(Y ) = .

However, reporting Y alone does not acknowledge that there is variability attached to

this estimator. For example, in Example 4.5, with the n = 50 measured pipes, reporting

y 1.299 in

as an estimate of the population mean does not account for the fact that

the 50 pipes measured were drawn randomly from a population of all pipes, and

dierent samples would give dierent sets of pipes (and dierent values of y).

In other words, using a point estimator only ignores important information; namely,

how variable the population of pipes is.

REMEDY: To avoid this problem (i.e., to account for the uncertainty in the sampling

procedure), we therefore pursue the topic of interval estimation (also known as con-

dence intervals). The main dierence between a point estimate and an interval estimate

is that

a point estimate is a one-shot guess at the value of the parameter; this ignores

the variability in the estimate.

an interval estimate (i.e., condence interval) is an interval of values. It is

formed by taking the point estimate and then adjusting it downwards and up-

wards to account for the point estimates variability. The end result is an interval

estimate.

DERIVATION: We start our discussion by revisiting Result 1 in the last subsection.

Recall that if Y

1

, Y

2

, ..., Y

n

is a random sample from a N(,

2

) distribution, then the

sampling distribution of Y is

Y N

_

,

2

n

_

PAGE 79

CHAPTER 4 STAT 509, J. TEBBS

3 2 1 0 1 2 3

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

z

f

(

z

)

Figure 4.7: N(0, 1) pdf. The upper 0.025 and lower 0.025 areas have been shaded. The

associated quantiles are z

0.025

1.96 and z

0.025

1.96, respectively.

and therefore

Z =

Y

/

n

N(0, 1).

We now introduce new notation that identies quantiles from this distribution.

NOTATION: Let z

/2

denote the upper /2 quantile from the N(0, 1) distribution.

Because the N(0, 1) distribution is symmetric about z = 0, we know that z

/2

is the

lower /2 quantile. For example, if = 0.05 (see Figure 4.7), we know that

z

0.05/2

= z

0.025

1.96

z

0.05/2

= z

0.025

1.96.

To see where these values come from, you can use Table 1 (VK, pp 593-594) or R:

PAGE 80

CHAPTER 4 STAT 509, J. TEBBS

> qnorm(0.975,0,1) ## z_{0.025} ## upper 0.025 quantile

[1] 1.959964

> qnorm(0.025,0,1) ## -z_{0.025} ## lower 0.025 quantile

[1] -1.959964

DERIVATION: In general, for any value of , 0 < < 1, we can write

1 = P(z

/2

< Z < z

/2

)

= P

_

z

/2

<

Y

/

n

< z

/2

_

= P

_

z

/2

n

< Y < z

/2

n

_

= P

_

z

/2

n

> Y > z

/2

n

_

= P

_

Y + z

/2

n

> > Y z

/2

n

_

= P

_

Y z

/2

n

< < Y + z

/2

n

_

.

We call

_

Y z

/2

n

, Y + z

/2

n

_

a 100(1) percent condence interval for the population mean . This is sometimes

written (more succinctly) as

Y z

/2

n

.

Note the form of the interval:

point estimate

. .

Y

quantile

. .

z

/2

standard error

. .

/

n

.

Many condence intervals we will study follow this same general form.

Here is how we interpret this interval: We say

We are 100(1 ) percent condent that the population mean is in

this interval.

PAGE 81

CHAPTER 4 STAT 509, J. TEBBS

Unfortunately, the word condent does not mean probability. The term con-

dence in condence interval means that if we were able to sample from the pop-

ulation over and over again, each time computing a 100(1 ) percent condence

interval for , then 100(1 ) percent of the intervals we would compute would

contain the population mean .

That is, condence refers to long term behavior of many intervals; not prob-

ability for a single interval. Because of this, we call 100(1 ) the condence

level. Typical condence levels are

90 percent ( = 0.10) = z

0.05

1.645

95 percent ( = 0.05) = z

0.025

1.96

99 percent ( = 0.01) = z

0.005

2.33.

The length of the 100(1 ) percent condence interval

Y z

/2

n

is equal to

2z

/2

n

.

Therefore,

the larger the sample size n, the smaller the interval length.

the larger the population variance

2

, the larger the interval length.

the larger the condence level 100(1 ), the larger the interval length.

Clearly, shorter condence intervals are preferred. They are more informative!

Example 4.6. Civil engineers have found that the ability to see and read a sign at night

depends in part on its surround luminance; i.e., the light intensity near the sign. The

data below are n = 30 measurements of the random variable Y , the surround luminance

(in candela per m

2

). The 30 measurements constitute a random sample from all signs in

a large metropolitan area.

PAGE 82

CHAPTER 4 STAT 509, J. TEBBS

10.9 1.7 9.5 2.9 9.1 3.2 9.1 7.4 13.3 13.1

6.6 13.7 1.5 6.3 7.4 9.9 13.6 17.3 3.6 4.9

13.1 7.8 10.3 10.3 9.6 5.7 2.6 15.1 2.9 16.2

Based on past experience, the engineers assume a normal population distribution (for

the population of all signs) with known population variance

2

= 20.

Question. Find a 90 percent condence interval for , the mean surround luminance.

Solution. We rst use R to calculate the sample mean y:

> mean(intensity) ## sample mean

[1] 8.62

For a 90 percent condence level; i.e., with = 0.10, we use

z

0.10/2

= z

0.05

1.645.

This can be determined from Table 1 (VK, pp 593-594) or from R:

> qnorm(0.95,0,1) ## z_{0.05} ## upper 0.05 quantile

[1] 1.644854

With n = 30 and

2

= 20, a 90 percent condence interval for the mean surround

luminance is

y z

/2

n

= 8.62 1.645

_

20

30

_

= (7.28, 9.96) candela/m

2

.

Interpretation: We are 90 percent condent that the mean surround luminance for

all signs in the population is between 7.28 and 9.96 candela/m

2

.

Further analysis: Recall that the engineers claimed that the population of all light

intensities is described by a normal (Gaussian) distribution. In Figure 4.8, we display a

normal qq plot to check this assumption. The qq plot might cause some concern about

the normality assumption. There is some mild evidence of disagreement with the normal

model (although with a small sample, it is always hard to be sure).

PAGE 83

CHAPTER 4 STAT 509, J. TEBBS

2 1 0 1 2

5

1

0

1

5

Theoretical Quantiles

S

a

m

p

l

e

Q

u

a

n

t

i

l

e

s

Figure 4.8: Normal qq plot for the light intensity data in Example 4.6. The observed

data are plotted versus the theoretical quantiles from a normal distribution. The line

added passes through the rst and third theoretical quartiles.

Is this a serious cause for concern?: Probably not. Recall that even if the population

distribution (here, the distribution of all light intensity measurements in the city) is not

perfectly normal, we still have

Y AN

_

,

2

n

_

,

for n large, by the Central Limit Theorem. Therefore, our 90 percent condence interval

is still approximately valid. A sample of size n = 30 is pretty large. In other words,

at n = 30, the CLT approximation above is usually kicking in rather well unless the

underlying population distribution is grossly skewed (and I mean very grossly). This is

not the case here.

PAGE 84

CHAPTER 4 STAT 509, J. TEBBS

4.5.2 Unknown population variance

2

SETTING: We will continue to assume that Y

1

, Y

2

, ..., Y

n

is a random sample from a

N(,

2

) population distribution.

Our goal is the same; namely, to write a 100(1 ) percent condence interval for

the population mean .

However, we will no longer make the (rather unrealistic) assumption that popula-

tion variance

2

is known.

RECALL: If you look back in the notes at the known

2

case, you will see that to derive

a 100(1) percent condence interval for , we started with the following distributional

result:

Z =

Y

/

n

N(0, 1).

This led us to the following condence interval formula:

Y z

/2

n

.

The obvious problem is that, because

2

is now unknown, we can not calculate the

interval. Not to worry; we just need a dierent starting point. Recall that

t =

Y

S/

n

t(n 1),

where S is the sample standard deviation (a point estimator for the population standard

deviation). This result is all we need; in fact, it is straightforward to reproduce the

known

2

derivation and tailor it to this (now more realistic) case. A 100(1 )

percent condence interval for is given by

Y t

n1,/2

S

n

.

The symbol t

n1,/2

denotes the upper /2 quantile from a t distribution with = n1

degrees of freedom. This value can be easily obtained from R. For those of you that like

probability tables, VK tables the t distributions in Table 2 (pp 595).

PAGE 85

CHAPTER 4 STAT 509, J. TEBBS

We see that the interval again has the same form:

point estimate

. .

Y

quantile

. .

t

n1,/2

standard error

. .

S/

n

.

We interpret the interval in the same way.

We are 100(1 ) percent condent that the population mean is in

this interval.

Example 4.7. Acute exposure to cadmium produces respiratory distress and kidney and

liver damage (and possibly death). For this reason, the level of airborne cadmium dust

and cadmium oxide fume in the air, denoted by Y (measured in milligrams of cadmium

per m

3

of air), is closely monitored. A random sample of n = 35 measurements from a

large factory are given below:

0.044 0.030 0.052 0.044 0.046 0.020 0.066

0.052 0.049 0.030 0.040 0.045 0.039 0.039

0.039 0.057 0.050 0.056 0.061 0.042 0.055

0.037 0.062 0.062 0.070 0.061 0.061 0.058

0.053 0.060 0.047 0.051 0.054 0.042 0.051

Based on past experience, engineers assume a normal population distribution (for the

population of all cadmium measurements).

Question. Find a 99 percent condence interval for , the mean level of airborne

cadmium.

Solution. We rst use R to calculate the sample mean y and the sample standard

deviation s:

> mean(cadmium) ## sample mean

[1] 0.04928571

> sd(cadmium) ## sample standard deviation

[1] 0.0110894

PAGE 86

CHAPTER 4 STAT 509, J. TEBBS

For a 99 percent condence level; i.e., with = 0.01, we use

t

34,0.01/2

= t

34,0.005

2.728.

Note that Table 2 (VK, pp 595) is not helpful here; the = 34 degrees of freedom

quantiles are not listed. From R, we get:

> qt(0.995,34) ## t_{34,0.005} ## upper 0.005 quantile

[1] 2.728394

With n = 35, y 0.049, and s 0.011, a 99 percent condence interval for the population

mean level of airborne cadmium is

y t

n1,/2

s

n

= 0.049 2.728

_

0.011

35

_

= (0.044, 0.054) mg/m

3

.

Interpretation: We are 99 percent condent that the population mean level of airborne

cadmium is between 0.044 and 0.054 mg/m

3

.

NOTE: It is possible to implement the t interval procedure entirely in R:

> # Calculate t interval directly

> t.test(cadmium,conf.level=0.99)$conf.int

[1] 0.04417147 0.05439996

Further analysis: Recall that the engineers claimed that the population of all airborne

cadmium concentrations is described by a normal (Gaussian) distribution. In Figure

4.9, we display the normal qq plot to check this assumption. The qq plot looks pretty

supportive of the normal assumption, although there is mild evidence of a slight departure

in the upper tail.

ROBUSTNESS: The t condence interval is based on the population distribution being

normal (Gaussian). However, this interval is robust to departures from normality; i.e.,

even if the population distribution is non-normal (non-Gaussian), we can still use the t

condence interval and get approximately valid results.

PAGE 87

CHAPTER 4 STAT 509, J. TEBBS

2 1 0 1 2

0

.

0

2

0

.

0

3

0

.

0

4

0

.

0

5

0

.

0

6

0

.

0

7

Theoretical Quantiles

S

a

m

p

l

e

Q

u

a

n

t

i

l

e

s

Figure 4.9: Normal qq plot for the cadmium data in Example 4.7. The observed data

are plotted versus the theoretical quantiles from a normal distribution. The line added

passes through the rst and third theoretical quartiles.

4.5.3 Sample size determination

MOTIVATION: In the planning stages of an experiment or investigation, it is often of

interest to determine how many individuals are needed to write a condence interval

with a given level of precision. For example, we might want to construct a 95 percent

condence interval for a population mean so that the interval length is no more than

5 units (e.g., days, inches, dollars, etc.). Of course, collecting data almost always costs

money! Therefore, one must be cognizant not only of the statistical issues associated

with sample size determination, but also of the practical issues like cost, time spent

in data collection, personnel training, etc.

PAGE 88

CHAPTER 4 STAT 509, J. TEBBS

SETTING: Suppose that Y

1

, Y

2

, ..., Y

n

is a random sample from a N(,

2

) population,

where

2

is known. In this (known

2

) situation, recall that a 100(1 ) percent con-

dence interval for is given by

Y z

/2

_

n

_

. .

=B, say

.

The quantity B is called the margin of error.

FORMULA: In the setting described above, it is possible to determine the sample size n

necessary once we specify these three pieces of information:

the value of

2

(or an educated guess at its value; e.g., from past information, etc.)

the condence level, 100(1 )

the margin of error, B.

This is true because

B = z

/2

_

n

_

n =

_

z

/2

B

_

2

.

Example 4.8. In a biomedical experiment, we would like to estimate the population

mean remaining life of healthy rats that are given a certain dose of a toxic substance.

Suppose that we would like to write a 95 percent condence interval for with a margin

of error equal to B = 2 days. From past studies, remaining rat lifetimes have been

approximated by a normal distribution with standard deviation = 8 days. How many

rats should we use for the experiment?

Solution. With z

0.05/2

= z

0.025

1.96, B = 2, and = 8, the desired sample size to

estimate is

n =

_

z

/2

B

_

2

=

_

1.96 8

2

_

2

61.46.

We would sample n = 62 rats to achieve these goals.

> qnorm(0.975,0,1)

[1] 1.959964

PAGE 89

CHAPTER 4 STAT 509, J. TEBBS

4.6 Condence interval for a population proportion p

SITUATION: We now switch gears and focus on a new parameter: the population

proportion p. This parameter emerges when the characteristic we measure on each

individual is binary (i.e., only 2 outcomes possible). Here are some examples:

p = proportion of defective circuit boards

p = proportion of customers who are satised

p = proportion of payments received on time

p = proportion of HIV positives in SC.

To start our discussion, we need to recall the Bernoulli trial assumptions for each

individual in the sample:

1. each individual results in a success or a failure,

2. the individuals are independent, and

3. the probability of success, denoted by p, 0 < p < 1, is the same for every

individual.

In our examples above,

success circuit board defective

success customer satised

success payment received on time

success HIV positive individual.

RECALL: If the individual success/failure statuses in the sample adhere to the Bernoulli

trial assumptions, then

Y = the number of successes out of n sampled individuals

follows a binomial distribution, that is, Y b(n, p). The statistical problem at hand is

to use the information in Y to estimate p.

PAGE 90

CHAPTER 4 STAT 509, J. TEBBS

POINT ESTIMATOR: A natural point estimator for p, the population proportion, is

p =

Y

n

,

the sample proportion. This statistic is simply the proportion of successes in the

sample (out of n individuals).

PROPERTIES: Fairly simple arguments can be used to show the following results:

E( p) = p

se( p) =

p(1 p)

n

.

The rst result says that the sample proportion p is an unbiased estimator of the

population proportion p. The second (standard error) result quanties the precision of p

as an estimator of p.

SAMPLING DISTRIBUTION: Knowing the sampling distribution of p is critical if we

are going to formalize statistical inference procedures for p. In this situation, we appeal

to an approximate result (conferred by the CLT) which says that

p AN

_

p,

p(1 p)

n

_

,

when the sample size n is large.

RESULT: An approximate 100(1 ) percent condence interval for p is given by

p z

/2

p(1 p)

n

.

This interval should be used only when the sample size n is large. A common

rule of thumb (to use this interval formula) is to require

n p 5

n(1 p) 5.

Under these conditions, the CLT should adequately approximate the true sampling

distribution of p, thereby making the condence interval formula above approxi-

mately valid.

PAGE 91

CHAPTER 4 STAT 509, J. TEBBS

Note again the form of the interval:

point estimate

. .

p

quantile

. .

z

/2

standard error

. .

p(1 p)

n

.

We interpret the interval in the same way.

We are 100(1 ) percent condent that the population proportion p

is in this interval.

The value z

/2

is the upper /2 quantile from the N(0, 1) distribution.

Example 4.9. One source of water pollution is gasoline leakage from underground

storage tanks. In Pennsylvania, a random sample of n = 74 gasoline stations is selected

and the tanks are inspected; 10 stations are found to have at least one leaking tank.

Question. Calculate a 95 percent condence interval for p, the population proportion

of gasoline stations with at least one leaking tank.

Solution. In this situation, we interpret

gasoline station = individual trial

at least one leaking tank = success

p = population proportion of stations with at least one leaking tank.

For 95 percent condence, we need z

0.05/2

= z

0.025

1.96. The sample proportion of

stations with at least one leaking tank is

p =

10

74

0.135.

Therefore, an approximate 95 percent condence interval for p is

0.135 1.96

0.135(1 0.135)

74

= (0.057, 0.213).

Interpretation: We are 95 percent condent that the population proportion of stations

in Pennsylvania with at least one leaking tank is between 0.057 and 0.213.

PAGE 92

CHAPTER 4 STAT 509, J. TEBBS

CLT approximation check: We have

n p = 74

_

10

74

_

= 10

n(1 p) = 74

_

1

10

74

_

= 64.

Both of these are larger than 5 =we can feel comfortable in using this interval formula.

QUESTION: Suppose that we would like to write a 100(1) percent condence interval

for p, a population proportion. We know that

p z

/2

p(1 p)

n

is an approximate 100(1) percent condence interval for p. What sample size n should

we use?

SAMPLE SIZE DETERMINATION: To determine the necessary sample size, we rst

need to specify two pieces of information:

the condence level 100(1 )

the margin of error:

B = z

/2

p(1 p)

n

.

A small problem arises. Note that B depends on p. Unfortunately, p can only be

calculated once we know the sample size n. We overcome this problem by replacing p

with p

0

, an a priori guess at its value. The last expression becomes

B = z

/2

p

0

(1 p

0

)

n

.

Solving this equation for n, we get

n =

_

z

/2

B

_

2

p

0

(1 p

0

).

This is the desired sample size n to nd a 100(1 ) percent condence interval for p

with a prescribed margin of error (roughly) equal to B. I say roughly, because there

may be additional uncertainty arising from our use of p

0

(our best guess).

PAGE 93

CHAPTER 4 STAT 509, J. TEBBS

CONSERVATIVE APPROACH: If there is no sensible guess for p available, use p

0

= 0.5.

In this situation, the resulting value for n will be as large as possible. Put another way,

using p

0

= 0.5 gives the most conservative solution (i.e., the largest sample size, n).

This is true because

n = n(p

0

) =

_

z

/2

B

_

2

p

0

(1 p

0

),

when viewed as a function of p

0

, is maximized when p

0

= 0.5.

Example 4.10. You have been asked to estimate the proportion of raw material (in a

certain manufacturing process) that is being scrapped; e.g., the material is so defective

that it can not be reworked. If this proportion is larger than 10 percent, this will be

deemed (by management) to be an unacceptable continued operating cost and a sub-

stantial process overhaul will be performed. Past experience suggests that the scrap rate

is about 5 percent, but recent information suggests that this rate may be increasing.

Question. You would like to write a 95 percent condence interval for p, the popula-

tion proportion of raw material that is to be scrapped, with a margin of error equal to

B = 0.02. How many pieces of material should you ask to be sampled?

Solution. For 95 percent condence, we need z

0.05/2

= z

0.025

1.96. In providing an

initial guess, we have options; we could use

p

0

= 0.05 (historical scrap rate)

p

0

= 0.10 (critical mass value)

p

0

= 0.50 (most conservative choice).

For these choices, we have

n =

_

1.96

0.02

_

2

0.05(1 0.05) 457

n =

_

1.96

0.02

_

2

0.10(1 0.10) 865

n =

_

1.96

0.02

_

2

0.50(1 0.50) 2401.

As we can see, the guessed value of p

0

has a substantial impact on the nal sample

size calculation.

PAGE 94

CHAPTER 4 STAT 509, J. TEBBS

4.7 Condence interval for a population variance

2

MOTIVATION: In many situations, one is concerned not with the mean of an underlying

(continuous) population distribution, but with the variance

2

instead. If

2

is excessively

large, this could point to a potential problem with a manufacturing process, for example,

where there is too much variation in the measurements produced. In a laboratory setting,

chemical engineers might wish to estimate the variance

2

attached to a measurement

system (e.g., scale, caliper, etc.). In eld trials, agronomists are often interested in

comparing the variability levels for dierent cultivars or genetically-altered varieties. In

clinical trials, physicians are often concerned if there are substantial dierences in the

variation levels of patient responses at dierent clinic sites.

NEW RESULT: Suppose that Y

1

, Y

2

, ..., Y

n

is a random sample from a N(,

2

) distri-

bution. The quantity

Q =

(n 1)S

2

2

2

(n 1),

a

2

distribution with = n 1 degrees of freedom.

FACTS: The

2

distribution has the following characteristics:

It is continuous, skewed to the right, and always positive.

It is indexed by a value called the degrees of freedom. In practice, is often

an integer (related to the sample size).

The

2

pdf formula is unnecessary for our purposes. R will compute

2

probabilities

and quantiles from the

2

distribution.

2

R CODE: Suppose that Q

2

().

F

Q

(q) = P(Q q)

p

pchisq(q,) qchisq(p,)

NOTE: Table 3 (VK, pp 596-597) catalogues some quantiles from some

2

distributions.

PAGE 95

CHAPTER 4 STAT 509, J. TEBBS

0 10 20 30 40

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

q

f

(

q

)

0 10 20 30 40

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

0 10 20 30 40

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

Chisquare; df = 5

Chisquare; df = 10

Chisquare; df = 20

Figure 4.10:

2

probability density functions with dierent degrees of freedom.

GOAL: Suppose that Y

1

, Y

2

, ..., Y

n

is a random sample from a N(,

2

) distribution. We

would like to write a 100(1 ) percent condence interval for

2

.

NOTATION: Let

2

n1,/2

denote the lower /2 quantile and let

2

n1,1/2

denote the

upper /2 quantile of the

2

(n 1) distribution; i.e.,

2

n1,/2

and

2

n1,1/2

satisfy

P(Q <

2

n1,/2

) = /2

P(Q >

2

n1,1/2

) = /2,

respectively. Note that, unlike the N(0, 1) and t distributions, the

2

distribution is

not symmetric. Therefore, dierent notation is needed to identify the quantiles of

2

distributions (this is nothing to get worried about).

PAGE 96

CHAPTER 4 STAT 509, J. TEBBS

DERIVATION: Because Q

2

(n 1), we write

1 = P(

2

n1,/2

< Q <

2

n1,1/2

)

= P

_

2

n1,/2

<

(n 1)S

2

2

<

2

n1,1/2

_

= P

_

1

2

n1,/2

>

2

(n 1)S

2

>

1

2

n1,1/2

_

= P

_

(n 1)S

2

2

n1,/2

>

2

>

(n 1)S

2

2

n1,1/2

_

= P

_

(n 1)S

2

2

n1,1/2

<

2

<

(n 1)S

2

2

n1,/2

_

.

This argument shows that

_

(n 1)S

2

2

n1,1/2

,

(n 1)S

2

2

n1,/2

_

is a 100(1 ) percent condence interval for the population variance

2

. We

interpret the interval in the same way.

We are 100(1 ) percent condent that the population variance

2

is in

this interval.

IMPORTANT: A 100(1 ) percent condence interval for the population standard

deviation arises from simply taking the square root of the endpoints of the

2

interval.

That is,

_

(n 1)S

2

2

n1,1/2

,

(n 1)S

2

2

n1,/2

_

is a 100(1 ) percent condence interval for the population standard deviation . In

practice, this interval may be preferred over the

2

interval, because standard deviation

is a measure of variability in terms of the original units (e.g., dollars, inches, days, etc.).

The variance is measured in squared units (e.g., dollars

2

, in

2

, days

2

, etc.).

Example 4.11. Indoor swimming pools are noted for their poor acoustical properties.

Suppose your goal is to design a pool in such a way that

PAGE 97

CHAPTER 4 STAT 509, J. TEBBS

the population mean time that it takes for a low-frequency sound to die out is

= 1.3 seconds

the population standard deviation for the distribution of die-out times is = 0.6

seconds.

Computer simulations of a preliminary design are conducted to see whether these stan-

dards are being met; here are data from n = 20 independently-run simulations. The data

are obtained on the time (in seconds) it takes for the low-frequency sound to die out.

1.34 2.56 1.28 2.25 1.84 2.35 0.77 1.84 1.80 2.44

0.86 1.29 0.12 1.87 0.71 2.08 0.71 0.30 0.54 1.48

Question. Find a 95 percent condence interval for the population standard deviation of

times . What does this interval suggest about whether the preliminary design conforms

to specications (with respect to variability)?

Solution. For 95 percent condence (i.e., using = 0.05), we need to use

2

n1,/2

=

2

19,0.025

8.907

2

n1,1/2

=

2

19,0.975

32.852.

I got these quantiles from R:

> qchisq(0.025,19) ## chi^2_{19,0.025}

[1] 8.906516

> qchisq(0.975,19) ## chi^2_{19,0.975}

[1] 32.85233

We also need to calculate the sample variance s

2

. From R, we get s

2

0.555.

> var(sounds) ## sample variance

[1] 0.554666

PAGE 98

CHAPTER 4 STAT 509, J. TEBBS

2 1 0 1 2

0

.

5

1

.

0

1

.

5

2

.

0

2

.

5

Theoretical Quantiles

S

a

m

p

l

e

Q

u

a

n

t

i

l

e

s

Figure 4.11: Normal qq plot for the swimming pool sound time data in Example 4.11.

The observed data are plotted versus the theoretical quantiles from a normal distribution.

The line added passes through the rst and third theoretical quartiles.

A 95 percent condence interval for

2

is

_

(n 1)s

2

2

n1,1/2

,

(n 1)s

2

2

n1,/2

_

=

_

19(0.555)

32.852

,

19(0.555)

8.907

_

= (0.321, 1.184).

Interpretation: We are 95 percent condent that the population variance

2

is between

0.321 and 1.184 sec

2

.

A 95 percent condence interval for (which is what we originally wanted) is

(

0.321,

Interpretation: We are 95 percent condent that the population standard deviation

is between 0.567 and 1.088 sec. From this interval, it appears as though the pool design

PAGE 99

CHAPTER 4 STAT 509, J. TEBBS

specications are being met (with respect to the population standard deviation level).

Note that = 0.6 is contained in the condence interval for .

Major warning: Unlike the z and t condence intervals for a population mean ,

the

2

interval for

2

(and for ) is not robust to departures from normality. If the

underlying population distribution is non-normal (non-Guassian), then the condence

interval formulas for

2

and are not to be used. Therefore, it is very important to

check the normality assumption with these interval procedures (e.g., use a qq-plot).

Analysis: With only n = 20 measurements, it is somewhat hard to tell, but the qq-plot

in Figure 4.11 looks fairly straight. Small sample sizes make interpreting qq-plots more

dicult (e.g., the analyst may look for patterns that are not really there).

4.8 Condence intervals for the dierence of two population

means

1

2

: Independent samples

REMARK: In practice, it is very common to compare the same characteristic (mean,

proportion, variance) from two dierent distributions. For example, we may wish to

compare

the mean starting salaries of male and female engineers (compare

1

and

2

)

the proportion of scrap produced from two manufacturing processes (compare p

1

and p

2

)

the variance of sound levels from two indoor swimming pool designs (compare

2

1

and

2

2

).

Our previous work is applicable only for a single distribution (i.e., a single mean , a single

proportion p, and a single variance

2

). We therefore need to extend these procedures to

handle two distributions. We start with comparing two means.

PAGE 100

CHAPTER 4 STAT 509, J. TEBBS

TWO-SAMPLE PROBLEM: Suppose that we have two independent samples:

Sample 1 : Y

11

, Y

12

, ..., Y

1n

1

N(

1

,

2

1

) random sample

Sample 2 : Y

21

, Y

22

, ..., Y

2n

2

N(

2

,

2

2

) random sample.

GOAL: Construct a 100(1) percent condence interval for the dierence of population

means

1

2

.

POINT ESTIMATORS: We dene the statistics

Y

1+

=

1

n

1

n

1

j=1

Y

1j

= sample mean for sample 1

Y

2+

=

1

n

2

n

2

j=1

Y

2j

= sample mean for sample 2

S

2

1

=

1

n

1

1

n

1

j=1

(Y

1j

Y

1+

)

2

= sample variance for sample 1

S

2

2

=

1

n

2

1

n

2

j=1

(Y

2j

Y

2+

)

2

= sample variance for sample 2.

4.8.1 Equal variance case:

2

1

=

2

2

GOAL: We want to write a condence interval for

1

2

, but how this interval is

constructed depends on the values of

2

1

and

2

2

. In particular, we consider two cases:

2

1

=

2

2

; that is, the two population variances are equal

2

1

=

2

2

; that is, the two population variances are not equal.

We rst consider the equal variance case. Addressing this case requires us to start with

the following (sampling) distribution result:

T =

(Y

1+

Y

2+

) (

1

2

)

S

2

p

_

1

n

1

+

1

n

2

_

t(n

1

+ n

2

2),

where

S

2

p

=

(n

1

1)S

2

1

+ (n

2

1)S

2

2

n

1

+ n

2

2

.

PAGE 101

CHAPTER 4 STAT 509, J. TEBBS

40 50 60 70

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

y

f

(

y

)

40 50 60 70

0

.

0

0

0

.

0

5

0

.

1

0

0

.

1

5

Distribution 1

Distribution 2

Figure 4.12: Two normal distributions with

2

1

=

2

2

.

Some comments are in order:

For this sampling distribution to hold (exactly), we need

the two samples to be independent

the two population distributions to be normal (Gaussian)

the two population distributions to have the same variance; i.e.,

2

1

=

2

2

.

The statistic S

2

p

is called the pooled sample variance estimator of the common

population variance, say,

2

. Algebraically, it is simply a weighted average of the

two sample variances S

2

1

and S

2

2

(where the weights are functions of the sample

sizes n

1

and n

2

).

The sampling distribution T t(n

1

+n

2

2) should suggest to you that condence

PAGE 102

CHAPTER 4 STAT 509, J. TEBBS

interval quantiles will come from this t distribution; note that this distribution

depends on the sample sizes from both samples.

In particular, because T t(n

1

+ n

2

2), we can nd the value t

n

1

+n

2

2,/2

that

satises

P(t

n

1

+n

2

2,/2

< T < t

n

1

+n

2

2,/2

) = 1 .

Substituting T into the last expression and performing algebraic manipulations, we

obtain

(Y

1+

Y

2+

) t

n

1

+n

2

2,/2

S

2

p

_

1

n

1

+

1

n

2

_

.

This is a 100(1) percent condence interval for the mean dierence

1

2

.

We see that the interval again has the same form:

point estimate

. .

Y

1+

Y

2+

quantile

. .

t

n

1

+n

2

2,/2

standard error

. .

S

2

p

(

1

n

1

+

1

n

2

)

.

We interpret the interval in the same way.

We are 100(1) percent condent that the population mean dierence

2

is in this interval.

Important: In two-sample situations, it is often of interest to see if the means

1

and

2

are dierent.

If the condence interval for

1

2

includes 0, this does not suggest that the

means

1

and

2

are dierent.

If the condence interval for

1

2

does not include 0, it does.

Example 4.12. In the vicinity of a nuclear power plant, environmental engineers from

the EPA would like to determine if there is a dierence between the mean weight in

sh (of the same species) from two locations. Independent samples are taken from each

location and the following weights (in ounces) are observed:

PAGE 103

CHAPTER 4 STAT 509, J. TEBBS

Location 1 Location 2

0

1

0

2

0

3

0

4

0

W

e

i

g

h

t

(

i

n

o

u

n

c

e

s

)

Figure 4.13: Boxplots of sh weights by location in Example 4.12.

Location 1: 21.9 18.5 12.3 16.7 21.0 15.1 18.2 23.0 36.8 26.6

Location 2: 22.0 20.6 15.4 17.9 24.4 15.6 11.4 17.5

Question. Construct a 90 percent condence interval for the mean dierence

1

2

.

Here,

1

(

2

) denotes the population mean weight of all sh at location 1 (2).

Solution. In order to visually assess the equal variance assumption, we use boxplots

to display the data in each sample; see Figure 4.13. The equal variance assumption,

based on the gure, looks reasonable; note that the spread in each distribution looks

roughly the same (save the outlier in Location 1).

NOTE: Instead of resorting to hand calculation (as we have often done in previous

examples), we will use R to calculate the condence interval directly:

PAGE 104

CHAPTER 4 STAT 509, J. TEBBS

> t.test(loc.1,loc.2,conf.level=0.90,var.equal=TRUE)$conf.int

[1] -1.940438 7.760438

Therefore, a 90 percent condence interval for the mean dierence

1

2

is

(1.940, 7.760) oz.

Interpretation: We are 90 percent condent that the mean dierence

1

2

is between

1.940 and 7.760 oz. Note that this interval includes 0. Therefore, we do not have

sucient evidence that the population mean sh weights

1

and

2

are dierent.

ROBUSTNESS: Some comments are in order about the robustness properties of the

two-sample condence interval

(Y

1+

Y

2+

) t

n

1

+n

2

2,/2

S

2

p

_

1

n

1

+

1

n

2

_

for the mean dierence

1

2

.

We should only use this interval if there is strong evidence that the population

variances

2

1

and

2

2

are equal (or at least close). Otherwise, we should use a

dierent interval (coming up).

Like the one-sample t interval for , the two sample t interval (and the unequal

variance version coming up) is robust to normality departures. This means that we

can feel comfortable with the interval even if the underlying population distributions

are not perfectly normal (Guassian).

4.8.2 Unequal variance case:

2

1

=

2

2

REMARK: When

2

1

=

2

2

, the problem of constructing a 100(1 ) percent condence

interval for

1

2

becomes more dicult theoretically. However, we can still write an

approximate condence interval.

PAGE 105

CHAPTER 4 STAT 509, J. TEBBS

FORMULA: An approximate 100(1 ) percent condence interval

1

2

is given by

(Y

1+

Y

2+

) t

,/2

S

2

1

n

1

+

S

2

2

n

2

,

where the degrees of freedom is calculated as

=

_

S

2

1

n

1

+

S

2

2

n

2

_

2

_

S

2

1

n

1

_

2

n

1

1

+

_

S

2

2

n

2

_

2

n

2

1

.

This interval is always approximately valid, as long as

the two samples are independent

the two population distributions are approximately normal (Gaussian).

No one in their right mind would calculate this interval by hand (particularly

nasty is the formula for ). R will produce the interval on request.

Example 4.13. You are part of a recycling project that is examining how much paper is

being discarded (not recycled) by employees at two large plants. These data are obtained

on the amount of white paper thrown out per year by employees (data are in hundreds

of pounds). Samples of employees at each plant were randomly selected.

Plant 1: 3.01 2.58 3.04 1.75 2.87 2.57 2.51 2.93 2.85 3.09

1.43 3.36 3.18 2.74 2.25 1.95 3.68 2.29 1.86 2.63

2.83 2.04 2.23 1.92 3.02

Plant 2: 3.79 2.08 3.66 1.53 4.07 4.31 2.62 4.52 3.80 5.30

3.41 0.82 3.03 1.95 6.45 1.86 1.87 3.78 2.74 3.81

Question. Are there dierences in the mean amounts of white paper discarded by

employees at the two plants? Answer this question by nding a 95 percent condence

interval for the mean dierence

1

2

. Here,

1

(

2

) denotes the population mean

amount of white paper discarded per employee at Plant 1 (2).

Solution. In order to visually assess the equal variance assumption, we again use

PAGE 106

CHAPTER 4 STAT 509, J. TEBBS

Plant 1 Plant 2

0

2

4

6

8

W

e

i

g

h

t

(

i

n

1

0

0

s

l

b

)

Figure 4.14: Boxplots of discarded white paper amounts (in 100s lb) in Example 4.13.

boxplots to display the data in each sample; see Figure 4.14. For these data, the equal

variance assumption would be highly suspect; the spread in the distribution of Plant 2

values is much larger than that of Plant 1. We again use R to calculate the (unequal

variance) condence interval for

1

2

:

> t.test(plant.1,plant.2,conf.level=0.95,var.equal=FALSE)$conf.int

[1] -1.35825799 -0.01294201

Therefore, a 95 percent condence interval for the mean dierence

1

2

is

(1.358, 0.013) lb.

Interpretation: We are 95 percent condent that the mean dierence

1

2

is between

PAGE 107

CHAPTER 4 STAT 509, J. TEBBS

135.8 and 1.3 lb. This interval does not include 0. Therefore, we have evidence that

the population mean weights (

1

and

2

) at the two plants are dierent.

REMARK: In this subsection, we have presented two condence intervals for

1

2

. One

assumes

2

1

=

2

2

(equal variance assumption) and one that assumes

2

1

=

2

2

(unequal

variance assumption). If you are unsure about which interval to use, go with the

unequal variance interval. The penalty for using it when

2

1

=

2

2

is much smaller

than the penalty for using the equal variance interval when

2

1

=

2

2

.

4.9 Condence interval for the dierence of two population pro-

portions p

1

p

2

: Independent samples

NOTE: We also can extend our condence interval procedure for a single population

proportion p to two populations. Dene

p

1

= population proportion of successes in Population 1

p

2

= population proportion of successes in Population 2.

For example, we might want to compare the proportion of

defective circuit boards for two dierent suppliers

satised customers before and after a product design change (e.g., Facebook, etc.)

on-time payments for two classes of customers

HIV positives for individuals in two demographic classes.

POINT ESTIMATORS: We assume that there are two independent random samples of

individuals (one sample from each population to be compared). Dene

Y

1

= number of successes in Sample 1 (out of n

1

individuals) b(n

1

, p

1

)

Y

2

= number of successes in Sample 2 (out of n

2

individuals) b(n

2

, p

2

).

PAGE 108

CHAPTER 4 STAT 509, J. TEBBS

The point estimators for p

1

and p

2

are the sample proportions, dened by

p

1

=

Y

1

n

1

p

2

=

Y

2

n

2

.

GOAL: We would like to write a 100(1 ) percent condence interval for p

1

p

2

, the

dierence of two population proportions.

IMPORTANT: To accomplish this goal, we need the following distributional result.

When the sample sizes n

1

and n

2

are large,

Z =

( p

1

p

2

) (p

1

p

2

)

p

1

(1p

1

)

n

1

+

p

2

(1p

2

)

n

2

AN(0, 1).

If this sampling distribution holds approximately, then

( p

1

p

2

) z

/2

p

1

(1 p

1

)

n

1

+

p

2

(1 p

2

)

n

2

is an approximate 100(1 ) percent condence interval for p

1

p

2

.

For the Z sampling distribution to hold approximately (and therefore for the in-

terval above to be useful), we need

the two samples to be independent

the sample sizes n

1

and n

2

to be large; common rules of thumb are to require

n

i

p

i

5

n

i

(1 p

i

) 5,

for each sample i = 1, 2. Under these conditions, the CLT should adequately

approximate the true sampling distribution of Z, thereby making the con-

dence interval formula above approximately valid.

Note again the form of the interval:

point estimate

. .

p

1

p

2

quantile

. .

z

/2

standard error

. .

p

1

(1 p

1

)

n

1

+

p

2

(1 p

2

)

n

2

.

We interpret the interval in the same way.

PAGE 109

CHAPTER 4 STAT 509, J. TEBBS

We are 100(1 ) percent condent that the population proportion

dierence p

1

p

2

is in this interval.

The value z

/2

is the upper /2 quantile from the N(0, 1) distribution.

Important: In two-sample situations, it is often of interest to see if the proportions

p

1

and p

2

are dierent.

If the condence interval for p

1

p

2

includes 0, this does not suggest that the

proportions p

1

and p

2

are dierent.

If the condence interval for p

1

p

2

does not include 0, it does.

Example 4.14. A programmable lighting control system is being designed. The pur-

pose of the system is to reduce electricity consumption costs in buildings. The system

eventually will entail the use of a large number of transceivers (a device comprised of

both a transmitter and a receiver). Two types of transceivers are being considered. In

life testing, 200 transceivers (randomly selected) were tested for each type.

Transceiver 1: 20 failures were observed (out of 200)

Transceiver 2: 14 failures were observed (out of 200).

Question. Dene p

1

(p

2

) to be the population proportion of Transceiver 1 (Transceiver

2) failures. Write a 95 percent condence interval for p

1

p

2

. Is there a signicant

dierence between the failure rates p

1

and p

2

?

Solution. For 95 percent condence, we need z

0.05/2

= z

0.025

1.96. The sample

proportions of defective transceivers are

p

1

=

20

200

= 0.10

p

2

=

14

200

= 0.07.

Therefore, an approximate 95 percent condence interval for p

1

p

2

is

(0.10 0.07) 1.96

0.10(1 0.10)

200

+

0.07(1 0.07)

200

= (0.025, 0.085).

PAGE 110

CHAPTER 4 STAT 509, J. TEBBS

Interpretation: We are 95 percent condent that the dierence of the population failure

rates for the two transceivers is between 0.025 and 0.085. Because this interval includes

0, we do not have sucient evidence that the two failure rates p

1

and p

2

are dierent.

CLT approximation check: We have

n

1

p

1

= 200

_

20

200

_

= 20 n

2

p

2

= 200

_

14

200

_

= 14

n

1

(1 p

1

) = 200

_

1

20

200

_

= 180 n

2

(1 p

2

) = 200

_

1

14

200

_

= 186.

All of these quantities are larger than 5 =we can feel comfortable in using this interval

formula.

4.10 Condence interval for the ratio of two population vari-

ances

2

2

/

2

1

: Independent samples

IMPORTANCE: You will recall that when we wrote a condence interval for

1

2

,

the dierence of the population means (with independent samples), we proposed two

dierent intervals:

one interval that assumed

2

1

=

2

2

one interval that assumed

2

1

=

2

2

.

We now propose a condence interval procedure that can be used to determine which

assumption is more appropriate. This condence interval is used to compare the popu-

lation variances in two independent samples.

TWO-SAMPLE PROBLEM: Suppose that we have two independent samples:

Sample 1 : Y

11

, Y

12

, ..., Y

1n

1

N(

1

,

2

1

) random sample

Sample 2 : Y

21

, Y

22

, ..., Y

2n

2

N(

2

,

2

2

) random sample.

GOAL: Construct a 100(1 ) percent condence interval for the ratio of population

variances

2

2

/

2

1

.

PAGE 111

CHAPTER 4 STAT 509, J. TEBBS

IMPORTANT: To accomplish this, we need the following sampling distribution result:

Q =

S

2

1

/

2

1

S

2

2

/

2

2

F(n

1

1, n

2

1),

an F distribution with (numerator)

1

= n

1

1 and (denominator)

2

= n

2

1 degrees

of freedom.

FACTS: The F distribution has the following characteristics:

continuous, skewed right, and always positive

indexed by two degree of freedomparameters

1

and

2

; these are usually integers

and are often related to sample sizes

The F pdf formula is complicated and is unnecessary for our purposes. R will

compute F probabilities and quantiles from the F distribution.

F R CODE: Suppose that Q F(

1

,

2

).

F

Q

(q) = P(Q q)

p

pf(q,

1

,

2

) qf(p,

1

,

2

)

NOTATION: Let F

n

1

1,n

2

1,/2

and F

n

1

1,n

2

1,1/2

denote the lower and upper quantiles,

respectively, of the F(n

1

1, n

2

1) distribution; i.e., these values satisfy

P(Q < F

n

1

1,n

2

1,/2

) = /2

P(Q > F

n

1

1,n

2

1,1/2

) = /2,

respectively. Similar to the

2

distribution, the F distribution is not symmetric. There-

fore, dierent notation is needed to identify the quantiles of F distributions.

DERIVATION: Because

Q =

S

2

1

/

2

1

S

2

2

/

2

2

F(n

1

1, n

2

1),

PAGE 112

CHAPTER 4 STAT 509, J. TEBBS

0 2 4 6 8

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

q

f

(

q

)

0 2 4 6 8

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

0 2 4 6 8

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

F; df = 5,5

F; df = 5,10

F; df = 10,10

Figure 4.15: F probability density functions with dierent degrees of freedom.

we can write

1 = P

_

F

n

1

1,n

2

1,/2

< Q < F

n

1

1,n

2

1,1/2

_

= P

_

F

n

1

1,n

2

1,/2

<

S

2

1

/

2

1

S

2

2

/

2

2

< F

n

1

1,n

2

1,1/2

_

= P

_

S

2

2

S

2

1

F

n

1

1,n

2

1,/2

<

2

2

2

1

<

S

2

2

S

2

1

F

n

1

1,n

2

1,1/2

_

.

This shows that

_

S

2

2

S

2

1

F

n

1

1,n

2

1,/2

,

S

2

2

S

2

1

F

n

1

1,n

2

1,1/2

_

is a 100(1 ) percent condence interval for the ratio of the population variances

2

2

/

2

1

. We interpret the interval in the same way.

We are 100(1) percent condent that the ratio

2

2

/

2

1

is in this interval.

PAGE 113

CHAPTER 4 STAT 509, J. TEBBS

If the condence interval for

2

2

/

2

1

includes 1, this does not suggest that the vari-

ances

2

1

and

2

2

are dierent.

If the condence interval for

2

2

/

2

1

does not include 1, it does.

Example 4.15. We consider again the recycling project in Example 4.13 that examined

the amount of white paper discarded per employee at two large plants. The data (pre-

sented in Example 4.13) were obtained on the amount of white paper thrown out per

year by employees (data are in hundreds of pounds). Samples of employees at each plant

(n

1

= 25 and n

2

= 20) were randomly selected. The boxplots in Figure 4.14 did suggest

that the population variances may be dierent.

Question. Find a 95 percent condence interval for

2

2

/

2

1

, the ratio of the population

variances. Here,

2

1

(

2

2

) denotes the population variance of the amount of white paper

by employees at Plant 1 (Plant 2).

Solution. We use R to get the sample variances s

2

1

and s

2

2

:

> var(plant.1) ## sample variance (Plant 1)

[1] 0.3071923

> var(plant.2) ## sample variance (Plant 2)

[1] 1.878411

That is, s

2

1

0.307 and s

2

2

1.878. We also use R to get the necessary F quantiles:

> qf(0.025,24,19) ## F_{24,19,0.025} ## lower 0.025 quantile

[1] 0.4264113

> qf(0.975,24,19) ## F_{24,19,0.975} ## upper 0.025 quantile

[1] 2.452321

The lower bound of the a 95 percent condence interval for

2

2

/

2

1

is

s

2

2

s

2

1

F

24,19,0.025

=

1.878

0.307

0.4264113 = 2.607.

The upper bound of the a 95 percent condence interval for

2

2

/

2

1

is

s

2

2

s

2

1

F

24,19,0.975

=

1.878

0.307

2.452321 = 14.995.

PAGE 114

CHAPTER 4 STAT 509, J. TEBBS

2 1 0 1 2

1

.

5

2

.

0

2

.

5

3

.

0

3

.

5

Theoretical Quantiles

S

a

m

p

l

e

Q

u

a

n

t

i

l

e

s

2 1 0 1 2

1

2

3

4

5

6

Theoretical Quantiles

S

a

m

p

l

e

Q

u

a

n

t

i

l

e

s

Figure 4.16: Normal quantile-quantile (qq) plots for employee recycle data for two plants

in Example 4.15.

Interpretation: We are 95 percent condent that the ratio of the population variances

2

2

/

2

1

is between 2.607 and 14.995. This interval does not include 1. Therefore, we

have sucient evidence that the population variances (

2

1

and

2

2

) at the two plants are

dierent.

Discussion: This nding supports our use of the unequal-variance interval for

1

2

in Example 4.13. Some statisticians recommend to use this equal/unequal variance test

before deciding which condence interval to use for

1

2

. Some statisticians (including

your authors) do not.

Major warning: Like the

2

interval for single population variance

2

, the two-sample F

interval for the ratio of two variances is not robust to departures from normality. If the

underlying population distributions are non-normal (non-Guassian), then this interval

should not be used.

Discussion: Figure 4.16 (above) displays the normal qq plots for data from the two

plants. I am somewhat worried about the normal assumption for Plant 2.

PAGE 115

CHAPTER 4 STAT 509, J. TEBBS

4.11 Condence intervals for the dierence of two population

means

1

2

: Dependent samples (Matched-pairs)

Example 4.16. Creatine is an organic acid that helps to supply energy to cells in the

body, primarily muscle. Because of this, it is commonly used by those who are weight

training to gain muscle mass. Does it really work? Suppose that we are designing an

experiment involving USC male undergraduates who exercise/lift weights regularly.

Design 1 (Independent samples): Recruit 30 students who are representative of the

population of USC male undergraduates who exercise/lift weights. For a single weight

training session, we will

assign 15 students to take creatine.

assign 15 students an innocuous substance that looks like creatine (but has no

positive/negative eect on performance).

For each student, we will record

Y = maximum bench press weight (MBPW).

We will then have two samples of data (with n

1

= 15 and n

2

= 15):

Sample 1 (Creatine): Y

11

, Y

12

, ..., Y

1n

1

Sample 2 (Control): Y

21

, Y

22

, ..., Y

2n

2

.

To compare the population means

1

= population mean MBPW for students taking creatine

2

= population mean MBPW for students not taking creatine,

we could construct a two-sample t condence interval for

1

2

using

(Y

1+

Y

2+

) t

n

1

+n

2

2,/2

S

2

p

_

1

n

1

+

1

n

2

_

PAGE 116

CHAPTER 4 STAT 509, J. TEBBS

or

(Y

1+

Y

2+

) t

,/2

S

2

1

n

1

+

S

2

2

n

2

,

depending on our underlying assumptions about

2

1

and

2

2

.

Design 2 (Matched Pairs): Recruit 15 students who are representative of the popu-

lation of USC male undergraduates who exercise/lift weights.

Each student will be assigned rst to take either creatine or the control substance.

For each student, we will then record his value of Y (MBPW).

After a period of recovery (e.g., 1 week), we will then have each student take the

other treatment (creatine/control) and record his value of Y again (but now on

the other treatment).

In other words, for each individual student, we will measure Y under both condi-

tions.

NOTE: In Design 2, because MBPW measurements are taken on the same student, the

dierence between the measurement (creatine/control) should be less variable than the

dierence between a creatine measurement on one student and a control measurement

on a dierent student.

In other words, the student-to-student variation inherent in the latter dierence

is not present in the dierence between MBPW measurements taken on the same

individual student.

MATCHED PAIRS: In general, by obtaining a pair of measurements on a single individ-

ual (e.g., student, raw material, machine, etc.), where one of measurement corresponds to

Treatment 1 and the other measurement corresponds to Treatment 2, you eliminate

variation among the individuals. This allows you to compare the two experimental con-

ditions (e.g., creatine/control, biodegradability treatments, operators, etc.) under more

homogeneous conditions where only variation within individuals is present (that is, the

variation arising from the dierence in the two experimental conditions).

PAGE 117

CHAPTER 4 STAT 509, J. TEBBS

Table 4.1: Creatine example. Sources of variation in the two independent sample and

matched pairs designs.

Design Sources of Variation

Two Independent Samples among students, within students

Matched Pairs within students

ADVANTAGE: When you remove extra variability, this enables you to do a better job at

comparing the two experimental conditions (treatments). By better job, I mean, you

can more precisely estimate the dierence between the treatments (excess variability

that naturally arises among individuals is not getting in the way). This gives you a better

chance of identifying a dierence between the treatments if one really exists.

NOTE: In matched pairs experiments, it is important to randomize the order in which

treatments are assigned. This may eliminate common patterns that may be seen when

always following, say, Treatment 1 with Treatment 2. In practice, the experimenter could

ip a fair coin to determine which treatment is applied rst.

IMPLEMENTATION: Data from matched pairs experiments are analyzed by examining

the dierence in responses of the two treatments. Specically, compute

D

j

= Y

1j

Y

2j

,

for each individual j = 1, 2, ..., n. After doing this, we have essentially created a one

sample problem, where our data are:

D

1

, D

2

, ..., D

n

,

the so-called data dierences. The one sample 100(1 ) percent condence interval

D t

n1,/2

S

D

n

,

where D and S

D

are the sample mean and sample standard deviation of the dierences,

respectively, is an interval estimate for

D

= mean dierence between the 2 treatments.

PAGE 118

CHAPTER 4 STAT 509, J. TEBBS

Table 4.2: Creatine data. Maximum bench press weight (in lbs) for creatine and control

treatments with 15 students. Note: These are not real data.

Student j Creatine MBPW Control MBPW Dierence (D

j

= Y

1j

Y

2j

)

1 230 200 30

2 140 155 15

3 215 205 10

4 190 190 0

5 200 170 30

6 230 225 5

7 220 200 20

8 255 260 5

9 220 240 20

10 200 195 5

11 90 110 20

12 130 105 25

13 255 230 25

14 80 85 5

15 265 255 10

INTERPRETATION: The parameter

D

describes the dierence in means for the two

treatment groups. If there are no dierences between the two treatments,

D

= 0.

If the condence interval for

D

includes 0, this does not suggest that two treatment

means are dierent.

If the condence interval for

D

does not include 0, it does.

To analyze the creatine data, lets compute a 95 percent condence interval for

D

:

> t.test(diff,conf.level=0.95)$conf.int

[1] -3.227946 15.894612

Interpretation: We are 95 percent condent that the mean dierence MBPW weight

is between 3.2 and 15.9 lb. Because this interval includes 0, this does not suggest

that taking creating leads to a larger mean MBPW.

PAGE 119

CHAPTER 4 STAT 509, J. TEBBS

4.12 One-way analysis of variance

REVIEW: So far in this chapter, we have discussed condence intervals for a single

population mean and for the dierence of two population means

1

2

. When there

are two means, we have recently seen that the design of the experiment/study completely

determines how the data are to be analyzed.

When the two samples are independent, this is called a (two) independent-

sample design.

When the two samples are obtained on the same individuals (so that the samples

are dependent), this is called a matched pairs design.

Condence interval procedures for

1

2

depend on the design of the study.

TERMINOLOGY: More generally, the purpose of an experiment is to investigate dif-

ferences between or among two or more treatments. In a statistical framework, we do

this by comparing the population means of the responses to each treatment.

In order to detect treatment mean dierences, we must try to control the eects

of error so that any variation we observe can be attributed to the eects of the

treatments rather than to dierences among the individuals.

BLOCKING: Designs involving meaningful grouping of individuals, that is, blocking,

can help reduce the eects of experimental error by identifying systematic components

of variation among individuals.

The matched pairs design for comparing two treatments is an example of such a

design. In this situation, individuals themselves are treated as blocks.

The analysis of data from experiments involving blocking will not be covered in this

course (see, e.g., STAT 506, STAT 525, and STAT 706). We focus herein on a simpler

setting, that is, a one-way classication model. This is an extension of the two

independent-sample design to more than two treatments.

PAGE 120

CHAPTER 4 STAT 509, J. TEBBS

ONE-WAY CLASSIFICATION: Consider an experiment to compare t 2 treatments

set up as follows:

We obtain a random sample of individuals and randomly assign them to treat-

ments. Samples corresponding to the treatment groups are independent (i.e., the

individuals in each treatment sample are unrelated).

In observational studies (where no treatment is physically applied to individu-

als), individuals are inherently dierent to begin with. We therefore simply take

random samples from each treatment population.

We do not attempt to group individuals according to some other factor (e.g., loca-

tion, gender, weight, variety, etc.). This would be an example of blocking.

MAIN POINT: In a one-way classication design, the only way in which individuals are

classied is by the treatment group assignment. Hence, such an arrangement is called

a one-way classication. When individuals are thought to be basically alike (other

than the possible eect of treatment), experimental error consists only of the variation

among the individuals themselves, that is, there are no other systematic sources of

variation.

Example 4.17. Four types of mortars: (1) ordinary cement mortar (OCM), polymer

impregnated mortar (PIM), resin mortar (RM), and (4) polymer cement mortar (PCM),

were subjected to a compression test to measure strength (MPa). Here are the strength

measurements taken on dierent mortar specimens (36 in all).

OCM: 51.45 42.96 41.11 48.06 38.27 38.88 42.74 49.62

PIM: 64.97 64.21 57.39 52.79 64.87 53.27 51.24 55.87 61.76 67.15

RM: 48.95 62.41 52.11 60.45 58.07 52.16 61.71 61.06 57.63 56.80

PCM: 35.28 38.59 48.64 50.99 51.52 52.85 46.75 48.31

Side by side boxplots of the data are in Figure 4.17.

PAGE 121

CHAPTER 4 STAT 509, J. TEBBS

OCM PIM RM PCM

0

2

0

4

0

6

0

8

0

1

0

0

S

t

r

e

n

g

t

h

(

M

P

a

)

Figure 4.17: Boxplots of strength measurements (MPa) for four mortar types.

In this example,

Treatment = mortar type (OCM, PIM, RM, and PCM). There are t = 4 treat-

ment groups.

Individuals = mortar specimens

This is an example of an observational study; not an experiment. That is, we do

not physically apply a treatment here; instead, the mortar specimens are inherently

dierent to begin with. We simply take random samples of each mortar type.

QUERY: An initial question that we might have is the following:

Are the treatment (mortar type) population means equal? Or, are the treat-

ment population means dierent?

PAGE 122

CHAPTER 4 STAT 509, J. TEBBS

This question can be answered by performing a hypothesis test, that is, by testing

H

0

:

1

=

2

=

3

=

4

versus

H

a

: the population means

i

are not all equal.

GOAL: We now develop a statistical procedure that allows us to test this type of hy-

pothesis in a one-way classication model.

4.12.1 Overall F test

NOTATION: Let t denote the number of treatments to be compared. Dene

Y

ij

= response on the jth individual in the ith treatment group

for i = 1, 2, ..., t and j = 1, 2, ..., n

i

.

n

i

is the number of replications for treatment i

When n

1

= n

2

= = n

t

= n, we say the design is balanced; otherwise, the

design is unbalanced.

Let N = n

1

+n

2

+ +n

t

denote the total number of individuals measured. If the

design is balanced, then N = nt.

Dene

Y

i+

=

1

n

i

n

i

j=1

Y

ij

S

2

i

=

1

n

i

1

n

i

j=1

(Y

ij

Y

i+

)

2

Y

++

=

1

N

t

i=1

n

i

j=1

Y

ij

.

The statistics Y

i+

and S

2

i

are simply the sample mean and sample variance, re-

spectively, of the ith sample. The statistic Y

++

is the sample mean of all the data

(across all t treatment groups).

PAGE 123

CHAPTER 4 STAT 509, J. TEBBS

STATISTICAL HYPOTHESIS: Our goal is to develop a procedure to test

H

0

:

1

=

2

= =

t

versus

H

a

: the population means

i

are not all equal.

The null hypothesis H

0

says that there is no treatment dierence, that is, all

treatment population means are the same.

The alternative hypothesis H

a

says that a dierence among the t population

means exists somewhere (but does not specify how the means are dierent).

When performing a hypothesis test, the goal is to decide which hypothesis is more

supported by the observed data.

ASSUMPTIONS: We have independent random samples from t 2 normal distribu-

tions, each of which has the same variance (but possibly dierent means):

Sample 1: Y

11

, Y

12

, ..., Y

1n

1

N(

1

,

2

)

Sample 2: Y

21

, Y

22

, ..., Y

2n

2

N(

2

,

2

)

.

.

.

.

.

.

Sample t: Y

t1

, Y

t2

, ..., Y

tn

t

N(

t

,

2

).

The procedure we develop is formulated by deriving two estimators for

2

. These two

estimators are formed by (1) looking at the variance of the observations within samples,

and (2) looking at the variance of the sample means across the t samples.

WITHIN ESTIMATOR: To estimate

2

within samples, we take a weighted average

(weighted by the sample sizes) of the t sample variances; that is, we pool all variance

estimates together to form one estimate. Dene

SS

res

= (n

1

1)S

2

1

+ (n

2

1)S

2

2

+ + (n

t

1)S

2

t

=

t

i=1

n

i

j=1

(Y

ij

Y

i+

)

2

. .

(n

i

1)S

2

i

.

PAGE 124

CHAPTER 4 STAT 509, J. TEBBS

We call SS

res

the residual sum of squares. Mathematics shows that

E

_

SS

res

2

_

= N t = E(MS

res

) =

2

,

where

MS

res

=

SS

res

N t

.

IMPORTANT: MS

res

is an unbiased estimator of

2

regardless of whether or not H

0

is

true. We call MS

res

the residual mean squares.

ACROSS ESTIMATOR: To derive the across-sample estimator, we assume a com-

mon sample size n

1

= n

2

= = n

t

= n (to simplify notation). Recall that if a sample

arises from a normal population, then the sample mean is also normally distributed, i.e.,

Y

i+

N

_

i

,

2

n

_

.

NOTE: If all the treatment population means are equal, that is,

H

0

:

1

=

2

= =

t

= , say,

is true, then

Y

i+

N

_

,

2

n

_

.

If H

0

is true, then the t sample means Y

1+

, Y

2+

, ..., Y

t+

are a random sample of size t

from a normal distribution with mean and variance

2

/n. The sample variance of this

random sample is given by

1

t 1

t

i=1

(Y

i+

Y

++

)

2

and has expectation

E

_

1

t 1

t

i=1

(Y

i+

Y

++

)

2

_

=

2

n

.

Therefore,

MS

trt

=

1

t 1

t

i=1

n(Y

i+

Y

++

)

2

. .

SS

trt

,

is an unbiased estimator of

2

; i.e., E(MS

trt

) =

2

, when H

0

is true.

PAGE 125

CHAPTER 4 STAT 509, J. TEBBS

TERMINOLOGY: We call SS

trt

the treatment sums of squares and MS

trt

the treat-

ment mean squares. MS

trt

is our second point estimator for

2

. Recall that MS

trt

is an unbiased estimator of

2

only when H

0

:

1

=

2

= =

t

is true (this is

important!). If we have dierent sample sizes, we simply adjust MS

trt

to

MS

trt

=

1

t 1

t

i=1

n

i

(Y

i+

Y

++

)

2

. .

SS

trt

.

This is still an unbiased estimator for

2

when H

0

is true.

MOTIVATION: When H

0

is true (i.e., the treatment means are the same), then

E(MS

trt

) =

2

E(MS

res

) =

2

.

These two facts suggest that when H

0

is true,

F =

MS

trt

MS

res

1.

When H

0

is not true (i.e., the treatment means are dierent), then

E(MS

trt

) >

2

E(MS

res

) =

2

.

These two facts suggest that when H

0

is not true,

F =

MS

trt

MS

res

> 1.

SAMPLING DISTRIBUTION: When H

0

is true, the F statistic

F =

MS

trt

MS

res

F(t 1, N t).

DECISION: We reject H

0

and conclude the treatment population means are dierent

if the F statistic is far out in the right tail of the F(t 1, N t) distribution. Why?

Because a large value of F is not consistent with H

0

being true! Large values of F (far

out in the right tail) are more consistent with H

a

.

PAGE 126

CHAPTER 4 STAT 509, J. TEBBS

0 5 10 15

0

.

0

0

.

2

0

.

4

0

.

6

F

P

D

F

Figure 4.18: The F(3, 32) probability density function. This is the distribution of F in

Example 4.17 if H

0

is true. An at F = 16.848 has been added.

MORTAR DATA: We now use R to calculate the F statistic for the strength/mortar

type data in Example 4.17.

> anova(lm(strength ~ mortar.type))

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

mortar.type 3 1520.88 506.96 16.848 9.576e-07 ***

Residuals 32 962.86 30.09

CONCLUSION: F = 16.848 is not an observation we would expect from the F(3, 32)

distribution (the distribution of F when H

0

is true); see Figure 4.18. Therefore, we reject

H

0

and conclude the population mean strengths for the four mortar types are dierent.

In other words, the evidence from the data suggests that H

a

is true; not H

0

.

PAGE 127

CHAPTER 4 STAT 509, J. TEBBS

TERMINOLOGY: As we have just seen (from the recent R analysis), it is common to

display one-way classication results in an ANOVA table. The form of the ANOVA

table for the one-way classication is given below:

Source df SS MS F

Treatments t 1 SS

trt

MS

trt

=

SS

trt

t1

F =

MS

trt

MS

res

Residuals N t SS

res

MS

res

=

SS

res

Nt

Total N 1 SS

total

It is easy to show that

SS

total

= SS

trt

+ SS

res

.

SS

total

measures how observations vary about the overall mean, without regard to

treatments; that is, it measures the total variation in all the data. SS

total

can be

partitioned into two components:

SS

trt

measures how much of the total variation is due to the treatments

SS

res

measures what is left over, which we attribute to inherent variation

among the individuals.

Degrees of freedom (df) add down.

Mean squares (MS) are formed by dividing sums of squares by the corresponding

degrees of freedom.

TERMINOLOGY: The probability value (p-value) for a hypothesis test measures

how much evidence we have against H

0

. It is important to remember the following:

the smaller the p-value = the more evidence against H

0

.

MORTAR DATA: For the strength/mortar type data in Example 4.17 (from the R out-

put), we see that

p-value = 0.0000009576.

PAGE 128

CHAPTER 4 STAT 509, J. TEBBS

This is obviously quite small which suggests that we have an enormous amount of

evidence against H

0

.

In this example, this p-value is calculated as the area to the right of F = 16.848 on

the F(3, 32) probability density function.

Therefore, this probability is interpreted as follows: If H

0

is true, this is the

probability that we would get a test statistic equal to or larger than F = 16.848.

Since this is extremely unlikely (p-value = 0.0000009576), this strongly suggests

that H

0

is not true.

P-VALUE RULES: Probability values are used in more general hypothesis test settings

in statistics (not just in one-way classication).

Q: How low does a p-value have to get before we reject H

0

?

A: Unfortunately, there is no right answer to this question. What is commonly done

is the following.

First choose a signicance level that is small. This represents the probability

that we will reject a true H

0

, that is,

= P(Reject H

0

|H

0

true).

Common values of chosen beforehand are = 0.10, = 0.05 (the most common),

and = 0.01.

The smaller the is chosen to be, the more evidence one requires to reject H

0

.

This is a true statement because of the following well-known decision rule:

p-value < = reject H

0

.

Therefore, the value of chosen by the experimenter determines how low the p-

value must get before H

0

is ultimately rejected.

For the strength/mortar type data, there is no ambiguity in our decision. For other

situations (e.g., p-value = 0.052), the decision may not be as clear cut.

PAGE 129

CHAPTER 4 STAT 509, J. TEBBS

4.12.2 Follow up analysis: Tukey pairwise condence intervals

RECALL: In a one-way classication, the overall F test is used to test the hypotheses:

H

0

:

1

=

2

= =

t

versus

H

a

: the population means

i

are not all equal.

QUESTION: If we do reject H

0

, have we really learned anything that is all that

relevant? All we have learned is that at least one of the population treatment means

is dierent. We have no idea which one(s) or how many. In this light, rejecting H

0

is

largely an uninformative conclusion.

FOLLOW-UP ANALYSES: If H

0

is rejected, that is, we conclude at least one population

treatment mean

i

is dierent, the obvious game becomes determining which one(s) and

how they are dierent. To do this, we will construct Tukey pairwise condence

intervals for all population treatment mean dierences

i

i

, 1 i < i

t. If there

are t treatments, then there are

_

t

2

_

=

t(t 1)

2

pairwise condence intervals to construct. For example, with the strength/mortar type

example, there are t = 4 treatments and 6 pairwise intervals:

2

,

1

3

,

1

4

,

2

3

,

2

4

,

3

4

,

where

1

= population mean strength for mortar type OCM

2

= population mean strength for mortar type PIM

3

= population mean strength for mortar type RM

4

= population mean strength for mortar type PCM.

PROBLEM: If we construct multiple condence intervals (here, 6 of them), and if we

construct each one at the 100(1) percent condence level, then the overall condence

PAGE 130

CHAPTER 4 STAT 509, J. TEBBS

level for all 6 intervals will be less than 100(1 ) percent. In statistics, this is known

as the multiple comparisons problem.

GOAL: Construct condence intervals for all pairwise intervals

i

i

, 1 i < i

t,

and have our overall condence level still be at 100(1 ) percent.

SOLUTION: Simply increase the condence level associated with each individual inter-

val! Tukeys method is designed to do this. Even better, R automates the construction

of Tukeys intervals. The intervals are of the form:

(Y

i+

Y

i

+

) q

t,Nt,

MS

res

_

1

n

i

+

1

n

i

_

,

where q

t,Nt,

is the Tukey quantile which gives an overall condence level of 100(1)

percent (overall for the set of all possible pairwise intervals).

MORTAR DATA: For the strength/mortar type data in Example 4.17, the R output

below gives all pairwise intervals. Note that the overall condence level is 95 percent.

> TukeyHSD(aov(lm(strength ~ mortar.type)),conf.level=0.95)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = lm(strength ~ mortar.type))

$mortar.type

diff lwr upr p adj

PCM-OCM 2.48000 -4.950955 9.910955 0.8026758

PIM-OCM 15.21575 8.166127 22.265373 0.0000097

RM-OCM 12.99875 5.949127 20.048373 0.0001138

PIM-PCM 12.73575 5.686127 19.785373 0.0001522

RM-PCM 10.51875 3.469127 17.568373 0.0016850

RM-PIM -2.21700 -8.863448 4.429448 0.8029266

ANALYSIS: In the R output, the columns labeled lwr and upr give, respectively, the

lower and upper limits of the pairwise condence intervals.

PAGE 131

CHAPTER 6 STAT 509, J. TEBBS

We are (at least) 95 percent condent that the dierence between the population

mean strengths for the PCM and OCM mortars is between 4.95 and 9.91 MPa.

Note that this condence interval includes 0, which suggests that these two

population means are not dierent.

An equivalent nding is that the adjusted p-value, given in the p adj col-

umn, is large.

We are (at least) 95 percent condent that the dierence between the population

mean strengths for the PIM and OCM mortars is between 8.17 and 22.27 MPa.

Note that this condence interval does not include 0, which suggests that

these two population means are dierent.

An equivalent nding is that the adjusted p-value, given in the p adj col-

umn, is small.

Interpretations for the remaining 4 condence intervals are formed similarly.

The main point is this:

If a pairwise condence interval (for two population means) includes 0, then

these population means are declared not to be dierent.

If a pairwise interval does not include 0, then the population means are

declared to be dierent.

The conclusions we make for all possible pairwise comparisons are at the

100(1 ) percent condence level.

Therefore, for the strength/mortar type data, the following pairs of population

means are declared to be dierent:

PIM-OCM RM-OCM PIM-PCM RM-PCM.

The following pairs of population means are declared to be not dierent:

PCM-OCM RM-PIM.

PAGE 132

CHAPTER 6 STAT 509, J. TEBBS

6 Linear regression

Complementary reading: Chapter 6 (VK); Sections 6.1-6.4.

6.1 Introduction

IMPORTANCE: A problem that arises in engineering, economics, medicine, and other

areas is that of investigating the relationship between two (or more) variables. In such

settings, the goal is to model a continuous random variable Y as a function of one or more

independent variables, say, x

1

, x

2

, ..., x

k

. Mathematically, we can express this model as

Y = g(x

1

, x

2

, ..., x

k

) + ,

where g : R

k

R. This is called a regression model.

The presence of the (random) error conveys the fact that the relationship between

the dependent variable Y and the independent variables x

1

, x

2

, ..., x

k

through g is

not deterministic. Instead, the term absorbs all variation in Y that is not

explained by g(x

1

, x

2

, ..., x

k

).

LINEAR MODELS: In this course, we will consider models of the form

Y =

0

+

1

x

1

+

2

x

2

+ +

k

x

k

. .

g(x

1

,x

2

,...,x

k

)

+ ,

that is, g is a linear function of

0

,

1

, ...,

k

. We call this a linear regression model.

The response variable Y is assumed to be random (but we do get to observe its

value).

The regression parameters

0

,

1

, ...,

k

are assumed to be xed and unknown.

The independent variables x

1

, x

2

, ..., x

k

are assumed to be xed (not random).

The error term is assumed to be random (and not observed).

PAGE 133

CHAPTER 6 STAT 509, J. TEBBS

DESCRIPTION: More precisely, we call a regression model a linear regression model

if the regression parameters enter the g function in a linear fashion. For example, each

of the models is a linear regression model:

Y =

0

+

1

x

. .

g(x)

+

Y =

0

+

1

x +

2

x

2

. .

g(x)

+

Y =

0

+

1

x

1

+

2

x

2

+

3

x

1

x

2

. .

g(x

1

,x

2

)

+.

Main point: The term linear does not refer to the shape of the regression function g.

It refers to how the regression parameters

0

,

1

, ...,

k

enter the g function.

6.2 Simple linear regression

TERMINOLOGY: A simple linear regression model includes only one independent

variable x and is of the form

Y =

0

+

1

x + .

The regression function

g(x) =

0

+

1

x

is a straight line with intercept

0

and slope

1

. If E() = 0, then

E(Y ) = E(

0

+

1

x + )

=

0

+

1

x + E()

=

0

+

1

x.

Therefore, we have these interpretations for the regression parameters

0

and

1

:

0

quanties the mean of Y when x = 0.

1

quanties the change in E(Y ) brought about by a one-unit change in x.

PAGE 134

CHAPTER 6 STAT 509, J. TEBBS

Example 6.1. As part of a waste removal project, a new compression machine for

processing sewage sludge is being studied. In particular, engineers are interested in the

following variables:

Y = moisture control of compressed pellets (measured as a percent)

x = machine ltration rate (kg-DS/m/hr).

Engineers collect n = 20 observations of (x, Y ); the data are given below.

Obs x Y Obs x Y

1 125.3 77.9 11 159.5 79.9

2 98.2 76.8 12 145.8 79.0

3 201.4 81.5 13 75.1 76.7

4 147.3 79.8 14 151.4 78.2

5 145.9 78.2 15 144.2 79.5

6 124.7 78.3 16 125.0 78.1

7 112.2 77.5 17 198.8 81.5

8 120.2 77.0 18 132.5 77.0

9 161.2 80.1 19 159.6 79.0

10 178.9 80.2 20 110.7 78.6

Table 6.1: Sewage data. Moisture (Y , measured as a percentage) and machine ltration

rate (x, measured in kg-DS/m/hr). There are n = 20 observations.

Figure 6.1 displays the data in a scatterplot. This is the most common graphical display

for bivariate data like those seen above. From the plot, we see that

the variables Y and x are positively related, that is, an increase in x tends to be

associated with an increase in Y .

the variables Y and x are linearly related, although there is a large amount of

variation that is unexplained.

this is an example where a simple linear regression model may be adequate.

PAGE 135

CHAPTER 6 STAT 509, J. TEBBS

80 100 120 140 160 180 200

7

7

7

8

7

9

8

0

8

1

Filtration rate (kgDS/m/hr)

M

o

i

s

t

u

r

e

(

P

e

r

c

e

n

t

a

g

e

)

Figure 6.1: Scatterplot of pellet moisture Y (measured as a percentage) as a function of

machine ltration rate x (measured in kg-DS/m/hr).

6.2.1 Least squares estimation

TERMINOLOGY: When we say, t a regression model, we mean that we would like

to estimate the regression parameters in the model with the observed data. Suppose that

we collect (x

i

, Y

i

), i = 1, 2, ..., n, and postulate the simple linear regression model

Y

i

=

0

+

1

x

i

+

i

,

for each i = 1, 2, ..., n. Our rst goal is to estimate

0

and

1

. Formal assumptions for

the error terms

i

will be given later.

LEAST SQUARES: A widely-accepted method of estimating the model parameters

0

and

1

is least squares. The method of least squares says to choose the values of

0

PAGE 136

CHAPTER 6 STAT 509, J. TEBBS

and

1

that minimize

Q(

0

,

1

) =

n

i=1

[Y

i

(

0

+

1

x

i

)]

2

.

Denote the least squares estimators by b

0

and b

1

, respectively, that is, the values of

0

and

1

that minimize Q(

0

,

1

). A two-variable minimization argument can be used to

nd closed-form expressions for b

0

and b

1

. Taking partial derivatives of Q(

0

,

1

), we

obtain

Q(

0

,

1

)

0

= 2

n

i=1

(Y

i

1

x

i

)

set

= 0

Q(

0

,

1

)

1

= 2

n

i=1

(Y

i

1

x

i

)x

i

set

= 0.

Solving for

0

and

1

gives the least squares estimators

b

0

= Y b

1

x

b

1

=

n

i=1

(x

i

x)(Y

i

Y )

n

i=1

(x

i

x)

2

=

SS

xy

SS

xx

.

In real life, it is rarely necessary to calculate b

0

and b

1

by hand, although VK (Example

6.4, pp 379) does give an example of hand calculation. R automates the entire model

tting process and subsequent analysis.

Example 6.1 (continued). We now use R to calculate the equation of the least squares

regression line for the sewage sludge data in Example 6.1. Here is the output:

> fit = lm(moisture~filtration.rate)

> fit

lm(formula = moisture ~ filtration.rate)

Coefficients:

(Intercept) filtration.rate

72.95855 0.04103

From the output, we see the least squares estimates (to 3 dp) for the sewage data are

b

0

= 72.959

b

1

= 0.041.

PAGE 137

CHAPTER 6 STAT 509, J. TEBBS

80 100 120 140 160 180 200

7

7

7

8

7

9

8

0

8

1

Filtration rate (kgDS/m/hr)

M

o

i

s

t

u

r

e

(

P

e

r

c

e

n

t

a

g

e

)

Figure 6.2: Scatterplot of pellet moisture Y (measured as a percentage) as a function of

ltration rate x (measured in kg-DS/m/hr). The least squares line has been added.

Therefore, the equation of the least squares line that relates moisture percentage Y to

the ltration rate x is

Y = 72.959 + 0.041x,

or, in other words,

NOTE: Your authors call the least squares line the prediction equation. This is

because we can predict the value of Y (moisture) for any value of x (ltration rate). For

example, when the ltration rate is x = 150 kg-DS/m/hr, we would predict the moisture

percentage to be

PAGE 138

CHAPTER 6 STAT 509, J. TEBBS

6.2.2 Model assumptions and properties of least squares estimators

INTEREST: We wish to investigate the properties of b

0

and b

1

as estimators of the true

regression parameters

0

and

1

in the simple linear regression model

Y

i

=

0

+

1

x

i

+

i

,

for i = 1, 2, ..., n. To do this, we need assumptions on the error terms

i

. Specically, we

will assume throughout that

E(

i

) = 0, for i = 1, 2, ..., n

var(

i

) =

2

, for i = 1, 2, ..., n, that is, the variance is constant

the random variables

i

are independent

the random variables

i

are normally distributed.

IMPLICATION: Under these assumptions,

Y

i

N(

0

+

1

x

i

,

2

).

Fact 1. The least squares estimators b

0

and b

1

are unbiased estimators of

0

and

1

,

respectively, that is,

E(b

0

) =

0

E(b

1

) =

1

.

Fact 2. The least squares estimators b

0

and b

1

have the following sampling distributions:

b

0

N(

0

, c

00

2

) and b

1

N(

1

, c

11

2

),

where

c

00

=

1

n

+

x

2

SS

xx

and c

11

=

1

SS

xx

.

Knowing these sampling distributions is critical if we want to write condence intervals

and perform hypothesis tests for

0

and

1

.

PAGE 139

CHAPTER 6 STAT 509, J. TEBBS

6.2.3 Estimating the error variance

GOAL: In the simple linear regression model

Y

i

=

0

+

1

x

i

+

i

,

where

i

N(0,

2

), we now turn our attention to estimating

2

, the error variance.

TERMINOLOGY: In the simple linear regression model, dene the ith tted value by

Y

i

= b

0

+ b

1

x

i

,

where b

0

and b

1

are the least squares estimators. Each observation has its own tted

value. Geometrically, an observations tted value is the (perpendicular) projection of

its Y value, upward or downward, onto the least squares line.

TERMINOLOGY: We dene the ith residual by

e

i

= Y

i

Y

i

.

Each observation has its own residual. Geometrically, an observations residual is the

vertical distance (i.e., length) between its Y value and its tted value.

If an observations Y value is above the least squares regression line, its residual is

positive.

If an observations Y value is below the least squares regression line, its residual is

negative.

INTERESTING: In the simple linear regression model (provided that the model includes

an intercept term

0

), we have the following algebraic result:

n

i=1

e

i

=

n

i=1

(Y

i

Y

i

) = 0,

that is, the sum of the residuals (from a least squares t) is equal to zero.

PAGE 140

CHAPTER 6 STAT 509, J. TEBBS

Obs x Y

Y = b

0

+ b

1

x e = Y

Y Obs x Y

Y = b

0

+ b

1

x e = Y

Y

1 125.3 77.9 78.100 0.200 11 159.5 79.9 79.503 0.397

2 98.2 76.8 76.988 0.188 12 145.8 79.0 78.941 0.059

3 201.4 81.5 81.223 0.277 13 75.1 76.7 76.040 0.660

4 147.3 79.8 79.003 0.797 14 151.4 78.2 79.171 0.971

5 145.9 78.2 78.945 0.745 15 144.2 79.5 78.876 0.624

6 124.7 78.3 78.075 0.225 16 125.0 78.1 78.088 0.012

7 112.2 77.5 77.563 0.062 17 198.8 81.5 81.116 0.384

8 120.2 77.0 77.891 0.891 18 132.5 77.0 78.396 1.396

9 161.2 80.1 79.573 0.527 19 159.6 79.0 79.508 0.508

10 178.9 80.2 80.299 0.099 20 110.7 78.6 77.501 1.099

Table 6.2: Sewage data. Fitted values and residuals from the least squares t.

SEWAGE DATA: In Table 6.2, I have used R to calculate the tted values and residuals

for each of the n = 20 observations in the sewage sludge data set.

TERMINOLOGY: We dene the residual sum of squares by

SS

res

n

i=1

e

2

i

=

n

i=1

(Y

i

Y

i

)

2

.

Fact 3. In the simple linear regression model,

MS

res

=

SS

res

n 2

is an unbiased estimator of

2

, that is, E(MS

res

) =

2

. The quantity

=

MS

res

=

SS

res

n 2

estimates and is called the residual standard error.

SEWAGE DATA: For the sewage data in Example 6.1, we use R to calculate MS

res

:

> fitted.values = predict(fit)

> residuals = moisture-fitted.values

> # Calculate MS_res

> sum(residuals^2)/18

[1] 0.4426659

PAGE 141

CHAPTER 6 STAT 509, J. TEBBS

For the sewage data, an (unbiased) estimate of the error variance

2

is

MS

res

0.443.

The residual standard error is

=

MS

res

=

0.4426659 0.6653.

This estimate can also be seen in the following R output:

> summary(fit)

lm(formula = moisture ~ filtration.rate)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 72.958547 0.697528 104.596 < 2e-16 ***

filtration.rate 0.041034 0.004837 8.484 1.05e-07 ***

Residual standard error: 0.6653 on 18 degrees of freedom

Multiple R-squared: 0.7999, Adjusted R-squared: 0.7888

F-statistic: 71.97 on 1 and 18 DF, p-value: 1.052e-07

6.2.4 Inference for

0

and

1

INTEREST: In the simple linear regression model

Y

i

=

0

+

1

x

i

+

i

,

the regression parameters

0

and

1

are unknown. It is therefore of interest to construct

condence intervals and perform hypothesis tests for these parameters.

In practice, inference for the slope parameter

1

is of primary interest because of

its connection to the independent variable x in the model.

Inference for

0

is less meaningful, unless one is explicitly interested in the mean

of Y when x = 0. We will not pursue this.

PAGE 142

CHAPTER 6 STAT 509, J. TEBBS

CONFIDENCE INTERVAL FOR

1

: Under our model assumptions, the following sam-

pling distribution arises:

t =

b

1

MS

res

/SS

xx

t(n 2).

This result can be used to derive a 100(1 ) percent condence interval for

1

,

which is given by

b

1

t

n2,/2

MS

res

/SS

xx

.

The value t

n2,/2

is the upper /2 quantile from the t(n 2) distribution.

Note the form of the interval:

point estimate

. .

b

1

quantile

. .

t

n2,/2

standard error

. .

MS

res

/SS

xx

.

We interpret the interval in the same way.

We are 100(1) percent condent that the population regression slope

1

is in this interval.

When interpreting the interval, of particular interest to us is the value

1

= 0.

If

1

= 0 is in the condence interval, this suggests that Y and x are not

linearly related.

If

1

= 0 is not in the condence interval, this suggests that Y and x are

linearly related.

HYPOTHESIS TEST FOR

1

: If our interest was to test

H

0

:

1

=

1,0

versus

H

a

:

1

=

1,0

,

where

1,0

is a xed value (often,

1,0

= 0), we would focus our attention on

t =

b

1

1,0

MS

res

/SS

xx

.

PAGE 143

CHAPTER 6 STAT 509, J. TEBBS

REASONING: If H

0

:

1

=

1,0

is true, then t arises from a t(n2) distribution. We can

therefore judge the amount of evidence against H

0

by comparing t to this distribution.

R automatically calculates t and produces a p-value for the test above. Remember that

small p-values are evidence against H

0

.

Example 6.1 (continued). We now use R to test

H

0

:

1

= 0

versus

H

a

:

1

= 0,

for the sewage sludge data in Example 6.1. Note that

1,0

= 0.

> summary(fit)

lm(formula = moisture ~ filtration.rate)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 72.958547 0.697528 104.596 < 2e-16 ***

filtration.rate 0.041034 0.004837 8.484 1.05e-07 ***

ANALYSIS: Figure 6.3 shows the t(18) distribution, that is, the distribution of t when

H

0

:

1

= 0 is true for the sewage sludge example. Clearly, t = 8.484 is not an expected

outcome from this distribution (p-value = 0.000000105)!! In other words, there is strong

evidence that the absorption rate is linearly related to machine ltration rate.

ANALYSIS: A 95 percent condence interval for

1

is calculated as follows:

b

1

t

18,0.025

se(b

1

) = 0.0410 2.1009(0.0048) = (0.0309, 0.0511).

We are 95 percent condent that population regression slope

1

is between 0.0309 and

0.0511. Note that this interval does not include 0.

> qt(0.975,18)

[1] 2.100922

PAGE 144

CHAPTER 6 STAT 509, J. TEBBS

5 0 5

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

t

f

(

t

)

Figure 6.3: t(18) probability density function. An at t = 8.484 has been added.

6.2.5 Condence and prediction intervals for a given x = x

0

INTEREST: Consider the simple linear regression model

Y

i

=

0

+

1

x

i

+

i

,

where

i

N(0,

2

). We are often interested in using the tted model to learn about

the response variable Y at a certain setting for the independent variable x = x

0

, say. For

example, in our sewage sludge example, we might be interested in the moisture percentage

Y when the ltration rate is x = 150 kg-DS/m/hr. Two potential goals arise:

We might be interested in estimating the mean response of Y when x = x

0

.

This mean response is denoted by E(Y |x

0

). This value is the mean of the following

PAGE 145

CHAPTER 6 STAT 509, J. TEBBS

probability distribution:

Y (x

0

) N(

0

+

1

x

0

,

2

).

We might be interested in predicting a new response Y when x = x

0

. This

predicted response is denoted by Y

(x

0

). This value is a new outcome from

Y (x

0

) N(

0

+

1

x

0

,

2

).

In the rst problem, we are interested in estimating the mean of the response variable Y

at a certain value of x. In the second problem, we are interested in predicting the value

of a new random variable Y at a certain value of x. Conceptually, the second problem is

far more dicult than the rst.

GOALS: We would like to create 100(1 ) percent intervals for the mean E(Y |x

0

) and

for the new value Y

(x

0

). The former is called a condence interval (since it is for

a mean response) and the latter is called a prediction interval (since it is for a new

random variable).

POINT ESTIMATOR/PREDICTOR: To construct either interval, we start with the

same quantity:

Y (x

0

) = b

0

+ b

1

x

0

,

where b

0

and b

1

are the least squares estimates from the t of the model.

In the condence interval for E(Y |x

0

), we call

Y (x

0

) a point estimator.

In the prediction interval for Y (x

0

), we call

Y (x

0

) a point predictor.

The primary dierence in the intervals arises in assessing the variability of

Y (x

0

).

CONFIDENCE INTERVAL: A 100(1 ) percent condence interval for the mean

E(Y |x

0

) is given by

Y (x

0

) t

n2,/2

MS

res

_

1

n

+

(x

0

x)

2

SS

xx

_

.

PAGE 146

CHAPTER 6 STAT 509, J. TEBBS

PREDICTION INTERVAL: A 100(1 ) percent prediction interval for the new

response Y

(x

0

) is given by

Y (x

0

) t

n2,/2

MS

res

_

1 +

1

n

+

(x

0

x)

2

SS

xx

_

.

COMPARISON: The two intervals are identical except for the extra 1 in the standard

error part of the prediction interval. This extra 1 arises from the additional uncer-

tainty associated with predicting a new response from the N(

0

+

1

x

0

,

2

) distribution.

Therefore, a 100(1 ) percent prediction interval for Y

(x

0

) will be wider than the

corresponding 100(1 ) percent condence interval for E(Y |x

0

).

INTERVAL LENGTH: The length of both intervals clearly depends on the value of x

0

.

In fact, the standard error of

Y (x

0

) will be smallest when x

0

= x and will get larger

the farther x

0

is from x in either direction. This implies that the precision with which

we estimate E(Y |x

0

) or predict Y

(x

0

) decreases the farther we get away from x. This

makes intuitive sense, namely, we would expect to have the most condence in our

tted model near the center of the observed data.

TERMINOLOGY: It is sometimes desired to estimate E(Y |x

0

) or predict Y

(x

0

) based

on the t of the model for values of x

0

outside the range of x values used in the ex-

periment/study. This is called extrapolation and can be very dangerous. In order for

our inferences to be valid, we must believe that the straight line relationship holds for x

values outside the range where we have observed data. In some situations, this may be

reasonable. In others, we may have no theoretical basis for making such a claim without

data to support it.

Example 6.1 (continued). In our sewage sludge example, suppose that we are interested

in estimating E(Y |x

0

) and predicting a new Y

(x

0

) when the ltration rate is x

0

= 150

kg-DS/m/hr.

E(Y |x

0

) denotes the mean moisture percentage for compressed pellets when the

machine ltration rate is x

0

= 150 kg-DS/m/hr. In other words, if we were to repeat

PAGE 147

CHAPTER 6 STAT 509, J. TEBBS

the experiment over and over again, each time using a ltration rate of x

0

= 150

kg-DS/m/hr, then E(Y |x

0

) denotes the mean value of Y (moisture percentage)

that would be observed.

Y

(x

0

) denotes a possible value of Y for a single run of the machine when the

ltration rate is set at x

0

= 150 kg-DS/m/hr.

R automates the calculation of condence and prediction intervals, as seen below.

> predict(fit,data.frame(filtration.rate=150),level=0.95,interval="confidence")

fit lwr upr

79.11361 78.78765 79.43958

> predict(fit,data.frame(filtration.rate=150),level=0.95,interval="prediction")

fit lwr upr

79.11361 77.6783 80.54893

Note that the point estimate (point prediction) is easily calculated:

Y (x

0

= 150) = 72.959 + 0.041(150) 79.11361.

A 95 percent condence interval for E(Y |x

0

= 150) is (78.79, 79.44). When the

ltration rate is x

0

= 150 kg-DS/m/hr, we are 95 percent condent that the mean

moisture percentage is between 78.79 and 79.44 percent.

A 95 percent prediction interval for Y

(x

0

= 150) is (77.68, 80.55). When the

ltration rate is x

0

= 150 kg-DS/m/hr, we are 95 percent condent that the mois-

ture percentage for a single run of the experiment will be between 77.68 and 80.55

percent.

Figure 6.4 shows 95 percent condence bands for E(Y |x

0

) and 95 percent prediction

bands for Y

(x

0

). These are not simultaneous bands (i.e., these are not bands for

the entire population regression function).

PAGE 148

CHAPTER 6 STAT 509, J. TEBBS

80 100 120 140 160 180 200

7

7

7

8

7

9

8

0

8

1

Filtration rate (kgDS/m/hr)

M

o

i

s

t

u

r

e

(

P

e

r

c

e

n

t

a

g

e

)

80 100 120 140 160 180 200

7

7

7

8

7

9

8

0

8

1

80 100 120 140 160 180 200

7

7

7

8

7

9

8

0

8

1

80 100 120 140 160 180 200

7

7

7

8

7

9

8

0

8

1

80 100 120 140 160 180 200

7

7

7

8

7

9

8

0

8

1

95% confidence

95% prediction

Figure 6.4: Scatterplot of pellet moisture Y (measured as a percentage) as a function of

machine ltration rate x (measured in kg-DS/m/hr). The least squares regression line

has been added. Ninety-ve percent condence/prediction bands have been added.

6.3 Multiple linear regression

6.3.1 Introduction

PREVIEW: We have already considered the simple linear regression model

Y

i

=

0

+

1

x

i

+

i

,

for i = 1, 2, ..., n, where

i

N(0,

2

). We now extend this basic model to include

multiple independent variables x

1

, x

2

, ..., x

k

. This is much more realistic because, in

practice, often Y depends on many dierent factors (i.e., not just one). Specically, we

PAGE 149

CHAPTER 6 STAT 509, J. TEBBS

consider models of the form

Y

i

=

0

+

1

x

i1

+

2

x

i2

+ +

k

x

ik

+

i

,

for i = 1, 2, ..., n. We call this a multiple linear regression model.

There are now p = k + 1 regression parameters

0

,

1

, ...,

k

. These are unknown

and are to be estimated with the observed data.

Schematically, we can envision the observed data as follows:

Individual Y x

1

x

2

x

k

1 Y

1

x

11

x

12

x

1k

2 Y

2

x

21

x

22

x

2k

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

n Y

n

x

n1

x

n2

x

nk

Each of the n individuals contributes a response Y and a value of each of the

independent variables x

1

, x

2

, ..., x

k

.

We continue to assume that

i

N(0,

2

).

We also assume that the independent variables x

1

, x

2

, ..., x

k

are xed and measured

without error.

PREVIEW: To t the multiple linear regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+ +

k

x

ik

+

i

,

we again use the method of least squares. Simple computing formulae for the least

squares estimators are no longer available (as they were in simple linear regression).

This is hardly a big deal because we will use computing to automate all analyses. For

instructional purposes, it is advantageous to express multiple linear regression models

in terms of matrices and vectors. This streamlines notation and makes the presentation

easier.

PAGE 150

CHAPTER 6 STAT 509, J. TEBBS

6.3.2 Matrix representation

MATRIX REPRESENTATION: Consider the multiple linear regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+ +

k

x

ik

+

i

,

for i = 1, 2, ..., n. Dene

Y =

_

_

_

_

_

_

_

_

Y

1

Y

2

.

.

.

Y

n

_

_

_

_

_

_

_

_

, X =

_

_

_

_

_

_

_

_

1 x

11

x

12

x

1k

1 x

21

x

22

x

2k

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 x

n1

x

n2

x

nk

_

_

_

_

_

_

_

_

, =

_

_

_

_

_

_

_

_

_

_

_

2

.

.

.

k

_

_

_

_

_

_

_

_

_

_

_

, =

_

_

_

_

_

_

_

_

2

.

.

.

n

_

_

_

_

_

_

_

_

.

With these denitions, the model above can be expressed equivalently as

Y = X + .

In this equivalent representation,

Y is an n 1 (random) vector of responses

X is an n p (xed) matrix of independent variable measurements (p = k + 1)

is a p 1 (xed) vector of unknown population regression parameters

is an n 1 (random) vector of unobserved errors.

LEAST SQUARES: The notion of least squares is the same as it was in the simple linear

regression model. To t a multiple linear regression model, we want to nd the values of

0

,

1

, ...,

k

that minimize

Q(

0

,

1

, ...,

k

) =

n

i=1

[Y

i

(

0

+

1

x

i1

+

2

x

i2

+ +

k

x

ik

)]

2

,

or, in matrix notation, the value of that minimizes

Q() = (YX)

(YX).

PAGE 151

CHAPTER 6 STAT 509, J. TEBBS

Because Q() is a scalar function of the p = k + 1 elements of , it is possible to use

calculus to determine the values of the p elements that minimize it. Formally, we can

take p partial derivatives with respect to each of

0

,

1

, ...,

k

and set these equal to zero.

Using the calculus of matrices, we can write this resulting system of p equations (and p

unknowns) as follows:

X

X = X

Y.

These are called the normal equations. Provided that X

solution is

b = (X

X)

1

X

Y =

_

_

_

_

_

_

_

_

_

_

_

b

0

b

1

b

2

.

.

.

b

k

_

_

_

_

_

_

_

_

_

_

_

.

This is the least squares estimator of . The tted regression model is

Y = Xb,

or, equivalently,

Y

i

= b

0

+ b

1

x

i1

+ b

2

x

i2

+ + b

k

x

ik

,

for i = 1, 2, ..., n.

TECHNICAL NOTE: For the least squares estimator

b = (X

X)

1

X

Y

to be unique, we need X to be of full column rank; i.e., r(X) = p = k + 1. This will

occur when there are no linear dependencies among the columns of X. If r(X) < p, then

X

X does not have a unique inverse. In this case, the normal equations can not be solved

uniquely.

Example 6.2. The taste of matured cheese is related to the concentration of several

chemicals in the nal product. In a study from the LaTrobe Valley of Victoria, Aus-

tralia, samples of cheddar cheese were analyzed for their chemical composition and were

PAGE 152

CHAPTER 6 STAT 509, J. TEBBS

subjected to taste tests. For each specimen, the taste Y was obtained by combining the

scores from several tasters. Data were collected on the following variables:

Y = taste score (TASTE)

x

1

= concentration of acetic acid (ACETIC)

x

2

= concentration of hydrogen sulde (H2S)

x

3

= concentration of lactic acid (LACTIC).

Variables ACETIC and H2S were both measured on the log scale. The variable LACTIC

has not been transformed. Table 6.3 contains concentrations of the various chemicals in

n = 30 specimens of cheddar cheese and the observed taste score.

Specimen TASTE ACETIC H2S LACTIC Specimen TASTE ACETIC H2S LACTIC

1 12.3 4.543 3.135 0.86 16 40.9 6.365 9.588 1.74

2 20.9 5.159 5.043 1.53 17 15.9 4.787 3.912 1.16

3 39.0 5.366 5.438 1.57 18 6.4 5.412 4.700 1.49

4 47.9 5.759 7.496 1.81 19 18.0 5.247 6.174 1.63

5 5.6 4.663 3.807 0.99 20 38.9 5.438 9.064 1.99

6 25.9 5.697 7.601 1.09 21 14.0 4.564 4.949 1.15

7 37.3 5.892 8.726 1.29 22 15.2 5.298 5.220 1.33

8 21.9 6.078 7.966 1.78 23 32.0 5.455 9.242 1.44

9 18.1 4.898 3.850 1.29 24 56.7 5.855 10.20 2.01

10 21.0 5.242 4.174 1.58 25 16.8 5.366 3.664 1.31

11 34.9 5.740 6.142 1.68 26 11.6 6.043 3.219 1.46

12 57.2 6.446 7.908 1.90 27 26.5 6.458 6.962 1.72

13 0.7 4.477 2.996 1.06 28 0.7 5.328 3.912 1.25

14 25.9 5.236 4.942 1.30 29 13.4 5.802 6.685 1.08

15 54.9 6.151 6.752 1.52 30 5.5 6.176 4.787 1.25

Table 6.3: Cheese data. ACETIC, H2S, and LACTIC are independent variables. The re-

sponse variable is TASTE.

MODEL: Researchers postulate that each of the three chemical composition variables

x

1

, x

2

, and x

3

is important in describing the taste and consider the multiple linear re-

PAGE 153

CHAPTER 6 STAT 509, J. TEBBS

gression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+

3

x

i3

+

i

,

for i = 1, 2, ..., 30. We now use R to t this model using the method of least squares:

> fit = lm(taste~acetic+h2s+lactic)

> fit

Coefficients:

(Intercept) acetic h2s lactic

-28.877 0.328 3.912 19.670

This output gives the values of the least squares estimates

b =

_

_

_

_

_

_

_

_

b

0

b

1

b

2

b

3

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

_

28.877

0.328

3.912

19.670

_

_

_

_

_

_

_

_

.

Therefore, the tted least squares regression model is

Y = 28.877 + 0.328x

1

+ 3.912x

2

+ 19.670x

3

,

or, in other words,

6.3.3 Estimating the error variance

GOAL: Consider the multiple linear regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+ +

k

x

ik

+

i

,

for i = 1, 2, ..., n, where

i

N(0,

2

). We have just seen how to estimate

0

,

1

, ...,

k

via least squares and how to automate this procedure in R. Our next task is to estimate

the error variance

2

.

PAGE 154

CHAPTER 6 STAT 509, J. TEBBS

TERMINOLOGY: Dene the residual sum of squares by

SS

res

=

n

i=1

(Y

i

Y

i

)

2

=

n

i=1

e

2

i

.

In matrix notation, we can write this as

SS

res

= (Y

Y)

(Y

Y)

= (YXb)

(YXb) = e

e.

The n 1 vector

Y = Xb contains the least squares tted values.

The n 1 vector e = Y

Y contains the least squares residuals.

RESULT: In the multiple linear regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+ +

k

x

ik

+

i

,

where

i

N(0,

2

),

MS

res

=

SS

res

n p

is an unbiased estimator of

2

, that is,

E(MS

res

) =

2

.

The quantity

=

MS

res

=

SS

res

n p

estimates and is called the residual standard error.

CHEESE DATA: For the cheese data in Example 6.2, we use R to calculate MS

res

:

> fitted.values = predict(fit)

> residuals = taste-fitted.values

> # Calculate MS_res

> sum(residuals^2)/26

[1] 102.6299

PAGE 155

CHAPTER 6 STAT 509, J. TEBBS

6.3.4 The hat matrix

TERMINOLOGY: Consider the linear regression model Y = X + and dene

H = X(X

X)

1

X

.

H is called the hat matrix. Important quantities in linear regression can be written as

functions of the hat matrix.

The vector of tted values can be written as

Y = Xb = X(X

X)

1

X

Y = HY.

The vector of residuals can be written as

e = Y

Y = YHY = (I H)Y,

where I is the n n identity matrix.

Interestingly, it turns out that

Y and e are orthogonal vectors in R

n

.

6.3.5 Analysis of variance for linear regression

IDENTITY: Algebraically, it can be shown that

n

i=1

(Y

i

Y )

2

. .

SS

total

=

n

i=1

(

Y

i

Y )

2

. .

SS

reg

+

n

i=1

(Y

i

Y

i

)

2

. .

SS

res

.

SS

total

is the total sum of squares. SS

total

is the numerator of the sample variance

of Y

1

, Y

2

, ..., Y

n

. It measures the total variation in the response data.

SS

reg

is the regression sum of squares. SS

reg

measures the variation in the

response data explained by the linear regression model.

SS

res

is the residual sum of squares. SS

res

measures the variation in the response

data not explained by the linear regression model.

PAGE 156

CHAPTER 6 STAT 509, J. TEBBS

Table 6.4: Analysis of variance table for linear regression.

Source df SS MS F

Regression k SS

reg

MS

reg

=

SS

reg

k

F =

MS

reg

MS

res

Residual n p SS

res

MS

res

=

SS

res

np

Total n 1 SS

total

ANOVA TABLE: We can combine all of this information to produce an analysis of

variance (ANOVA) table. Such tables are standard in regression analysis.

The degrees of freedom (df) add down.

SS

total

can be viewed as a statistic that has lost a degree of freedom for

having to estimate the overall mean of Y with the sample mean Y . Recall

that n 1 is our divisor in the sample variance S

2

.

There are k degrees of freedom associated with SS

reg

because there are k

independent variables.

The degrees of freedom for SS

res

can be thought of as the divisor needed to

create an unbiased estimator of

2

. Recall that

MS

res

=

SS

res

n p

=

SS

res

n k 1

is an unbiased estimator of

2

The sum of squares (SS) also add down. This follows from the algebraic identity

noted earlier.

Mean squares (MS) are the sums of squares divided by their degrees of freedom.

The F statistic is formed by taking the ratio of MS

reg

and MS

res

. More on this in

a moment.

COEFFICIENT OF DETERMINATION: Since

SS

total

= SS

reg

+ SS

res

,

PAGE 157

CHAPTER 6 STAT 509, J. TEBBS

the proportion of the total variation in the data explained by the linear regression model

is

R

2

=

SS

reg

SS

total

.

This statistic is called the coecient of determination. Clearly,

0 R

2

1.

The larger the R

2

, the better the regression model explains the variability in the data.

IMPORTANT: It is critical to understand what R

2

does and does not measure. Its value

is computed under the assumption that the multiple linear regression model is correct

and assesses how much of the variation in the data may be attributed to that relationship

rather than to inherent variation.

If R

2

is small, it may be that there is a lot of random inherent variation in the data,

so that, although the multiple linear regression model is reasonable, it can explain

only so much of the observed overall variation.

Alternatively, R

2

may be close to 1; e.g., in a simple linear regression model t, but

this may not be the best model. In fact, R

2

could be very high, but ultimately

not relevant because it assumes the simple linear regression model is correct. In

reality, a better model may exist (e.g., a quadratic model, etc.).

F STATISTIC: The F statistic in the ANOVA table is used to test

H

0

:

1

=

2

= =

k

= 0

versus

H

a

: at least one of the

j

is nonzero.

In other words, F tests whether or not at least one of the independent variables

x

1

, x

2

, ..., x

k

is important in describing the response Y . If H

0

is rejected, we do not

know which one or how many of the

j

s are nonzero; only that at least one is.

SAMPLING DISTRIBUTION: When H

0

:

1

=

2

= =

k

= 0 is true,

F =

MS

reg

MS

res

F(k, n p).

PAGE 158

CHAPTER 6 STAT 509, J. TEBBS

Therefore, we can gauge the evidence against H

0

by comparing F to this distribution.

Values of F far out in the (right) upper tail are evidence against H

0

. R automatically

produces the value of F and produces the corresponding p-value. Recall that small

p-values are evidence against H

0

(the smaller the p-value, the more evidence).

Example 6.2 (continued). For the cheese data in Example 6.2, we t the multiple linear

regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+

3

x

i3

+

i

,

for i = 1, 2, ..., 30. The ANOVA table, obtained using SAS, is shown below.

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Regression 3 4994.50861 1664.83620 16.22 <.0001

Residual 26 2668.37806 102.62993

Corrected Total 29 7662.88667

The F statistic is used to test

H

0

:

1

=

2

=

3

= 0

versus

H

a

: at least one of the

j

is nonzero.

ANALYSIS: Based on the F statistic (F = 16.22), and the corresponding probability

value (p-value < 0.0001), we have strong evidence to reject H

0

. See also Figure 6.5.

Interpretation: We conclude that at least one of the independent variables (ACETIC,

H2S, LACTIC) is important in describing taste.

NOTE: The coecient of determination is

R

2

=

SS

reg

SS

total

=

4994.51

7662.89

0.652.

Interpretation: About 65.2 percent of the variability in the taste data is explained by

the linear regression model that includes ACETIC, H2S, and LACTIC. The remaining 34.8

percent of the variability in the taste data is explained by other sources.

PAGE 159

CHAPTER 6 STAT 509, J. TEBBS

0 5 10 15 20

0

.

0

0

.

1

0

.

2

0

.

3

0

.

4

0

.

5

0

.

6

0

.

7

F

P

D

F

Figure 6.5: F(3, 26) probability density function. An at F = 16.22 has been added.

IMPORTANT: If we t this model in R, we get the following:

fit = lm(taste~acetic+h2s+lactic)

anova(fit)

Response: taste

Df Sum Sq Mean Sq F value Pr(>F)

acetic 1 2314.14 2314.14 22.5484 6.528e-05 ***

h2s 1 2147.11 2147.11 20.9209 0.0001035 ***

lactic 1 533.26 533.26 5.1959 0.0310870 *

Residuals 26 2668.38 102.63

NOTE: The convention used by R is to split up the regression sum of squares

SS

reg

= 4994.50861

PAGE 160

CHAPTER 6 STAT 509, J. TEBBS

into sums of squares for each of the three independent variables ACETIC, H2S, and LACTIC,

as they are added sequentially to the model (these are called sequential sums of

squares). The sequential sums of squares for the independent variables add to the

SS

reg

for the model (up to rounding error), that is,

SS

reg

= 4994.51 = 2314.14 + 2147.11 + 533.26

= SS(ACETIC) + SS(H2S) + SS(LACTIC).

In words,

SS(ACETIC) is the sum of squares added when compared to a model that includes

only an intercept term.

SS(H2S) is the sum of squares added when compared to a model that includes an

intercept term and ACETIC.

SS(LACTIC) is the sum of squares added when compared to a model that includes

an intercept term, ACETIC, and H2S.

In other words, we can use the sequential sums of squares to assess the impact of adding

independent variables ACETIC, H2S, and LACTIC to the model in sequence.

6.3.6 Inference for individual regression parameters

IMPORTANCE: Consider our multiple linear regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+ +

k

x

ik

+

i

,

for i = 1, 2, ..., n, where

i

N(0,

2

). Condence intervals and hypothesis tests for

j

can help us assess the importance of using the independent variable x

j

in a model with

the other independent variables. That is, inference regarding

j

is always conditional

on the other variables being included in the model.

CONFIDENCE INTERVALS: A 100(1 ) percent condence interval for

j

, for

j = 0, 1, 2, ..., k, is given by

b

j

t

np,/2

2

c

jj

,

PAGE 161

CHAPTER 6 STAT 509, J. TEBBS

where

2

= MS

res

=

SS

res

n p

is our unbiased estimate of

2

and

c

jj

= (X

X)

1

jj

is the corresponding diagonal element of the (X

X)

1

matrix.

HYPOTHESIS TESTS: Hypothesis tests for

H

0

:

j

= 0

versus

H

a

:

j

= 0,

can be performed by examining the p-value output provided in R.

If H

0

:

j

= 0 is not rejected, then x

j

is not important in describing Y in the

presence of the other independent variables.

If H

0

:

j

= 0 is rejected, this means that x

j

is important in describing Y even

after including the eects of the other independent variables.

> summary(fit)

Call: lm(formula = taste ~ acetic + h2s + lactic)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -28.877 19.735 -1.463 0.15540

acetic 0.328 4.460 0.074 0.94193

h2s 3.912 1.248 3.133 0.00425 **

lactic 19.670 8.629 2.279 0.03109 *

Residual standard error: 10.13 on 26 degrees of freedom

Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116

F-statistic: 16.22 on 3 and 26 DF, p-value: 3.810e-06

PAGE 162

CHAPTER 6 STAT 509, J. TEBBS

OUTPUT: The Estimate output gives the values of the least squares estimates:

b

0

28.877 b

1

0.328 b

2

3.912 b

3

19.670.

Therefore, the tted least squares regression model is

Y = 28.877 + 0.328x

1

+ 3.912x

2

+ 19.670x

3

,

or, in other words,

The Std.Error output gives

19.735 = se(b

0

) =

c

00

2

=

2

(X

X)

1

00

4.460 = se(b

1

) =

c

11

2

=

2

(X

X)

1

11

1.248 = se(b

2

) =

c

22

2

=

2

(X

X)

1

22

8.629 = se(b

3

) =

c

33

2

=

2

(X

X)

1

33

,

where

2

= MS

res

=

SS

res

30 4

= (10.13)

2

102.63

is the square of the Residual standard error. The t value output gives the t statistics

t = 1.463 =

b

0

0

c

00

2

t = 0.074 =

b

1

0

c

11

2

t = 3.133 =

b

2

0

c

22

2

t = 2.279 =

b

3

0

c

33

2

.

These t statistics can be used to test H

0

:

i

= 0 versus H

0

:

i

= 0, for i = 0, 1, 2, 3.

Two-sided probability values are in Pr(>|t|). At the = 0.05 level,

we do not reject H

0

:

0

= 0 (p-value = 0.155). Interpretation: In the model

which includes all three independent variables, the intercept term

0

is not statis-

tically dierent from zero.

PAGE 163

CHAPTER 6 STAT 509, J. TEBBS

we do not reject H

0

:

1

= 0 (p-value = 0.942). Interpretation: ACETIC does not

signicantly add to a model that includes H2S and LACTIC.

we reject H

0

:

2

= 0 (p-value = 0.004). Interpretation: H2S does signicantly

add to a model that includes ACETIC and LACTIC.

we reject H

0

:

3

= 0 (p-value = 0.031). Interpretation: LACTIC does signicantly

add to a model that includes ACETIC and H2S.

CONFIDENCE INTERVALS: Ninety-ve percent condence intervals for the regression

parameters

0

,

1

,

2

, and

3

, respectively, are

b

0

t

26,0.025

se(b

0

) = 28.877 2.056(19.735) =(69.45, 11.70)

b

1

t

26,0.025

se(b

1

) = 0.328 2.056(4.460) =(8.84, 9.50)

b

2

t

26,0.025

se(b

2

) = 3.912 2.056(1.248) =(1.35, 6.48)

b

3

t

26,0.025

se(b

3

) = 19.670 2.056(8.629) =(1.93, 37.41).

The conclusions reached from interpreting these intervals are the same as those reached

using the hypothesis test p-values. Note that the

2

and

3

intervals do not include zero.

Those for

0

and

1

do.

6.3.7 Condence and prediction intervals for a given x = x

0

GOALS: We would like to create 100(1 ) percent intervals for the mean E(Y |x

0

)

and for the new value Y

(x

0

). As in the simple linear regression case, the former is

called a condence interval (since it is for a mean response) and the latter is called a

prediction interval (since it is for a new random variable).

CHEESE DATA: Suppose that we are interested estimating E(Y |x

0

) and predicting a

new Y

(x

0

) when ACETIC = 5.5, H2S = 6.0, and LACTIC = 1.4, so that

x

0

=

_

_

_

_

_

5.5

6.0

1.4

_

_

_

_

_

.

PAGE 164

CHAPTER 6 STAT 509, J. TEBBS

We use R to compute the following:

> predict(fit,data.frame(acetic=5.5,h2s=6.0,lactic=1.4),level=0.95,interval="confidence")

fit lwr upr

23.93552 20.04506 27.82597

> predict(fit,data.frame(acetic=5.5,h2s=6.0,lactic=1.4),level=0.95,interval="prediction")

fit lwr upr

23.93552 2.751379 45.11966

Note that the point estimate/prediction is

Y (x

0

) = b

0

+ b

1

x

10

+ b

2

x

20

+ b

3

x

30

= 28.877 + 0.328(5.5) + 3.912(6.0) + 19.670(1.4) 23.936.

A 95 percent condence interval for E(Y |x

0

) is (20.05, 27.83). When ACETIC =

5.5, H2S = 6.0, and LACTIC = 1.4, we are 95 percent condent that the mean taste

rating is between 20.05 and 27.83.

A 95 percent prediction interval for Y

(x

0

), when x = x

0

, is (2.75, 45.12). When

ACETIC = 5.5, H2S = 6.0, and LACTIC = 1.4, we are 95 percent condent that the

taste rating for a new cheese specimen will be between 2.75 and 45.12.

6.4 Model diagnostics (residual analysis)

IMPORTANCE: We now discuss certain diagnostic techniques for linear regression. The

term diagnostics refers to the process of checking the model assumptions. This is an

important exercise because if the model assumptions are violated, then our analysis (and

all subsequent interpretations) could be compromised.

MODEL ASSUMPTIONS: We rst recall the model assumptions on the error terms in

the linear regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+ +

k

x

ik

+

i

,

for i = 1, 2, ..., n. Specically, we have made the following assumptions:

PAGE 165

CHAPTER 6 STAT 509, J. TEBBS

E(

i

) = 0, for i = 1, 2, ..., n

var(

i

) =

2

, for i = 1, 2, ..., n, that is, the variance is constant

the random variables

i

are independent

the random variables

i

are normally distributed.

RESIDUALS: In checking our model assumptions, we rst have to deal with the obvious

problem; namely, the error terms

i

in the model are never observed. However, from the

t of the model, we can calculate the residuals

e

i

= Y

i

Y

i

,

where the ith tted value

Y

i

= b

0

+ b

1

x

i1

+ b

2

x

i2

+ + b

k

x

ik

.

We can think of the residuals e

1

, e

2

, ..., e

n

as proxies for the error terms

1

,

2

, ...,

n

,

and, therefore, we can use the residuals to check our model assumptions instead.

QQ PLOT FOR NORMALITY: To check the normality assumption (for the errors) in

linear regression, it is common to display the qq-plot of the residuals.

Recall that if the plotted points follow a straight line (approximately), this supports

the normality assumption.

Substantial deviation from linearity is not consistent with the normality assump-

tion.

The plot in Figure 6.6 supports the normality assumption for the errors in the

multiple linear regression model for the cheese data.

RESIDUAL PLOT: By the phrase residual plot, I mean the plot of the residuals (on

the vertical axis) versus the predicted values (on the horizontal axis). This plot is simply

the scatterplot of the residuals and the predicted values.

PAGE 166

CHAPTER 6 STAT 509, J. TEBBS

2 1 0 1 2

1

0

0

1

0

2

0

Theoretical Quantiles

S

a

m

p

l

e

Q

u

a

n

t

i

l

e

s

Figure 6.6: Cheese data. Normal qq-plot of the least squares residuals.

Advanced linear model arguments show that if the model does a good job at de-

scribing the data, then the residuals and tted values are independent.

This means that a plot of the residuals versus the tted values should reveal no

noticeable patterns; that is, the plot should appear to be random in nature (e.g.,

a random scatter of points).

On the other hand, if there are denite (non-random) patterns in the residual plot,

this suggests that the model is inadequate in some way or it could point to a

violation in the model assumptions.

The plot in Figure 6.7 does not suggest any obvious model inadequacies! It looks

completely random in appearance.

PAGE 167

CHAPTER 6 STAT 509, J. TEBBS

0 10 20 30 40 50

1

0

0

1

0

2

0

Fitted values

R

e

s

i

d

u

a

l

s

Figure 6.7: Cheese data. Residual plot for the multiple linear regression model t. A

horizontal line at zero has been added.

COMMON VIOLATIONS: Although there are many ways to violate the statistical as-

sumptions associated with linear regression, the most common violations are

non-constant variance (heteroscedasticity)

misspecifying the true regression function

correlated observations over time.

Example 6.3. An electric company is interested in modeling peak hour electricity

demand (Y ) as a function of total monthly energy usage (x). This is important for

planning purposes because the generating system must be large enough to meet the

maximum demand imposed by customers. Data for n = 53 residential customers for a

given month are shown in Figure 6.8.

PAGE 168

CHAPTER 6 STAT 509, J. TEBBS

500 1000 1500 2000 2500 3000 3500

0

5

1

0

1

5

Monthly Usage (kWh)

P

e

a

k

D

e

m

a

n

d

(

k

W

h

)

0 2 4 6 8 10 12

2

0

2

Fitted values

R

e

s

i

d

u

a

l

s

Figure 6.8: Electricity data. Left: Scatterplot of peak demand (Y , measured in kWh)

versus monthly usage (x, measured in kWh) with least squares simple linear regression

line superimposed. Right: Residual plot for the simple linear regression model t.

Problem: There is a clear problem with non-constant variance here. Note how the

residual plot fans out like the bell of a trumpet. This violation may have been missed

by looking at the scatterplot alone, but the residual plot highlights it.

Remedy: A common course of action to handle non-constant variance is to apply a

transformation to the response variable Y . Common transformations are logarithmic

(ln Y ), square-root (

ELECTRICITY DATA: A square root transformation is commonly applied to address

non-constant variance. Consider the simple linear regression model

W

i

=

0

+

1

x

i

+

i

,

for i = 1, 2, ..., 53, where W

i

=

Y

i

. It is straightforward to t this transformed model

in R as before. We simply regress W on x (instead of regressing Y on x).

> fit.2 = lm(sqrt(peak.demand) ~ monthly.usage)

PAGE 169

CHAPTER 6 STAT 509, J. TEBBS

500 1000 1500 2000 2500 3000 3500

0

.

5

1

.

0

1

.

5

2

.

0

2

.

5

3

.

0

3

.

5

4

.

0

Monthly Usage (kWh)

P

e

a

k

D

e

m

a

n

d

(

k

W

h

)

:

S

q

u

a

r

e

r

o

o

t

s

c

a

l

e

1.0 1.5 2.0 2.5 3.0 3.5 4.0

1

.

0

0

.

5

0

.

0

0

.

5

Fitted values

R

e

s

i

d

u

a

l

s

Figure 6.9: Electricity data. Left: Scatterplot of the square root of peak demand (

Y )

versus monthly usage (x, measured in kWh) with the least squares simple linear regression

line superimposed. Right: Residual plot for the simple linear regression model t with

transformed response.

> fit.2

Coefficients:

(Intercept) monthly.usage

0.580831 0.000953

ANALYSIS: Figure 6.9 above shows the scatterplot (left) and the residual plot (right)

from tting the transformed model. The fanning out shape that we saw previously (in

the untransformed model) is now largely absent. The tted transformed model is

W = 0.580831 + 0.000953x,

or, in other words,

Further analyses can be carried out with the transformed model; e.g., testing whether

peak demand (on the square root scale) is linearly related to monthly usage, etc.

PAGE 170

CHAPTER 6 STAT 509, J. TEBBS

4 6 8 10

0

.

5

1

.

0

1

.

5

2

.

0

Wind Velocity (mph)

D

C

O

u

t

p

u

t

1.0 1.5 2.0 2.5

0

.

6

0

.

4

0

.

2

0

.

0

0

.

2

Fitted values

R

e

s

i

d

u

a

l

s

Figure 6.10: Windmill data. Left: Scatterplot of DC output Y versus wind velocity (x,

measured in mph) with least squares simple linear regression line superimposed. Right:

Residual plot for the simple linear regression model t.

Example 6.4. A research engineer is investigating the use of a windmill to generate

electricity. He has collected data on the direct current (DC) output Y from his windmill

and the corresponding wind velocity (x, measured in mph). Data for n = 25 observation

pairs are shown in Figure 6.10.

Problem: There is a clear quadratic relationship between DC output and wind velocity,

so a simple linear regression model t (as shown above) is inappropriate. The residual

plot shows a pronounced quadratic pattern; this pattern is not accounted for in tting a

straight line model.

Remedy: Fit a multiple linear regression model with two independent variables: wind

velocity x and its square x

2

, that is, consider the quadratic regression model

Y

i

=

0

+

1

x

i

+

2

x

2

i

+

i

,

for i = 1, 2, ..., 25. It is straightforward to t a quadratic model in R. We simply regress

Y on x and x

2

.

PAGE 171

CHAPTER 6 STAT 509, J. TEBBS

4 6 8 10

0

.

5

1

.

0

1

.

5

2

.

0

Wind Velocity (mph)

D

C

O

u

t

p

u

t

0.5 1.0 1.5 2.0

0

.

2

0

.

1

0

.

0

0

.

1

0

.

2

Fitted values

R

e

s

i

d

u

a

l

s

Figure 6.11: Windmill data. Scatterplot of DC output Y versus wind velocity (x, mea-

sured in mph) with least squares quadratic regression curve superimposed. Right: Resid-

ual plot for the quadratic regression model t.

> wind.velocity.sq = wind.velocity^2

> fit.2 = lm(DC.output ~ wind.velocity + wind.velocity.sq)

> fit.2

Coefficients:

(Intercept) wind.velocity wind.velocity.sq

-1.15590 0.72294 -0.03812

The tted quadratic regression model is

2

or, in other words,

2

.

Note that the residual plot from the quadratic model t, shown above, now looks quite

good. The quadratic trend has disappeared (because the model now incorporates it).

PAGE 172

CHAPTER 7 STAT 509, J. TEBBS

Year

G

l

o

b

a

l

t

e

m

p

e

r

a

t

u

r

e

d

e

v

i

a

t

i

o

n

s

(

s

i

n

c

e

1

9

0

0

)

1900 1920 1940 1960 1980 2000

0

.

4

0

.

2

0

.

0

0

.

2

0

.

4

0 20 40 60 80 100

1

0

1

2

Year

R

e

s

i

d

u

a

l

s

Figure 6.12: Global temperature data. Left: Time series plot of the temperature Y

measured one time per year. The independent variable x is year, measured as 1900,

1901, ..., 1997. A simple linear regression model t has been superimposed. Right:

Residual plot from the simple linear regression model t.

Example 6.5. The data in Figure 6.12 (left) are temperature readings (in deg C) on

land-air average temperature anomalies, collected once per year from 1900-1997. To

emphasize that the data are collected over time, I have used straight lines to connect the

observations; this is called a time series plot.

Unfortunately, it is all too common that people t linear regression models to time

series data and then blindly use them for prediction purposes.

It takes neither a meteorologist nor an engineering degree to know that temperature

observations collected over time are probably correlated. Not surprisingly, residuals

from a simple linear regression display clear correlation over time.

Regression techniques (as we have learned in this chapter) are generally not appro-

priate when analyzing time series data for this reason. More advanced modeling

techniques are needed.

PAGE 173

CHAPTER 7 STAT 509, J. TEBBS

7 Factorial Experiments

Complementary reading: Chapter 7 (VK); Sections 7.1-7.2.

7.1 Introduction

REMARK: In engineering experiments, particularly those carried out in industrial set-

tings, there are often several factors of interest and the goal is to assess the eects of

these factors on a continuous response Y (e.g., yield, lifetime, ll weights, etc.). A fac-

torial treatment structure is an ecient way of dening treatments in these types of

experiments.

One example of a factorial treatment structure uses k factors, where each factor

has two levels. This is called a 2

k

factorial experiment.

Factorial experiments are often used in the early stages of experimental work. For

this reason, factorial experiments are also called factor screening experiments.

Example 7.1. A nickel-titanium alloy is used to make components for jet turbine

aircraft engines. Cracking is a potentially serious problem in the nal part, as it can lead

to nonrecoverable failure. A test is run at the parts producer to determine the eect of

k = 4 factors on cracks: pouring temperature (A), titanium content (B), heat treatment

method (C), and amount of grain rener used (D).

Factor A has 2 levels: low temperature and high temperature

Factor B has 2 levels: low content and high content

Factor C has 2 levels: Method 1 and Method 2

Factor D has 2 levels: low amount and high amount.

The response variable in the experiment is

Y = length of largest crack (in mm) induced in a piece of sample material.

PAGE 174

CHAPTER 7 STAT 509, J. TEBBS

NOTE: In this example, there are 4 factors, each with 2 levels. Thus, there are

2 2 2 2 = 2

4

= 16

dierent treatment combinations. These are listed here:

a

1

b

1

c

1

d

1

a

1

b

2

c

1

d

1

a

2

b

1

c

1

d

1

a

2

b

2

c

1

d

1

a

1

b

1

c

1

d

2

a

1

b

2

c

1

d

2

a

2

b

1

c

1

d

2

a

2

b

2

c

1

d

2

a

1

b

1

c

2

d

1

a

1

b

2

c

2

d

1

a

2

b

1

c

2

d

1

a

2

b

2

c

2

d

1

a

1

b

1

c

2

d

2

a

1

b

2

c

2

d

2

a

2

b

1

c

2

d

2

a

2

b

2

c

2

d

2

For example, the treatment combination a

1

b

1

c

1

d

1

holds each factor at its low level, the

treatment combination a

1

b

1

c

2

d

2

holds Factors A and B at their low level and Factors

C and D at their high level, and so on.

TERMINOLOGY: In a 2

k

factorial experiment, one replicate of the experiment uses 2

k

runs, one at each of the 2

k

treatment combinations.

Therefore, in Example 7.1, one replicate of the experiment would require 16 runs

(one at each treatment combination listed above).

Two replicates would require 32 runs, three replicates would require 48 runs, and

so on.

TERMINOLOGY: There are dierent types of eects of interest in factorial experiments:

main eects and interaction eects. For example, in a 2

4

factorial experiment,

there is 1 eect that does not depend on any of the factors.

there are 4 main eects: A, B, C, and D.

there are 6 two-way interaction eects: AB, AC, AD, BC, BD, and CD.

there are 4 three-way interaction eects: ABC, ABD, ACD, and BCD.

there is 1 four-way interaction eect: ABCD.

PAGE 175

CHAPTER 7 STAT 509, J. TEBBS

OBSERVATION: Note that 1 +4 +6 +4 +1 = 16. In other words, with 16 observations

(from one 2

4

replicate), we can estimate the 4 main eects and we can estimate all of the

11 interaction eects. We will have 1 observation left to estimate the overall mean of Y ,

that is, the eect that depends on none of the 4 factors.

GENERALIZATION: In a 2

k

factorial experiment, there is/are

_

k

0

_

= 1 overall mean (the mean of Y ignoring all factors)

_

k

1

_

= k main eects

_

k

2

_

=

k(k1)

2

two-way interaction eects

_

k

3

_

three-way interaction eects, and so on.

Note that

_

k

0

_

+

_

k

1

_

+

_

k

2

_

+ +

_

k

k

_

=

k

j=0

_

k

j

_

= 2

k

and additionally that

_

k

0

_

,

_

k

1

_

, ...,

_

k

k

_

are the entries in the (k + 1)th row of Pascals

Triangle. Observe also that 2

k

grows quickly in size as k increases. For example, if there

are k = 10 factors (A, B, C, D, E, F, G, H, I, and J, say), then performing just one

replicate of the experiment would require 2

10

= 1024 runs! In real life, rarely would this

type of experiment be possible.

7.2 Example: A 2

2

experiment with replication

NOTE: We rst consider 2

k

factorial experiments where k = 2, that is, there are only

two factors, denoted by A and B. This is called a 2

2

experiment. We illustrate with an

agricultural example.

Example 7.2. Predicting corn yield prior to harvest is useful for making feed supply and

marketing decisions. Corn must have an adequate amount of nitrogen (Factor A) and

phosphorus (Factor B) for protable production and also for environmental concerns.

PAGE 176

CHAPTER 7 STAT 509, J. TEBBS

Table 7.1: Corn yield data (bushels/plot).

Treatment combination Yield (Y ) Treatment sample mean

a

1

b

1

35, 26, 25, 33, 31 30

a

1

b

2

39, 33, 41, 31, 36 36

a

2

b

1

37, 27, 35, 27, 34 32

a

2

b

2

49, 39, 39, 47, 46 44

Experimental design: In a 2 2 = 2

2

factorial experiment, two levels of nitrogen

(a

1

= 10 and a

2

= 15) and two levels of phosphorus were used (b

1

= 2 and b

2

= 4).

Applications of nitrogen and phosphorus were measured in pounds per plot. Twenty

small (quarter acre) plots were available for experimentation, and the four treatment

combinations a

1

b

1

, a

1

b

2

, a

2

b

1

, and a

2

b

2

were randomly assigned to plots. Note that

there are 5 replications.

Response: The response variable is

Y = yield per plot (measured in # bushels).

Side-by-side boxplots of the data in the table above are in Figure 7.1.

Naive analysis: One silly way to analyze these data would be to simply regard each

of the combinations a

1

b

1

, a

1

b

2

, a

2

b

1

, and a

2

b

2

as a treatment and perform a one-way

ANOVA with t = 4 treatment groups like we did in Chapter 4. This would produce the

following ANOVA table:

> anova(lm(yield ~ treatment))

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

treatment 3 575 191.67 9.5833 0.0007362 ***

Residuals 16 320 20.00

PAGE 177

CHAPTER 7 STAT 509, J. TEBBS

a1b1 a1b2 a2b1 a2b2

0

1

0

2

0

3

0

4

0

5

0

6

0

7

0

Y

i

e

l

d

(

b

u

s

h

e

l

s

/

p

l

o

t

)

Figure 7.1: Boxplots of corn yields (bushels/plot) for four treatment groups.

OMNIBUS CONCLUSION: The value F = 9.5833 is not what we would expect from an

F(3, 16) distribution, the distribution of F when

H

0

:

11

=

12

=

21

=

22

is true (p-value 0.0007). Therefore, we would conclude that at least one of the factorial

treatment population means is dierent.

REMARK: As we have discussed before in one-way classication experiments, an overall

F test provides very little information. With a factorial treatment structure, it is possible

to explore the data further; in particular, we can learn about the main eects due to

nitrogen (Factor A) and due to phosphorus (Factor B). We can also learn about the

interaction between nitrogen and phosphorus.

PAGE 178

CHAPTER 7 STAT 509, J. TEBBS

PARTITION: Let us rst recall the treatment sum of squares from the one-way ANOVA:

SS

trt

= 575.

The way we learn more about specic eects is to partition SS

trt

into the following

pieces: SS

A

, SS

B

, and SS

AB

. By partition, I mean that we will write

SS

trt

= SS

A

+ SS

B

+ SS

AB

.

In words,

SS

A

is the sum of squares due to the main eect of A (nitrogen)

SS

B

is the sum of squares due to the main eect of B (phosphorus)

SS

AB

is the sum of squares due to the interaction eect of A and B (nitrogen and

phosphorus).

We can use R to write this partition in a richer ANOVA table (mathematical details

omitted):

> fit = lm(yield ~ nitrogen*phosphorus)

> anova(fit)

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

nitrogen 1 125 125 6.25 0.0236742 *

phosphorus 1 405 405 20.25 0.0003635 ***

nitrogen:phosphorus 1 45 45 2.25 0.1530877

Residuals 16 320 20

The F statistics F

A

, F

B

, and F

AB

can be used to test for main eects and an interaction

eect, respectively. Each F statistic above has an F(1, 16) distribution under the as-

sumption that the associated eect is zero. Small p-values (e.g., p-value < 0.05) indicate

that the eect is nonzero. Eects with large p-values can be treated as not signicant.

PAGE 179

CHAPTER 7 STAT 509, J. TEBBS

0 2 4 6 8

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

F

P

D

F

Figure 7.2: F(1, 16) probability density function. An at F

AB

= 2.25 has been added.

ANALYSIS: When analyzing data from an experiment with a 2

2

factorial treatment

structure, the rst task is to judge whether the two factors interact, here, whether or not

the nitrogen/phosphorus contribution is real. From the ANOVA table, we see that

F

AB

= 2.25 (p-value 0.153).

This value of F

AB

is not all that unreasonable when compared to the F(1, 16) distribution,

the distribution of F

AB

when nitrogen and phosphorus do not interact. In other words,

we do not have substantial evidence that nitrogen and phosphorus interact.

NOTE: An interaction plot is a graphical display that can help us assess (visually)

whether two factors interact. In this plot, the levels of Factor A (say) are marked on the

horizontal axis. The sample means of the treatments are plotted against the levels of A,

and the points corresponding to the same level of Factor B are joined by straight lines.

PAGE 180

CHAPTER 7 STAT 509, J. TEBBS

2

0

2

5

3

0

3

5

4

0

4

5

5

0

nitrogen

M

e

a

n

y

i

e

l

d

(

b

u

s

h

e

l

s

/

p

l

o

t

)

a1 a2

phosphorus

b2

b1

Figure 7.3: Interaction plot for nitrogen and phosphorus in Example 7.2.

If Factors A and B do not interact at all, the interaction plot should display parallel

lines. That is, the eect of one factor stays constant across the levels of the other

factor. This is essentially what it means to have no interaction.

If the interaction plot displays a departure from parallelism (including an over-

whelming case where the lines intersect), then this is visual evidence of interaction.

That is, the eect of one factor depends on the levels of the other factor.

The F test that uses F

AB

provides numerical evidence of interaction. The interac-

tion plot provides visual evidence.

CONCLUSION: We do not have strong evidence that nitrogen and phosphorus interact.

The F

AB

statistic is not signicant and the interaction plot does not show a substantial

departure from parallelism.

PAGE 181

CHAPTER 7 STAT 509, J. TEBBS

GENERAL STRATEGY: The following are guidelines for analyzing data from 2

2

fac-

torial experiments. Start by looking at whether or not the interaction contribution is

signicant. This can be done by using an interaction plot and an F test that uses F

AB

.

If the interaction is signicant, then formal analysis of main eects is not all

that meaningful because their interpretations depend on the interaction. In this

situation, the best approach is to just redo the entire analysis as a one-way ANOVA

with 4 treatments. Tukey pairwise condence intervals can help you formulate an

ordering among the 4 treatment population means.

If the interaction is not signicant, I prefer to ret the model without the

interaction term present and then examine the main eects. This can be done

numerically by examining the sizes of F

A

and F

B

, respectively.

ANALYSIS: Here is the ANOVA table for the corn yield data, leaving out the nitro-

gen/phosphorus interaction term:

> fit = lm(yield ~ nitrogen + phosphorus)

> anova(fit)

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

nitrogen 1 125 125.00 5.8219 0.027403 *

phosphorus 1 405 405.00 18.8630 0.000442 ***

Residuals 17 365 21.47

Comparing this to the ANOVA table with interaction, note that the interaction sum of

squares, SS

AB

= 45, has now been absorbed into the residual sum of squares.

The main eect of nitrogen (Factor A) is signicant in describing yield (F

A

=

5.8219, p-value 0.0274).

The main eect of phosphorus (Factor B) is strongly signicant in describing yield

(F

B

= 18.8630, p-value = 0.0004).

PAGE 182

CHAPTER 7 STAT 509, J. TEBBS

Nitrogen.1 Nitrogen.2

2

0

2

5

3

0

3

5

4

0

4

5

5

0

Y

i

e

l

d

(

b

u

s

h

e

l

s

/

p

l

o

t

)

Phosphorus.1 Phosphorus.2

2

0

2

5

3

0

3

5

4

0

4

5

5

0

Y

i

e

l

d

(

b

u

s

h

e

l

s

/

p

l

o

t

)

Figure 7.4: Left: Side by side boxplots for nitrogen (Factor A). Right: Side by side

boxplots for phosphorus (Factor B).

CONFIDENCE INTERVALS: A 95 percent condence interval for

A1

A2

, the dier-

ence in means for the two levels of nitrogen (Factor A) is given by

(Y

A1

Y

A2

) t

17,0.025

MS

res

_

1

10

+

1

10

_

.

A 95 percent condence interval for

B1

B2

, the dierence in means for the two levels

of phosphorus (Factor B) is given by

(Y

B1

Y

B2

) t

17,0.025

MS

res

_

1

10

+

1

10

_

.

The R code online can be used to calculate these intervals:

95% CI for

A1

A2

: (9.37, 0.62) bushels/acre

95% CI for

B1

B2

: (13.37, 4.63) bushels/acre

Note that neither of these intervals includes zero. This is expected because F

A

and F

B

are both signicant at the = 0.05 level.

PAGE 183

CHAPTER 7 STAT 509, J. TEBBS

REGRESSION: In Example 7.2, there were two levels of nitrogen (a

1

= 10 and a

2

= 15)

and two levels of phosphorus (b

1

= 2 and b

2

= 4) used in the experiment. These levels,

which we generically called low and high when analyzing the data using ANOVA,

are actually numerical in nature (measured in pounds per plot). In this light, there is

nothing to prevent us from tting the following multiple linear regression model

using the numerical values of nitrogen and phosphorus:

Y

i

=

0

+

1

x

i1

+

2

x

i2

+

i

,

where the independent variables

x

1

= nitrogen amount (10 or 15 pounds)

x

2

= phosphorus amount (2 or 4 pounds).

Doing so in R gives the following output:

> fit = lm(yield ~ nitrogen + phosphorus)

> summary(fit)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.5000 6.1297 1.550 0.139599

nitrogen 1.0000 0.4144 2.413 0.027403 *

phosphorus 4.5000 1.0361 4.343 0.000442 ***

Residual standard error: 4.634 on 17 degrees of freedom

Multiple R-squared: 0.5922, Adjusted R-squared: 0.5442

F-statistic: 12.34 on 2 and 17 DF, p-value: 0.0004886

This output gives the values of the least squares estimates

b =

_

_

_

_

_

b

0

b

1

b

2

_

_

_

_

_

=

_

_

_

_

_

9.5

1.0

4.5

_

_

_

_

_

.

PAGE 184

CHAPTER 7 STAT 509, J. TEBBS

Therefore, the tted least squares regression model for the corn yield data is

Y = 9.5 + 1.0x

1

+ 4.5x

2

,

or, in other words,

This equation can subsequently be used make predictions about future yields based on

given values of nitrogen and phosphorus. In doing so, be careful about extrapolation;

for example, you would not want to make a prediction when x

1

= 25 and x

2

= 10. These

values are not representative of those used in the actual experiment, so this model may

not be a good description of yield for these values of nitrogen and phosphorus.

INTERESTING: I have below constructed the analysis of variance table for the multiple

linear regression t:

> anova(fit)

Analysis of Variance Table

Response: yield

Df Sum Sq Mean Sq F value Pr(>F)

nitrogen 1 125 125.00 5.8219 0.027403 *

phosphorus 1 405 405.00 18.8630 0.000442 ***

Residuals 17 365 21.47

You will note that this table is identical to the two-way ANOVA table (without inter-

action) on pp 182 (notes). This is no coincidence! In fact, the two-way ANOVA model

(without interaction) and the multiple linear regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+

i

are actually identical models. Therefore, tting each one gives the same analysis and the

same conclusions.

PAGE 185

CHAPTER 7 STAT 509, J. TEBBS

7.3 Example: A 2

4

experiment without replication

Example 7.3. A chemical product is produced in a pressure vessel. A factorial exper-

iment is carried out to study the factors thought to inuence the ltration rate of this

product. The four factors are temperature (A), pressure (B), concentration of formalde-

hyde (C) and stirring rate (D). Each factor is present at two levels (e.g., low and

high). A 2

4

experiment is performed with one replication; the data are shown below.

Factor Filtration rate

Run A B C D Run label (Y , gal/hr)

1 a

1

b

1

c

1

d

1

45

2 + a

2

b

1

c

1

d

1

71

3 + a

1

b

2

c

1

d

1

48

4 + + a

2

b

2

c

1

d

1

65

5 + a

1

b

1

c

2

d

1

68

6 + + a

2

b

1

c

2

d

1

60

7 + + a

1

b

2

c

2

d

1

80

8 + + + a

2

b

2

c

2

d

1

65

9 + a

1

b

1

c

1

d

2

43

10 + + a

2

b

1

c

1

d

2

100

11 + + a

1

b

2

c

1

d

2

45

12 + + + a

2

b

2

c

1

d

2

104

13 + + a

1

b

1

c

2

d

2

75

14 + + + a

2

b

1

c

2

d

2

86

15 + + + a

1

b

2

c

2

d

2

70

16 + + + + a

2

b

2

c

2

d

2

96

NOTATION: When discussing factorial experiments, it is common to use the symbol

to denote the low level of a factor and the symbol + to denote the high level. For

example, the rst row of the table above indicates that each factor (A, B, C, and D) is

run at its low level. The response Y for this run is 45 gal/hr.

PAGE 186

CHAPTER 7 STAT 509, J. TEBBS

NOTE: In this experiment, there are k = 4 factors, so there are 15 eects to estimate:

the 4 main eects: A, B, C, and D

the 6 two-way interactions: AB, AC, AD, BC, BD, and CD

the 4 three-way interactions: ABC, ABD, ACD, BCD

the 1 four-way interaction: ABCD.

In this 2

4

experiment, we have 16 values of Y and 15 eects to estimate. This means that

only one Y observation is left, but this observation is used to estimate the overall mean.

This leaves us with no observations (and therefore no degrees of freedom) to perform

statistical tests. This is an obvious problem! Why? Because we have no way to judge

which main eects are signicant, and we cannot learn about how these factors interact.

TERMINOLOGY: A single replicate of a 2

k

factorial experiment is called an unrepli-

cated factorial. With only one replicate, as in Example 7.3, there is no internal error

estimate, so we cannot perform statistical tests to judge signicance. What do we do?

One approach to the analysis of an unreplicated factorial is to assume that certain

higher-order interactions are negligible and then combine their mean squares to

estimate the error.

This is an appeal to the sparsity of eects principle; that is, most systems

are dominated by some of the main eects and low-order interactions and most

high-order interactions are negligible.

To learn about which eects may be negligible, we can t the full ANOVA model

and obtain the SS attached to each of these 15 eects (see next page).

Eects with large SS can be retained. Eects with small SS can be discarded.

A smaller model with only the large eects can then be t. This smaller model

will have an error estimate formed by taking all of the eects with small SS and

combining them together.

PAGE 187

CHAPTER 7 STAT 509, J. TEBBS

ANALYSIS: Here is the R output summarizing the t of the full model:

> # Fit full model

> fit = lm(filtration ~ A*B*C*D)

> anova(fit)

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

A 1 1870.56 1870.56

B 1 39.06 39.06

C 1 390.06 390.06

D 1 855.56 855.56

A:B 1 0.06 0.06

A:C 1 1314.06 1314.06

B:C 1 22.56 22.56

A:D 1 1105.56 1105.56

B:D 1 0.56 0.56

C:D 1 5.06 5.06

A:B:C 1 14.06 14.06

A:B:D 1 68.06 68.06

A:C:D 1 10.56 10.56

B:C:D 1 27.56 27.56

A:B:C:D 1 7.56 7.56

Residuals 0 0.00

Warning message: ANOVA F-tests on an essentially perfect fit are unreliable

NOTE: From this table, it is easy to see that the eects

A, C, D, AC, AD

are far more signicant than the others. For example, the smallest SS in this set is 390.06

(Factor C) which is over 5 times larger than the largest remaining SS (68.06). As a next

step, we therefore consider tting a smaller model with these 5 eects only. This will

free up 10 degrees of freedom that can be used to estimate the error variance.

ANALYSIS: Here is the R output (next page) summarizing the t of the smaller model

that includes only A, C, D, AC, and AD:

PAGE 188

CHAPTER 7 STAT 509, J. TEBBS

4

0

5

0

6

0

7

0

8

0

9

0

1

0

0

Factor.A

F

i

l

t

r

a

t

i

o

n

r

a

t

e

(

g

a

l

/

h

r

)

a1 a2

Factor.C

c1

c2

4

0

5

0

6

0

7

0

8

0

9

0

1

0

0

Factor.A

F

i

l

t

r

a

t

i

o

n

r

a

t

e

(

g

a

l

/

h

r

)

a1 a2

Factor.D

d2

d1

Figure 7.5: Left: Interaction plot for temperature (Factor A) and concentration of

formaldehyde (Factor C). Right: Interaction plot for temperature (Factor A) and stirring

rate (Factor D).

> # Fit smaller model

> fit = lm(filtration ~ A + C + D + A:C + A:D)

> anova(fit)

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

A 1 1870.56 1870.56 95.865 1.928e-06 ***

C 1 390.06 390.06 19.990 0.001195 **

D 1 855.56 855.56 43.847 5.915e-05 ***

A:C 1 1314.06 1314.06 67.345 9.414e-06 ***

A:D 1 1105.56 1105.56 56.659 1.999e-05 ***

Residuals 10 195.13 19.51

NOTE: It is clear that these ve eects are each signicant (note that the p-values are

all very close to zero). Interaction plots for temperature (Factor A) and concentration of

formaldehyde (Factor C) and temperature (Factor A) and stirring rate (Factor D) are in

Figure 7.5. These plots depict the strong pairwise interaction that exists.

PAGE 189

CHAPTER 7 STAT 509, J. TEBBS

REGRESSION: In Example 7.3, there were no numerical values attached to the levels

of temperature (Factor A), concentration of formaldehyde (Factor C), and stirring rate

(Factor D). Therefore, if we wanted to t a regression model (e.g., for prediction purposes,

etc.), we can use the following variables with arbitrary numerical codings assigned:

x

1

= temperature (1 = low; 1 = high)

x

2

= concentration of formaldehyde (1 = low; 1 = high)

x

3

= stirring rate (1 = low; 1 = high).

With these values, we can t the multiple linear regression model

Y

i

=

0

+

1

x

i1

+

2

x

i2

+

3

x

i3

+

4

x

i1

x

i2

+

5

x

i1

x

i3

+

i

.

Doing so in R gives the following output:

> fit = lm(filtration ~ temp + conc + stir.rate + temp:conc + temp:stir.rate)

> summary(fit)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 70.062 1.104 63.444 2.30e-14 ***

temp 10.812 1.104 9.791 1.93e-06 ***

conc 4.938 1.104 4.471 0.00120 **

stir.rate 7.313 1.104 6.622 5.92e-05 ***

temp:conc -9.062 1.104 -8.206 9.41e-06 ***

temp:stir.rate 8.312 1.104 7.527 2.00e-05 ***

Residual standard error: 4.417 on 10 degrees of freedom

Multiple R-squared: 0.966, Adjusted R-squared: 0.9489

F-statistic: 56.74 on 5 and 10 DF, p-value: 5.14e-07

This output gives the values of the least squares estimates

b =

_

_

_

_

_

_

_

_

_

_

_

_

_

_

b

0

b

1

b

2

b

3

b

4

b

5

_

_

_

_

_

_

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

_

_

_

_

_

_

_

70.062

10.812

4.938

7.313

9.062

8.312

_

_

_

_

_

_

_

_

_

_

_

_

_

_

.

PAGE 190

CHAPTER 7 STAT 509, J. TEBBS

Therefore, the tted least squares regression model for the ltration rate data is

Y = 70.062 + 10.812x

1

+ 4.938x

2

+ 7.313x

3

9.062x

1

x

2

+ 8.312x

1

x

3

or, in other words,

This tted regression model can be used to write condence intervals or prediction in-

tervals for future values of temperature, concentration, and stirring rate.

ANOVA TABLE: I have constructed the analysis of variance table for the multiple linear

regression t in this example:

> anova(fit)

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

temp 1 1870.56 1870.56 95.865 1.928e-06 ***

conc 1 390.06 390.06 19.990 0.001195 **

stir.rate 1 855.56 855.56 43.847 5.915e-05 ***

temp:conc 1 1314.06 1314.06 67.345 9.414e-06 ***

temp:stir.rate 1 1105.56 1105.56 56.659 1.999e-05 ***

Residuals 10 195.13 19.51

Note that this table is identical to the ANOVA table on pp 189 (notes).

REMARK: In this chapter, we have only just scratched the surface when it comes

to discussing factorial treatment structures. Specialized courses in experimental design,

such as STAT 506, would delve into more advanced designs and analysis techniques.

For example, a design that often arises in industrial experiments is that of running

a 2

k

factorial experiment in less than 2

k

runs; these are called fractional factorial

experiments. In these experiments, the engineer acknowledges a priori that the highest

order interactions are negligible and the goal is to assess main eects and lower order

interactions only. For those who are interested, Section 7.3 (VK) introduces this concept.

An illustrative example is Example 7.7 on pp 499 (VK).

PAGE 191

- A&J Study Manual for SOA Exam P/CAS Exam 1Uploaded byanjstudymanual
- Probability DistributionsUploaded bypennylanephotographs
- Research Proposal for MBAUploaded byAndy Phuong
- UNSW -Econ2206 Solutions Semester 1 2011 - Introductory Eco No MetricsUploaded byleobe89
- Single Sampling Plan TheoryUploaded byPraveen GY
- December 09 10Uploaded byRush Kirubi
- MCA Regular First to Fourth Semester Syllabus Final..Uploaded byIshna Anhsi
- Probability Theory (Lecture Notes)Uploaded byDEEPAK
- A New Approach to Fitting Linear Models in High Dimensional SpacesUploaded bygiangblackk
- 15 08 0489-02-0006 Statistical Property of Dynamic Ban Channel Gain at 4 5ghzUploaded byMujeeb Abdullah
- Probability TextUploaded bykirkbyjl
- Fundamentals of Statistical Signal Processing by Dr Steven KayUploaded byUtpal Das
- Effects of Hub-and-Spoke Free Trade Agreements on Trade: Panel Data AnalysisUploaded byAsian Development Bank
- Edu 2010 Spring Exam cUploaded byZaheer Ahmad
- CHAPTER 3Uploaded byMarco Antonio Suarez
- The Law of Large NumbersUploaded byMichael Latt
- Dimensions of Hard Power. Lemke DUploaded byUlugbek Mamadjanov
- Level1LOS2017.pdfUploaded byAnonymous tPvhuo
- 11_Session_CSC304A.pptxUploaded byshy
- Problem Set 10 AnswersUploaded byJena Graham
- Normal DistributionUploaded byuos
- Ch4_probability (4.1 - 4.$)Uploaded byAbdiqadir Jibril
- Python Dts CalibrationUploaded byMiranda Brett
- handout2Uploaded byCatalin Parvu
- PATTERN RECOGNITION OF CHILDREN STORIES BASED ON THEMES USING TIME SERIES DATAUploaded byAnonymous qwgN0m7oO
- Abbott 351Tutorial3 f08tutorialUploaded byjsmith84
- PPCh03Uploaded byDilshod Abdullah
- TDC in Statistics (General) 26Uploaded byShekhar Raj Kachari
- Fx of Hub and Spoke FtaUploaded bymzholda
- RD05 StatisticsUploaded byأحمدآلزهو

- a Beginner’s Guide to Fragility, Vulnerability, And RiskUploaded bySaharAktham
- +2018 Alamilla Optimum design and damage control for load sequencesUploaded byAlberto Vasquez Martinez
- SYLLABUS MTH201Uploaded byJOHN WILLIAMS
- 6 Quality and Statistical Process Control Chapter 9Uploaded byTerry Conrad King
- Exercise Chapter4Uploaded byEunice Wong
- ST104B Statistics 2.pdfUploaded bymagnifiqueheart
- Theory of uncertainty of measurement.pdfUploaded byLong Nguyễn Văn
- 12-antialiasingUploaded bysoujana
- Probability in Computer ScienceUploaded byDavid James
- Deming Socialskills August2015Uploaded byacesfullofkings
- preventive and social medicine public health.pdfUploaded byRaj Subedi
- Particle analtsis.pdfUploaded byJiraros Soponarunrat
- IEEE Power Transformer PHMUploaded byHu Chao
- SMCSUploaded byrakesh7800_427023020
- An Slab 4Uploaded byMharkie
- Modeling Impacts of Process Architecture on CostUploaded byEvandro Paulino
- A theory of trading volume.pdfUploaded byAli Maksum
- Binomial Probability DistributionUploaded byVenkatramaiah Sudarsan
- Topic 03Uploaded bySolanum tuberosum
- June 2013 QP S1 EdexcelUploaded bypaolo maldini
- HW4Uploaded byPallavi Raiturkar
- SPSS Quick GuideUploaded bygulwareen
- discrete probabilityUploaded byNelsonMoseM
- Think StatsUploaded bymnmdev
- ITUploaded bysheenam_bhatia
- Points AwkardedUploaded byAlmir Tahirovic
- Poisson DistributionUploaded byK.Prasanth Kumar
- 40 2 Intvl Est VarUploaded byEbookcraze
- CHAPTER 1 erere.pptUploaded bymrmr2
- Stochastic ProcessUploaded byFRANCESCO222