00 upvotes00 downvotes

16 views18 pagesJul 14, 2007

302ch32002

© Attribution Non-Commercial (BY-NC)

PDF, TXT or read online from Scribd

Attribution Non-Commercial (BY-NC)

16 views

00 upvotes00 downvotes

302ch32002

Attribution Non-Commercial (BY-NC)

You are on page 1of 18

The table below is taken from The Statistical Sleuth by Ramsey and Schafer. We will discuss it in class.

Allocation of Units to Groups

At Random A random sample is Random samples are Inferences to

selected from one selected from existing the populations

population; units distinct populations. can be drawn.

are then randomly

assigned to different

treatment groups

units is found available units from

units are then distinct groups are

randomly assigned examined.

to treatment groups.

Causal inferences

can be drawn

We will begin by considering examples from each cell in the above table. First, we will consider units

that are subjects (distinct individuals). Notice that I am deliberately not defining the response or, if applica-

ble, treatments.

• Upper left: The units are high school freshmen and the population is all high school freshmen in

Wisconsin. A random sample of freshmen is selected from this population. Once the sample is

obtained, the students are divided into treatment groups by randomization.

• Upper right: All high schools in Wisconsin are classified as public or private. The high school

freshmen at these two types of schools form the two populations. Independent random samples of

freshmen are selected from each population.

• Lower left: All freshmen at Sun Prairie high school are selected for study. The students are divided

37

38 CHAPTER 3. THE TWO SAMPLE PROBLEM

• Lower right: Freshmen at Sun Prairie high school are compared to freshmen at Edgewood high

school.

Next, I will consider units that are trials.

• Upper left: A golfer wants to compare two drivers. The trials, individual shots, are assigned to driver

by randomization; we get random samples by assuming a spinner model for each driver.

• Upper right: A golfer has one driver and he wants to compare his ability playing at sea level versus

playing at an altitude of 5,000 feet. We get random samples by assuming a spinner model at each site.

• Lower left: Same as upper left, but we no longer assume spinner models.

• Lower right: Same as upper right, but we no longer assume spinner models.

Before we get into inference procedures, formulas for tests and estimation, I want to introduce some

issues of scientific importance.

On each unit (subject or trial) we plan to obtain a response. Typically, the response exhibits some

variation as we move from unit to unit. (If there is no variation, we will not need to do Statistics.) We

invent the notion of factors as the source of the variation. For example, in its last five basketball games, the

Milwaukee Bucks scored 102, 93, 102, 104 and 91 points. Possible factors include strength of opponent,

location of game, and length of time since the previous game. Note the following. These are natural factors

for a basketball fan to suggest. If you know nothing about basketball, then you will be ill-equipped to

speculate on the identity of factors. As we will see later in these notes, if a scientist is bad at suggesting

factors, then he/she will likely not learn much from collecting and analyzing data.

From the list of possible factors, the researcher chooses one for special status; it is called the study

factor. (In regression and ANOVA, the researcher may choose to have several study factors.) For example,

for the Bucks I might choose location of game as my study factor. Next, I must specify the levels of the study

factor. If there are two levels, then we have the “two sample problem” which is the title of this chapter. As

a result, I might choose my levels to be “home” and “away.” Note that this is not the only possible choice. I

could use the four time zones to be the levels, or the 28 arenas (if I am correct in my belief that the Clippers

and Lakers share an arena; 29 if I am incorrect). If the researcher selects more than two levels and the levels

are on an interval or ratio scale, then methods of regression might be used.

After specifying the study factor(s), all other factors are collectively referred to as background factors.

I want to mention two ways to handle background factors. (This is not an exhaustive list.)

First, you can block on a background factor. In the basketball example, I could block on the opponent.

To keep this simple, I will focus on eight games with four opponents, as shown below.

Opponent Home Away H −A

Dallas 102 97 5

Denver 116 113 3

Toronto 104 102 2

Washington 99 100 −1

Mean 105.25 103.00 2.25

SD 7.46 6.98 2.50

Looking ahead a bit, if the home and away were independent random samples, then the following analysis

could be appropriate. (I have placed the Home scores in C1, the Away scores in C2, and Home − Away in

C3.)

3.1. OBSERVATIONAL VERSUS RANDOMIZED STUDIES 39

MTB > pool.

TWOSAMPLE T FOR C1 VS C2

N MEAN STDEV SE MEAN

C1 4 105.25 7.46 3.7

C2 4 103.00 6.98 3.5

But it is more appropriate to analyze the differences as a one sample problem, as with the following analysis.

MTB > ttest c3

N MEAN STDEV SE MEAN T P VALUE

C3 4 2.25 2.50 1.25 1.80 0.17

C3 4 2.25 2.50 1.25 ( -1.73, 6.23)

Notice that for the differences, the P-value is much smaller and the confidence interval is much narrower.

Blocking is not always effective. For example, the Bucks’ data above was selected deliberately for

illustration only. There are 14 teams that the Bucks have played home and away thus far. The analysis of all

the data is very different than that above.

MTB > twos c1 c2;

MTB > pool.

TWOSAMPLE T FOR C1 VS C2

N MEAN STDEV SE MEAN

C1 14 94.60 10.20 2.73

C2 14 101.71 7.16 1.91

The analysis with blocking is below.

MTB > ttest c3

40 CHAPTER 3. THE TWO SAMPLE PROBLEM

TEST OF MU = 0 VS MU N.E. 0

N MEAN STDEV SE MEAN T P VALUE

C3 14 -7.07 13.04 3.49 -2.03 0.063

N MEAN SD 95 % C.I.

C3 14 -7.07 13.04 (-14.60, 0.46)

Note from above that the independent samples analysis compared to the block (paired data) analysis gives a

smaller P-value and a narrower confidence interval. (The correct analysis is with blocking; we will discuss

this further in lecture.)

It is my opinion that novice statisticians frequently are overly optimistic on the value of blocking. Unless

you are pretty certain that the factor has a big impact on the response, it is usually better not to block.

A second way to deal with a background factor is by controlling for it. This means that you keep the

value of the factor constant throughout the study, or at least for the data you analyze. In a study of her cat’s

consumption of two flavors of treats, Dawn (a former student) controlled the cat’s intake of other food, and

tried to control his activity level. In addition, she presented either treat at the same time each day, in an

attempt to control for time of day effect as well as the cat’s general level of hunger.

After blocking and controlling, if either or both of these are used, there are still lots of background

factors. If units are assigned to study factor by randomization, there is some reason to believe that the

effects of these background factors will be “balanced” between the levels of the study factor (this notion can

be made more precise). But if units are associated with the level of a study factor, it is very possible that the

background factors will severely bias the study. Some examples follow.

1. Yesterday I heard a talk on the effects of “coaching” for the SAT. The subjects are students. The

response is change in verbal score from PSAT to SAT. The study factor is coaching with levels “yes”

and “no.” What are some possible background factors?

2. The subjects are people. The response is whether the person develops a particular disease of interest.

The study factor is smoking, with levels “yes” and “no.” What are some possible background factors?

3. The subjects are men. The response is whether the man develops a particular disease of interest.

The study factor whether the man has had a vasectomy, with levels “yes” and “no.” What are some

possible background factors?

The following hypothetical example illustrates the possible effect of a background factor.

A company with 200 employees decides it must reduce its work force by one-half. The following table

reveals the relationship between gender and outcome.

Outcome

Gender Released Not released Total p̂

Female 60 40 100 0.60

Male 40 60 100 0.40

Total 100 100 200

Now suppose that the value of a background factor, job type, is available for each person. One could

take the data above and stratify it according to job type, as I have done below.

3.1. OBSERVATIONAL VERSUS RANDOMIZED STUDIES 41

Job A

Outcome

Gender Released Not released Total p̂

Female 56 24 80 0.70

Male 16 4 20 0.80

Total 72 28 100

Job B

Outcome

Gender Released Not released Total p̂

Female 4 16 20 0.20

Male 24 56 80 0.30

Total 28 72 100

Note that in the original table, the female release rate is 0.20 larger than the male release rate, but in

each component table (i.e. for each job) the female release rate is 0.10 smaller than the male release rate!

This consistent (across component tables) reversal of the direction of the relationship is called Simpson’s

Paradox.

We can gain insight into the “why” behind Simpson’s Paradox by examining the following two tables.

Job Outcome

Gender A B Total Job Released Not released Total

Female 80 20 100 A 72 28 100

Male 20 80 100 B 28 72 100

Total 100 100 200 Total 100 100 100

The background factor (job) is statistically related to the study factor (gender) and response (outcome).

If the background factor fails to be statistically related to either the study factor or response, then Simpson’s

Paradox will not occur. (This issue will be addressed in a future homework assignment.)

If a background factor is strongly (statistically) related to the response, then you probably want to block

on it or control it. If a background factor is strongly (statistically) related to the study factor, then it will be

difficult to separate statistically the effect of the study factor from the effect of the background factor.

There is a sampling issue for observational studies that I want to address. Years ago I saw a variation of

the following example in a really bad introductory Statistics book. Each person in a population of college

students can be assigned a value on each of two dichotomous variables. The first is GPA: high (A) or

low (Ac ); the second is whether the person smokes tobacco (B) or not (B c ). We can imagine a table of

population counts (I will follow the notation in Wardrop, Chapter 8).

B Bc Total

A NAB NAB c NA

Ac N Ac B N Ac B c N Ac

Total NB NB c N

There are several ways to view this table. You can view it as a single population with two dichotomous

variables per person (as I have done above). In this case, inference would focus on estimating probabilities

and conditional probabilities. Secondly, you could view smoking status as the response and GPA as the

study factor, with levels high and low. This means that we have two distinct populations—high and low

GPA. Inference would focus on the proportion of smokers in each GPA population. Thirdly, we can reverse

the roles of smoking and GPA. This gives two distinct populations—smokers and nonsmokers. Inference

42 CHAPTER 3. THE TWO SAMPLE PROBLEM

would focus on the proportion of high GPA in each smoking group. (The bad book thought that this last

perspective was the only one possible and compounded its error by suggesting a causal link—smoking

leads to bad grades! One could just as easily argue that anxiety over low grades leads to smoking or that

a background factor, time spent partying, is such that a large amount of time spent partying is linked to

smoking and low grades.)

A critical point that is often overlooked is the importance of how a sample is selected. Let us imagine

three possible sampling schemes. We can take a random sample from the overall population of college

students; we can take independent random samples from the populations of smokers and nonsmokers; or

we can take independent random samples from the populations of high and low GPA. Suppose that the

population counts and population proportions are given by the following tables.

Smoker? Smoker?

GPA Yes No Total GPA Yes No Total

High 600 4400 5000 High 0.06 0.44 0.50

Low 1400 3600 5000 Low 0.14 0.36 0.50

Total 2000 8000 10000 Total 0.20 0.80 1.00

The tables of conditional probabilities of smoking status given GPA and GPA given smoking status are

below.

Smoker?

Smoker?

GPA Yes No

GPA Yes No Total

High 0.30 0.55

High 0.12 0.88 1.00

Low 0.70 0.45

Low 0.28 0.72 1.00

Total 1.00 1.00

Note that 0.12 and 0.28 are p1 and p2 for the perspective of smoking being the response, and that 0.30 and

0.55 are p1 and p2 for the perspective of GPA being the response.

I will consider three ways to sample. First, suppose we select a random sample (with replacement) of

size 1000 from the overall population. I did this on my computer and obtained the data below.

Smoker?

GPA Yes No Total

High 60 437 497

Low 137 366 503

Total 197 803 1000

Next, I used these data to estimate the three tables above; the population proportions and the two tables of

conditional probabilities. The results are below.

Smoker? Smoker?

Smoker?

GPA Yes No Total GPA Yes No

GPA Yes No Total

High 0.060 0.437 0.497 High 0.305 0.544

High 0.121 0.879 1.000

Low 0.137 0.366 0.503 Low 0.695 0.456

Low 0.272 0.728 1.000

Total 0.197 0.803 1.000 Total 1.000 1.000

By inspection, all estimates are quite close to the population proportions. As the comparisons based on the

last two tables suggest, if we have a random sample from the overall population, it is valid to pretend we

have either: (a) independent random samples from the high and low GPA populations, or (b) independent

random samples from the smoking and nonsmoking populations.

Second, suppose that I select independent random samples (with replacement) of size 500 each from the

smoking and nonsmoking populations. I did this on my computer and obtained the results shown below.

3.1. OBSERVATIONAL VERSUS RANDOMIZED STUDIES 43

Smoker?

GPA Yes No Total

High 157 288 445

Low 343 212 555

Total 500 500 1000

The estimates of: high GPA for smokers is 157/500 = 0.314, and high GPA for nonsmokers is 288/500 =

0.576. These numbers are reasonably close to the population proportions, 0.30 and 0.55, respectively.

But now suppose that we pretend we have independent random samples from the GPA populations; what

happens? Our estimate of smoking given high GPA is 157/445 = 0.353, which is considerably larger than

0.12, the population proportion. And our estimate of smoking given low GPA is 343/555 = 0.618, which

is considerably larger than 0.28, the population proportion.

As a result, we conclude that it is improper to pretend we have a independent random samples from the

GPA populations. The reason for the strong bias shown above is simple. By taking equal sample sizes from

each smoking population, we are grossly oversampling the smokers in the overall population, and also in

the two GPA populations.

Suppose, however, that I had selected samples of size 200 from the smokers and 800 from the nonsmok-

ers. (These sample sizes match the proportions of smokers and nonsmokers in the overall population.) I did

this and obtained the data below.

Smoker?

GPA Yes No Total

High 62 434 496

Low 138 366 504

Total 200 800 1000

If one divides each of the values in the table by 1000, one obtains a very good estimate of the table of

population proportions. This is a general rule: if we sample from subpopulations in proportion to occurrence

in the overall population, then it is ok to pretend we have a random sample from the overall population.

Before returning to the two sample problem, I want to digress into a common error on sampling. The

point is that it is very important to be careful about units.

Suppose that a small college has a freshman class of 500 students. Each student enrolls in five courses,

as detailed below.

The college calculates that there are 2500 students in the 91 sections offered, for a mean of 27.5 students

per section. A rival college reports that for every student, the mean class size is 136. Both computations are

correct. What do you think?

44 CHAPTER 3. THE TWO SAMPLE PROBLEM

Later in this chapter we will consider dependent samples, which arise from pairing.

Data from studies of this section can be presented in the following manner.

Variable 2

Variable 1 B Bc Total

A a b n1

Ac c d n2

Total m 1 m2 n

This table is meant to be very general. It can be used for sampling from one population with two dichoto-

mous responses per unit (remember the GPA and smoking example earlier). This table can be used for

independent random samples from two populations. Finally, it can be used with a study with randomization.

At some point in the analysis I usually view the one population, two responses problem as a problem on

conditional probabilities, I will modify the above table to the following form which I find easier to under-

stand.

Study Response

factor S F Total

Level 1 a b n1

Level 2 c d n2

Total m 1 m2 n

In order to analyze such data, statisticians typically begin by arguing that the marginal totals can (or

should) be viewed as fixed numbers. This can be a bit of a stretch, so some discussion is merited.

For the one population, two responses model, only the value n is fixed in advance by the researcher; all

other entries in the table are the observed values of random variables. The statistician then argues that one

should perform analysis after conditioning on the other marginal totals. Here is an abridged version of the

argument statisticians give. Suppose that we have the following marginal totals.

Variable 2

Variable 1 B Bc Total

A a b 60

Ac c d 40

Total 70 30 100

Only the total number of units, n = 100, is fixed by the sampling plan. But what do we learn from the

other totals? Well, we get evidence that B is much more common than B c , and evidence that A is somewhat

more common than Ac . But we don’t learn anything about a relationship between A and B, which is, after

all, the primary purpose of the investigation. Note that with the above margins, we could have a = 60

which would provide evidence of a very strong positive association between A and B; or we could have

a = 30 which would provide evidence of a very strong negative association between A and B; or we could

have a = 42 which would provide no evidence of an association between A and B. In short, knowledge of

the marginal totals does not provide the researcher with evidence of the strength or direction of association

between A and B; hence, it probably won’t hurt to condition on the margins. Plus there is the added bonus

that conditioning on the margins makes the math much easier.

In the table below, define p̂1 = a/n1 , q̂1 = b/n1 , p̂2 = c/n2 , and q̂2 = d/n2 .

3.2. DICHOTOMOUS RESPONSE, INDEPENDENT SAMPLES 45

Study Response

factor S F Total

Level 1 a b n1

Level 2 c d n2

Total m 1 m2 n

The confidence interval for p1 − p2 is

s

p̂1 q̂1 p̂2 q̂2

p̂1 − p̂2 ± z + . (3.1)

n1 n2

This is an approximate interval, based on using a normal curve approximation. Minitab will not evaluate

this formula for us.

For hypothesis testing, there are several possible approaches. The null hypothesis is p 1 = p2 ; there are

three possible alternatives, obtained by replacing ‘=’ in the null hypothesis by >, <, or 6=. An exact P-value

can be obtained by using the hypergeometric distribution. This distribution is not in Minitab. (See me if

you want a macro for Version 9.) Approximate probabilities can be obtained by a normal or chi-squared

approximation. The normal approximation can be written two ways. First, as z = x/σ, where

s

m1 m2

x = p̂1 − p̂2 and σ = .

n1 n2 (n − 1)

n − 1(ad − bc)

z= √ .

n 1 n2 m1 m2

Some people modify z slightly and use

√

0 n(ad − bc)

z =√ .

n1 n2 m1 m2

Finally, others use χ2 = (z 0 )2 . If z or z 0 is used, P-values are obtained from the standard normal curve.

Literally, χ2 can be used only for the alternative 6= and the P-value is obtained by using the chi-squared

curve with one degree of freedom. Minitab presents the analysis only for χ 2 .

Exercise 2 on page 252 of Wardrop presents the following data.

Study Response

factor S F Total

Level 1 46 66 112

Level 2 30 99 129

Total 76 165 241

Below is a Minitab analysis of these data.

DATA> 46 66

DATA> 30 99

DATA> end

MTB > chis c1 c2

46 CHAPTER 3. THE TWO SAMPLE PROBLEM

C1 C2 Total

1 46 66 112

35.32 76.68

2 30 99 129

40.68 88.32

2.804 + 1.292 = 8.813

df = 1

SUBC> chis 1.

8.8130 0.9970

MTB > subt 0.9970 1 k1

ANSWER = 0.0030

MTB > let k2=sqrt(8.813)

MTB > cdf k2 k3

MTB > subt k3 1 k4

ANSWER = 0.0015

I am “jumping over” multicategory responses, ordered or not, and proceeding to numerical. We will return

to multicategory responses soon.

The most commonly used procedures compare the populations by comparing their means. For reference,

see 16.1 and 16.2 of Wardrop. In fact, the presentation below is a compression of the ideas in 16.2.

We assume that we have independent random samples from two populations. The first population has

(unknown) mean µ1 and standard deviation σ1 . The second population has (unknown) mean µ 2 and standard

deviation σ2 . The data from the first population are denoted

and are summarized by their mean ȳ1· and standard deviation s1 . Similarly, data from the second population

are denoted

y2,1 , y2,2 , y2,3 , . . . , y2,n2 ,

and are summarized by their mean ȳ2· and standard deviation s2 . (Most authors suppress the comma in the

subscript, and I might forget and do that too on occasion. I like the commas for later work when we need to

know whether, for example y111 is the 11th observation from the first sample or the first observation from

the 11th sample.)

Attention focuses on the difference µ 1 − µ2 , which is estimated by ȳ1· − ȳ2· . For inference, we will

need the sampling distribution of this estimate. Some basic results in mathematical statistics indicate that

3.3. NUMERICAL RESPONSE, INDEPENDENT SAMPLES 47

(Ȳ1· − Ȳ2· ) − (µ1 − µ2 )

W = q .

σ12 /n1 + σ22 /n2

The mathematical problem arises in trying to deal with the unknown σ’s in the denominator.

The first approach is to assume that they are equal and to estimate them by s p , where

(n1 − 1)s21 + (n2 − 1)s22

s2p = .

n1 + n 2 − 2

Next, we substitute this estimate into W to yield W 1 ,

(Ȳ1· − Ȳ2· ) − (µ1 − µ2 )

W1 = p .

sp 1/n1 + 1/n2

If one assumes that the two populations are normal pdfs, then probabilities for W 1 can be obtained from the

t distribution with (n1 + n2 − 2) degrees of freedom.

The following example is taken from exercise 1 on page 401 of Wardrop.

MTB > set c1

DATA> 321 323 329 330 331 332 337 337 343 347

DATA> end

MTB > set c2

DATA> 301 315 316 317 321 321 323 3(327)

DATA> end

MTB > desc c1 c2

C1 10 333.00 331.50 8.18 2.59 321 347 327.50 338.50

C2 10 319.50 321.00 7.93 2.51 301 327 315.75 327.00

SUBC> pool.

TWOSAMPLE T FOR C1 VS C2

N MEAN STDEV SE MEAN

C1 10 333.00 8.18 2.6

C2 10 319.50 7.93 2.5

The second approach is to make no assumption about the two standard deviations; simply estimate each

population standard deviation by its corresponding sample standard deviation. Making this change to W ,

we get W2 ,

(Ȳ1· − Ȳ2· ) − (µ1 − µ2 )

W2 = q .

s21 /n1 + s22 /n2

48 CHAPTER 3. THE TWO SAMPLE PROBLEM

Assuming normal pdfs, in this case, does not solve the problem. With normal pdfs, the sampling distribution

of W2 can be approximated by, but does not equal, a t distribution.

There are different opinions about the degrees of freedom in the approximating distribution. Minitab

uses a horrendously messy formula for the degrees of freedom, but since we don’t need to evaluate it, the

fact that it is horrendous is no problem. (See formulas 16.6 and 16.7 on page 592 of Wardrop if you want to

see it!) The above data will be reanalyzed under this second situation.

TWOSAMPLE T FOR C1 VS C2

N MEAN STDEV SE MEAN

C1 10 333.00 8.18 2.6

C2 10 319.50 7.93 2.5

The only difference in the two analyses is that the latter has 17 degrees of freedom, while the former has 18.

The values of T are identical because for a balanced study W 1 = W2 .

It is instructive to consider some artificial data.

N MEAN STDEV SE MEAN

C11 10 50.00 9.40 3.0

C12 10 45.00 9.40 3.0

N MEAN STDEV

C11 10 50.00 9.40

C12 20 45.00 9.40

3.3. NUMERICAL RESPONSE, INDEPENDENT SAMPLES 49

N MEAN STDEV

C11 10 50.00 1.00

C12 10 45 100

N MEAN STDEV

C11 10 50.00 1.00

C12 20 45 100

N MEAN STDEV

C11 10 50 100

C12 20 45.00 1.00

I want to remark on a dumb, but increasingly popular, approach. The suggestion is to use the t distribu-

tion with r − 1 degrees of freedom, where r is the minimum of n 1 and n2 . The main virtue of this method

is that we avoid having to calculate the d.f. with the horrendous formula; but if it is done by computer, what

is the problem?

Finally, if n1 and n2 are both large and you must analyze the data by hand, you might as well use the

standard normal curve for reference instead of bothering with calculating the degrees of freedom. By the

“minimum” approach in the previous paragraph, if each sample size is 30 (or more), then we know that we

have at least 29 (or more) d.f. As a result, we might be willing to use the standard normal curve instead of

the t curve.

I want to explore the issue of robustness for the above procedures. I performed a simulation study with

1000 runs. For each run I selected independent random samples with n 1 = n2 = 10 from exponential (1)

pdfs. For each random sample I calculated two 95% confidence intervals for µ 1 − µ2 ; one with pooling and

one without. The results were virtually identical and very close to what one would expect for normal pdfs.

In particular, when pooling, 42 intervals were incorrect (4.2%) and the mean width of the intervals is 1.800.

When not pooling, 38 intervals were incorrect (3.8%) and the mean width of the intervals is 1.838.

I now want to address a strange property of the above procedures. Recall the earlier data from page 401

of Wardrop. Let us now suppose that the largest observation from the first population, 347, is replaced by

357. This increases the mean of the first sample by one and clearly has no effect on the second sample. Thus,

50 CHAPTER 3. THE TWO SAMPLE PROBLEM

we have evidence that µ1 is even larger (compared to the evidence in the original data) and no evidence about

µ2 . Thus, it seems “logical” that our estimate of µ 1 − µ2 should “increase,” and certainly not decrease. But

look at the analysis below.

MTB > twos c1 c2;

SUBC> pool.

TWOSAMPLE T FOR C1 VS C2

N MEAN STDEV

C1 10 334.0 10.4

C2 10 319.50 7.93

Earlier, the lower bound for the confidence interval was 5.9; now it has decreased to 5.8! This is very

strange!

The same phenomenon occurs without pooling, as shown below. In this case, the lower bound decreases

from 5.9 to 5.7.

MTB > twos c1 c2

TWOSAMPLE T FOR C1 VS C2

N MEAN STDEV

C1 10 334.0 10.4

C2 10 319.50 7.93

The Mann-Whitney-Wilcoxin procedure is an alternative to the above procedures. It assumes that the

pdfs differ in a shift; see the picture in class. Mann-Whitney (Wilcoxin is usually suppressed to avoid con-

fusion with the one-sample procedure) is a generalization of the normal case with equal standard deviations.

(Discuss.)

The idea behind Mann-Whitney will be illustrated with a small set of artificial data.

Sample 1: 8 9 12 15

Sample 2: 4 7 9 13

The data are combined into one set and sorted, and ranks are assigned to the overall data, as below.

Data: 4 7 8 9 9 12 13 15

Ranks: 1 2 3 4.5 4.5 6 7 8

Note that tied values are given mean ranks. The test statistic is the sum of the ranks of the data in the first

sample; for these data it is W = 3 + 4.5 + 6 + 8 = 21.5

I put the above data into c3 and c4 and ran the following Minitab command.

3.3. NUMERICAL RESPONSE, INDEPENDENT SAMPLES 51

C3 N = 4 Median = 10.500

C4 N = 4 Median = 8.000

Point estimate for ETA1-ETA2 is 2.500

97.0 pct c.i. for ETA1-ETA2 is (-5.000,11.000)

W = 21.5

Test of ETA1 = ETA2 vs. ETA1 n.e. ETA2 is significant at 0.3865

The test is significant at 0.3836 (adjusted for ties)

I also ran this command for the earlier data in c1 and c2.

MTB > mann c1 c2

C1 N = 10 Median = 331.50

C2 N = 10 Median = 321.00

Point estimate for ETA1-ETA2 is 13.00

95.5 pct c.i. for ETA1-ETA2 is (5.00,21.00)

W = 146.5

Test of ETA1 = ETA2 vs. ETA1 n.e. ETA2 is significant at 0.0019

The test is significant at 0.0019 (adjusted for ties)

I replaced the largest observation in the first sample, 347, by 999. The analysis is below.

MTB > mann c1 c2

C1 N = 10 Median = 331.5

C2 N = 10 Median = 321.0

Point estimate for ETA1-ETA2 is 13.0

95.5 pct c.i. for ETA1-ETA2 is (5.0,22.0)

W = 146.5

Test of ETA1 = ETA2 vs. ETA1 n.e. ETA2 is significant at 0.0019

The test is significant at 0.0019 (adjusted for ties)

Next, consider the output below.

MTB > twos c1 c2

TWOSAMPLE T FOR C1 VS C2

N MEAN STDEV SE MEAN

C1 10 398 211 67

C2 10 319.50 7.93 2.5

52 CHAPTER 3. THE TWO SAMPLE PROBLEM

Paired data arises in two ways:

• Subdividing (or reusing) units, or

Below are some examples of subdividing units.

1. The classic “before and after” studies, in which a response is obtained before and after some event

(diet, exercise, training, etc.). Note that these studies are observational; i.e. there is no randomization.

2. We want to compare two brands of tires to see how they wear on the front wheels of front-wheel-drive

cars. Each car is given one tire of each brand for its front. For each car the location (left or right) is

assigned at random to the brand.

Regarding the second example, if we have, say, 20 cars for study and we randomize we might end up with,

say, Brand A being on 12 left wheels and 8 right. If we decide to force these two numbers to be identical (10

each in this example), we get what is called a cross-over design, which I don’t plan to cover in these notes.

Below are some examples of matching similar units.

1. Sixty students are available for a comparison of two teaching materials. Students are paired based on

some criterion (IQ, background in area, GPA, etc.). In each of the 30 pairs students are assigned to

material by randomization.

2. This example is invalid, as demonstrated later in these notes, but is popular and is advocated in some

introductory texts. Two classes have 30 students each. Class 1 will use teaching material A, and class

2 will use teaching material B. Students are paired across classes (i.e. each student in Class 1 is paired

with a student in Class 2). This is invalid.

Matching similar units is valid only if there is randomization.

Read Section 8.5 of Wardrop.

The standard approach is to calculate differences and then use a one sample procedure. Page 405 of Wardrop

presents data on 25-yard backstroke and breaststroke times. Below are the first ten pairs; see Wardrop for

complete listing of data.

Pair: 1 2 3 4 5 6 7 8 9 10

Bk. 40.0 39.5 39.5 41.0 39.0 38.0 38.5 38.5 39.0 39.5

Br. 37.0 37.5 37.5 37.0 38.0 38.0 38.5 40.5 39.0 39.0

Diff. 3.0 2.0 2.0 4.0 1.0 0.0 0.0 −2.0 0.0 0.5

3.4. PAIRED DATA 53

C1

-3.5 -2.0 0.0 0.0 0.0 0.5 0.5 1.0 1.0 1.0 1.5 1.5 1.5

1.5 2.0 2.0 2.0 2.0 2.0 2.5 2.5 2.5 3.0 4.0 7.0

C1 25 1.440 1.933 0.387 ( 0.642, 2.238)

C1 25 1.440 1.933 0.387 3.73 0.0011

ACHIEVED

N MEDIAN CONFIDENCE CONFIDENCE INTERVAL POSITION

C1 25 1.500 0.8922 ( 1.000, 2.000) 9

0.9500 ( 1.000, 2.000) NLI

0.9567 ( 1.000, 2.000) 8

C1 25 2 3 20 0.0001 1.500

ESTIMATED ACHIEVED

N MEDIAN CONFIDENCE CONFIDENCE INTERVAL

C1 25 1.50 95.0 ( 1.00, 2.00)

54 CHAPTER 3. THE TWO SAMPLE PROBLEM

N TEST STATISTIC P-VALUE MEDIAN

C1 25 22 220.5 0.002 1.500

I now want to suggest and investigate an inappropriate way to analyze data. I will do this via computer

simulation. Suppose that I have independent random samples of size 10 each from two standard normal

pdfs. For example, I generated such data on Minitab and got the results below.

Sample 1

0.30 −0.22 −0.74 0.05 0.79 −1.70 1.15 0.02 −1.85 −0.02

Sample 2

−0.45 0.40 −1.09 0.41 −0.91 0.24 −1.97 −0.83 1.63 −0.39

Now, let’s sort each sample.

Sample 1, Sorted

−1.85 −1.70 −0.74 −0.22 −0.02 0.02 0.05 0.30 0.79 1.15

Sample 2, Sorted

−1.97 −1.09 −0.91 −0.83 −0.45 −0.39 0.24 0.40 0.41 1.63

Now, let’s pair the sorted data, matching the smallest values in each set, the second smallest values, and so

on. Then, after pairing, subtract the values in the second sample from the corresponding values in the first

sample. The 10 differences are below.

Differences of Sorted Data

0.12 −0.61 0.17 0.61 0.43 0.41 −0.19 −0.10 0.38 −0.48

I constructed two 95% confidence intervals for the difference of the means. Using the two independent sam-

ples, pooled estimate of variance (an appropriate analysis), I obtained [−0.85, 1.00]. Using the differences

of the sorted data, I obtained [−0.22, 0.37]. We will see that this latter analysis is incorrect. At this point,

however, the latter analysis looks superior: both intervals are correct (they contain 0) and the second interval

is much more precise.

I repeated the above steps 1,000 times. For each pair of samples I constructed the 95% confidence

interval (pooled) for the difference of the means. Of these intervals, 945 (94.5%) were correct. This is as

expected. But for each pair of samples I also sorted the data and formed pairs of the sorted data. Then I

calculated differences. Using the one sample t procedure, 517 (51.7%) of the intervals were correct! This

horrible performance demonstrates that the pairing is not valid!

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.