Chapter 2
The Two Sample Problem
2.1 Observational Versus Randomized Studies
The table below is taken from The Statistical Sleuth by Ramsey and Schafer. We will discuss it in class.
Selection of Units At Random
Not at Random
Allocation of Units to Groups
By Randomization A random sample is selected from one population; units are then randomly assigned to different treatment groups
A group of study units is found units are then randomly assigned to treatment groups.
Causal inferences can be drawn
Not by Randomization Random samples are selected from existing distinct populations.
Inferences to the populations can be drawn.
Collections of available units from distinct groups are examined.
We will begin by considering examples from each cell in the above table. First, we will consider units that are subjects (distinct individuals). Notice that I am deliberately not deﬁning the response or, if appli cable, treatments.
• Upper left: The population is all high school freshmen in Wisconsin. A random sample is se lected from this population. Once the sample is obtained, the students are divided into treatment groups by randomization.
• Upper right: All high schools in Wisconsin are classiﬁed as public or private. The high school
freshmen at these two types of schools form the two populations. Independent random samples of freshmen are selected from each population.
• Lower left: All freshmen at Sun Prairie high school are selected for study. The students are divided into treatment groups by randomiza tion.
• Lower right: Freshmen at Sun Prairie high school are compared to freshmen at Edgewood high school.
Next, I will consider units that are trials.
23
24
• Upper left: A golfer wants to compare two drivers. The trials, individual shots, are as signed to driver by randomization; we get ran dom samples by assuming a spinner model for each driver.
• Upper right: A golfer has one driver and he wants to compare his ability playing at sea level versus playing at an altitude of 5,000 feet. We get random samples by assuming a spinner model at each site.
• Lower left: Same as upper left, but we no longer assume spinner models.
• Lower right: Same as upper right, but we no longer assume spinner models.
Before we get into inference procedures, formu las for tests and estimation, I want to introduce some issues of scientiﬁc importance. On each unit (subject or trial) we plan to obtain a response. Typically, the response exhibits some vari ation as we move from unit to unit. (If there is no variation, we will not need to do Statistics.) We in vent the notion of factors as the source of the vari ation. For example, in its last ﬁve basketball games, the Milwaukee Bucks scored 102, 93, 102, 104 and 91 points. Possible factors include strength of oppo nent, location of game, and length of time since the previous game. Note the following. These are nat ural factors for a basketball fan to suggest. If you know nothing about basketball, then you will be ill equipped to speculate on the identity of factors. As we will see below, if a scientist is bad at suggesting factors, then he/she will likely not learn much from collecting and analyzing data. From the list of possible factors, the researcher chooses one for special status; it is called the study factor. (In regression and ANOVA, the researcher may choose to have several study factors.) For ex ample, for the Bucks I might choose location of game as my study factor. Next, I must specify the levels of the study factor. If there are two levels, then we have the “two sample problem” which is the title of this chapter. As a result, I might choose my levels to be “home” and “away.” Note that this is not the only possible choice. I could use the four time zones to be the levels, or the 28 arenas (if I am correct in my
CHAPTER 2. THE TWO SAMPLE PROBLEM
belief that the Clippers and Lakers share an arena; 29 if I am incorrect). If the researcher selects more than two levels and the levels are on an interval or ratio scale, then methods of regression might be used. After specifying the study factor(s), all other fac tors are collectively referred to as background fac tors. I want to mention two ways to handle back ground factors. (This is not an exhaustive list.) First, you can block on a background factor. In the basketball example, I could block on the opponent. To keep this simple, I will focus on eight games with four opponents, as shown below.
Opponent 
Home 
Away 
H − A 
Dallas 
102 
97 
5 
Denver 
116 
113 
3 
Toronto 
104 
102 
2 
Washington 
99 
100 
−1 
Mean 
105.25 
103.00 
2.25 
SD 
7.46 
6.98 
2.50 
Looking ahead a bit, if the home and away were inde pendent random samples, then the following analysis could be appropriate.
twos c1 c2;
pool.
%
TWOSAMPLE T FOR C1 VS C2
N 
MEAN 
SD 

C1 
4 
105.25 7.46 

C2 
4 
103.00 6.98 
95 PCT CI FOR MU C1  MU C2:
( 10.2,
14.7)
TTEST MU C1 = MU C2 (VS NE):
T= 0.44
P=0.67 DF=
6
POOLED STDEV = 7.22
But it is more appropriate to analyze the differences as a one sample problem, as with the following anal ysis.
ttest c3
%
TEST OF MU = 0 VS MU N.E. 0
2.1.
OBSERVATIONAL VERSUS RANDOMIZED STUDIES
25
N 
MEAN 
SD 
T 
P VALUE 

C3 
4 
2.25 
2.50 1.80 
0.17 

tint c3 % 

N 
MEAN 
SD 
95 % C.I. 

C3 
4 
2.25 2.50 (1.73,6.23) 
Notice that for the differences, the Pvalue is much smaller and the conﬁdence interval is much narrower. Blocking is not always effective. For example, the Bucks’ data above was selected deliberately for illus tration only. There are 14 teams that the Bucks have played home and away thus far. The analysis of all the data is very different than that above.
twos c1 c2;
pool.
%
TWOSAMPLE T FOR C1 VS C2
N 
MEAN 
SD 

C1 
14 
94.60 10.20 

C2 
14 101.71 
7.16 
95 PCT CI FOR MU C1  MU C2:
( 13.9,
0.2)
TTEST MU C1 = MU C2 (VS NE):
T= 2.12
P=0.043 DF=
26
POOLED STDEV = 8.81
The analysis with blocking is below.
ttest c3
%
TEST OF MU = 0 VS MU N.E. 0
N
MEAN
SD
T P VALUE
0.063
C3 14 7.07 13.04 2.03
tint c3
%
95 % C.I.
C3 14 7.07 13.04 (14.60, 0.46)
N
MEAN
SD
Note from above that the independent samples analy sis compared to the block (paired data) analysis gives a smaller Pvalue and a narrower conﬁdence interval.
It is my opinion that novice statisticians frequently are overly optimistic on the value of blocking. Un less you are pretty certain that the factor has a big im pact on the response, it is usually better not to block. A second way to deal with a background factor is by controlling for it. This means that you keep the value of the factor constant throughout the study, or at least for the data you analyze. In a study of her cat’s consumption of two ﬂavors of treats, Dawn (a former student) controlled the cat’s intake of other food, and tried to control his activity level. In addi tion, she presented either treat at the same time each day, in an attempt to control for time of day effect as well as the cat’s general level of hunger. After blocking and controlling, if either or both of these are used, there are still lots of background fac tors. If units are assigned to study factor by random ization, there is some reason to believe that the ef fects of these background factors will be “balanced” between the levels of the study factor (this notion can be made more precise). But if units are associated with the level of a study factor, it is very possible that the background factors will severely bias the study. Some examples follow.
1. Yesterday I heard a talk on the effects of “coach ing” for the SAT. The subjects are students. The response is change in verbal score from PSAT to SAT. The study factor is coaching with levels “yes” and “no.” What are some possible back ground factors?
2. The subjects are people. The response is whether the person develops a particular disease of interest. The study factor is smoking, with levels “yes” and “no.” What are some possible background factors?
3. The subjects are men. The response is whether the man develops a particular disease of inter est. The study factor whether the man has had a vasectomy, with levels “yes” and “no.” What are some possible background factors?
The following hypothetical example illustrates the possible effect of a background factor. A company with 200 employees decides it must reduce its work force by onehalf. The following ta
26
ble reveals the relationship between gender and out come.
Outcome
Gender 
Released 
Not released 
Total 
pˆ 
Female 
60 
40 
100 
0.60 
Male 
40 
60 
100 
0.40 
Total 
100 
100 
200 
Now suppose that the value of a background fac tor, job type, is available for each person. One could take the data above and stratify it according to job type, as I have done below.
Job A
Outcome
Gender 
Released 
Not released 
Total 
pˆ 
Female 
56 
24 
80 
0.70 
Male 
16 
4 
20 
0.80 
Total 
72 
28 
100 

Job B 

Outcome 

Gender 
Released 
Not released 
Total 
pˆ 
Female 
4 
16 
20 
0.20 
Male 
24 
56 
80 
0.30 
Total 
28 
72 
100 
Note that in the original table, the female release rate is 0.20 larger than the male release rate, but in each component table (i.e. for each job) the female release rate is 0.10 smaller than the male release rate! This consistent (across component tables) reversal of the direction of the relationship is called Simpson’s Paradox. We can gain insight into the “why” behind Simp son’s Paradox by examining the following two ta bles.
Job
Gender 
A 
B 
Total 
Female 
80 
20 
100 
Male 
20 
80 
100 
Total 
100 
100 
200 
Outcome
Job 
Released 
Not released 
Total 
A 
72 
28 
100 
B 
28 
72 
100 
Total 
100 
100 
100 
CHAPTER 2. THE TWO SAMPLE PROBLEM
The background factor (job) is statistically related to the study factor (gender) and response (outcome). If the background factor fails to be statistically re lated to either the study factor or response, then Simpson’s Paradox will not occur. (This issue will be addressed in a future homework assignment.) If a background factor is strongly (statistically) related to the response, then you probably want to block on it. If a background factor is strongly (statis tically) related to the study factor, then it will be dif ﬁcult to separate statistically the effect of the study factor from the effect of the background factor. There is a sampling issue for observational studies that I want to address. Years ago I saw a variation of the following example in a really bad introduc tory Statistics book. Each person in a population of college students can be assigned a value on each of two dichotomous variables. The ﬁrst is GPA: high (A) or low (A ^{c} ); the second is whether the person smokes tobacco (B) or not (B ^{c} ). We can imagine a table of population counts (I will follow the notation in Wardrop, Chapter 8).
There are several ways to view this table. You can view it as a single population with two dichotomous variables per person (as I have done above). In this case, inference would focus on estimating probabil ities and conditional probabilities. Secondly, you could view smoking status as the response and GPA as the study factor, with levels high and low. This means that we have two distinct populations—high and low GPA. Inference would focus on the propor tion of smokers in each GPA population. Thirdly, we can reverse the roles of smoking and GPA. This gives two distinct populations—smokers and nonsmokers. Inference would focus on the proportion of high GPA in each smoking group. (The bad book thought that this last perspective was the only one possible and compounded its error by suggesting a causal link— smoking leads to bad grades! One could just as easily argue that anxiety over low grades leads to smoking or that a background factor, time spent partying, is such that a large amount of time spent partying is
2.1.
OBSERVATIONAL VERSUS RANDOMIZED STUDIES
27
linked to smoking and low grades.) A critical point that is often overlooked is the im portance of how a sample is selected. Let us imagine three possible sampling schemes. We can take a ran dom sample from the overall population of college students; we can take independent random samples from the populations of smokers and nonsmokers; or we can take independent random samples from the populations of high and low GPA. Suppose that the population counts are given by the following table.
Smoker?
GPA 
Yes 
No 
Total 
High 
600 
4400 
5000 
Low 
1400 
3600 
5000 
Total 
2000 
8000 
10000 
The table of population proportions is below.
Smoker?
GPA 
Yes 
No 
Total 
High 
0.06 
0.44 
0.50 
Low 
0.14 
0.36 
0.50 
Total 
0.20 
0.80 
1.00 
The table of conditional probabilities of smoking status given GPA is below.
Smoker?
GPA 
Yes 
No 
Total 
High 
0.12 
0.88 
1.00 
Low 
0.28 
0.72 
1.00 
Note that 0.12 and 0.28 are p _{1} and p _{2} for the perspec tive of smoking being the response. The table of conditional probabilities of GPA given smoking status is below.
Smoker?
GPA 
Yes 
No 
High 
0.30 
0.55 
Low 
0.70 
0.45 
Total 
1.00 
1.00 
Note that 0.30 and 0.55 are p _{1} and p _{2} for the perspec tive of GPA being the response. I will consider three ways to sample. First, sup pose we select a random sample (with replacement) of size 1000 from the overall population. I did this on my computer and obtained the data below.
Smoker?
GPA 
Yes 
No 
Total 
High 
60 
437 
497 
Low 
137 
366 
503 
Total 
197 
803 
1000 
Next, I used these data to estimate the three tables above; the population proportions and the two tables of conditional probabilities. The results are below.
Smoker?
GPA 
Yes 
No 
Total 
High 
0.060 
0.437 
0.497 
Low 
0.137 
0.366 
0.503 
Total 
0.197 
0.803 
1.000 
Smoker? 

GPA 
Yes 
No 
Total 
High 
0.121 
0.879 
1.000 
Low 
0.272 
0.728 
1.000 
Smoker?
GPA 
Yes 
No 
High 
0.305 
0.544 
Low 
0.695 
0.456 
Total 
1.000 
1.000 
By inspection, all estimates are quite close to the population proportions. As the comparisons based on the last two tables suggest, if we have a random sample from the overall population, it is valid to pre tend we have either: (a) independent random sam ples from the high and low GPA populations, or (b) independent random samples from the smoking and nonsmoking populations. Second, suppose that I select independent random samples (with replacement) of size 500 each from the smoking and nonsmoking populations. I did this on my computer and obtained the results shown below.
Smoker?
GPA 
Yes 
No 
Total 
High 
157 
288 
445 
Low 
343 
212 
555 
Total 
500 
500 
1000 
The estimates of: high GPA for smokers is 157/500 = 0.314, and high GPA for nonsmokers is 288/500 = 0.576. These numbers are reason ably close to the population proportions, 0.30 and
28
0.55, respectively. But now suppose that we pretend we have independent random samples from the GPA populations; what happens? Our estimate of smok ing given high GPA is 157/445 = 0.353, which is considerably larger than 0.12, the population propor tion. And our estimate of smoking given low GPA is 343/555 = 0.618, which is considerably larger than 0.28, the population proportion. As a result, we conclude that it is improper to pre tend we have a independent random samples from the GPA populations. The reason for the strong bias shown above is simple. By taking equal sample sizes from each smoking population, we are grossly over sampling the smokers in the overall population, and also in the two GPA populations. Suppose, however, that I had selected samples of size 200 from the smokers and 800 from the non smokers. (These sample sizes match the proportions of smokers and nonsmokers in the overall popula tion.) I did this and obtained the data below.
Smoker?
GPA 
Yes 
No 
Total 
High 
62 
434 
496 
Low 
138 
366 
504 
Total 
200 
800 
1000 
If one divides each of the values in the table by 1000, one obtains a very good estimate of the table of pop ulation proportions. This is a general rule: if we sam ple from subpopulations in proportion to occurrence in the overall population, then it is ok to pretend we have a random sample from the overall population. Before returning to the two sample problem, I want to digress into a common error on sampling. The point is that it is very important to be careful about units. Suppose that a small college has a freshman class of 500 students. Each student enrolls in ﬁve courses, as detailed below.
1. Social Studies 101, which is taught in one sec tions of 500 students.
2. Science 102, which is taught in ﬁve sections of 100 students each.
3. Math 103, which is taught in 10 sections of 50 students each.
CHAPTER 2. THE TWO SAMPLE PROBLEM
4. English 104, which is taught in 25 sections of 20 students each.
5. Humanities 105, which is taught in 50 sections of 10 students each.
The college calculates that there are 2500 students in the 91 sections offered, for a mean of 27.5 students per section. A rival college reports that for every stu dent, the mean class size is 136. Both computations are correct. What do you think?
2.2 Dichotomous response, indepen dent samples
Later in this chapter we will consider dependent sam ples, which arise from pairing. Data from studies of this section can be presented in the following manner.
Variable 2
Variable 
1 
B 
B ^{c} 
Total 

A 
a 
b 
n 
_{1} 

A ^{c} 
c 
d 
n 
_{2} 

Total 
m _{1} 
m _{2} 
n 
This table is meant to be very general. It can be used for sampling from one population with two dichoto mous responses per unit (remember the GPA and smoking example earlier). This table can be used for independent random samples from two populations. Finally, it can be used with a study with randomiza tion. At some point in the analysis I usually view the one population, two responses problem as a problem on conditional probabilities, I will modify the above table to the following form which I ﬁnd easier to un derstand.
Study 
Response 

factor 
S 
F 
Total 

Level 1 
a 
b 
n 
_{1} 
Level 2 
c 
d 
n 
_{2} 
Total 
m _{1} 
m _{2} 
n 
In order to analyze such data, statisticians typi cally begin by arguing that the marginal totals can (or should) be viewed as ﬁxed numbers. This can be a bit of a stretch, so some discussion is merited.
2.2.
DICHOTOMOUS RESPONSE, INDEPENDENT SAMPLES
29
For the one population, two responses model, only the value n is ﬁxed in advance by the researcher; all other entries in the table are the observed values of random variables. The statistician then argues that one should perform analysis after conditioning on the other marginal totals. Here is an abridged ver sion of the argument statisticians give. Suppose that we have the following marginal totals.
Variable 2
Variable 1 
B 
B ^{c} 
Total 
A 
a 
b 
60 
A ^{c} 
c 
d 
40 
Total 
70 
30 
100 
Only the total number of units, n = 100, is ﬁxed by the sampling plan. But what do we learn from the other totals? Well, we get evidence that B is much more common than B ^{c} , and evidence that A is somewhat more common than A ^{c} . But we don’t learn anything about a relationship between A and
B, which is, after all, the primary purpose of the in vestigation. Note that with the above margins, we could have a = 60 which would provide evidence of
a very strong positive association between A and B;
or we could have a = 30 which would provide evi dence of a very strong negative association between A and B; or we could have a = 42 which would pro vide no evidence of an association between A and B. In short, knowledge of the marginal totals does not
provide the researcher with evidence of the strength or direction of association between A and B; hence,
it probably won’t hurt to condition on the margins.
Plus there is the added bonus that conditioning on the margins makes the math much easier. In the table below, deﬁne pˆ _{1} = a/n _{1} , qˆ _{1} = b/n _{1} , pˆ _{2} = c/n _{2} , and qˆ _{2} = d/n _{2} .
Study 
Response 

factor 
S 
F 
Total 

Level 1 
a 
b 
n 
_{1} 
Level 2 
c 
d 
n 
_{2} 
Total 
m _{1} 
m _{2} 
n 
The conﬁdence interval for p _{1} − p _{2} is
pˆ _{1} − pˆ _{2} ± z ^{p}^{ˆ} ^{1} ^{q}^{ˆ} ^{1}
n
1
_{+}
pˆ _{2} qˆ _{2}
_{n} 2
.
(2.1)
This is an approximate interval, based on using a nor mal curve approximation. Minitab will not evaluate this formula for us. For hypothesis testing, there are several possible approaches. The null hypothesis is p _{1} = p _{2} ; there
are three possible alternatives, obtained by replacing
An exact
Pvalue can be obtained by using the hypergeomet ric distribution. This distribution is not in Minitab. (See me if you want a macro for Version 9.) Ap proximate probabilities can be obtained by a normal or chisquared approximation. The normal approxi mation can be written two ways. First, as z = x/σ, where
‘=’ in the null hypothesis by >, <, or
=.
x = pˆ _{1} − pˆ _{2} and σ
=
m 1 m 2 n _{1} n _{2} (n − 1) ^{.}
This expression can be rewritten as
_{z} _{=} ^{√} n − 1(ad − bc)
^{√} n 1 n 2 m 1 m 2
.
Some people modify z slightly and use
_{z} _{} _{=} ^{√} n(ad − bc)
^{√} n 1 n 2 m 1 m 2
Finally, others use
χ ^{2}
= (z ^{} ) ^{2} .
.
If z or z ^{} is used, Pvalues are obtained from the stan dard normal curve. Literally, χ ^{2} can be used only for
= and the Pvalue is obtained by using
the chisquared curve with one degree of freedom. Minitab presents the analysis only for χ ^{2} . Exercise 2 on page 252 of Wardrop presents the following data.
the alternative
Study 
Response 

factor 
S 
F 
Total 
Level 1 
46 
66 
112 
Level 2 
30 
99 
129 
Total 
76 
165 
241 
Below is a Minitab analysis of these data.
read c1 c2
46 66
30 99
30 
CHAPTER 2. THE TWO SAMPLE PROBLEM 

end chis c1 c2 
% 
are denoted 
Expected counts are printed below observed counts
C1 
C2 
Total 

1 
46 
66 
112 
35.3 
76.7 

2 
30 
99 
129 
40.7 
88.3 

Total 
76 
165 
241 
ChiSq = 2.804 + df = 1
3.230 +
1.488 +
1.292 = 8.813
cdf 8.813; 

chis 1. 

8.8130 
0.9970 
subt 0.9970 1 k1
ANSWER = 0.0030
let k2=sqrt(8.813)
cdf k2 k3 subt k3 1 k4 ANSWER = 0.0015
2.3
response, dent samples
Numerical
indepen
I am “jumping over” multicategory responses, or dered or not, and proceeding to numerical. We will return to multicategory responses soon. The most commonly used procedures compare the populations by comparing their means. For refer ence, see 16.1 and 16.2 of Wardrop. In fact, the pre sentation below is a compression of the ideas in 16.2. We assume that we have independent random sam ples from two populations. The ﬁrst population has (unknown) mean µ _{1} and standard deviation σ _{1} . The second population has (unknown) mean µ _{2} and stan dard deviation σ _{2} . The data from the ﬁrst population
y 1,1 , y 1,2 , y 1,3 ,
, y 1,n _{1} ,
and are summarized by their mean y¯ _{1}_{·} and standard deviation s _{1} . Similarly, data from the second popu lation are denoted
y 2,1 , y 2,2 , y 2,3 ,
, y 2,n _{2} ,
and are summarized by their mean y¯ _{2}_{·} and standard deviation s _{2} . (Most authors suppress the comma in the subscript, and I might forget and do that too on occasion. I like the commas for later work when we need to know whether, for example y _{1}_{1}_{1} is the 11th observation from the ﬁrst sample or the ﬁrst observa tion from the 11th sample.) Attention focuses on the difference µ _{1} −µ _{2} , which is estimated by y¯ _{1}_{·} − y¯ _{2}_{·} . For inference, we will need the sampling distribution of this estimate. Some ba sic results in mathematical statistics indicate that the standardized form of the estimate is
W =
^{(}
¯
Y 1· −
¯
Y _{2}_{·} ) − (µ _{1} − µ _{2} )
σ ^{2}
1
n
1
^{+}
σ
2
2
n
2
.
The mathematical problem arises in trying to deal with the unknown σ’s in the denominator. The ﬁrst approach is to assume that they are equal and to estimate them by s _{p} , where
s
2
p
_{=}
(n _{1} − 1)s ^{2}
1
+ (n _{2} − 1)s ^{2}
2
^{.}
n _{1} + n _{2} − 2
Next, we substitute this estimate into W to yield W _{1} ,
W _{1} =
^{(}
¯
Y 1· −
¯
Y _{2}_{·} ) − (µ _{1} − µ _{2} )
s _{p} ^{} 1/n _{1} + 1/n _{2}
.
If one assumes that the two populations are normal pdfs, then probabilities for W _{1} can be obtained from the t distribution with (n _{1} + n _{2} − 2) degrees of free dom. The following example is taken from exercise 1 on page 401 of Wardrop.
set c1
321 
323 329 330 331 332 337 337 
343 
347 
end 
2.3.
NUMERICAL RESPONSE, INDEPENDENT SAMPLES
31
set c2 301 315 316 317 321 321 323
TWOSAMPLE T FOR C1 VS C2
3(327) 
N 
MEAN 
STDEV 

end 
C1 
10 
333.00 
8.18 

desc c1 c2 
C2 
10 
319.50 
7.93 

N 
MEAN MEDIAN 
STDEV 
95 
PCT CI FOR MU C1  MU C2: 

C1 
10 
333.0 331.50 
8.18 
( 
5.9, 21.1) 

C2 
10 
319.5 321.00 
7.93 
twos c1 c2; pool.
TWOSAMPLE T FOR C1 VS C2
N 
MEAN 
STDEV 

C1 
10 
333.0 
8.18 
C2 
10 
319.5 
7.93 
95 PCT CI FOR MU C1  MU C2:
( 5.9,
21.1)
TTEST MU C1 = MU C2 (VS NE):
T= 3.75 P=0.0015
POOLED STDEV =
DF=
8.06
18
The second approach is to make no assumption about the two standard deviations; simply estimate each population standard deviation by its corre sponding sample standard deviation. Making this change to W , we get W _{2} ,
W _{2} =
^{(}
¯
Y 1· −
¯
Y _{2}_{·} ) − (µ _{1} − µ _{2} )
s ^{2} /n _{1} + s _{2}
1
^{2} /n _{2}
.
Assuming normal pdfs, in this case, does not solve the problem. With normal pdfs, the sampling distri bution of W _{2} can be approximated by, but does not equal, a t distribution. There are different opinions about the degrees of freedom in the approximating distribution. Minitab uses a horrendously messy formula for the degrees of freedom, but since we don’t need to evaluate it, the fact that it is horrendous is no problem. (See for mulas 16.6 and 16.7 on page 592 of Wardrop if you want to see it!) The above data will be reanalyzed under this second situation.
twos c1 c2
%
TTEST MU C1 = MU C2 (VS NE):
T= 3.75
P=0.0016 DF=
17
The only difference in the two analyses is that the latter has 17 degrees of freedom, while the former has 18. The values of T are identical because for a balanced study W _{1} = W _{2} . It is instructive to consider some artiﬁcial data.
twos c11 c12 %
TWOSAMPLE T FOR C11 VS C12
N 
MEAN 
STDEV 

C11 
10 
50.00 
9.40 
C12 
10 
45.00 
9.40 
95 
PCT CI FOR MU C11  MU C12: 

( 
3.8, 13.8) 
TTEST MU C11 = MU C12 (VS NE):
T= 1.19
P=0.25 DF=
18
Contrast this output with the following four.
twos c11 c12
%
TWOSAMPLE T FOR C11 VS C12
N 
MEAN 
STDEV 

C11 
10 
50.00 
9.40 
C12 
20 
45.00 
9.40 
95 PCT CI FOR MU C11  MU C12: 
( 2.7, 12.7)
TTEST MU C11 = MU C12 (VS NE):
T= 1.37
P=0.19 DF=
twos c11 c12
%
18
TWOSAMPLE T FOR C11 VS C12
32 

N 
MEAN 
STDEV 

C11 
10 
50.00 
1.00 
C12 
10 
45 
100 
95 
PCT CI FOR MU C11  MU C12: 

( 
66.55, 77) 
TTEST MU C11 = MU C12 (VS NE):
T= 0.16
P=0.88 DF=
twos c11 c12
%
9
TWOSAMPLE T FOR C11 VS C12
N 
MEAN 
STDEV 

C11 
10 
50.00 
1.00 
C12 
20 
45 
100 
95 
PCT CI FOR MU C11  MU C12: 

( 
41.83, 52) 
TTEST MU C11 = MU C12 (VS NE):
T= 0.22
P=0.83 DF=
twos c11 c12
19
TWOSAMPLE T FOR C11 VS C12
N 
MEAN 
STDEV 

C11 
10 
50 
100 
C12 
20 
45.00 
1.00 
95 
PCT CI FOR MU C11  MU C12: 

( 
67, 76.55) 
TTEST MU C11 = MU C12 (VS NE):
T= 0.16
P=0.88 DF=
9
I want to remark on a dumb, but increasingly pop ular, approach. The suggestion is to use the t dis tribution with r − 1 degrees of freedom, where r is the minimum of n _{1} and n _{2} . The main virtue of this method is that we avoid having to calculate the d.f. with the horrendous formula; but if it is done by com puter, what is the problem? Finally, if n _{1} and n _{2} are both large and you must analyze the data by hand, you might as well use the standard normal curve for reference instead of both ering with calculating the degrees of freedom. By the “minimum” approach in the previous paragraph,
CHAPTER 2. THE TWO SAMPLE PROBLEM
if each sample size is 30 (or more), then we know that we have at least 29 (or more) d.f. As a result, we might be willing to use the standard normal curve in stead of the t curve.
I want to explore the issue of robustness for the
above procedures. I performed a simulation study with 1000 runs. For each run I selected independent random samples with n _{1} = n _{2} = 10 from exponen tial (1) pdfs. For each random sample I calculated two 95% conﬁdence intervals for µ _{1} − µ _{2} ; one with pooling and one without. The results were virtually identical and very close to what one would expect for normal pdfs. In particular, when pooling, 42 in tervals were incorrect (4.2%) and the mean width of the intervals is 1.800. When not pooling, 38 inter vals were incorrect (3.8%) and the mean width of the
intervals is 1.838.
I now want to address a strange property of the
above procedures. Recall the earlier data from page 401 of Wardrop. Let us now suppose that the largest observation from the ﬁrst population, 347, is replaced by 357. This increases the mean of the ﬁrst sample by one and clearly has no effect on the sec ond sample. Thus, we have evidence that µ _{1} is even larger (compared to the evidence in the original data) and no evidence about µ _{2} . Thus, it seems “logical” that our estimate of µ _{1} − µ _{2} should “increase,” and certainly not decrease. But look at the analysis be low.
twos c1 c2; pool.
TWOSAMPLE T FOR C1 VS C2
N 
MEAN 
STDEV 

C1 
10 
334.0 
10.4 
C2 
10 
319.50 
7.93 
95 PCT CI FOR MU C1  MU C2:
( 5.8,
23.2)
TTEST MU C1 = MU C2 (VS NE):
T= 3.51
P=0.0025 DF=
POOLED STDEV =
9.25
18
Earlier, the lower bound for the conﬁdence interval was 5.9; now it has decreased to 5.8! This is very
2.3.
NUMERICAL RESPONSE, INDEPENDENT SAMPLES
33
strange! The same phenomenon occurs without pooling, as 
C3 
N = 4 
Median = 10.5 

shown below. In this case, the lower bound decreases 
C4 
N = 
4 
Median = 
8.0 
from 5.9 to 5.7. 
Point estimate for ETA1ETA2 is 2.5 
twos c1 c2
TWOSAMPLE T FOR C1 VS C2
N 
MEAN 
STDEV 

C1 
10 
334.0 
10.4 
C2 
10 
319.50 
7.93 
95 PCT CI FOR MU C1  MU C2:
( 5.7,
23.3)
TTEST MU C1 = MU C2 (VS NE):
T= 3.51
P=0.0029 DF=
16
The MannWhitneyWilcoxin procedure is an al ternative to the above procedures. It assumes that the pdfs differ in a shift; see the picture in class. MannWhitney (Wilcoxin is usually suppressed to avoid confusion with the onesample procedure) is a generalization of the normal case with equal stan dard deviations. (Discuss.) The idea behind MannWhitney will be illustrated with a small set of artiﬁcial data.
Sample 1: 
8 
9 
12 
15 
Sample 2: 
4 
7 
9 
13 
The data are combined into one set and sorted, and ranks are assigned to the overall data, as below.
97.0 pct c.i. for
ETA1ETA2
is (5.000,11.000)
W = 21.5
Test of ETA1 = ETA2 ETA1 n.e. ETA2
vs. is significant at 0.3865 The test is significant at 0.3836
(adjusted for ties)
Cannot reject at alpha = 0.05
I also ran this command for the earlier data in c1 and c2.
mann c1 c2
%
MannWhitney Confidence Interval and Test
C1 
N = 
10 
Median = 331.5 
C2 
N = 
10 
Median = 321.0 
Point estimate for ETA1ETA2 is 13.0
Data: 
4 
7 
8 
9 
95.5 pct c.i. for 

Ranks: 
1 
2 
3 
4.5 
ETA1ETA2 is (5.00,21.00) 

Data: 
9 
12 
13 
15 

Ranks: 
4.5 
6 
7 
8 
W 
= 146.5 
Note that tied values are given mean ranks. The test
statistic is the sum of the ranks of the data in the ﬁrst
sample; 
for these data it is W 
= 
3 + 4.5 + 6 + 8 
= 
21.5 
I put the above data into c3 and c4 and ran the following Minitab command.
mann c3 c4
%
MannWhitney Confidence Interval and Test
vs. ETA1 n.e. ETA2 is significant at 0.0019 The test is significant at 0.0019 (adjusted for ties)
Test of ETA1 = ETA2
I replaced the largest observation in the ﬁrst sam ple, 347, by 999. The analysis is below.
mann c1 c2 %
34 
CHAPTER 2. THE TWO SAMPLE PROBLEM 

MannWhitney Confidence Interval and Test 
is called a crossover design, which I don’t plan to cover in these notes. Below are some examples of matching similar 

C1 
N = 
10 
Median = 
331.5 
units. 
C2 
N = 
10 
Median = 
321.0 
Point estimate for ETA1ETA2 is 13.0
95.5 pct c.i. for ETA1ETA2 is (5.0,22.0)
W = 146.5
vs. ETA1 n.e. ETA2 is significant at 0.0019 The test is significant at 0.0019 (adjusted for ties)
Test of ETA1 = ETA2
2.4 Paired data
Paired data arises in two ways:
• Subdividing (or reusing) units, or
• Matching similar units.
Below are some examples of subdividing units.
1. The classic “before and after” studies, in which a response is obtained before and after some event (diet, exercise, training, etc.). Note that these studies are observational; i.e. there is no randomization.
2. We want to compare two brands of tires to see how they wear on the front wheels of front wheeldrive cars. Each car is given one tire of each brand for its front. For each car the loca tion (left or right) is assigned at random to the brand.
Regarding the second example, if we have, say, 20 cars for study and we randomize we might end up with, say, Brand A being on 12 left wheels and 8 right. If we decide to force these two numbers to be identical (10 each in this example), we get what
1. Sixty students are available for a comparison of two teaching materials. Students are paired based on some criterion (IQ, background in area, GPA, etc.). In each of the 30 pairs students are assigned to material by randomization.
2. This example is invalid, as demonstrated later in these notes, but is popular and is advocated in some introductory texts. Two classes have 30 students each. Class 1 will use teaching mate rial A, and class 2 will use teaching material B. Students are paired across classes (i.e. each stu dent in Class 1 is paired with a student in Class 2). This is invalid.
Matching similar units is valid only if there is ran domization.
2.4.1 Dichotomous response
Read Section 8.5 of Wardrop.
2.4.2 Numerical response
The standard approach is to calculate differences and then use a one sample procedure. Page 405 of Wardrop presents data on 25yard backstroke and breaststroke times. Below are the ﬁrst ten pairs; see Wardrop for complete listing of data.
Pair: 
1 
2 
3 
4 
5 
Bk. 
40.0 
39.5 
39.5 
41.0 
39.0 
Br. 
37.0 
37.5 
37.5 
37.0 
38.0 
Diff. 
3.0 
2.0 
2.0 
4.0 
1.0 
Pair: 
6 
7 
8 
9 
10 
Bk. 
38.0 
38.5 
38.5 
39.0 
39.5 
Br. 
38.0 
38.5 
40.5 
39.0 
39.0 
Diff. 
0.0 
0.0 
−2.0 
0.0 
0.5 
The sorted 25 differences are printed and analyzed below.
prin c1
%
2.4.
PAIRED DATA
35
C1 

3.5 2.0 
0.0 
0.0 
0.0 

0.5 
0.5 
1.0 
1.0 
1.0 

1.5 
1.5 
1.5 
1.5 
2.0 

2.0 
2.0 
2.0 
2.0 
2.5 

2.5 
2.5 
3.0 
4.0 
7.0 

tint c1 
% 

N 
MEAN 
STDEV 95.0% C.I. 

C1 
25 
1.44 1.933 (0.64,2.24) 

ttest c1 % 
TEST OF MU = 0 VS MU N.E. 0
N 
MEAN 
STDEV 
T 
P VALUE 

C1 
25 
1.44 1.933 
3.73 
0.0011 

sint c1 % 
SIGN CONF INT FOR MED
N 
MEDIAN 

C1 
25 
1.500 

ACHIEVED 

CONF 
CONF INT 
POS 

0.892 
(1.0, 2.0) 
9 

0.950 
(1.0, 2.0) 
NLI 

0.957 
(1.0, 2.0) 
8 

stest c1 
% 
SIGN TEST OF MED = 0 VS N.E. 0
C1
N BELOW EQUAL ABOVE
20
25
2
3
PVALUE MEDIAN
0.0001 
1.500 

wint c1 
% 

EST 

N 
MED 
CONF 
CONF INT 

C1 
25 
1.50 
95.0 
(1.0, 2.0) 
wtest c1 
% 
TEST OF MED = 0 VS MED N.E. 0
N FOR 
WILC 
EST 

N 
TEST 
STAT PVALUE MED 

C1 25 
22 
220.5 
0.002 
1.5 
I now want to suggest and investigate an inappro priate way to analyze data. I will do this via com puter simulation. Suppose that I have independent random samples of size 10 each from two standard normal pdfs. For example, I generated such data on Minitab and got the results below.
Sample 1
0.30 
−0.22 
−0.74 
0.05 
0.79 
−1.70 
1.15 
0.02 
−1.85 
−0.02 
Sample 2 

−0.45 
0.40 
−1.09 
0.41 
−0.91 
0.24 
−1.97 
−0.83 
1.63 
−0.39 
Now, let’s sort each sample. 

Sample 1, Sorted 

−1.85 
−1.70 
−0.74 
−0.22 
−0.02 
0.02 
0.05 
0.30 
0.79 
1.15 
Sample 2, Sorted 

−1.97 
−1.09 
−0.91 
−0.83 
−0.45 
−0.39 
0.24 
0.40 
0.41 
1.63 
Now, let’s pair the sorted data, matching the smallest values in each set, the second smallest values, and so on. Then, after pairing, subtract the values in the second sample from the corresponding values in the ﬁrst sample. The 10 differences are below.
Differences of Sorted Data
0.12 
−0.61 
0.17 
0.61 
0.43 
0.41 
−0.19 
−0.10 
0.38 
−0.48 
I constructed two 95% conﬁdence intervals for the difference of the means. Using the two independent samples, pooled estimate of variance (an appropriate analysis), I obtained [−0.85, 1.00]. Using the dif ferences of the sorted data, I obtained [−0.22, 0.37]. We will see that this latter analysis is incorrect. At this point, however, the latter analysis looks supe rior: both intervals are correct (they contain 0) and the second interval is much more precise.
36
I repeated the above steps 1,000 times. For each pair of samples I constructed the 95% conﬁdence in terval (pooled) for the difference of the means. Of these intervals, 945 (94.5%) were correct. This is as expected. But for each pair of samples I also sorted the data and formed pairs of the sorted data. Then I calculated differences. Using the one sample t proce dure, 517 (51.7%) of the intervals were correct! This horrible performance demonstrates that the pairing is not valid!
CHAPTER 2. THE TWO SAMPLE PROBLEM
Much more than documents.
Discover everything Scribd has to offer, including books and audiobooks from major publishers.
Cancel anytime.