Professional Documents
Culture Documents
Chapter 3
Departures from the Normality Assumption: The t procedures are robust to departures
from normality. Data depart from normality when their distribution is not symmetrical,
bell- or mound-shaped, or when the tails are either too long or short. While this is
subjective to some extent, assessing how severe is the departure from normality is an
important part of your training.
Departures from the Equal Variances Assumption: These departures can be more
serious. This condition is best checked by looking at histograms of both samples as well
as the sample standard deviations. Often, so long as the sample sizes are similar the
uncertainty measures will still be reasonably close to the true ones.
Departures from Independence: This can be caused by serial or spatial correlation or by
cluster effects. These assumptions can usually be easily checked by considering the
experimental design and data collecting procedures. Data that fail to meet the
independence assumption require methods other than those presented here.
What are some examples of data sets that violate the independence assumption?
Resistance: A statistical procedure is resistant if it does not change very much when a
small part of the data changes. The t procedures are not resistant to outliers. Outliers
should be identified and the analysis performed with and without the outlier and the
results reported in the published results of the experiment.
Transformations of the Data: sometimes departures from normality can be corrected by
transformations. The most common transformation is the log transform. There are two
common log transformations. It is common for writers to use log to mean both the natural
log and the log base 10. In my notes, I will often use ln to mean natural log and log to
mean log 10, or log to mean either natural or log 10. Your text will use log to mean
natural log and will use log10 to mean log 10, though the use of log10 will be rare. Please
do not let this confuse or upset you.
y = log( x ) + δ ,
mean[log(Y)] = median[log(Y)]
and
median[log(Y)] = log[median(Y)].
In words, The median, or 50th percentile, of the log transformed values is the log of the
50th percentile of the original values. So, when we transform back to the original scale,
we are now drawing inferences about the median.
If we denote the averages of the log transformed means as Z1 , Z 2 then the difference of
these two quantities estimates the log of the ratio of their medians. That is,
⎡ median ( Y2 ) ⎤
Z 2 − Z1 = log ⎢ ⎥ where Y1 and Y2 represent the two samples and hence the
⎢⎣ median ( Y1 ) ⎥⎦
right-hand side is an estimate of the log of the ratio of the two population medians.
Other transformations: There are many transformations one can try. There are rules of
thumb, but it usually boils down to trial and error. Some common transformations are
square root, reciprocal, and the arcsine.
Chapter 4
Alternatives to t-tools
Welch’s t-Test for Comparing Two Normal Populations with Unequal Variance.
As mentioned earlier, the standard error for the estimate of the difference between two
population means when the variances are not assumed equal is given by:
s12 s22
SEW ( y1 − y2 ) = + .
n1 n2
The degrees of freedom are difficult in this case and the exact d.f., and hence the exact
distribution is not known. The best approximation, Satterthwaite’s approximation, is
given by
⎡⎣ SEW ( y2 − y1 ) ⎤⎦
4
dfW =
⎡⎣ SE ( y2 ) ⎤⎦ ⎡⎣ SE ( y1 ) ⎤⎦
4 4
+
n2 − 1 n1 − 1
The t statistic and p-value are then calculated in exactly the same way as for the pooled
variance test.
Wilcoxon Sum Rank Test
Say you believe that students who go home to their families for Thanksgiving Weekend
actually do better on their exams because they need to decompress more than they need
to study. Say you took a random sample of 8 students who went home for Thanksgiving
and 8 who stayed in Missoula and studied, and then obtained their final exam scores out
of a total of 200 possible.
To perform the sum-rank test, we need to rank the two samples together and find the
ranks. Once we have found the ranks, the test statistic can be calculated.
W − µw
z= .
SD(W )
Home Studied Ranks
Ranked Ranked
90.0400 1
94.7618 2
94.9900 3
95.9400 4
102.0240 5
104.4400 6
106.8800 7
113.2500 8
115.4934 9
119.2100 10
123.5596 11
131.0900 12
129.6706 13
137.9134 14
142.4956 15
183.4077 16
We now sum the ranks for the ‘home’ data. This yields W=1+3+4+6+7+8+10+12=51.
51 − 68
z= = −0.1992
85.33
So, we fail to reject the null hypothesis of no difference between the two groups.
µs=n(n+1)/4
SD(S)=[n(n+1)(2n+1)/24]1/2
We then compare the usual z-statistic to the quantiles of the normal distribution to obtain
a p-value in the usual way.