Professional Documents
Culture Documents
C.13 Critical Values for the Spearman Rank Correlation Coefficient, rs . . . 526
C.14 Critical Values for the Kolmogorov-Smirnov Test . . . . . . . . . . . . 527
C.15 Critical Values for the Lilliefors Test . . . . . . . . . . . . . . . . . . . 528
References 529
Index 531
Our goal in writing this book was to generate an accessible and relatively complete
introduction for undergraduates to the use of statistics in the biological sciences. The
text is designed for a one quarter or one semester class in introductory statistics for
the life sciences. The target audience is sophomore and junior biology, environmental
studies, biochemistry, and health sciences majors. The assumed background is some
coursework in biology as well as a foundation in algebra but not calculus. Examples
are taken from many areas in the life sciences including genetics, physiology, ecology,
agriculture, and medicine.
This text emphasizes the relationships among probability, probability distributions,
and hypothesis testing. We highlight the expected value of various test statistics under
the null and research hypotheses as a way to understand the methodology of hypoth-
esis testing. In addition, we have incorporated nonparametric alternatives to many
situations along with the standard parametric analysis. These nonparametric tech-
niques are included because undergraduate student projects often have small sample
sizes that preclude parametric analysis and because the development of the nonpara-
metric tests is readily understandable for students with modest math backgrounds.
The nonparametric tests can be skipped or skimmed without loss of continuity.
We have tried to include interesting and easily understandable examples with each
concept. The problems at the end of each chapter have a range of difficulty and come
from a variety of disciplines. Some are real-life examples and most others are realistic
in their design and data values. Throughout the text we have included short “Concept
Checks” that allow readers to immediately gauge their mastery of the topic presented.
Their answers are found at the ends of appropriate chapters. The end-of-chapter
problems are randomized within each chapter to require the student to choose the
appropriate analysis. Many undergraduate texts present a technique and immediately
give all the problems that can be solved with it. This approach prevents students
from having to make the real-life decision about the appropriate analysis. We believe
this decision making is a critical skill in statistical analysis and have provided a large
number of opportunities to practice and it.
The material for this text derives principally from a required undergraduate bio-
statistics course one of us (Glover) taught for more than twenty years and from a
second course in nonparametric statistics and field data analysis that the other of
x PREFACE
Supplemental Materials
We have provided several di↵erent resources to supplement the text in various ways.
There is a set of Additional Appendices that are available online at http://waveland.
com/Glover-Mitchell/Appendices.pdf. These are coordinated with this text and
contain further information on several topics, including:
• a method for determining confidence intervals for the di↵erence between medians
of independent samples based on the Wilcoxon rank-sum test;
Also available is set of 300 additional problems to supplement those in the text.
This material is available online for both students and instructors at http://waveland.
com/Glover-Mitchell/ExtraProblems.pdf.
An Answer Manual for instructors is available free on CD from the publisher.
Included on this CD are
• a PDF file containing both questions and answers for all the problems in the text
and another file containing only the answers for all the problems in the text;
• a PDF file containing the supplementary problems mentioned above and another
file containing both the questions and the answers to all of the supplementary
problems; and
Acknowledgments
For the preparation of this third edition, thanks are due to the following people:
Don Rosso and Dakota West at Waveland Press for their support and guidance; Ann
Warner of Hobart and William Smith Colleges, for her meticulous word processing of
the early drafts of this manuscript; and the students of Hobart and William Smith
Colleges for their many comments and suggestions, particularly Aline Gadue for her
careful scrutiny of the first edition.
Thomas J. Glover
Kevin J. Mitchell
Geneva, NY
1
Concepts in Chapter 1:
• Scientific Method and Statistical Analysis
• Parameters: Descriptive Characteristics of Populations
• Statistics: Descriptive Characteristics of Samples
• Variable Types: Continuous, Discrete, Ranked, and Categorical
• Measures of Central Tendency: Mean, Median, and Mode
• Measures of Dispersion: Range, Variance, Standard Deviation, and Standard
Error
• Descriptive Statistics for Frequency Data
• E↵ects of Coding on Descriptive Statistics
• Tables and Graphs
• Quartiles and Box Plots
• Accuracy, Precision, and the 30–300 Rule
1.1 Introduction
The modern study of the life sciences includes experimentation, data gathering, and
interpretation. This text o↵ers an introduction to the methods used to perform these
fundamental activities.
The design and evaluation of experiments, known as the scientific method, is
utilized in all scientific fields and is often implied rather than explicitly outlined in
many investigations. The components of the scientific method include observation,
formulation of a potential question or problem, construction of a hypothesis, followed
by a prediction, and the design of an experiment to test the prediction. Let’s consider
these components briefly.
a cause and e↵ect relationship. For example, suppose upon investigating a remote
Fijian island community you realized that the vast majority of the adults su↵er from
hypertension (abnormally elevated blood pressures with the systolic over 165 mmHg
and the diastolic over 95 mmHg). Note that the individual observations here are quan-
titative while the percentage that are hypertensive is based on a qualitative evaluation
of the sample. From these preliminary observations one might formulate the question:
Why are so many adults in this population hypertensive?
Formulation of a Hypothesis
A hypothesis is a tentative explanation for the observations made. A good hypothesis
suggests a cause and e↵ect relationship and is testable.
The Fijian community may demonstrate hypertension because of diet, life style,
genetic makeup, or combinations of these factors. Because we’ve noticed extraordinary
consumption of octopi in their diet and knowing octopods have a very high cholesterol
content, we might hypothesize that the high level of hypertension is caused by diet.
Making a Prediction
If the hypothesis is properly constructed, it can and should be used to make predic-
tions. Predictions are based on deductive reasoning and take the form of an “if-then”
statement. For example, a good prediction based on the hypothesis above would be:
If the hypertension is caused by a high cholesterol diet, then changing the diet to a low
cholesterol one should lower the incidence of hypertension.
The criteria for a valid (properly stated) prediction are:
the diet. If the group with the low cholesterol diet exhibits significantly lower levels
of hypertension, the hypothesis is supported by the data. On the other hand, if the
change in diet has no e↵ect on hypertension, then a new or revised hypothesis should
be formulated and the experimental procedure redesigned. Finally, the generalizations
that are drawn by relating the data to the hypothesis can be stated as conclusions.
While these steps outlined above may seem straightforward, they often require
considerable insight and sophistication to apply properly. In our example, how the
groups are chosen is not a trivial problem. They must be constructed without bias and
must be large enough to give the researcher an acceptable level of confidence in the
results. Further, how large a change is significant enough to support the hypothesis?
What is statistically significant may not be biologically significant.
A foundation in statistical methods will help you design and interpret experiments
properly. The field of statistics is broadly defined as the methods and procedures for
collecting, classifying, summarizing, and analyzing data, and utilizing the data to test
scientific hypotheses. The term statistics is derived from the Latin for state, and orig-
inally referred to information gathered in various censuses that could be numerically
summarized to describe aspects of the state, for example, bushels of wheat per year,
or number of military-aged men. Over time statistics has come to mean the scientific
study of numerical data based on natural phenomena. Statistics applied to the life
sciences is often called biostatistics or biometry. The foundations of biostatistics
go back several hundred years, but statistical analysis of biological systems began
in earnest in the late nineteenth century as biology became more quantitative and
experimental.
of 25 female green turtles laying eggs on Heron Island or the variability in clutch size
of 50 clutches of tiger snake eggs collected in southeastern Queensland are examples
of statistics.
While such statistics are not equal to the population parameters, it is hoped that
they are sufficiently close to the population parameters to be useful or that the poten-
tial error involved can be quantified. Sample statistics along with an understanding
of probability form the foundation for inferences about population parameters. See
Figure 1.1 for review.
Chapter 1 provides techniques for organizing sample data. Chapters 2 through 4
present the necessary probability concepts, and the remaining chapters outline various
techniques to test a wide range of predictions from hypotheses.
Concept Checks. At the end of several of the sections in each chapter we include one or
two questions designed as a rapid check of your mastery of a central idea of the section’s
content. These questions will be be most helpful if you do each as you encounter it in the
text. Answers to these questions are given at the end of each chapter just before the exercises.
Concept Check 1.1. Which of the following are populations and which are samples?
(a) The weights of 25 randomly chosen eighth grade boys in the Detroit public school
system.
(b) The number of eggs found in each osprey nest on Mt. Desert Island in Maine.
(c) The heights of 15 redwood trees measured in the Muir Woods National Monument,
an old growth coast redwood forest.
(d ) The lengths of all the blind cave fish, Astyanas mexicanus, in a small cavern system
in central Mexico.
SECTION 1.3: Variables or Data Types 5
(a) Continuous variables or interval data can assume any value in some
(possibly unbounded) interval of real numbers. Common examples include
length, weight, temperature, volume, and height. They arise from measure-
ment.
(b) Discrete variables assume only isolated values. Examples include clutch
size, trees per hectare, arms per sea star, or items per quadrat. They arise
from counting.
2. Ranked (ordinal) variables are not measured but nonetheless have a natural
ordering. For example, candidates for political office can be ranked by individual
voters. Or students can be arranged by height from shortest to tallest and
correspondingly ranked without ever being measured. The rank values have no
inherent meaning outside the “order” that they provide. That is, a candidate
ranked 2 is not twice as preferable as the person ranked 1. (Compare this with
measurement variables where a plant 2 feet tall is twice as tall as a plant 1 foot
tall. With measurement variables such ratios are meaningful, while with ordinal
variables they are not.)
3. Categorical data are qualitative data. Some examples are species, gender,
genotype, phenotype, healthy/diseased, and marital status. Unlike with ranked
data, there is no “natural” ordering that can be assigned to these categories.
When measurement variables are collected for either a population or a sample, the
numerical values have to be abstracted or summarized in some way. The summary de-
scriptive characteristics of a population of objects are called population parameters
or just parameters. The calculation of a parameter requires knowledge of the mea-
surement variables value for every member of the population. These parameters are
usually denoted by Greek letters and do not vary within a population. The summary
descriptive characteristics of a sample of objects, that is, a subset of the population,
are called statistics. Sample statistics can have di↵erent values, depending on how
the sample of the population was chosen. Statistics are denoted by various symbols,
but (almost) never by Greek letters.
X1 = 1, X2 = 6, X3 = 4, X4 = 5, X5 = 6, X6 = 3, X7 = 8, X8 = 7. (1.1)
We would denote the population size with a capital N . In our theoretical population
N = 8.
The population mean µ would be
1+6+4+5+6+3+8+7
= 5.
8
FORMULA 1.1. The algebraic shorthand formula for a population mean is
PN
i=1 Xi
µ= .
N
The Greek letter ⌃ (“sigma”) indicates summation. The subscript i = 1 indicates
to start with the first observation and the superscript N means to continue until and
including the N th observation. The subscript and superscript may represent other
starting and stopping points for the summation within the population or sample. For
the example above,
X5
Xi
i=2
If sigma notation is new to you or if you wish a quick review of its properties, read
Appendix A.1 before continuing.
FORMULA 1.2. The sample mean is defined by
Pn
i=1 Xi
X= ,
n
where n is the sample size. The sample mean is usually reported to one more decimal place
than the data and always has appropriate units associated with it.
The symbol X (read “X bar”) indicates that the observations of a subset of size n
from a population have been averaged. X is fundamentally di↵erent from µ because
samples from a population can have di↵erent values for their sample mean, that is,
they can vary from sample to sample within the population. The population mean,
however, is constant for a given population.
Again consider the small theoretical population 1, 6, 4, 5, 6, 3, 8, 7. A sample of size
3 may consist of 5, 3, 4 with X = 4 or 6, 8, 4 with X = 6.
SECTION 1.4: Measures of Central Tendency: Mean, Median, and Mode 7
Actually there are 56 possible samples of size 3 that could be drawn from the
population in (1.1). Only four samples have a sample mean the same as the population
mean, that is, X = µ:
Sample Sum X
X3 , X6 , X7 4+3+8 5
X2 , X3 , X4 6+4+5 5
X5 , X3 , X4 6+4+5 5
X8 , X6 , X4 7+3+5 5
Median
The second measure of central tendency is the median. The median is the “middle”
value of an ordered list of observations. Though this idea is simple enough, it will
prove useful to define it in terms of an even simpler notion. The depth of a value
is its position relative to the nearest extreme (end) when the data are listed in order
from smallest to largest.
EXAMPLE 1.1. The table below gives the circumferences at chest height (CCH) (in
cm) and their corresponding depths for 15 sugar maples, Acer saccharum, measured
in a forest in southeastern Ohio.
CCH 18 21 22 29 29 36 37 38 56 59 66 70 88 93 120
Depth 1 2 3 4 5 6 7 8 7 6 5 4 3 2 1
EXAMPLE 1.2. The table below gives CCH (in cm) for 12 cypress pines, Callitris
preissii, measured near Brown Lake on North Stradbroke Island.
CCH 17 19 31 39 48 56 68 73 73 75 80 122
Depth 1 2 3 4 5 6 6 5 4 3 2 1
8 CHAPTER 1: Introduction to Data Analysis
Mode
The mode is defined as the most frequently occurring value in a data set. The mode
of Example 1.2 would be 73 cm, while Example 1.1 would have a mode of 29 cm.
In symmetrical distributions the mean, median, and mode are coincident. Bimodal
distributions may indicate a mixture of samples from two populations, for example,
weights of males and females. While the mode is not often used in biological research,
reporting the number of modes, if more than one, can be informative.
Each measure of central tendency has di↵erent features. The mean is a purposeful
measure only for a quantitative variable, whether it is continuous (for example, height)
or discrete (for example, clutch size). The median can be calculated whenever a
variable can be ranked (including when the variable is quantitative). Finally, the
mode can be calculated for categorical variables, as well as for quantitative and ranked
variables.
The sample median expresses less information than the sample mean because it
utilizes only the ranks and not the actual values of each measurement. The median,
however, is resistant to the e↵ects of outliers. Extreme values or outliers in a sam-
ple can drastically a↵ect the sample mean, while having little e↵ect on the median.
Consider Example 1.2 with X = 58.4 cm and X̃ = 62 cm. Suppose X12 had been mis-
takenly recorded as 1220 cm instead of 122 cm. The mean X would become 149.9 cm
while the median X̃ would remain 62 cm.
Sample 1 Sample 2
8.9 3.1
9.6 17.0
11.2 9.9
9.4 5.1
9.9 18.0
10.9 3.8
10.4 10.0
11.0 2.9
9.7 21.2
SOLUTION. Upon investigation we see that both samples are the same size and
have the same mean, X 1 = X 2 = 10.11 kg. In fact, both samples have the same
median. To see this, arrange the data sets in rank order as in Table 1.1. We have
n = 9, so X̃ = X n+1 = X5 , which is 9.9 kg for both samples.
2
Neither of the samples has a mode. So by all the descriptors in Section 1.4 these
samples appear to be identical. Clearly they are not. The di↵erence in the samples
SECTION 1.5: Measures of Dispersion and Variability 9
1 8.9 2.9
2 9.4 3.1
3 9.6 3.8
4 9.7 5.1
5 9.9 9.9
4 10.4 10.0
3 10.9 17.0
2 11.0 18.0
1 11.2 21.2
Range
The simplest measure of dispersion or “spread” of the data is the range.
FORMULAS 1.3. The di↵erence between the largest and smallest observations in a group
of data is called the range:
Sample range = Xn X1
Population range = XN X1
When the data are ordered from smallest to largest, the values Xn and X1 are called the
sample range limits.
Variance
To develop a measure that uses all the data to form an index of dispersion consider
the following. Suppose we express each observation as a distance from the mean
10 CHAPTER 1: Introduction to Data Analysis
is the sum of these squared deviates and is referred to as the corrected sum of
squares, denoted by CSS. Each observation is corrected or adjusted for its distance
from the mean.
FORMULA 1.4. The corrected sum of squares is utilized in the formula for the sample
variance, Pn
i=1 (Xi X)2
s2 = .
n 1
The sample variance is usually reported to two more decimal places than the data and has
units that are the square of the measurement units.
This calculation is not as intuitive as the mean or median, but it is a very good
indicator of scatter or dispersion. If the above formula had n instead of n 1 in
the denominator, it would be exactly the average squared distance from the mean.
Returning to Example 1.3, the variance of Sample 1 is 0.641 kg2 and the variance of
Sample 2 is 49.851 kg2 , reflecting the larger “spread” in Sample 2.
A sample variance is an unbiased estimator of a parameter called the population
variance.
SECTION 1.5: Measures of Dispersion and Variability 11
2
FORMULA 1.5. A population variance is denoted by (“sigma squared”) and is defined
by PN
2 i=1 (Xi µ)2
= .
N
It really is the average squared deviation from the mean for the population. The
n 1 in Formula 1.4 makes it an unbiased estimate of the population parameter. (See
Appendix A.2 for a proof.) Remember that “unbiased” means that the average of all
possible values of s2 for a certain size sample will be equal to the population value 2 .
Formulas 1.4 and 1.5 are theoretical formulas and are rather tedious to apply
directly. Computational formulas utilizeP the fact Pthat most calculators with statistical
registers simultaneously calculate n, Xi , and Xi2 .
P
FORMULA 1.6. The corrected sum of squares (Xi X)2 may be computed more simply
as P
X ( X i )2
CSS = Xi2 .
n
P (
P
Xi )2
Xi2 is the uncorrected sum of squares and n
is the correction term.
To verify Formula 1.6, using the properties in Appendix A.1 notice that
X X 2 X X X 2
(Xi X)2 = (Xi2 2Xi X + X ) = Xi2 2X Xi + X .
P P
Remember that X = nXi , so nX = Xi ; hence
X X X 2 X 2 2 X 2
(Xi X)2 = Xi2 2X(nX) + X = Xi2 2nX + nX = Xi2 nX .
P
Xi
Substituting for X yields
n
X X ✓ P ◆2 X P 2 X P 2
2 2 Xi n( Xi ) ( Xi )
(Xi X) = Xi n = Xi2 = Xi2 .
n n2 n
FORMULA 1.7. Use of the computational formula for the corrected sum of squares gives
the computational formula for the sample variance
P 2 (P Xi )2
2 Xi n
s = .
n 1
Returning to Example 1.3, Sample 2,
X X
Xi = 91, Xi2 = 1318.92, n = 9,
so 2
1318.92 (91) 1318.92 920.11 398.81
s2 = 9
= = = 49.851 kg2 .
9 1 8 8
Remember, the numerator must always be a positive number because it’s a sum of
squared deviations. Because the variance has units that are the square of the measure-
ment units, such as squared kilograms above, they have no physical interpretation.
With a similar derivation, the population variance computational formula can be
shown to be P 2 (P X i )2
2 Xi N
= .
N
Again, this formula is rarely used since most populations are too large to census
directly.
12 CHAPTER 1: Introduction to Data Analysis
Standard Deviation
FORMULAS 1.8. A more “natural” calculation is the standard deviation, which is the
positive square root of the population or sample variance, respectively.
s s
P 2 (P Xi )2 P 2 (P Xi )2
Xi N
Xi n
= and s= .
N n 1
These descriptions have the same units as the original observations and are, in a sense,
the average deviation of observations from their mean.
Again, consider Example 1.3.
The standard deviation of a sample is relatively easy to interpret and clearly reflects
the greater variability in Sample 2 compared to Sample 1. Like the mean, the standard
deviation is usually reported to one more decimal place than the data and always has
appropriate units associated with it. Both the variance and standard deviation can be
used to demonstrate di↵erences in scatter between samples or populations.
27 32 30 41 35 Xi ’s
These five fish have an average length of 33.0 cm. Some are smaller and others larger than
this mean. To get a sense of this variability, let’s subtract the average from each data point
(Xi 33) = xi generating what is called the deviate for each value. The data when rescaled
by subtracting the mean become
6 1 3 +8 +2 xi ’s
When we add these deviations, their sum is 0, so their mean is also 0. To quantify these
deviations and, therefore, the sample’s variability, we square these deviates to prevent them
from always summing to 0.
This calculation is called the corrected or rescaled sum of squares (squared deviates).
If we averaged these calculations by dividing the corrected sum of squares by the sample
size n = 5, we would have a measure of the average squared distance of the observations
from their mean. This measure is called the sample variance. However, with samples this
SECTION 1.5: Measures of Dispersion and Variability 13
Standard Error
The most important statistic of central tendency is the sample mean. However, the
mean varies from sample to sample (see page 7). We now develop a method to measure
the variability of the sample mean.
The variance and standard deviation are measures of dispersion or scatter of the
values of the X’s in a sample or population. Because means utilize a number of X’s
in their calculation, they tend to be less variable than the individual X’s. An extreme
value of X (large or small) contributes only one nth of its value to the sample mean
and is, therefore, somewhat dampened out.
A measure of the variability in X’s then depends on two factors: the variability
in the X’s and the number of X’s averaged to generate the mean X. We utilize two
statistics to estimate this variability.
FORMULAS 1.9. The variance of the sample mean is defined to be
s2
,
n
and standard deviation of the sample mean or, more commonly, the standard error
s
SE = p .
n
The standard error is the more important of these two statistics. Its utility will be
become clear in Chapter 4 when the Central Limit Theorem is outlined. The standard
error is usually reported to one more decimal place than the data, or if n is large, to
two more places.
EXAMPLE 1.4. Calculate the variance of the sample mean and the standard error
for the data sets in Example 1.3.
SOLUTION. The sample sizes are both n = 9. For Sample 1, s2 = 0.641 kg2 , so the
variance of the sample mean is
s2 0.641
= = 0.71 kg2
n 9
and the standard deviation is s = 0.80 kg, so the standard error is
s 0.80
SE = p = p = 0.27 kg.
n 9
s2 49.851
= = 16.62 kg2
n 9
14 CHAPTER 1: Introduction to Data Analysis
Concept Check 1.2. The following data are the carapace (shell) lengths in centimeters of
a sample of adult female green turtles, Chelonia mydas, measured while nesting at Heron
Island in Australia’s Great Barrier Reef. Calculate the following descriptive statistics for this
sample: sample mean, sample median, corrected sum of squares, sample variance, standard
deviation, standard error, and range. Remember to use the appropriate number of decimal
places in these descriptive statistics and to include the correct units with all statistics.
0 268
1 316
2 135
3 61
4 15
5 3
6 1
7 1
To calculate the sample descriptive statistics using Formulas 1.2, 1.7, and 1.8 would
be quite arduous, involving sums and sums of squares of 800 numbers. Fortunately,
the following formulas limit the drudgery for these calculations.
It is clear that X1 = 0 occurs f1 = 268 times, X2 = 1 occurs f2 = 316 times, etc.,
and that the sum of observations in the first category is f1 X1 , the sum in the second
category is f2 X2 , etc. The sum of all observations is, therefore,
c
X
f1 X1 + f2 X2 + · · · + fc Xc = fi Xi ,
i=1
where
Pc c denotes the number of categories. The total number of observations is n =
i=1 i , and as a result:
f
FORMULA 1.10. The sample mean for a grouped data set is given by
Pc
i=1 fi Xi
P
X= c .
i=1 fi
SECTION 1.6: Descriptive Statistics for Frequency Tables 15
Similarly, the computational formula for the sample variance for a grouped data set
can be derived directly from
Pc
2 fi (Xi X)2
s = i=1 .
n 1
FORMULA 1.11. The sample variance for a grouped data set is given by
Pc P
f i Xi )2
2 i=1 fi Xi2 ( n
s = ,
n 1
Pc
where n = i=1 fi .
To apply Formulas 1.10 and 1.11, we need to calculate only three sums:
P
• The sample size n = fi
P
• The sum of observations fi Xi
P
• The uncorrected sum of squared observations fi Xi2
Returning to Example 1.5, it is now straightforward to calculate X, s2 , and s.
0 268 0 0
1 316 316 316
2 135 270 540
3 61 183 549
4 15 60 240
5 3 15 75
6 1 6 36
7 1 7 49
Note that column 4 in the table above is generated by first squaring Xi and then
multiplying by fi , not by squaring the values in column 3. In other words, fi Xi2 6=
(fi Xi )2 .
The sample mean is
Pc
i=1 fi Xi 857
X= P c = = 1.1 plants/quadrat,
i=1 f i 800
the sample variance is
Pc
Pc 2 ( i=1 fi Xi )
2
(857)2
2 i=1 fi (Xi ) n 1805 800 2
s = = = 1.11 (plants/quadrat) ,
n 1 800 1
and the sample standard deviation is
p
s = 1.11 = 1.1 plants/quadrat.
Example 1.5 summarized data for a discrete variable taking on whole number
values from 0 to 7. Continuous variables can also be presented as grouped data in
frequency tables.
16 CHAPTER 1: Introduction to Data Analysis
EXAMPLE 1.6. The following data were collected by randomly sampling a large
population of rainbow trout, Salmo gairdnerii. The variable of interest is weight in
pounds.
Xi (lb) fi f i Xi fi Xi2
1 2 2 2
2 1 2 4
3 4 12 36
4 7 28 112
5 13 65 325
6 15 90 540
7 20 140 980
8 24 192 1536
9 7 63 567
10 9 90 900
11 2 22 242
12 4 48 576
13 2 26 338
Rainbow trout have weights that can range from almost 0 to 20 lb or more. More-
over their weights can take on any value in that interval. For example, a particular
trout may weigh 7.3541 lb. When data are grouped as in Example 1.6 intervals are
implied for each class. A fish in the 3-lb class weighs somewhere between 2.50 and
3.49 lb and a fish in the 9-lb class weighs between 8.50 and 9.49 lb. Fish were weighed
to the nearest pound allowing analysis of grouped data for a continuous measurement
variable. In Example 1.6,
Pc
i=1 fi Xi 780
X= P c = = 7.1 lb
i=1 f i 110
and
Pc
Pc 2 ( i=1 fi X i )
2
(780)2
2 i=1 fi (Xi ) n 6158 110 2
s = = = 5.75 (lb) .
n 1 110 1
Therefore,
p
s= 5.75 = 2.4 lb.
Again, consider that calculation time is saved by working with 13 classes instead
of 110 individual observations. Whether measuring the rainbow trout to the nearest
pound was appropriate will be considered in Section 1.10.
Additive Coding
Additive coding involves the addition or subtraction of a constant from each observa-
tion in a data set. Suppose the data gathered in Example 1.6 were collected using a
scale that weighed the fish 2 lb too low. We could go back to the data and add 2 lb to
each observation and recalculate the descriptive statistics. A more efficient tack would
be to realize that if a fixed amount c is added or subtracted from each observation in
a data set, the sample mean will be increased or decreased by that amount, but the
variance will be unchanged.
To see why, if X c is the coded mean, then
P P P P P
(Xi + c) Xi + c Xi + nc Xi
Xc = = = = + c = X + c.
n n n n
If s2c is the coded sample variance, then
P P P
2 [(Xi + c) (X + c)]2 (Xi + c X c)2 (Xi X)2
sc = = = = s2 ,
n 1 n 1 n 1
therefore, sc = s.
If the scale weighed 2 lb light in Example 1.6 the new, corrected statistics would
2
be X c = 7.1 + 2.0 = 9.1 lb, and s2c = 5.75 (lb) , and sc = 2.4 lb.
Multiplicative Coding
Multiplicative coding involves multiplying or dividing each observation in a data set by
a constant. Suppose the data in Example 1.6 were to be presented at an international
conference and, therefore, had to be presented in metric units (kilograms) rather than
English units (pounds). Since 1 kg equals 2.20 lb, we could convert the observations to
kilograms by multiplying each observation by 1/2.20 or 0.45 kg/lb. Again, the more
efficient approach would be to realize the following.
If each of the observations in a data set is multiplied by a fixed quantity c, the new
mean is c times the old mean because
P P
cXi c Xi
Xc = = = cX.
n n
Further the new variance is c2 times the old variance because
P P P 2 P
2 (cXi cX)2 [c(Xi X)]2 c (Xi X)2 2 (Xi X)2
sc = = = =c = c2 s2
n 1 n 1 n 1 n 1
and from this it follows that the new standard deviation is c times the old standard
deviation, sc = cs. (Remember, too, that division is just multiplication by a fraction.)
To convert the summary statistics of Example 1.6 to metric we simply utilize the
formulas above with c = 0.45 kg/lb.
X c = cX = 0.45 kg/lb (7.1 lb) = 3.20 kg.
s2c = c2 s2 = (0.45 kg/lb)2 (5.75 lb2 ) = 1.164 kg2 .
sc = cs = 0.45 kg/lb (2.4 lb) = 1.08 kg.
Our understanding of the e↵ects of coding on descriptive statistics can sometimes
help determine the nature of experimental manipulations of variables.
18 CHAPTER 1: Introduction to Data Analysis
SOLUTION. We have two choices here: The e↵ect of the fertilizer could be addi-
tive, increasing each value by 50 g (Xi + 50) or the e↵ect of the fertilizer could be
multiplicative, doubling each value (2Xi ). In the first case we expect the yield of the
new variety with fertilizer to be 150 g + 50 g = 200 g. In the second case we expect
the yield of the new variety with fertilizer to be 2 ⇥ 150 g = 300 g. To di↵erentiate
between these possibilities we must look at the variance in yield of the original variety
with and without fertilizer. If the e↵ect of fertilizer is additive, the variances with and
without fertilizer should be similar because additive coding doesn’t e↵ect the variance:
Xi + 50 yields s2 , the original sample variance. If the e↵ect is to double the yield, the
variance of yields with fertilizer should be four times the variance without fertilizer
because multiplicative coding increases the variance by the square of the constant
used in coding. 2Xi yields 4s2 , doubling the yield increases the sample variance four
fold.
TABLE 1.2. The relative frequencies, cumulative frequencies, and relative cumu-
lative frequencies for Example 1.5
Pr
fi Pr i=1 fi
n
(100) i=1 fi n
(100)
Xi fi Relative Cumulative Relative cumulative
Plants/quadrat Frequency frequency frequency frequency
relative frequencies. See Figure 1.2. In a bar graph the bar heights are the relative
frequencies. The bars are of equal width and spaced equidistantly along the horizontal
axis. Because these data are discrete, that is, because they can only take certain values
along the horizontal axis, the bars do not touch each other.
40 ....
...............
...
... ...
... ...
................... ... ...
... ... ...
.... ... ...
... ..... ... ...
30 ... .... ... ...
... ... ... .....
... ... ... ...
... ... ... ...
... ... ... ....
...
Relative ...
... .....
...
...
...
...
20 ... ... ... ...
frequency ... ... ... ...
... ................
... .... ... .... ...
... ... ... ..... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... .... ... ...
10 ... ... ... ... ... ...
... ..... ... ...
... ... ...
... ...................
... ... ... ... ... ...
... ... ... ...
... ... ..... ... ...
... .... ... ... ... ... ....
... ... ... ..... ... ... ... ... ..................
.. .. .. .. .. ... .. .. .. .. ................. ............... ...............
0
0 1 2 3 4 5 6 7 8
Plants/quadrat
The data in Example 1.6 can be summarized in a similar fashion with relative
frequency, cumulative frequency, and relative cumulative frequency columns. See Ta-
ble 1.3.
1 2 1.82 2 1.82
2 1 0.91 3 2.73
3 4 3.64 7 6.36
4 7 6.36 14 12.73
5 13 11.82 27 24.55
6 15 13.64 42 38.18
7 20 18.18 62 56.36
8 24 21.82 86 78.18
9 7 6.36 93 84.55
10 9 8.18 102 92.73
11 2 1.82 104 94.55
12 4 3.64 108 98.18
13 2 1.82 110 100.00
P
110 100.00
Because the data in Example 1.6 are continuous measurement data with each class
implying a range of possible values for Xi , for example, Xi = 3 implies each fish
weighed between 2.50 lb and 3.49 lb, the pictorial representation of the data set is
a histogram not a bar graph. Histograms have the observation classes along the
horizontal axis. The area of the strip represents the relative frequency. (If the classes
20 CHAPTER 1: Introduction to Data Analysis
of the histogram are of equal width, as they often are, then the heights of the strips
will represent the relative frequency, as in a bar graph.) See Figure 1.3. The strips
in this case touch each other because each X value corresponds to a range of possible
values.
25
20
15
Relative
frequency
10
0
2 4 6 8 10 12 14
Weight in pounds
FIGURE 1.3. A histogram for the relative frequencies for Example 1.6.
While the categories in a bar graph are predetermined because the data are dis-
crete, the classes representing ranges of continuous data values must be selected by
the investigator. In fact, it is sometimes revealing to create more than one histogram
of the same data by employing classes of di↵erent widths.
EXAMPLE 1.8. The list below gives snowfall measurements for 50 consecutive years
(1951–2000) in Syracuse, NY (in inches per year). The data have been rearranged
in order of increasing annual snowfall. Create a histogram using classes of width
30 inches and then create a histogram using narrower classes of width 15 inches.
(Source: http://neisa.unh.edu/Climate/IndicatorExcelFiles.zip)
71.7 73.4 77.8 81.6 84.1 84.1 84.3 86.7 91.3 93.8
93.9 94.4 97.5 97.6 98.1 99.1 99.9 100.7 101.0 101.9
102.1 102.2 104.8 108.3 108.5 110.2 111.0 113.3 114.2 114.3
116.2 119.2 119.5 122.9 124.0 125.7 126.6 130.1 131.7 133.1
135.3 145.9 148.1 149.2 153.8 160.9 162.6 166.1 172.9 198.7
SOLUTION. Use the same scale for the horizontal axis (inches of annual snowfall) in
both histograms. Remember that the area of a strip represents the relative frequency
of the associated class. Since the snowfall classes of the second histogram (15 in) are
one-half those of the first histogram (30 in), then the vertical scale must be multiplied
by a factor of 2 so that equal areas in each histogram will represent the same relative
frequencies. Thus, a single year in the second histogram will be represented by a strip
half as wide but twice as tall as in the first histogram, as indicated in the key in the
upper left corner of each diagram.
In this case, the narrower classes of the second histogram provide more informa-
tion. For example, nearly one-third of all recent winters in Syracuse have produced
snowfalls in the 90–105 inch range. There was one year with a very large amount of
snowfall of approximately 200 in. While one could garner this same information from
the data itself, normally one would use a (single) histogram to summarize data and
not list the entire data set.
SECTION 1.8: Tables and Graphs 21
50 ..........................................
...................................... = 1 yr
45
40
35
30
Relative
25
frequency
20
15
10
5
0
0 30 60 90 120 150 180 210
Snowfall in inches per year
35 .......................
... ..
....................
= 1 yr
30
25
20
Relative
frequency
15
10
0
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210
Snowfall in inches per year
The sample IQR describes the spread of the middle 50% of the sample, that is, the
di↵erence between the first and third quartiles. As such, it is a measure of variability
and is commonly reported with the median.
EXAMPLE 1.9. Find the first and third quartiles and the IQR for the cypress pine
data in Example 1.2.
CCH 17 19 31 39 48 56 68 73 73 75 80 122
Depth 1 2 3 4 5 6 6 5 4 3 2 1
12+1
SOLUTION. The median depth is 2
= 6.5. So there are six observations below
the median. The quartile depth is the median depth of these six observations: 6+12
=
3.5. So the first quartile is Q1 = 31+39 2
= 35 cm. Similarly, the depth for the
third quartile is also 3.5 (from the right), so Q3 = 73+75
2
= 74 cm. Finally, the
IQR = Q3 Q1 = 74 35 = 39 cm.
Another random document with
no related content on Scribd:
DANCE ON STILTS AT THE GIRLS’ UNYAGO, NIUCHI
I see increasing reason to believe that the view formed some time
back as to the origin of the Makonde bush is the correct one. I have
no doubt that it is not a natural product, but the result of human
occupation. Those parts of the high country where man—as a very
slight amount of practice enables the eye to perceive at once—has not
yet penetrated with axe and hoe, are still occupied by a splendid
timber forest quite able to sustain a comparison with our mixed
forests in Germany. But wherever man has once built his hut or tilled
his field, this horrible bush springs up. Every phase of this process
may be seen in the course of a couple of hours’ walk along the main
road. From the bush to right or left, one hears the sound of the axe—
not from one spot only, but from several directions at once. A few
steps further on, we can see what is taking place. The brush has been
cut down and piled up in heaps to the height of a yard or more,
between which the trunks of the large trees stand up like the last
pillars of a magnificent ruined building. These, too, present a
melancholy spectacle: the destructive Makonde have ringed them—
cut a broad strip of bark all round to ensure their dying off—and also
piled up pyramids of brush round them. Father and son, mother and
son-in-law, are chopping away perseveringly in the background—too
busy, almost, to look round at the white stranger, who usually excites
so much interest. If you pass by the same place a week later, the piles
of brushwood have disappeared and a thick layer of ashes has taken
the place of the green forest. The large trees stretch their
smouldering trunks and branches in dumb accusation to heaven—if
they have not already fallen and been more or less reduced to ashes,
perhaps only showing as a white stripe on the dark ground.
This work of destruction is carried out by the Makonde alike on the
virgin forest and on the bush which has sprung up on sites already
cultivated and deserted. In the second case they are saved the trouble
of burning the large trees, these being entirely absent in the
secondary bush.
After burning this piece of forest ground and loosening it with the
hoe, the native sows his corn and plants his vegetables. All over the
country, he goes in for bed-culture, which requires, and, in fact,
receives, the most careful attention. Weeds are nowhere tolerated in
the south of German East Africa. The crops may fail on the plains,
where droughts are frequent, but never on the plateau with its
abundant rains and heavy dews. Its fortunate inhabitants even have
the satisfaction of seeing the proud Wayao and Wamakua working
for them as labourers, driven by hunger to serve where they were
accustomed to rule.
But the light, sandy soil is soon exhausted, and would yield no
harvest the second year if cultivated twice running. This fact has
been familiar to the native for ages; consequently he provides in
time, and, while his crop is growing, prepares the next plot with axe
and firebrand. Next year he plants this with his various crops and
lets the first piece lie fallow. For a short time it remains waste and
desolate; then nature steps in to repair the destruction wrought by
man; a thousand new growths spring out of the exhausted soil, and
even the old stumps put forth fresh shoots. Next year the new growth
is up to one’s knees, and in a few years more it is that terrible,
impenetrable bush, which maintains its position till the black
occupier of the land has made the round of all the available sites and
come back to his starting point.
The Makonde are, body and soul, so to speak, one with this bush.
According to my Yao informants, indeed, their name means nothing
else but “bush people.” Their own tradition says that they have been
settled up here for a very long time, but to my surprise they laid great
stress on an original immigration. Their old homes were in the
south-east, near Mikindani and the mouth of the Rovuma, whence
their peaceful forefathers were driven by the continual raids of the
Sakalavas from Madagascar and the warlike Shirazis[47] of the coast,
to take refuge on the almost inaccessible plateau. I have studied
African ethnology for twenty years, but the fact that changes of
population in this apparently quiet and peaceable corner of the earth
could have been occasioned by outside enterprises taking place on
the high seas, was completely new to me. It is, no doubt, however,
correct.
The charming tribal legend of the Makonde—besides informing us
of other interesting matters—explains why they have to live in the
thickest of the bush and a long way from the edge of the plateau,
instead of making their permanent homes beside the purling brooks
and springs of the low country.
“The place where the tribe originated is Mahuta, on the southern
side of the plateau towards the Rovuma, where of old time there was
nothing but thick bush. Out of this bush came a man who never
washed himself or shaved his head, and who ate and drank but little.
He went out and made a human figure from the wood of a tree
growing in the open country, which he took home to his abode in the
bush and there set it upright. In the night this image came to life and
was a woman. The man and woman went down together to the
Rovuma to wash themselves. Here the woman gave birth to a still-
born child. They left that place and passed over the high land into the
valley of the Mbemkuru, where the woman had another child, which
was also born dead. Then they returned to the high bush country of
Mahuta, where the third child was born, which lived and grew up. In
course of time, the couple had many more children, and called
themselves Wamatanda. These were the ancestral stock of the
Makonde, also called Wamakonde,[48] i.e., aborigines. Their
forefather, the man from the bush, gave his children the command to
bury their dead upright, in memory of the mother of their race who
was cut out of wood and awoke to life when standing upright. He also
warned them against settling in the valleys and near large streams,
for sickness and death dwelt there. They were to make it a rule to
have their huts at least an hour’s walk from the nearest watering-
place; then their children would thrive and escape illness.”
The explanation of the name Makonde given by my informants is
somewhat different from that contained in the above legend, which I
extract from a little book (small, but packed with information), by
Pater Adams, entitled Lindi und sein Hinterland. Otherwise, my
results agree exactly with the statements of the legend. Washing?
Hapana—there is no such thing. Why should they do so? As it is, the
supply of water scarcely suffices for cooking and drinking; other
people do not wash, so why should the Makonde distinguish himself
by such needless eccentricity? As for shaving the head, the short,
woolly crop scarcely needs it,[49] so the second ancestral precept is
likewise easy enough to follow. Beyond this, however, there is
nothing ridiculous in the ancestor’s advice. I have obtained from
various local artists a fairly large number of figures carved in wood,
ranging from fifteen to twenty-three inches in height, and
representing women belonging to the great group of the Mavia,
Makonde, and Matambwe tribes. The carving is remarkably well
done and renders the female type with great accuracy, especially the
keloid ornamentation, to be described later on. As to the object and
meaning of their works the sculptors either could or (more probably)
would tell me nothing, and I was forced to content myself with the
scanty information vouchsafed by one man, who said that the figures
were merely intended to represent the nembo—the artificial
deformations of pelele, ear-discs, and keloids. The legend recorded
by Pater Adams places these figures in a new light. They must surely
be more than mere dolls; and we may even venture to assume that
they are—though the majority of present-day Makonde are probably
unaware of the fact—representations of the tribal ancestress.
The references in the legend to the descent from Mahuta to the
Rovuma, and to a journey across the highlands into the Mbekuru
valley, undoubtedly indicate the previous history of the tribe, the
travels of the ancestral pair typifying the migrations of their
descendants. The descent to the neighbouring Rovuma valley, with
its extraordinary fertility and great abundance of game, is intelligible
at a glance—but the crossing of the Lukuledi depression, the ascent
to the Rondo Plateau and the descent to the Mbemkuru, also lie
within the bounds of probability, for all these districts have exactly
the same character as the extreme south. Now, however, comes a
point of especial interest for our bacteriological age. The primitive
Makonde did not enjoy their lives in the marshy river-valleys.
Disease raged among them, and many died. It was only after they
had returned to their original home near Mahuta, that the health
conditions of these people improved. We are very apt to think of the
African as a stupid person whose ignorance of nature is only equalled
by his fear of it, and who looks on all mishaps as caused by evil
spirits and malignant natural powers. It is much more correct to
assume in this case that the people very early learnt to distinguish
districts infested with malaria from those where it is absent.
This knowledge is crystallized in the
ancestral warning against settling in the
valleys and near the great waters, the
dwelling-places of disease and death. At the
same time, for security against the hostile
Mavia south of the Rovuma, it was enacted
that every settlement must be not less than a
certain distance from the southern edge of the
plateau. Such in fact is their mode of life at the
present day. It is not such a bad one, and
certainly they are both safer and more
comfortable than the Makua, the recent
intruders from the south, who have made USUAL METHOD OF
good their footing on the western edge of the CLOSING HUT-DOOR
plateau, extending over a fairly wide belt of
country. Neither Makua nor Makonde show in their dwellings
anything of the size and comeliness of the Yao houses in the plain,
especially at Masasi, Chingulungulu and Zuza’s. Jumbe Chauro, a
Makonde hamlet not far from Newala, on the road to Mahuta, is the
most important settlement of the tribe I have yet seen, and has fairly
spacious huts. But how slovenly is their construction compared with
the palatial residences of the elephant-hunters living in the plain.
The roofs are still more untidy than in the general run of huts during
the dry season, the walls show here and there the scanty beginnings
or the lamentable remains of the mud plastering, and the interior is a
veritable dog-kennel; dirt, dust and disorder everywhere. A few huts
only show any attempt at division into rooms, and this consists
merely of very roughly-made bamboo partitions. In one point alone
have I noticed any indication of progress—in the method of fastening
the door. Houses all over the south are secured in a simple but
ingenious manner. The door consists of a set of stout pieces of wood
or bamboo, tied with bark-string to two cross-pieces, and moving in
two grooves round one of the door-posts, so as to open inwards. If
the owner wishes to leave home, he takes two logs as thick as a man’s
upper arm and about a yard long. One of these is placed obliquely
against the middle of the door from the inside, so as to form an angle
of from 60° to 75° with the ground. He then places the second piece
horizontally across the first, pressing it downward with all his might.
It is kept in place by two strong posts planted in the ground a few
inches inside the door. This fastening is absolutely safe, but of course
cannot be applied to both doors at once, otherwise how could the
owner leave or enter his house? I have not yet succeeded in finding
out how the back door is fastened.