Professional Documents
Culture Documents
John D. Nagy
Scottsdale Community College
1 Introduction to Biostatistics 1
1.1 What Does “Biostatistics” Mean? . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Variation is the root cause of uncertainty . . . . . . . . . . . . . . . . . . 2
1.3 Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Data in Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Representing Data Mathematically . . . . . . . . . . . . . . . . . 6
1.4.2 Classification of Random Variables . . . . . . . . . . . . . . . . . 7
1.5 Summarizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 The Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 The Frequency Distribution . . . . . . . . . . . . . . . . . . . . . 8
1.5.3 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.4 Bar graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
iii
iv CONTENTS
2. Finite and infinite sums: When we write a sum of a finite series of numbers
x1 , x2 , . . . , xn we will write
n
X
x i = x1 + x2 + . . . + xn . (1)
i=1
n! = 1 × 2 × . . . × n, (4)
5. Combinations: The number of ways one can choose k objects out of a set of n
objects without replacement and without regard to order is
à !
n n!
= . (5)
k k!(n − k)!
³ ´
n
The notation k
is read “n choose k”.
6. Logarithms: The standard (base 10) logarithm of x will be written log x, and the
natural logarithm (base e) will be written ln x. By definition,
and
ln x = y if and only if ey = x. (7)
0.1. USEFUL NOTATION v
0.4
0.3
N(x)
0.2
0.1
0.0
−4 −2 0 2 4
Figure 1: The standard normal or Gaussian distribution function. Vertical lines at x = −2 and x = 2
represent integration boundaries for an example in the text.
vi CONTENTS
7. Absolute value: The absolute value of a number x is written |x|. If x > 0 then
|x| = | − x| = x.
If the function f (x) is never negative, as will always be the case in the problems we
will consider in this manual, you may think of the integral of f (x) as the area of
the region bounded by the graph of f on top, the horizontal or so-called “x” axis
on the bottom, and vertical lines through a and b on each side. For example, figure
1 shows a graph of the standard normal or Gaussian curve (sometimes called the
“bell curve”). If this curve R2
is f (x), then the area of the region between x = −2
and x = 2 is the integral −2 f (x)dx.R Similarly, the area under the normal curve
from x = 2 to infinity on the right is 2∞ f (x)dx,
R −2
and the area under the curve from
x = −2 to negative infinity on the left is −∞ f (x)dx. The normal curve has the
property that Z ∞
f (x)dx = 1. (9)
∞
Therefore, Z −2 Z 2 Z ∞
f (x)dx + f (x)dx + f (x)dx = 1. (10)
−∞ −2 2
Chapter 1
Introduction to Biostatistics
1
2 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS
2. Individual variation: Individuals themselves may vary over time. For example,
suppose we were measuring the weights of wolves in a pack. Our results would
depend in part on when the animals last ate. In another (true) example, in prepa-
ration for a retinal angiogram a patient’s blood pressure was measured and found
to be worrisomely high. The patient indicated that he had to urinate, which he
therefore was allowed to do. Afterwards his blood pressure was well within the
healthy range.
Variation in a study is not always bad, and in fact the statistical subdisciplines of
experimental and survey design focus primarily on converting “bad” variation into “good”
variation, or at least variation that will not damage our study. One form of variation –
called bias – is extremely bad. We all have an idea of what this word means in common
usage, but to an empirical scientist bias means something very specific.
Definition 1.2 Bias or systematic error is a form of error that causes our sample
or experiment to generate a systematically incorrect estimate of a statistic.
In some cases bias can be recognized immediately, but most of the time it is very hard
to spot.
Example 1.2: Students in an introductory biology lab used a digital caliper to measure the length
of 20 mesquite leaves. After taking their measurements they discovered that when they closed the
caliper completely, instead of reading 0 mm it read 1.2 mm. They had failed to zero the caliper
properly before taking their measurements, so each reading was 1.2 mm larger than the leaf actually
was. This is an example of a type of bias called “instrumentation bias.”
Example 1.3: Another group of students in the same class as those in Example 1.2 calibrated
their caliper properly before making measurements. But they tended to be sloppy in the use of
the caliper, so their readings often deviated astonishingly far from the actual length of the leaf.
However, it appeared that they were equally likely to overestimate the leaf length as they were to
underestimate it. In that case, their results were hampered by instrumentation error, but since these
errors were not systematic, they were not biased.
Example 1.4: Suppose that Denmark-Wahnfried et al. (from Example 1.1) included only men
of European descent in their study, but they are trying to make inferences about all men in the
United States. The study’s results suggest that pattern baldness may predict prostate cancer risk
because men with prostate cancer tended to be more bald on the crown by age 30 than those
4 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS
300
250
Incidence (cases/100,000)
200
150
100
50
0
1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995
Year
Figure 1.1: Incidence (per 100,000 men of a given race) of prostate cancer in men of primarily Eu-
ropean descent (“White”, grey bars) and primarily African descent (“Black”, open bars) in the United
States between 1973 and 1996. Data are from the Surveillance, Epidemiology and End Results Program’s
population-based registries in Atlanta, Detroit, Seattle-Puget Sound, San Francisco-Oakland, Connecti-
cut, Iowa, New Mexico, Utah and Hawaii, available at the Center for Disease Control and Prevention’s
National Center for Health Statistics web-site (www.cdc.gov/nchs/fastats/cancer.htm).
without prostate cancer. (For now we will ignore the statistical uncertainty of their results.) But,
we know from previous studies that men of primarily European descent (“white”) have a relatively
low prostate cancer risk compared to men of primarily African descent (“black”) (see Figure 1.1).
That fact alone appears to make a study based only on white men biased. However, in reality it
may or may not be biased. The fact that white and black men do not have the same incidence of
prostate cancer is only part of the story. The study looked at pattern baldness and prostate cancer.
So, even though black men are more susceptible to prostate cancer than white men, the proportion
of men with prostate cancer who were bald on the crown by age 30 could still be the same for both
black and white men, in which case pattern baldness would predict prostate cancer risk equally well
for both black and white men.
Bias can arise in various ways in a biological study. One common way we’ve already
seen in Example 1.4. In this case at least one subset of the population (African American
men) was not sampled. Such a situation results in selection bias, in which case the
sampling procedure introduces uncertainty that masks the true relationships among the
1.3. POPULATION AND SAMPLE 5
studied the dangers for at least hundreds if not thousands of years, and we understand
them pretty well. In fact, humanity’s efforts in this direction led to a whole new field of
intellectual endeavor – it’s called statistics.
Definition 1.3 A statistical population is the entire set of elements (humans, cells,
wolves, bees, all possible experiments of a particular type, etc.) about which we wish to
make inferences or describe. If the population is both finite and well defined, we often
denote the population size by N .
Definition 1.4 A statistical sample is the subset of elements from the population that
we actually study. The sample size is the precise number of elements in the sample and
is typically denoted n.
So, the sample is what we use to draw conclusions about the population. In a well-
designed study, both will be clearly defined, although in most cases it will be impossible
to assign a value to N because the population size is either not well-defined or not finite.
Note that the population can be well defined when the population size is not.
Definition 1.5 Continuous random variables are measurements that can be any
length. That is, a variable that can take any real number in a given interval. For example,
the length of a milkweed bug in example 1.6 is a continuous random variable. I’ll call
data represented as a continuous random variable continuous data.
Definition 1.7 A data frame is a data table in which columns represent variables and
rows represent elements on which data were taken.
8 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS
To get a feel for how to build a data frame, let’s practice with the following simple
study.
Example 1.7: Human blood consists of two major components: 1) the formed elements including
red blood corpuscles, white blood cells and platelets, and 2) plasma. Clinicians frequently measure
the proportion of blood (by volume) that consists of formed elements, which is called the hematocrit,
usually expressed as a percent. In a study conducted by Dr. Kim Cooper in one of his histology
classes, each of 49 students measured their own hematocrit. The data are presented simply in Table
1.1. Since hematocrit can take on any value between 0% and 100%, these are continuous data.
Table 1.1 is nice and concise, but it is not a data frame and one should not record
raw data in this way. One can use something like table 1.1 to summarize results, but it
will make analysis by computer very difficult.
So, how do we build a data frame for this study? Note that there are a total of 49
students in the study, so our data frame will have 49 rows plus a header row. For every
student, two variables were recorded: gender and hematocrit. So, our data frame will
have two columns plus an identification column. Table 1.2 shows an example of a data
frame, actually two halves of the data frame side-by-side, for this data set.
Table 1.2. Data frame for the hematocrit study on 49 histology students conducted by
Dr. Kim Cooper.
Student Gender Hematocrit Student Gender Hematocrit
1 Female 43.0 26 Female 43.0
2 Female 38.6 27 Female 41.0
3 Female 36.4 28 Female 41.0
4 Female 40.4 29 Female 41.0
5 Female 41.5 30 Female 37.0
6 Female 28.0 31 Female 40.6
7 Female 37.0 32 Female 36.0
8 Female 42.5 33 Female 40.0
9 Female 40.5 34 Male 37.0
10 Female 39.1 35 Male 40.8
11 Female 42.5 36 Male 44.0
12 Female 40.5 37 Male 40.0
13 Female 44.5 38 Male 44.0
14 Female 40.9 39 Male 44.0
15 Female 39.0 40 Male 45.0
16 Female 40.0 41 Male 47.5
17 Female 41.0 42 Male 42.0
18 Female 40.0 43 Male 46.0
19 Female 41.0 44 Male 41.0
20 Female 45.0 45 Male 44.0
21 Female 40.0 46 Male 44.0
22 Female 42.5 47 Male 45.5
23 Female 36.5 48 Male 46.9
24 Female 40.0 49 Male 44.0
25 Female 47.0
10 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS
Table 1.3. Absolute and relative frequencies of the hematocrit data in Table 1.2.
Interval widths are 4%, n = 49. Intervals include endpoints.
Interval Midpoint Frequency Rel. Frequency
28.0% – 31.9% 30 1 0.02
32.0% – 35.9% 34 0 0
36.0% – 39.9% 38 9 0.18
40.0% – 43.9% 42 25 0.51
44.0% – 47.9% 46 14 0.29
the first (data) row, third column from the left (value = 1) expresses the number of hematocrit
measurements that fell in the interval 28.0% to 31.9%, inclusive. The second column indicates the
midpoint of the interval. The fourth column, titled “Relative Frequency,” represents the proportion
of all observations the fell in the appropriate interval. This table tells us that 51% of students in
the sample had hematocrits between 40.0% and 43.9%.
The choice of interval widths is arbitrary, and as a practical issue is chosen based on
how many intervals express the data best. Typically, fewer than four or more than 20
intervals is not useful.
The number of observations in an interval is the absolute frequency or simply the
frequency. The proportion of observations in an interval is called the relative frequency.
I will denote the absolute frequency in the i-th interval as fi , and the corresponding
relative frequency as fˆi . Specifically,
fi
fˆi = . (1.1)
n
14
12
10
Frequency
8
6
4
2
0
25 30 35 40 45 50
Hematcrit (%)
Figure 1.2: Absolute frequency histogram for the hematocrit data in Table 1.2. n = 49, interval widths
equal to 2%.
12
10
8
Frequency
6
4
2
0
25 30 35 40 45 50
Hematocrit (%)
Figure 1.3: Absolute frequency histogram for the hematocrit data in Table 1.2. n = 49, interval widths
equal to 1%.
12 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS
45
40
35
30
Figure 1.4: Box plot of hematocrits of 49 histology students. The middle box represents the middle
50% of the distribution, the heavy line in the middle of the box is the median and the “whiskers” shows
the range, with one extremely low outlier. The vertical axis is hematocrit in percent.
Box plots are very handy when comparing two distributions (Figure 1.5). In the
hematocrit example, a parallel box plot like Figure 1.5 allows a much more powerful
view into the difference between hematocrits of men and women. Certainly, “average”
(median) hematocrit in men is higher than in women, at least in this sample. But, more
importantly, the inner 50% ranges, also called the inner quartile ranges, do not overlap.
Therefore, one can conclude that 75% of the men had hematocrits that were larger than
75% of the women’s.
Example 1.10: Researchers in northern Minnesota studying foxes (species unspecified by the
source) measure the number of kits in 64 litters. They construct a frequency table (Table 1.4) of
the data and graph it as a bar graph (Figure 1.6).
1.5. SUMMARIZING DATA 13
45
40
35
30
Female Male
Figure 1.5: Box plot comparison of hematocrits of 33 female and 16 male histology students.
Table 1.4. Absolute and relative frequencies of litter sizes in a study of foxes (species
unspecified by source).
# Kits Frequency Rel. Frequency
3 10 0.16
4 27 0.42
5 22 0.34
6 4 0.063
7 1 0.016
14 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS
25
20
15
Frequency
10
5
0
3 4 5 6 7
Number of Kits
Figure 1.6: Absolute frequency distribution of number of kits per litter for the data in table 1.4.
1.6 Exercises
1. List as many possible causes of uncertainty in the Denmark-Wahnfried et al. study
as you can.
2. Explain how it is possible that measurement error does not necessarily lead to
measurement bias.