You are on page 1of 19

Biostatistics Lecture Notes

John D. Nagy
Scottsdale Community College

November 25, 2003


Contents

0.1 Useful Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Introduction to Biostatistics 1
1.1 What Does “Biostatistics” Mean? . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Variation is the root cause of uncertainty . . . . . . . . . . . . . . . . . . 2
1.3 Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Data in Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Representing Data Mathematically . . . . . . . . . . . . . . . . . 6
1.4.2 Classification of Random Variables . . . . . . . . . . . . . . . . . 7
1.5 Summarizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 The Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 The Frequency Distribution . . . . . . . . . . . . . . . . . . . . . 8
1.5.3 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.4 Bar graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iii
iv CONTENTS

0.1 Useful Notation


The following notation and concepts are assumed:

1. Functions: y = f (x) is read “y is a function of the variable x,” or simply “f of


x.” The function f assigns a unique value y to every x over which f is defined.

2. Finite and infinite sums: When we write a sum of a finite series of numbers
x1 , x2 , . . . , xn we will write
n
X
x i = x1 + x2 + . . . + xn . (1)
i=1

If the sum is infinitely long, then we write



X
xi = x1 + x2 + . . . . (2)
i=1

We can also have double sums, as follows:


n X
X m n
X n
X n
X
xij = xi1 + xi2 + . . . + xim (3)
i−1 j=1 i=1 i=1 i=1
= (x11 + x21 + . . . + xn1 ) + . . . + (x1m + x2m + . . . + xnm ).

3. Probability: The notation P {A} is read “probability of A,” where A is some


random event.

4. Factorials: The notation n! is called “n factorial” and is defined

n! = 1 × 2 × . . . × n, (4)

where n is a nonnegative integer. By definition 0! = 1.

5. Combinations: The number of ways one can choose k objects out of a set of n
objects without replacement and without regard to order is
à !
n n!
= . (5)
k k!(n − k)!
³ ´
n
The notation k
is read “n choose k”.

6. Logarithms: The standard (base 10) logarithm of x will be written log x, and the
natural logarithm (base e) will be written ln x. By definition,

log x = y if and only if 10y = x, (6)

and
ln x = y if and only if ey = x. (7)
0.1. USEFUL NOTATION v

0.4
0.3
N(x)

0.2
0.1
0.0

−4 −2 0 2 4

Figure 1: The standard normal or Gaussian distribution function. Vertical lines at x = −2 and x = 2
represent integration boundaries for an example in the text.
vi CONTENTS

7. Absolute value: The absolute value of a number x is written |x|. If x > 0 then
|x| = | − x| = x.

8. Integrals: The integral of a function f (x) from x = a on the left to x = b on the


right is written Z b
f (x)dx. (8)
a

If the function f (x) is never negative, as will always be the case in the problems we
will consider in this manual, you may think of the integral of f (x) as the area of
the region bounded by the graph of f on top, the horizontal or so-called “x” axis
on the bottom, and vertical lines through a and b on each side. For example, figure
1 shows a graph of the standard normal or Gaussian curve (sometimes called the
“bell curve”). If this curve R2
is f (x), then the area of the region between x = −2
and x = 2 is the integral −2 f (x)dx.R Similarly, the area under the normal curve
from x = 2 to infinity on the right is 2∞ f (x)dx,
R −2
and the area under the curve from
x = −2 to negative infinity on the left is −∞ f (x)dx. The normal curve has the
property that Z ∞
f (x)dx = 1. (9)

Therefore, Z −2 Z 2 Z ∞
f (x)dx + f (x)dx + f (x)dx = 1. (10)
−∞ −2 2
Chapter 1

Introduction to Biostatistics

1.1 What Does “Biostatistics” Mean?


In their influential text entitled Biometry, Sokal and Rohlf (1995) define biometry as
“the application of statistical methods to the solution of biological problems.” An-
other very influential text by Zar (1999) defines biostatistics as “statistics applied to
biological problems”. So we can see that “biometry” and “biostatistics” are the same
thing.
So, what are statistics? This word has at least three different meanings. First,
statistics can be collections of numbers like the number of people who own Humvees or the
number of people who voted for the current President of the United States. (Depending
on the President, these two statistics might be nearly identical.) Such statistics when
applied to populations of organisms, including humans, are often called “vital statistics.”
Another distinct definition of “a statistic” is a calculation from data, like the average
(mean) blood pressure of women on hormone replacement therapy. Finally, statistics
also refers to an intellectual field spanning the disciplines of empirical science and applied
mathematics. This concept is described very well by Stigler (1986) in his excellent history
of statistical thought:

Modern statistics provides a quantitative technology for empirical sci-


ence; it is a logic and methodology for the measurement of uncertainty and
for an examination of the consequences of that uncertainty in the planning
and interpretation of experimentation and observation.

There are three key aspects of this definition:


1. Quantitative technology: By this Stigler means much more than computers, pro-
grams and other high-tech things. He means nothing less than an entire intellectual
discipline involving a body of mathematical and computational techniques.

2. Empirical science: “Empirical” means “relying or based on experiment or expe-


rience” (according to Webster’s New World Dictionary). In science, we tend to
separate the empirical from the theoretical. Empirical science generates and inter-
prets facts, whereas theory ties facts together.

1
2 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS

3. Uncertainty: Uncertainty is inherent in every empirical study. Therefore, every


scientist, empiricist and theoretician alike1 , must understand how this uncertainty
arises. Since statistics were invented primarily to allow us to examine “the conse-
quences of that uncertainty,” all scientists must understand at least some rudimen-
tary statistics.

Example 1.1: In a recent study (Denmark-Wahnfried et al. 2000), researchers investigated a


possible link between male pattern baldness (androgenic alopecia in the lingo) and prostate cancer.
Such a link is not unlikely since prostate cancer and male pattern baldness are modulated by the
same hormones. The researchers combined data from two separate “case-control” studies, a study
design we will explore in detail later. The combined data set contained 134 and 145 men with and
without prostate cancer, respectively. Participants were asked to indicate on the Hamilton Scale
of Baldness, a set of generic pictures of pattern baldness, what picture best represented them at
age 30. Results suggest that men with prostate cancer were twice as likely to have been bald on
the crown of the skull by age 30 compared to healthy men. On the other hand, men who were
balding in front, but not on the crown, by age 30 appeared to have a lower risk of prostate cancer.
However, in no portion of the study were the results unequivocal. More precisely, we would say
that their results were not “statistically significant,” meaning that results similar to theirs are not
unlikely to have occurred by chance even if there is no relationship at all between prostate cancer
and androgenic alopecia.

1.2 Variation is the root cause of uncertainty


In example 1.1, causes of uncertainty include physiological differences among men, physi-
ological state of any given study participant, errors in measurement, mistakes in interpre-
tation of a patient’s response, the choice of who to include in the study, etc. All of these
uncertainties cause variation in the data. This concept is so important that statisticians
have a very precise definition for it.

Definition 1.1 Variation is deviation of a measurement, count or other statistic from


the “average” of that statistic among all subjects under study or possible outcomes of an
experiment.

Here is a non-exhaustive list of potential causes of variation in any biological study.

1. Population variation: Since no two individuals in any biological sample or ex-


periment have identical genes and life-time environmental exposures, ages, etc. we
expect and observe that there are consistent differences among individuals. For
example, suppose one man in Denmark-Wahnfried et al.’s study is a vegetarian
and another is not. For biological reasons we would not expect them to have the
same risk of prostate cancer even if their hair loss was exactly the same.
1
One might wonder why a theoretician needs to know about uncertainty that arises in empirical
research. Any such mystery will be removed once one recognizes that a theoretician’s work must always
relate to empirical research in one way or other.
1.2. VARIATION IS THE ROOT CAUSE OF UNCERTAINTY 3

2. Individual variation: Individuals themselves may vary over time. For example,
suppose we were measuring the weights of wolves in a pack. Our results would
depend in part on when the animals last ate. In another (true) example, in prepa-
ration for a retinal angiogram a patient’s blood pressure was measured and found
to be worrisomely high. The patient indicated that he had to urinate, which he
therefore was allowed to do. Afterwards his blood pressure was well within the
healthy range.

3. Sample variation: Variation can be caused by the set of objects or individuals


we choose to study in the first place. For example, if Denmark-Wahnfried et al.
had chosen 279 other men as their sample, they almost certainly would not have
obtained exactly the same results as they did.

4. Observer or instrumentation error: Mistakes in measurement also produce


variation. For example, some older participants in Denmark-Wahnfried et al.’s
study may have incorrectly reported their hair loss at age 30 because their memories
were inaccurate.

Variation in a study is not always bad, and in fact the statistical subdisciplines of
experimental and survey design focus primarily on converting “bad” variation into “good”
variation, or at least variation that will not damage our study. One form of variation –
called bias – is extremely bad. We all have an idea of what this word means in common
usage, but to an empirical scientist bias means something very specific.

Definition 1.2 Bias or systematic error is a form of error that causes our sample
or experiment to generate a systematically incorrect estimate of a statistic.

In some cases bias can be recognized immediately, but most of the time it is very hard
to spot.
Example 1.2: Students in an introductory biology lab used a digital caliper to measure the length
of 20 mesquite leaves. After taking their measurements they discovered that when they closed the
caliper completely, instead of reading 0 mm it read 1.2 mm. They had failed to zero the caliper
properly before taking their measurements, so each reading was 1.2 mm larger than the leaf actually
was. This is an example of a type of bias called “instrumentation bias.”

Example 1.3: Another group of students in the same class as those in Example 1.2 calibrated
their caliper properly before making measurements. But they tended to be sloppy in the use of
the caliper, so their readings often deviated astonishingly far from the actual length of the leaf.
However, it appeared that they were equally likely to overestimate the leaf length as they were to
underestimate it. In that case, their results were hampered by instrumentation error, but since these
errors were not systematic, they were not biased.

Example 1.4: Suppose that Denmark-Wahnfried et al. (from Example 1.1) included only men
of European descent in their study, but they are trying to make inferences about all men in the
United States. The study’s results suggest that pattern baldness may predict prostate cancer risk
because men with prostate cancer tended to be more bald on the crown by age 30 than those
4 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS

300
250
Incidence (cases/100,000)

200
150
100
50
0

1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995

Year

Figure 1.1: Incidence (per 100,000 men of a given race) of prostate cancer in men of primarily Eu-
ropean descent (“White”, grey bars) and primarily African descent (“Black”, open bars) in the United
States between 1973 and 1996. Data are from the Surveillance, Epidemiology and End Results Program’s
population-based registries in Atlanta, Detroit, Seattle-Puget Sound, San Francisco-Oakland, Connecti-
cut, Iowa, New Mexico, Utah and Hawaii, available at the Center for Disease Control and Prevention’s
National Center for Health Statistics web-site (www.cdc.gov/nchs/fastats/cancer.htm).

without prostate cancer. (For now we will ignore the statistical uncertainty of their results.) But,
we know from previous studies that men of primarily European descent (“white”) have a relatively
low prostate cancer risk compared to men of primarily African descent (“black”) (see Figure 1.1).
That fact alone appears to make a study based only on white men biased. However, in reality it
may or may not be biased. The fact that white and black men do not have the same incidence of
prostate cancer is only part of the story. The study looked at pattern baldness and prostate cancer.
So, even though black men are more susceptible to prostate cancer than white men, the proportion
of men with prostate cancer who were bald on the crown by age 30 could still be the same for both
black and white men, in which case pattern baldness would predict prostate cancer risk equally well
for both black and white men.
Bias can arise in various ways in a biological study. One common way we’ve already
seen in Example 1.4. In this case at least one subset of the population (African American
men) was not sampled. Such a situation results in selection bias, in which case the
sampling procedure introduces uncertainty that masks the true relationships among the
1.3. POPULATION AND SAMPLE 5

variables one is studying.


Selection bias is not the only form of bias, unfortunately. Problems also can arise
through measurement bias, a situation in which an instrument or person tends to read
consistently either too high or too low, as in Example 1.2. But, as Example 1.3 shows,
instrumentation error and instrumentation bias are not the same thing.
Statistics, in the sense of a calculation on data, can also be biased. However, this
is not a procedural, but rather a mathematical, problem that an empirical scientist can
rely on professional mathematical statisticians to resolve.

1.3 Population and Sample


Denmark-Wahnfried et al. were interested in the relationship between pattern baldness
and prostate cancer for all men, both now and in the future (and presumably in the past).
But, they haven’t studied anywhere near the perhaps 3 billion men and boys currently
on the planet (they studied about 0.0045% of them), let alone the uncounted billions
(trillions? more?) that have or will exist. Does this fact invalidate their study? Evidently
not, or else all of science would be invalid, yet we build cell phones and have been to
the moon, events that are hardly accidents. But essentially no study ever looks at every
subject or performs every possible experiment of interest for obvious practical reasons.
Suppose we wanted to study everyone in the United States that exist right now. That’s
actually a fairly well-defined population of about 260 million people. Further suppose
that our study required a single measurement from our subjects. If we could make this
measurement on one person per second (think about how unrealistically fast that is for
something like blood pressure), then it would take us about 8.2 years to complete our
measurements. In the meantime, we’d have to deal with all those individuals who died
or were born to women waiting in line to have their blood pressure taken.
And that’s an easy study because the population is well-defined. In most cases that
won’t be true. Suppose, for example, we were interested in the effect of vitamin A on
overall cancer risk in all people. We have no idea exactly how many people lived in the
past, or even exactly when “people” arose through evolution. Nor do we know how many
will live in the future.
Another wrinkle arises because of individual variation. Suppose, just for the sake
of argument, that we are only interested in a well defined group of people, and we find
some way to study everyone in that group. We set up a controlled clinical trial where
we randomly assign half of our population to the treatment group, who get a vitamin
A supplement with their breakfasts, and the other half to the control group, who get a
placebo pill. Then after 20 years of this we determine who does and does not have cancer.
But, cancer risk is associated with age, so if we waited a year or two before starting our
study, we might have exactly the same people participating in it, but our results would
be different because of individual variation in cancer risk over time.
The solution is to admit that we can’t study everyone we wish to make inferences
about, and go about developing techniques to allow us to use a sample to represent
everyone. Certainly there is danger in that approach. What if our sample just so happens
to be a bunch of oddballs? While that would cause a problem, one thing is certain. We’ve
6 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS

studied the dangers for at least hundreds if not thousands of years, and we understand
them pretty well. In fact, humanity’s efforts in this direction led to a whole new field of
intellectual endeavor – it’s called statistics.

Definition 1.3 A statistical population is the entire set of elements (humans, cells,
wolves, bees, all possible experiments of a particular type, etc.) about which we wish to
make inferences or describe. If the population is both finite and well defined, we often
denote the population size by N .

Definition 1.4 A statistical sample is the subset of elements from the population that
we actually study. The sample size is the precise number of elements in the sample and
is typically denoted n.

So, the sample is what we use to draw conclusions about the population. In a well-
designed study, both will be clearly defined, although in most cases it will be impossible
to assign a value to N because the population size is either not well-defined or not finite.
Note that the population can be well defined when the population size is not.

1.4 Data in Biology


Quantitative empirical research involves gathering, summarizing and analyzing data.
Our goal at this point is to obtain a formal understanding of how to represent and
summarize data as a first step towards analysis.

1.4.1 Representing Data Mathematically


Quantitative or categorical biological data are represented mathematically as random
variables. In general a variable is a mathematical symbol representing an unspecified
value in a given set of possibilities, usually quantities. For example, the variable Y could
represent prostate cancer risk for an unspecified man in the U.S., or the number of species
of beetles living in an unspecified tree in Brazil, or the mass of the sun at an unspecified
time. A random variable is a symbol representing the outcome of a random event,
which I will define for now as an event for which the outcome cannot be predicted given
information at hand. The converse is a deterministic process, which can be predicted
to within a fairly tight precision from initial conditions, at least in principle.2
Example 1.5: The rate of a chemical reaction given a reasonably large initial concentration of
reactants is often modelled as a deterministic process because most are predictable within relatively
strict limits of precision. Therefore, we might represent the concentration of reactant i as yi and
that of the product as x. Variables x and yi are not random variables.
Random variables are represented in various texts as italicized upper or lower case
letters. In this manual I will follow a convention used in other texts and represent random
variables with upper case, like Y , and deterministic variables with by lower case, like y.
2
There is also a third possibility, called chaos, which can be described by deterministic equations
and therefore is not random, but in reality cannot be predicted either.
1.5. SUMMARIZING DATA 7

If I wish to specify a measurement in a set of measurements represented by Y I will use


Yi , which can also be called the i-th measurement.
Example 1.6: As part of a study investigating how the growth of milkweed bugs is affected by
their food supply I measured the lengths of five bugs in my backyard. The values were, in the order
in which I measured them, 9, 10, 7, 12 and 14 millimeters. Let Y represent the length of a milkweed
bug in my backyard population. I only measured five of these bugs, so n = 5, Y1 = 9 and Y4 = 12.

1.4.2 Classification of Random Variables


When first faced with a random variable it is useful to take note of its type. The two
types we will work with in this text are the following:

Definition 1.5 Continuous random variables are measurements that can be any
length. That is, a variable that can take any real number in a given interval. For example,
the length of a milkweed bug in example 1.6 is a continuous random variable. I’ll call
data represented as a continuous random variable continuous data.

Definition 1.6 Discrete or meristic random variables are usually measurements


or counts that take on integer values. For example, the number of wolves in a pack is a
discrete random variable; a pack with 4.38 wolves makes no sense. Data represented as
a discrete random variable I’ll call discrete data.

1.5 Summarizing Data


The two common ways of summarizing data are tables and graphs. Neither method
is “better” than the other; in some cases a data table is superior, and in others a figure
is. The point you are trying to get across determines which you should use and how it
should be constructed.

1.5.1 The Data Frame


Tables can be very useful if they are well constructed. However, a poorly designed
table not only is confusing but may actually make the analysis more difficult. The most
important data table you will make in a given study is the data frame, which is a table
of raw data specially designed for analysis. When collecting raw data it is dreadfully
important not to just throw the data into some table willy-nilly. Most of the time you
will want to use a standard data frame so that whatever statistical computational software
you use knows what to look for where.

Definition 1.7 A data frame is a data table in which columns represent variables and
rows represent elements on which data were taken.
8 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS

Table 1.1. Hematocrits of 49 histology students segregated by gender.


Women Men
43.0 39.1 41.0 41.0 37.0 42.0
38.6 42.5 45.0 41.0 40.8 46.0
36.4 40.5 40.0 37.0 44.0 41.0
40.0 44.5 42.5 40.6 40.0 44.0
41.5 40.9 36.5 36.0 44.0 44.0
28.0 39.0 40.0 40.0 44.0 45.5
37.0 40.0 47.0 45.0 46.9
42.5 41.0 43.0 47.5 44.0
40.5 40.0 41.0

To get a feel for how to build a data frame, let’s practice with the following simple
study.
Example 1.7: Human blood consists of two major components: 1) the formed elements including
red blood corpuscles, white blood cells and platelets, and 2) plasma. Clinicians frequently measure
the proportion of blood (by volume) that consists of formed elements, which is called the hematocrit,
usually expressed as a percent. In a study conducted by Dr. Kim Cooper in one of his histology
classes, each of 49 students measured their own hematocrit. The data are presented simply in Table
1.1. Since hematocrit can take on any value between 0% and 100%, these are continuous data.
Table 1.1 is nice and concise, but it is not a data frame and one should not record
raw data in this way. One can use something like table 1.1 to summarize results, but it
will make analysis by computer very difficult.
So, how do we build a data frame for this study? Note that there are a total of 49
students in the study, so our data frame will have 49 rows plus a header row. For every
student, two variables were recorded: gender and hematocrit. So, our data frame will
have two columns plus an identification column. Table 1.2 shows an example of a data
frame, actually two halves of the data frame side-by-side, for this data set.

1.5.2 The Frequency Distribution


Typically the set of raw numbers reveals very little. So, our first task is to create sum-
maries of the data set that may reveal patterns. One of the most useful summaries is
called the frequency distribution. In fact, the vast majority of time we study frequency
distributions of our data set and statistics derived from it, not the raw data themselves.
This fact appears to be somewhat counter-intuitive and leads to an early pitfall in a
beginning student’s study of statistics.
Frequency distributions of continuous data can be displayed in a number of different
ways, including frequency tables, histograms and box plots. All are easily produced with
standard statistical software from a data frame. At the moment we will focus on their
interpretation.
Example 1.8: Table 1.3 is a frequency table of the hematocrit data in Table 1.2. The cell in
1.5. SUMMARIZING DATA 9

Table 1.2. Data frame for the hematocrit study on 49 histology students conducted by
Dr. Kim Cooper.
Student Gender Hematocrit Student Gender Hematocrit
1 Female 43.0 26 Female 43.0
2 Female 38.6 27 Female 41.0
3 Female 36.4 28 Female 41.0
4 Female 40.4 29 Female 41.0
5 Female 41.5 30 Female 37.0
6 Female 28.0 31 Female 40.6
7 Female 37.0 32 Female 36.0
8 Female 42.5 33 Female 40.0
9 Female 40.5 34 Male 37.0
10 Female 39.1 35 Male 40.8
11 Female 42.5 36 Male 44.0
12 Female 40.5 37 Male 40.0
13 Female 44.5 38 Male 44.0
14 Female 40.9 39 Male 44.0
15 Female 39.0 40 Male 45.0
16 Female 40.0 41 Male 47.5
17 Female 41.0 42 Male 42.0
18 Female 40.0 43 Male 46.0
19 Female 41.0 44 Male 41.0
20 Female 45.0 45 Male 44.0
21 Female 40.0 46 Male 44.0
22 Female 42.5 47 Male 45.5
23 Female 36.5 48 Male 46.9
24 Female 40.0 49 Male 44.0
25 Female 47.0
10 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS

Table 1.3. Absolute and relative frequencies of the hematocrit data in Table 1.2.
Interval widths are 4%, n = 49. Intervals include endpoints.
Interval Midpoint Frequency Rel. Frequency
28.0% – 31.9% 30 1 0.02
32.0% – 35.9% 34 0 0
36.0% – 39.9% 38 9 0.18
40.0% – 43.9% 42 25 0.51
44.0% – 47.9% 46 14 0.29

the first (data) row, third column from the left (value = 1) expresses the number of hematocrit
measurements that fell in the interval 28.0% to 31.9%, inclusive. The second column indicates the
midpoint of the interval. The fourth column, titled “Relative Frequency,” represents the proportion
of all observations the fell in the appropriate interval. This table tells us that 51% of students in
the sample had hematocrits between 40.0% and 43.9%.
The choice of interval widths is arbitrary, and as a practical issue is chosen based on
how many intervals express the data best. Typically, fewer than four or more than 20
intervals is not useful.
The number of observations in an interval is the absolute frequency or simply the
frequency. The proportion of observations in an interval is called the relative frequency.
I will denote the absolute frequency in the i-th interval as fi , and the corresponding
relative frequency as fˆi . Specifically,
fi
fˆi = . (1.1)
n

Therefore, in the hematocrit example, f4 = 25 and fˆ4 = 0.51.


One useful graphical representation of the frequency distribution for continuous data
is the histogram.
Example 1.9: Figure 1.2 is a histogram of the data in table 1.3. The horizontal axis represents the
scale of hematocrit measurements, and the vertical axis is the frequency, either relative or absolute
(the latter in this example). As useful as histograms are, they can’t by used unconsciously because
their shape, and the information they convey, depend on the interval widths. For example, figure
1.3 shows the same hematocrit data with smaller intervals.

1.5.3 Box plots


A third very useful way to represent distributions of continuous data is the box and
whisker plot, or simply box plot. A box plot of the data from Table 1.2 is presented in
Figure 1.4. The line in the middle of the “box” is a measure of the distribution’s central
tendency, in this case the median but sometimes the mean. The box itself represents
the middle 50% of the distribution, and the outer lines, or “whiskers,” show the range
of values in the data set, in this case, minus one extremely low value of 28.0%. That
“outlier” is shown as an open circle towards the bottom.
1.5. SUMMARIZING DATA 11

14
12
10
Frequency

8
6
4
2
0

25 30 35 40 45 50

Hematcrit (%)

Figure 1.2: Absolute frequency histogram for the hematocrit data in Table 1.2. n = 49, interval widths
equal to 2%.
12
10
8
Frequency

6
4
2
0

25 30 35 40 45 50

Hematocrit (%)

Figure 1.3: Absolute frequency histogram for the hematocrit data in Table 1.2. n = 49, interval widths
equal to 1%.
12 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS

45
40
35
30

Figure 1.4: Box plot of hematocrits of 49 histology students. The middle box represents the middle
50% of the distribution, the heavy line in the middle of the box is the median and the “whiskers” shows
the range, with one extremely low outlier. The vertical axis is hematocrit in percent.

Box plots are very handy when comparing two distributions (Figure 1.5). In the
hematocrit example, a parallel box plot like Figure 1.5 allows a much more powerful
view into the difference between hematocrits of men and women. Certainly, “average”
(median) hematocrit in men is higher than in women, at least in this sample. But, more
importantly, the inner 50% ranges, also called the inner quartile ranges, do not overlap.
Therefore, one can conclude that 75% of the men had hematocrits that were larger than
75% of the women’s.

1.5.4 Bar graphs


So far everything presented in this section applies to continuous data. The concept of
a frequency distribution can be extended to discrete data in a natural and obvious way,
as we are about to explore (Example 1.10, Table 1.4). However, as a technical matter
one should not use histograms or box plots to represent the distribution of discrete data
unless the variable has enough classes that a continuous approximation is appropriate.
Instead we represent the frequency distribution of discrete data with a bar graph, which
looks like a histogram except that the bars to not touch (Figure 1.6).

Example 1.10: Researchers in northern Minnesota studying foxes (species unspecified by the
source) measure the number of kits in 64 litters. They construct a frequency table (Table 1.4) of
the data and graph it as a bar graph (Figure 1.6).
1.5. SUMMARIZING DATA 13

45
40
35
30

Female Male

Figure 1.5: Box plot comparison of hematocrits of 33 female and 16 male histology students.

Table 1.4. Absolute and relative frequencies of litter sizes in a study of foxes (species
unspecified by source).
# Kits Frequency Rel. Frequency
3 10 0.16
4 27 0.42
5 22 0.34
6 4 0.063
7 1 0.016
14 CHAPTER 1. INTRODUCTION TO BIOSTATISTICS

25
20
15
Frequency

10
5
0

3 4 5 6 7

Number of Kits

Figure 1.6: Absolute frequency distribution of number of kits per litter for the data in table 1.4.

1.6 Exercises
1. List as many possible causes of uncertainty in the Denmark-Wahnfried et al. study
as you can.

2. Explain how it is possible that measurement error does not necessarily lead to
measurement bias.

3. What is the population and population size in Denmark-Wahnfried et al.’s study?


Are both well defined? What is the most precise statement you can make about N
in this study?

4. What exactly is the sample in Denmark-Wahnfried et al.’s study? What is n?

5. Suppose Y represents the number of gorillas killed by poachers in a year, X is the


exact number of beetle species living in a tree in the Brazilian rainforest, and Z(t)
is the mass of the sun at time t. Which of these is (are) deterministic and which is
(are) random? Explain your response.

6. Classify the following variables as random or deterministic, discrete or continuous:


(a) the number of men with prostate cancer in the U.S.; (b) the chin-shape of a
child born to parents with a mid-chin fissure, a trait known to follow Mendel’s laws
and be passed via an autosomal dominant allele; (c) the mass of a breast tumor at
time t; (d) the number of walking legs on a horseshoe crab; (e) the weight of an
adult African elephant to the nearest 5 tons.

You might also like