You are on page 1of 12

Laboratory: Case-Control Studies

“How many do I need?” is one of the most common questions addressed to an epidemiologist.
The epidemiologist answers with “What question are you attempting to answer?” Sample size
depends on the purpose of the study. More often than not the investigator has not precisely
determined what question is to be answered. It is essential that this be done before sample size
calculations can be performed. There are several key reasons to perform sample size
calculations: 1) it forces a specification of expected outcomes, 2) it leads to a stated recruitment
goal, 3) it encourages development of appropriate timetables and budgets, 4) it discourages the
conduct of small, inconclusive trials, and perhaps most importantly, 5) it reduces the unnecessary
use of animals in animal experiments.

When you read studies, you will come across common mistakes that are related to sample size.
These include: 1) a failure to discuss a sample size, 2) unrealistic assumptions about disease
incidence, etc., 3) a failure to explore sample size for a range of input values, 4) a failure to state
power for a completed study with negative results (often referred to as post-hoc power which
opens up another major debate), and 5) a failure to account for attrition by increasing the sample
size above the calculated size.

Variability is an important consideration when calculating sample sizes. The greater the
variability among subjects, the harder it is to demonstrate a significant difference. Making
multiple measurements on subjects and averaging them can increase the precision and may help
decrease variability (by decreasing random error). Paired measurements (e.g. baseline and after
treatment) can be used to measure change which can reduce the variability of the outcome and
enable a smaller sample size.

Hypothesis Testing

A common use of statistics is to test whether the data that we observed are consistent with the
null hypothesis. Of course, we never expect the data to be exactly the same as the null
hypothesis as discussed in relation to the Central Limit Theorem. However, if the data we
observe would be extremely rare under the distribution proposed by the null hypothesis, then we
reject the null hypothesis as being inconsistent with the data and instead accept an alternate

In the application of hypothesis testing procedures, one must bear in mind that a result that is
declared to be statistically significant may not necessarily be of practical significance. In
particular, if the sample size is large, the difference between an observed sample statistic and the
hypothesized population parameter may be highly significant, although the actual difference
between them is negligible. For example, a difference between an observed proportion of
pˆ = 0.84 and a hypothesized value of 0.80 would be significant at the 5% level if the number of
observations exceeds 385. By the same token, a non-significant difference may be of practical
importance. This might be the case in a medical research program concerning the treatment of a
life-threatening disease, where any improvement in the probability of survival beyond a certain
time will be important, no matter how small. Generally speaking, the result of a statistical test of
the validity of H0 will often only be part of the evidence against H0. This evidence will usually

need to be combined with other, non-statistical, evidence before an appropriate course of action
can be determined.

Type I and Type II Errors

Possible condition of null hypothesis

True False

Fail to reject H0 Significance (1-α) Type II error (β)

Possible action Reject H0 Type I error (α) Power (1-β)

Significance (1-α): Also called a confidence level, it refers to the probability that an observed
difference in a study reflects a true difference in the population under the assumptions of the null
hypothesis, i.e. that the observed difference is not due to chance given the null hypothesis is true.

Type I error (α): The probability of rejecting a true null hypothesis, or incorrectly concluding
that a difference exists (the null hypothesis is not appropriate) when in fact a difference really
does not exist in the population. A false positive decision.

Power (1-β): The probability of correctly rejecting the null hypothesis when it is in fact false,
and thus, the probability of observing a difference in the sample when an equal or greater
difference is present in the population.

Type II error (β): The probability of accepting a false null hypothesis, or incorrectly concluding
that a difference does not exist (the null hypothesis is appropriate) when in fact a difference
really does exist in the population. A false negative decision.

Precision and Validity
In order to make unbiased inferences about the associations between putative causes and
outcomes, we must measure the various factors with as little error as possible. The true value of
the factor in the population is known as the parameter. This population value is not typically
measurable, so instead, we obtain an estimate of the parameter through sampling of the
population. The overall goal of an epidemiologic study is accuracy in estimation: to estimate
the value of the parameter that is the object of measurement with little error. Sources of error
can be classified as either random or systematic. Our goal is to reduce both types of error.

High Low




Precision (lack of random error)

Precision is a reduction in random error, or that variation attributed to sampling. Precision is
also referred to as reliability and repeatability, depending on the book you read. I prefer to use
precision (and possibly reliability) because it more clearly describes the nature of the problem:
how far from the true value are the randomly sampled data points, strictly due to random error.
Repeatability, at least to me, implies whether two observers would generate the same results
when measuring the same samples (for instance, comparing interpretation of clinical pathology

As an example of precision, consider the prevalence of a disease in a herd of 1000 animals. Let
the true prevalence be 10%, or 100 out of the 1000. If we randomly sample 10 individuals in the
herd, will we always get 1 out of 10 with the disease (and therefore an estimate of prevalence
equal to 10%)? How much variability will there be in our estimate due strictly to random error?

How do we improve the precision of our study?

1. Increase the sample size.

By increasing sample size, we reduce the random error because we gain confidence that our
sample is more representative of the population from which the sample was taken. There are
formulas that relate the size of a study to the study design, the study population, and the
desired precision (or power). We will discuss these methods in more detail later in the

2. Increase the efficiency of the study.
What do we mean by this? We want to maximize the information that each individual in the
study provides to our inferences. For example, suppose we are planning a prospective study
to determine the influence of exposure X on disease Y. We randomly enroll the subjects, and
these subjects ultimately fall into two groups, those exposed to X and those unexposed to X.
The goal is to determine precisely the influence of X on Y. Is it better to enroll more
subjects? Not necessarily. Suppose we enroll 10,000 subjects, but it turns out that only 50
were exposed to X (because of the way in which subjects were enrolled). We do not have
much information about exposure, and therefore, our ability to determine the relationship
between X and Y will be imprecise. When evaluating the efficiency of a study, we must
consider both the amount of information that each subject provides as well as the cost of
acquiring this information.

Validity (lack of systematic error)

Validity is the ability to measure what we actually think we are measuring. Bias then refers to
the presence of a systematic error. It does not include the error associated with sampling
variability, but instead, is generally considered a flaw in the design, analysis or both. It is
important to recognize that only rarely can data that are biased by the sampling design be
adjusted for in the analysis. It is therefore critical to anticipate and control for biases at the
outset of the study. When discussing validity, we usually separate it into two components,
internal validity and external validity.

The p-value represents the probability of obtaining an outcome which is at least as extreme as the
one which is actually observed, given that the null hypothesis is correct. All of the testing that
is performed IS BASED ENTIRELY on the assumption that the null hypothesis is correct. We
are using the distribution of the null hypothesis in order to determine the probability of observing
the data that we actually saw, and thus, we can only reject or accept the null hypothesis, but we
can never prove that it is true. We are assuming it is true!!! In addition, the p-value tells us
nothing about the probability of other alternate hypotheses being more appropriate.

Confidence Intervals

A confidence interval is a range that we construct in order to capture the desired parameter value
with a certain probability. The higher this probability of containing the true parameter value, the
wider the interval must be for the same data. The interpretation of a confidence interval is based
on the idea of repeated sampling. Suppose that repeated samples of n binary observations are
taken, and a 95% confidence interval for p is obtained for each of the data sets. Then, 95% of
these intervals would be expected to contain the true success probability. It is important to
recognize that this does not mean that there is a probability of 0.95 that p lies in the confidence
interval. This is because p is fixed in any particular case and so it either lies in the confidence
interval or it does not. The limits of the interval are the random variables. Therefore, the correct

view of a confidence interval is based on the proportion of calculated intervals that would be
expected to contain p, rather than the probability that p is contained within any particular

As an example, suppose a survey of dairy farmers is conducted to estimate the proportion of

herds that vaccinate against BVD. Within a specific geographic region, suppose that out of 300
farms, 123 said that they vaccinate. The 95% confidence interval would be calculated as
follows: First, we need a point estimate of the population parameter. In this case, we use:

pˆ = 123 = 0.41

The variance is then calculated as:

pˆ (1 − pˆ ) .41(.59 )
Var ( pˆ ) = Var  y  = 2 (Var ( y )) = 2 np(1 − p ) =
1 1
 n  n n n 300

Using the normal approximation, we then have the confidence interval calculated as:

pˆ (1 − pˆ )
pˆ ± z (1−α ) , or 0.41 ± 1.96(0.028) = (0.36, 0.46). How would we interpret this?
2 n

A primary motivation for believing that Bayesian thinking is important is that it facilitates a
common-sense interpretation of statistical conclusions. For instance, a Bayesian (probability)
interval for an unknown quantity of interest can be directly regarded as having high probability
of containing the unknown quantity, in contrast to a frequentist (confidence) interval, which may
strictly be interpreted only in relation to a sequence of similar inferences that might be made in
repeated practice.

Central Limit Theorem (CLT)

In a skewed distribution, the frequency distribution is asymmetrical, with some values being
disproportionately more or less frequent. This would seem to preclude the use of the Normal
distribution to model a given skewed variable. The sampling distribution of the mean that we
developed above, however, circumvents the apparent problem. Regardless of the population's
distribution, the distribution of the sampled means that pertain to that population will not be
skewed. The CLT states that the distribution of sample means from repeated samples of n
independent observations approaches a Normal distribution, with a mean of µ and a variance

E ( X ) = µ and Var ( X ) = σ

This implies that as n increases, the variance of X decreases. These parameters can be
transformed back into the Standard Normal, such that
X −µ
is N(0,1).
σ n
Notice that this theorem requires no assumption about the form of the original distribution. It
works for any configuration. What makes this theorem so important is that it holds regardless of
the shape or form of the population distribution. If the sample size is large enough, the sampling
distribution of the means is always Normal. The major importance of this result is for statistical
inference, as the CLT provides us with the ability to make even more precise statements. Instead
of relying on Tchebychev's theorem, which ensures that at least 75% of the observations are
within 2 standard deviations of the mean, we know from mathematical statistics that the
proportion of observations falling within 2 or 3 standard deviations of the mean for a Normal
distribution is 0.95 or 0.997, respectively.

One place where the Normal approximation can be used is with the Binomial distribution. As n
becomes large, the Binomial can become difficult to calculate. Under the CLT, the Normal can
be used to approximate the Binomial, and this approximation improves as n increases. As a
general rule, the approximation should only be used when n * p (or n * q) is greater than 5.

The CLT states that the distribution of

Y − np X−p
W = = is N(0,1). Thus, if n is sufficiently large, the distribution of Y
np (1 − p ) p (1 − p ) n
(the expected number of successes) is approximately N[ np, np(1-p) ].

Remember that the disease outcomes we are measuring are often rare. This may be especially
true if we are studying mortality due to some factor. Thus, the Normal approximation may not
be the ideal way in which to estimate the Binomial, and instead, exact methods may be required.
This will become clear when we discuss confidence intervals later in this section.

Sample size to estimate a proportion (e.g., prevalence)
When one wants to investigate the presence of disease, we are dealing with a presence/absence
(binomial) situation because each selected element is either infected or not infected. One
property of the Binomial distribution is that the variance equals p * (1 - p), where p is expressed
as a proportion (or it is p * (100 - p), when p is expressed as a percentage). Of course, the
standard deviation is the square root of p * (1 -p). We can then use the following formula to
determine the samples size.

Z 12−α 2 Pˆ Qˆ NZ 12−α 2 Pˆ Qˆ
n= ≈
L2 L2 ( N − 1) + Z α2 Pˆ Qˆ

n = estimated sample size for the study.

Z1-α/2 = value of Z which provides α/2 in each tail of normal curve if a 2-tailed test. If a 1-
tailed test is used, this should say 1 - α. If α, the type I error, is 0.05, then the 2-
tailed Z is 1.96. α specifies the probability of declaring a difference to be
statistically significant when no real difference exists in the population.
P̂ = the best guess of the prevalence of the disease in the population.
Q̂ = 1 − P̂
L = allowable error or required precision.
N = population size. If the population size is large, this is irrelevant (equivalent to
sampling with replacement).

A property of the variance of a proportion is that it is symmetric around its maximum at p = 0.5
(or 50%). Hence, the sample size is maximal when p is estimated to be 50%, and this value
should be used when there is no idea of the actual proportion.

Be careful with the value you select for L, especially at low or high prevalences. For example,
one is not interested whether or not the prevalence equals 4% ± 10%, but rather 4% ± 2%.

Example: a farmer raising veal calves asks you to determine the prevalence of salmonellosis due
to Salmonella dublin in veal calves. It is a large unit with well over 1000 calves. The prevalence
on this farm is known to range from 0 to 80%. It is prudent to put the estimated prevalence at
50% because nothing is known about the actual prevalence except that it is between 0 and 80%.
By choosing 50%, you will end up with the largest possible sample size for a desired value of L
and Z. Now, you need to put values on L and Z, so assume you choose 5% and 1.96. Thus, you
are trying to estimate with 95% confidence a true prevalence of 50% plus or minus 5%. Now,
calculate n. Answer: 385 (or 278 if use the finite population correction for N = 1000).

Sample size to estimate differences between proportions
Suppose you want to compare two antibiotics. Two groups of animals are infected with an
appropriate pathogen and the percentages of recovery in both groups are compared. Question:
how many animals should be included in each group? The following formula can be used
(Fleiss, 1981):

(Z 1−α 2 2 P Q + Z 1− β Pe Qe + Pc Qc )

(Pe − Pc )2

n = estimated sample size for each of the exposed and unexposed groups.
Z1-α/2 = value of Z which provides α/2 in each tail of normal curve if a 2-tailed test. If a 1-
tailed test is used, this should say 1 - α. If α, the type I error, is 0.05, then the 2-
tailed Z is 1.96. α specifies the probability of declaring a difference to be
statistically significant when no real difference exists in the population.
Z1-β = value of Z which provides β in the tail of normal curve. If β, the type II error, is 0.2
then the Z value is 0.84. β specifies the probability of declaring a difference to be
statistically nonsignificant when there is a real difference in the population.
Pe = estimate of response rate in exposed group or exposure rate in cases.
Pc = estimate of response rate in unexposed group or exposure rate in noncases.
P = (Pe + Pc) / 2
Q = 1− P

Example: a pharmaceutical company has developed a brand new antibiotic against pathogen X.
No other antibiotics are available, so no comparison can be made with existing antibiotic
treatments. However, it is known from field data that 70% recover from the disease (although
effects on production are tremendous). It is expected (hoped) that after use of the antibiotic 95%
of the animals will improve and that the duration of the disease will be shorter as well. Thus,
p1 = 0.70 and p2 = 0.95. Choosing a two-sided confidence of 95% and a power of 80%, it shows
that the values for Z1-α/2 and Z1-β are 1.96 and 0.84, respectively. Using the formula, calculate n
for each group. Answer: 36 (or 43 in EpiInfo, which uses the finite population correction).

Sample size to detect a disease in a population
Suppose one is interested in the percentage of farms that are infected with a pathogen X
(prevalence based on whether or not the pathogen is present at a farm). A farm is considered to
be infected when at least one of its animals is infected. First the appropriate number of farms as
units of concern is randomly chosen and secondly the status of the farm (infected or not) is
determined. The proportion of infected animals per farm is not of major interest. Ideally, we
would screen all the animals on a farm, but often this is not necessary. Suppose that it is known
that if the disease is present, about 50% of the animals are likely to be positive. By sampling one
animal you have a 50% probability of correctly concluding that a truly positive farm is positive.
In general, one aims at a higher probability to classify a positive farm correctly (e.g., 95%). By
selecting 2 animals, the probability is increased up to 75% (25% of drawings show two
negatives), 3 animals yield 87.5%, 4 animals 93.75% and 5 animals 96.875%. Thus, between 4
and 5 animals should be sampled to detect positive farms with a probability of 95%, if 50% of
the animals are truly diseased.

The calculations can be put into a general formula (Cannon and Roe, 1982):

  
n = 1 − (1 − p ) d  *  N −
1 (d − 1) 

   2 
n = sample size
p = probability of finding at least one case (= confidence level, e.g., 0.95)
d = number of (detectable) cases in the population
N = population size

If the test that is used to evaluate the status of the animals is not 100% sensitive, d is equal to the
number of diseased or infected animals multiplied by the sensitivity of the test. It is assumed
that no false-positives are present or that they are ruled out by confirmatory tests.

Example: suppose you want to detect whether or not a flock of N = 1000 animals is positive with
regard to pathogen X. If X is present, you suspect that about 50% of the animals will be
infected, thus d =500. Setting p to 0.95, n equals: (1 – 0.051/500) * (1000 - 499/2) = 4.48
(rounded as 5). Is this number much affected by N? (No, because the prevalence is rather high).

A similar formula can be used to determine the maximum number of positives (d) in a population
given that all samples (n) tested negative:

  
d = 1 − (1 − p ) n  *  N −
1 (n − 1) 

   2 

Example: suppose that 1,000 slaughter cows tested negative to E. coli O157:H7 and the total
number of cows slaughtered amounted to 1 million. What is the maximal prevalence in the
‘population’ of slaughter cows? What is the maximal prevalence if the sensitivity of the test is
only 85%? Answers: if p is set to 0.95 then d equals 2989 which is about 0.3%; 0.3/0.85 = 0.35.

Sample size formula to estimate a mean
To calculate sample sizes for a mean obtained from a Normal distribution, we need to estimate
both the mean and its standard deviation (S). Secondly, we need to define an interval L,
indicating that with a probability of (1 - α)% the true population mean will be within the interval
of the sample mean ± L. This interval is therefore an indication of the precision of the estimate.
Thirdly, we have to decide on (1 - α). There is no general rule for this decision, but 95 is
commonly used. From this we can determine the corresponding number of units (Z) of S. For
example, if (1 - α)= 95 then the significance level is 5% and Z0.05 = 1.96. This value indicates
that that 95% of all the observations will fall within the interval: mean ± 1.96*S / √n.

The confidence interval (CI) for a mean, obtained by drawing elements from a Normal
distribution, can be written as:

Z 1−α 2 * S
CI = x ±

If the part after the ± sign is denoted as L, n is then calculated as:

Z 12−α 2 * Sˆ 2

n = estimated sample size for the study.

Z1-α/2 = value of Z which provides α/2 in each tail of normal curve if a 2-tailed test. If a 1-
tailed test is used, this should say 1 - α. If α, the type I error, is 0.05, then the 2-
tailed Z is 1.96. α specifies the probability of declaring a difference to be
statistically significant when no real difference exists in the population.
Ŝ = estimate of standard deviation in the population.
L = allowable error or required precision.

Example: suppose you want to estimate growth/day in veal calves. Growth should be around
1000 g / day and an estimate of the S is 250 g. How many calves should be weighed to check
whether or not the mean growth of a large unit is between 950 and 1050 g / day? A confidence
level of 95% is required. (If you want more practice, also try calculating the sample size
required when (1) SD increases to 375; (2) the confidence level decreases to 90%; (3) L is
changed from 50 to 100). The answers you get should be: 96; 216; 67; 24.

Sample size to estimate differences between means
Sometimes an investigator is interested in differences between groups of animals, e.g.,
differences in milk production between mastitic cows and healthy cows. Suppose milk
production is a Normally distributed parameter. Now we can perform a one-tailed test, because
we know that mastitic cows will show reduced milk production. This has an impact on the value
of Z1-α: the one-sided value of Z0.05 is equal to the two-sided Z0.10.

In order to make the sample size calculation we need:

 an estimation of the difference between both groups X e − X c = δ
 the standard deviation of the trait
 the significance level α
 the power of the test, i.e., the probability (1-β) of obtaining a significant result if the true
difference equals X e − X c .

The sample size formula is written as:

 (Z 1−α 2 + Z 1− β )* S 

n = 2 *  
 (X e − X c ) 
n = estimated sample size for each of the exposed and unexposed groups.
Z1-α/2 = value of Z which provides α/2 in each tail of normal curve if a 2-tailed test is used or
α in one tail if a 1-tailed test is used. If α, the type I error, is 0.05, then the 2-tailed Z
is 1.96. α specifies the probability of declaring a difference to be statistically
significant when no real difference exists in the population.
Z1-β = value of Z which provides β in the tail of normal curve. If β, the type II error, is 0.2
then the Z value is 0.84. β specifies the probability of declaring a difference to be
statistically nonsignificant when there is a real difference in the population.
S = estimate of standard deviation common to both exposed and unexposed groups.
Xe = estimate of mean outcome in exposed group or cases.

Xc = estimate of mean outcome in unexposed group or noncases.

The constant, 2, in the formula arises from the assumption that the S is equal in both groups. The
calculated sample size is the number of individuals per group.

Example: suppose daily milk production in healthy cows is 25 liters and in mastitic cows it is
assumed to be reduced by 10%. Thus δ = 2.5 liters. The standard deviation of daily milk
production is known to be 6 liters. Given a one-sided α of 0.05, Z0.05 = 1.64. A power of 80%
gives Z0.20 = 0.84. Determine the sample size per group. For more practice vary the value of S, δ
and α to your liking. Answer for the example: 72 per group.

In conclusion, remember that formulae do exist to estimate sample sizes, and these should
always be applied in prospective studies. These sample size calculations are only guidelines to
how many units should be investigated. It is not a strict number as the assumptions underlying
the calculations will almost never exactly mirror the true values. Many other formulae for
calculation of sample sizes do exist (e.g., for a different hypothesis, a somewhat different design
or different types of data). However, the general principle is always the same and, therefore,
only the most basic and most frequently used sample size formulae were presented here.