Chapter 13. Reading the Medical Literature

Figure 13–1.

Purpose of the Chapter

This final chapter has several purposes. Most importantly, it ties together concepts and skills presented in
previous chapters and applies these concepts very specifically to reading medical journal articles.
Throughout the text, we have attempted to illustrate the strengths and weaknesses of some of the studies
discussed, but this chapter focuses specifically on those attributes of a study that indicate whether we, as
readers of the medical literature, can use the results with confidence. The chapter begins with a brief
summary of major types of medical studies. Next, we examine the anatomy of a typical journal article in
detail, and we discuss the contents of each component—abstract or summary, introduction, methods,
results, discussion, and conclusions. In this examination, we also point out common shortcomings, sources
of bias, and threats to the validity of studies.

Clinicians read the literature for many di erent reasons. Some articles are of interest because you want only
to be aware of advances in a field. In these instances, you may decide to skim the article with little interest in
how the study was designed and carried out. In such cases, it may be possible to depend on experts in the
field who write review articles to provide a relatively superficial level of information. On other occasions,
however, you want to know whether the conclusions of the study are valid, perhaps so that they can be used
to determine patient care or to plan a research project. In these situations, you need to read and evaluate the
article with a critical eye in order to detect poorly done studies that arrive at unwarranted conclusions.

To assist readers in their critical reviews, we present a checklist for evaluating the validity of a journal article.
The checklist notes some of the characteristics of a well-designed and well-written article. The checklist is
based on our experiences with medical students, house sta , journal clubs, and interactions with physician
colleagues. It also reflects the opinions expressed in an article describing how journal editors and
statisticians can interact to improve the quality of published medical research (Marks et al, 1988). A number
of authors have found that only a minority of published studies meet the criteria for scientific adequacy. The
checklist should assist you in using your time most e ectively by allowing you to di erentiate valid articles
from poorly done studies so that you can concentrate on the more productive ones.

Two guidelines recently published increase our optimism that the quality of the published literature will
continue to improve. The International Conference on Harmonization (ICH) E9 guideline “Statistical
Principles for Clinical Trials” (1999) addresses issues of statistical methodology in the design, conduct,


analysis, and evaluation of clinical trials. Application of the principles is intended to facilitate the general
acceptance of analyses and conclusions drawn from clinical trials.

The International Committee of Medical Journal Editors published the Uniform Requirements of Manuscripts
Submitted to Biomedical Journals in 1997. Under Statistics, the document states:

Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original
data to verify the reported results. . When data are summarized in the results section, specify the statistical
methods used to analyze them.

The requirements also recommend the use of confidence intervals and to avoid depending solely on P values.

Review of Major Study Designs

Chapter 2 introduced the major types of study designs used in medical research, broadly divided into
experimental studies (including clinical trials); observational studies (cohort, case–control, cross-
sectional/surveys, case–series); and meta-analyses. Each design has certain advantages over the others as
well as some specific disadvantages; they are briefly summarized in the following paragraphs. (A more
detailed discussion is presented in Chapter 2.)

Clinical trials provide the strongest evidence for causation because they are experiments and, as such, are
subject to the least number of problems or biases. Trials with randomized controls are the study type of
choice when the objective is to evaluate the e ectiveness of a treatment or a procedure. Drawbacks to using
clinical trials include their expense and the generally long time needed to complete them.

Cohort studies are the best observational study design for investigating the causes of a condition, the course
of a disease, or risk factors. Causation cannot be proved with cohort studies, because they do not involve
interventions. Because they are longitudinal studies, however, they incorporate the correct time sequence to
provide strong evidence for possible causes and e ects. In addition, in cohort studies that are prospective, as
opposed to historical, investigators can control many sources of bias. Cohort studies have disadvantages, of
course. If they take a long time to complete, they are frequently weakened by patient attrition. They are also
expensive to carry out if the disease or outcome is rare (so that a large number of subjects needs to be
followed) or requires a long time to develop.

Case–control studies are an e icient way to study rare diseases, examine conditions that take a long time to
develop, or investigate a preliminary hypothesis. They are the quickest and generally the least expensive
studies to design and carry out. Case–control studies also are the most vulnerable to possible biases,
however, and they depend entirely on high-quality existing records. A major issue in case–control studies is
the selection of an appropriate control group. Some statisticians have recommended the use of two control
groups: one similar in some ways to the cases (such as having been hospitalized or treated during the same
period) and another made up of healthy subjects.


Cross-sectional studies and surveys are best for determining the status of a disease or condition at a
particular point in time; they are similar to case–control studies in being relatively quick and inexpensive to
complete. Because cross-sectional studies provide only a snapshot in time, they may lead to misleading
conclusions if interest focuses on a disease or other time-dependent process.

Case–series studies are the weakest kinds of observational studies and represent a description of typically
unplanned observations; in fact, many would not call them studies at all. Their primary use is to provide
insights for research questions to be addressed by subsequent, planned studies.

Studies that focus on outcomes can be experimental or observational. Clinical outcomes remain the major
focus, but emphasis is increasingly placed on functional status and quality-of-life measures. It is important to
use properly designed and evaluated methods to collect outcome data. Evidence-based medicine makes
great use of outcome studies.

Meta-analysis may likewise focus on clinical trials or observational studies. Meta-analyses di er from the
traditional review articles in that they attempt to evaluate the quality of the research and quantify the
summary data. They are helpful when the available evidence is based on studies with small sample sizes or
when studies come to conflicting conclusions. Meta-analyses do not, however, take the place of well-
designed clinical trials.

The Abstract & Introduction Sections of a Research Report

Journal articles almost always include an abstract or summary of the article prior to the body of the article
itself. Most of us are guilty of reading only the abstract on occasion, perhaps because we are in a great hurry
or have only a cursory interest in the topic. This practice is unwise when it is important to know whether the
conclusions stated in the article are justified and can be used to make decisions. This section discusses the
abstract and introduction portions of a research report and outlines the information they should contain.

The Abstract
The major purposes of the abstract are (1) to tell readers enough about the article so they can decide whether
to read it in its entirely and (2) to identify the focus of the study. The International Committee of Medical
Journal Editors (1997) recommended that the abstract “state the purposes of the study or investigation, basic
procedures (selection of study subjects or experimented animals; observational and analytic methods), main
findings (specific data and their statistical significance, if possible) and the principal conclusions.” An
increasing number of journals, especially those we consider to be of high quality, now use structured
abstracts in which authors succinctly provide the above-mentioned information in separate, easily identified
paragraphs (Haynes et al, 1990).

We suggest asking two questions to decide whether to read the article: (1) If the study has been properly
designed and analyzed, would the results be important and worth knowing? (2) If the results are statistically
significant, does the magnitude of the change or e ect also have clinical significance; if the results are not

statistically significant, was the sample size su iciently large to detect a meaningful di erence or e ect? If the
answers to these questions are yes, then it is worthwhile to continue to read the report. Structured abstracts
are a boon to the busy reader and frequently contain enough information to answer these two questions.

The Introduction or Abstract

At one time, the following topics were discussed (or should have been discussed) in the introduction section;
however, with the advent of the structured abstract, many of these topics are now addressed directly in that
section. The important issue is that the information be available and easy to identify.

Reason for the Study

The introduction section of a research report is usually fairly short. Generally, the authors briefly mention
previous research that indicates the need for the present study. In some situations, the study is a natural
outgrowth or the next logical step of previous studies. In other circumstances, previous studies have been
inadequate in one way or another. The overall purpose of this information is twofold: to provide the
necessary background information to place the present study in its proper context and to provide reasons for
doing the present study. In some journals, the main justification for the study is given in the discussion
section of the article instead of in the introduction.

Purpose of the Study

Regardless of the placement of background information on the study, the introduction section is where the
investigators communicate the purpose of their study. The purpose of the study is frequently presented in the
last paragraph or last sentences at the end of the introduction. The purpose should be stated clearly and
succinctly, in a manner analogous to a 15-second summary of a patient case. For example, in the study
described in Chapter 5, Dennison and colleagues (1997, p. 15) do this very well; they stated their objective as

To evaluate, in a population-based sample of healthy children, fruit juice consumption and its e ects on
growth parameters during early childhood.

This statement concisely communicates the population of interest (healthy children), the focus of the study
or independent variable (fruit juice consumption), and the outcome (e ects on growth). As readers, we
should be able to determine whether the purpose for the study was conceived prior to data collection or if it
evolved a er the authors viewed their data; the latter situation is much more likely to capitalize on chance
findings. The lack of a clearly stated research question is the most common reason medical manuscripts are
rejected by journal editors (Marks et al, 1988).

Population Included in the Study

In addition to stating the purpose of the study, the structured abstract or introduction section may include
information on the study’s location, length of time, and subjects. Alternatively, this information may be

contained in the methods sections. This information helps readers decide whether the location of the study
and the type of subjects included in the study are applicable in the readers’ own practice environment.

The time covered by a study gives important clues regarding the validity of the results. If the study on a
particular therapy covers too long a period, patients entering at the beginning of the study may di er in
important ways from those entering at the end. For example, major changes may have occurred in the way
the disease in question is diagnosed, and patients entering near the end of the study may have had their
disease diagnosed at an earlier stage than did patients who entered the study early (see detection bias, in the
section of that title.). If the purpose of the study is to examine sequelae of a condition or procedure, the
period covered by the study must be su iciently long to detect consequences.

The Method Section of a Research Report

The method section contains information about how the study was done. Simply knowing the study design
provides a great deal of information, and this information is o en given in a structured abstract. In addition,
the method section contains information regarding subjects who participated in the study or, in animal or
inanimate studies, information on the animals or materials. The procedures used should be described in
su icient detail that the reader knows how measurements were made. If methods are novel or require
interpretation, information should be given on the reliability of the assessments. The study outcomes should
be specified along with the criteria used to assess them. The method section also should include information
on the sample size for the study and on the statistical methods used to analyze the data; this information is
o en placed at the end of the method section. Each of these topics is discussed in detail in this section.

How well the study has been designed is of utmost importance. The most critical statistical errors, according
to a statistical consultant to the New England Journal of Medicine, involve improper research design:
“Whereas one can correct incorrect analytical techniques with a simple reanalysis of the data, an error in
research design is almost always fatal to the study—one cannot correct for it subsequent to data collection”
(Marks et al, 1988, p. 1004). Many statistical advances have occurred in recent years, especially in the
methods used to design, conduct, and analyze clinical trials, and investigators should o er evidence that
they have obtained expert advice.

Subjects in the Study

Methods for Choosing Subjects

Authors should provide several critical pieces of information about the subjects included in their study so
that we readers can judge the applicability of the study results. Of foremost importance is how the patients
were selected for the study and, if the study is a clinical trial, how treatment assignments were made.

Randomized selection or assignment greatly enhances the generalizability of the results and avoids biases
that otherwise may occur in patient selection (see the section titled, “Bias Related to Subject Selection.”).
Some authors believe it is su icient merely to state that subjects were randomly selected or treatments were

randomly assigned, but most statisticians recommend that the type of randomization process be specified as
well. Authors who report the randomization methods provide some assurance that randomization actually
occurred, because some investigators have a faulty view of what constitutes randomization. For example, an
investigator may believe that assigning patients to the treatment and the control on alternate days makes the
assignment random. As we emphasized in Chapter 4, however, randomization involves one of the precise
methods that ensure that each subject (or treatment) has a known probability of being selected.

Eligibility Criteria

The authors should present information to illustrate that major selection biases (discussed in the section
titled, “Bias Related to Subject Selection.”) have been avoided, an aspect especially important in
nonrandomized trials. The issue of which patients serve as controls was discussed in Chapter 2 in the context
of case–control studies. In addition, the eligibility criteria for both inclusion and exclusion of subjects in the
study must be specified in detail. We should be able to state, given any hypothetical subject, whether this
person would be included in or excluded from the study. Sauter and coworkers (2002) gave the following
information on patients included in their study:

Patients undergoing CHE in our Surgical Department were consecutively included into the study provided
that they did not meet one the following exclusion criteria: (a) inflammatory bowel disease, history of
intestinal surgery, or diarrhea within the preceding 2 years, (b) body weight > 90 kg, (c) pregnancy, (d)
abnormal liver function tests . ., (e) diabetes mellitus, (f) history of radiation of the abdominal region, and (g)
drug therapy with antibiotics, lipid lower agents, laxatives, and cholestyramine.

Patient Follow-Up

For similar reasons, su icient information must be given regarding the procedures the investigators used to
follow up patients, and they should state the numbers lost to follow-up. Some articles include this
information under the results section instead of in the methods section.

The description of follow-up and dropouts should be su iciently detailed to permit the reader to draw a
diagram of the information. Occasionally, an article presents such a diagram, as was done by Hébert and
colleagues in their study of elderly residents in Canada (1997), reproduced in Figure 13–1. Such a diagram
makes very clear the number of patients who were eligible, those who were not eligible because of specific
reasons, the dropouts, and so on.

Figure 13–1.


Flow of the subjects through the study, a representative sample of elderly people living at home in
Sherbrooke, Canada, 1991–1993. (Reproduced, with permission, from Figure 1 in Hébert R, Brayne C,
Spiegelhalter D: Incidence of functional decline and improvement in a community-dwelling very elderly
population. Am J Epidemiol 1997;145:935–944.)

Bias Related to Subject Selection

Bias in studies should not happen; it is an error related to selecting subjects or procedures or to measuring a
characteristic. Biases are sometimes called measurement errors or systematic errors to distinguish them
from random error (random variation), which occurs any time a sample is selected from a population. This
section discusses selection bias, a type of bias common in medical research.

Selection biases can occur in any study, but they are easier to control in clinical trials and cohort designs. It is
important to be aware of selection biases, even though it is not always possible to predict exactly how their
presence a ects the conclusions. Sackett (1979) enumerated 35 di erent biases. We discuss some of the
major ones that seem especially important to the clinician. If you are interested in a more detailed
discussion, consult the article by Sackett and the text by Feinstein (1985), which devotes several chapters to

the discussion of bias (especially Chapter 4, in the section titled “The Meaning of the Term Probability,” and
Chapters 15–17).

Prevalence or Incidence Bias

Prevalence (Neyman) bias occurs when a condition is characterized by early fatalities (some subjects die
before they are diagnosed) or silent cases (cases in which the evidence of exposure disappears when the
disease begins). Prevalence bias can result whenever a time gap occurs between exposure and selection of
study subjects and the worst cases have died. A cohort study begun prior to the onset of disease is able to
detect occurrences properly, but a case–control study that begins at a later date consists only of the people
who did not die. This bias can be prevented in cohort studies and avoided in case–control studies by limiting
eligibility for the study to newly diagnosed or incident cases. The practice of limiting eligibility is common in
population-based case–control studies in cancer epidemiology.

To illustrate prevalence or incidence bias, let us suppose that two groups of people are being studied: those
with a risk factor for a given disease (eg, hypertension as a risk factor for stroke) and those without the risk
factor. Suppose 1000 people with hypertension and 1000 people without hypertension have been followed
for 10 years. At this point, we might have the situation shown in Table 13–1.

Table 13–1. Illustration of Prevalence Bias: Actual Situation.

Number of Patients in 10-Year Cohort Study

Alive with Cerebrovascular Dead from Alive with No Cerebrovascular

Disease Stroke Disease

With hypertension 50 250 700

Without 80 20 900

A cohort study begun 10 years ago would conclude correctly that patients with hypertension are more likely
to develop cerebrovascular disease than patients without hypertension (300 to 100) and far more likely to die
from it (250 to 20).

Suppose, however, a case–control study is undertaken at the end of the 10-year period without limiting
eligibility to newly diagnosed cases of cerebrovascular disease. Then the situation illustrated in Table 13–2


Table 13–2. Illustration of Prevalence Bias: Result with Case–Control Design.

Number of Patients in Case–Control Study at End of 10 Years

Patients With Cerebrovascular Disease Without Cerebrovascular Disease

With hypertension 50 700

Without hypertension 80 900

The odds ratio is calculated as (50 × 900)/(80 × 700) = 0.80, making it appear that hypertension is actually a
protective factor for the disease! The bias introduced in an improperly designed case–control study of a
disease that kills o one group faster than the other can lead to a conclusion exactly the opposite of the
correct conclusion that would be obtained from a well-designed case–control study or a cohort study.

Admission Rate Bias

Admission rate bias (Berkson’s fallacy) occurs when the study admission rates di er, which causes major
distortions in risk ratios. As an example, admission rate bias can occur in studies of hospitalized patients
when patients (cases) who have the risk factor are admitted to the hospital more frequently than either the
cases without the risk factor or the controls with the risk factor.

This fallacy was first pointed out by Berkson (1946) in evaluating an earlier study that had concluded that
tuberculosis might have a protective e ect on cancer. This conclusion was reached a er a case–control study
found a negative association between tuberculosis and cancer: The frequency of tuberculosis among
hospitalized cancer patients was less than the frequency of tuberculosis among the hospitalized control
patients who did not have cancer. These counterintuitive results occurred because a smaller proportion of
patients who had both cancer and tuberculosis were hospitalized and thus available for selection as cases in
the study; chances are that patients with both diseases were more likely to die than patients with cancer or
tuberculosis alone.

It is important to be aware of admission rate bias because many case–control studies reported in the medical
literature use hospitalized patients as sources for both cases and controls. The only way to control for this
bias is to include an unbiased control group, best accomplished by choosing controls from a wide variety of
disease categories or from a population of healthy subjects. Some statisticians suggest using two control
groups in studies in which admission bias is a potential problem.

Nonresponse Bias and the Volunteer E ect

Several steps discussed in Chapter 11 can be taken to reduce potential bias when subjects fail to respond to a
survey. Bias that occurs when patients either volunteer or refuse to participate in studies is similar to


nonresponse bias. This e ect was studied in the nationwide Salk polio vaccine trials in 1954 by using two
di erent study designs to evaluate the e ectiveness of the vaccine (Meier, 1989). In some communities,
children were randomly assigned to receive either the vaccine or a placebo injection. Some communities,
however, refused to participate in a randomized trial; they agreed, instead, that second graders could be
o ered the vaccination and first and third graders could constitute the controls. In analysis of the data,
researchers found that families who volunteered their children for participation in the nonrandomized study
tended to be better educated and to have a higher income than families who refused to participate. They also
tended to be absent from school with a higher frequency than nonparticipants.

Although in this example we might guess how absence from school could bias results, it is not always easy to
determine how selection bias a ects the outcome of the study; it may cause the experimental treatment to
appear either better or worse than it should. Investigators should therefore reduce the potential for
nonresponse bias as much as possible by using all possible means to increase the response rate and obtain
the participation of most eligible patients. Using databases reduces response bias, but sometimes other
sources of bias are present, that is, reasons that a specific group or selected information is underrepresented
in the database.

Membership Bias

Membership bias is essentially a problem of preexisting groups. It also arises because one or more of the
same characteristics that cause people to belong to the groups are related to the outcome of interest. For
example, investigators have not been able to perform a clinical trial to examine the e ects of smoking; some
researchers have claimed it is not smoking itself that causes cancer but some other factor that simply
happens to be more common in smokers. As readers of the medical literature, we need to be aware of
membership bias because it cannot be prevented, and it makes the study of the e ect of potential risk factors
related to life-style very di icult.

A problem similar to membership bias is called the healthy worker e ect; it was recognized in epidemiology
when workers in a hazardous environment were unexpectedly found to have a higher survival rate than the
general public. A er further investigation, the cause of this incongruous finding was determined: Good health
is a prerequisite in persons who are hired for work, but being healthy enough to work is not a requirement for
persons in the general public.

Procedure Selection Bias

Procedure selection bias occurs when treatment assignments are made on the basis of certain characteristics
of the patients, with the result that the treatment groups are not really similar. This bias frequently occurs in
studies that are not randomized and is especially a problem in studies using historical controls. A good
example is the comparison of a surgical versus a medical approach to a problem such as coronary artery
disease. In early studies comparing surgical and medical treatment, patients were not randomized, and the
evidence pointed to the conclusion that patients who received surgery were healthier than those treated
medically; that is, only healthier patients were subjected to the risks associated with the surgery. The


Coronary Artery Surgery Study (CASS, 1983) was undertaken in part to resolve these questions. It is important
to be aware of procedure selection bias because many published studies describe a series of patients, some
treated one way and some another way, and then proceed to make comparisons and draw inappropriate
conclusions as a result.

Procedures Used in the Study and Common Procedural Biases

Terms and Measurements

The procedures used in the study are also described in the method section. Here authors provide definitions
of measures used in the study, especially any operational definitions developed by the investigators. If
unusual instruments or methods are used, the authors should provide a reference and a brief description. For
example, the study of screening for domestic violence in emergency department patients by Lapidus and
colleagues (2002) defined domestic violence as “past or current physical, sexual, emotional, or verbal harm to
a woman caused by a spouse, partner, or family member.” Domestic violence screening was defined as
“assessing an individual to determine if she has been a victim of domestic violence.”

The journal Stroke has the practice of presenting the abbreviations and acronyms used in the article in a box.
This makes the abbreviations clear and also easy to refer to in reading other sections of the article. For
example, in reporting their study of sleep-disordered breathing and stroke, Good and colleagues (1996)
presented a list of abbreviations at the top of the column that describe the subjects and methods in the

Several biases may occur in the measurement of various patient characteristics and in the procedures used or
evaluated in the study. Some of the more common biases are described in the following subsections.

Procedure Bias

Procedure bias, discussed by Feinstein (1985), occurs when groups of subjects are not treated in the same
manner. For example, the procedures used in an investigation may lead to detection of other problems in
patients in the treatment group and make these problems appear to be more prevalent in this group. As
another example, the patients in the treatment group may receive more attention and be followed up more
vigorously than those in another group, thus stimulating greater compliance with the treatment regimen. The
way to avoid this bias is by carrying out all maneuvers except the experimental factor in the same way in all
groups and examining all outcomes using similar procedures and criteria.

Recall Bias

Recall bias may occur when patients are asked to recall certain events, and subjects in one group are more
likely to remember the event than those in the other group. For example, people take aspirin commonly and
for many reasons, but patients diagnosed as having peptic ulcer disease may recall the ingestion of aspirin
with greater accuracy than those without gastrointestinal problems. In the study of the relationship between


juice consumption and growth, Dennison and associates (1997) asked parents to keep a daily journal of all
the liquid consumed by their children; a properly maintained journal helps reduce recall bias.

Insensitive-Measure Bias

Measuring instruments may not be able to detect the characteristic of interest or may not be properly
calibrated. For example, routine x-ray films are an insensitive method for detecting osteoporosis because
bone loss of approximately 30% must occur before a roentgenogram can detect it. Newer densitometry
techniques are more sensitive and thus avoid insensitive-measure bias.

Detection Bias

Detection bias can occur because a new diagnostic technique is introduced that is capable of detecting the
condition of interest at an earlier stage. Survival for patients diagnosed with the new procedure
inappropriately appears to be longer, merely because the condition was diagnosed earlier.

A spin-o of detection bias, called the Will Rogers phenomenon (because of his attention to human
phenomena), was described by Feinstein and colleagues (1985). They found that a cohort of subjects with
lung cancer first treated in 1953–1954 had lower 6-month survival rates for patients with each of the three
main stages (localized, nodal involvement, and metastases) as well as for the total group than did a 1977
cohort treated at the same institutions. Newer imaging procedures were used with the later group; however,
according to the old diagnostic classification, this group had a prognostically favorable zero-time shi in that
their disease was diagnosed at an earlier stage. In addition, by detecting metastases in the 1977 group that
were missed in the earlier group, the new technologic approaches resulted in stage migration; that is,
members of the 1977 cohort were diagnosed as having a more advanced stage of the disease, whereas they
would have been diagnosed as having earlier-stage disease in 1953–1954. The individuals who migrated from
the earlier-stage group to the later-stage group tended to have the poorest survival in the earlier-stage group;
so removing them resulted in an increase in survival rates in the earlier group. At the same time, these
individuals, now assigned to the later-stage group, were better o than most other patients in this group, and
their addition to the group resulted in an increased survival in the later-stage group as well. The authors
stated that the 1953–1954 and 1977 cohorts actually had similar survival rates when patients in the 1977
group were classified according to the criteria that would have been in e ect had there been no advances in
diagnostic techniques.

Compliance Bias

Compliance bias occurs when patients find it easier or more pleasant to comply with one treatment than with
another. For example, in the treatment of hypertension, a comparison of α-methyldopa versus
hydrochlorothiazide may demonstrate inaccurate results because some patients do not take α-methyldopa
owing to its unpleasant side e ects, such as drowsiness, fatigue, or impotence in male patients.

Assessing Study Outcomes


Variation in Data

In many clinics, a nurse collects certain information about a patient (eg, height, weight, date of birth, blood
pressure, pulse) and records it on the medical record before the patient is seen by a physician. Suppose a
patient’s blood pressure is recorded as 140/88 on the chart; the physician, taking the patient’s blood pressure
again as part of the physical examination, observes a reading of 148/96. Which blood pressure reading is
correct? What factors might be responsible for the di erences in the observation? We use blood pressure and
other clinical information to examine sources of variation in data and ways to measure the reliability of
observations. Two classic articles in the Canadian Medical Association Journal (McMaster University Health
Sciences Centre, Department of Clinical Epidemiology and Biostatistics, 1980a; 1980b) discuss sources of
clinical disagreement and ways disagreement can be minimized.

Factors That Contribute to Variation in Clinical Observations

Variation, or variability in measurements on the same subject, in clinical observations and measurements can
be classified into three categories: (1) variation in the characteristic being measured, (2) variation introduced
by the examiner, and (3) variation owing to the instrument or method used. It is especially important to
control variation due to the second two factors as much as possible so that the reported results will
generalize as intended.

Substantial variability may occur in the measurement of biologic characteristics. For example, a person’s
blood pressure is not the same from one time to another, and thus, blood pressure values vary. A patient’s
description of symptoms to two di erent physicians may vary because the patient may forget something.
Medications and illness can also a ect the way a patient behaves and what information he or she remembers
to tell a nurse or physician.

Even when no change occurs in the subject, di erent observers may report di erent measurements. When
examination of a characteristic requires visual acuity, such as the reading on a sphygmomanometer or the
features on an x-ray film, di erences may result from the varying visual abilities of the observers. Such
di erences can also play a role when hearing (detecting heart sounds) or feeling (palpating internal organs) is
required. Some individuals are simply more skilled than others in history taking or performing certain

Variability also occurs when the characteristic being measured is a behavioral attribute. Two examples are
measurements of functional status and measurements of pain; here the additional component of patient or
observer interpretation can increase apparent variability. In addition, observers may tend to observe and
record what they expect based on other information about the patient. These factors point out the need for a
standardized protocol for data collection.

The instrument used in the examination can be another source of variation. For instance, mercury column
sphygmomanometers are less inherently variable than aneroid models. In addition, the environment in
which the examination takes place, including lighting and noise level, presence of other individuals, and


room temperature, can produce apparent di erences. Methods for measuring behavior-related
characteristics such as functional status or pain usually consist of a set of questions answered by patients and
hence are not as precise as instruments that measure physical characteristics.

Several steps can be taken to reduce variability. Taking a history when the patient is calm and not heavily
medicated and checking with family members when the patient is incapacitated are both useful in
minimizing errors that result from a patient’s illness or the e ects of medication. Collecting information and
making observations in a proper environment is also a good strategy. Recognizing one’s own strengths and
weaknesses helps one evaluate the need for other opinions. Blind assessment, especially of subjective
characteristics, guards against errors resulting from preconceptions. Repeating questionable aspects of the
examination or asking a colleague to perform a key aspect (blindly, of course) reduces the possibility of error.
Having well-defined operational guidelines for using classification scales helps people use them in a
consistent manner. Ensuring that instruments are properly calibrated and correctly used eliminates many
errors and thus reduces variation.

Ways to Measure Reliability and Validity

A common strategy to ensure the reliability or reproducibility of measurements, especially for research
purposes, is to replicate the measurements and evaluate the degree of agreement. We discussed intra-
andinterrater reliability in Chapter 5 and discussed reliability and validity in detail in Chapter 11.

One approach to establishing the reliability of a measure is to repeat the measurements at a di erent time or
by a di erent person and compare the results. When the outcome is nominal, the kappa statistic is used;
when the scale of measurement is numerical, the statistic used to examine the relationship is the correlation
coe icient (Chapter 8) or the intraclass correlation (Chapter 11).

Hébert and colleagues (1997) used the Functional Autonomy Measurement System (SMAF) instrument to
measure cognitive functioning and depression. They evaluated its validity by comparing the SMAF score with
the nursing time required for care (r = 0.88) and between disability scores for residents living in settings of
di erent levels of care. The high correlation between nursing time and score indicates that patients with
higher (more dependent) scores required more nursing time, a reasonable expectation. Another indication of
validity is higher disability scores among residents living in settings where they were provided with a high
level of care and lower scores among residents who live independently.


Another aspect of assessing the outcome is related to ways of increasing the objectivity and decreasing the
subjectivity of the assessment. In studies involving the comparison of two treatments or procedures, the
most e ective method for achieving objective assessment is to have both patient and investigator be
unaware of which method was used. If only the patient is unaware, the study is called blind; if both patient
and investigator are unaware, it is called double-blind.


Ballard and colleagues (1998) studied the e ect of antenatal thyrotropin-releasing hormone in preventing
lung disease in preterm infants in a randomized study. Experimental subjects were given the hormone, and
controls were given placebo. The authors stated:

The women were randomly assigned within centers to the treatment or placebo group in permuted blocks of
four. The study was double-blinded, and only the pharmacies at the participating centers had the
randomization schedule.

Blinding helps to reduce a priori biases on the part of both patient and physician. Patients who are aware of
their treatment assignment may imagine certain side e ects or expect specific benefits, and their
expectations may influence the outcome. Similarly, investigators who know which treatment has been
assigned to a given patient may be more watchful for certain side e ects or benefits. Although we might
suspect an investigator who is not blinded to be more favorable to the new treatment or procedure, just the
opposite may happen; that is, the investigator may bend over backward to keep from being prejudiced by his
or her knowledge and therefore may err in favor of the control.

Knowledge of treatment assignment may be somewhat less influential when the outcome is straightforward,
as is true for mortality. With mortality, it is di icult to see how outcome assessment can be biased. Many
examples exist in which the outcome appears to be objective, however, even though its evaluation contains
subjective components. Many clinical studies attempt to ascribe reasons for mortality or morbidity, and
judgment begins to play a role in these cases. For example, mortality is o en an outcome of interest in
studies involving organ transplantation, and investigators wish to di erentiate between deaths from failure
of the organ and deaths from an unrelated cause. If the patient dies in an automobile accident, for example,
investigators can easily decide that the death is not due to organ rejection; but in most situations, the
decision is not so easy.

The issue of blinding becomes more important as the outcome becomes less amenable to objective
determination. Research that focuses on quality-of-life outcomes, such as chest pain status, activity
limitation, or recreational status, require subjective measures. Although patients cannot be blinded in many
studies, the subjective outcomes can be assessed by a person, such as another physician, a psychologist, or a
physical therapist, who is blind to the treatment the patient received.

Data Quality and Monitoring

The method section is also the place where steps taken to ensure the accuracy of the data should be
described. Increased variation and possibly incorrect conclusions can occur if the correct observation is
made but is incorrectly recorded or coded. Dennison and colleagues (1997, p. 16) stated: “All questionnaire
data were dual-entered and verified before being entered into a . . database.” Dual or duplicate entry
decreases the likelihood of errors because it is unusual for the same entry error to occur twice.

Multicenter studies provide additional data quality challenges. It is important that an accurate and complete
protocol be developed to ensure that data are handled the same way in all centers. Gelber and colleagues


(1997) studied data collected from 63 centers in North America in setting normative values for cardiovascular
autonomic nervous system tests. They reported that:

All site personnel were trained by a member of the Autonomic Nervous System (ANS) Reading Center in the
use of the equipment and testing methodology. . All data were analyzed at a single Autonomic Reading
Center. The analysis program contains internal checks which alert the analyzing technician to any aberrant
data points overlooked during the editing process and warns the technician when the results suggest that the
test may have been performed improperly. The analysis of each study was reviewed by the director of the
ANS Reading Center.

In addition to standardized training, the data entry process itself was monitored for potential errors.

Determining an Appropriate Sample Size

Specifying the sample size needed to detect a di erence or an e ect of a given magnitude is one of the most
crucial pieces of information in the report of a medical study. Recall that missing a significant di erence is
called a type II error, and this error can happen when the sample size is too small. We provide many
illustrations in the chapters that discuss specific statistical methods, especially Chapters 5, 6, 7, 8, 10, and 11.

Determination of sample size is referred to as power analysis or as determining the power of a study. An
assessment of power is essential in negative studies, studies that fail to find an expected di erence or
relationship; we feel so strongly about this point that we recommend that readers disregard negative studies
that do not provide information on power.

Harper studied the use of paracervical block to diminish cramping associated with cryosurgery (1997). She

To have a power of 80% to detect a di erence of 20 mm on the visual analog scale at the 0.05 level of
significance (assuming a standard deviation of 30 mm), the power analysis a priori showed that 35 women
would be needed in each cohort. The first 35 women who met the inclusion and exclusion criteria for
cryosurgery were treated in the usual manner with no anesthetic block given before cryosurgery. The
variances of the actual responses were greater than anticipated in the a priori power analysis, leading to the
subsequent enrollment of the next five women qualifying for the study for a total of 40 women in the usual
treatment group. This increase in enrollment maintained the power of the study.

Thus, as a result of analysis of data, the investigator opted to increase the sample size to maintain power.

We repeatedly emphasize the need to perform a power analysis prior to beginning a study and have
illustrated how to estimate power using statistical programs for that purpose. Investigators planning
complicated studies or studies involving a number of variables are especially advised to contact a statistician
for assistance.

Evaluating the Statistical Methods


Statistical methods are the primary focus of this text, and only a brief summary and some common problems
are listed here. At the risk of oversimplification, the use of statistics in medicine can be summarized as
follows: (1) to answer questions concerning di erences; (2) to answer questions concerning associations; and
(3) to control for confounding issues or to make predictions. If you can determine the type of question
investigators are asking (from the stated purpose of the study) and the types and numbers of measures used
in the study (numerical, ordinal, nominal), then the appropriate statistical procedure should be relatively
easy to specify. Tables 10–1 and 10–2 in Chapter 10 and the flowcharts in Appendix C were developed to
assist with this process. Some common biases in evaluating data are discussed in the next sections.

Fishing Expedition

A fishing expedition is the name given to studies in which the investigators do not have clear-cut research
questions guiding the research. Instead, data are collected, and a search is carried out for results that are
significant. The problem with this approach is that it capitalizes on chance occurrences and leads to
conclusions that may not hold up if the study is replicated. Unfortunately, such studies are rarely repeated,
and incorrect conclusions can remain a part of accepted wisdom.

Multiple Significance Tests

Multiple tests in statistics, just as in clinical medicine, result in increased chances of making a type I, or false-
positive, error when the results from one test are interpreted as being independent of the results from
another. For example, a factorial design for analysis of variance in a study involving four groups measured on
three independent variables has the possibility of 18 comparisons (6 comparisons among the four groups on
each of three variables), ignoring interactions. If each comparison is made for P ≤ 0.05, the probability of
finding one or more comparisons significant by chance is considerably greater than 0.05. The best way to
guard against this bias is by performing the appropriate global test with analysis of variance prior to making
individual group comparisons (Chapter 7) or using an appropriate method to analyze multiple variables
(Chapter 10).

A similar problem can occur in a clinical trial if too many interim analyses are done. Sometimes it is
important to analyze the data at certain stages during a trial to learn if one treatment is clearly superior or
inferior. Many trials are stopped early when an interim analysis determines that one therapy is markedly
superior to another. For instance, the Women’s Health Initiative trial on estrogen plus progestin (Writing
Group for WHI, 2002) and the study of finasteride and the development of prostate cancer (Thompson et al,
2003) were both stopped early. In the estrogen study, the conclusion was that overall health risks outweighed
the benefits, whereas finasteride was found to prevent or delay prostate cancer in a significant number of
men. In these situations, it is unethical to deny patients the superior treatment or to continue to subject them
to a risky treatment. Interim analyses should be planned as part of the design of the study, and the overall
probability of a type I error (the α level) should be adjusted to compensate for the multiple comparisons.

Migration Bias


Migration bias occurs when patients who drop out of the study are also dropped from the analysis. The
tendency to drop out of a study may be associated with the treatment (eg, its side e ects), and dropping
these subjects from the analysis can make a treatment appear more or less e ective than it really is.
Migration bias can also occur when patients cross over from the treatment arm to which they were assigned
to another treatment. For example, in crossover studies comparing surgical and medical treatment for
coronary artery disease, patients assigned to the medical arm of the study sometimes require subsequent
surgical treatment for their condition. In such situations, the appropriate method is to analyze the patient
according to his or her original group; this is referred to as analysis based on the intention-to-treat principle.

Entry Time Bias

Entry time bias may occur when time-related variables, such as survival or time to remission, are counted
di erently for di erent arms of the study. For example, consider a study comparing survival for patients
randomized to a surgical versus a medical treatment in a clinical trial. Patients randomized to the medical
treatment who die at any time a er randomization are counted as treatment failures; the same rule must be
followed with patients randomized to surgery, even if they die prior to the time surgery is performed.
Otherwise, a bias exists in favor of the surgical treatment.

The Results Section of a Research Report

The results section of a medical report contains just that: results of (or findings from) the research directed at
questions posed in the introduction. Typically, authors present tables or graphs (or both) of quantitative data
and also report the findings in the text. Findings generally consist of both descriptive statistics (means,
standard deviations, risk ratios, etc) and results of statistical tests that were performed. Results of statistical
tests are typically given as either P values or confidence limits; authors seldom give the value of the statistical
test itself but, rather, give the P value associated with the statistical test. The two major aspects for readers
evaluating the results section are the adequacy of information and the su iciency of the evidence to
withstand possible threats to the validity of the conclusions.

Assessing the Data Presented

Authors should provide adequate information about measurements made in the study. At a minimum, this
information should include sample sizes and either means and standard deviations for numerical measures
or proportions for nominal measures. For example, in describing the e ect of sex, race, and age on estimating
percentage body fat from body mass index, Jackson and colleagues (2002) specified the independent
variables in the method section along with how they were coded for the regression analysis. They also gave
the equation of the standard error of the estimate that they used to evaluate the fit of the regression models.
In the results section, they included a table of means and standard deviations of the variables broken down
by sex and race.


In addition to presenting adequate information on the observations in the study, good medical reports use
tables and graphs appropriately. As we outlined in Chapter 3, tables and graphs should be clearly labeled so
that they can be interpreted without referring to the text of the article. Furthermore, they should be properly
constructed, using the methods illustrated in Chapter 3.

Assuring the Validity of the Data

A good results section should have the following three characteristics. First, authors of medical reports
should provide information about the baseline measures of the group (or groups) involved in the study as did
Jackson and colleagues (2002) in Table 2 of their article. Tables like this one typically give information on the
gender, age, and any important risk factors for subjects in the di erent groups and are especially important in
observational studies. Even with randomized studies, it is always a good idea to show that, in fact, the
randomization worked and the groups were similar. Investigators o en perform statistical tests to
demonstrate the lack of significant di erence on the baseline measures. If it turns out that the groups are not
equivalent, it may be possible to make adjustments for any important di erences by one of the covariance
adjusting methods discussed in Chapter 10.

Second, readers should be alert for the problem of multiple comparisons in studies in which many statistical
tests are performed. Multiple comparisons can occur because a group of subjects is measured at several
points in time, for which repeated-measures analysis of variance should be used. They also occur when many
study outcomes are of interest; investigators should use multivariate procedures in these situations. In
addition, multiple comparisons result when investigators perform many subgroup comparisons, such as
between men and women, among di erent age groups, or between groups defined by the absence or
presence of a risk factor. Again, either multivariate methods or clearly stated a priori hypotheses are needed.
If investigators find unexpected di erences that were not part of the original hypotheses, these should be
advanced as tentative conclusions only and should be the basis for further research.

Third, it is important to watch for inconsistencies between information presented in tables or graphs and
information discussed in the text. Such inconsistencies may be the result of typographic errors, but
sometimes they are signs that the authors have reanalyzed and rewritten the results or that the researchers
were not very careful in their procedures. In any case, more than one inconsistency should alert us to watch
for other problems in the study.

The Discussion & Conclusion Sections of a Research Report

The discussion and conclusion section(s) of a medical report may be one of the easier sections for clinicians
to assess. The first and most important point to watch for is consistency among comments in the discussion,
questions posed in the introduction, and data presented in the results. In addition, authors should address
the consistency or lack of same between their findings and those of other published results. Careful readers
will find that a surprisingly large number of published studies do not address the questions posed in their


introduction. A good habit is to refer to the introduction and briefly review the purpose of the study just prior
to reading the discussion and conclusion.

The second point to consider is whether the authors extrapolated beyond the data analyzed in the study. For
example, are there recommendations concerning dosage levels not included in the study? Have conclusions
been drawn that require a longer period of follow-up than that covered by the study? Have the results been
generalized to groups of patients not represented by those included in the study?

Finally, note whether the investigators point out any shortcomings of the study, especially those that a ect
the conclusions drawn, and discuss research questions that have arisen from the study or that still remain
unanswered. No one is in a better position to discuss these issues than the researchers who are intimately
involved with the design and analysis of the study they have reported.

A Checklist for Reading the Literature

It is a rare article that meets all the criteria we have included in the following lists. Many articles do not even
provide enough information to make a decision about some of the items in the checklist. Nevertheless,
practitioners do not have time to read all the articles published, so they must make some choices about
which ones are most important and best presented. Bordage and Dawson (2003) developed a set of
guidelines for preparing a study and writing a research grant that contains many topics that are relevant to
reading an article as well. The companion to our text, the book by Greenberg and colleagues (2000), is
recommended for suggestions in reading the epidemiologic literature. Greenhalgh (1997b) presents a
collection of articles on various topics published in the British Medical Journal, and the Journal of the
American Medical Association has published a series of excellent articles under the general title of “Users’
Guides to the Medical Literature.”

The following checklist is fairly exhaustive, and some readers may not want to use it unless they are
reviewing an article for their own purposes or for a report. The items on the checklist are included in part as a
reminder to the reader to look for these characteristics. Its primary purpose is to help clinicians decide
whether a journal article is worth reading and, if so, what issues are important when deciding if the results
are useful. The items in italics can o en be found in a structured abstract. An asterisk (*) designates items
that we believe are the most critical; these items are the ones readers should use when a less comprehensive
checklist is desired.

Reading the Structured Abstract

*A. Is the topic of the study important and worth knowing about?

*B. What is the purpose of the study? Is the focus on a di erence or a relationship? The purpose should be
clearly stated; one should not have to guess.


C. What is the main outcome from the study? Does the outcome describe something measured on a
numerical scale or something counted on a categorical scale? The outcome should be clearly stated.

D. Is the population of patients relevant to your practice—can you use these results in the care of your
patients? The population in the study a ects whether or not the results can be generalized.

E. If statistically significant, do the results have clinical significance as well?

Reading the Introduction

*A. What research has already been done on this topic and what outcomes were reported? The study should
add new information.

Reading the Methods

*A. Is the appropriate study design used (clinical trial, cohort, case-control, cross-sectional, meta-analysis)?

B. Does the study cover an adequate period of time? Is the follow-up period long enough?

*C. Are the criteria for inclusion and exclusion of subjects clear? How do these criteria limit the applicability
of the conclusions? The criteria also a ect whether or not the results can be generalized.

*D. Were subjects randomly sampled (or randomly assigned)? Was the sampling method adequately

E. Are standard measures used? Is a reference to any unusual measurement/procedure given if needed? Are
the measures reliable/replicable?

F. What other outcomes (or dependent variables) and risk factors (or independent variables) are in the study?
Are they clearly defined?

*G. Are statistical methods outlined? Are they appropriate? (The first question is easy to check; the second
may be more di icult to answer.)

*H. Is there a statement about power—the number of patients that are needed to find the desired outcome? A
statement about sample size is essential in a negative study.

I. In a clinical trial:

1. How are subjects recruited?

*2. Are subjects randomly assigned to the study groups? If not:

a. How are patients selected for the study to avoid selection biases?


b. If historical controls are used, are methods and criteria the same for the experimental group; are
cases and controls compared on prognostic factors?

*3. Is there a control group? If so, is it a good one?

4. Are appropriate therapies included?

5. Is the study blind? Double-blind? If not, should it be?

6. How is compliance ensured/evaluated?

*7. If some cases are censored, is a survival method such as Kaplan–Meier or the Cox model used?

J. In a cohort study:

*1. How are subjects recruited?

2. Are subjects randomly selected from an eligible pool?

*3. How rigorously are subjects followed? How many dropouts does the study have and who are they?

*4. If some cases are censored, is a survival method such as Kaplan–Meier curves used?

K. In a case–control study:

*1. Are subjects randomly selected from an eligible pool?

2. Is the control group a good one (bias-free)?

3. Are records reviewed independently by more than one person (thereby increasing the reliability of

L. In a cross-sectional (survey, epidemiologic) study:

1. Are the questions unbiased?

*2. Are subjects randomly selected from an eligible pool?

*3. What is the response rate?

M. In a meta-analysis:

*1. How is the literature search conducted?

2. Are the criteria for inclusion and exclusion of studies clearly stated?

*3. Is an e ort made to reduce publication bias (because negative studies are less likely to be

*4. Is there information on how many studies are needed to change the conclusion?

Reading the Results

*A. Do the reported findings answer the research questions?

*B. Are actual values reported—means, standard deviations, proportions—so that the magnitude of
di erences can be judged by the reader?

C. Are many P values reported, thus increasing the chance that some findings are bogus?

*D. Are groups similar on baseline measures? If not, how did investigators deal with these di erences
(confounding factors)?

E. Are the graphs and tables, and their legends, easy to read and understand?

*F. If the topic is a diagnostic procedure, is information on both sensitivity and specificity (false-positive rate)
given? If predictive values are given, is the dependence on prevalence emphasized?

Reading the Conclusion and Discussion

*A. Are the research questions adequately discussed?

*B. Are the conclusions justified? Do the authors extrapolate more than they should, for example, beyond the
length of time subjects were studied or to populations not included in the study?

C. Are the conclusions of the study discussed in the context of other relevant research?

D. Are shortcomings of the research addressed?

Questions 1–65 are available as an interactive quiz (see Table of Contents page).

Questions 66–70

These questions constitute a set of extended matching items. For each of the situations outlined here, select
the most appropriate statistical method to use in analyzing the data from the choices a–i that follow. Each
choice may be used more than once.

a. Independent-groups t test
b. Chi-square test
c. Wilcoxon rank sum test
d. Pearson correlation


e. Analysis of variance
f. Mantel–Haenszel chi-square
g. Multiple regression
h. Paired t test
i. Odds ratio

66. Investigating average body weight before and a er a supervised exercise program
67. Investigating gender of the head of household in families of patients whose medical costs are covered by
insurance, Medicaid, or self
68. Investigating a possible association between exposure to an environmental pollutant and miscarriage
69. Investigating blood cholesterol levels in patients who follow a diet either low or moderate in fat and who
take either a drug to lower cholesterol or a placebo
70. Investigating physical functioning in patients with diabetes on the basis of demographic characteristics
and level of diabetic control

Questions 71–75

These questions constitute a set of multiple true–false items. For each of the statements, determine whether
the statement is true or false.

Table 13–12 contains the variables used by Lamas and colleagues (1992) to predict aspirin therapy before
myocardial infarction (MI). Refer to the table to answer the following questions.


Table 13–12. Logistic-Regression Model to Predict Aspirin Therapy before Infarction.

Variable Odds Ratio (95% CI)

PTCA before MI 2.66 (1.57–4.51)

Catheterization before MI 2.22 (1.59–3.10)

Previous MI 1.95 (1.49–2.54)

CABG before MI 1.80 (1.27–2.55)

White race 1.47 (0.97–2.23)

Randomization a er 1/28/88 1.43 (1.11–1.85)

Angina before MI 1.22 (0.94–1.60)

Hypertension 1.19 (0.95–1.50)

Male sex 1.19 (0.86–1.64)

Married status 1.03 (0.79–1.35)

Age 1.02 (1.01–1.04)

Education a er high school 0.97 (0.77–1.22)

Orthopedic disease 0.97 (0.55–1.69)

Type of hospital (academic vs community) 0.92 (0.71–1.18)

Diabetes 0.83 (0.62–1.08)

CI = confidence interval; PTCA = percutaneous transluminal coronary angioplasty; MI = myocardial infarction; CABG = coronary-artery
bypass gra ing.

Source: Adapted and used, with permission, from Table 2 in Lamas GA, Pfe er MA, Hamm P, Wertheimer J, Rouleau JL, Braunwald E: Do the
results of randomized clinical trials of cardiovascular drugs influence medical practice? N Engl J Med 1992;327:241–247.

71. Patients who had had a previous MI were significantly more likely to take aspirin.
72. Race was a more significant predictor of aspirin therapy than age.

73. Older patients were significantly more likely to take aspirin.

74. Diabetic patients were significantly less likely to take aspirin.
75. The type of hospital was significantly associated with aspirin use.

