You are on page 1of 7

12PM403 9/1/01 4:02 pm Page 69

Palliative Medicine 2001; 15: 69–75

Issues in research

Describing the subjects in a study


RM Pickering Lecturer in Medical Statistics, Medical Statistics Group, University of Southampton

Introduction Describing data


Descriptive statistics aim to give an idea of the vari-
Most articles reporting the analysis of data will at ability seen in a measurement taken across a group
some point use statistics to describe background or of subjects: another name for this variability is the
socio-demographic features of the subjects included distribution of values. A more complete impression
in the study. An important reason for doing this is of variability can be conveyed by a graph, but this
to give the reader some idea of the extent to which takes up considerable space and so key features of
the findings from a study can be generalized to their the distribution are presented using only one or two
own local situation. Sometimes it is appropriate to summary statistics. The two features that are usually
report the findings from a study entirely in descrip- felt to be of most interest are where the centre of
tive terms as well. the distribution lies and how spread out it is. The
The production of descriptive statistics is a rela- centre of a distribution is usually given as the mean
tively straightforward matter, most statistical pack- or the median, while the spread of the distribution
ages produce all the statistics one could possibly may be expressed as the standard deviation (SD),
desire, and usually more. Some choice has to be the range or the inter-quartile range (IQR). Defini-
made concerning the most pertinent statistics to tions and properties of these techniques are given
present in a particular situation, and these then have in standard statistics books.1
to be included in the paper in a manner that is easy Figure 1a shows a typical symmetric distribution
for the reader to assimilate. There may be con- for a quantitative variable. The mean might be used
straints on the amount of space available, and it is here to describe where the centre of the distribution
in any case a good idea to make statistical display lies and the standard deviation to give an idea of its
as concise as possible. spread around the central value. Standard devi-
The purpose of this article is to provide a brief ations are particularly useful where the distribution
review of widely used descriptive statistics, and to is approximately symmetrical around its mean and
give some guidance as to how they may be incor- follows a bell-shaped (Normal) distribution, in
porated into text or tables in the context of some which case around 95% of subjects will be included
examples from Palliative Medicine. within the range 2SD below and above the mean.
Another reason for presenting standard deviations
is that they are required in calculations of sample
Address for correspondence: Dr RM Pickering, Medical Statistics
Group, Health Care Research Unit, South Academic Block,
size, and descriptive statistics for a particular vari-
Southampton General Hospital, Tremona Road, Southampton able may be used from published reports to plan
SO16 6YD, UK. E-mail: rmp@soton.ac.uk future studies.
© Arnold 2001 0269–2163(01)PM403OA

Downloaded from pmj.sagepub.com at The University of Iowa Libraries on May 30, 2015
12PM403 9/1/01 4:02 pm Page 70

70 RM Pickering

Figure 1 Centre and spread of a distribution

There is an obvious interest in presenting the situation the medians of two groups may fall in the
minimum and maximum values obtained in a sample, same category even though differences in the tails
that is the range of values. In describing the spread of the distribution suggest that, overall, one group
of subjects’ ages, for example, knowing that the takes higher values than the other. The mean would
study included people aged between 42 and 83 years be sensitive to such differences in distribution. An-
is more immediately helpful than knowing that the other example where the mean might be chosen to
standard deviation of age was 10.8. If data values locate the centre of a skewed distribution is where
have been incorrectly recorded, there may be im- an economic evaluation is planned, for example in
possibly high or low values and examination of the comparing lengths of stay between groups. If unit
range can give the first indication that there are cost (assumed equal over days) is applied to a re-
errors in the data. duction in mean length of stay, then cost savings can
When a distribution is skewed (Figure 1b) things be estimated per given number of subjects.
are more complicated. The mean will be overly It is also more difficult to decide which measure
influenced by the tail of values to the right, and just of spread to use in the context of a skewed distri-
one or two extreme values (or ‘outliers’) may have bution. It is clear from Figure 1b that no single
the effect of pulling the mean substantially to the number can adequately describe the spread of data
right of what seems visually to be the centre of the around the central value, because there is greater
distribution. An alternative is to present the median, spread to the right than to the left. One straight-
which is defined as having 50% of values falling forward indication of spread that could be used
above it and 50% below. Figure 1b suggests that the is the range, the minimum to maximum values
value for which the curve is at its highest may also obtained in the study. Another possibility is the
be useful in locating the centre of the distribution, inter-quartile range, or IQR, which covers the cen-
but in practice this can prove awkward: there may tral 50% of the distribution. One disadvantage of
be no single most frequent value. The median is using the IQR is that it is not as widely known as
often recommended as the preferred statistic for other statistics, but many statistical packages now
locating the centre of the distribution when data are produce it. Standard deviations may be presented
skewed, but sometimes the mean can be more when variables follow a distribution which is skewed,
helpful. If the attribute being displayed takes only and again can form the basis of approximate power
a limited number of values, such as a scale com- calculations for future studies.
prising a sum of subitems, it can happen that the There are statistics that emphasize other aspects
majority of subjects take the same value. In such a of a distribution, such as the degree of skew, and

Downloaded from pmj.sagepub.com at The University of Iowa Libraries on May 30, 2015
12PM403 9/1/01 4:02 pm Page 71

Issues in research 71

these may be presented if they are particularly rele- Table 1 Characteristics of outpatients in three palliative care
vant to the research question. Generally, however, settings in Wales, number (%), unless otherwise stated
(n=36) 3
statistics describing the distribution of quantitative
variables tend to be limited to measures of central Site
tendency and spread. Holme Tower 15 (42%)
Velindre 14 (39%)
Llandough 7 (19%)

Descriptive statistics in text Sex


Male 17 (47%)
Female 19 (53%)
Sometimes only one or two characteristics may be
important, or available, in describing the character- Marital status
Married 26 (72%)
istics of subjects in a study, in which case descriptive Widowed 4 (11%)
statistics can be included in the text, for example: Divorced 3 (8%)
Single 2 (6%)
The sample comprised 33 (52%) men, and 30 (48%) Unknown 1 (3%)
women. Ages ranged from 55 to 80 years, with a mean
age of 71 years. Figure 1 depicts the marital status and Number of children
Figure 2 shows carer status of those in the sample.2 None 4 (11%)
1 6 (17%)
An alternative here would be to include statistics 2–4 15 (41%)
describing the age and gender distributions along >4 5 (14%)
Unknown 6 (17%)
with frequencies for marital and carer status in a
table. Living arrangement
Live alone at home 9 (25%)
Live with family 24 (67%)
Unknown 3 (8%)
Descriptive statistics in tables
Primary cancer
Often there are too many characteristics to be con- Breast cancer 11 (31%)
Prostate cancer 7 (19%)
veniently described in text, or the need to compare Gastrointestinal cancer 4 (11%)
several subgroups of subjects makes a tabular pre- Gynaecological cancer 3 (8%)
sentation more convenient. An example of such a Lung cancer 2 (6%)
Head and neck cancer 2 (6%)
presentation is shown in Table 1, which summarizes Others 6 (16%)
the distribution of six categorical variables and two Missing 1 (3%)
quantitative variables in a sample whose size is stat-
Age (years)
ed in the title to be 36.3 The categorical variables, Mean (SD) 64 (10.81)
so called because they represent a characteristic that Range 42–83
can be one of several types or categories, are best
Months since diagnosis
described by presenting the number (and then per- Mean (SD) 61 (55.7)
centage) for each category. As the categorical vari- Range 2–228
ables are in the majority in Table 1, the title
indicates that numbers (%) are being presented
unless indicated otherwise. It is best to give numbers
first and then percentages, unless the study is very
large, to emphasize the numbers available. For 1, or just the number (%) where a symptom, for
example, only four subjects in the study described example, is present.
in Table 1 had no children, and primarily to state the There are some subjects in the study described in
frequency as a percentage, 11%, gives a false sense Table 1 where information on a categorical variable
of precision (the 95% confidence interval around is unknown, and the authors have treated this as a
the estimated percentage lies from 4% to 26%). category in its own right. Alternatively, the per-
Where a categorical variable represents only two centages could have been calculated only amongst
types it is possible to present either numbers (%) for cases where the information was known, in which
each one, as is done for males and females in Table case the numbers could have been presented over

Downloaded from pmj.sagepub.com at The University of Iowa Libraries on May 30, 2015
12PM403 9/1/01 4:02 pm Page 72

72 RM Pickering

the denominator of cases available, for example stating the range alone is not helpful is distin-
4/30 (13%) of subjects with no children. A footnote guishing between the distributions in different
to the revised table could have expanded on the situations. Alternative ways of describing the dis-
missing information. Particularly where a high per- tribution of a variable put different emphases on
centage of cases have unknown values, it may be the data and will be more or less illuminating in dif-
best to leave ‘unknown’ as a category in the table, ferent circumstances.
as done in Table 1. When scales constructed from subitems are being
The two quantitative variables in Table 1 are de- described, it is helpful to be able to interpret the
scribed by mean (SD) and range: the statistics being findings in the light of the minimum and maximum
presented are indicated in the left-hand column of values that the scale can theoretically assume and
the table. Were a table to include the same descrip- the clinical meaning of these extremes. This type of
tive statistics for several quantitative variables, then information often appears in the Methods section
the statistics being used could be indicated in the of a paper, where the use of the tool is described,
column headings or the title. Different statistics for but it is useful to state the potential range either as
each quantitative variable can easily be incorpora- a footnote or along with the name of the scale in any
ted if the table is constructed as Table 1, although tables where it appears.
the reader might wonder why this had been done.
If there are subjects who have been excluded from
the calculation of descriptive statistics for a quanti- Graphical displays
tative variable because of missing information, this
can be clarified by including an additional row A picture tells a thousand words – and displaying a
under Mean (SD) and Range, labelled ‘n’ which distribution in graphical form rather than by sum-
gives the number of subjects for whom information mary statistics alone does convey a more complete
is available. impression of variability in data. Figure 2 displays
In the distributions described in Table 1, the age the distribution of Karnofsky status from subjects
distribution would appear to be symmetrical with its entered into a randomized controlled trial. In the
mean falling close to the centre of its range and its original paper4 Karnofsky status was presented as
mean ± 2SD extending close to the limits of the number (%) for each value in a table of socio-
range, while the distribution of months since diag- demographic and medical characteristics. Figure 2a
nosis is clearly skewed to the right. Mean months displays the same information as a bar chart, from
since diagnosis is much closer to the lower than the which a tendency for the distribution to be skewed
upper extreme of its range, and its mean minus 2SD to the left is apparent, which was not so obvious
extends to negative values. Distributions of lengths from the original tabular presentation. In Figure 2b
of time often follow this pattern. the distribution is shown in the form of a box and
A choice arises when describing the distribution whisker plot. The central horizontal line indicates
of an ordered variable taking a limited number of the position of the median of the distribution; the
values, such as a scale constructed from subitems or, ‘box’ covers the IQR, that is the central 50% of the
for example, number of children in Table 1. Such distribution; while the ‘whiskers’ extend to the mini-
variables can be described either in terms of central mum and maximum values recorded in the study.
value and a measure of spread, or as a categorical The skew to the left is also apparent in Figure 2b.
variable, by presenting numbers (%) for frequently Were there any degree of bi-modality (two peaks)
occurring categories and combining categories with in the distribution, this would be evident from the
lower frequencies. The latter approach was chosen bar chart but not from the box and whisker plot.
for number of children in Table 1. Sometimes this Displaying a distribution graphically takes up more
type of variable may be grouped into only two cate- space than summary statistics and so is better
gories, if, for example, scores above a particular reserved for the main findings from a study. In par-
value are conventionally accepted as indicating an ticular, more space might be devoted to describe a
‘abnormal’ state. Some scales usually extend from difference in a primary outcome variable between
the minimum to the maximum values that are theo- two groups than in describing the distribution of
retically available even in quite small samples, and baseline characteristics.

Downloaded from pmj.sagepub.com at The University of Iowa Libraries on May 30, 2015
12PM403 9/1/01 4:02 pm Page 73

Issues in research 73

Figure 2 Distribution of Karnofsky status (n = 363) in a randomized controlled trial comparing comprehensive palliative care
to conventional care4

Describing loss of subjects to a study of a study5 comparing methods of assessing symp-


toms, and then details the impact of eligibility cri-
In order to assess the generalizability of results it is teria in reducing the numbers available. Amongst
useful to see how the subjects included in the even- patients entered into the trial there was a further
tual analysis relate to the patient base from which loss of subjects. It is important to recognize that the
they were drawn. The means by which subjects were eligibility criteria resulted in 85/121 (70%) of the
selected is clearly important and should be stated in
the Methods section of a paper. The impact of
eligibility criteria and refusal rates in reducing the Table 2 Study recruitment and attrition rates to a study
number of available cases could usefully be shown, comparing patient and proxy symptom assessments in
advanced cancer patients5
and there may be further loss of subjects during the
study. This can be described in text, as in this excerpt, Frequency
which describes refusal rates to a needs assessment Patient attrition rates
study in chronic obstructive airways disease: Patients consecutively sampled 121
One-hundred-and-fifty-one people with COAD were Patient ineligibility
contacted and asked to take part in the study. Sixty-three Unablea 28
Deceased 4
(42%) individuals agreed to participate and were inter- Other 3
viewed. Five (3%) individuals were too ill to participate Refused 1
and three (2%) had recently died. Eighty (53%) declined
over the telephone or did not respond to the letter.2 Patients entered 85
Patient attrition
Only two types of refusal and recent death were rel- Unablea 9
evant and this was easy to explain in text. When Discharge 5
greater numbers of subjects are lost to the study or Death 4
Not completed in specified time frame 15
the patterns or reasons for loss are more compli- Otherb 3
cated it is easier to assimilate the information if Patients completed 49
shown in tabular format or as a flow chart. Table 2 a
Patient unable to participate due to impaired cognitive func-
starts by stating the total number of patients admit- tioning, language barrier or deterioration in condition.
b
ted to an acute unit during the recruitment period Assessments could not be completed by all three raters (n=3).

Downloaded from pmj.sagepub.com at The University of Iowa Libraries on May 30, 2015
12PM403 9/1/01 4:02 pm Page 74

74 RM Pickering

total patient base being entered, and that some chart. The chart may take up more space than
findings relate to only 49/121 (40%). Examining the tabular presentation but it is easier to see at a
reasons stated in Table 2, it would seem likely that glance how subjects are lost. The chart can easily be
the loss of subjects resulted in analyses being per- adapted to situations in which subjects are lost at
formed on a more favourable subgroup, and this more than two stages or from several sub-groups.
may typically be the case. In addition to the complete loss of subjects ex-
Guidelines concerning the information that pressed by attrition rates, there may be information
should be included when reporting randomized on specific variables that is unavailable as, for exam-
controlled trials6 have recommended that the pro- ple, shown in Table 1, and this will further reduce
gress of patients through a trial should be presented the number of subjects available for analysis in which
as a flow chart. These charts are so obviously help- these variables feature. This is particularly pro-
ful that they are being included more widely in an blematic where many factors are being controlled
increasing variety of studies. For example, Figure 3 statistically. Although individual factors may only be
shows the information contained in Table 2 as a flow missing in a few cases, the numbers mount up so

Figure 3 Study recruitment and attrition rates to a study comparing patient and proxy symptom assessments in advanced
cancer patients5

Downloaded from pmj.sagepub.com at The University of Iowa Libraries on May 30, 2015
12PM403 9/1/01 4:02 pm Page 75

Issues in research 75

that there may be a substantial reduction in numbers primary objective of the study, such as whether or
available when all factors are taken into account. It not an outcome differs depending on a subject
may be that several analyses are being reported, all characeristic, or a treatment received, not to estab-
based on different numbers of cases depending on lish whether or not a factor can safely be ignored
which variables are involved. The reader should be when other comparisons are being drawn.
able to establish how many cases each reported
analysis was based on. As with subjects who are
excluded altogether from a study, those for whom Conclusions
information is unavailable on a sporadic basis often
comprise a less favourable subgroup. Describing the subjects in a study is the first step in
most papers reporting analysis of data. It is impor-
tant in establishing the generalizablity of research
Comparing baseline characteristics findings and, in the context of comparative studies,
between groups flags the need for comparisons to be made taking
account of differences between groups. The objec-
In reports of randomized trials, tables describing tive is to describe the main features of the distri-
baseline characteristics in each arm of the trial have bution, while alerting the reader to problematic or
the additional role of demonstrating whether or not unusual features of the data. Usually the space
randomization has been successful in producing available in a paper constrains the presentation of
similar groups, and thus whether treatment com- several alternative statistics for location or spread
parisons are likely to be confounded by other dif- and, in any case, too many statistics can confuse
ferences. Statistical tests of significance should not rather than offer further insight. The attrition of
be used to decide whether such differences need to subjects during a study should also be described and
be taken into account.1 If treatment allocation is gives a more direct method of relating study subjects
properly randomized we know that any difference to the patient base from which they were drawn.
in baseline characteristics must be due to chance,
and that is the only question a significance test is
addressing. The question facing the researcher is References
whether or not the magnitude of any difference is 1 Altman DG. Practical statistics for medical research.
sufficient to confound treatment comparisons, and London: Chapman & Hall, 1991.
this depends on the strength of the relationship 2 Skilbeck J, Mott L, Page H, Smith D, Hjelmeland-
between the characteristic in question and the out- Ahmedzai S, Clark D. Palliative care in chronic
come, as well as the difference it exhibits between obstructive airway disease: a needs assessment.
treatment groups. A statistical test for baseline dif- Palliat Med 1998; 12: 245–54.
ferences does not address this question, and there 3 Pratheepawanit N, Salek MS, Finlay IG. The
may be insufficient numbers available to achieve applicability of quality-of-life assessment in palliative
care: comparing two quality-of-life measures. Palliat
significance even when taking a baseline charac- Med 1999; 13: 325–34.
teristic into account. This may materially alter the 4 Jordhøy MS, Kaasa S, Fayers P, Øvreness T,
conclusions drawn. It is better to use descriptive Underland G, Ahlner-Elmqvist M. Challenges in
statistics to judge whether there are any clinically palliative care research; recruitment, attrition and
important differences in baseline characteristics, compliance: experience from a randomized
and if so compare outcome in controlled analyses, controlled trial. Palliat Med 1999; 13: 299–310.
or alternatively make allowance for known prog- 5 Nekolaichuk CL, Bruera E, Spachynski K,
nostic factors irrespective of the differences they MacEachern T, Hanson J, Maguire TO. A
comparison of patient and proxy symptom
exhibit at baseline. The same issues arise when the assessments in advanced cancer patients. Palliat Med
groups being compared are not based on random 1999; 13: 311–23.
allocation, only in this case there are more likely to 6 Altman DG. Better reporting of randomised
be other differences confounding any comparison. controlled trials: the CONSORT statement. Br Med J
Statistical significance tests are used to address the 1996; 313: 570–71.

Downloaded from pmj.sagepub.com at The University of Iowa Libraries on May 30, 2015

You might also like