You are on page 1of 10

1080

International Journal of Epidemiology


International Epidemiological Association 1997
Vol. 26, No. 5
Printed in Great Britain
Verbal autopsy (VA) is an indirect method for
estimating cause-specific mortality. The method uses
information obtained from a close relative or caretaker
of a deceased person about the circumstances, symp-
toms and signs during the terminal illness to assign a
cause of death (CoD). In the mid-1970s, the World Health
Organization (WHO) recommended lay-reporting of
information about health issues by people without
formal medical training and subsequently published a
suggested death record which was probably the first
formal VA questionnaire.
1
A recent review
2
examined methodological issues
affecting the use of VA to estimate cause-specific
mortality rates. It highlighted the importance of having
a validated CoD in order to evaluate the performance of
VA, for example in terms of sensitivity, specificity and
predictive power. It also identified choice of method for
assigning CoD using VA data as an important issue, and
contrasted physician review, expert algorithms and
data-derived (statistical) methods. However, it did not
consider in detail the advantages and disadvantages of
different methods, nor did it discuss the range of data-
derived methods that exist.
3,4
Physician review, which assigns CoD on the basis of
specialist medical expertise, has face validity because
of its similarity to medical history taking. However,
it is unlikely to be feasible if VA is widely adopted
because it is time-consuming and requires doctors, who
are a scarce and expensive resource, to do it. Moreover,
physician review may not be the most valid method of
assigning CoD.
57
There are also likely to be inconsist-
encies within reviewers over time, between reviewers
within a study, and between reviewers between studies
due, for example, to differing expertise and biases.
8
Expert algorithms can be considered to represent a
consensus of physician reviewers. They overcome the
* Department of Social Medicine, University of Bristol, Canynge Hall,
Whiteladies Road, Bristol BS8 2PR, UK.

Department of Epidemiology & Population Sciences, London


School of Hygiene and Tropical Medicine, Keppel Street, London
WC1E 7HT, UK.
A Review of Data-Derived Methods
for Assigning Causes of Death from
Verbal Autopsy Data
BARNABY C REEVES* AND MARIA QUIGLEY

Reeves B C (Department of Social Medicine, University of Bristol, Canynge Hall, Whiteladies Road, Bristol BS8 2PR, UK)
and Quigley M. A review of data-derived methods for assigning causes of death from verbal autopsy data. International
Journal of Epidemiology 1997; 26: 10801089.
Background. Verbal autopsy (VA) is an indirect method for estimating cause-specific mortality. In most previous studies,
cause of death has been assigned from verbal autopsy data using expert algorithms or by physician review. Both of these
methods may have poor validity. In addition, physician review is time consuming and has to be carried out by doctors.
A range of methods exist for deriving classification rules from data. Such rules are quick and simple to apply and in many
situations perform as well as experts.
Methods. This paper has two aims. First, it considers the advantages and disadvantages of the three main methods for
deriving classification rules empirically; (a) linear and other discriminant techniques, (b) probability density estimation and
(c) decision trees and rule-based methods. Second, it reviews the factors which need to be taken into account when
choosing a classification method for assigning cause of death from VA data.
Results. Four main factors influence the choice of classification method: (a) the purpose for which a classifier is being
developed, (b) the number of validated causes of death assigned to each case, (c) the characteristics of the VA data and
(d) the need for a classifier to be comprehensible. When the objective is to estimate mortality from a single cause of
death, logistic regression should be used. When the objective is to determine patterns of mortality, the choice of method
will depend on the above factors in ways which are elaborated in the paper.
Conclusion. Choice of classification method for assigning cause of death needs to be considered when designing a
VA validation study. Comparison of the performance of classifiers derived using different methods requires a large VA
dataset, which is not currently available.
Keywords: verbal autopsy, cause of death, data-derived, classification, validation, expert algorithm
inconsistency and the time-consuming nature of physi-
cian review, but are still subject to the criticism that
they may not be valid.
6,9,10
For example, there is a tend-
ency to include an essential sign in an expert algorithm
for a disease, irrespective of whether or not the sign
discriminates between different diseases; fever is an
essential sign for a diagnosis of malaria in a child,
11
but it discriminates poorly between malaria and other
diagnoses because it is also a common sign in other
conditions.
6
The objectives of this paper are threefold: first, to
identify and describe the main data-derived classifica-
tion methods that are available for assigning CoD on
the basis of VA data; second, to consider factors which
may influence the choice of method; third, to discuss
issues concerned with the evaluation of classifiers. The
paper will not compare the performance of different
data-derived methods of classification because there are
no VA datasets of sufficiently large size to permit a
meaningful comparison.
2,6
A review of classification methods is important
and timely, because: (i) In several fields, data-derived
methods have been shown to perform at least as well as
physician review or expert algorithms.
1114
(ii) Data-
derived methods are relatively cheap, quick and simple
compared to physician review. (iii) The choice of
method for assigning CoD may have implications for
the design of a VA project. Several large studies are cur-
rently being conducted in different parts of the world
under the co-ordination of the WHO, and consideration
of the way in which CoD will be assigned may influence
decisions about their development. (iv) The applicabil-
ity of these methods to a wide range of other problems
in medicine
3,1116
is a further reason for examining their
respective advantages and disadvantages.
TYPES OF CLASSIFICATION METHOD
The term classification method is used to describe a
particular analytical technique, and the term classifier
to describe a specific classification rule which has been
derived from empirical data using one or other classi-
fication method. Classification methods refer to classes
and attributes; in the context of VA, classes are the
validated CoD and attributes are signs, symptoms and
other data about the deceased which are collected using
the VA questionnaire. Observations are located in
multi-dimensional attribute space, their position being
determined by their particular scores on each attribute
(Figure 1). The prior class probabilities for an obser-
vation correspond to the proportions of observations in
the population of interest which belong to each class;
the posterior class probabilities are the conditional
ASSIGNING CAUSE OF DEATH FROM VERBAL AUTOPSY 1081
FIGURE 1 Graphical illustration of (i) linear dis-
criminant, (ii) probability density estimation and
(iii) decision tree classification methods. Data
points represent duration of illness and age for
156 childhood deaths from malaria (; n = 78),
measles (; n = 39) and malnutrition (, n = 39).
The lines in (i) represent the discriminant func-
tions; a new case, a 23 month-old child dying
after 12 days of illness (), falls in the measles
sector. In (ii), the new case is classified as dying
from malnutrition, since this is the commonest
cause of death amongst the nearest five neigh-
bours. In (iii) the case is also classified as dying
from malnutrition, since the child was older
than 18 months (age cutoff criterion) and had
been ill for more than 8 days prior to death
(duration of illness cutoff criterion)
probabilities of an observation belonging to each class,
given its attributes.
The process of deriving a classifier requires a dataset
containing attributes and validated class membership
for a number of observations. The aim of all classifica-
tion methods is to partition attribute space into regions
populated only (or predominantly) by a single class of
observation.
i
There are three main classification
methods; (a) linear and other discrimination techniques,
(b) probability density estimation, (c) decision-trees
and rule-based methods. These methods are illustrated
graphically in Figure 1 for a subset of a VA dataset;
5
two variables, age at death and duration of illness, are
shown for all cases validated as having died from
measles, malnutrition and malaria.
ii
In addition,
because of the syndromic nature of some disease
definitions, we also consider how combinations of
attributes can be used to discriminate between CoD.
In the simplest case, with only two attributes and two
classes, linear discrimination divides the two dimen-
sional attribute space by fitting a line which maximizes
the separation of clusters of data points for each class
(by either least squares
17
or maximum likelihood
methods
18
). The principle can be extended to more than
two classes by fitting additional lines, and to more
than two attributes by fitting planes or hyperplanes
rather than lines to partition attribute space. Quadratic
and logistic discriminant functions can also be used.
When only two classes are considered, logistic regres-
sion is equivalent to a logistic discriminant function
18
and, if interactions are allowed, to a quadratic discrimin-
ant function. Discriminant methods assume that mem-
bers of a class form a single cluster in attribute space;
the region of attribute space which corresponds to a
class must therefore be spatially continuous.
Local probability density methods are exemplified by
the nearest-neighbour procedure, which assumes that
a new case is likely to be located near to other cases of
the same class in attribute space. Posterior class
probabilities are calculated for each location in attribute
space, by examining the attributes and classes of the
nearest n cases.
iii
New cases are assigned to classes
according to the class probabilities for their locations in
attribute space. Unlike discriminant methods, there is
no requirement that all regions of attribute space in
which one class has the highest posterior class prob-
ability should be continuous; however, if members of a
class form multiple clusters, there must be a threshold
number of observations in each cluster for the method
to be able to identify the cluster. Supervised learning
neural networks
19
are also usually considered to be
probability density methods.
Decision-tree and rule-based methods partition attri-
bute space into successively smaller regions, according
to attribute-defined criteria.
20,21
Potential splits in
the data are inspected, based on each attribute in turn.
(Continuous attributes are dichotomized, and splits are
inspected for all possible cutpoints.) The best split is
chosen using maximum likelihood methods,
iv
and two
subsets of observations (branches in the decision tree)
are created. Splitting continues until branches become
pure (i.e. include observations belonging to only one
class), branches cannot be made less impure, or until
some stopping criterion (e.g. the number of obser-
vations in a branch) is reached.
v
As with probability
density methods, there is no requirement that all
regions of attribute space in which one class has the
highest conditional probability should be continuous.
Combinations of variables are the basis of almost all
published VA classifiers, irrespective of whether they
have been derived by experts
9,10,22
or by logistic regres-
sion.
6
Logistic regression and other discriminant clas-
sifiers represent combinations of attributes using and
operators (weighted by their respective coefficients;
these methods can represent or combinations by allow-
ing interactions).
vi
Expert algorithms are not con-
strained in this way, although only Mobley et al.
10
have
reported sensitivities and specificities for other com-
binations, e.g. or operators, or combinations of and
and or operators. Examples of expert algorithms which
have used combinations of and and or operators are
given in Table 1.
Combining variables in expert algorithms makes use
of two principles: (i) Medical syndromes and disease
entities are frequently defined in terms of combinations
INTERNATIONAL JOURNAL OF EPIDEMIOLOGY 1082
iv
Criteria other than maximum likelihood ratio have been used to
choose the optimal split for each branch in the tree.
v
Although decision trees can be derived which give almost perfect
classification of the training observations, such trees usually predict
the class membership of new observations poorly (see Evaluating the
performance of classifiers). Peripheral branches must be pruned to
improve prediction.
20,21
vi
Discriminant methods can give rise to or combinations in another
way; the terms in a logistic or discriminant equation may, by chance,
sum to the same total for combinations of different values of predictor
variables and their weights (for example, see Quigley et al.
6
, Table 4).
i
The description partitioning attribute space focuses on the role of
classifiers in making a definitive decision, i.e. assigning a single class
to an observation. Classifiers can also be used probabilistically, i.e. to
rank order possible class membership for an observation according to
the posterior class probabilities.
ii
These data are used only for illustration. Extracting a subset of cases
invalidates the classifiers described for any real-life population.
iii
The number of nearest neighbours with known class membership
against which a new case is compared can be varied systematically, and
the optimal value chosen by cross-validation methods.
of signs and symptoms. (ii) Or combinations increase
the sensitivity of a classifier, while and combinations
increase specificity; in principle, a classification rule
can therefore be built by combining all relevant attri-
butes for a disease with or operators, and then com-
bining these with additional attributes which have high
specificity by and operators.
Description of signs and symptoms as essential,
confirmatory or supportive
9,10,22,23
acknowledges that
some are not always present, and that they provide
differing strengths of evidence for a diagnosis. It is
therefore surprising that there has not been greater
use of or combinations of variables in VA expert
algorithms.
Ways of combining raw data need not be limited to
Boolean operators. It might make good medical sense,
for example, to create a combined variable which rep-
resented essential sign and at least two of four sup-
portive signs. It is legitimate to choose combinations
by post hoc inspection of the data, providing that the
performance of the combination, or any classifier which
uses the combination, is subsequently evaluated on new
data.
In the case of expert algorithms, the combination of
variables chosen is the classifier. However, combined
variables can also be used as new attributes when de-
riving a classifier. Decision tree methods, particularly,
can perform much better if variables are combined in
meaningful ways, and using combined variables can
help to produce comprehensible rules.
20,21
This is be-
cause it is difficult for a decision tree classifier to encode
some simple arithmetic transformations, e.g. a ratio or a
difference, in a rule-based structure. Conversely, it is
difficult for statistical classifiers to encode some of the
rule-based structures which form the basis of decision
trees. Neither method can easily encode a combination
such as the example described above (i.e. essential sign
and at least two of four supportive signs). Therefore, if
there are good knowledge-based reasons for creating
new variables from combinations of others, they should
be investigated with all classification methods.
FACTORS INFLUENCING CHOICE OF
CLASSIFICATION METHOD
There are four main factors which influence the choice
of classification method: (i) the purpose for which a
classifier is being developed, (ii) the number of
validated CoD which can be assigned to each case, (iii)
characteristics of VA data, (iv) need for a classifier to
be comprehensible.
Figure 2 summarizes the ways in which these factors
influence the choice of classification method.
The Purpose for which a Classifier is Developed
Three roles for VA studies have been identified: (i) to
describe patterns of mortality, to inform health policy
and allocation of resources, (ii) to compare patterns of
mortality (or mortality from a single cause) over time,
between geographical areas or between control and
intervention groups in a trial, (iii) to help formulate
effective interventions to prevent deaths from common
causes in the future.
24
With respect to choice of classification method, the
key distinction is whether a classifier is required to
estimate mortality for several causes or for a single CoD.
Role (i) requires a classifier to estimate mortality for
several CoD, but roles (ii) and (iii) might involve either
patterns of mortality or a single CoD. This distinction is
important because, although all classification methods
can be applied in both circumstances, logistic regres-
sion is the obvious choice when deriving a classifier to
estimate mortality for a single cause. It makes few
assumptions about the distributional characteristics of
attribute data, is familiar to epidemiologists and is
widely available (Appendix). It also allows the trade-
off between the sensitivity and specificity of the
classifier to be varied.
The Number of Validated CoD
The classification methods described above are only ap-
plicable in a straightforward way when a case is valid-
ated as having a single CoD. This is a constraint, given
current practice of assigning up to two validated CoD
per death.
5,23
The problem is serious only when the aim
is to estimate patterns of mortality since, if a single CoD
is of interest, co-primary or related CoD can be ignored.
ASSIGNING CAUSE OF DEATH FROM VERBAL AUTOPSY 1083
TABLE 1 Examples of different Boolean combinations of signs
and symptoms which have been used as expert algorithms for
particular causes of death
Boolean Cause of death Expert algorithm Ref.
combination
and Acute lower Cough and dypsnoea 9
respiratory tract
infection
andor Pneumonia Cough and 10
(dypsnoea or tachypnoea)
and Measles Age 120 days and rash 10
andand Measles Age 120 days and rash 9
and fever 3 days
and Malaria Fever and convulsions 10
andor Malaria Fever and (convulsions 10
or loss of consciousness)
If more than one validated CoD is allowed, there are
four possible ways of developing a classifier: (i) treat
each combination of validated CoD as a separate CoD,
2
(ii) devise classification methods which can cope with
more than one validated CoD,
6
(iii) devise a hierarchy,
detailing which CoD take precedence over others,
25
(iv)
omit cases with more than one validated CoD when
deriving a classifier.
7
Treating all combinations of validated CoD as new
classes is theoretically the best solution, since it effect-
ively treats all cases as having a single CoD and makes
no assumption that a case with two validated CoD will
have attributes characteristic of both causes. However,
because combinations of CoD are likely to be rare, a
very large sample size is required to derive appropriate
classifiers for each combination and to estimate their
performance accurately. VA studies have, to date, been
too small to estimate mortality from single CoD with
sufficient precision, so studies of sufficient size to use
this approach are unlikely to be carried out.
2,6
Logistic regression has been adapted to cope with
cases with two validated CoD by deriving separate
classifiers for each CoD.
6
In each analysis, all cases
were classified dichotomously as having the validated
CoD of interest or not; cases with two validated CoD,
e.g. measles and severe anaemia, were classified as
having died from measles when deriving a measles
classifier, and as having died from anaemia when deriv-
ing an anaemia classifier. Performance was calculated
without distinguishing between cases with one or two
validated CoD.
This method has limitations. An overall error rate
cannot be defined, since cases with two validated CoD
may be correctly classified in one but not all analyses,
and cases with single validated CoD may be assigned
multiple CoD only one of which matches the validated
cause. Because there is relatively little penalty to
assigning an additional CoD to a case, the method tends
to overestimate the number of cases assigned multiple
CoD.
6
The extent to which this happens can be adjusted
by varying the trade-off between the sensitivity and
specificity for each classifier, although it is not clear
how this affects the estimates of cause-specific death
rates.
INTERNATIONAL JOURNAL OF EPIDEMIOLOGY 1084
FIGURE 2 Diagram summarizing the way in which the objective of a classifier, the number of validated causes
of death, the size of the dataset and the characteristics of attribute data affect the choice of classification
method
The method also takes no account of the possibility
that deaths from a single CoD may be characterized
by different attributes to deaths where the same CoD
is co-primary or a contributory factor. A classifier
designed to assign the CoD, irrespective of whether it is
the only CoD, may perform less well than if separate
classifiers are trained to assign the CoD to cases where
the CoD is (a) the sole and (b) joint CoD. Finally, al-
though there are constraints when attempting to classify
cases to several classes simultaneously, the simultaneous
contrast between the attributes of cases with different
validated CoD may contribute additional information
for discriminating CoD.
Using a hierarchy is likely to introduce bias. This
solution is therefore not recommended. All cases with
the same two validated CoD are unlikely to have the
same underlying cause; assigning the same reference
CoD to all these cases will therefore, arbitrarily, in-
crease the prior probability of one and decrease the
other. Even if the hierarchy were applied consistently,
the extent of bias might vary in different circumstances,
preventing comparisons being made;
7,25
for example, in
cases assigned both malaria and acute respiratory
infection as validated CoD, the proportion in whom
malaria was the underlying CoD might vary according
to season.
Omitting cases with two validated CoD reduces the
sample size and is also likely to bias estimates of
performance; it, too, is therefore not recommended.
Cases with a single validated CoD may not be represent-
ative of all cases where the cause was the underlying
CoD and this might result in a classifier performing
better on cases assigned a single target CoD than cases
assigned the target CoD in combination with another
CoD.
7
The above discussion emphasises the importance of
the decision about whether VA studies should allow
two validated CoD.
2,25
It is a decision which needs to be
taken when designing a VA study, and has implications
for the way in which criteria for validated CoD are
formulated.
Characteristics of VA Data
Other characteristics of VA data which may affect
choice of classification method include: (i) the scales
used to measure attributes, (ii) rare or undetermined
validated CoD, and (iii) inappropriate priors.
Most VA datasets include different types of attribute
data, e.g. presence or absence of fever (categorical),
severity or duration of diarrhoea (ordinal), age or dura-
tion of illness (continuous). Linear and quadratic
discriminants using maximum likelihood methods (the
basis of most implementations of discriminant analysis
in statistical software) assume that attribute data are
continuous and distributed normally; although the
methods appear robust when all variables are categor-
ical,
26
caution must be exercised when the assumptions
are violated, particularly when datasets contain mixed
types of data.
27,28
Continuous and ordinal data can be
categorized, although this wastes information. Logistic
regression, probability density estimation and decision
trees make no assumptions about the distributions of
attribute data.
It is not always possible to validate a CoD, even when
death occurs in hospital and clinical and laboratory
findings are available.
5,23
There may also be cases which
are assigned very rare validated CoD. When deriving
a classifier, undetermined and rare CoD are usually
pooled as a class other. The appropriateness of this
approach depends on the classification method used.
A logistic regression classifier designed to assign a
single CoD makes no distinction between cases valid-
ated as having undetermined or rare CoD and cases
validated as having specified CoD different to the CoD
of interest. Therefore, pooling of cases with rare and
undetermined CoD should not compromise the
classifier.
Classification methods which assign several CoD
simultaneously treat the class other in exactly the
same way as any other validated CoD. Discriminant
methods, therefore, assume that members of the class
form a single cluster in attribute space. However, there
is no reason why this should be true, since they may
have a variety of CoD characterized by different signs
and symptoms. Probability density estimation methods
can cope with members of a class forming more than
one cluster, but require a threshold number of cases
in each cluster. Decision trees can assign cases in
scattered patches of attribute space to the class other
since each final split of a branch can be between a
designated CoD and other CoD.
An alternative strategy might be to assume that cases
with other CoD must have low posterior class prob-
abilities for all designated CoD. This strategy would
require the omission of all cases classified as having
other CoD when deriving a classifier, but including
them when evaluating the classifier. Cases with post-
erior class probabilities below a chosen criterion level
for all designated CoD could then be classified as
other. This strategy should be feasible for all methods,
since all can estimate posterior class probabilities as
well as determining class membership.
VA validation studies have, to date, been based
exclusively on hospital deaths.
5,9,10,29,30
This practice
may lead to prior probabilities for disease which are
different to those in the population for which the VA
ASSIGNING CAUSE OF DEATH FROM VERBAL AUTOPSY 1085
procedure is being developed, i.e. rural areas in which
validation of CoD is not possible.
2,5
Only discriminant
methods formally allow alternative priors to be sup-
plied. Probability density estimation and decision tree
methods, which estimate posterior class probabilities
directly, can recalculate class probabilities on the basis
of revised priors but may give misleading results.
Furthermore, the structure of a decision tree classifier
may be highly dependent on the prior probabilities of
classes in the dataset from which it is derived.
21,31 vii
Need for a Classifier to be Comprehensible
Comprehensibility, i.e. the extent to which a classifier
appears to embody expert knowledge and can be easily
interpreted, may be important for three reasons: (i)
users with medical knowledge are likely to seek to
interpret a VA classifier in the framework of their
existing knowledge and may distrust a classifier which
does not appear to fit or which is unintelligible, (ii)
when CoD are difficult to distinguish, a comprehensible
classifier may provide insight about how the CoD can
be discriminated, and (iii) comprehensible classifiers
are likely to be simpler and more easily implemented.
Classifiers derived using different classification
methods are not equally comprehensible. Probability
density estimation classifiers are the most incompre-
hensible since the importance of the available attributes,
and the way in which the information is combined, is
not made explicit. Decision tree classifiers are the most
comprehensible since each split represents a simple
dichotomous rule using a single cutoff criterion for an
attribute. Whether or not there is a trade-off between
performance and comprehensibility remains an em-
pirical question.
The premium attached to comprehensibility may vary
in different circumstances. When applying a classifier
to VA data obtained from a large household survey to
estimate national mortality statistics, comprehensibility
might only be important with respect to the credibility
of the statistics for users. However, when used by inter-
viewers with some medical knowledge, a classifier
might need to be comprehensible to maintain morale.
Comprehensible classifiers might also provide insight
about the natural history of different CoD.
13
Because comprehensible classifiers are also often
simple, they may help to reduce the number of un-
determined CoD. By encoding a simple classifier in a
VA questionnaire, cases with undetermined CoD could
be identified at the time of interview. Protocols for
interviewers could be established, e.g. to elicit further
information from respondents using standardized
prompts or to test the consistency of previous responses
(in a similar way to the use of filter questions in VA
questionnaires
2,9,10,23,32
).
EVALUATING THE PERFORMANCE OF
CLASSIFIERS
The need to evaluate a classifier is common to all
methods. Performance can be measured by several
indices, e.g. sensitivity, specificity, predictive power,
overall proportion classified correctly. Estimation of
these indices requires a classifier to be applied to a
dataset with validated class membership, allowing
inspection of the number and types of misclassification.
However, measuring performance is not simple.
Classifiers perform optimistically on data from which
they are derived because datasets inevitably contain
noise.
3,31
A classifier is to some extent based on the
noise, as well as information relevant to classification
of new cases; this can be a particular problem for
decision trees if they are not adequately pruned (see
v
).
Consequently, it is important to measure the per-
formance of a classifier in a way which adjusts for this
bias. Three methods can be used to obtain relatively
bias-free estimates of accuracy, namely train-and-test,
cross-validation and bootstrap.
31
Comparisons of train-
and-test and cross validation have been made on
various datasets.
33
Train-and-test is the simplest and most valid method.
Available validated cases are divided randomly into
training and test subsets (usually in a ratio of about
3:1). The classifier is derived from the larger, training
sample, and is then used to assign classes to cases in the
second test sample. A comparison of the validated and
assigned classes of test cases provides an unbiased es-
timate of the performance of the classifier. It is import-
ant that dividing the data should cause only a slight
loss of efficiency in deriving the classifier, and that there
should be sufficient cases in each class in the test
sample to provide a reliable estimate of accuracy. Thus
the method is only suitable for large datasets (usually
where n 1500). Classifiers for assigning CoD based
on VA data,
6
for diagnosing acute abdominal pain
12
and
myocardial infarction
14
provide examples of evaluation
using the train-and-test method.
Cross-validation requires the dataset to be divided
into m subsamples. (When m = n, cross-validation is
equivalent to the leave-one-out method
34
.) Each
subsample is used in turn as test data to evaluate a
classifier derived from the remaining (m-1) subsamples.
INTERNATIONAL JOURNAL OF EPIDEMIOLOGY 1086
vii
This problem goes far deeper than the choice of classification
method because, even when methods allow alternative priors to be
supplied, appropriate values are unlikely to be known since the true
priors are the very statistics which VA procedures aim to estimate.
The optimal classifier is finally derived using all of the
data, and an unbiased estimate of the accuracy of the
classifier is obtained from the average of the per-
formance estimates for each of the m subsamples. This
method is suitable for samples of moderate size.
Disadvantages include the need for repeated training
(which may be a problem for methods which are
computationally intensive such as neural networks) and
exaggerated scatter of the performance estimates for the
m subsamples, resulting in a confidence interval for
overall accuracy of the classifier which is too wide.
Additionally, the method does not work well with
decision tree classifiers because a knowledge-based
judgement is often required to prune a tree to an
optimal number of branches;
20,21
it is difficult to apply
such a judgement uniformly to the different classifiers
trained on each of the (m-1) subsamples. Cross-
validation has been used to evaluate a decision tree
classifier for diagnosing critical congenital heart
disease in babies.
16
The bootstrap method is a non-parametric procedure
for estimating error rates. It re-uses the original data by
creating many new datasets of the same size as the
original by random sampling from the original dataset
with replacement.
35
Some cases will be omitted from
each bootstrap sample, and others will appear more
than once. The aim is to replicate derivation of the
classifier a large number of times (ideally about 200),
testing the classifier obtained in each replication
against the cases which are unused in the bootstrap
sample. An overall estimate of the accuracy is obtained
from the average of estimates for the replications, and
the final classifier is based on the original dataset. As
for cross-validation, the need to prune trees by hand
makes it difficult to evaluate a decision tree classifier
using the bootstrap method.
There is a trade-off between cross-validation and
bootstrap methods with respect to error and bias. The
bootstrap method gives narrower, more appropriate
confidence limits about the overall performance es-
timate, but the estimate itself is biased optimistically.
Cross-validation is recommended for moderate sized
samples and bootstrapping for smaller ones.
31
When the aim is to determine patterns of mortality,
one might argue that accurate classification of indi-
vidual cases is not essential providing a classifier gives
accurate estimates of the proportions of deaths which
are attributable to each important CoD; this might
happen if misclassifications for different CoD balanced
each other. This argument might be valid when using a
classifier to survey patterns of mortality in the same
population from which the VA dataset used to derive
the classifier was obtained. However, it is unlikely to
be valid when using VA to investigate changes in
patterns of mortality, since the balance between dif-
ferent types of misclassification is unlikely to remain
the same. For example, consider using a classifier to
estimate the effect of an intervention to reduce deaths
from a particular cause; change in mortality from the
target CoD would almost certainly be underestimated,
since the intervention would be very unlikely to have
any effect on those misclassified as dying from the
target CoD but who in fact died from a different CoD.
CONCLUSION
Physician review is unlikely to be a feasible method for
assigning CoD from VA data. Expert algorithms are
easily applied but they may have poor validity. A range
of classification methods exist which can be used
to derive classifiers empirically from VA data. Such
classifiers appear to be at least as valid as physician
review or expert algorithms, and can be applied quickly
and simply to large datasets.
The choice of classification method for assigning
CoD needs to be considered when designing a VA
validation study. Four main factors influence this
choice: (i) the purpose for which a classifier is being
developed, (ii) the number of validated CoD assigned to
each case, (iii) the characteristics of the VA data and
(iv) the need for a classifier to be comprehensible.
When the objective is to estimate mortality from a
single CoD, logistic regression should be used; when
the objective is to determine patterns of mortality, the
choice of method will depend on the above factors.
Comparison of the performance of classifiers derived
using different methods requires a large VA dataset,
which is not currently available.
ACKNOWLEDGEMENTS
The authors are grateful to: Dr Bob Snow for
permission to use the Kenya VA dataset; Dr Gilly
Maude for comments on the manuscript; Dr David
Spiegelhalter for statistical advice on different classi-
fication methods and comments on the manuscript.
Barnaby Reeves was supported by an Advanced
Studentship from the Medical Research Council (UK)
when carrying out the work which forms the basis for
this paper. Maria Quigley is currently funded by the
Medical Research Council (UK).
REFERENCES
1
World Health Organization. Lay Reporting of Health Informa-
tion. Geneva:WHO, 1978.
ASSIGNING CAUSE OF DEATH FROM VERBAL AUTOPSY 1087
INTERNATIONAL JOURNAL OF EPIDEMIOLOGY 1088
2
Chandramohan D, Maude G H, Rodrigues L C, Hayes R J.
Verbal autopsies for adult deaths: issues in their devel-
opment and validation. Int J Epidemiol 1994; 23: 21322.
3
Hand D J. Statistical methods in diagnosis. Statistical Methods
in Medical Research 1992; 1: 4967.
4
Henery R J. Classification. In: Michie D, Spiegelhalter D J,
Taylor C C (eds). Machine Learning, Neural and Statist-
ical Classification. Hemel Hempstead: Ellis Horwood,
1994, pp. 616.
5
Snow R W, Armstrong J R M, Forster D et al. Childhood deaths
in Africa: uses and limitations of verbal autopsies. Lancet
1992; 340: 35155.
6
Quigley M A, Armstrong Schellenberg J R M, Snow R W.
Algorithms for verbal autopsies: a validation study in
Kenyan children. Bull World Health Organ 1996; 74:
14754.
7
Reeves B C. A Comparison of the Feasibility and Performance
of Data-Derived Classification Rules for Assigning Causes
of Death from Verbal Autopsies. Unpublished MSc
dissertation, University of London, 1994.
8
Tversky A, Kahneman D. Judgement under uncertainty:
heuristics and biases. In: Kahneman D, Slovic P, Tversky A
(eds). Judgement under Uncertainty: Heuristics and
Biases. Cambridge: Cambridge University Press, 1982,
pp. 320.
9
Kalter H D, Gray R H, Black R E, Gultiano S A. Validation of
postmortem interviews to ascertain selected causes of
death in children. Int J Epidemiol 1990; 19: 38086.
10
Mobley C C, Boerma J T, Titus S, Lohrke B, Shangula K, Black
R E. Validation study of a verbal autopsy method for
causes of childhood mortality in Namibia. J Trop Paed
1996; 42: 36569.
11
Redd S C, Kazembe P N, Luby S P et al. Clinical algorithm for
treatment of Plasmodium falciparum malaria in children.
Lancet 1996; 347: 22327.
12
de Dombal F T, Leaper D J, Staniland J R, McCann A P,
Horrocks J C. Computer-aided diagnosis of acute
abdominal pain. Br Med J 1972; I: 913.
13
Bratko I, Mozetic I, Lavrac L. KARDIO: A Study in Deep and
Qualitative Knowledge for Expert Systems. Cambridge,
MA: MIT Press, 1989.
14
Baxt W G, Skora J. Prospective validation of artificial neural
network trained to identify myocardial infarction. Lancet
1996; 347: 1215.
15
Spiegelhalter D J, Knill-Jones R P. Statistical and knowledge-
based approaches to clinical decision-support systems,
with an application in gastroenterology. J R Statist Soc Ser
A 1984; 147: 3577.
16
Chiogna M, Spiegelhalter D J, Franklin R C G, Bull K. An
empirical comparison of expert-derived and data-derived
classification trees. Stat Med 1994; 15: 15769.
17
Fisher R A. The use of multiple measurements in taxonomic
problems. Ann Eugen 1936; 8: 37686.
18
Mitchell J M O. Classical statistical methods. In: Michie D,
Spiegelhalter D J, Taylor C C (eds). Machine Learning,
Neural and Statistical Classification. Hemel Hempstead:
Ellis Horwood, 1994, pp. 1728.
19
Minsky M C, Papert S. Perceptrons. Cambridge, MA: MIT
Press, 1969.
20
Breiman L, Friedman J H, Olshen R A, Stone C J. Classification
and Regression Trees. Monterey, CA: Wadsworth &
Brooks, 1984, pp. 1858.
21
Feng C, Michie D. Machine learning of rules and trees. In:
Michie D, Spiegelhalter D J, Taylor C C (eds). Machine
Learning, Neural and Statistical Classification. Hemel
Hempstead: Ellis Horwood, 1994, pp. 5083.
22
Bang A T, Bang R A, the SEARCH Team. Diagnosis of causes
of childhood deaths in developing countries by verbal
autopsy: suggested criteria. Bull World Health Organ
1992; 70: 499507.
23
World Health Organization. Infant and Young Child Verbal
Autopsy Questionnaire, Version 5. Geneva: Epidemiology
and Methodological Unit, WHO, 1994.
24
World Health Organization. The Measurement of Overall and
Cause-Specific Mortality in Infants and Children. Report
of a joint WHO/UNICEF informal consultation, December
1992. Geneva: WHO, 1993, p. 3.
25
London School of Hygiene and Tropical Medicine. Second
Workshop on Verbal Autopsy Tools for Adult Death.
Unpublished report. London: London School of Hygiene
and Tropical Medicine, 1996.
26
Moore D H. Evaluation of five discrimination procedures for
binary variables. J Am Statist Assoc 1973; 68: 399404.
27
Krzanowski W J. Discrimination and classification using both
binary and continuous variables. J Am Stat Assoc 1975;
70: 78290.
28
Titterington D M, Murray G D, Murray L S et al. Comparison
of discrimination techniques applied to a complex data set
of head injured patients. J R Stat Soc Ser A 1981; 144:
14575.
29
Pacqu-Margolis S, Pacque M, Dukuly Z, Boateng J, Taylor
H R. Application of the verbal autopsy during a clinical
trial. Soc Sci Med 1990; 31: 58591.
30
Mirza N M, Macharia W M, Wafula E M, Agwanda R O,
Onyango F E. Verbal autopsy: a tool for determining cause
of death in a community. East Afr Med J 1990; 67: 69398.
31
Henery R J. Methods for comparison. In: Michie D,
Spiegelhalter D J, Taylor C C (eds). Machine Learning,
Neural and Statistical Classification. Hemel Hempstead:
Ellis Horwood, 1994, pp. 10724.
32
Gray R H. Interview based diagnosis of morbidity and causes of
death. In: Boerma J T (ed.). Measurement of Maternal and
Child Mortality, Morbidity and Health Care: Interdisciplin-
ary Approaches. Lige: Editions Derouax-Ordina, 1992,
pp. 6184.
33
Statlog partners. Dataset descriptions and results. In: Michie D,
Spiegelhalter D J, Taylor C C (eds). Machine Learning,
Neural and Statistical Classification. Hemel Hempstead:
Ellis Horwood, 1994, pp. 13174.
34
Lachenbruch P, Mickey R. Discriminant Analysis. New York:
Hafner Press, 1975.
35
Efron B. Estimating the error rate of a prediction rule: improve-
ments on cross-validation. J Am Stat Assoc 1983; 78: 31631.
36
Venables W N, Ripley B R. Modern Applied Statistics with
S-Plus. New York: Springer Verlag, 1994.
(Revised version received February 1997)
ASSIGNING CAUSE OF DEATH FROM VERBAL AUTOPSY 1089
APPENDIX
Availability of the Classification Methods Described in some Common Statistical Software Packages
Software
a
Discriminant methods Classification method Decision tree methods
Probability density
estimation methods
BMDP (1990) Use program 7M for linear discriminant functions; Not available Not available
use program LR for logistic regression
EGRET Revision 4 Linear and quadratic discriminant functions are not available; Not available Not available
for logistic regression, select UNCONDITIONAL LOGISTIC
options from regression module
Minitab Release 11 Use DISCRIMINANT command for linear and quadratic Not available Not available
discriminant functions; use BLOGISTIC for logistic regression
SAS Version 6 Use PROC DISCRIM for linear and quadratic discriminant functions; Use PROC DISCRIM Not available
use PROC CATMOD or PROC LOGISTIC for logistic regression for nearest neighbour or
kernel methods
S-plus Version 3.3 Use DISCR command for linear and quadratic discriminant functions; Not available
c
The TREE command
(Windows) 3.4 (UNIX) there is no simple logistic regression command
b,c
creates classification
trees based on a
maximum likelihood
splitting criterion
SPSS Release 4 Use DISCRIMINANT procedure for linear and quadratic Advanced Statistics CHAID (using a
2
-
discriminant functions and for logistic regression Users Guide provides based splitting criterion)
a MATRIX program is available as an
for nearest neighbour optional module
21
method
STATA Release 4 Linear and quadratic discriminant functions are not available; Not available Not available
use LOGISTIC command for logistic regression
Systat Version 6 Use MGLH for linear and quadratic discriminant functions; Not available Not available
no simple command is available for logistic regression
b
a
The above classification methods can be implemented in GLIM by skilled users.
b
Logistic regression can be carried out manually using the options available under general linear modelling.
c
Venables and Ripley
36
provide S-plus programs for a variety of classification methods.

You might also like