You are on page 1of 17

Health Policy 71 (2005) 315–331

A review and comparison of classification algorithms


for medical decision making
Paul R. Harper∗
School of Mathematics, University of Southampton, SO17 1BJ, Southampton, UK

Abstract

Within a health care setting, it is often desirable from both clinical and operational perspective to capture the uncertainty
and variability amongst a patient population, for example to predict individual patient outcomes, risks or resource needs.
Homogeneity brings the benefits of increased certainty in individual patient needs and resource utilisation, thus providing
an opportunity for both improved clinical diagnosis and more efficient planning and management of health care resources.
A number of classification algorithms are considered and evaluated for their relative performances and practical usefulness
on different types of health care datasets. The algorithms are evaluated using four criteria: accuracy, computational time,
comprehensibility of the results and ease of use of the algorithm to relatively statistically naive medical users. The research
has shown that there is not necessarily a single best classification tool, but instead the best performing algorithm will de-
pend on the features of the dataset to be analysed, with particular emphasis on health care data, which are discussed in
the paper.
© 2004 Elsevier Ireland Ltd. All rights reserved.

Keywords: Classification algorithms; Clinical decision making; Neural networks; CART

1. Introduction cally heterogeneous and require more detailed mod-


elling for classification. Resulting health care needs
As a direct consequence of individuality, patients and corresponding resources vary from patient to pa-
typically differ in a number of medical, physical and tient. The subsequent uncertainty and variability in
socio-economic characteristics, for example by age, the health care system can place great stress on the
severity of illness, complications and speed of recov- efficient and effective planning and management of
ery. Groups of patients with health care needs, whether resources.
they represent those who suffer from a particular dis- From both a clinical and operational perspective,
ease or those who are rushed into hospital as emer- it is desirable to be able to divide this heterogeneous
gencies, might be considered as groups of patients group into smaller homogeneous (in terms of some
with similar needs. In fact, these groups are typi- measure) sub-groups. Homogeneity brings the benefits
of increased certainty in clinical diagnosis, predicting
individual patient needs and resource utilisation. For
∗ Tel.: +44 23 8059 2660; fax: +44 23 8059 5147. example, given an individual patient we can classify
E-mail address: p.r.harper@maths.soton.ac.uk (P.R. Harper). them into a patient sub-group in which we know, from

0168-8510/$ – see front matter © 2004 Elsevier Ireland Ltd. All rights reserved.
doi:10.1016/j.healthpol.2004.05.002
316 P.R. Harper / Health Policy 71 (2005) 315–331

past experience and data, that their length of stay (LoS) datasets. Results of the comparison study and a sum-
in hospital is likely to be within a certain range of time mary of general findings are presented in Sections 5
with a given confidence. The LoS for this patient group and 6, respectively.
will typically substantially differ from the predicted
LoS of other groups. The purpose of classification in
this example would be to produce tight LoS bands with 2. The general classification problem
high confidence. Thus, with the added knowledge and
confidence of how long individual patients are likely to There are two elements to a general classification
stay in the hospital system, the potential for improved problem. Measurements are made on some case or ob-
efficiency and effectiveness in hospital planning and ject (for example, a patient in a hospital setting with
management is vast. measurements including age, sex, clinical diagnosis,
An important criterion for a good classification pro- LoS, outcome, etc.) and based on these measurements
cedure is that it not only produces accurate classi- a prediction is made as to which class a case is in. The
fiers (within the limits of the data) but that it also prediction is made following a pre-defined classifica-
provides insight and understanding into the predictive tion rule.
structure of the data [1]. For example, finding which In mathematical terms, we define X to be the mea-
socio-economic and medical characteristics contribute surement space containing x = (x1 , x2 , . . . , xm ), the
to the risk of a particular disease not only provides measurement vector, where each xi is a measurement
valuable assistance in classifying individuals into risk taken on a case. The method should given any x in X,
groups with some certainty, but more generally has ¯
have a classification rule to assign one of the classes
advanced the knowledge and understanding of the dis- (1, 2, 3, . . . , J) to x, where J is the number of classes.
ease. The classifiers are based on past experience using a
In this paper, four algorithms have been consid- combination of expert knowledge and past data with
ered, with the intention of representing the spectrum their relevant outcomes. For example, the classifiers to
of classical statistical and more recent advances in be defined could come from a hospital database com-
computer-based approaches: bined with the expert knowledge of the consultants,
specialty managers and other medical staff. Each mea-
• discriminant analysis (DA)
sured variable is continuous, nominal or ordinal in na-
• regression models (multiple and logistic)
ture. A variable is continuous if the measured value
• tree-based algorithms (CART)
is a real number (e.g. LoS, age, height). A variable is
• artificial neural networks
nominal if it is a finite categorical set with no natural
In considering suitable datasets, the primary crite- ordering (e.g. sex, hospital ward, clinical diagnosis).
rion was that chosen datasets were of real-world inter- A variable is ordinal if it is a finite categorical set with
est to the medical profession. Each algorithm has been a natural order.
evaluated using four criteria: accuracy, computing time There exist many different classification algo-
taken to produce results, comprehensibility of the re- rithms, but their relative merits and practical useful-
sults and the ease of use of the algorithm to relatively ness for health care problems in particular remain
naive medical users. To assess user-friendliness and unclear. Thus, a need arises to evaluate their relative
comprehensibility, the models were presented to man- performances. Intrasubject comparisons have been
agerial and clinical staff within different NHS Trusts considered in the past, for example, within statistics
across the South of England, including Portsmouth [2], within symbolic learning [3] and within neural
Hospitals NHS Trust, Southampton University Hospi- networks [4]. Other authors, for example [5], have
tals NHS Trust and The Royal Berkshire and Battle compared different algorithms for non-health care
NHS Trust. datasets, but little or no research has been conducted
In Section 2, the general classification problem is on the relative merits of various techniques for health
presented with a review of previous classification com- care problems and in particular consideration of the
parisons. Section 3 provides information on each of practical usefulness for (use by and interpretation of)
the algorithms and Section 4 describes the chosen medical personnel.
P.R. Harper / Health Policy 71 (2005) 315–331 317

3. Algorithms in which one of the variables (the dependent variable)


is expressed as a linear combination of the remaining
Four algorithms have been considered. A brief syn- variables (which are referred to as the independent or
opsis on each is provided below. explanatory variables).
The method of least squares is used to estimate the
3.1. Discriminant analysis parameters of the model from a given dataset. This
process of estimation is often referred to as fitting the
(Canonical) discriminant analysis is a technique us- model. The following description is intended to pro-
ing least squares methods to separate data into two vide an introduction to the topic. The reader is referred
or more groups. Data points are characterized by sev- to [6] for a comprehensive examination of the subject.
eral variables; the optimal discriminant function is as- We seek a model which will enable us to repre-
sumed to be a linear function of the variables and is sent the relationship between the dependent variable,
determined by maximizing the between group sum of y, and the set of independent variables, x1 , x2 , . . . ,
squares for fixed within group sum of squares. This xk . In general, this relationship will not be completely
technique is well described (for example, in [19]) and deterministic and the relationship will contain a ran-
a detailed discussion will not be given here. dom experimental error term, ε. In order to investigate
DA is primarily used to classify cases into the val- the relationship, a value of y, yj , is determined when
ues of a categorical dependent, usually a dichotomy. x1 , x2 ,. . . , xk take the values x1j , x2j , . . . , xkj , respec-
If it is effective for a set of data, the classification ta- tively. This is repeated for j = 1, . . . , n. The general
ble of correct and incorrect estimates will yield a high model used to represent the relationship is of the form:
percentage correct. In general, there are several pur-
yj = Φ(x1j , x2j , . . . , xkj ) + εj j = 1, . . . , n
poses for DA:
• To investigate differences between groups. where Φ denotes some function of the x’s and contains
• To determine the most parsimonious way to distin- unknown parameters which have to be estimated.
guish between groups. Attention is usually limited to the general linear
• To discard variables which are little related to group model, in which Φ is a linear function of the unknown
distinctions. parameters, such that:
• To classify cases into groups. yj = α + β1 x1j + β2 x2j + · · · + βk xkj + εj
• To test theory by observing whether cases are clas-
sified as predicted. j = 1, . . . , n

Discriminant analysis shares all the usual assump- In some situations, it is more appropriate for the
tions of correlation, requiring linear and homoscedas- model to include a multiplicative error term rather than
tic relationships and untruncated interval or near the additive term used above. The general model then
interval data. Like multiple regression, it also assumes becomes:
proper model specification (inclusion of all important yj = Φ(x1j , x2j , . . . , xkj )δj
independents and exclusion of extraneous variables).
DA is an earlier alternative to logistic regression where ␦j denotes the random experimental error term.
(see Section 3.2), which is now frequently used in This model is often appropriate for data for which
place of DA as it usually involves fewer violations the relationship between the variables is clearly
of assumptions, is robust and has coefficients, which non-linear. In this case, a suitable logarithmic trans-
many find easier to interpret. formation can sometimes be applied to achieve an
approximately linear relationship with an additive
3.2. Regression models (multiple and logistic) error term.
Binomial (or binary) logistic regression is a form
Regression analysis is concerned with investigating of regression that is used when the dependent is a
the relationship between several variables in the pres- dichotomy and the independents are continuous vari-
ence of random error. In particular, we build a model ables, categorical variables or both. Multinomial lo-
318 P.R. Harper / Health Policy 71 (2005) 315–331

gistic regression exists to handle the case of depen- if the variable is categorical then deviance is used.
dents with more classes. Logistic regression applies An algorithm is used to split the original dataset into
maximum likelihood estimation after transforming the sub-populations of increasing purity (decreasing vari-
dependent into a logit variable (the natural log of the ance or deviance). At each junction of the tree is a
odds of the dependent occurring or not). In this way, node. A terminal node is a node at the end of the
logistic regression estimates the probability of a cer- branch of the tree. At each node in the tree, the algo-
tain event occurring. rithm searches through each of the independent vari-
Logistic regression has many analogies to ordinary ables in turn. For each variable, it finds the best binary
least squares (OLS) regression: the standardized logit split that produces a node with the smallest variance
coefficients correspond to beta weights and a pseudo or deviance. Then, it selects the variable that has pro-
R2 statistic is available to summarize the strength duced the best binary split (best of the best). The par-
of the relationship. Unlike OLS regression, however, ent node will thus be split on this variable with the split
logistic regression does not assume linearity of rela- as defined. This may have the effect of leaving one of
tionship between the independent variables and the the child nodes with a higher variance/deviance than
dependent, does not require normally distributed vari- the parent node. The algorithm however continues to
ables, does not assume homoscedasticity and in gen- branch from each of these child nodes until defined
eral, has less stringent requirements. The success of stopping rules have been fulfilled.
the logistic regression can be assessed by looking at An issue in CART analysis is when to stop the par-
the classification table, showing correct and incorrect titioning, i.e. when do we say that the variance has not
classifications of the dichotomous, ordinal or polyto- significantly reduced? It would be possible to create
mous dependent. Goodness-of-fit tests are available a tree where each terminal node has zero variance by
as indicators of success as is the Wald statistic and having just one case in each node. However, this would
other tests of the model’s significance. be statistically irrelevant and practically useless. It is
necessary, therefore, to introduce stopping rules so that
3.3. Tree-based algorithms (CART) terminal nodes have sufficient size to yield reasonable
and statistically robust results. Stopping rules include:
Classification and regression trees (CART) is a clas-
sification method that has been successfully used in • Stop when nodes contain a certain number of cases.
many health care applications. Example applications • Stop when reduction of variance is below a certain
include, creating case-mix groups [7], minimum data threshold.
requirements [8], cancer survival groups [9], inten- • Stop when a maximum number of terminal nodes
sive care [10] and hospital inpatients [11]. The reader (or layers) have been produced.
is directed to [1] for a comprehensive description of Care must be exercised when defining stopping
the CART algorithm. Although there are variants of rules and should account for the number of cases in
tree-based methods with different splitting criteria, the dataset. A terminal node with less than 30 cases,
CART has been selected for this study, since it is for example, can be expected to yield little predictive
widely used in medical decision making (evidenced power and lack statistical robustness. Standard bounds
by the number of published applications using CART, are no less than 50 cases per node, a significance level
a small sample of which is referenced above), and of 1% on the reduction of variance in order to split a
furthermore, because recent published studies have node and a maximum of around 10 terminal nodes.
shown that different splitting criteria have insignifi- Once a tree is constructed, statistical summaries can
cant impact on the accuracy of the classifier with no be produced at the terminal nodes, which can be used
single splitting criteria in the literature proving to be to form the classes.
universally better than the rest [12]. There are four components required to construct a
The first step in producing a tree is to decide which regression tree:
variable is to be predicted (LoS, operation times, sur-
vival rates, etc.). If the variable is ordinal then the vari- 1. A set of questions of the form: does xi belong to
ance is used to measure the purity [1] in the group, the set A. The answer to such questions induces a
P.R. Harper / Health Policy 71 (2005) 315–331 319

split of the predictor space, into cases associated it periodically receives and the signal it periodically
with A and those with the complement of A. The sends to other processors, and yet such simple local
sub-samples form the nodes. processors are capable of performing complex tasks
2. A goodness of split criterion Φ(s, t) that can be when placed together in a large network of orches-
evaluated at any split s at any node t. trated cooperation.
3. A means of determining the appropriate size of the Artificial neural networks have their roots in work
tree. performed in the early part of the 20th century [13], but
4. Statistical summaries at terminal nodes of the tree, only during the 1990s, after the breaking of some the-
for example, node averages and frequency distri- oretical barriers and the advances in computing power,
butions. have these networks been widely accepted as useful
tools. A plethora of books and papers have been pub-
Once a tree has been produced, it should be vali- lished on artificial neural networks. This section aims
dated to give an estimate of the accuracy of its classi- to provide the reader with a general overview of the
fications. The same data that is used to construct the topic. For a more comprehensive and exhaustive de-
tree cannot be used to test the classifications, as the scription, the following references are useful starting
estimate will be over-optimistic. This is overcome by points: [13–16].
splitting the data into two sets, A and B. The cases in The neural network is the collection of units that
A must be independent and identically distributed to are connected in some pattern to allow communica-
the cases in B. This has the drawback that it reduces tion between the units. These units, also referred to
the sample size used in the construction of the tree. as neurons or nodes, are simple processors whose
Set A can be used to train the data and build the tree computing ability is restricted to a rule for combin-
(training set) and set B to test the robustness and va- ing input signals and an activation rule that takes the
lidity by forcing the data through the tree (testing set). combined input to calculate an output signal. Output
For smaller samples, there is a technique called V-fold signals may be sent to other units along connections
cross-validation. For larger samples, standard statisti- known as weights. The weights usually excite or in-
cal methods can be used to compare the values of the hibit the signal that is being communicated. The net
training set and test set at any node using significance input of weighted signals received by a unit j is given
tests. by
n

3.4. Artificial neural networks
netj = w0 + wi xi
i=1
Artificial neural networks are parallel computing
devices consisting of many interconnected simple pro- where w0 is the biasing signal, wi the weight on in-
cessors. In essence, they are attempting to mimic the put connection ij, xi the magnitude of signal on input
behaviour of the human brain. Although each proces- connection ij and n is the number of input connec-
sor is quite simplistic, a collection of these units (a tions to unit j. An illustrative schematic of a single
network) gives rise to a powerful computational tool. network unit with three incoming signals is shown in
Each processor in the network is only aware of signals Fig. 1.

Fig. 1. A single network unit.


320 P.R. Harper / Health Policy 71 (2005) 315–331

The backpropagation algorithm is a supervised


learning method in which the developer first selects
a set of training data from historical records. The
training data would consist of samples of inputs, I, to-
gether with a corresponding set of targeted output(s),
t. For each data point in the training set, the algo-
rithm sweeps through the network twice. The forward
sweep first propagates the input vectors through the
Fig. 2. A two-layered feedforward network. network to compute the output values at the output
layer (with the weights and biasing values of the net-
work initialised to small arbitrary values). The unit’s
The summed value, net, is passed on to a sec-
total output error, e, obtained by summing the indi-
ond processor within the unit, the transfer function,
vidual differences between the network’s output and
which computes the output value of unit j determined
the targeted outputs (t-o), are then in turn propagated
by:
backwards through the network to determine how the
oj = f(netj ) weights are to be changed during training. Common
error functions used, include the sum of squared error
where oj is the unit output and f(netj ) is the output (SSE) and the mean squared error (MSE). Further
transfer function. Common transfer functions include, discussions on the backpropagation algorithm may be
the step, sigmoid and hyperbolic tangent sigmoid func- found in [17].
tions. An important consideration affecting the likely suc-
One of the intriguing aspects of neural networks is cess or failure of a neural network is the choice of the
that although they have units with limited computing number of units in the input, hidden and output lay-
capability, when many of these units are connected to- ers. With prior knowledge of the dataset, respectable
gether, the complete network is capable of performing initial values may be chosen. For most problems how-
a complicated task. Fig. 2 illustrates an example of ever, the initial values are unclear and often further
a simple two-layered feedforward network (the input complicated by the curse of dimensionality, which
layer is not normally regarded as a layer as its purpose states that the number of data points required for train-
is simply to enter the input values). ing increases non-linearly as each input variable is
As opposed to a traditional and rigorous program- added.
ming approach, where the developer blueprints every The choice of the number of units to be used in the
command to be executed, a neural network is left to it- hidden layers also strongly affects the generalisation
self to learn the underlying theories of the problem and capability of the network. Generalisation in the neu-
the procedures required for solving it. This process of ral network context refers to the ability of a network
learning is known as network training. The knowledge to arrive at a configuration that is able to correctly
obtained by the network is stored in its weights and process input data that has never been presented to it
biasing values. A popular training algorithm for feed- before. As a general rule of thumb [15], start with a
forward networks is the backpropagation algorithm. two-layered network comprising of 30–50% of the to-
Feedforward networks that are trained using this ap- tal number of units in both the input and output layers
proach are commonly known as backpropagation net- in the hidden layer. While the lack of units in the hid-
works (BPN). The general concept of this technique den layers would obviously cause the network to be
is to constantly perform corrections to the individual insufficiently powerful to model the problem at hand,
weights and bias values, with the objective of reduc- the presence of too many units may cause the network
ing the output error, corresponding to the magnitude it to overfit. Under such circumstances, the network has
has contributed to the output error. In other words, the begun to memorise the training set and will not allow
algorithm redistributes the “blame” to the individual for flexibility in processing new input sets. Overfitting
weights and biases according to their contributions to is observed when a higher test error is observed when
the overall error. compared to the final training error.
P.R. Harper / Health Policy 71 (2005) 315–331 321

Table 1
Dataset descriptions
Dataset Description

A An intensive care dataset containing routinely collected ICU data within the UK. Information on a number of
socio-economic and medical variables. Primary interest in predicting LoS and outcome.
B Routinely collected data from a hospital patient management system for predicting LoS on the ward.
C A comprehensive maternity dataset collected as part of a commissioned study on predicting complications at birth.
Contains a number of socio-economic and medical variables.
D A large diabetes dataset, containing information on various diabetic complications collected for over 30 years from a
leading unit in the UK.

4. Datasets a classification tool. Therefore, to compare the rela-


tive performances of each of the four techniques, ac-
In considering suitable datasets, the primary crite- curacy has been measured as the percentage of cases
rion was that chosen datasets were of interest to the (patients) that the algorithm classifies correctly with a
medical profession. In order to evaluate how the dif- categorical dependent variable or the correlation co-
ferent classification approaches perform on different efficient (r) between the observed and predicted re-
types of data, a number of datasets were chosen with sponses with a continuous dependent variable.
the intention of representing those with different sizes
(number of records), number of variables (fields), the 4.2. Run-time speed
level of variance or deviance (an indicator of how
“messy” the data is) and the ratio of continuous to Training and testing times were calculated. Training
categorical variables in the dataset. time is the time taken to learn plus the time taken to
The four selected datasets are described in Table 1. classify the training data. Test time is the time it takes
A summary of the classification studies is shown in to classify the test data. All algorithms were performed
Table 2. on the same PC with a Pentium III processor of speed
The algorithms are evaluated using four criteria. 600 MHz with 128 MB RAM.
Two are objectively measurable: the accuracy and the
computing time taken to produce results. There are
4.3. Comprehensibility and ease of use
also two subjective criteria: the comprehensibility of
the results and the ease of use of the algorithm to rel-
This is a measure of the extent to which the algo-
atively naive medical users.
rithm produces comprehensible results that are easy to
4.1. Accuracy interpret and understand, particularly by medical staff.
This was measured by consulting various medical per-
There is no generally accepted measure or agree- sonnel within the participating NHS Trusts. They were
ment on the appropriate loss function or accuracy of also asked to estimate the ease of use of each tech-

Table 2
Summary of classification studies
Study Dataset Dependent variable Number of Variance/deviance Number of variables
records
Description Nature

1 A LoS in ICU (days) Cts 582 11.8 (mean 2.4) 7 (2 Cts, 5 Cat)
2 A Outcome (death or survival) Cat 582 0.13 (13%) 7 (2 Cts, 5 Cat)
3 B LoS in hospital (days) Cts 17,974 56.6 (mean 4.3) 5 (2 Cts, 3 Cat)
4 C Chance of complicated delivery Cat 2,402 0.24 (24%) 16 (8 Cts, 8 Cat)
5 D Predicting onset of Retinopathy Cat 4,056 0.33 (33%) 14 (12 Cts, 2 Cat)
Cts: continuous variable; Cat: categorical variable.
322 P.R. Harper / Health Policy 71 (2005) 315–331

nique. This is based on the amount of time required 5.1. Discriminant analysis
to understand the algorithm and prepare the data, the
amount of tuning necessary and the time required to Discriminant analysis is used to classify cases into
produce correct results. the values of a categorical dependent. A large amount
of set-up time was therefore required to split the con-
tinuous dependent variables into necessary groups. For
this purpose, its groups were derived using percentile
5. Results of the comparison study splits with a total of ten groups constructed.
DA consistently gave the lowest accuracy and took a
Initial time was spent randomly, splitting each of longer time to run than both the regression and CART
the four datasets into training and testing datasets approaches, although it was faster than the neural net-
(with approximately half the total number of observa- work tool. DAs poor performance may be explained
tions in each). The same training and testing datasets by its linear structure that seems unable to tune itself
were used for each of the classification techniques in to the structure of the datasets. In contrast, for exam-
each study. The discriminant and regression models ple, regression models may be tuned to the dataset
were built using SPSS (version 10). The CART trees through the use of interaction terms. In DA, the as-
were constructed using SPSS AnswerTree (version sumptions of a linear structure appear too restrictive.
2.1). NeuroSolutions (version 3) was used to build, DA performed particularly poorly for predicting LoS
train and test the neural networks. For each study, the (studies 1 and 3). The LoS data has high skew and
models were trained and then tested. Their accuracy kurtosis, which appears to severely disrupt the perfor-
and run-time were recorded. mance of the discriminants.
Table 3 summarises how each of the classification The resulting discriminant functions and their coef-
techniques performed in each of the five studies. Re- ficients were not always easier to interpret and often
call that with a categorical dependent variable, the shed little light on the structure of the data. A survey
percentage of correctly classified cases is presented of health care professionals revealed that they often
(rescaled as a number between 0 and 1). With a struggled to make clinical sense of the discriminant
continuous dependent variable, the value of the cor- functions. This is a further downfall of the technique.
relation coefficient (r) is given. Ninety-five percent
confidence intervals (+/−) are provided, calculated 5.2. Regression models
using the Binomial distribution for categorical stud-
ies and Fisher’s Z function transformation applied Models with both additive and multiplicative error
to correlation coefficients for continuous studies. terms were examined and the best accuracy recorded.
The time taken to test the models is shown in sec- When the number of variables is large, and thus run-
onds. Appendix contains more detailed information ning a full-factorial model is unfeasible, regression
with illustrative results for each of the four tech- models ideally require that the user has at minimum a
niques. general understanding of the variables and may there-

Table 3
Performance of the different classification techniques
Study Discriminant Regression CART Neural Nets

Accuracy Time Accuracy Time Accuracy Time Accuracy Time

1 0.233 ± 0.09 12 0.292 ± 0.08 1 0.320 ± 0.08 1 0.328 ± 0.08 39


2 0.670 ± 0.05 6 0.814 ± 0.04 1 0.87 0 ± 0.04 2 0.871 ± 0.04 39
3 0.451 ± 0.02 165 0.620 ± 0.01 5 0.604 ± 0.01 20 0.339 ± 0.01 620
4 0.539 ± 0.03 7 0.675 ± 0.02 2 0.780 ± 0.02 6 0.794 ± 0.02 165
5 0.728 ± 0.02 3 0.745 ± 0.02 2 0.792 ± 0.01 12 0.738 ± 0.01 129
P.R. Harper / Health Policy 71 (2005) 315–331 323

fore evaluate suitable interaction terms. As a conse- models and always more accurate than discriminant
quence, a large amount of time should be afforded to analysis. It appeared to perform well over all datasets,
the consideration of appropriate model terms. Various indicating that the number of records or number of
models may then be built and evaluated. variables does not affect its relative accuracy compared
The accuracy was typically comparable with that of to the other techniques. It performed particularly well
CART and neural network approaches. It tended to do on datasets with high skew and kurtosis, indicating
better when the number of records (data points) was that they are furthest from the (multivariate) normal.
large and actually produced the best correlation coef- Symbolic learning tools, like CART, are generally
ficient, r, for predicting hospital LoS (0.62). For stud- non-parametric. That is, they do not make any as-
ies 1 and 3 (continuous dependent variables), the best sumptions about the underlying distributions. This is
results were obtained after a logarithmic transforma- why we observe a consistently good performance rel-
tion was applied to the dependent, although a number ative to the other approaches. Although run time is a
of other transformations were also evaluated. potential concern for tree-based methods, the observed
Careful selection of interaction terms were consid- times were low even for larger datasets and those with
ered and this process was greatly aided by the re- a high proportion of categorical explanatory variables.
sults from CART. Tree-based structures graphically il- The CART output is simple and straightforward to
lustrate the variables that cause the tree to split and interpret. Surveyed health care professionals partic-
branch, with nested variables (parent and resulting ularly favoured the clear pictorial way in which the
child branches further down the tree) providing a good tree was constructed and captured on screen. It was
initial indication of potential interaction terms to in- found that this technique was the easiest to clinically
clude in the regression model. Further detail of the interpret, allowing the user to discover which vari-
usefulness of CART in the appreciation of possible in- ables were of importance and furthermore their rela-
teraction terms may be found in appendix. Once the tive importance and quantifying the point at which to
desired model had been chosen, the actual run-time split on each variable. Such a method can challenge
was the quickest of all of the techniques. Even with perceived beliefs or reinforce existing clinical judge-
17,974 records (study 3) it only took 5 s to run. ments. A great practical advantage of this tool is the
Regression performs relatively poorly to CART ability to combine expert clinical/managerial knowl-
and neural networks when there is a high number of edge with the power of statistical analysis. From any
categorical independent variables (studies 1, 2 and 4). node in the tree, it is possible for the user to split
This is consistent with what is known about the prop- the node on a user-selected explanatory variable. The
erties of the discriminant and logistic regression algo- CART algorithm may then be continued from the cre-
rithms; the properties of high skew (>1) and kurtosis ated child nodes. The combination of CART and lo-
(>7) along with the presence of binary/categorical cal expert clinical knowledge proved to be a powerful
variables disrupts the performance of these algorithms tool during discussions with, and use of the tool by,
[5]. Such conditions are well suited to symbolic learn- the surveyed medical staff.
ing algorithms, as observed by the performances of
CART and neural networks (study 4). 5.4. Neural networks
When considering ease-of-use and interpretability,
regression models are fairly straightforward to inter- The performance of the neural net varied through-
pret and understand although there is a large set-up out the study depending on the features of the dataset,
time and a need to run a number of different models although overall it performed well. A large amount
accounting for interaction terms and transformations of initial effort was required to train and validate the
to the dependent variable as necessary. models. Run-time was by far the slowest of the stud-
ied classification tools because of the large number of
5.3. CART parameters, with the neural net taking over 10 min to
train the largest dataset (study 3). Improvement in the
The performance of CART was consistently as speed of neural networks is a large research area (for
good as, or better than, regression and neural networks example, see [18]).
324 P.R. Harper / Health Policy 71 (2005) 315–331

A major drawback of neural networks is that they This research has indicated that in practice there is
are difficult to set up to produce good results. For ex- no single best classification tool, but instead the best
ample, to run backpropogation properly, a number of technique will depend on the features of the dataset
parameters needed to be adjusted. For example, an im- to be analysed and any preferences of end-users.
portant decision concerns the number of layers and the The research has made a start in investigating what
initial step-size and momentum rates of the backpro- these features are with particular emphasis on health
pogation network. Any small changes in these param- care data. A summary of the main findings are as
eters can decrease the performance substantially. After follows:
running a number of networks and gaining an appre-
ciation of the tool, an insight was made on choosing • Overall the results were promising and each tool
suitable starting conditions. These appeared to indi- made a statistically significant contribution in each
cate an initial step-size of 0.7, a momentum rate of study (values of r and the percentage correctly clas-
0.5 and a network with two-hidden layers, although sified were all significant at the 95% level), although
there is no guarantee that the results are the best that in study 1 the variability explained by the models
could be achieved. An automatic method of parame- was considerably less than in the other studies.
ter selection for backpropogation is another important • Regression models consistently had the fastest
current research topic. run-times, although the difference in times com-
Accuracy was restricted (especially when compar- pared to CART and DA is likely to be insignificant
ing the results with those obtained from CART and in practice. Neural networks require significantly
regression models) when the network was handling more time to train and validate models.
datasets with high variance or deviance (studies 3 and • CART, regression and neural network classification
5). The tool was equally or more accurate than all of approaches gave similar accuracies, although CART
the other tools in studies 1 and 2 with the smallest was the only tool to give consistently good results.
dataset (582 records). DA performed poorly throughout the study.
As with regression and DA, neural networks are • CART was well suited to datasets with large skew
not always easier to interpret. Coupled with the large (>1) and kurtosis (>7) and where there were a large
set-up time, long run-times and multitude of possible proportion of categorical independent variables.
network configurations, this tool is perhaps best suited CART makes no assumption about the underlying
to an experienced user. Extreme care should be exer- distribution, hence why CART performed consis-
cised if intending to give the tool to a health care pro- tently well. In contrast, these conditions limit the
fessional with limited statistical and neural network performance of discriminant and regression models,
knowledge. where the data is furthest from the (multivariate)
normal.
• Neural networks produced the best accuracy when
6. Summary of findings and general dealing with smaller datasets, but performed slightly
conclusions disappointingly when handling dependent variables
with high levels of variability or deviance.
In order to capture the uncertainty and variability • If ease of use and human understanding are high
amongst the patient population, a number of classifi- priority, symbolic algorithms, such as CART should
cation techniques have been considered and evaluated be chosen.
for their relative performances and practical useful- • A number of health care professionals were sur-
ness in predicting various health care indicators using veyed for ease of use and interpretability of the four
a number of real-life datasets. The selected algorithms techniques. The main concern focussed on the form
have been chosen to cover a range of statistical meth- of the DA discriminant function and the associated
ods for medical decision making, spanning both com- coefficients and weights of the regression and neural
monly used traditional approaches (regression analy- network models, respectively. Typically these were
sis) and more recent and less widely used techniques of seen to be difficult to interpret and often shed
(neural networks). little light on the structure of the data.
P.R. Harper / Health Policy 71 (2005) 315–331 325

The CART tree-based structure provides good ini- Appendix A


tial indication of potential interaction terms to include
in the regression-based models for studies with larger This appendix contains detailed results from each of
number of independent variables. Although it is likely the adopted techniques in the study. It is not possible
that the same accuracy would have eventually been to provide results for each classification tool for each
obtained for regression models without this prior in- study. Instead, only one study is shown for each tool
formation, it is certainly true that this led to reduced with the intention of illustrating typical model inputs,
runtimes for regression models, thus we acknowledge outputs and issues.
that this might cause slight bias in runtime results for
regression models compared to other techniques. A.1. Discriminant analysis
A survey of hospital staff from the participating
NHS Trusts has revealed that tree-based tools, such as DA is used to classify cases into the values of a cate-
CART, do have a greater practical appeal than that of gorical dependent. For studies 1 and 3, it was therefore
the other tested techniques. This is a measure of the necessary to divide the continuous dependent variable
extent to which the CART algorithm produces com- into a number of groups. Using percentile points, the
prehensible results that are generally easier to interpret distribution of values (LoS) was divided into 10 groups
by medical staff than the results of other algorithms, (1–10%, 11–20%, etc.). DA was then used to predict
and on the time it took for hospital staff to understand membership to each of the 10 groups. For studies 2,
the technique, prepare the data and actually perform 4 and 5, the dependent variable was in the necessary
the analysis to produce correct and meaningful results. categorical (group) form.
The CART analysis was performed by the staff them- Study 5 is used to illustrate the results from DA.
selves, after an initial training session (less than half In this study, we are predicting the onset of diabetic
a day). retinopathy using 13 explanatory variables. Diabetic
In practice, clearly a balance must be made between patients may suffer from a number of long-term
the accuracy and interpretability of a proposed tech- complications. Retinopathy is a complication with
nique. Accuracy is undoubtedly important, especially the eyes, which can eventually lead to blindness. It
when considering a number of health care variables can be treated successfully if detected in time. The
such as predicting death or survival. We might how- dependent variable was a nominal zero to one vari-
ever wish to avoid a situation in which we are obtain- able defining retinopathy. The list of independent
ing accurate predictions, but where the form of the variables included sex, height, systolic blood pres-
classifier is complex and little confidence and knowl- sure, diastolic blood pressure, number of diabetic
edge is gained on the data structure. Such a black box years, glucose level (HbA), type of diabetes, cre-
approach is limited in producing interpretable clas- atinine level, cholesterol level and age of onset of
sification rules both for understanding the prognos- diabetes.
tic structure and for the planning and management of Presented tables are taken from SPSS (version 10)
health care in general. output.

Descriptive statistics.
ariable Description Min Max Mean S.D.

S1 Sex of patient (M/F) 0 1 Cat Cat


DTYPE Diabetes type (IDDM/NIDDM) 0 1 Cat Cat
TOESCORE Toescore 6.48 14.30 10.5 1.8
HEIGHT Height of patient 131.0 200.0 167.8 9.7
SBPMEAN Systolic blood pressure 90.0 220.0 143.7 19.8
DBPMEAN Diastolic blood pressure 51.8 116.7 80.2 9.5
HBAMEAN Glucose level 4.8 14.9 8.7 1.5
CHOLMEAN Cholesterol level 2.1 12.4 5.8 1.1
CRETMEAN Creatinine level 50.5 644.6 98.3 35.7
BMIMEAN Body mass index 18.3 49.0 28.1 5.0
326 P.R. Harper / Health Policy 71 (2005) 315–331

Appendix A (Continued )

ariable Description Min Max Mean S.D.

DIAB YEA Years as a diabetic 0.0 66.5 9.5 11.1


AGE Age of patient 10.4 93.3 61.2 16.5
AGE DIAG Age of onset of diabetes 0.4 90.6 51.7 20.2
RET Retinopathy (dependent variable) 0 1 Cat Cat

Standardized canonical discriminant function. A.2. Regression models

Function Regression analysis is concerned with investigating


1 the relationship between several variables in the pres-
ence of random error. In particular, we build a model in
S1 0.225 which the dependent variable is expressed as a linear
DTYP 0.254 combination of the independent or explanatory vari-
TOESCOR 0.362 ables.
HEIGH −0.292 The technique is illustrated using study 3, where we
SBPMEA 0.286 predict LoS on the ward based on routinely collected
DBPMEA 0.003 data from a hospital patient management. A number
HBAMEA 0.215 of socio-economic and medical variables are routinely
CHOLMEA −0.083 collected. Four variables (age, intent, status and sex)
CRETMEA −0.038 were selected for this study. The level of variance in
BMIMEA −0.110 LoS is high (variance of 56.6 with a mean LoS of 4.3)
DIAB YE 0.853 and the number of records large (17,974). SPSS (ver-
AGE −0.146 sion 10) output is presented. The best result was ob-
tained after a logarithmic transformation was applied
Classificationa,b . to LoS, although a number of other transformations
were evaluated.
Predicted membership Total
Descriptive statistics.
RE 0 1
Variable Description Min Max Mean S.D.
Origin
Count 0 756 225 981 X1 Sex of patient 0 1 Cat Cat
1 236 501 737 (M/F)
% 0 77.1 22.9 100.0 AGE Age of patient 0 98 56.6 18.2
1 32.0 68.0 100.0 (M/F)
I1 Intent (day- 0 1 10.5 1.8
Cross-validated case/inpatient)
Counta 0 755 226 981 S1 Status (emer- 0 1 Cat Cat
1 242 495 737 gency/elective)
% 0 77.0 23.0 100.0 LOS Hospital LoS 1 220 4.3 7.5
1 32.8 67.2 100.0 (dependent
a 73.2% of original grouped cases correctly.b 72.8% of variable)
cross-validated grouped cases correctly.
P.R. Harper / Health Policy 71 (2005) 315–331 327

The following additional and interaction terms were three-way interaction term (excluding sex) was used
defined: in the model. The results from the regression analysis
are consistent with the results from CART.
Variable Interaction
Model summary.
AGE2 AGE × AGE
F1 AGE × I1 Model R R2 Adjusted Standard error
F2 AGE × X1 R2 of the estimate
F3 AGE × S1 1 0.593a 0.352 0.352 0.7705
F4 I1 × S1 2 0.628b 0.394 0.394 0.7449
F5 I1 × X1 3 0.630c 0.396 0.396 0.7439
F6 I1 × S1 4 0.630d 0.397 0.397 0.7436
F7 AGE × I1 × X1 a Predictors: (constant), F1.
b Predictors: (constant), F1, F4.
The results from CART gave a helpful insight into
c Predictors: (constant), F1, F4, I1.
possible interaction terms. Age, intent (I1) and sta-
d Predictors: (constant), F1, F4, I1, F3.
tus (S1) were important explanatory variables. Only a

ANOVAe

Model Sum of squares d.f. Mean square F Sig.

1 Regression 2896.515 1 2896.515 4879.140 0.000a


Residual 5328.033 8975 0.594 – –
Total 8224.548 8976 – – –
2 Regression 3244.575 2 1622.287 2923.390 0.000b
Residual 4979.974 8974 0.555 – –
Total 8224.548 8976 – – –
3 Regression 3259.545 3 1086.515 1963.604 0.000c
Residual 4965.003 8973 0.553 – –
Total 8224.548 8976 – – –
4 Regression 3263.621 4 815.905 1475.592 0.000d
Residual 4960.927 8972 0.553 – –
Total 8224.548 8976 – – –
a Predictors: (constant), F1.
b Predictors: (constant), F1, F4.
c Predictors: (constant), F1, F4, I1.
d Predictors: (constant), F1, F4, I1, F3.
e Dependent Variable: LOGLOS.
328 P.R. Harper / Health Policy 71 (2005) 315–331

LOGLOS = (0.3940 × I1) + (0.0084 × AGE × each patient. An algorithm is used to split the original
I1) + (0.0039 × AGE × S1) + (0.2270 × I1 × S1). dataset into sub-populations of increasing purity (de-
creasing variance or deviance). CART is demonstrated
Correlations.
using study 4. Given a number of explanatory vari-
LOS LOSPRED ables, we wish to predict the probability of a pregnant
woman having a complicated delivery. Complications
LOS include the need to induce the baby, caesarean section
Pearson 1.000 0.620a and stillbirth. A successful classification could help
Sig. (two-tailed) – 0.000 to flag women at high risk for whom we might of-
N 8987 8987 fer more dedicated care throughout labour and during
a Correlation is significant at the 0.01 level (two- delivery.
tailed). There were 2402 records (1201 in both the training
and testing datasets) and 16 variables of interest (8
A.3. Regression trees (CART) continuous and 8 categorical).

The CART method produces a tree that, by answer-


ing a series of yes/no questions, can be used to classify
Descriptive statistics.

Variable Description Min Max Mean S.D.

SMOKING If woman smokes (Y/N) 0 1 Cat Cat


EPILEPSY History of epilepsy (Y/N) 0 1 Cat Cat
HYPERTEN History of hypertension (Y/N) 0 1 Cat Cat
PARITY Previous number of children 0 8 0.9 1.0
CAESAREAN Previous number of caesareans 0 2 0.02 0.21
HEIGHT Height of woman 1.4 1.9 1.6 0.06
WEIGHTMO Weight of woman 39.0 146.0 66.7 13.2
DIABETES Is diabetic (Y/N) 0 1 Cat Cat
BP Blood pressure 40.0 116.0 72.7 9.4
AGEMO Age of woman 18.3 49.0 28.1 5.0
RACEMO Race of mother (two categories) 0 1 Cat Cat
GEST WKS Gestation (in weeks) 23.0 42.0 39.3 1.9
B INDEX Body mass index of woman 15.8 57.0 24.8 4.7
NO BABES Number of expected babies 1 2 1.02 0.14
S1 Sex of baby (M/F) 0 1 Cat Cat
SPONTDEL Spontaneous delivery (dependent variable) 0 1 Cat Cat
P.R. Harper / Health Policy 71 (2005) 315–331 329

Node Path No. of SpontDel SpontDel


records Y (%) N (%)
1 All data 1201 75 25
12 Parity = 0; gestation <= 3 66 58 42
13 Parity = 0; gestation > 37 172 66 34
10 Parity = 0; gestation <= 40; age <= 26 92 63 37
11 Parity = 0; gestation <= 40; age > 26 85 49 51
8 Parity = 0; gestation > 40; age <= 25 45 51 49
9 Parity = 0; gestation > 40; age > 25 85 46 54
30 Parity > 0; caesarean = 0; BMI <= 20; sex = M 51 92 8
31 Parity > 0; caesarean = 0; BMI <= 20; sex = F 50 98 2
29 Parity > 0; caesarean = 0; weight <= 64; BMI > 20 213 93 7
24 Parity > 0; caesarean = 0; weight <= 81; gest <= 39 95 94 6
25 Parity > 0; caesarean = 0; weight <= 81; gest > 39 126 95 5
23 Parity > 0; caesarean = 0; weight > 81; gest <= 39 69 86 14
19 Parity > 0; caesarean > 0 52 55 45

Classification.
A.4. Artificial neural networks
Predicted
membership NeuroSolutions (version 3) was used to train,
Y N Total cross-validate and test each network. NeuroSolutions
makes use of WindowsTM ‘point & click’ technol-
Original ogy to make the software easy to use, enabling
Count Y 814 81 895 non-experienced users to begin building and test-
N 173 123 296 ing models with little training and in minimal time.
% Y 90.9 9.1 100.0 Caution however should be exercised when selecting
N 58.8 41.2 100.0 suitable network initial conditions, such as step-sizes
Overall classification rate of 78.0%. and numbers of hidden layers. Only with experience
of using the software is it possible to get a feel for
appropriate model configurations.
330 P.R. Harper / Health Policy 71 (2005) 315–331

We illustrate the neural network approach in pre- The dataset contained 582 records. Half was used to
dicting ICU LoS (study 1). Routinely collected vari- train the network and half to test the model. Ten per-
ables of interest include the patient’s age, sex, out- cent of the training data was used for cross-validation.
come, source (A/E, HDU, Theatre or Ward), admission Missing values were removed as requested by the
status (elective or emergency) and hospital speciality software. The best accuracy was achieved using a
(ENT, general surgery, medicine, orthopaedics, tho- multiplayer perceptron network with one hidden
racic medicine, trauma or vascular surgery). Categori- layer, a supervised learning control with 1000 epochs
cal variables with multiple classes (g) were re-assigned (iterations over the training set), a step-size of 0.7 and
to (g–1) binary groups (0–1). a momentum rate of 0.5.
Descriptive statistics.
Variable Description Min Max Mean S.D.

AGE Age 1 97 60.7 19.8


X1 Sex male (Y/N); else female 0 1 Cat Cat
DAYS ICU LoS (dependent variable) 0.1 25.2 2.4 3.4
S1 ENT (Y/N) 0 1 Cat Cat
S2 Gen surgery (Y/N) 0 1 Cat Cat
S3 Medicine (Y/N) 0 1 Cat Cat
S4 Orthopaedics (Y/N) 0 1 Cat Cat
S5 Thoracic (Y/N) 0 1 Cat Cat
S6 Trauma (Y/N); if N to all then vascular 0 1 Cat Cat
A1 Elective (Y/N); else emergency 0 1 Cat Cat
P1 A/E (Y/N) 0 1 Cat Cat
P2 HDU (Y/N) 0 1 Cat Cat
P3 Theatre (Y/N); if N to all then ward 0 1 Cat Cat
O1 Outcome alive (Y/N); else died 0 1 Cat Cat

Performance DAYS ICU

MSE 11.6452
NMSE 0.9286
MAE 1.8410
Min abs error 0.0152
Max abs error 23.6978
r 0.3280
P.R. Harper / Health Policy 71 (2005) 315–331 331

References [10] Ridley S. Classification trees: a possible method for iso-


resource grouping in intensive care. Anaesthesia 1998;53:
[1] Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification 833–40.
and Regression Trees. London: Chapman & Hall; 1984. [11] Harper PR, Shahani AK. Modelling for the planning and
[2] Remme J, Habbema JDF, Hermans J. A simulative compa- management of bed capacities in hospitals. Journal of the
rison of linear, quadratic and kernel discrimination. Journal Operational Research Society 2002;53:11–9.
of Statistics and Computer Simulation 1980;11:87–106. [12] Berzal F, et al. On the quest for easy-to-understand splitting
[3] Clark P, Boswell R. Rule induction with CN2: some recent rules. Data and Knowledge Engineering 2003;44:31–48.
improvements. In: Proceedings of ESWL’91, Porto, Portugal; [13] Anderson JA. An introduction to neural networks. Cambridge,
1991. p. 151–163. MA: MIT Press; 1995.
[4] Xu L, Krzyzak A, Oja E. Neural nets for dual subspace [14] Haykin S. Neural networks: a comprehensive foundation.
pattern recognition method. International Journal of Neural London: Prentice-Hall; 1999.
Systems 1991;2:169–84. [15] Callan R. The essence of neural networks. London: Prentice-
[5] King RD, Feng C, Sutherland A. Statlog: comparison Hall; 1999.
of classification algorithms on large real-world problems. [16] Kay JW, Titterington DM. Statistics and neural networks:
Applied Artificial Intelligence 1995;9:289–333. advances at the interface. Oxford: Oxford University Press;
[6] Draper NR, Smith H. Applied regression analysis. New York: 1999.
Wiley; 1966. [17] Mehrotha K, et al. Element of artificial neural networks.
[7] Smith ME, et al. Case-mix groups for hospital-based home Cambridge, MA: MIT Press; 1997.
care. Medical Care 1992;30:1–16. [18] Fahlman S. Distributed connectionist systems for AI:
[8] Hornberger JC, Habraken H, Bloch DA. Minimum data prospects and problems. In: Concepts and characteristics of
needed on patient preferences for acute, efficient medical knowledge-based systems, proceedings of IFIP 10/WG 10.1
decision-making. Medical Care 1995;33:297–310. workshop. Mount Fuji, Japan; 1999. p. 22–35.
[9] Garbe C, et al. Primary Cutaneous Melanoma—identification [19] Manly BFJ. Multivariate Statistical Methods. London:
of prognostic groups and estimation of individual prognosis Chapman & Hall; 1998.
for 5,093 patients. Cancer 1995;75:2484–91.

You might also like