Professional Documents
Culture Documents
Abstract
Within a health care setting, it is often desirable from both clinical and operational perspective to capture the uncertainty
and variability amongst a patient population, for example to predict individual patient outcomes, risks or resource needs.
Homogeneity brings the benefits of increased certainty in individual patient needs and resource utilisation, thus providing
an opportunity for both improved clinical diagnosis and more efficient planning and management of health care resources.
A number of classification algorithms are considered and evaluated for their relative performances and practical usefulness
on different types of health care datasets. The algorithms are evaluated using four criteria: accuracy, computational time,
comprehensibility of the results and ease of use of the algorithm to relatively statistically naive medical users. The research
has shown that there is not necessarily a single best classification tool, but instead the best performing algorithm will de-
pend on the features of the dataset to be analysed, with particular emphasis on health care data, which are discussed in
the paper.
© 2004 Elsevier Ireland Ltd. All rights reserved.
0168-8510/$ – see front matter © 2004 Elsevier Ireland Ltd. All rights reserved.
doi:10.1016/j.healthpol.2004.05.002
316 P.R. Harper / Health Policy 71 (2005) 315–331
past experience and data, that their length of stay (LoS) datasets. Results of the comparison study and a sum-
in hospital is likely to be within a certain range of time mary of general findings are presented in Sections 5
with a given confidence. The LoS for this patient group and 6, respectively.
will typically substantially differ from the predicted
LoS of other groups. The purpose of classification in
this example would be to produce tight LoS bands with 2. The general classification problem
high confidence. Thus, with the added knowledge and
confidence of how long individual patients are likely to There are two elements to a general classification
stay in the hospital system, the potential for improved problem. Measurements are made on some case or ob-
efficiency and effectiveness in hospital planning and ject (for example, a patient in a hospital setting with
management is vast. measurements including age, sex, clinical diagnosis,
An important criterion for a good classification pro- LoS, outcome, etc.) and based on these measurements
cedure is that it not only produces accurate classi- a prediction is made as to which class a case is in. The
fiers (within the limits of the data) but that it also prediction is made following a pre-defined classifica-
provides insight and understanding into the predictive tion rule.
structure of the data [1]. For example, finding which In mathematical terms, we define X to be the mea-
socio-economic and medical characteristics contribute surement space containing x = (x1 , x2 , . . . , xm ), the
to the risk of a particular disease not only provides measurement vector, where each xi is a measurement
valuable assistance in classifying individuals into risk taken on a case. The method should given any x in X,
groups with some certainty, but more generally has ¯
have a classification rule to assign one of the classes
advanced the knowledge and understanding of the dis- (1, 2, 3, . . . , J) to x, where J is the number of classes.
ease. The classifiers are based on past experience using a
In this paper, four algorithms have been consid- combination of expert knowledge and past data with
ered, with the intention of representing the spectrum their relevant outcomes. For example, the classifiers to
of classical statistical and more recent advances in be defined could come from a hospital database com-
computer-based approaches: bined with the expert knowledge of the consultants,
specialty managers and other medical staff. Each mea-
• discriminant analysis (DA)
sured variable is continuous, nominal or ordinal in na-
• regression models (multiple and logistic)
ture. A variable is continuous if the measured value
• tree-based algorithms (CART)
is a real number (e.g. LoS, age, height). A variable is
• artificial neural networks
nominal if it is a finite categorical set with no natural
In considering suitable datasets, the primary crite- ordering (e.g. sex, hospital ward, clinical diagnosis).
rion was that chosen datasets were of real-world inter- A variable is ordinal if it is a finite categorical set with
est to the medical profession. Each algorithm has been a natural order.
evaluated using four criteria: accuracy, computing time There exist many different classification algo-
taken to produce results, comprehensibility of the re- rithms, but their relative merits and practical useful-
sults and the ease of use of the algorithm to relatively ness for health care problems in particular remain
naive medical users. To assess user-friendliness and unclear. Thus, a need arises to evaluate their relative
comprehensibility, the models were presented to man- performances. Intrasubject comparisons have been
agerial and clinical staff within different NHS Trusts considered in the past, for example, within statistics
across the South of England, including Portsmouth [2], within symbolic learning [3] and within neural
Hospitals NHS Trust, Southampton University Hospi- networks [4]. Other authors, for example [5], have
tals NHS Trust and The Royal Berkshire and Battle compared different algorithms for non-health care
NHS Trust. datasets, but little or no research has been conducted
In Section 2, the general classification problem is on the relative merits of various techniques for health
presented with a review of previous classification com- care problems and in particular consideration of the
parisons. Section 3 provides information on each of practical usefulness for (use by and interpretation of)
the algorithms and Section 4 describes the chosen medical personnel.
P.R. Harper / Health Policy 71 (2005) 315–331 317
Discriminant analysis shares all the usual assump- In some situations, it is more appropriate for the
tions of correlation, requiring linear and homoscedas- model to include a multiplicative error term rather than
tic relationships and untruncated interval or near the additive term used above. The general model then
interval data. Like multiple regression, it also assumes becomes:
proper model specification (inclusion of all important yj = Φ(x1j , x2j , . . . , xkj )δj
independents and exclusion of extraneous variables).
DA is an earlier alternative to logistic regression where ␦j denotes the random experimental error term.
(see Section 3.2), which is now frequently used in This model is often appropriate for data for which
place of DA as it usually involves fewer violations the relationship between the variables is clearly
of assumptions, is robust and has coefficients, which non-linear. In this case, a suitable logarithmic trans-
many find easier to interpret. formation can sometimes be applied to achieve an
approximately linear relationship with an additive
3.2. Regression models (multiple and logistic) error term.
Binomial (or binary) logistic regression is a form
Regression analysis is concerned with investigating of regression that is used when the dependent is a
the relationship between several variables in the pres- dichotomy and the independents are continuous vari-
ence of random error. In particular, we build a model ables, categorical variables or both. Multinomial lo-
318 P.R. Harper / Health Policy 71 (2005) 315–331
gistic regression exists to handle the case of depen- if the variable is categorical then deviance is used.
dents with more classes. Logistic regression applies An algorithm is used to split the original dataset into
maximum likelihood estimation after transforming the sub-populations of increasing purity (decreasing vari-
dependent into a logit variable (the natural log of the ance or deviance). At each junction of the tree is a
odds of the dependent occurring or not). In this way, node. A terminal node is a node at the end of the
logistic regression estimates the probability of a cer- branch of the tree. At each node in the tree, the algo-
tain event occurring. rithm searches through each of the independent vari-
Logistic regression has many analogies to ordinary ables in turn. For each variable, it finds the best binary
least squares (OLS) regression: the standardized logit split that produces a node with the smallest variance
coefficients correspond to beta weights and a pseudo or deviance. Then, it selects the variable that has pro-
R2 statistic is available to summarize the strength duced the best binary split (best of the best). The par-
of the relationship. Unlike OLS regression, however, ent node will thus be split on this variable with the split
logistic regression does not assume linearity of rela- as defined. This may have the effect of leaving one of
tionship between the independent variables and the the child nodes with a higher variance/deviance than
dependent, does not require normally distributed vari- the parent node. The algorithm however continues to
ables, does not assume homoscedasticity and in gen- branch from each of these child nodes until defined
eral, has less stringent requirements. The success of stopping rules have been fulfilled.
the logistic regression can be assessed by looking at An issue in CART analysis is when to stop the par-
the classification table, showing correct and incorrect titioning, i.e. when do we say that the variance has not
classifications of the dichotomous, ordinal or polyto- significantly reduced? It would be possible to create
mous dependent. Goodness-of-fit tests are available a tree where each terminal node has zero variance by
as indicators of success as is the Wald statistic and having just one case in each node. However, this would
other tests of the model’s significance. be statistically irrelevant and practically useless. It is
necessary, therefore, to introduce stopping rules so that
3.3. Tree-based algorithms (CART) terminal nodes have sufficient size to yield reasonable
and statistically robust results. Stopping rules include:
Classification and regression trees (CART) is a clas-
sification method that has been successfully used in • Stop when nodes contain a certain number of cases.
many health care applications. Example applications • Stop when reduction of variance is below a certain
include, creating case-mix groups [7], minimum data threshold.
requirements [8], cancer survival groups [9], inten- • Stop when a maximum number of terminal nodes
sive care [10] and hospital inpatients [11]. The reader (or layers) have been produced.
is directed to [1] for a comprehensive description of Care must be exercised when defining stopping
the CART algorithm. Although there are variants of rules and should account for the number of cases in
tree-based methods with different splitting criteria, the dataset. A terminal node with less than 30 cases,
CART has been selected for this study, since it is for example, can be expected to yield little predictive
widely used in medical decision making (evidenced power and lack statistical robustness. Standard bounds
by the number of published applications using CART, are no less than 50 cases per node, a significance level
a small sample of which is referenced above), and of 1% on the reduction of variance in order to split a
furthermore, because recent published studies have node and a maximum of around 10 terminal nodes.
shown that different splitting criteria have insignifi- Once a tree is constructed, statistical summaries can
cant impact on the accuracy of the classifier with no be produced at the terminal nodes, which can be used
single splitting criteria in the literature proving to be to form the classes.
universally better than the rest [12]. There are four components required to construct a
The first step in producing a tree is to decide which regression tree:
variable is to be predicted (LoS, operation times, sur-
vival rates, etc.). If the variable is ordinal then the vari- 1. A set of questions of the form: does xi belong to
ance is used to measure the purity [1] in the group, the set A. The answer to such questions induces a
P.R. Harper / Health Policy 71 (2005) 315–331 319
split of the predictor space, into cases associated it periodically receives and the signal it periodically
with A and those with the complement of A. The sends to other processors, and yet such simple local
sub-samples form the nodes. processors are capable of performing complex tasks
2. A goodness of split criterion Φ(s, t) that can be when placed together in a large network of orches-
evaluated at any split s at any node t. trated cooperation.
3. A means of determining the appropriate size of the Artificial neural networks have their roots in work
tree. performed in the early part of the 20th century [13], but
4. Statistical summaries at terminal nodes of the tree, only during the 1990s, after the breaking of some the-
for example, node averages and frequency distri- oretical barriers and the advances in computing power,
butions. have these networks been widely accepted as useful
tools. A plethora of books and papers have been pub-
Once a tree has been produced, it should be vali- lished on artificial neural networks. This section aims
dated to give an estimate of the accuracy of its classi- to provide the reader with a general overview of the
fications. The same data that is used to construct the topic. For a more comprehensive and exhaustive de-
tree cannot be used to test the classifications, as the scription, the following references are useful starting
estimate will be over-optimistic. This is overcome by points: [13–16].
splitting the data into two sets, A and B. The cases in The neural network is the collection of units that
A must be independent and identically distributed to are connected in some pattern to allow communica-
the cases in B. This has the drawback that it reduces tion between the units. These units, also referred to
the sample size used in the construction of the tree. as neurons or nodes, are simple processors whose
Set A can be used to train the data and build the tree computing ability is restricted to a rule for combin-
(training set) and set B to test the robustness and va- ing input signals and an activation rule that takes the
lidity by forcing the data through the tree (testing set). combined input to calculate an output signal. Output
For smaller samples, there is a technique called V-fold signals may be sent to other units along connections
cross-validation. For larger samples, standard statisti- known as weights. The weights usually excite or in-
cal methods can be used to compare the values of the hibit the signal that is being communicated. The net
training set and test set at any node using significance input of weighted signals received by a unit j is given
tests. by
n
3.4. Artificial neural networks
netj = w0 + wi xi
i=1
Artificial neural networks are parallel computing
devices consisting of many interconnected simple pro- where w0 is the biasing signal, wi the weight on in-
cessors. In essence, they are attempting to mimic the put connection ij, xi the magnitude of signal on input
behaviour of the human brain. Although each proces- connection ij and n is the number of input connec-
sor is quite simplistic, a collection of these units (a tions to unit j. An illustrative schematic of a single
network) gives rise to a powerful computational tool. network unit with three incoming signals is shown in
Each processor in the network is only aware of signals Fig. 1.
Table 1
Dataset descriptions
Dataset Description
A An intensive care dataset containing routinely collected ICU data within the UK. Information on a number of
socio-economic and medical variables. Primary interest in predicting LoS and outcome.
B Routinely collected data from a hospital patient management system for predicting LoS on the ward.
C A comprehensive maternity dataset collected as part of a commissioned study on predicting complications at birth.
Contains a number of socio-economic and medical variables.
D A large diabetes dataset, containing information on various diabetic complications collected for over 30 years from a
leading unit in the UK.
Table 2
Summary of classification studies
Study Dataset Dependent variable Number of Variance/deviance Number of variables
records
Description Nature
1 A LoS in ICU (days) Cts 582 11.8 (mean 2.4) 7 (2 Cts, 5 Cat)
2 A Outcome (death or survival) Cat 582 0.13 (13%) 7 (2 Cts, 5 Cat)
3 B LoS in hospital (days) Cts 17,974 56.6 (mean 4.3) 5 (2 Cts, 3 Cat)
4 C Chance of complicated delivery Cat 2,402 0.24 (24%) 16 (8 Cts, 8 Cat)
5 D Predicting onset of Retinopathy Cat 4,056 0.33 (33%) 14 (12 Cts, 2 Cat)
Cts: continuous variable; Cat: categorical variable.
322 P.R. Harper / Health Policy 71 (2005) 315–331
nique. This is based on the amount of time required 5.1. Discriminant analysis
to understand the algorithm and prepare the data, the
amount of tuning necessary and the time required to Discriminant analysis is used to classify cases into
produce correct results. the values of a categorical dependent. A large amount
of set-up time was therefore required to split the con-
tinuous dependent variables into necessary groups. For
this purpose, its groups were derived using percentile
5. Results of the comparison study splits with a total of ten groups constructed.
DA consistently gave the lowest accuracy and took a
Initial time was spent randomly, splitting each of longer time to run than both the regression and CART
the four datasets into training and testing datasets approaches, although it was faster than the neural net-
(with approximately half the total number of observa- work tool. DAs poor performance may be explained
tions in each). The same training and testing datasets by its linear structure that seems unable to tune itself
were used for each of the classification techniques in to the structure of the datasets. In contrast, for exam-
each study. The discriminant and regression models ple, regression models may be tuned to the dataset
were built using SPSS (version 10). The CART trees through the use of interaction terms. In DA, the as-
were constructed using SPSS AnswerTree (version sumptions of a linear structure appear too restrictive.
2.1). NeuroSolutions (version 3) was used to build, DA performed particularly poorly for predicting LoS
train and test the neural networks. For each study, the (studies 1 and 3). The LoS data has high skew and
models were trained and then tested. Their accuracy kurtosis, which appears to severely disrupt the perfor-
and run-time were recorded. mance of the discriminants.
Table 3 summarises how each of the classification The resulting discriminant functions and their coef-
techniques performed in each of the five studies. Re- ficients were not always easier to interpret and often
call that with a categorical dependent variable, the shed little light on the structure of the data. A survey
percentage of correctly classified cases is presented of health care professionals revealed that they often
(rescaled as a number between 0 and 1). With a struggled to make clinical sense of the discriminant
continuous dependent variable, the value of the cor- functions. This is a further downfall of the technique.
relation coefficient (r) is given. Ninety-five percent
confidence intervals (+/−) are provided, calculated 5.2. Regression models
using the Binomial distribution for categorical stud-
ies and Fisher’s Z function transformation applied Models with both additive and multiplicative error
to correlation coefficients for continuous studies. terms were examined and the best accuracy recorded.
The time taken to test the models is shown in sec- When the number of variables is large, and thus run-
onds. Appendix contains more detailed information ning a full-factorial model is unfeasible, regression
with illustrative results for each of the four tech- models ideally require that the user has at minimum a
niques. general understanding of the variables and may there-
Table 3
Performance of the different classification techniques
Study Discriminant Regression CART Neural Nets
fore evaluate suitable interaction terms. As a conse- models and always more accurate than discriminant
quence, a large amount of time should be afforded to analysis. It appeared to perform well over all datasets,
the consideration of appropriate model terms. Various indicating that the number of records or number of
models may then be built and evaluated. variables does not affect its relative accuracy compared
The accuracy was typically comparable with that of to the other techniques. It performed particularly well
CART and neural network approaches. It tended to do on datasets with high skew and kurtosis, indicating
better when the number of records (data points) was that they are furthest from the (multivariate) normal.
large and actually produced the best correlation coef- Symbolic learning tools, like CART, are generally
ficient, r, for predicting hospital LoS (0.62). For stud- non-parametric. That is, they do not make any as-
ies 1 and 3 (continuous dependent variables), the best sumptions about the underlying distributions. This is
results were obtained after a logarithmic transforma- why we observe a consistently good performance rel-
tion was applied to the dependent, although a number ative to the other approaches. Although run time is a
of other transformations were also evaluated. potential concern for tree-based methods, the observed
Careful selection of interaction terms were consid- times were low even for larger datasets and those with
ered and this process was greatly aided by the re- a high proportion of categorical explanatory variables.
sults from CART. Tree-based structures graphically il- The CART output is simple and straightforward to
lustrate the variables that cause the tree to split and interpret. Surveyed health care professionals partic-
branch, with nested variables (parent and resulting ularly favoured the clear pictorial way in which the
child branches further down the tree) providing a good tree was constructed and captured on screen. It was
initial indication of potential interaction terms to in- found that this technique was the easiest to clinically
clude in the regression model. Further detail of the interpret, allowing the user to discover which vari-
usefulness of CART in the appreciation of possible in- ables were of importance and furthermore their rela-
teraction terms may be found in appendix. Once the tive importance and quantifying the point at which to
desired model had been chosen, the actual run-time split on each variable. Such a method can challenge
was the quickest of all of the techniques. Even with perceived beliefs or reinforce existing clinical judge-
17,974 records (study 3) it only took 5 s to run. ments. A great practical advantage of this tool is the
Regression performs relatively poorly to CART ability to combine expert clinical/managerial knowl-
and neural networks when there is a high number of edge with the power of statistical analysis. From any
categorical independent variables (studies 1, 2 and 4). node in the tree, it is possible for the user to split
This is consistent with what is known about the prop- the node on a user-selected explanatory variable. The
erties of the discriminant and logistic regression algo- CART algorithm may then be continued from the cre-
rithms; the properties of high skew (>1) and kurtosis ated child nodes. The combination of CART and lo-
(>7) along with the presence of binary/categorical cal expert clinical knowledge proved to be a powerful
variables disrupts the performance of these algorithms tool during discussions with, and use of the tool by,
[5]. Such conditions are well suited to symbolic learn- the surveyed medical staff.
ing algorithms, as observed by the performances of
CART and neural networks (study 4). 5.4. Neural networks
When considering ease-of-use and interpretability,
regression models are fairly straightforward to inter- The performance of the neural net varied through-
pret and understand although there is a large set-up out the study depending on the features of the dataset,
time and a need to run a number of different models although overall it performed well. A large amount
accounting for interaction terms and transformations of initial effort was required to train and validate the
to the dependent variable as necessary. models. Run-time was by far the slowest of the stud-
ied classification tools because of the large number of
5.3. CART parameters, with the neural net taking over 10 min to
train the largest dataset (study 3). Improvement in the
The performance of CART was consistently as speed of neural networks is a large research area (for
good as, or better than, regression and neural networks example, see [18]).
324 P.R. Harper / Health Policy 71 (2005) 315–331
A major drawback of neural networks is that they This research has indicated that in practice there is
are difficult to set up to produce good results. For ex- no single best classification tool, but instead the best
ample, to run backpropogation properly, a number of technique will depend on the features of the dataset
parameters needed to be adjusted. For example, an im- to be analysed and any preferences of end-users.
portant decision concerns the number of layers and the The research has made a start in investigating what
initial step-size and momentum rates of the backpro- these features are with particular emphasis on health
pogation network. Any small changes in these param- care data. A summary of the main findings are as
eters can decrease the performance substantially. After follows:
running a number of networks and gaining an appre-
ciation of the tool, an insight was made on choosing • Overall the results were promising and each tool
suitable starting conditions. These appeared to indi- made a statistically significant contribution in each
cate an initial step-size of 0.7, a momentum rate of study (values of r and the percentage correctly clas-
0.5 and a network with two-hidden layers, although sified were all significant at the 95% level), although
there is no guarantee that the results are the best that in study 1 the variability explained by the models
could be achieved. An automatic method of parame- was considerably less than in the other studies.
ter selection for backpropogation is another important • Regression models consistently had the fastest
current research topic. run-times, although the difference in times com-
Accuracy was restricted (especially when compar- pared to CART and DA is likely to be insignificant
ing the results with those obtained from CART and in practice. Neural networks require significantly
regression models) when the network was handling more time to train and validate models.
datasets with high variance or deviance (studies 3 and • CART, regression and neural network classification
5). The tool was equally or more accurate than all of approaches gave similar accuracies, although CART
the other tools in studies 1 and 2 with the smallest was the only tool to give consistently good results.
dataset (582 records). DA performed poorly throughout the study.
As with regression and DA, neural networks are • CART was well suited to datasets with large skew
not always easier to interpret. Coupled with the large (>1) and kurtosis (>7) and where there were a large
set-up time, long run-times and multitude of possible proportion of categorical independent variables.
network configurations, this tool is perhaps best suited CART makes no assumption about the underlying
to an experienced user. Extreme care should be exer- distribution, hence why CART performed consis-
cised if intending to give the tool to a health care pro- tently well. In contrast, these conditions limit the
fessional with limited statistical and neural network performance of discriminant and regression models,
knowledge. where the data is furthest from the (multivariate)
normal.
• Neural networks produced the best accuracy when
6. Summary of findings and general dealing with smaller datasets, but performed slightly
conclusions disappointingly when handling dependent variables
with high levels of variability or deviance.
In order to capture the uncertainty and variability • If ease of use and human understanding are high
amongst the patient population, a number of classifi- priority, symbolic algorithms, such as CART should
cation techniques have been considered and evaluated be chosen.
for their relative performances and practical useful- • A number of health care professionals were sur-
ness in predicting various health care indicators using veyed for ease of use and interpretability of the four
a number of real-life datasets. The selected algorithms techniques. The main concern focussed on the form
have been chosen to cover a range of statistical meth- of the DA discriminant function and the associated
ods for medical decision making, spanning both com- coefficients and weights of the regression and neural
monly used traditional approaches (regression analy- network models, respectively. Typically these were
sis) and more recent and less widely used techniques of seen to be difficult to interpret and often shed
(neural networks). little light on the structure of the data.
P.R. Harper / Health Policy 71 (2005) 315–331 325
Descriptive statistics.
ariable Description Min Max Mean S.D.
Appendix A (Continued )
The following additional and interaction terms were three-way interaction term (excluding sex) was used
defined: in the model. The results from the regression analysis
are consistent with the results from CART.
Variable Interaction
Model summary.
AGE2 AGE × AGE
F1 AGE × I1 Model R R2 Adjusted Standard error
F2 AGE × X1 R2 of the estimate
F3 AGE × S1 1 0.593a 0.352 0.352 0.7705
F4 I1 × S1 2 0.628b 0.394 0.394 0.7449
F5 I1 × X1 3 0.630c 0.396 0.396 0.7439
F6 I1 × S1 4 0.630d 0.397 0.397 0.7436
F7 AGE × I1 × X1 a Predictors: (constant), F1.
b Predictors: (constant), F1, F4.
The results from CART gave a helpful insight into
c Predictors: (constant), F1, F4, I1.
possible interaction terms. Age, intent (I1) and sta-
d Predictors: (constant), F1, F4, I1, F3.
tus (S1) were important explanatory variables. Only a
ANOVAe
LOGLOS = (0.3940 × I1) + (0.0084 × AGE × each patient. An algorithm is used to split the original
I1) + (0.0039 × AGE × S1) + (0.2270 × I1 × S1). dataset into sub-populations of increasing purity (de-
creasing variance or deviance). CART is demonstrated
Correlations.
using study 4. Given a number of explanatory vari-
LOS LOSPRED ables, we wish to predict the probability of a pregnant
woman having a complicated delivery. Complications
LOS include the need to induce the baby, caesarean section
Pearson 1.000 0.620a and stillbirth. A successful classification could help
Sig. (two-tailed) – 0.000 to flag women at high risk for whom we might of-
N 8987 8987 fer more dedicated care throughout labour and during
a Correlation is significant at the 0.01 level (two- delivery.
tailed). There were 2402 records (1201 in both the training
and testing datasets) and 16 variables of interest (8
A.3. Regression trees (CART) continuous and 8 categorical).
Classification.
A.4. Artificial neural networks
Predicted
membership NeuroSolutions (version 3) was used to train,
Y N Total cross-validate and test each network. NeuroSolutions
makes use of WindowsTM ‘point & click’ technol-
Original ogy to make the software easy to use, enabling
Count Y 814 81 895 non-experienced users to begin building and test-
N 173 123 296 ing models with little training and in minimal time.
% Y 90.9 9.1 100.0 Caution however should be exercised when selecting
N 58.8 41.2 100.0 suitable network initial conditions, such as step-sizes
Overall classification rate of 78.0%. and numbers of hidden layers. Only with experience
of using the software is it possible to get a feel for
appropriate model configurations.
330 P.R. Harper / Health Policy 71 (2005) 315–331
We illustrate the neural network approach in pre- The dataset contained 582 records. Half was used to
dicting ICU LoS (study 1). Routinely collected vari- train the network and half to test the model. Ten per-
ables of interest include the patient’s age, sex, out- cent of the training data was used for cross-validation.
come, source (A/E, HDU, Theatre or Ward), admission Missing values were removed as requested by the
status (elective or emergency) and hospital speciality software. The best accuracy was achieved using a
(ENT, general surgery, medicine, orthopaedics, tho- multiplayer perceptron network with one hidden
racic medicine, trauma or vascular surgery). Categori- layer, a supervised learning control with 1000 epochs
cal variables with multiple classes (g) were re-assigned (iterations over the training set), a step-size of 0.7 and
to (g–1) binary groups (0–1). a momentum rate of 0.5.
Descriptive statistics.
Variable Description Min Max Mean S.D.
MSE 11.6452
NMSE 0.9286
MAE 1.8410
Min abs error 0.0152
Max abs error 23.6978
r 0.3280
P.R. Harper / Health Policy 71 (2005) 315–331 331