You are on page 1of 4

Public Health

Public reporting of surgeon outcomes: low numbers of
procedures lead to false complacency
Kate Walker, Jenny Neuburger, Oliver Groene, David A Cromwell, Jan van der Meulen

The English National Health Service published outcome information for individual surgeons for ten specialties in
June, 2013. We looked at whether individual surgeons do sufficient numbers of procedures to be able to reliably
identify those with poor performance. For some specialties, the number of procedures that a surgeon does each year
is low and, as a result, the chance of identifying a surgeon with increased mortality rates is also low. Therefore, public
reporting of individual surgeons’ outcomes could lead to false complacency. We recommend use of outcomes that are
fairly frequent, considering the hospital as the unit of reporting when numbers are low, and avoiding interpretation
of no evidence of poor performance as evidence of acceptable performance.

From the summer of 2013, outcomes of some surgical
procedures will be reported for individual surgeons as
part of the English National Health Service (NHS)
Commissioning Board’s new policy.1 This policy follows
the example of the Society for Cardiothoracic Surgery in
Great Britain and Ireland (SCTS)2 and several US states
(eg, New York3), which report mortality for adult cardiac
procedures by surgeon. The aim is to allow patients to
choose their surgeon and clinicians to improve outcomes of care. However, when overall numbers of
specific procedures are low, correct identification of a
surgeon with poor performance is challenging, even if
mortality is high.4 The danger is that low numbers mask
poor performance and lead to false complacency.
We examine this issue in relation to reporting of
surgical mortality for individual surgeons for adult
cardiac surgery, plus key procedures in three
other specialties: oesophagectomy or gastrectomy for
oesophagogastric cancer; bowel cancer resection; and
hip fracture surgery. We address three questions.
First, what number of procedures is necessary for
reliable detection of poor performance? Second, how
many surgeons in each specialty actually do this number
of procedures in a period of 1, 3, or 5 years? Third, what
is the probability that a surgeon identified as a statistical
outlier has truly poor performance? Finally, we offer
recommendations about how surgeon performance can
be assessed in a meaningful way. We used postoperative
mortality as an example to address these questions,
because it is the outcome that will be reported for
English surgeons this summer.

Number of procedures
The number of adult cardiac surgeries done in NHS
hospitals is fairly high: 50% of cardiac surgeons do
between 60 and 170 per year.2 Many other procedures are
done less frequently, which means statistical power is
poor and that poorly performing surgeons are unlikely to
be correctly identified. In this context, statistical power is
the probability that a surgeon with poor performance will
be detected as a statistical outlier—ie, as significantly
worse than average. For example, 80% power means that,

of ten surgeons who are truly performing poorly, on
average eight would be identified.
Bowel cancer resection illustrates the issue of low
numbers. Postoperative mortality is about 5% (table 1).8
Therefore, if 20 operations were done in a year—a high
number for this procedure—only one patient would be
expected to die after surgery. With low numbers, the play
of chance (ie, the role of uncontrollable factors) might be
greater than the effect of a surgeon’s performance on the
number of deaths. Conversely, if a surgeon’s performance
was average, the chance that more than one patient of
20 would die after surgery can be calculated as about 25%
with basic statistical principles.
Statistical power is determined by the expected number
of deaths, which is a combination of numbers of the
procedures and the mortality (panel 1). To calculate how
many procedures would be necessary to achieve different
statistical power thresholds, we used the national overall
mortality rate for adult cardiac surgery, bowel cancer
resection, oesophagectomy or gastrectomy, and hip
fracture surgery to calculate the expected numbers of
deaths,6,7,9 and deemed a doubling of these rates as poor
performance (panel 1). The numbers necessary for each
power threshold exceed the annual number of procedures
typically done by surgeons in English NHS hospitals
(table 1). The differences are particularly large for bowel
cancer surgery and oesophagectomy or gastrectomy: the
annual median number of bowel cancer surgeries is
roughly a tenth of that necessary for 60% power, and the
median number of oesophagectomy or gastrectomy
procedures is about a tenth of that needed for 70% power
(table 1). Hip fracture surgery has the highest mortality,
and therefore the same power is achieved for fewer
operations than are necessary for other procedures
(table 1).

Published Online
July 5, 2013
Department of Health Services
Research and Policy, London
School of Hygiene and Tropical
Medicine, London, UK
(K Walker PhD, J Neuburger PhD,
O Groene PhD, D A Cromwell PhD,
Prof J van der Meulen PhD)
Correspondence to:
Dr Jenny Neuburger, Department
of Health Services Research and
Policy, London School of Hygiene
and Tropical Medicine,
London WC1H 9SH, UK

Proportion of surgeons who do the necessary
number of procedures
We estimated the proportion of surgeons who do a
sufficient number of procedures to achieve 60%, 70%,
and 80% power to detect poor performance (table 2).2,5
These proportions are calculated for reporting periods
of 1, 3, and 5 years, assuming that the overall rate of Published online July 5, 2013


Public Health

mortality (%)

Hip fracture surgery

Median annual



Number of procedures
necessary to detect
poor performance

a period of time to increase power, but recognise that a
balance needs to be struck between statistical power and
timeliness (panel 2).

power power power

Correct identification of poor performance




Oesophagectomy or gastrectomy





Bowel cancer resection






Cardiac surgery






5% significance level. Poor performance defined as double the national overall mortality rate. *On the basis of hospital
episode statistics5 for the 3-year period from April, 2009, to March, 2012 (except for cardiac surgery, for which reported
numbers2 are used). †30-day mortality (March 1, 2010–Feb 28, 2011).6 ‡90-day mortality (Oct 1, 2007–June 30, 2009).7
§90-day mortality (Aug 1, 2010–July 31, 2011).8 ¶In-hospital mortality (April 1, 2008–March 31, 2011).9

Table 1: Mortality after four surgical procedures, the number of procedures that occur annually, and how
many would be necessary to detect poor performance with different statistical powers

Panel 1: Calculation of statistical power
We calculate statistical power with four numbers (under the assumption that underlying
statistical distributions can be approximated with the normal distribution):
1 National overall mortality
2 Mortality rate at which performance is deemed to be poor
3 Statistical threshold used to test whether an individual surgeon’s rate is consistent
with the national overall mortality rate
4 Number of procedures done by the surgeon
In this report, we define poor performance as double the national overall mortality rate.
This definition is arbitrary, but would in practice represent a fairly large difference in
performance. To detect smaller differences would necessitate larger numbers to achieve
the same levels of statistical power. We use a 5% significance level to calculate the poor
performance threshold, which corresponds to the commonly used 95% control limit on
funnel plots (figure). In fact, many audits use wider limits, such as 99·8%, or even higher,
to detect an outlier. As levels of significance increase, so limits widen, reducing the
statistical power to detect poor performance.

mortality remains constant with time. The SCTS reports
surgeon-level mortality with 3 years of data.2 Its data
show that about three-quarters of surgeons do sufficient
numbers of cardiac operations to achieve 60% statistical
power (table 2). The proportion of surgeons who do
sufficient numbers of procedures to identify poor
performance is much lower for procedures other than
cardiac and hip fracture surgery (table 2).
Gains in statistical power can clearly be achieved by
extension of the period during which data are obtained:
as length of time increases, so does the proportion of
surgeons who do the necessary number of procedures
(table 2). However, pooling of data from long periods
will adversely affect the timeliness of reported data.
It assumes that individual surgeon skills and practice
patterns are largely stable, which might not be the case.
Moreover, such pooling could mask a recent deterioration
in a surgeon’s performance. It also introduces challenges
related to the reporting of outcomes of junior surgeons,
retired surgeons, and surgeons who are temporarily
appointed overseas. We recommend pooling of data over

Not all surgeons identified as statistical outliers will
truly have poor performance. The proportion correctly
identified as having poor performance is known as the
positive predictive value.10 The number correctly identified depends on the significance threshold, how many
procedures a surgeon does, and the prevalence of poor
performance. With standard diagnostic reasoning, it
can be calculated that, if one in 20 cardiac surgeons
truly had poor performance, 63% would be correctly
identified on the basis of the median number of
procedures in 3 years. The equivalent positive predictive
values for the other procedures, with the same assumptions, are 62% for hip fracture surgery, 57% for
oesophagectomy or gastrectomy, and 38% for bowel
cancer resection. Therefore, a large proportion of surgeons identified as outliers would be falsely accused of
poor performance. In reality, the proportion of surgeons with poor performance will probably be lower
than one in 20, with the result that the proportion of
outliers which are false accusations would be substantially higher.

Improving statistical power
There are options for improvements in statistical power
other than the pooling of data over time, but each
introduces problems of its own. First, data for different
procedures could be pooled. However when outcomes
differ between procedures, this approach could prevent
fair comparisons of outcomes. We grouped gastrectomy,
which has a mortality of 6·9%, with oesophagectomy,
which has a mortality of 5·7%.7 Additionally, cardiac
procedures were grouped together, combining coronary
bypass surgery with replacement or repairs of cardiac
valves, as is done by SCTS.2 Adjustment for patients’
risk profiles (ie, case-mix adjustment) might not be
sufficient to remove biases due to surgeons who do
varying mixes of procedures. Another difficulty caused
by the pooling of data is that poor surgeon performance
for one specific procedure could be masked.
Second, the control limits used to identify poor
performance could be lowered. We used a 5% significance level, which corresponds to 95% control limits
(figure), and is the lowest commonly used threshold.
However, lowering of the threshold would also lead to
an increased proportion of surgeons falsely identified
as having poor performance.
Third, an alternative outcome measure could be
selected, such as surgical complications or emergency readmission. Although increased numbers of events would
raise statistical power, measurement error due to
incomplete or inconsistent recording would tend to
reduce it. We recommend use of outcome measures that Published online July 5, 2013

Public Health



1-year reporting period
Hip fracture surgery*




Oesophagectomy or




Bowel resection*




Cardiac surgery†




Hip fracture surgery*




Oesophagectomy or




3-year reporting period

Bowel resection*




Cardiac surgery†




Hip fracture surgery*




Oesophagectomy or




5-year reporting period

Bowel resection*




Cardiac surgery†




Poor performance is defined as double the national overall mortality rate, with a 5%
significance level. *Data for numbers of procedures come from the hospital episode
statistics for the 3-year period from April, 2009, to March, 2012.5 We selected
procedures with the appropriate International Classification of Diseases (version 10)
diagnosis codes and Office of Population Censuses and Surveys Classification of
Interventions and Procedures-4.4 procedure codes. Procedures were allocated to a
consultant if they were contracted under a relevant specialty. †Data used from
Society for Cardiothoracic Surgery in Great Britain and Ireland published data.2

Table 2: Proportion of surgeons who do sufficient numbers of different
procedures every year to identify cases of poor performance

are fairly frequent (panel 2). Additionally, for specialties
in which most surgeons still do not do sufficient numbers
of procedures to achieve acceptable power, we recommend that reporting should be at the level of the surgical
team or hospital, not the surgeon (panel 2).

Implications of the new policy
Reporting of outcomes for individual surgeons for
cardiac surgery in the UK has largely been viewed as a
success.11,12 As we have shown, numbers of cardiac
surgeries are sufficient to allow the process of detection
to operate with reasonable statistical power. However, we
believe that consultant-level reporting could be far less
effective for other specialties. The concern about false
identification of poor performance has received much
attention in view of the stigma attached to poor performance.13 The potential collateral damage of a false accusation could include reputational damage, increases in
indemnity insurance premiums, or even prosecution.
Public reporting could also affect surgeon behaviour,
leading to selection of low-risk patients for surgery.14
Inaccurate estimates of surgeon performance could
also cause unnecessary alarm in patients. A mortality
estimate of 10% for a surgeon could worry patients, even
if the estimate is based on such small numbers that no
statistical evidence of poor performance is available. One

Panel 2: Recommendations for public reporting of surgeon outcomes
Measurement of outcomes
• Pool data over time when annual numbers are low, but also consider timeliness of data
• Select outcome measures for which the outcome event is fairly frequent
• For specialties in which most surgeons do not achieve 60% power, the unit of
reporting should be the team, hospital, or trust
Presentation of results
• Report surgeon outcomes with appropriate statistical techniques, such as funnel plots
• Avoid interpreting no evidence of poor performance as evidence of acceptable
• Report surgeon outcomes with appropriate health warnings, such as interpretation of
outcomes with low numbers and data quality issues
• Report surgeon outcomes alongside unit or hospital outcomes to guide interpretation

99·8% control limits
95% control limits
Median mortality


Adjusted 90-day mortality (%)






Number of procedures in trust



Figure: Funnel plot showing risk-adjusted 90-day mortality after bowel cancer resection in different trusts
For reporting for individual surgeons, one dot would represent one surgeon and numbers of procedures would
be much lower than they are here. Trusts falling above the control limits are deemed to be outliers. In our
estimates, we have used 95% control limits. Reproduced from Health and Social Care Information Centre’s
national bowel cancer audit,8 by permission of the Health and Social Care Information Centre.

option to overcome this issue would be to use hierarchical
modelling techniques that would shrink the surgeons’
mortality estimates, especially when based on small
numbers, towards the overall mortality.15 However, these
hierarchical models would not overcome the problem of
low statistical power.
A second implication has received much less
attention.16 With low numbers of procedures, an unintended result of reporting for individual surgeons
could be false complacency. For most surgeons, power
will be insufficient to detect poor performance, and this
absence of evidence could be falsely interpreted as
evidence of acceptable performance. Therefore, rather
than stimulating quality improvement and early
responses to local concerns about quality of care,
publicly reported figures that identify no problems
could lead to inaction. Published online July 5, 2013


Public Health

Our analyses draw attention to the need for great care
in presentation of estimates for individual surgeons.
Analyses should be presented in such a way as to avoid
false complacency, false accusation, and unnecessary
alarm to patients (panel 2).

Wider issues
Several wider issues have been raised previously about
the reporting of surgeon outcomes, mainly related to
adequate adjustment for patient case mix, the accuracy
with which the responsible surgeon can be identified,
and the shared responsibility for the care of patients
within teams.17 Operative mortality, including unavoidable deaths, might not be a good proxy for preventable
mortality. Of particular relevance is the mean proportion
of deaths that can be prevented: if this proportion is low,
mortality is a poor test to predict avoidable mortality.18
Additionally, mortality is not the only outcome that
concerns patients; others include avoidance of serious
complications, being treated with dignity, return of
function, and freedom from recurrent symptoms.19
Case-mix adjustment aims to account for differences in
age, disease severity, or other factors in comparisons of
surgeon outcomes. Validated methods for case-mix
adjustment do not exist for all the procedures for which
outcomes have to be reported in 2013. Even when they do
exist, these methods might not fully adjust for case-mix
differences. Therefore, some surgeons treating patients
at high risk could be wrongly identified as an outlier,20
and underperforming surgeons treating low-risk patients
will be less likely to be detected.
Identification of the surgeon responsible for a
procedure is not always straightforward. Some procedures are not allocated to any surgeon, whereas others
are done by more than one surgeon. Inconsistencies
between units in how procedures are allocated to
surgeons could further undermine these comparisons.
A final issue is the appropriate organisational level for
reporting outcomes. Reporting for individual surgeons
ignores the effect of the multidisciplinary team and the
context in which especially complex surgery is done.
Many aspects of care other than the surgeon’s performance will affect the outcome, such as timeliness of
referral and diagnosis, perioperative care, and follow-up
care after discharge. For example, complications resulting from surgery might result in a patient’s death,
depending on the way clinical units monitor patients’
vital status and respond to adverse occurrences.21
We recommend that surgeon outcomes are reported
alongside unit outcomes to guide interpretation (panel 2).
KW and JN conceived this report. All authors were involved in the
design. KW, JN, and OG collected and analysed data. All authors
interpreted results. KW, JN, and OG wrote the report, with contributions
from DAC and JvdM.
Conflicts of interest
We are all involved in national clinical audits, but report that we have no
conflicts of interest.


We thank Rob Wakeman for providing information on surgeon volume
for hip fracture surgery. No specific funding was received for this report.
The salaries of KW, JN, and OG were funded by a grant from the Royal
College of Surgeons of England.
NHS Commissioning Board. Everyone counts: planning for patients
2013/14. 2012.
everyonecounts-planning.pdf (accessed June 28, 2013).
Society for Cardiothoracic Surgery in Great Britain and Ireland.
UK surgeons’ results.
default.aspx (accessed June 28, 2013).
Hannan EL, Cozzens K, King SB, Walford G, Shah NR. The
New York State cardiac registries: history, contributions, limitations,
and lessons for future efforts to assess and publicly report
healthcare outcomes. J Am Coll Cardiol 2012; 59: 2309–16.
Altman DG, Bland JM. Absence of evidence is not evidence of
absence. BMJ 1995; 311: 485.
Health and Social Care Information Centre. Hospital episode
statistics. (accessed June 28, 2013).
National Hip Fracture Database. National Report 2012. 2012. http://
Report%202012.pdf?OpenElement (accessed June 28, 2013).
Health and Social Care Information Centre. National
oesophago-gastric cancer audit 2012.
(accessed June 28, 2013).
Health and Social Care Information Centre. National bowel cancer
audit 2012.
ge=1#top (accessed June 28, 2013).
National Institute for Cardiovascular Outcomes Research. Adult
Cardiac Surgery. 2013.
Adultcardiacsurgery (accessed June 28, 2013).
10 Altman DG, Bland JM. Diagnostic tests 2: predictive values. BMJ
1994; 309: 102.
11 Bridgewater B, Hickey GL, Cooper G, Deanfield J, Roxburgh J.
Publishing cardiac surgery mortality rates: lessons for other
specialties. BMJ 2013; 346: f1139.
12 Bridgewater B. Mortality data in adult cardiac surgery for named
surgeons: retrospective examination of prospectively collected data
on coronary artery surgery and aortic valve replacement. BMJ 2005;
330: 506–10.
13 Lilford R, Mohammed AM, Spiegelhalter D, Thomson R. Use and
misuse of process and outcome data in managing performance of
acute medical care: avoiding institutional stigma. Lancet 2004;
363: 1147–54.
14 Shahian DM, Edwards FH, Jacobs JP, et al. Public reporting of
cardiac surgery performance: part 1—history, rationale,
consequences. Ann Thorac Surg 2011; 92: S2–11.
15 Dimick JB, Staiger DO, Birkmeyer JD. Ranking hospitals on
surgical mortality: the importance of reliability adjustment.
Health Serv Res 2010; 45: 1614–29.
16 Dimick JB, Welch HG, Birkmeyer JD. Surgical mortality as an
indicator of hospital quality: the problem with small sample size.
JAMA 2004; 292: 847–51.
17 Johal A, Cromwell D, van der Meulen J. Hospital episode statistics
and revalidation: creating the evidence to support revalidation. Jan 9,
(accessed June 28, 2013).
18 Girling AJ, Hofer TP, Wu J, et al. Case-mix adjusted hospital
mortality is a poor proxy for preventable mortality: a modelling
study. BMJ Qual Saf 2012; 21: 1052–56.
19 Shahian DM, Normand SL. Autonomy, beneficence, justice, and the
limits of provider profiling. J Am Coll Cardiol 2012; 59: 2383–86.
20 Spiegelhalter DJ. Handling over-dispersion of performance
indicators. Qual Saf Health Care 2005; 14: 347–51.
21 Ghaferi AA, Birkmeyer JD, Dimick JB. Hospital volume and failure
to rescue with high-risk surgery. Med Care 2011; 49: 1076–81. Published online July 5, 2013