Public Health

Public reporting of surgeon outcomes: low numbers of procedures lead to false complacency
Kate Walker, Jenny Neuburger, Oliver Groene, David A Cromwell, Jan van der Meulen

The English National Health Service published outcome information for individual surgeons for ten specialties in June, 2013. We looked at whether individual surgeons do sufficient numbers of procedures to be able to reliably identify those with poor performance. For some specialties, the number of procedures that a surgeon does each year is low and, as a result, the chance of identifying a surgeon with increased mortality rates is also low. Therefore, public reporting of individual surgeons’ outcomes could lead to false complacency. We recommend use of outcomes that are fairly frequent, considering the hospital as the unit of reporting when numbers are low, and avoiding interpretation of no evidence of poor performance as evidence of acceptable performance.

Published Online July 5, 2013 S0140-6736(13)61491-9 Department of Health Services Research and Policy, London School of Hygiene and Tropical Medicine, London, UK (K Walker PhD, J Neuburger PhD, O Groene PhD, D A Cromwell PhD, Prof J van der Meulen PhD) Correspondence to: Dr Jenny Neuburger, Department of Health Services Research and Policy, London School of Hygiene and Tropical Medicine, London WC1H 9SH, UK

From the summer of 2013, outcomes of some surgical procedures will be reported for individual surgeons as part of the English National Health Service (NHS) Commissioning Board’s new policy.1 This policy follows the example of the Society for Cardiothoracic Surgery in Great Britain and Ireland (SCTS)2 and several US states (eg, New York3), which report mortality for adult cardiac procedures by surgeon. The aim is to allow patients to choose their surgeon and clinicians to improve outcomes of care. However, when overall numbers of specific procedures are low, correct identification of a surgeon with poor performance is challenging, even if mortality is high.4 The danger is that low numbers mask poor performance and lead to false complacency. We examine this issue in relation to reporting of surgical mortality for individual surgeons for adult cardiac surgery, plus key procedures in three other specialties: oesophagectomy or gastrectomy for oesophagogastric cancer; bowel cancer resection; and hip fracture surgery. We address three questions. First, what number of procedures is necessary for reliable detection of poor performance? Second, how many surgeons in each specialty actually do this number of procedures in a period of 1, 3, or 5 years? Third, what is the probability that a surgeon identified as a statistical outlier has truly poor performance? Finally, we offer recommendations about how surgeon performance can be assessed in a meaningful way. We used postoperative mortality as an example to address these questions, because it is the outcome that will be reported for English surgeons this summer.

Number of procedures
The number of adult cardiac surgeries done in NHS hospitals is fairly high: 50% of cardiac surgeons do between 60 and 170 per year.2 Many other procedures are done less frequently, which means statistical power is poor and that poorly performing surgeons are unlikely to be correctly identified. In this context, statistical power is the probability that a surgeon with poor performance will be detected as a statistical outlier—ie, as significantly worse than average. For example, 80% power means that,

of ten surgeons who are truly performing poorly, on average eight would be identified. Bowel cancer resection illustrates the issue of low numbers. Postoperative mortality is about 5% (table 1).8 Therefore, if 20 operations were done in a year—a high number for this procedure—only one patient would be expected to die after surgery. With low numbers, the play of chance (ie, the role of uncontrollable factors) might be greater than the effect of a surgeon’s performance on the number of deaths. Conversely, if a surgeon’s performance was average, the chance that more than one patient of 20 would die after surgery can be calculated as about 25% with basic statistical principles. Statistical power is determined by the expected number of deaths, which is a combination of numbers of the procedures and the mortality (panel 1). To calculate how many procedures would be necessary to achieve different statistical power thresholds, we used the national overall mortality rate for adult cardiac surgery, bowel cancer resection, oesophagectomy or gastrectomy, and hip fracture surgery to calculate the expected numbers of deaths,6,7,9 and deemed a doubling of these rates as poor performance (panel 1). The numbers necessary for each power threshold exceed the annual number of procedures typically done by surgeons in English NHS hospitals (table 1). The differences are particularly large for bowel cancer surgery and oesophagectomy or gastrectomy: the annual median number of bowel cancer surgeries is roughly a tenth of that necessary for 60% power, and the median number of oesophagectomy or gastrectomy procedures is about a tenth of that needed for 70% power (table 1). Hip fracture surgery has the highest mortality, and therefore the same power is achieved for fewer operations than are necessary for other procedures (table 1).

Proportion of surgeons who do the necessary number of procedures
We estimated the proportion of surgeons who do a sufficient number of procedures to achieve 60%, 70%, and 80% power to detect poor performance (table 2).2,5 These proportions are calculated for reporting periods of 1, 3, and 5 years, assuming that the overall rate of
1 Published online July 5, 2013

Public Health

National postoperative mortality (%)

Median annual number*

Number of procedures necessary to detect poor performance 60% 70% 80% power power power

a period of time to increase power, but recognise that a balance needs to be struck between statistical power and timeliness (panel 2).

Correct identification of poor performance
Not all surgeons identified as statistical outliers will truly have poor performance. The proportion correctly identified as having poor performance is known as the positive predictive value.10 The number correctly identified depends on the significance threshold, how many procedures a surgeon does, and the prevalence of poor performance. With standard diagnostic reasoning, it can be calculated that, if one in 20 cardiac surgeons truly had poor performance, 63% would be correctly identified on the basis of the median number of procedures in 3 years. The equivalent positive predictive values for the other procedures, with the same assumptions, are 62% for hip fracture surgery, 57% for oesophagectomy or gastrectomy, and 38% for bowel cancer resection. Therefore, a large proportion of surgeons identified as outliers would be falsely accused of poor performance. In reality, the proportion of surgeons with poor performance will probably be lower than one in 20, with the result that the proportion of outliers which are false accusations would be substantially higher.

Hip fracture surgery Oesophagectomy or gastrectomy Bowel cancer resection Cardiac surgery

8·4%† 6·1%‡ 5·1%§ 2·7%¶

31 11 9 128

56 79 95 192

75 109 132 256

102 148 179 352

5% significance level. Poor performance defined as double the national overall mortality rate. *On the basis of hospital episode statistics5 for the 3-year period from April, 2009, to March, 2012 (except for cardiac surgery, for which reported numbers2 are used). †30-day mortality (March 1, 2010–Feb 28, 2011).6 ‡90-day mortality (Oct 1, 2007–June 30, 2009).7 §90-day mortality (Aug 1, 2010–July 31, 2011).8 ¶In-hospital mortality (April 1, 2008–March 31, 2011).9

Table 1: Mortality after four surgical procedures, the number of procedures that occur annually, and how many would be necessary to detect poor performance with different statistical powers

Panel 1: Calculation of statistical power We calculate statistical power with four numbers (under the assumption that underlying statistical distributions can be approximated with the normal distribution): 1 National overall mortality 2 Mortality rate at which performance is deemed to be poor 3 Statistical threshold used to test whether an individual surgeon’s rate is consistent with the national overall mortality rate 4 Number of procedures done by the surgeon In this report, we define poor performance as double the national overall mortality rate. This definition is arbitrary, but would in practice represent a fairly large difference in performance. To detect smaller differences would necessitate larger numbers to achieve the same levels of statistical power. We use a 5% significance level to calculate the poor performance threshold, which corresponds to the commonly used 95% control limit on funnel plots (figure). In fact, many audits use wider limits, such as 99·8%, or even higher, to detect an outlier. As levels of significance increase, so limits widen, reducing the statistical power to detect poor performance.

Improving statistical power
There are options for improvements in statistical power other than the pooling of data over time, but each introduces problems of its own. First, data for different procedures could be pooled. However when outcomes differ between procedures, this approach could prevent fair comparisons of outcomes. We grouped gastrectomy, which has a mortality of 6·9%, with oesophagectomy, which has a mortality of 5·7%.7 Additionally, cardiac procedures were grouped together, combining coronary bypass surgery with replacement or repairs of cardiac valves, as is done by SCTS.2 Adjustment for patients’ risk profiles (ie, case-mix adjustment) might not be sufficient to remove biases due to surgeons who do varying mixes of procedures. Another difficulty caused by the pooling of data is that poor surgeon performance for one specific procedure could be masked. Second, the control limits used to identify poor performance could be lowered. We used a 5% significance level, which corresponds to 95% control limits (figure), and is the lowest commonly used threshold. However, lowering of the threshold would also lead to an increased proportion of surgeons falsely identified as having poor performance. Third, an alternative outcome measure could be selected, such as surgical complications or emergency readmission. Although increased numbers of events would raise statistical power, measurement error due to incomplete or inconsistent recording would tend to reduce it. We recommend use of outcome measures that

mortality remains constant with time. The SCTS reports surgeon-level mortality with 3 years of data.2 Its data show that about three-quarters of surgeons do sufficient numbers of cardiac operations to achieve 60% statistical power (table 2). The proportion of surgeons who do sufficient numbers of procedures to identify poor performance is much lower for procedures other than cardiac and hip fracture surgery (table 2). Gains in statistical power can clearly be achieved by extension of the period during which data are obtained: as length of time increases, so does the proportion of surgeons who do the necessary number of procedures (table 2). However, pooling of data from long periods will adversely affect the timeliness of reported data. It assumes that individual surgeon skills and practice patterns are largely stable, which might not be the case. Moreover, such pooling could mask a recent deterioration in a surgeon’s performance. It also introduces challenges related to the reporting of outcomes of junior surgeons, retired surgeons, and surgeons who are temporarily appointed overseas. We recommend pooling of data over
2 Published online July 5, 2013

Public Health

60% power 1-year reporting period Hip fracture surgery* Oesophagectomy or gastrectomy* Bowel resection* Cardiac surgery† 3-year reporting period Hip fracture surgery* Oesophagectomy or gastrectomy* Bowel resection* Cardiac surgery† 5-year reporting period Hip fracture surgery* Oesophagectomy or gastrectomy* Bowel resection* Cardiac surgery† 84% 34% 37% 80% 73% 9% 17% 75% 4% 0 0 16%

70% power

80% power

Panel 2: Recommendations for public reporting of surgeon outcomes Measurement of outcomes • Pool data over time when annual numbers are low, but also consider timeliness of data • Select outcome measures for which the outcome event is fairly frequent • For specialties in which most surgeons do not achieve 60% power, the unit of reporting should be the team, hospital, or trust Presentation of results • Report surgeon outcomes with appropriate statistical techniques, such as funnel plots • Avoid interpreting no evidence of poor performance as evidence of acceptable performance • Report surgeon outcomes with appropriate health warnings, such as interpretation of outcomes with low numbers and data quality issues • Report surgeon outcomes alongside unit or hospital outcomes to guide interpretation

1% 0 0 1% 62% 0 4% 69% 79% 17% 24% 77%

0 0 0 0 42% 0 0 56% 70% 5% 11% 72%


99·8% control limits 95% control limits Median mortality

Adjusted 90-day mortality (%)

Poor performance is defined as double the national overall mortality rate, with a 5% significance level. *Data for numbers of procedures come from the hospital episode statistics for the 3-year period from April, 2009, to March, 2012.5 We selected procedures with the appropriate International Classification of Diseases (version 10) diagnosis codes and Office of Population Censuses and Surveys Classification of Interventions and Procedures-4.4 procedure codes. Procedures were allocated to a consultant if they were contracted under a relevant specialty. †Data used from Society for Cardiothoracic Surgery in Great Britain and Ireland published data.2


Table 2: Proportion of surgeons who do sufficient numbers of different procedures every year to identify cases of poor performance


are fairly frequent (panel 2). Additionally, for specialties in which most surgeons still do not do sufficient numbers of procedures to achieve acceptable power, we recommend that reporting should be at the level of the surgical team or hospital, not the surgeon (panel 2).

0 0 50 100 150 Number of procedures in trust 200 250

Implications of the new policy
Reporting of outcomes for individual surgeons for cardiac surgery in the UK has largely been viewed as a success.11,12 As we have shown, numbers of cardiac surgeries are sufficient to allow the process of detection to operate with reasonable statistical power. However, we believe that consultant-level reporting could be far less effective for other specialties. The concern about false identification of poor performance has received much attention in view of the stigma attached to poor performance.13 The potential collateral damage of a false accusation could include reputational damage, increases in indemnity insurance premiums, or even prosecution. Public reporting could also affect surgeon behaviour, leading to selection of low-risk patients for surgery.14 Inaccurate estimates of surgeon performance could also cause unnecessary alarm in patients. A mortality estimate of 10% for a surgeon could worry patients, even if the estimate is based on such small numbers that no statistical evidence of poor performance is available. One

Figure: Funnel plot showing risk-adjusted 90-day mortality after bowel cancer resection in different trusts For reporting for individual surgeons, one dot would represent one surgeon and numbers of procedures would be much lower than they are here. Trusts falling above the control limits are deemed to be outliers. In our estimates, we have used 95% control limits. Reproduced from Health and Social Care Information Centre’s national bowel cancer audit,8 by permission of the Health and Social Care Information Centre.

option to overcome this issue would be to use hierarchical modelling techniques that would shrink the surgeons’ mortality estimates, especially when based on small numbers, towards the overall mortality.15 However, these hierarchical models would not overcome the problem of low statistical power. A second implication has received much less attention.16 With low numbers of procedures, an unintended result of reporting for individual surgeons could be false complacency. For most surgeons, power will be insufficient to detect poor performance, and this absence of evidence could be falsely interpreted as evidence of acceptable performance. Therefore, rather than stimulating quality improvement and early responses to local concerns about quality of care, publicly reported figures that identify no problems could lead to inaction.
3 Published online July 5, 2013

Public Health

Our analyses draw attention to the need for great care in presentation of estimates for individual surgeons. Analyses should be presented in such a way as to avoid false complacency, false accusation, and unnecessary alarm to patients (panel 2).

Acknowledgments We thank Rob Wakeman for providing information on surgeon volume for hip fracture surgery. No specific funding was received for this report. The salaries of KW, JN, and OG were funded by a grant from the Royal College of Surgeons of England. References 1 NHS Commissioning Board. Everyone counts: planning for patients 2013/14. 2012. everyonecounts-planning.pdf (accessed June 28, 2013). 2 Society for Cardiothoracic Surgery in Great Britain and Ireland. UK surgeons’ results. default.aspx (accessed June 28, 2013). 3 Hannan EL, Cozzens K, King SB, Walford G, Shah NR. The New York State cardiac registries: history, contributions, limitations, and lessons for future efforts to assess and publicly report healthcare outcomes. J Am Coll Cardiol 2012; 59: 2309–16. 4 Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ 1995; 311: 485. 5 Health and Social Care Information Centre. Hospital episode statistics. (accessed June 28, 2013). 6 National Hip Fracture Database. National Report 2012. 2012. http:// 22A244F2ED802579C900553993/$file/NHFD%20National%20 Report%202012.pdf?OpenElement (accessed June 28, 2013). 7 Health and Social Care Information Centre. National oesophago-gastric cancer audit 2012. searchcatalogue?productid=7335&q=%22National+OesophagoGastric+Cancer+Audits%22&sort=Relevance&size=10&page=1#top (accessed June 28, 2013). 8 Health and Social Care Information Centre. National bowel cancer audit 2012. 27&q=title%3a%22bowel+cancer%22&sort=Relevance&size=10&pa ge=1#top (accessed June 28, 2013). National Institute for Cardiovascular Outcomes Research. Adult 9 Cardiac Surgery. 2013. Adultcardiacsurgery (accessed June 28, 2013). 10 Altman DG, Bland JM. Diagnostic tests 2: predictive values. BMJ 1994; 309: 102. 11 Bridgewater B, Hickey GL, Cooper G, Deanfield J, Roxburgh J. Publishing cardiac surgery mortality rates: lessons for other specialties. BMJ 2013; 346: f1139. 12 Bridgewater B. Mortality data in adult cardiac surgery for named surgeons: retrospective examination of prospectively collected data on coronary artery surgery and aortic valve replacement. BMJ 2005; 330: 506–10. 13 Lilford R, Mohammed AM, Spiegelhalter D, Thomson R. Use and misuse of process and outcome data in managing performance of acute medical care: avoiding institutional stigma. Lancet 2004; 363: 1147–54. 14 Shahian DM, Edwards FH, Jacobs JP, et al. Public reporting of cardiac surgery performance: part 1—history, rationale, consequences. Ann Thorac Surg 2011; 92: S2–11. 15 Dimick JB, Staiger DO, Birkmeyer JD. Ranking hospitals on surgical mortality: the importance of reliability adjustment. Health Serv Res 2010; 45: 1614–29. 16 Dimick JB, Welch HG, Birkmeyer JD. Surgical mortality as an indicator of hospital quality: the problem with small sample size. JAMA 2004; 292: 847–51. 17 Johal A, Cromwell D, van der Meulen J. Hospital episode statistics and revalidation: creating the evidence to support revalidation. Jan 9, 2013. docs/hospital-episode-statistics-and-revalidation-report-2013 (accessed June 28, 2013). 18 Girling AJ, Hofer TP, Wu J, et al. Case-mix adjusted hospital mortality is a poor proxy for preventable mortality: a modelling study. BMJ Qual Saf 2012; 21: 1052–56. 19 Shahian DM, Normand SL. Autonomy, beneficence, justice, and the limits of provider profiling. J Am Coll Cardiol 2012; 59: 2383–86. 20 Spiegelhalter DJ. Handling over-dispersion of performance indicators. Qual Saf Health Care 2005; 14: 347–51. 21 Ghaferi AA, Birkmeyer JD, Dimick JB. Hospital volume and failure to rescue with high-risk surgery. Med Care 2011; 49: 1076–81.

Wider issues
Several wider issues have been raised previously about the reporting of surgeon outcomes, mainly related to adequate adjustment for patient case mix, the accuracy with which the responsible surgeon can be identified, and the shared responsibility for the care of patients within teams.17 Operative mortality, including unavoidable deaths, might not be a good proxy for preventable mortality. Of particular relevance is the mean proportion of deaths that can be prevented: if this proportion is low, mortality is a poor test to predict avoidable mortality.18 Additionally, mortality is not the only outcome that concerns patients; others include avoidance of serious complications, being treated with dignity, return of function, and freedom from recurrent symptoms.19 Case-mix adjustment aims to account for differences in age, disease severity, or other factors in comparisons of surgeon outcomes. Validated methods for case-mix adjustment do not exist for all the procedures for which outcomes have to be reported in 2013. Even when they do exist, these methods might not fully adjust for case-mix differences. Therefore, some surgeons treating patients at high risk could be wrongly identified as an outlier,20 and underperforming surgeons treating low-risk patients will be less likely to be detected. Identification of the surgeon responsible for a procedure is not always straightforward. Some procedures are not allocated to any surgeon, whereas others are done by more than one surgeon. Inconsistencies between units in how procedures are allocated to surgeons could further undermine these comparisons. A final issue is the appropriate organisational level for reporting outcomes. Reporting for individual surgeons ignores the effect of the multidisciplinary team and the context in which especially complex surgery is done. Many aspects of care other than the surgeon’s performance will affect the outcome, such as timeliness of referral and diagnosis, perioperative care, and follow-up care after discharge. For example, complications resulting from surgery might result in a patient’s death, depending on the way clinical units monitor patients’ vital status and respond to adverse occurrences.21 We recommend that surgeon outcomes are reported alongside unit outcomes to guide interpretation (panel 2).
Contributors KW and JN conceived this report. All authors were involved in the design. KW, JN, and OG collected and analysed data. All authors interpreted results. KW, JN, and OG wrote the report, with contributions from DAC and JvdM. Conflicts of interest We are all involved in national clinical audits, but report that we have no conflicts of interest.

4 Published online July 5, 2013

Sign up to vote on this title
UsefulNot useful