Human Visual Explanations Mitigate Bias in AI-based Assessment of Surgeon Skills

www.nature.
com/npjdigitalmed
ARTICLE OPEN
Human visual explanations mitigate bias in AI-based

assessment of surgeon skills
Dani Kiyasseh 1 ✉, Jasper Laca2, Taseen F. Haque2, Maxwell Otiato 2
, Brian J. Miles3, Christian Wagner4, Daniel A. Donoho5,
2✉
Quoc-Dien Trinh6, Animashree Anandkumar1 and Andrew J. Hung
Artificial intelligence (AI) systems can now reliably assess surgeon skills through videos of intraoperative surgical activity. With such
systems informing future high-stakes decisions such as whether to credential surgeons and grant them the privilege to operate on
patients, it is critical that they treat all surgeons fairly. However, it remains an open question whether surgical AI systems exhibit
bias against surgeon sub-cohorts, and, if so, whether such bias can be mitigated. Here, we examine and mitigate the bias exhibited
by a family of surgical AI systems—SAIS—deployed on videos of robotic surgeries from three geographically-diverse hospitals (USA
and EU). We show that SAIS exhibits an underskilling bias, erroneously downgrading surgical performance, and an overskilling bias,
erroneously upgrading surgical performance, at different rates across surgeon sub-cohorts. To mitigate such bias, we leverage a
strategy —TWIX—which teaches an AI system to provide a visual explanation for its skill assessment that otherwise would have
been provided by human experts. We show that whereas baseline strategies inconsistently mitigate algorithmic bias, TWIX can
effectively mitigate the underskilling and overskilling bias while simultaneously improving the performance of these AI systems
across hospitals. We discovered that these findings carry over to the training environment where we assess medical students’ skills
1234567890():,;
today. Our study is a critical prerequisite to the eventual implementation of AI-augmented global surgeon credentialing programs,
ensuring that all surgeons are treated fairly.
npj Digital Medicine (2023)6:54 ; https://doi.org/10.1038/s41746-023-00766-2
INTRODUCTION Doing so will ethically guide the impending implementation of AI-

The quality of a surgeon’s intraoperative activity (skill-level) can augmented global surgeon credentialing programs14,15.
now be reliably assessed through videos of surgical procedures Previous studies have focused on algorithmic bias exclusively
and artificial intelligence (AI) systems1–3. With these AI-based skill against patients, demonstrating that AI systems systematically
assessments on the cusp of informing high-stakes decisions on a underestimate the pain level of Black patients16 and falsely predict
global scale such as the credentialing of surgeons4,5, it is critical that female Hispanic patients are healthy17. The study of bias in
that they are unbiased—reliably reflecting the true skill-level of all video-based AI systems has also gained traction, in the context of
surgeons equally6,7. However, it remains an open question whether automated video interviews18, algorithmic hiring19, and emotion
recognition20. Previous work has not, however, investigated the
such surgical AI systems exhibit a bias against certain surgeon sub-
bias of AI systems applied to surgical videos21, thereby over-
cohorts. Without an examination and mitigation of these systems’
looking its effect on surgeons. Further, previous attempts to
algorithmic bias, they may unjustifiably rate surgeons differently,
mitigate such bias are either ineffective22–24 or are limited to a
erroneously delaying (or hastening) the credentialing of surgeons, single AI system deployed in a single hospital25–27, casting doubt
and thus placing patients’ lives at risk8,9. on their wider applicability. As such, previous studies do not
A surgeon typically masters multiple skills (e.g., needle handling attempt, nor demonstrate the effectiveness of a strategy, to
and driving) necessary for surgery10–12. To reliably automate the
mitigate the bias exhibited by multiple AI systems across multiple
assessment of such skills, multiple AI systems (one for each skill) hospitals.
are often developed (Fig. 1a). To test the robustness of these In this study, we examine the bias exhibited by a family of
systems, they are typically deployed on data from multiple surgical AI systems—SAIS3—developed to assess the binary skill-
hospitals13. We argue that the bias of any one of these systems, level (low vs. high skill) of multiple surgical activities from videos.
which manifests as a discrepancy in its performance across Through experiments on data from three geographically-diverse
surgeon sub-cohorts (e.g., novices vs. experts), is akin to one of hospitals, we show that SAIS exhibits an underskilling bias,
many light bulbs in an electric circuit connected in series (Fig. 1b). erroneously downgrading surgical performance, and an overskilling
With a single defective light bulb influencing the entire circuit, just bias, erroneously upgrading surgical performance, at different
one biased AI system is enough to disadvantage a surgeon sub- rates across surgeon sub-cohorts. To mitigate such bias, we
cohort. Therefore, the deployment of multiple AI systems across leverage a strategy—TWIX28—that teaches an AI system to
multiple hospitals, a common feat in healthcare, necessitates that complement its skill assessments with a prediction of the
we examine and mitigate the bias of all such systems collectively. importance of video frames, as provided by human experts
1
Department of Computing and Mathematical Sciences, California Institute of Technology, California, CA, USA. 2Center for Robotic Simulation and Education, Catherine & Joseph
Aresty Department of Urology, University of Southern California, California, CA, USA. 3Department of Urology, Houston Methodist Hospital, Texas, TX, USA. 4Department of
Urology, Pediatric Urology and Uro-Oncology, Prostate Center Northwest, St. Antonius-Hospital, Gronau, Germany. 5Division of Neurosurgery, Center for Neuroscience, Children’s
National Hospital, Washington DC, WA, USA. 6Center for Surgery & Public Health, Department of Surgery, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA,
USA. ✉email: danikiy@hotmail.com; ajhung@gmail.com
Published in partnership with Seoul National University Bundang Hospital

D. Kiyasseh et al.
2
a
surgical video
frames
Surgical AI Surgical AI
System 1 System 2
needle handling needle driving

skill assessment skill assessment
train and deploy deploy deploy

Hospital A Hospital B Hospital C
AI systems
AUC
bias no bias
novice novice
expert expert
1234567890():,;
series electric circuit
AI systems exhibit bias against expert surgeons
bias bias bias

Hospital A Hospital B Hospital C
surgical video
frames important frames identified
by human experts
Surgical AI
TWIX
System 1 important frames
frame 1 frame identified by AI system
needle handling TWIX is a simple add-on to any AI system
skill assessment
Fig. 1 Mitigating bias of multiple surgical AI systems across multiple hospitals. a Multiple AI systems assess the skill-level of multiple
surgical activities (e.g., needle handling and needle driving) from videos of intraoperative surgical activity. These AI systems are often
deployed across multiple hospitals. b To examine bias, we stratify these systems' performance (e.g., AUC) across different sub-cohorts of
surgeons (e.g., novices vs. experts). The bias of one of many AI systems is akin to a light bulb in an electric circuit connected in series: similar to
how one defective light bulb leads to a defective circuit, one biased AI system is sufficient to disadvantage a surgeon sub-cohort. c To mitigate
bias, we teach an AI system, through a strategy referred to as TWIX, to complement its skill assessments with predictions of the importance of
video frames based on ground-truth annotations provided by human experts.
(Fig. 1c). We show that TWIX can mitigate the underskilling and discrepancy in the negative predictive value (NPV) of SAIS (see
overskilling bias across hospitals and simultaneously improve the Methods, Fig. 6). We, therefore, present SAIS’ NPV for surgeons
performance of AI systems for all surgeons. Our findings inform who have performed a different number of robotic surgeries
the ethical implementation of impending AI-augmented global during their lifetime (expert caseload >100), those operating on
surgeon credentialing programs. prostate glands of different volumes and of different cancer
severity (Gleason score) (Fig. 2). Note that members of these
groups are fluid as surgeons often have little say over, for example,
RESULTS the characteristics of the prostate gland they operate on. Please
SAIS exhibits underskilling bias across hospitals refer to the Methods section for our motivation behind selecting
With skill assessment, we refer to the erroneous downgrading of these groups and sub-cohorts.
surgical performance as underskilling. An underskilling bias is We found that SAIS exhibits an underskilling bias across
exhibited when such underskilling occurs at different rates across hospitals (see Methods for description of data, Table 2 for the
surgeon sub-cohorts. For binary skill assessment (low vs. high number of video samples). This is evident by, for example, the
skill), which is the focus of our study, this bias is reflected by a discrepancy in the negative predictive value across the two
npj Digital Medicine (2023) 54 Published in partnership with Seoul National University Bundang Hospital
D. Kiyasseh et al.
3
Fig. 2 SAIS exhibits an underskilling bias across hospitals. SAIS is tasked with assessing the skill-level of a needle handling and b needle
driving. A discrepancy in the negative predictive value across surgeon sub-cohorts reflects an underskilling bias. Note that SAIS is always
trained on data from USC and deployed on data from St. Antonius Hospital and Houston Methodist Hospital. To examine bias, we stratify SAIS'
performance based on the total number of robotic surgeries performed by a surgeon during their lifetime (caseload), the volume of the
prostate gland, and the severity of the prostate cancer (Gleason score). The results are an average, and error bars reflect the standard error,
across ten Monte Carlo cross-validation folds.
surgeon sub-cohorts operating on prostate glands of different SAIS’ outputs remain highly predictive of skill-level (odds ratio =
volumes (≤49 ml and >49 ml). For example, when assessing the 2.27), suggesting that surgeon caseload, or experience, plays a
skill-level of needle handling at USC (Fig. 2a), SAIS achieved relatively smaller role in assessing skill31 (see Methods). To further
NPV ≈ 0.71 and 0.75 for the two sub-cohorts, respectively. Such an check if SAIS was latching onto caseload-specific features in
underskilling bias consistently appears across hospitals where surgical videos, we retrained it on data with an equal number of
NPV ≈ 0.80 and 0.93 at St. Antonius Hospital (SAH), and NPV ≈ 0.73 samples from each class (low vs. high skill) and surgeon caseload
and 0.88 at Houston Methodist Hospital (HMH). These findings group (novice vs. expert) and found that the underskilling bias still
extend to when SAIS assessed the skill-level of the second surgical persists. This suggests that SAIS is unlikely to be dependent on
activity of needle driving (see Fig. 2b). unreliable caseload-specific features.
Overskilling bias. While our emphasis has been on the unders- Examining bias across multiple AI systems and hospitals
killing bias, we demonstrate that SAIS also exhibits an overskilling prevents misleading bias findings
bias, where it erroneously upgrades surgical performance (see
With multiple AI systems deployed on the same group of surgeons
Supplementary Note 2).
across hospitals, we claim that examining the bias of only one of
these AI systems can lead to misleading bias findings. Here, we
Multi-class skill assessment. Although the emphasis of this study
provide evidence in support of this claim by focusing on the
is on binary skill assessment, a decision driven primarily by the
surgeon caseload group (also applies to other groups).
need to inspect the fairness of a previously-developed and soon-
to-be-deployed AI system (SAIS), there has been a growing
Multiple AI systems. We found that, had we examined bias for
number of studies focused on multi-class skill assessment15. As
only needle handling, we would have erroneously assumed that
such, we conducted a confined experiment to examine whether
SAIS disadvantaged novice surgeons exclusively. While SAIS did
such a setup, in which needle handling is identified as either low,
exhibit an underskilling bias against novice surgeons at USC when
intermediate, or high skill also results in algorithmic bias (see
assessing the skill-level of needle handling, it exhibited this bias
Supplementary Note 3). We found that both the underskilling and
against expert surgeons when assessing the skill-level of the
overskilling bias continue to extend to this setting.
second surgical activity of needle driving. For example, SAIS
achieved NPV ≈ 0.71 and 0.75 for novice and expert surgeons,
Underskilling bias persists even after controlling for potential respectively, for needle handling (Fig. 2a), whereas it achieved
confounding factors NPV ≈ 0.85 and 0.75 for these two sub-cohorts, for needle driving
Confounding factors may be responsible for the apparent (Fig. 2b).
underskilling bias29,30. It is possible that the underskilling bias
against surgeons with different caseloads (Fig. 2b) is driven by Multiple hospitals. We also found that, had we examined bias on
SAIS’ dependence on caseload, as a proxy, for skill assessment. For data only from USC, we would have erroneously assumed that
example, SAIS may have latched onto the effortlessness of expert SAIS disadvantaged expert surgeons exclusively. While SAIS did
surgeons’ intraoperative activity, as opposed to the strict skill exhibit an underskilling bias against expert surgeons at USC when
assessment criteria (see Methods), as predictive of high-skill assessing the skill-level of needle driving, it exhibited this bias
activity. However, after controlling for caseload, we found that against novice surgeons, to an even greater extent, at HMH. For
Published in partnership with Seoul National University Bundang Hospital npj Digital Medicine (2023) 54
D. Kiyasseh et al.
4
Fig. 3 TWIX mitigates the underskilling bias across hospitals. We present the average performance of SAIS on the most disadvantaged sub-
cohort (worst-case NPV) before and after adopting TWIX, indicating the percent change. An improvement (↑) in the worst-case NPV is
considered bias mitigation. SAIS is tasked with assessing the skill-level of a needle handling and b needle driving. Note that SAIS is trained on
data from USC and deployed on data from St. Antonius Hospital and Houston Methodist Hospital. Results are an average across ten Monte
Carlo cross-validation folds.
example, SAIS achieved NPV ≈ 0.85 and 0.75 for novice and expert (see Fig. 2 to identify disadvantaged sub-cohorts). This finding
surgeons, respectively, at USC, whereas it achieved NPV ≈ 0.57 and was even more pronounced when SAIS was tasked with assessing
0.80 for these two sub-cohorts at HMH (Fig. 2b). the skill-level of needle driving at USC (Fig. 3b), with improve-
ments in the worst-case NPV by up to 32%.
TWIX mitigates underskilling bias across hospitals We also observed that TWIX, despite being adopted while SAIS
was trained on data exclusively from USC, also mitigates bias
Although we demonstrated, in a previous study, that SAIS was
when SAIS is deployed on data from other hospitals. This is
able to generalize to data from different hospitals, we are acutely
evident by the improvements in SAIS’ performance for the
aware that AI systems are not perfect. They can, for example,
disadvantaged surgeon sub-cohorts at SAH and, occasionally, at
depend on unreliable features as a shortcut to performing a task,
HMH. In cases where we observed a decrease in the worst-case
otherwise known as spurious correlations32. We similarly hypothe- performance, we found that this was associated with an overall
sized that SAIS, as a video-based AI system, may be latching onto decrease in the performance of SAIS (Fig. 4). We hypothesize that
unreliable temporal features (i.e., video frames) to perform skill this reduction in performance is driven by the variability in the
assessment. At the very least, SAIS could be focusing on frames execution of surgical activity by surgeons across hospitals.
which are irrelevant to the task at hand and which could hinder its
performance.
Overskilling bias. Empirically, we discovered that while various
To test this hypothesis, we opted for an approach that directs
strategies mitigated the underskilling bias, they exacerbated the
an AI system’s focus onto frames deemed relevant (by human
overskilling bias (more details in forthcoming section). In contrast,
experts) while performing skill assessment. The intuition is that by we found that TWIX avoids this negative unintended effect.
learning to focus on features deemed most relevant by human Specifically, we found that TWIX also mitigates the overskilling
experts, an AI system is less likely to latch onto unreliable features bias (see Supplementary Note 4).
in a video when assessing surgeon skill. To that end, we leverage a
strategy entitled training with explanations—TWIX28—(see Meth-
ods). We present the performance of SAIS for the disadvantaged Deploying TWIX with multiple AI systems and hospitals
surgeon sub-cohorts before and after adopting TWIX when prevents misleading findings about its effectiveness
assessing the skill-level of needle handling (Fig. 3a) and needle As with examining algorithmic bias, it is equally critical to measure
driving (Fig. 3b). the effectiveness of a bias mitigation strategy across multiple AI
We found that TWIX mitigates the underskilling bias exhibited systems and hospitals in order to avoid misleading findings. We
by SAIS. This is evident by the improvement in SAIS’ worst-case now provide evidence in support of this claim.
negative predictive value for the disadvantaged surgeon sub-
cohorts after having adopted TWIX. For example, when SAIS was Multiple AI systems. We found that, had we not adopted TWIX for
tasked with assessing the skill-level of needle handling at USC needle driving skill assessment, we would have underestimated its
(Fig. 3a), worst-case NPV increased by 2% for the disadvantaged effectiveness. Specifically, while TWIX mitigated the underskilling
surgeon sub-cohort (novice) in the surgeon caseload group bias at USC when SAIS assessed the skill-level of needle handling
D. Kiyasseh et al.
5
Fig. 4 TWIX can improve AI system performance while mitigating bias across hospitals. The performance (AUC) of SAIS before and after
having adopted TWIX when assessing the skill-level of a needle handling and b needle driving. Note that SAIS is trained on data from USC and
deployed on data from St. Antonius Hospital and Houston Methodist Hospital. The results are an average across ten Monte Carlo cross-
validation folds and the shaded area represents one standard error.
(system 1), the magnitude of this mitigation increased when SAIS

Table 1. Baseline strategies mitigate bias inconsistently.
assessed the skill-level of the distinct activity of needle driving
(system 2). For example, for the disadvantaged surgeon sub- Bias mitigation strategy
cohort in the caseload group, the worst-case NPV improved by 2%
for needle handling (Fig. 3a) and 20% for needle driving (Fig. 3b), Bias TWAD VPT TWIX (ours)
reflecting a 10-fold increase in the effectiveness of TWIX as a bias
Underskilling ↓ 3.7% ↓ 7.7% ↓ 3.0%
mitigation strategy.
Overskilling ↑ 6.7% ↑ 7.0% ↓ 4.0%
Multiple hospitals. We found that, had we not adopted TWIX and We report the change in the AI system’s bias (negative percent change in
deployed SAIS in other hospitals, we would have overestimated its worst-case performance) averaged across the surgeon groups as a result of
effectiveness. Specifically, while TWIX mitigated the underskilling adopting distinct mitigation strategies. An improvement in the worst-case
bias at USC when SAIS assessed the skill-level of needle driving, performance corresponds to a reduction in bias. Results are shown for the
the magnitude of this mitigation decreased when SAIS was needle handling skill assessment system deployed on data from USC.
deployed on data from SAH. For example, for the disadvantaged TWAD involves training an AI system with additional data, and VPT
surgeon sub-cohort in the prostate volume group, the worst-case involves pre-training the AI system with surgical videos (see Methods).
NPV improved by 19% at USC but only by 1% at SAH (Fig. 3b).
effect on bias is an appealing property whose implications we
Baseline bias mitigation strategies induce collateral damage discuss later.
A strategy for mitigating a particular type of bias can exacerbate
another, leading to collateral damage and eroding its effective- TWIX can improve AI system performance while mitigating
ness. To investigate this, we adapted two additional strategies that bias across hospitals
have, in the past, proven effective in mitigating bias33,34. These Trustworthy AI systems must exhibit both robust and fair
include training an AI system with additional data (TWAD) and behavior35. Although it has been widely documented that
pre-training an AI system first with surgical videos (VPT) (see mitigating algorithmic bias can come at the expense of AI system
Methods for in-depth description). We compare their ability to performance36, recent work has cast doubt on this trade-off37–39.
mitigate bias to that of TWIX (Table 1 and Supplementary Note 5). We explored this trade-off in the context of TWIX, and present
We found that while baseline strategies were effective in SAIS’ performance for all surgeons across hospitals (Fig. 4). This is
mitigating the underskilling bias, and even more so than TWIX, reflected by the area under receiver operating characteristic curve
they dramatically worsened the overskilling bias exhibited by SAIS. (AUC), before and after having adopted TWIX.
For example, VPT almost negated its improvement in the We found that TWIX can improve the performance of AI systems
underskilling bias (7.7%) by exacerbating the overskilling bias while mitigating bias. This is evident by the improvement in the
(7.0%). In contrast, TWIX consistently mitigated both the unders- performance of SAIS both for the disadvantaged surgeon sub-
killing and overskilling bias, albeit more moderately, resulting in cohorts (see earlier Fig. 3) and on average for all surgeons. For
an average improvement in the worst-case performance by 3.0% example, when tasked with assessing the skill-level of needle
and 4.0%, respectively. The observed consistency in TWIX’s driving at USC (Fig. 3b), TWIX improved the worst-case NPV by
D. Kiyasseh et al.
6
Fig. 5 SAIS can be used today to assess the skill-level of surgical trainees. a SAIS exhibits an underskilling bias against male medical
students when assessing the skill-level of needle handling. b TWIX improves the worst-case NPV, and thus mitigates the underskilling bias.
c TWIX simultaneously improves SAIS' ability to perform skill assessment.
32%, 19%, and 20% for the surgeon groups of caseload, prostate characterized by multiple AI systems deployed on the same group
volume, and Gleason score, respectively and thus mitigating the of surgeons and across multiple hospitals, with a single excep-
underskilling bias, and also improved SAIS’ performance from tion43. Third, previous work has not demonstrated the effective-
AUC = 0.821 → 0.843 (Fig. 4b). ness of a bias mitigation strategy across multiple stakeholders and
hospitals33.
Deployment of SAIS in a training environment When it comes to bias mitigation, we found that TWIX mitigated
algorithmic bias more consistently than baseline strategies that
Our study informs the future implementation of AI-augmented
have, in the past, proven effective in other scientific domains and
surgeon credentialing programs. We can, however, begin to assess
with other AI systems. This consistency is reflected by a
today the skills of surgical trainees in a training environment. To simultaneous decrease in algorithmic bias of different forms
foster a fair learning environment for surgical trainees, it is critical (underskilling and overskilling), of multiple AI systems (needle
that these AI-based skill assessments reflect the true skill-level of handling and needle driving skill assessment), and across
all trainees equally. To measure this, and as a proof of concept, we hospitals. We do appreciate, however, that it is unlikely for a
deployed SAIS on video samples of the needle handling activity single bias mitigation strategy to be effective all the time. This
performed by medical students without prior robotic experience reactive approach to bias mitigation might call for more of a
on a robot otherwise used in surgical procedures (see Methods) preventative approach where AI systems are purposefully
(Fig. 5). designed to exhibit minimal bias. While appealing in theory, we
We discovered that our findings from when SAIS was deployed believe this is impractical at the present moment for several
on video samples of live surgical procedures transferred to the reasons. First, it is difficult to determine, during the design stage of
training environment. Specifically, we first found that SAIS exhibits an AI system, whether it will exhibit any algorithmic bias upon
an underskilling bias against male medical students (Fig. 5a). deployment on data, and if so, against whom. Mitigating bias
Consistent with earlier findings, we also found that TWIX mitigates becomes challenging when you cannot first quantify it. Second,
this underskilling bias (Fig. 5b) and simultaneously improves SAIS’ the future environment in which an AI system will be deployed is
ability to assess the skill-level of needle handling (Fig. 5c). often unknown. This ambiguity makes it difficult to design an AI
system specialized to the data in that environment ahead of time.
In some cases, it may even be undesirable to do so as a specialized
DISCUSSION
system might be unlikely to generalize to novel data.
Recently-developed surgical AI systems can reliably assess multi- From a practical standpoint, we believe TWIX confers several
ple surgeon skills across hospitals. The impending deployment of benefits. Primarily, TWIX is a simple add-on to almost any AI system
such systems for the purpose of credentialing surgeons and that processes temporal information and does not require any
training medical students necessitates that they do not disadvan- amendments to the latter’s underlying architecture. This is
tage any particular sub-cohort. However, until now, it has appealing particularly in light of the broad availability and common
remained an open question whether such surgical AI systems practice of adapting open-source AI systems. In terms of resources,
exhibit algorithmic bias. TWIX only requires the availability of ground-truth importance
In this study, we examined and mitigated the bias exhibited by labels (e.g., the importance of frames in a video), which we have
a family of surgical AI systems –SAIS –that assess the skill-level of demonstrated can be acquired with relative ease in this study.
multiple surgical activities through video. To prevent misleading Furthermore, TWIX’s benefits can extend beyond just mitigating
bias findings, we demonstrated the importance of examining the algorithmic bias. Most notably, when performing inference on an
collective bias exhibited by all AI systems deployed on the same unseen video sample, an AI system equipped with TWIX can be
group of surgeons and across multiple hospitals. We then viewed as explainable, as it highlights the relative importance of
leveraged a strategy—TWIX—which not only mitigates such bias video frames thereby instilling trust in domain experts, and
for the majority of surgeon groups and hospitals, but can also leveraged as a personalized educational tool for medical students,
improve the performance of AI systems for all surgeons. directing them towards surgical activity in the video that can be
As it pertains to the study and mitigation of algorithmic bias, improved upon. These additional capabilities would be missing
previous work is limited in three main ways. First, it has not from other bias mitigation strategies.
examined the algorithmic bias of AI systems applied to the data We demonstrated that, to prevent misleading bias findings, it is
modality of surgical videos6,40 nor against surgeons41,42, thereby crucial to examine and mitigate the bias of multiple AI systems
overlooking an important stakeholder within medicine. Second, across multiple hospitals. Without such an analysis, stakeholders
previous work has not studied bias in the real clinical setting within medicine would be left with an incomplete and potentially
D. Kiyasseh et al.
7
incorrect understanding of algorithmic bias. For example, at the scope of our work, researchers must appreciate the entire
national level, medical boards augmenting their decision-making ecosystem in which an AI system is deployed. Moving forward,
with AI systems akin to those introduced here may introduce and once data become available, we look to examine (a) bias
unintended disparities in how surgeons are credentialed. At the against surgeon groups which we had excluded in this study due
local hospital level, medical students subjected to AI-augmented to sample size constraints (e.g., those belonging to a particular
surgical training, a likely first application of such AI systems, may race, sex, and ethnicity) and (b) intersectional bias50: that which is
receive unreliable learning signals. This would hinder their exhibited against surgeons who belong to multiple groups at the
professional development and perpetuate existing biases in the same time (e.g., expert surgeons who are female). Doing so could
education of medical students44–47. Furthermore, the alleviation of help outline whether a variant of Simpson’s paradox51 is at play;
bias across multiple hospitals implies that surgeons looking to bias, although absent at the individual group level, may be
deploy an AI system in their own operating room are now less present when simultaneously considering multiple groups. We
reticent to do so. As such, we recommend that algorithmic bias, leave this to future work as the analysis would require a sufficient
akin to AI system performance, is also examined across multiple number of samples from each intersectional group. We must also
hospitals and multiple AI systems deployed on the same group of emphasize that a single bias mitigation strategy is unlikely to be a
stakeholders. Doing so increases the transparency of AI systems, panacea. As a result, we encourage the community to develop
leading to more informed decision-making at various levels of bias mitigation strategies that achieve the desired effect across
operation within healthcare and contributing to the ethical multiple hospitals, AI systems, and surgeon groups. Exploring the
deployment of surgical AI systems. interplay of these elements, although rarely attempted in the
There are important challenges that our work does not yet context of algorithmic bias in medicine, is critical to ensure that AI
address. A topic that is seldom discussed, and which we do not systems deployed in clinical settings have the intended positive
claim to have an answer for, is that of identifying an acceptable effect on stakeholders.
level of algorithmic bias. Akin to the ambiguity of selecting a The credentialing of a surgeon is often considered a rite of
performance threshold that AI systems should surpass before being passage. With time, such a decision is likely to be supported by AI-
deployed, it is equally unclear whether a discrepancy in perfor- based skill assessments. In preparation for this future, our study
mance across groups (i.e., bias) of 10 percentage points is introduces safeguards to enable fair decision-making.
significantly worse than that of 5 percentage points. As with model
performance, this is likely to be context-specific and dependent on
how costly a particular type of bias is. In our work, we have METHODS
suggested that any performance discrepancy is indicative of Ethics approval
algorithmic bias, an assumption that the majority of previous work
also makes. In a similar vein, we have only considered algorithmic All datasets (data from USC, SAH, and HMH) were collected under
bias at a single snapshot in time, when the model is trained and Institutional Review Board (IRB) approval from the University of
deployed on a static and retrospectively-collected dataset. How- Southern California in which written informed consent was
ever, as AI systems are likely to be deployed over extended periods obtained from all participants (HS-17-00113). Moreover, the
of time, where the distribution of data is likely to change, it is critical datasets were de-identifed prior to model development.
to continuously monitor and mitigate the bias exhibited by such
systems over time. Analogous to continual learning approaches that Description of surgical procedure and activities
allow models to perform well on new unseen data while In this study, we focused on robot-assisted radical prostatectomy
maintaining strong performance on data observed in the past48, (RARP), a surgical procedure in which the prostate gland is
we believe continual bias mitigation is an avenue worth exploring. removed from a patient’s body in order to treat cancer. With a
Our study has been limited to examining the bias of AI systems surgical procedure often composed of sequential steps that must
which only assess the quality of two surgical skills (needle be executed by a surgeon, we observed the intraoperative activity
handling and needle driving). Although these skills form the of surgeons during one particular step of the RARP procedure: the
backbone of suturing, an essential activity that almost all surgeons vesicoureteral anastomosis (VUA). In short, the VUA is a
must master, they are but only a subset of all skills required of a reconstructive suturing step in which the bladder and urethra,
surgeon. It is imperative for us to proactively assess the separated by the removal of the prostate, must now be connected
algorithmic bias of surgical AI systems once they become capable to one another through a series of stitches. This connection
of reliably assessing a more exhaustive set of surgical skills. creates a tight link that should allow for the normal flow of urine
Another limitation is that we examine and mitigate algorithmic postoperatively. To perform a single stitch in the VUA step, a
bias exclusively through a technical lens. However, we acknowl- surgeon must first grab the needle with one of the robotic arms
edge that the presence and perpetuation of bias is dependent on (needle handling), push that needle through the tissue (needle
a multitude of additional factors ranging from the social context in driving), and then withdraw that needle on the other side of the
which an AI system is deployed, the decisions that it will inform, tissue in preparation for the next stitch (needle withdrawal).
and the incentives surrounding its use. In this study, and for
illustration purposes, we assumed that an AI system would be
used to either provide feedback to surgeons about their Surgical video samples and annotations
performance or to inform decisions such as surgeon credentialing. In assessing the skill-level of suturing activity, SAIS was trained and
To truly determine whether algorithmic biases, as we have defined evaluated on video samples associated with ground-truth skill
them, translate into tangible biases that negatively affect surgeons assessment annotations. We now outline how these video
and their clinical workflow, a prospective deployment of an AI samples and annotations were generated, and defer a description
system would be required. of SAIS to the next section.
Although we leveraged a bias mitigation strategy (TWIX), our
work does not claim to address the key open question of how Video samples. We collected videos of entire robotic surgical
much bias mitigation is sufficient. Indeed, the presence of a procedures from three geographically-diverse hospitals in addi-
performance discrepancy across groups is not always indicative of tion to videos of medical students performing suturing activities in
algorithmic bias. Some have claimed that this is the case only if a laboratory environment.
the discrepancy is unjustified and harmful to stakeholders49. Live robotic surgical procedures: An entire video of the VUA
Therefore, to address this open question, which is beyond the step (on the order of 20 min) from one surgical case was split into
D. Kiyasseh et al.
8
Table 2. Total number of videos and video samples associated with each of the hospitals and tasks.
Task Activity Details Hospital Videos Video samples Surgeons Generalizing to
Skill assessment Suturing Needle handling USC 78 912 19 Videos

SAH 60 240 18 Hospitals
HMH 20 184 5 Hospitals
LAB 69 328 38 Modality
Needle driving USC 78 530 19 Videos
SAH 60 280 18 Hospitals
HMH 20 220 5 Hospitals
Note that we train our model, SAIS, on data exclusively shown in bold following a ten fold Monte Carlo cross-validation setup. For an exact breakdown of the
number of video samples in each fold and training, validation, and test split, please refer to Supplementary Tables 1–6. The data from the remaining hospitals
are exclusively used for inference. SAIS is always trained and evaluated on a class-balanced set of data whereby each category (e.g., low skill and high skill)
contains the same number of samples. This prevents SAIS from being negatively affected by a sampling bias during training, and allows for a more intuitive
appreciation of the evaluation results.
video samples depicting either one of the two suturing activities: SAIS is an AI system for skill assessment
needle handling and needle driving. With each VUA step SAIS was recently developed to decode the intraoperative activity
consisting of around 24 stitches, this resulted in approximately of surgeons based exclusively on surgical videos3. Specifically, it
24 video samples depicting needle handling and another demonstrated state-of-the-art performance in assessing the skill-
24 samples depicting needle driving. To obtain these video level of surgical activity, such as needle handling and driving,
samples, a trained medical fellow identified the start and end time across multiple hospitals. In light of these capabilities, we used
of the respective suturing sub-phases. Each video sample can span SAIS as the core AI system whose potential bias we attempted to
5−30 s in duration. Please refer to Table 2 for a summary of the examine and mitigate across hospitals.
number of video samples.
Training environment: To mimic the VUA step in a laboratory Components of SAIS. We outline the basic components of SAIS
environment, we presented medical students with a realistic gel- here and refer readers to the original study for more details3. In
like model of the bladder and urethra, and asked them to perform short, SAIS takes two data modalities as input: RGB frames and
a total of 16 stitches while using a robot otherwise used in live optical flow, which measures motion in the field of view over
surgical procedures. To obtain video samples, we followed the time, and which is derived from neighboring RGB frames.
same strategy described above. As such, each participant’s video Spatial information is extracted from each of these frames
through a vision transformer pre-trained in a self-supervised
resulted in 16 video samples for each of the activities of needle
manner on ImageNet. To capture the temporal information
handling, needle driving, etc. For this dataset, we only focused on across frames, SAIS learns the relationship between subsequent
needle handling (see Table 2 for a number of video samples). Note frames through an attention mechanism. Greater attention, or
that since these video samples depict suturing activities, we importance, is placed on frames deemed more important for
adopted the same annotation strategy (described next) for these the ultimate skill assessment. Repeating this process for all data
video samples and those of live surgical procedures. modalities, SAIS arrives at modality-specific video representa-
tions. SAIS aggregates these representations to arrive at a single
Skill assessment annotations. A team of trained human raters video representation that summarizes the content of the video
(TFH, MO, and others) were tasked with viewing each video sample. This video representation is then used to output a
sample and annotating it with either a binary low-skill or high-skill probability distribution over the two skill categories (low vs.
assessment. It is worthwhile to note that, to minimize potential high skill).
bias in the annotations, these raters were not privy to the clinical
meta-information (e.g., surgeon caseload) associated with their Training and evaluating SAIS. As in the original study3, SAIS is
surgical videos. The raters followed the strict guidelines outlined trained on data exclusively from USC using tenfold Monte Carlo
in our team’s previously-developed skill assessment tool52, which cross validation (see Supplementary Note 1). Each fold consisted
we outline in brief below. To ensure the quality of the annotations, of a training, validation, and test set, ensuring that surgical videos
the raters first went through a training process in which they were not shared across the sets. When evaluated on data from
annotated the same set of video samples. Once their agreement other hospitals, SAIS is deployed on all such video samples. This is
level exceeded 80%, they were allowed to begin annotating the repeated for all ten of the SAIS models. As such, we report
video samples for this study. In the event of disagreements in the evaluation metrics as an average and standard deviation across 10
annotations, we followed the same strategy adopted in the Monte Carlo cross validation folds.
original study3 where the lowest of all scores is considered as the
final annotation. Skill assessment evaluation metrics
Needle handling skill assessment: The skill-level of needle When SAIS decodes surgical skills, we report the positive
handling is assessed by observing the number of times a surgeon predictive value (PPV) defined as the proportion of AI-based
had to reposition their grasp of the needle. Fewer repositions high-skill assessments which are correct, and the negative
imply a high skill-level, as they are indicative of improved surgeon predictive value (NPV) defined as the proportion of the AI-based
dexterity and intent. low-skill assessments which are correct (see Fig. 6). The motivation
Needle driving skill assessment: The skill-level of needle driving for doing so stems from the expected use of these AI systems
is assessed by observing the smoothness with which a surgeon where their low or high-skill assessment predictions would inform
pushes the needle through tissue. Smoother driving implies a high decision-making (e.g., surgeon credentialing). As such, we were
skill-level, as it is less likely to cause physical trauma to the tissue. interested in seeing what proportion of the AI-based assessments,
D. Kiyasseh et al.
9
Fig. 6 Visual definition of underskilling and overskilling bias in the context of binary skill assessment. Whereas an underskilling bias is
reflected by a discrepancy in the negative predictive value of AI-based predictions across sub-cohorts of surgeons (e.g., s1 = novice and
s2 = expert), and overskilling bias is reflected by a discrepancy in the positive predictive value.
^ matched the ground-truth assessment, Y, for a given set of S

Y, (Y^ ¼ high) which should have been classified as a low skill
surgeon sub-cohorts, fsi gSi¼1 (Y = low). This is equivalently reflected by the positive predictive
value of the AI-based predictions (see Fig. 6).
PPVsi ¼ PðY ¼ highjsi ; Y^ ¼ highÞ (1)
Underskilling and overskilling bias. Adopting the established
NPVsi ¼ PðY ¼ lowjsi ; Y^ ¼ lowÞ (2)
definitions of bias16,17,33, and leveraging our descriptions of
underskilling and overskilling, we define an underskilling bias as
a discrepancy in the negative predictive value of AI-based
Choosing threshold for evaluation metrics. SAIS outputs the predictions across sub-cohorts of surgeons, s1 and s2, for example,
probability, p ∈ [0, 1], that a video sample depicts a high-skill when dealing with two sub-cohorts (see Fig. 6). This concept
activity. As with any probabilistic output, to make a definitive naturally extends to the multi-class skill assessment setting
prediction (skill assessment), we had to choose a threshold, τ, on (Supplementary Note 3). A larger discrepancy implies a larger
this probability. Whereas p ≤ τ indicates a low-skill assessment, bias. We similarly define an overskilling bias as a discrepancy in
p > τ indicates a high-skill assessment. While this threshold is often the positive predictive value of AI-based predictions across sub-
informed by previously-established clinical evidence53 or a desired cohorts of surgeons. Given our study’s focus on RARP surgical
error rate, we did not have such prior information in this setting. procedures, we examine bias exhibited against groups of (a)
We also balanced the number of video samples from each skill surgeons with different robotic caseloads (total number of robotic
category during the training of SAIS. As such, we chose a surgeries performed in their lifetime), and those operating on
threshold τ = 0.5 for our experiments. Changing this threshold did prostate glands of (b) different volumes, and (c) different cancer
not affect the relative model performance values across surgeon severity. We motivate this choice of groups in this next section.
sub-cohorts, and therefore left the bias findings unchanged.
Motivation behind surgeon groups and sub-cohorts
Quantifying the different types of bias We examined algorithmic bias against several surgeon groups.
To examine and mitigate the bias exhibited by surgical AI systems, These included the volume of the prostate gland, the severity of
we first require a definition of bias. Although many exist in the the prostate cancer (Gleason Score), and the surgeon caseload. We
literature, we adopt the definition, most commonly used in recent chose these groups after consultation with a urologist (AH) about
studies16,17,33, as a discrepancy in the performance of an AI system their relevance, and the completeness of the clinical meta-
for different members, or sub-cohorts, of a group (e.g., surgeons information associated with the surgical cases. It may seem
with different experience levels). The choice of performance counter-intuitive at first to identify surgeon groups based on, for
metric ultimately depends on the type of bias we are interested in example, the volume of the prostate gland on which they operate.
examining. In this study, we focus on two types of bias: After all, few surgeons make the decision to operate on patients
underskilling and overskilling. based on such a factor. Although a single surgeon may not have a
say over the volume of the prostate gland on which they operate,
Underskilling. In the context of skill assessment, underskilling institution- or geography-specific patient demographics may
occurs when an AI system erroneously downgrades surgical naturally result in these groups. For example, we found that, in
performance, predicting a skill to be of lower quality than it addition to differences in prostate volumes of patients within a
actually is. Using this logic with binary skill assessment (low vs. high hospital, there exists a difference in the distribution of such
skill), underskilling can be quantified by the proportion of AI-based volumes across hospitals. Therefore, defining surgeon groups
low-skill predictions (Y^ ¼ low) which should have been classified as based on these factors still provides meaningful insight into
high skill (Y = high). This is equivalently reflected by the negative algorithmic bias.
predictive value of the AI-based predictions (see Fig. 6). While it is
also possible to examine the proportion of high-skill assessments Defining surgeon sub-cohorts. In order to quantify bias as a
which an AI system predicts to be low-skill, amounting to the true discrepancy in model performance across sub-cohorts, we
positive rate, we opt to focus on how AI-based low-skill predictions discretized continuous surgeon groups, where applicable, into
directly inform the decision-making of an end-user. two sub-cohorts. To define novice and expert surgeons, we built
on previous literature which uses surgeon caseload, the total
Overskilling. In the context of skill assessment, overskilling occurs number of robotic surgeries performed by a surgeon during their
when an AI system erroneously upgrades surgical performance, lifetime, as a proxy54–56. As such, we define experts as having
predicting a skill to be of higher quality than it actually is. Using completed >100 robotic surgeries. As for prostate volume, we
this logic with binary skill assessment, overskilling can be used the population median in the USC data to define prostate
quantified by the proportion of AI-based high-skill predictions volume ≤49 ml and >49 ml. We used the population median (a) in
D. Kiyasseh et al.
10
order to have relatively balanced sub-cohorts, thus avoiding the ground-truth importance label assigned to that representation. In
undesirable setting where a sub-cohort has too few samples, and the next section, we outline the form of these importance labels
(b) based on previous clinical research, when available, that points and how they are provided by human experts. As such, TWIX can
to a meaningful threshold. For example, there is evidence which be used whenever an AI system extracts representations of data
demonstrates that surgeons operating on large prostate glands and ground-truth importance labels are available.
experience longer operative times than their counterparts57,58. To In this study, we used TWIX alongside a video-based surgical AI
facilitate the comparison of findings across hospitals, we used the system, which extracts representations of individual frames in a
same threshold and definition of the aforementioned sub-cohorts surgical video, where each frame was annotated as depicting
irrespective of the hospital. either relevant or irrelevant information for the task of surgeon
skill assessment. The parameters of TWIX are learned alongside
Implementation details to control for confounding factors those of of the main AI system (SAIS) in an end-to-end manner by
optimizing an objective function with two terms; one to assess
To determine the relative importance of other factors in
surgeon skill, and another to identify important frames. This is akin
automating skill assessment, and thus potentially acting as a
to multi-task learning. To emulate a realistic deployment setting,
source of the underskilling bias, we conducted several experi-
in which an AI system is trained on data from one hospital and
ments. Without loss of generality, we conducted these experi-
deployed on data from another hospital, we learn these
ments to check if surgeon caseload was a potential confounding
parameters using ground-truth explanation annotations (outlined
factor of the observed underskilling bias. We chose this factor
next) exclusively from USC.
because we empirically observed an underskilling bias against
surgeons with different caseloads and it is for a relationship to
Acquisition of explanation labels. To train the TWIX module, as
exist between caseload and the quality of surgical activity (e.g., in
presented in this study, we require ground-truth labels reflecting
our dataset, 65 and 45% of assessments are considered low-skill
the relevance of each individual frame in a surgical video. A frame
for novice and expert surgeons, respectively).
is considered relevant if it depicts (or violates) the stringent set of
First, after training SAIS on data from USC to assess the skill-level of
skill assessment criteria laid out in a previously-established
needle handling, we deployed it on data from the test set (all 10
taxonomy52. Surgeon skill assessment is often based on both
Monte Carlo cross-validation folds). Recall that SAIS returns a
visual and motion cues in the surgical video. Therefore, in practice,
probabilistic value ∈ [0, 1] indicating the probability of a high-skill
a span of frames was identified as relevant or irrelevant, as
surgical activity. Following the setup described in Pierson et al.59, we
opposed to a single frame. Specifically, a team of two trained
train a logistic regression model where the two independent
human raters were asked to highlight the regions of time
variables are SAIS’ probabilistic output and the binary surgeon
(equivalently, the span of video frames) deemed most relevant
caseload (0 is ≤100 cases, 1 is >100 cases) and the dependent
(e.g., 1 = important, 0 = not important) for assigned skill
variable is the needle handling skill-level. Across the 10 folds, we
assessment.
found that the coefficient of SAIS’ probabilistic output is 0.82, SD: 0.15.
Training the raters: Before embarking on the annotation
This suggests that caseload plays a minimal role in the assessment of
process, a team of two human raters already familiar with the
surgeon skill, and is therefore an unlikely confounding factor.
skill assessment taxonomy were trained to identify relevant
To provide further evidence in support of this claim, we
aspects of video samples. As part of the training, each rater was
conducted an additional study whereby we trained SAIS after
having adopted a class and sub-cohort balancing strategy. Sub- then provided with the same set of training video samples to be
cohort balancing implies having the same number of samples independently annotated with explanations. Since these annota-
from each surgeon sub-cohort in each class. Class balancing tions were temporal (i.e., covering spans of frames), we quantified
implies having the same number of samples from each class. their agreement by calculating their intersection over union (IoU).
Given a training set of data, we first performed sub-cohort This training process continued until the IoU >0.8, implying an
balancing by inspecting each class and under-sampling data 80% overlap in the frames annotated by the raters.
(using a random seed = 0) from the surgeon sub-cohort with the Annotating the video samples: After completing the training
larger number of samples. We then performed class balancing by process, raters were asked to annotate video samples from USC.
under-sampling data (using a random seed = 0) from the class Specifically, they only annotated video samples associated with
with the larger number of samples. The validation and test sets in low-skill assessments because (a) it is often these assessments,
each fold remained unchanged. The intuition is that if caseload which often require further intervention (e.g., for surgeon learning
were a confounding factor, and SAIS happened to be latching and improvement purposes) and (b) it is more straightforward, and
onto caseload-specific features in the videos, then balancing the less ambiguous, to annotate relevant frames for a low-skill
training data as described above should eradicate, or at least assessment than a high-skill assessment. The former, for example,
alleviate, the observed underskilling bias. However, we found the often occurs by violating one or more of the criteria identified in the
underskilling bias persisted to the same degree as before, skill assessment taxonomy. In the event of disagreement in the
suggesting that caseload is unlikely to be a confounding factor. annotations of raters, we only considered the intersection of each
annotation. The intuition is that we wanted to avoid annotating
TWIX as a bias mitigation strategy potentially superfluous frames as important. For example, if the first
rater (R1) identified the span 0 − 5s as important whereas the
To mitigate algorithmic bias, we adopted a strategy—TWIX28— second rater (R2) identified the span 1 − 6s as important, the final
which involves teaching an AI system to complement its predictions annotation became 1 − 5s (the intersection). We experimented with
with an explanation that matches one provided by a human. Please other approaches, such as considering the union, which we found
refer to the section titled TWIX mitigates underskilling bias across to have a minimal effect on our findings.
hospitals for the motivation behind our adopting this approach.
Next, we briefly outline the mechanics of TWIX and the process we
followed for acquiring ground-truth explanation labels. Measuring the effectiveness of TWIX as a bias mitigation
strategy
Mechanics of TWIX. Conceptually, TWIX is a module (i.e., a neural The mitigation of algorithmic bias in surgical AI systems is critical
network), which receives a representation of data as input and as it reduces the rate at which groups are disadvantaged,
returns the probability of the importance of that representation. increases the likelihood of clinical adoption by surgeons, and
This module is trained in a supervised manner to match the improves the ethical deployment of AI systems. Recall that we first
D. Kiyasseh et al.
11
defined bias as a discrepancy in the performance of an AI system However, the videos and the corresponding annotations of the suturing activities
across different groups (e.g., surgeon levels). Traditionally, performed by medical students in the training environment are available upon
algorithmic bias has been considered mitigated if such a reasonable request from the authors.
discrepancy is reduced. However, this reduction, which would
take place via a bias mitigation strategy, can manifest in several
ways, thus obfuscating unintended effects. For example, a CODE AVAILABILITY
reduction can occur through unchanged performance for the All models were developed using Python and standard deeplearning libraries such as
PyTorch61. The code for the underlying model (SAIS) can be accessed at https://
disadvantaged group and worsened performance for the remain- github.com/danikiyasseh/SAIS and that for TWIX can be accessed at https://
ing group. It is debatable whether this behavior is desirable and github.com/danikiyasseh/TWIX.
ethical. An even more problematical scenario is one where the
reduction in the discrepancy comes at the expense of depressed
Received: 6 October 2022; Accepted: 21 January 2023;
performance for all groups. This implies that each group would be
further disadvantaged by the AI system.
One way to overcome these limitations is by monitoring
improvements in the performance of the AI system that is achieved
for the disadvantaged group, known as the worst-case perfor- REFERENCES
mance27. Specifically, an improvement in the worst-case perfor- 1. Wang, Z. & Majewicz Fey, A. Deep learning with convolutional neural network for
mance is a desirable objective in the context of bias mitigation as it objective skill evaluation in robot-assisted surgery. Int. J. Comput. Assist. Radiol.
implies that an AI system is lowering the rate with which it mistreats Surg. 13, 1959–1970 (2018).
the disadvantaged group (e.g., through a lower error rate). We do 2. Khalid, S., Goldenberg, M., Grantcharov, T., Taati, B. & Rudzicz, F. Evaluation of
note, though, that it remains critical to observe the effect of a bias deep learning models for identifying surgical actions and measuring perfor-
mitigation strategy on all groups (see Discussion section). To that mance. JAMA Netw. Open 3, e201664–e201664 (2020).
end, we adopted this objective when evaluating whether or not a 3. Kiyasseh, D. et al. A vision transformer for decoding surgeon activity from surgical
videos. 7, 1–17 https://doi.org/10.1038/s41551-023-01010-8 (2023).
bias mitigation strategy achieved its desired effect.
4. Ward, T. M. et al. Surgical data science and artificial intelligence for surgical
education. J. Surg. Oncol. 124, 221–230 (2021).
Description of other bias mitigation strategies 5. Huffman, E. M., Rosen, S. A., Levy, J. S., Martino, M. A. & Stefanidis, D. Are current
In an attempt to mitigate the bias exhibited by SAIS, we also credentialing requirements for robotic surgery adequate to ensure surgeon
proficiency? Surg. Endosc. 35, 2104–2109 (2021).
experimented with two additional strategies, which we
6. Collins, J. W. et al. Ethical implications of AI in robotic surgical training: a Delphi
outline below. consensus statement. Eur. Urol. Focus 8, 613–622 (2021).
7. Maier-Hein, L. et al. Surgical data science–from concepts toward clinical trans-
Strategy 1—training with additional data (TWAD). Data-centric lation. Med. Image Anal. 76, 102306 (2022).
approaches, those which focus on curating a fit-for-purpose 8. Zorn, K. C. et al. Training, credentialing, proctoring and medicolegal risks of
dataset with minimal annotation noise on which a model is robotic urological surgery: recommendations of the society of urologic robotic
trained, hold promise for improving the fairness of AI systems33. surgeons. J. Urol. 182, 1126–1132 (2009).
To that end, we explored the degree to which adding data to the 9. Green, C. A., Levy, J. S., Martino, M. A. & Porterfield Jr, J. The current state of
training set would mitigate bias. Specifically, we retrieved video surgeon credentialing in the robotic era. Ann. Laparosc. Endosc. Surg. 5 https://
ales.amegroups.com/article/view/5624/html (2020).
samples annotated (by the same team of human raters) as
10. Darzi, A., Datta, V. & Mackay, S. The challenge of objective assessment of surgical
depicting medium skill-level and added them to the video skill. Am. J. Surg. 181, 484–486 (2001).
samples from the low-skill category. The intuition was that 11. Moorthy, K., Munz, Y., Sarker, S. K. & Darzi, A. Objective assessment of technical
exposing SAIS to more representative training data could help skills in surgery. BMJ 327, 1032–1037 (2003).
alleviate its bias. After adding these video samples to the low-skill 12. Gallagher, A. G. et al. Virtual reality simulation for the operating room: proficiency-
category, we continued to balance the number of video samples based training as a paradigm shift in surgical skills training. Ann. Surg. 241, 364 (2005).
from each category (low vs. high-skill). Concretely, before AD, SAIS 13. Adams, R. et al. Prospective, multi-site study of patient outcomes after imple-
was trained on 742 video samples, 50% of which belonged to the mentation of the trews machine learning-based early warning system for sepsis.
low-skill category. After adding data, SAIS was trained on Nat. Med. 28, 1455–1460 (2022).
14. Lee, J. Y., Mucksavage, P., Sundaram, C. P. & McDougall, E. M. Best practices for
1522 samples with the same distribution of skill categories.
robotic surgery training and credentialing. J. Urol. 185, 1191–1197 (2011).
15. Lam, K. et al. A Delphi consensus statement for digital surgery. NPJ Digit. Med. 5,
Strategy 2—surgical video pre-training (VPT). Some evidence has 1–9 (2022).
demonstrated that exposing an AI system to vast amounts of 16. Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an
unlabeled data before training can help mitigate bias34. To that end, algorithm used to manage the health of populations. Science 366, 447–453
we pre-trained a vision transformer in a self-supervised manner with (2019).
images of the VUA step from all three hospitals: USC, SAH, and HMH. 17. Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I. Y. & Ghassemi, M.
This amounted to ≈8 million images in total. We followed the same Underdiagnosis bias of artificial intelligence algorithms applied to chest radio-
pre-training setup as that described in DINO60. In short, DINO graphs in under-served patient populations. Nat. Med. 27, 2176–2182 (2021).
18. Booth, B. M. et al. Bias and fairness in multimodal machine learning: a case study
leverages a contrastive objective function in which representations
of automated video interviews. In Proc. 2021 International Conference on Multi-
of augmented versions of an image (e.g., randomly cropped, rotated, modal Interaction 268–277 (ACM, 2021).
etc.) are encouraged to be similar to one another. We conducted 19. Raghavan, M., Barocas, S., Kleinberg, J. & Levy, K. Mitigating bias in algorithmic
pre-training for two full epochs on an NVIDIA Titan RTX, lasting 48 h. hiring: Evaluating claims and practices. In Proc. 2020 Conference on Fairness,
Accountability, and Transparency 469–481 (ACM, 2020).
Reporting summary 20. Domnich, A. & Anbarjafari, G. Responsible ai: Gender bias assessment in emotion
recognition. Preprint at https://arxiv.org/abs/2103.11436 (2021).
Further information on research design is available in the Nature 21. Ricci Lara, M. A., Echeveste, R. & Ferrante, E. Addressing fairness in artificial
Research Reporting Summary linked to this article. intelligence for medical imaging. Nat. Commun. 13, 1–6 (2022).
22. Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Mitigating bias in machine
learning for medicine. Commun. Med. 1, 1–3 (2021).
DATA AVAILABILITY 23. Pfohl, S. et al. Net benefit, calibration, threshold selection, and training objectives
The videos of live surgical procedures from the University of Southern California, St. for algorithmic fairness in healthcare. In 2022 ACM Conference on Fairness,
Antonius Hospital, and Houston Methodist Hospital are not publicly available. Accountability, and Transparency1039–1052 (ACM, 2022).
D. Kiyasseh et al.
12
24. Marcinkevičs, R., Ozkan, E. & Vogt, J. E. Debiasing deep chest x-ray classifiers using 54. Hung, A. J. et al. Face, content and construct validity of a novel robotic surgery
intra-and post-processing methods. Machine Learning for Healthcare Conference simulator. J. Urol. 186, 1019–1025 (2011).
(2022). 55. Hung, A. J. et al. Validation of a novel robotic-assisted partial nephrectomy sur-
25. Liu, E. Z. et al. Just train twice: improving group robustness without training gical training model. BJU Int. 110, 870–874 (2012).
group information. In International Conference on Machine Learning 6781–6792 56. Hung, A. J. et al. Development and validation of objective performance metrics
(PMLR, 2021). for robot-assisted radical prostatectomy: a pilot study. J. Urol. 199, 296–304
26. Idrissi, B. Y., Arjovsky, M., Pezeshki, M. & Lopez-Paz, D. Simple data balancing (2018).
achieves competitive worst-group-accuracy. In Conference on Causal Learning 57. Martinez, C. H. et al. Effect of prostate gland size on the learning curve for robot-
and Reasoning 336–351 (PMLR, 2022). assisted laparoscopic radical prostatectomy: does size matter initially? J. Endourol.
27. Zhang, H. et al. Improving the fairness of chest x-ray classifiers. In Conference on 24, 261–266 (2010).
Health, Inference, and Learning 204–233 (PMLR, 2022). 58. Goldstraw, M. et al. Overcoming the challenges of robot-assisted radical pros-
28. Kiyasseh, D. et al. A multi-institutional study using artificial intelligence to provide tatectomy. Prostate Cancer Prostatic Dis. 15, 1–7 (2012).
reliable and fair feedback to surgeons. Commun. Med. 3, 1–12 https://doi.org/ 59. Pierson, E., Cutler, D. M., Leskovec, J., Mullainathan, S. & Obermeyer, Z. An
10.1038/s43856-023-00263-3 (2023). algorithmic approach to reducing unexplained pain disparities in underserved
29. Mukherjee, P. et al. Confounding factors need to be accounted for in assessing populations. Nat. Med. 27, 136–140 (2021).
bias by machine learning algorithms. Nat. Med. 28, 1159–1160 (2022). 60. Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc.
30. Bernhardt, M., Jones, C. & Glocker, B. Investigating underdiagnosis of AI algo- IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).
rithms in the presence of multiple sources of dataset bias. Preprint at https:// 61. Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning
arxiv.org/abs/2201.07856 (2022). library. In Proc. 33rd International Conference on Neural Information Processing
31. Maan, Z., Maan, I., Darzi, A. & Aggarwal, R. Systematic review of predictors of Systems 8026–8037 (Curran Associates Inc., 2019).
surgical performance. J. Br. Surg. 99, 1610–1621 (2012).
32. DeGrave, A. J., Janizek, J. D. & Lee, S.-I. Ai for radiographic covid-19 detection
selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021). ACKNOWLEDGEMENTS
33. Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, Research reported in this publication was supported by the National Cancer Institute
curated clinical image set. Sci. Adv. 8, eabq6147 (2022). under Award No. R01CA251579-01A1.
34. Goyal, P. et al. Vision models are more robust and fair when pretrained on uncurated
images without supervision. Preprint at https://arxiv.org/abs/2202.08360 (2022).
35. Liang, W. et al. Advances, challenges and opportunities in creating data for
AUTHOR CONTRIBUTIONS
trustworthy AI. Nat. Mach. Intell. 4, 669–677 (2022).
36. Friedler, S. A. et al. A comparative study of fairness-enhancing interventions in D.K. contributed to the conception of the study and the study design, developed the
machine learning. In Proc. Conference on Fairness, Accountability, and Transpar- deeplearning models, and wrote the manuscript. J.L. collected the data from the
ency 329–338 (ACM, 2019). training environment. D.K., J.L., T.F.H., and M.O. provided annotations for the video
37. Wick, M., Paanda, S. & Tristan, J.-B. Unlocking fairness: a trade-off revisited. In Proc. samples. D.A.D. and Q.-D.T. provided feedback on the manuscript. C.W. collected data
33rd International Conference on Neural Information Processing Systems from St. Antonius Hospital and B.J.M. collected data from Houston Methodist Hospital
8783–8792 (Curran Associates Inc., 2019). and provided feedback on the manuscript. A.J.H. and A.A. provided supervision and
38. Dutta, S. et al. Is there a trade-off between fairness and accuracy? a perspective contributed to edits of the manuscript.
using mismatched hypothesis testing. In International Conference on Machine
Learning 2803–2813 (PMLR, 2020).
39. Rodolfa, K. T., Lamba, H. & Ghani, R. Empirical observation of negligible COMPETING INTERESTS
fairness–accuracy trade-offs in machine learning for public policy. Nat. Mach. The authors declare no competing non-financial interests but the following
Intell. 3, 896–904 (2021). competing financial interests: D.K. is a paid consultant of Flatiron Health and an
40. Rudzicz, F. & Saqur, R. Ethics of artificial intelligence in surgery. Preprint at https:// employee of Vicarious Surgical, C.W. is a paid consultant of Intuitive Surgical, A.A. is
arxiv.org/abs/2007.14302 (2020). an employee of Nvidia, and A.J.H is a consultant of Intuitive Surgical.
41. Seastedt, K. P. et al. A scoping review of artificial intelligence applications in
thoracic surgery. Eur. J. Cardiothorac. Surg. 61, 239–248 (2022).
42. Wilhelm, D. et al. Ethische, legale und soziale implikationen bei der anwendung ADDITIONAL INFORMATION
künstliche-intelligenz-gestützter technologien in der chirurgie. Der Chirurg 93, Supplementary information The online version contains supplementary material
223–233 (2022). available at https://doi.org/10.1038/s41746-023-00766-2.
43. Schrouff, J. et al. Maintaining fairness across distribution shift: do we have viable
solutions for real-world applications?Preprint at https://arxiv.org/abs/2202.01034 Correspondence and requests for materials should be addressed to Dani Kiyasseh or
(2022). Andrew J. Hung.
44. Fallin-Bennett, K. Implicit bias against sexual minorities in medicine: cycles of
professional influence and the role of the hidden curriculum. Acad. Med. 90, Reprints and permission information is available at http://www.nature.com/
549–552 (2015). reprints
45. Klein, R. et al. Gender bias in resident assessment in graduate medical education:
review of the literature. J. Gen. Int. Med. 34, 712–719 (2019). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims
46. Barnes, K. L., McGuire, L., Dunivan, G., Sussman, A. L. & McKee, R. Gender bias in published maps and institutional affiliations.
experiences of female surgical trainees. J. Surg. Edu. 76, e1–e14 (2019).
47. Hemphill, M. E., Maher, Z. & Ross, H. M. Addressing gender-related implicit bias in
surgical resident physician education: a set of guidelines. J. Surg. Edu. 77,
491–494 (2020). Open Access This article is licensed under a Creative Commons
48. Kiyasseh, D., Zhu, T. & Clifton, D. A clinical deep learning framework for con- Attribution 4.0 International License, which permits use, sharing,
tinually learning from cardiac signals across diseases, time, modalities, and adaptation, distribution and reproduction in any medium or format, as long as you give
institutions. Nat. Commun. 12, 1–11 (2021). appropriate credit to the original author(s) and the source, provide a link to the Creative
49. Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning (fair- Commons license, and indicate if changes were made. The images or other third party
mlbook.org, 2019). material in this article are included in the article’s Creative Commons license, unless
50. Buolamwini, J. & Gebru, T. Gender shades: Intersectional accuracy disparities in indicated otherwise in a credit line to the material. If material is not included in the
commercial gender classification. In Conference on Fairness, Accountability and article’s Creative Commons license and your intended use is not permitted by statutory
Transparency 77–91 (PMLR, 2018). regulation or exceeds the permitted use, you will need to obtain permission directly
51. Wagner, C. H. Simpson’s paradox in real life. Am Stat. 36, 46–48 (1982). from the copyright holder. To view a copy of this license, visit http://
52. Haque, T. F. et al. Development and validation of the end-to-end assessment of creativecommons.org/licenses/by/4.0/.
suturing expertise (ease). J. Urol. 207, e153 (2022).
53. Pfohl, S. et al. Creating fair models of atherosclerotic cardiovascular disease risk.
In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society 271–278 (ACM, 2019). © The Author(s) 2023

Human Visual Explanations Mitigate Bias in AI-based Assessment of Surgeon Skills

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Human Visual Explanations Mitigate Bias in AI-based Assessment of Surgeon Skills

Uploaded by

Copyright:

Available Formats

www.nature.

Human visual explanations mitigate bias in AI-based

INTRODUCTION Doing so will ethically guide the impending implementation of AI-

Published in partnership with Seoul National University Bundang Hospital

needle handling needle driving

train and deploy deploy deploy

series electric circuit

AI systems exhibit bias against expert surgeons

bias bias bias

(system 1), the magnitude of this mitigation increased when SAIS

Task Activity Details Hospital Videos Video samples Surgeons Generalizing to

Skill assessment Suturing Needle handling USC 78 912 19 Videos

^ matched the ground-truth assessment, Y, for a given set of S

You might also like