Professional Documents
Culture Documents
AUC
population. Results for one population may not necessarily trans- (CIs), which quantify uncertainty in performance. To obtain the
late to others (5,7). Thus, the patient population should be defined FoM, a reference standard is needed. The process to define the ref-
in the claim. This includes aspects such as sex, ethnicity, age erence standard should be stated.
group, geographic location, disease stage, social determinants The Claim Describes the Generalizability of an AI Algorithm:
of health, and other disease and application-relevant biologic Generalizability is defined as an algorithm’s ability to properly work
variables. with new, previously unseen data, such as that from a different insti-
tution, scanner, acquired with a different image-acquisition protocol,
II.3. Definition of Imaging Process or processed by a different reader. By providing all the components
The imaging system, acquisition protocol, and reconstruction of a claim, an evaluation study will describe the algorithm’s general-
and analysis parameters may affect task performance. For exam- izability to unseen data, since the claim will specify the characteris-
ple, an AI algorithm evaluated for a high-resolution PET system tics of the population used for evaluation, state whether the
may rely on high-frequency features captured by this system and evaluation was single or multicenter, define the image acquisition
thus not apply to low-resolution systems (8). Depending on the and analysis protocols used, as well as the competency of the
algorithm, specific acquisition protocol parameters may need to be observer performing the evaluation study. Figure 2 presents a sch-
specified or the requirement to comply with a certain accreditation ematic showing how different kinds of generalizability could be
standard, such as SNMMI-Clinical Trial Network, RSNA QIBA established. Some key points from this figure are:
profile, and the EARL standards, may need to be stated. For exam-
ple, an AI-based denoising algorithm for ordered-subsets-expecta- Providing evidence for generalizability requires external valida-
tion-maximization (OSEM)–based reconstructed images may not tion. This is defined as validation where some portion of the test-
apply to images reconstructed using filtered backprojection or ing study, such as the data (patient population demographics) or
even for a different number of OSEM iterations since noise prop- the process to acquire the data, is different from that in the
erties change with iteration numbers. Thus, depending on the development cohort. Depending on the level of external valida-
application, the claim should specify the imaging protocol. Fur- tion, the claim can be appropriately defined.
ther, if the algorithm was evaluated across multiple scanners, or For a study that claims to be generalizable across populations,
with multiple protocols, that should be specified. scanners, and readers, the external cohort would be from differ-
ent patient demographics, with different scanners, and analyzed
II.4. Process to Extract Task-Specific Information
by different readers than the development cohort, respectively.
Task-based evaluation of an imaging algorithm requires a strategy Multicenter studies provide higher confidence about generaliz-
to extract task-specific information from the images. For classifica- ability compared with single-center studies since they typically
tion tasks, a typical strategy is to have human observer(s) read the include some level of external validation (patients from differ-
images, detect lesions, and classify the patient or each detected lesion ent geographical locations/different scanners/different readers).
into a certain class (e.g., malignant or benign). Here, observer com-
petency (multiple trained radiologists/one trained radiologist/resi-
dent/untrained reader) will impact task performance. The choice of III. METHODS FOR EVALUATION
the strategy may impact confidence of the validity of the algorithm. The evaluation framework for AI algorithms is provided in Fig-
This is also true for quantification and joint classification/quantifica- ure 3. The 4 classes of this framework are differentiated based on
tion tasks. Thus, this strategy should be specified in the claim.
their objectives, as briefly described below, with details provided in
II.5. Figure of Merit (FoM) to Quantify Task Performance the ensuing subsections. An example for an AI low-dose PET recon-
FoMs quantitatively describe the algorithm’s performance on struction algorithm is provided. Figure 3 contains another example
the clinical task, enabling comparison of different methods, com- for an AI-based automated segmentation algorithm. A detailed exam-
parison to standard of care, and defining quantitative metrics of ple of using this framework to evaluate a hypothetical AI-based
success. FoMs should be accompanied by confidence intervals transmission-less attenuation compensation method for SPECT
FIGURE 3. Framework for evaluation of AI-based algorithms. The left of the pyramid provides a brief description of the phase, and the right provides
an example of evaluating an AI-based segmentation algorithm on the task of evaluating MTV using this framework.
AUC
require multiple studies. Further, necessary resources (such as a The patient population.
large, representative dataset) may not be available to conduct these The imaging and image analysis protocol(s).
studies. Thus, a task-agnostic objective facilitates timely dissemina- Process to define reference standard.
tion and widens the scope of newly developed AI methods. Performance as quantified with a task-agnostic evaluation
III.1.2. Study Design: The following are recommended best prac- metric.
tices to conduct POC evaluation of an AI algorithm. Best practices to
develop the algorithm are covered in the companion paper (16). We reemphasize that the POC study claim should not be inter-
Data Collection: In POC evaluation, the study can use realistic preted as an indication of the algorithm’s expected performance in a
simulations, physical phantoms, or retrospective clinical or research clinical setting or on any clinical task.
data, usually collected for a different purpose, for example, routine Example Claim: Consider the evaluation of a new segmenta-
diagnosis. The data used for evaluation may come from the develop- tion algorithm. The claim could read as follows:
ment cohort, that is, the same overall cohort that the training and val- “An AI-based PET tumor-segmentation algorithm evaluated
idation cohorts were drawn from. However, there must be no overlap on 50 patients with locally advanced breast cancer acquired
between these data. Public databases, such as those available at The on a single scanner with single-reader evaluation yielded mean
Cancer Imaging Archive (18) and from medical image analysis chal- Dice scores of 0.78 (95% CI 0.71-0.85).”
lenges, such as at https://grand-challenge.org, can also be used.
Defining Reference Standard: For POC evaluations conducted III.2 Technical Task-Specific Evaluation
with simulation and physical phantoms, the ground truth is known. III.2.1. Objective: The objective of technical task-specific evalua-
For clinical data, curation by readers may be used, but that may tion is to evaluate the technical performance of an AI algorithm on
not be of the highest quality. For example, curations by a single specific clinically relevant tasks such as those of detection and
reader may be sufficient. quantification using FoMs that quantify aspects such as accuracy (dis-
Testing Procedure: The testing procedure should be designed crimination accuracy for detection task and measurement bias for quan-
to demonstrate promising technologic innovation. The algorithm tification task) and precision (reproducibility and repeatability). The
should thus be compared against reference and/or standard-of-care objective is not to assess the utility of the method in clinical decision
methods and preferably other state-of-the-art algorithms. making, because clinical decision making is a combination of factors
Figures of Merit: While the evaluation is task-agnostic, the beyond technical aspects, such as prior clinical history, patient biology,
FoMs should be carefully chosen to show promise for progression to other patient characteristics (age/sex/ethnicity), and results of other clin-
clinical task evaluation. For example, evaluating a new denoising ical tests. Thus, this evaluation does not consider clinical outcomes.
algorithm that overly smooths the image at the cost of resolution For example, evaluating the accuracy of an AI-based segmenta-
using the FoM of contrast-to-noise ratio may be misleading. In those tion method to measure MTV would be a technical efficacy study.
cases, a FoM such as structural similarity index may be more rele- This study would not assess whether more accurate MTV mea-
vant. We recommend evaluation of the algorithms using multiple surement led to any change in clinical outcome.
FoMs. A list of some FoMs is provided in Supplemental Table 1. III.2.2. Study Design: Given the goal of evaluating technical
III.1.3. Output Claim of the POC Study: The claim should
performance, the evaluation should be performed in controlled set-
state the following:
tings. Practices for designing such studies are outlined below. A
The application (e.g., segmentation, reconstruction) for which the framework and summary of tools to conduct these studies in the
method is proposed. context of PET is provided in Jha et al. (10).
Simulation Physical
studies phantoms Clinical studies
Accuracy Y Y
Repeatability/reproducibility/noise sensitivity Y Y
with multiple replicates
Criterion that can be
evaluated Repeatability/reproducibility/noise sensitivity Y Yes and
with test–retest replicates recommended
Biologic repeatability/reproducibility/noise Y
sensitivity
Study Type: A technical evaluation study can be conducted using clinical data. Typically, multicenter studies are performed
through the following mechanisms: to improve patient accrual in trials and therefore the same inclu-
sion and exclusion criteria are applied to all centers. Further,
1. Realistic simulations are studies conducted with anthropomor- multicenter studies can help assess the need for harmonization
phic digital phantoms simulating patient populations, where of imaging procedures and system performances.
measurements corresponding to these phantoms are generated
using accurately simulated scanners. This includes virtual clini- Data Collection:
cal trials, which can be used to obtain population-based infer- Realistic simulation studies: To conduct realistic simulations,
ences (19–21). multiple digital anthropomorphic phantoms are available (22).
2. Anthropomorphic physical phantom studies are conducted on In virtual clinical trial–based studies, the distribution of simu-
the scanners with devices that mimic the human anatomy and lated image data should be similar to that observed in clinical
physiology. populations. For this purpose, parameters derived directly
3. Clinical-data-based studies where clinical data are used to from clinical data can be used during simulations (4). Expert
evaluate the technical performance of an AI algorithm, for reader-based studies can be used to validate realism of simu-
example, repeatability of an AI algorithm measuring MTV in lations (23).
test-retest PET scans. Next, to simulate the imaging systems, tools such as GATE
The tradeoffs with these 3 study types are listed in Table 1. (24), SIMIND (25), SimSET (26), PeneloPET (27), and others
Each study type can be single or multiscanner/center, depending (10) can be used. Different system configurations, including
on the claim: those replicating multicenter settings, can be simulated. If the
methods use reconstruction, then clinically used reconstruc-
Single-center/single-scanner studies are typically performed with a tion protocols should be simulated. Simulation studies should
specific system, image acquisition, and reconstruction protocol. In not use data used for algorithm training/validation.
these studies, the algorithm performance can be evaluated for vari- Anthropomorphic physical phantom studies: For clinical rele-
ability in patients, including different demographics, habitus, or dis- vance, the tracer uptake and acquisition parameters when imag-
ease characteristics, while keeping the technical aspects of the ing these phantoms should mimic that in clinical settings. To
imaging procedures constant. These studies can measure the sensi- claim generalizable performance across different scanner proto-
tivity of the algorithm to patient characteristics. They can also cols, different clinical acquisition and reconstruction protocols
study the repeatability of the AI algorithm. Reproducibility may be should be used. A phantom used during training should not be
explored by varying factors such as reconstruction settings. used during evaluation irrespective of changes in acquisition
Multicenter/multiscanner studies are mainly suitable to explore conditions between training and test phases.
the sensitivity of the AI algorithm to acquisition variabilities, Clinical data: Technical evaluation studies will typically be ret-
including variability in imaging procedures, systems, recon- rospective. Use of external datasets, such as those from an insti-
struction methods and settings, and patient demographics if tution or scanner not used for method training/validation, is
Registry, pilot
studies to develop tracking, and ongoing technical performance benchmarking (51).
clinical
hypothesis, This approach may provide additional evidence on algorithm per-
Evaluation feedback to
developers formance and could assist in finding areas of improvements, clini-
Off-label evaluation
Evaluation in new
cal needs not well served or even deriving a hypothesis for further
Validate
tracking of
issues
a mechanism for postdeployment evaluation by serving as sanity
checks to ensure that the AI algorithm was not affected by a main-
Multi-center Scope of the
Observation Single-center studies studies monitoring study
tenance operation such as a software update. These studies could
include assessing contrast or SUV recovery, absence of nonuni-
FIGURE 6. Chart showing the different objectives of postdeployment formities or artifacts, cold-spot recovery, and other specialized tests
monitoring, grouped as a function of the scope and goal of the study. depending on the AI algorithm. Also, tests can be conducted to
ensure that there is a minimal or harmonized image quality as development and maintenance to ensure that a product represents
required by the AI tool for the configurations as stated in the claim. state of the art throughout its lifetime. This may be imperative for
AI systems likely will operate on data generated in nonstation- AI, where technologic innovations are expected to evolve at a fast
ary environments with shifting patient populations and clinical and pace in the coming years. A deployed AI algorithm offers the
operational practices changing over time (9). Postdeployment stud- opportunity to pool data from several users. Specifically, registry
ies can help identify these dataset shifts and assess if recalibration approaches enable cost-efficient pooling of uniform data, multi-
or retraining of the AI method may be necessary to maintain per- center observational studies, and POC studies that can be used to
formance (52,53). Monitoring the distribution of various patient develop a new clinical hypothesis or evaluate specific outcomes
population descriptors, including demographics and disease preva- for particular diseases.
lence, can provide cues for detecting dataset shifts. In the case of Figures of Merit: We provide the FoMs for the studies where
changes in these descriptors, the output of the AI algorithm can be quantitative metrics of success are defined.
verified by physicians for randomly selected test cases. A possible
Monitoring study with clinical data: Frequency of clinical usage
solution to data shift is continuous learning of the AI method (54).
of the AI algorithm, number of times the AI-based method
In Supplemental Section B, we discuss strategies (55–57) to evalu-
changed clinical decisions or affected patient management.
ate continuous-learning-based methods.
Monitoring study with routine physical phantom studies: Since
Off-Label Evaluation: Typically, an AI algorithm is trained and
these are mostly sanity checks, FoMs similar to those used
tested using a well-defined cohort of patients, in terms of patient dem-
when evaluating POC studies may be considered. In case task-
ographics, applicable guidelines, practice preferences, reader exper-
based evaluation is required, FoMs as provided in Supplemental
tise, imaging instrumentation, and acquisition and analysis protocols.
Table 1 may be used.
However, the design of the algorithm may suggest acceptable perfor-
Off-label evaluation: FoMs similar to those used when perform-
mance in cohorts outside the intended scope of the algorithm. Here, a
ing technical and clinical evaluation may be considered.
series of cases is appropriate to collect preliminary data that may sug-
gest a more thorough trial. An example is a study where an AI algo-
rithm that was trained on patients with lymphoma and lung cancer
IV. DISCUSSION
(58) showed reliable performance in patients with breast cancer (59).
Collecting Feedback for Proactive Development: Medical The key recommendations from this article are summarized in
products typically have a long lifetime. This motivates proactive Table 2. These are referred to as the RELAINCE (Recommendations