Professional Documents
Culture Documents
2
Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center
3
Center for Information Technology, National Institutes of Health
4
King Abdullah University of Science & Technology
5
Biomedical Image Analysis, Imperial College London
Abstract. The volume of CT exams being done in the world has been
rising every year, which has led to radiologist burn-out. Large Language
Models (LLMs) have the potential to reduce their burden, but their
adoption in the clinic depends on radiologist trust, and easy evaluation
of generated content. Presently, many automated methods are available
to evaluate the reports generated for chest radiographs, but such an ap-
proach is not available for CT presently. In this paper, we propose a
novel evaluation framework to judge the capabilities of vision-language
LLMs in generating accurate summaries of CT-based abnormalities. We
input CT slices containing an abnormality (e.g., lesion) to a vision-based
LLM (GPT-4V, LLaVA-Med, and RadFM), to generate a free-text sum-
mary of the predicted characteristics of the abnormality. Next, a GPT-4
model decomposed the summary into specific aspects (body part, loca-
tion, type, and attributes), automatically evaluated the characteristics
against the ground-truth, and generated a score for each aspect based on
its clinical relevance and factual accuracy. These scores were then con-
trasted against those obtained from a clinician, and a high correlation
(≥ 85%, p < .001) was observed. Although GPT-4V outperformed other
models in our evaluation, it still requires overall improvement. Our eval-
uation method offers valuable insights into the specific areas that need
the most enhancement, guiding future development in this field.
1 Introduction
In current clinical practice, a radiologist communicates the results of an imaging
exam for a patient to their referring doctor through a signed report. While read-
ing the patient exam, the radiologist uses Speech Recognition Software (SRS)
⋆
These authors contributed equally to this work
⋆⋆
Joint senior co-authors
2 Zhu et al.
that converts dictated speech into text. SRS is widely used and has significantly
reduced the report turn-around time. However, any errors resulting from the
dictation have to be corrected by the radiologists themselves, and persistent
communication errors can negatively impact the interpretation of patient diag-
noses and lead to medical malpractice suits [1]. These errors are most common
for cross-sectional imaging [2], such as CT and MR, and the volume of these
exams has steadily increased each year [3]. This has led to a 54-72% radiologist
burn-out rate [4] where they are under increased pressure to deal with a sub-
stantially higher number of patients while maintaining a high level of accuracy.
To reduce their burden, a myriad number of transformer-based approaches
have been proposed to generate radiology reports in one shot [5,6]. However,
chest radiographs (CXR) have been the singular focus of these works with scant
effort devoted to other modalities, such as CT [7]. The development of such
a method for CT presents unique challenges, attributable to the intrinsic 3D
nature of CT data, computational complexity, and extensive reporting detail
necessary for clinical applications. However, there have been recent advances
with Large Language Models (LLMs), like GPT-4 [8], and vision-based LLMs,
such as GPT-4 Vision (GPT-4V), LLaVA-Med [9], and Radiology Foundation
Model (RadFM) [10]. These multimodal models have demonstrated their capa-
bilities across several tasks, such as passing medical exams, medical note-taking
and diagnosing diseases [11,12,13]. Therefore, they have immense potential to
pre-fill the “findings” section of a radiology report with relevant information
that a radiologist can quickly review [14].
Despite these advances, crucial factors determining their clinical use involve:
(1) radiologist trust, and (2) easy interpretation and evaluation of the generated
content. Current evaluation metrics, including Natural Language Generation
(NLG) and Clinical Efficacy (CE) metrics, are notoriously limited [15,14,16,17]
when it comes to capturing the semantic richness and clinical relevance neces-
sary for radiology reports. Additionally, they lack the explanatory power that
is required for clinical use. For CXR images, a variety of automated methods
validating the clinical accuracy of reports have been established [18,19,15]. But,
an equivalent automated system for the validation of clinical accuracy in CT is
notably absent. These limitations have prompted the exploration of LLM-based
evaluation methods that promise a nuanced assessment aligned with human
judgment [20,16] and emphasize accuracy and relevance of generated content.
In this paper, we present a novel evaluation framework that judges the abil-
ity of a vision-based LLM in generating diagnostically accurate findings, such
that they can be pre-filled into the radiology report. CT slices containing an ab-
normality (e.g., lesion) were input to a vision-based LLM (e.g., GPT-4V), and
it generated a free-text summary of the predicted characteristics of the abnor-
mality. Next, a language-centric GPT-4 model decomposed the summary into
specific aspects (body part, location, type, and attributes), automatically eval-
uated the characteristics against the ground-truth annotations, and generated a
score for each aspect based on its clinical relevance and factual accuracy. These
scores were then contrasted against those obtained from a clinician, and a high
Decomposing Vision-based LLM Predictions 3
correlation (≥ 85%, p < .001) was observed. Our approach is unique in that
it combines the expertise of radiologists with the Chain-of-Thought (COT) [21]
reasoning from LLMs to evaluate the characteristics of abnormalities predicted
by vision-based LLMs.
Contributions: (1) The proposed auto-evaluation framework is a novel ap-
proach that decomposes the characteristics of a CT-based abnormal finding gen-
erated by a vision-based LLM into specific aspects, such that distinct dimensions
of report quality can be effectively isolated and verified. (2) Three recent vision-
based LLMs were evaluated for their capability to generate summarizations of
CT-based findings. (3) Solidifies the limitations of traditional NLG metrics for
capturing factual accuracy and reporting complexity.
2 Methods
Dataset: To the best our knowledge, there is no publicly available dataset with
CT exams and paired radiology reports. Consequently, this study utilizes a sub-
set of the publicly available DeepLesion dataset [22,23], comprising 496 CT vol-
umes (496 studies) from 486 patients. The subset contained 500 lesions of various
kinds (e.g., liver, kidney, bone, etc.) that were prospectively marked in 500 CT
slices. The dataset also provided specific characteristics of lesions that were ex-
tracted from the sentences in the radiology reports using an automated method
[23]. These included the body part location, type of lesion, and shape and ap-
pearance attributes. As certain lesion characteristics were missed by the auto-
mated tool, two board-certified radiologists, each with 10+ years of experience,
manually reviewed and comprehensively annotated any missing lesion character-
istics. The dataset included both male and female participants (M: 294, F: 192),
ranging in age from 2 to 87 years (mean: 52.2, s.d.: 17.7).
Decomposing Vision-based LLM Predictions for Auto-Evaluation: The
proposed framework decomposes the predicted descriptions of CT-based find-
ings (e.g., lesions) by a vision-based LLM, such that a language-centric LLM
4 Zhu et al.
GPT-4V: ‘Location’: ‘Right abdomen, near the kidney’, ‘Body Part’: ‘Abdomen, kid-
ney’, ‘Type’: ‘Mass’, ‘Attributes’: ‘Well-circumscribed, homogenous’, ‘Impression’: ‘Well-
circumscribed homogenous mass in the right abdomen, adjacent to the right kidney’
LLaVA-Med: The image is an axial computed tomography (CT) scan that has been
annotated with a bounding box by a radiologist. The bounding box is a rectangular box
that encloses the lesion, which is indicated in green. This helps to highlight the area of
interest and provides a clear view of the lesion’s location, size, and shape.
Table 2. Correlation scores between the clinician and GPT-4 grading of reports.
aligns with findings in literature [25]. This sets them apart from GPT-4V, which
demonstrated enhanced performance when employing COT reasoning. Further-
more, the inclusion of bounding boxes generally lead to improved model perfor-
mance as it provided guidance for the models, with GPT-4V showing the most
significant enhancement. Amongst the evaluated models, GPT-4V outperformed
its counterparts. However, despite these advancements, the overall quality of pre-
filled findings for CT reporting remained suboptimal. This indicated a pressing
need for further development of vision-based LLMs to meet clinical standards.
Consider the ground truth text ‘left mediastinal pleural enhancing mass’
and a predicted result by RadFM ‘Axial contrast enhanced images through the
thoracic inlet demonstrating a homogeneous mass posterior to the trachea’. It
is evident that the NLG scores are low owing to the minimal textual overlap.
Furthermore, this sentiment extends to individuals without medical training, as
an understanding of anatomy is necessary to recognize that both texts pertain
to the chest and lung regions.
8 Zhu et al.
5 Acknowledgements
This research was supported by the Intramural Research Program of the National
Library of Medicine and Clinical Center at the NIH.
References
1. John J. Smith and Leonard Berlin. Signing a colleague’s radiology report. Amer-
ican Journal of Roentgenology, 176(1):27–30, 2001. PMID: 11133532.
2. Michael D Ringler, Brian C Goss, and Brian J Bartholmai. Syntactic and semantic
errors in radiology reports associated with speech recognition software. Health
Informatics Journal, 23(1):3–13, 2017. PMID: 26635322.
3. M. Mahesh, A. J. Ansari, and Jr Mettler, F. A. Patient exposure from radiologic
and nuclear medicine procedures in the united states and worldwide: 2009-2018.
Radiology, 301, 2023.
4. N. A. Fawzy, M. J. Tahir, A. Saeed, M. J. Ghosheh, T. Alsheikh, A. Ahmed, and
Z. Lee, K. Y. Yousaf. Incidence and factors associated with burnout in radiologists:
A systematic review. European journal of radiology open, 11:100530, 2023.
5. Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radi-
ology reports via memory-driven transformer. In Bonnie Webber, Trevor Cohn,
Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empiri-
cal Methods in Natural Language Processing (EMNLP), pages 1439–1449, Online,
November 2020. Association for Computational Linguistics.
6. Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal memory net-
works for radiology report generation. In Chengqing Zong, Fei Xia, Wenjie Li, and
Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Associa-
tion for Computational Linguistics and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, Online,
August 2021. Association for Computational Linguistics.
7. Akimichi Ichinose, Taro Hatsutani, Keigo Nakamura, Yoshiro Kitamura, Satoshi
Iizuka, Edgar Simo-Serra, Shoji Kido, and Noriyuki Tomiyama. Visual ground-
ing of whole radiology reports for 3d ct images. In Hayit Greenspan, Anant
Madabhushi, Parvin Mousavi, Septimiu Salcudean, James Duncan, Tanveer Syeda-
Mahmood, and Russell Taylor, editors, Medical Image Computing and Computer
Assisted Intervention – MICCAI 2023, pages 611–621, Cham, 2023. Springer Na-
ture Switzerland.
8. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo-
rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal
Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
9. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei
Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. LLaVA-Med: Training
a large language-and-vision assistant for biomedicine in one day. Advances in
Neural Information Processing Systems, 36, 2024.
10. Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards
generalist foundation model for radiology. arXiv preprint arXiv:2308.02463, 2023.
11. Shubo Tian, Qiao Jin, Lana Yeganova, Po-Ting Lai, Qingqing Zhu, Xiuying Chen,
Yifan Yang, Qingyu Chen, Won Kim, Donald C Comeau, et al. Opportunities
and challenges for chatgpt and large language models in biomedicine and health.
Briefings in Bioinformatics, 25(1):bbad493, 2024.
10 Zhu et al.
12. Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric
Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint
arXiv:2303.13375, 2023.
13. Qiao Jin, Robert Leaman, and Zhiyong Lu. Retrieve, summarize, and verify: How
will chatgpt impact information seeking from the medical literature? Journal of
the American Society of Nephrology, pages 10–1681, 2023.
14. Qingqing Zhu, Tejas Sudharshan Mathai, Pritam Mukherjee, Yifan Peng,
Ronald M. Summers, and Zhiyong Lu. Utilizing longitudinal chest x-rays and
reports to pre-fill radiology reports. In MICCAI (5), volume 14224 of Lecture
Notes in Computer Science, pages 189–198. Springer, 2023.
15. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus,
Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya,
et al. Chexpert: A large chest radiograph dataset with uncertainty labels and
expert comparison. In Thirty-Third AAAI Conference on Artificial Intelligence,
2019.
16. Qingqing Zhu, Xiuying Chen, Qiao Jin, Benjamin Hou, Tejas Sudharshan Mathai,
Pritam Mukherjee, Xin Gao, Ronald M Summers, and Zhiyong Lu. Leveraging pro-
fessional radiologists’ expertise to enhance llms’ evaluation for radiology reports,
2024.
17. Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M Cheung, Robert
Chen, Ronald M Summers, Justin F Rousseau, Peiyun Ni, Marc J Landsman,
et al. Hidden flaws behind expert-level accuracy of gpt-4 vision in medicine. arXiv
preprint arXiv:2401.08396, 2024.
18. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and
Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and bench-
marks on weakly-supervised classification and localization of common thorax dis-
eases. In Proceedings of the IEEE conference on computer vision and pattern recog-
nition, pages 2097–2106, 2017.
19. Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers,
and Zhiyong Lu. Negbio: a high-performance tool for negation and uncertainty de-
tection in radiology reports. AMIA Summits on Translational Science Proceedings,
2018:188, 2018.
20. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang
Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda
Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, pages 2511–2522, Singa-
pore, December 2023. Association for Computational Linguistics.
21. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning
in large language models. Advances in Neural Information Processing Systems,
35:24824–24837, 2022.
22. Ke Yan, Xiaosong Wang, Le Lu, and Ronald M. Summers. Deeplesion: Automated
deep mining, categorization and detection of significant radiology image findings
using large-scale clinical lesion annotations. CoRR, abs/1710.01766, 2017.
23. Ke Yan, Yifan Peng, Veit Sandfort, Mohammadhadi Bagheri, Zhiyong Lu, and
Ronald M Summers. Holistic and comprehensive annotation of clinically significant
findings on diverse ct images: learning from radiology reports and label ontology.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 8523–8532, 2019.
24. OpenAI. GPT-4V(ision) System Card.
https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
Decomposing Vision-based LLM Predictions 11
25. Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin,
and Minjoon Seo. The cot collection: Improving zero-shot and few-shot learning
of language models via chain-of-thought fine-tuning. In The 2023 Conference on
Empirical Methods in Natural Language Processing, 2023.