Vision LLM

Decomposing Vision-based LLM Predictions
for Auto-Evaluation with GPT-4
Qingqing Zhu1⋆ , Benjamin Hou2,5⋆ , Tejas S. Mathai2 , Pritam Mukherjee2 ,

Qiao Jin1 , Xiuying Chen4 , Zhizheng Wang1 , Ruida Cheng3 ,
Ronald M. Summers2 ⋆⋆ , and Zhiyong Lu1 ⋆⋆
1
National Center for Biotechnology Information, National Library of Medicine
arXiv:2403.05680v1 [cs.AI] 8 Mar 2024
2
Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center
3
Center for Information Technology, National Institutes of Health
4
King Abdullah University of Science & Technology
5
Biomedical Image Analysis, Imperial College London
Abstract. The volume of CT exams being done in the world has been
rising every year, which has led to radiologist burn-out. Large Language
Models (LLMs) have the potential to reduce their burden, but their
adoption in the clinic depends on radiologist trust, and easy evaluation
of generated content. Presently, many automated methods are available
to evaluate the reports generated for chest radiographs, but such an ap-
proach is not available for CT presently. In this paper, we propose a
novel evaluation framework to judge the capabilities of vision-language
LLMs in generating accurate summaries of CT-based abnormalities. We
input CT slices containing an abnormality (e.g., lesion) to a vision-based
LLM (GPT-4V, LLaVA-Med, and RadFM), to generate a free-text sum-
mary of the predicted characteristics of the abnormality. Next, a GPT-4
model decomposed the summary into specific aspects (body part, loca-
tion, type, and attributes), automatically evaluated the characteristics
against the ground-truth, and generated a score for each aspect based on
its clinical relevance and factual accuracy. These scores were then con-
trasted against those obtained from a clinician, and a high correlation
(≥ 85%, p < .001) was observed. Although GPT-4V outperformed other
models in our evaluation, it still requires overall improvement. Our eval-
uation method offers valuable insights into the specific areas that need
the most enhancement, guiding future development in this field.
Keywords: Computer Tomography · Deep Learning · Automatic Eval-

uation · Large Language Models · GPT-4.
1 Introduction
In current clinical practice, a radiologist communicates the results of an imaging
exam for a patient to their referring doctor through a signed report. While read-
ing the patient exam, the radiologist uses Speech Recognition Software (SRS)
⋆
These authors contributed equally to this work
⋆⋆
Joint senior co-authors
2 Zhu et al.
that converts dictated speech into text. SRS is widely used and has significantly
reduced the report turn-around time. However, any errors resulting from the
dictation have to be corrected by the radiologists themselves, and persistent
communication errors can negatively impact the interpretation of patient diag-
noses and lead to medical malpractice suits [1]. These errors are most common
for cross-sectional imaging [2], such as CT and MR, and the volume of these
exams has steadily increased each year [3]. This has led to a 54-72% radiologist
burn-out rate [4] where they are under increased pressure to deal with a sub-
stantially higher number of patients while maintaining a high level of accuracy.
To reduce their burden, a myriad number of transformer-based approaches
have been proposed to generate radiology reports in one shot [5,6]. However,
chest radiographs (CXR) have been the singular focus of these works with scant
effort devoted to other modalities, such as CT [7]. The development of such
a method for CT presents unique challenges, attributable to the intrinsic 3D
nature of CT data, computational complexity, and extensive reporting detail
necessary for clinical applications. However, there have been recent advances
with Large Language Models (LLMs), like GPT-4 [8], and vision-based LLMs,
such as GPT-4 Vision (GPT-4V), LLaVA-Med [9], and Radiology Foundation
Model (RadFM) [10]. These multimodal models have demonstrated their capa-
bilities across several tasks, such as passing medical exams, medical note-taking
and diagnosing diseases [11,12,13]. Therefore, they have immense potential to
pre-fill the “findings” section of a radiology report with relevant information
that a radiologist can quickly review [14].
Despite these advances, crucial factors determining their clinical use involve:
(1) radiologist trust, and (2) easy interpretation and evaluation of the generated
content. Current evaluation metrics, including Natural Language Generation
(NLG) and Clinical Efficacy (CE) metrics, are notoriously limited [15,14,16,17]
when it comes to capturing the semantic richness and clinical relevance neces-
sary for radiology reports. Additionally, they lack the explanatory power that
is required for clinical use. For CXR images, a variety of automated methods
validating the clinical accuracy of reports have been established [18,19,15]. But,
an equivalent automated system for the validation of clinical accuracy in CT is
notably absent. These limitations have prompted the exploration of LLM-based
evaluation methods that promise a nuanced assessment aligned with human
judgment [20,16] and emphasize accuracy and relevance of generated content.
In this paper, we present a novel evaluation framework that judges the abil-
ity of a vision-based LLM in generating diagnostically accurate findings, such
that they can be pre-filled into the radiology report. CT slices containing an ab-
normality (e.g., lesion) were input to a vision-based LLM (e.g., GPT-4V), and
it generated a free-text summary of the predicted characteristics of the abnor-
mality. Next, a language-centric GPT-4 model decomposed the summary into
specific aspects (body part, location, type, and attributes), automatically eval-
uated the characteristics against the ground-truth annotations, and generated a
score for each aspect based on its clinical relevance and factual accuracy. These
scores were then contrasted against those obtained from a clinician, and a high
Decomposing Vision-based LLM Predictions 3
Bx NoBx Vision LLM Evaluation Pipeline
LLaVA-Med GPT-4 auto

Predicted NLG Metrics Clinician
evaluation
Text
RadFM
CT slice BLEU-[1,2,3,4] Location Location

w/ pathology GPT-4
ROUGEL Body Parts Body Parts
Vision
METEOR Type Type
Prompt:
Imagine you are Ground Truth Attributes Attributes
a radiologist... Dataset: Text
- 500 slices, 500 nodules
- 486 patients (294M, 192F) Correlation Score
Fig. 1. Pipeline for the auto-evaluation of CT-based findings characterization. First,

CT slices with outlined lesions were analyzed by vision-based LLMs that generated a
summary of their characteristics. The summaries were evaluated against the ground-
truth annotations by a clinician, with NLG metrics, and auto-evaluation with GPT-4.
correlation (≥ 85%, p < .001) was observed. Our approach is unique in that
it combines the expertise of radiologists with the Chain-of-Thought (COT) [21]
reasoning from LLMs to evaluate the characteristics of abnormalities predicted
by vision-based LLMs.
Contributions: (1) The proposed auto-evaluation framework is a novel ap-
proach that decomposes the characteristics of a CT-based abnormal finding gen-
erated by a vision-based LLM into specific aspects, such that distinct dimensions
of report quality can be effectively isolated and verified. (2) Three recent vision-
based LLMs were evaluated for their capability to generate summarizations of
CT-based findings. (3) Solidifies the limitations of traditional NLG metrics for
capturing factual accuracy and reporting complexity.
2 Methods
Dataset: To the best our knowledge, there is no publicly available dataset with
CT exams and paired radiology reports. Consequently, this study utilizes a sub-
set of the publicly available DeepLesion dataset [22,23], comprising 496 CT vol-
umes (496 studies) from 486 patients. The subset contained 500 lesions of various
kinds (e.g., liver, kidney, bone, etc.) that were prospectively marked in 500 CT
slices. The dataset also provided specific characteristics of lesions that were ex-
tracted from the sentences in the radiology reports using an automated method
[23]. These included the body part location, type of lesion, and shape and ap-
pearance attributes. As certain lesion characteristics were missed by the auto-
mated tool, two board-certified radiologists, each with 10+ years of experience,
manually reviewed and comprehensively annotated any missing lesion character-
istics. The dataset included both male and female participants (M: 294, F: 192),
ranging in age from 2 to 87 years (mean: 52.2, s.d.: 17.7).
Decomposing Vision-based LLM Predictions for Auto-Evaluation: The
proposed framework decomposes the predicted descriptions of CT-based find-
ings (e.g., lesions) by a vision-based LLM, such that a language-centric LLM
4 Zhu et al.
can auto-evaluate the predictions by comparing them against the ground-truth

annotations. First, several prominent vision-based LLMs from recent literature
were tasked with analyzing a CT slice with a known abnormality (lesion) and
generating a free-text description of its characteristics. Next, GPT-4 parsed this
prediction, and provided a score for each aspect by comparing them against the
ground truth annotations from DeepLesion. The objective was to move beyond
conventional natural language generation (NLG) metrics, which despite their
linguistic coherence, are insufficient to evaluate predictions for clinical accuracy.
Figure 1 illustrates the experimental design, consisting of three integral steps.
(1) Visual Context Integration: The abnormal findings within the CT slices
were delineated with a bounding box prior to being input into the vision-based
LLM. The clear visual context provided to the vision-based LLMs was expected
to enhance the accuracy of the generated summaries of the findings.
(2) Text-Based Chain-of-Thought: In the Text-Based Chain of Thought
(COT) approach, the vision-based LLM takes the input CT slice with an abnor-
mality in it, and generates a free-text description of the abnormality. The output
description should contain the following aspects: Body Part, Location (specific),
Type, and Attributes. ‘Body Part’ is the larger anatomical region or organ of the
body where the lesion or abnormality is situated. ‘Location’ refers to the precise
area or specific site within a body part where a lesion or abnormality is located.
‘Type’ includes classifications, such as nodule, mass, or enlarged lymph node.
‘Attributes’ describe characteristics like size, shape, density (hypo or hyper), or
calcification. The summary should be concise and clinically relevant, such that
the characteristics of the findings can be pre-filled in the findings section of a
radiology report. The summary should also lend itself to being auto-evaluated
by a LLM (language model only). The prompt used for this task was designed to
allow the model to concentrate on each aspect individually, thereby optimizing
the use of its natural language generation capabilities to produce clinically rele-
vant and informative descriptions of the findings. This approach contrasts with
the one-shot methods [21] found in the literature, which attempt to generate
entire radiology reports in a single step without explicit intermediate reasoning.
(3) Auto-Evaluation using GPT-4: This is the core of our framework that
involves employing GPT-4 to mimic the evaluation process of radiologists. In-
spired by [16], GPT-4 compared the predicted summary from the vision-based
LLM against ground-truth annotations from DeepLesion and assigned a score
for each aspect of the abnormality. The scores provided were: {“Correct,” “Par-
tially Correct,” “Incorrect,”, “Not Applicable”}. “Correct” indicated that the
interpretation was entirely accurate. “Partially Correct” signified that the in-
terpretation was somewhat accurate, but lacked full precision or completeness.
“Incorrect” meant that the interpretation did not match the correct answer in
any way. Lastly, “Not Applicable” denoted that the question was not relevant to
the situation and thus could not be graded. These scores enabled the automated
evaluation of the accuracy of the predicted descriptions. The prompt provided
to GPT-4, detailed in supplementary material, instructed the model to assess
the AI-generated predictions in the same way as a clinician.
3 Experiments and Results

Baseline Models: Three vision-based LLMs from recent literature were eval-
uated for their capability to generate a summary of the CT-based abnormal-
ity. These models included GPT-4 Vision (GPT-4V) [24], LLaVA-Med [9], and
RadFM [10]. For both LLaVA-Med and RadFM, the default configurations set
by the respective authors were used for inference. OpenAI API version “2023-03-
15-preview”, utilizing the “gpt-4” engine, was employed for GPT-4V. The con-
figuration was set as follows: {”temperature”: 0.7, ”top p”: 0.95, ”max tokens”:
4000, ”model version”: ”2024-02-15-preview”}.
Experiments: To determine the utility of GPT-4 for auto-evaluation of find-
ings generated by vision-based LLMs, a baseline performance with respect to a
clinician was required. In particular, 100 random lesions were chosen from the
entire set of 500 lesions. As their characteristics were already verified previously
by two board-certified radiologists, it was cumbersome to evaluate the free-text
generations for all 500 lesions. Once their performance was quantified based on
different metrics, the utility of using GPT-4 alone for auto-evaluation of the
findings generated by vision-based LLMs compared against the ground-truth
was determined. Evaluation of the generated findings was conducted based on
four configurations: both with and without bounding boxes delineating the lesion
in the CT slice, and combined with and without text-based COT.
Metrics: The linguistic quality of the generated summaries were evaluated us-
ing traditional NLG metrics, such as BLEU, METEOR, and ROUGE. However,
considering the variability in reporting across different institutions, these met-
rics have limitations in capturing the clinical relevance of the generated summary
(see example in supplementary material). Inspired by [16], auto-evaluation by
GPT-4 was done by comparing the predictions against the ground-truth annota-
tions. This evaluation was also compared against the evaluation performed by a
clinician. Correlations between the evaluations done by the clinician and GPT-4
was considered to be a measure of the reliability of GPT-4 for auto-evaluation.
Results - Evaluation of Clinician vs. GPT-4: The clinician compared the
ground truth against the free-text predictions generated by the vision-based
LLMs for the 100 random lesions. The clinician provided a score for the predic-
tion based on: body part, location, type, and attributes. The score fell into one
of these categories: “Incorrect”: -1, “Partially Correct”: 0.5, “Correct”: 1. The
“Not Applicable” category was excluded from the analysis as these responses
did not contribute meaningful information regarding the accuracy or quality of
the generated findings. Next, GPT-4 compared the same predictions against the
ground-truth and provided the corresponding scores. Finally, NLG metrics were
computed using these scores. In Fig. 2, heatmaps show the interaction between
the various metrics computed based on the clinician and GPT-4 evaluations.
From Fig. 2, NLG metrics, such as BLEU and METEOR, revealed strong
correlations amongst themselves (purple box), particularly at lower levels of pre-
cision like BLEU-1 and BLEU-2. This indicated a consistency in evaluating the
linguistic quality of generated texts at these levels. However, the correlation
weakened considerably at increased precision levels of BLEU-3 and BLEU-4.
6 Zhu et al.
Fig. 2. Heatmap of pairwise Pearson’s Correlation Coefficient among various grading

scores; NLG metrics, Radiologist evaluations and GPT-4 evaluations. L-to-R: vision-
based LLM generated impression from GPT-4V, LLaVA-Med, RadFM. Color intensity
indicates the strength of correlation, with darker shades representing higher correlation.
Particularly for LLaVA-Med, the scores were predominantly at 0, and exhibited

no correlation. This pattern aligned with expectations and highlighted the lim-
itations of BLEU metrics in capturing the nuanced accuracy of more complex
sentences within radiology reports.
Furthermore, the decomposition of the evaluation into specific aspects (loca-
tion, body part, lesion type, and attributes) revealed insightful patterns (peach
box). These aspects showed a lack of strong correlation with one another, and
other pairings also displayed no significant correlation. These observations af-
firmed the efficacy of the approach in dissecting the findings into their granular
elements (except location and body part), such that the distinct parts of report
quality can be isolated and assessed independently. Comparing NLG metrics
with the clinician evaluations showed a weak correlation, suggesting that the
NLG metrics may not serve as reliable indicators of clinical accuracy within ra-
diology reports (blue box). This highlighted a potential gap in utilizing NLG
metrics for assessing the clinical relevance of generated reports, pointing to the
necessity for domain-specific evaluation methods.
Lastly, the comparison between evaluations conducted by GPT-4 and the
clinician showed the strength of the automated evaluation framework (pink box),
and summarised in Table 2. The results from auto-evaluation showed a strong
correlation with those from a clinician based on Pearson’s correlation coeffi-
cient of 0.87 ± 0.02 (p<0.001). This suggests that GPT-4’s evaluation method
closely aligned with the clinical assessment paradigms utilized by radiologists.
This observation underscored the potential of language models like GPT-4 in
accurately mirroring radiologists’ evaluations, offering promise for automating
report assessment with a high degree of fidelity to clinical standards.
Results - Auto Evaluation using GPT-4: As there was a strong correlation
between the evaluations conducted by GPT-4 and the clinician, GPT-4 was uti-
lized to compare the predicted findings against the ground-truth for all the 500
lesions. Figure 3 shows grading scores from GPT-4 across four configurations:
both with and without lesion bounding boxes in the CT slice, and combined
with and without text-based Chain-of-Thought (COT). Notably, LLaVA-Med
and RadFM lack the functionality to utilize text-based COT processes, which
Fig. 3. Comparative analysis of abnormality characterization by GPT-4V, LLaVA-

Med, and RadFM with bounding boxes (Bx) vs. without bounding boxes (NoBx).
Color mapping = {orange: ‘Incorrect’, beige: ‘Partially Correct’, teal: ‘Correct’, white:
‘Not Applicable’}. Numeric labels = {0: ‘Location’, 1: ‘Body Part’, 2: ‘Type’, 3: ‘At-
tribute’}. X-axis denotes scores { x ∈ R | 0 < x < 1 }, N = 500 samples.
GPT-4V: ‘Location’: ‘Right abdomen, near the kidney’, ‘Body Part’: ‘Abdomen, kid-
ney’, ‘Type’: ‘Mass’, ‘Attributes’: ‘Well-circumscribed, homogenous’, ‘Impression’: ‘Well-
circumscribed homogenous mass in the right abdomen, adjacent to the right kidney’
LLaVA-Med: The image is an axial computed tomography (CT) scan that has been
annotated with a bounding box by a radiologist. The bounding box is a rectangular box
that encloses the lesion, which is indicated in green. This helps to highlight the area of
interest and provides a clear view of the lesion’s location, size, and shape.
GT: ‘right renal RadFM: { Lesion 1 }

parapelvic cyst’
Table 1. Example Chain-of-Thought responses by vision-based LLMs for a renal cyst.
Location Body Part Type Attribute Avg. p-value

GPT-4V 0.86 0.90 0.84 0.86 0.87 ± 0.02 <0.001
LLaVA-Med 0.59 0.83 0.76 0.80 0.75 ± 0.10 <0.001
RadFM 0.99 0.92 0.82 0.89 0.90 ± 0.07 <0.001
Table 2. Correlation scores between the clinician and GPT-4 grading of reports.
aligns with findings in literature [25]. This sets them apart from GPT-4V, which
demonstrated enhanced performance when employing COT reasoning. Further-
more, the inclusion of bounding boxes generally lead to improved model perfor-
mance as it provided guidance for the models, with GPT-4V showing the most
significant enhancement. Amongst the evaluated models, GPT-4V outperformed
its counterparts. However, despite these advancements, the overall quality of pre-
filled findings for CT reporting remained suboptimal. This indicated a pressing
need for further development of vision-based LLMs to meet clinical standards.
Consider the ground truth text ‘left mediastinal pleural enhancing mass’
and a predicted result by RadFM ‘Axial contrast enhanced images through the
thoracic inlet demonstrating a homogeneous mass posterior to the trachea’. It
is evident that the NLG scores are low owing to the minimal textual overlap.
Furthermore, this sentiment extends to individuals without medical training, as
an understanding of anatomy is necessary to recognize that both texts pertain
to the chest and lung regions.
8 Zhu et al.
4 Discussion and Conclusion

Through experiments, it was observed that LLaVA-Med and RadFM were un-
able to leverage text-based COT as shown in Table 1. This reflects the architec-
tural or design constraints that prevented these models from effectively breaking
down and processing information in a stepwise manner. GPT-4V’s improved per-
formance with COT suggested its architecture was better suited to sequential
reasoning, mimicking a radiologist’s thought process, thereby leading to more ac-
curate generation of the characteristics of CT findings. We also noticed that the
bounding box delineating the lesion improved the performance of most models,
especially GPT-4V. We attribute this to the added visual cues that bounding
boxes provided. These cues helped focus the model’s attention on a specific
area of interest, thereby improving the accuracy of the abnormality summary.
GPT-4V outperforming other models can be linked to its advanced language pro-
cessing capabilities, likely benefiting from a more sophisticated understanding
of context, superior text generation algorithms, and perhaps a more extensive
training dataset that includes a diverse range of medical imaging scenarios.
With respect to the lesion characteristics (location, body part, type, at-
tributes), a varied performance across the categories was seen. This reflected
the differential impact of visual and textual guidance on model accuracy. For
instance, the improvement in “Body Part” and “Type” with bounding boxes
suggested these categories benefitted more from visual delineation. In contrast,
the “Location” and “Attributes” categories, which may require more abstract
reasoning or detailed textual information, showed mixed results. This indicated
a complex interplay between model capabilities and the nature of the task.
Despite advancements, the overall performance of vision-based LLMs in de-
scribing the characteristics of findings in CT remains inadequate for clinical
standards. This is largely due to the inherent complexity of medical images, the
subtlety of pathologies, and the high standard of accuracy required for clinical
diagnosis. Current vision-based LLMs are trained largely on natural images and
lack sufficient training on diverse datasets. Thus, they still lack the ability to
fully comprehend and articulate the complex interplay of visual features and
clinical implications present in CT exams.
In summary, an framework was proposed for the auto-evaluation of AI-
generated characteristics for findings in CT exams, which would be pre-filled into
the findings section of radiology reports. Results from auto-evaluation showed
a strong correlation with those from a clinician based on Pearson’s correlation
coefficient of 0.87 ± 0.02 (p<0.001). GPT-4V outperformed other recent vision-
language LLMs in predicting the characteristics of lesions in the dataset, and
GPT-4 was sufficient for the auto-evaluation of the predictions from GPT-4V
against ground-truth annotations. The evaluation approach pointed out particu-
lar weaknesses within various vision-based LLMs, and provided essential insights
to enhance the precision and dependability of AI-generated interpretations from
medical images. By addressing these identified gaps and focusing on comprehen-
sive model development, there is potential to significantly improve the utility of
vision- and language-based models in radiology.
5 Acknowledgements
This research was supported by the Intramural Research Program of the National
Library of Medicine and Clinical Center at the NIH.
References
1. John J. Smith and Leonard Berlin. Signing a colleague’s radiology report. Amer-
ican Journal of Roentgenology, 176(1):27–30, 2001. PMID: 11133532.
2. Michael D Ringler, Brian C Goss, and Brian J Bartholmai. Syntactic and semantic
errors in radiology reports associated with speech recognition software. Health
Informatics Journal, 23(1):3–13, 2017. PMID: 26635322.
3. M. Mahesh, A. J. Ansari, and Jr Mettler, F. A. Patient exposure from radiologic
and nuclear medicine procedures in the united states and worldwide: 2009-2018.
Radiology, 301, 2023.
4. N. A. Fawzy, M. J. Tahir, A. Saeed, M. J. Ghosheh, T. Alsheikh, A. Ahmed, and
Z. Lee, K. Y. Yousaf. Incidence and factors associated with burnout in radiologists:
A systematic review. European journal of radiology open, 11:100530, 2023.
5. Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radi-
ology reports via memory-driven transformer. In Bonnie Webber, Trevor Cohn,
Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empiri-
cal Methods in Natural Language Processing (EMNLP), pages 1439–1449, Online,
November 2020. Association for Computational Linguistics.
6. Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal memory net-
works for radiology report generation. In Chengqing Zong, Fei Xia, Wenjie Li, and
Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Associa-
tion for Computational Linguistics and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, Online,
August 2021. Association for Computational Linguistics.
7. Akimichi Ichinose, Taro Hatsutani, Keigo Nakamura, Yoshiro Kitamura, Satoshi
Iizuka, Edgar Simo-Serra, Shoji Kido, and Noriyuki Tomiyama. Visual ground-
ing of whole radiology reports for 3d ct images. In Hayit Greenspan, Anant
Madabhushi, Parvin Mousavi, Septimiu Salcudean, James Duncan, Tanveer Syeda-
Mahmood, and Russell Taylor, editors, Medical Image Computing and Computer
Assisted Intervention – MICCAI 2023, pages 611–621, Cham, 2023. Springer Na-
ture Switzerland.
8. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo-
rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal
Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
9. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei
Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. LLaVA-Med: Training
a large language-and-vision assistant for biomedicine in one day. Advances in
Neural Information Processing Systems, 36, 2024.
10. Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards
generalist foundation model for radiology. arXiv preprint arXiv:2308.02463, 2023.
11. Shubo Tian, Qiao Jin, Lana Yeganova, Po-Ting Lai, Qingqing Zhu, Xiuying Chen,
Yifan Yang, Qingyu Chen, Won Kim, Donald C Comeau, et al. Opportunities
and challenges for chatgpt and large language models in biomedicine and health.
Briefings in Bioinformatics, 25(1):bbad493, 2024.
10 Zhu et al.
12. Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric
Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint
arXiv:2303.13375, 2023.
13. Qiao Jin, Robert Leaman, and Zhiyong Lu. Retrieve, summarize, and verify: How
will chatgpt impact information seeking from the medical literature? Journal of
the American Society of Nephrology, pages 10–1681, 2023.
14. Qingqing Zhu, Tejas Sudharshan Mathai, Pritam Mukherjee, Yifan Peng,
Ronald M. Summers, and Zhiyong Lu. Utilizing longitudinal chest x-rays and
reports to pre-fill radiology reports. In MICCAI (5), volume 14224 of Lecture
Notes in Computer Science, pages 189–198. Springer, 2023.
15. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus,
Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya,
et al. Chexpert: A large chest radiograph dataset with uncertainty labels and
expert comparison. In Thirty-Third AAAI Conference on Artificial Intelligence,
2019.
16. Qingqing Zhu, Xiuying Chen, Qiao Jin, Benjamin Hou, Tejas Sudharshan Mathai,
Pritam Mukherjee, Xin Gao, Ronald M Summers, and Zhiyong Lu. Leveraging pro-
fessional radiologists’ expertise to enhance llms’ evaluation for radiology reports,
2024.
17. Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M Cheung, Robert
Chen, Ronald M Summers, Justin F Rousseau, Peiyun Ni, Marc J Landsman,
et al. Hidden flaws behind expert-level accuracy of gpt-4 vision in medicine. arXiv
preprint arXiv:2401.08396, 2024.
18. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and
Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and bench-
marks on weakly-supervised classification and localization of common thorax dis-
eases. In Proceedings of the IEEE conference on computer vision and pattern recog-
nition, pages 2097–2106, 2017.
19. Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers,
and Zhiyong Lu. Negbio: a high-performance tool for negation and uncertainty de-
tection in radiology reports. AMIA Summits on Translational Science Proceedings,
2018:188, 2018.
20. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang
Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda
Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, pages 2511–2522, Singa-
pore, December 2023. Association for Computational Linguistics.
21. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning
in large language models. Advances in Neural Information Processing Systems,
35:24824–24837, 2022.
22. Ke Yan, Xiaosong Wang, Le Lu, and Ronald M. Summers. Deeplesion: Automated
deep mining, categorization and detection of significant radiology image findings
using large-scale clinical lesion annotations. CoRR, abs/1710.01766, 2017.
23. Ke Yan, Yifan Peng, Veit Sandfort, Mohammadhadi Bagheri, Zhiyong Lu, and
Ronald M Summers. Holistic and comprehensive annotation of clinically significant
findings on diverse ct images: learning from radiology reports and label ontology.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 8523–8532, 2019.
24. OpenAI. GPT-4V(ision) System Card.
https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
25. Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin,
and Minjoon Seo. The cot collection: Improving zero-shot and few-shot learning
of language models via chain-of-thought fine-tuning. In The 2023 Conference on
Empirical Methods in Natural Language Processing, 2023.

Vision LLM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vision LLM

Uploaded by

Copyright:

Available Formats

Decomposing Vision-based LLM Predictions

for Auto-Evaluation with GPT-4

Qingqing Zhu1⋆ , Benjamin Hou2,5⋆ , Tejas S. Mathai2 , Pritam Mukherjee2 ,

Keywords: Computer Tomography · Deep Learning · Automatic Eval-

Bx NoBx Vision LLM Evaluation Pipeline

LLaVA-Med GPT-4 auto

CT slice BLEU-[1,2,3,4] Location Location

Fig. 1. Pipeline for the auto-evaluation of CT-based findings characterization. First,

can auto-evaluate the predictions by comparing them against the ground-truth

3 Experiments and Results

Fig. 2. Heatmap of pairwise Pearson’s Correlation Coefficient among various grading

Particularly for LLaVA-Med, the scores were predominantly at 0, and exhibited

Fig. 3. Comparative analysis of abnormality characterization by GPT-4V, LLaVA-

GT: ‘right renal RadFM: { Lesion 1 }

Table 1. Example Chain-of-Thought responses by vision-based LLMs for a renal cyst.

Location Body Part Type Attribute Avg. p-value

4 Discussion and Conclusion

You might also like