Professional Documents
Culture Documents
Background Large language models (LLMs) such as OpenAI’s GPT-4 or Google’s PaLM 2 are
arXiv:2402.08806v1 [cs.AI] 13 Feb 2024
proposed as viable diagnostic support tools or even spoken of as replacements for “curbside consults”.
However, even LLMs specifically trained on medical topics may lack sufficient diagnostic accuracy for
real-life applications.
Methods Using collective intelligence methods and a dataset of 200 clinical vignettes of real-life
cases, we assessed and compared the accuracy of differential diagnoses obtained by asking individual
commercial LLMs (OpenAI GPT-4, Google PaLM 2, Cohere Command, Meta Llama 2) against the
accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same
LLMs.
Results We find that aggregating responses from multiple, various LLMs leads to more accurate
differential diagnoses (average accuracy for 3 LLMs: 75.3% ± 1.6pp) compared to the differential
diagnoses produced by single LLMs (average accuracy for single LLMs: 59.0% ± 6.1pp).
Discussion The use of collective intelligence methods to synthesize differential diagnoses combining
the responses of different LLMs achieves two of the necessary steps towards advancing acceptance
of LLMs as a diagnostic support tool: (1) demonstrate high diagnostic accuracy and (2) eliminate
dependence on a single commercial vendor.
|2
in the differential (i.e., the first diagnosis is LLMs: 75.3% ± 1.6pp) are consistently better than
given the score 1/1 = 1, the second 1/2 = 0.5, the accuracy of differential diagnoses produced by
the third 1/3 = 0.33, etc). single LLMs (average accuracy of single LLMs:
4. Aggregation: Each of the unique diagnoses in 59.0% ± 6.1pp).
the set created in step 2 is given a aggregate The average accuracy of individual LLMs is
score calculated by adding all the individual 59.0% ± 6.1pp, i.e., on average a LLM produces a
scores of that diagnosis across all differentials. ranked differential diagnosis that contains the right
5. Synthesis: A synthetic differential is generated diagnosis in 59.0% of the cases. The average accu-
by taking the five unique diagnoses with the racy of groups of LLMs increases as the group size
highest aggregate score and ranking them by grows: the average accuracy for groups of 2 LLMs
their score in decreasing order. is 69.1% ± 2.6pp, for 3 LLMs is 75.3% ± 1.6pp, for
4 LLMs is 78.0% ± 0.1pp. Figure 1 shows the in-
2.4 Accuracy measure dividual and average accuracy of the single LLMs
and of groups of LLMs.
The accuracy of a solver (either an LLM or a group
This finding holds true also when the definition
of LLMs) is calculated as the percentage of cor-
of correctly diagnosed case is made stricter by
rectly diagnosed cases among all cases.
considering only the 3 highest ranked diagnosis in
For this study we consider a case to be correctly a differential (TOP-3 matching), as illustrated in
diagnosed by a solver if the differential provided Figure 2.
by that solver for that case contains the correct This trend is also confirmed when the differential
diagnosis among the five highest ranked diagnoses. diagnosis of the LLM with the highest individual
This so-called TOP-5 matching mirrors similar cor- accuracy, GPT-4, is excluded from the experiment
rectness measures used in previous studies [11, 12]. (average accuracy of single LLMs: 54.6% ± 6.4pp,
Results obtained using TOP-1 or TOP-3 matching of groups of 2 LLMs: 63.5% ± 2.3pp, 3 LLMs:
are provided in the supplemental material. 70.0% ± 0.1pp).
The supplemental material provides data on these
3 Results alternative evaluation methods.
Single LLMs
% accuracy over 200 cases
2 LLMs
100 3 LLMs
4 LLMs
80
60
40
20
0
○
△
□
☆
1
○
△
△
☆
☆
☆
2
△
☆ □
☆ □
☆ ○
3 □
4
○
□
○
□
○
△
□
○
△
△
LL
LL
LL
LL
M
M
s
s
(a
(a
(a
(☆
ve
△
er
er
ra
○
ag
ag
ge
□
e)
e)
)
)
Figure 1 | Synthetic differential diagnoses aggregated from different LLMs show a greater diagnostic accuracy compared to
differential diagnoses produced by single LLMs. This graph provides a visual representation of the data presented in Table 1.
⃝ = Cohere Command, △ = Google PaLM 2, □ = Meta Llama 2, 9 = OpenAI GPT-4
|3
% accuracy over 200 cases
80
70
60
50 TOP-5
TOP-5 (without GPT-4)
TOP-3
40 TOP-3 (without GPT-4)
0 1 2 3 4 5
Number of LLMs responses contributing to the aggregated differential
Figure 2 | Increasing the number of LLMs contributing to a synthetic differential leads to an increase in accuracy also
(a) when the definition of correctly diagnosed case is made stricter by considering only the 3 highest ranked diagnosis in a
differential (TOP-3 matching) and (b) when the top-performing LLM, GPT-4, is excluded from the experiment.
low-accuracy differentials from multiple LLMs into used in LLM research such as ensemble methods
synthetic high-accuracy differentials. The degree [14, 15] or multi-agent debates [16].
of increase in accuracy achieved by the method
employed in this study is in line with similar results Two factors differentiate the method employed in
in the field of collective intelligence, both in the this study from these techniques: universality and
medical field [11, 12] and outside [10, 13]. simplicity. First, this method works despite using
LLMs with varying querying approaches and tech-
The mechanism that allows this increase in accu- nical differences, and can thus be easily extended
racy is that the presented aggregation method em- to work with any combination of LLMs. Second,
phasizes plausible diagnoses (likely to be present in the simplicity of this method means that not only
the differential returned by multiple LLMs, and so, can it be easily integrated in existing software ap-
bound to have a higher aggregate score), while min- plications, but could even be performed by medical
imizing the effects of hallucinated diagnoses (likely personnel manually querying separate LLMs and
to be present in only one of the LLMs). This can synthesizing the results themselves.
be seen as an instance of the Anna Karenina prin-
ciple (good answers are common to many LLMs, The trust of medical practitioners in LLM-based
bad answers are local to a specific LLM). tools could be strengthened by the application of
aggregation methods like the one employed in this
This principle is exploited by similar techniques study. In particular, knowing that multiple sources
|4
contributed would increase clinician confidence in References
the final differential and lessen the fear of having
encountered one of the many mistaken answers or 1. Sorin, V., Klang, E., Sklair-Levy, M., Cohen, I., Zippel,
D. B., Balint Lahat, N., Konen, E. & Barash, Y. Large
hallucinations that LLMs are known to produce [5]. language model (ChatGPT) as a support tool for breast
tumor board. NPJ Breast Cancer 9, 44 (2023).
An additional advantage of the use of knowledge 2. Savage, T., Wang, J. & Shieh, L. A Large Language Model
aggregation methods is preventing vendor lock- Screening Tool to Target Patients for Best Practice Alerts:
Development and Validation. JMIR Medical Informatics
in, removing the need to engage with a single, 11, e49886 (2023).
potentially expensive, or legally problematic LLM 3. Haug, C. J. & Drazen, J. M. Artificial intelligence and
machine learning in clinical medicine, 2023. New England
vendor in order to obtain high-quality diagnostic Journal of Medicine 388, 1201–1208 (2023).
differentials. The use of aggregation methods like 4. Patel, S. B. & Lam, K. ChatGPT: the future of dis-
charge summaries? The Lancet Digital Health 5, e107–
the one we propose would address these issues e108 (2023).
by enabling the use of multiple cheaper LLMs, 5. Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks
or alternatively, locally-deployed and fine-tuned of GPT-4 as an AI chatbot for medicine. New England
Journal of Medicine 388, 1233–1239 (2023).
LLMs. For instance, our results show that the 6. Schwartz, I. S., Link, K. E., Daneshjou, R. & Cortés-
3-LLM group without GPT-4 (the top-performing Penfield, N. Black box warning: large language models
and the future of infectious diseases consultation. Clinical
LLM) offers a diagnostic accuracy within a couple Infectious Diseases, ciad633 (2023).
of percentage points of GPT-4 alone. 7. Corrêa, N. K., Galvão, C., Santos, J. W., Del Pino, C.,
Pinto, E. P., Barbosa, C., Massmann, D., Mambrini, R.,
With a clear baseline on diagnostic accuracy and Galvão, L., Terem, E., et al. Worldwide AI ethics: A
review of 200 guidelines and recommendations for AI
trust, LLM-based tools can become valuable sup- governance. Patterns 4 (2023).
port instruments that can speed up diagnosis, re- 8. Eriksen, A. V., Möller, S. & Ryg, J. Use of GPT-4 to
diagnose complex clinical cases. NEJM AI 1, AIp2300031
duce diagnostic mistakes and costs, and provide (2023).
additional consulting services in underserved areas. 9. Barile, J., Margolis, A., Cason, G., Kim, R., Kalash, S.,
Tchaconas, A. & Milanaik, R. Diagnostic Accuracy of a
While this study shows that aggregating differen- Large Language Model in Pediatric Case Studies. JAMA
pediatrics (2024).
tials produced by current LLMs leads to improved
10. Suran, S., Pattanaik, V. & Draheim, D. Frameworks
diagnostic accuracy, further studies are needed to for collective intelligence: A systematic literature review.
ACM Computing Surveys (CSUR) 53, 1–36 (2020).
examine the impact of future LLMs specialized
11. Barnett, M. L., Boddupalli, D., Nundy, S. & Bates, D. W.
on medical topics or the use of participatory AI Comparative accuracy of diagnosis by collective intel-
methods (for instance, the synthesis of differentials ligence of multiple physicians vs individual physicians.
JAMA network open 2, e190096–e190096 (2019).
aggregating responses from both LLMs and human 12. Kurvers, R. H., Nuzzolese, A. G., Russo, A., Barabucci,
practitioners). G., Herzog, S. M. & Trianni, V. Automating hybrid col-
lective intelligence in open-ended medical diagnostics.
Proceedings of the National Academy of Sciences 120,
e2221473120 (2023).
5 Acknowledgements 13. Klein, N. & Epley, N. Group discussion improves lie detec-
tion. Proceedings of the National Academy of Sciences
112, 7460–7465 (2015).
The authors would like to thank Nikolas Zöller of
14. Jiang, D., Ren, X. & Lin, B. Y. LLM-Blender: Ensembling
the Max Plank Institute for Human Development Large Language Models with Pairwise Ranking and Gen-
for his valuable and constructive feedback. erative Fusion. arXiv preprint arXiv:2306.02561 (2023).
15. Yang, H., Li, M., Xiao, Y., Zhou, H., Zhang, R. & Fang, Q.
The authors would also like to thank Irving Lin and One LLM is not Enough: Harnessing the Power of Ensem-
ble Learning for Medical Question Answering. medRxiv,
Jay Komarneni of the Human Diagnosis Project 2023–12 (2023).
for their suggested framing and review. 16. Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang,
S., Fu, J. & Liu, Z. Chateval: Towards better llm-based
This work is supported by the European Union’s evaluators through multi-agent debate. arXiv preprint
arXiv:2308.07201 (2023).
Horizon Europe Research and Innovation Pro-
gramme under grant agreement No 101070588
(HACID: Hybrid Human Artificial Collective Intel-
ligence in Open-Ended Domains).
|5