You are on page 1of 5

Combining Insights From

Multiple Large Language Models


Improves Diagnostic Accuracy
Gioele Barabucci†,1 , Victor Shia2,3 , Eugene Chu4 , Benjamin Harack3,5 , and Nathan Fu3
1
University of Cologne, 2 Harvey Mudd College, 3 The Human Diagnosis Project,
4
Kaiser Permanente, 5 University of Oxford

Background Large language models (LLMs) such as OpenAI’s GPT-4 or Google’s PaLM 2 are
arXiv:2402.08806v1 [cs.AI] 13 Feb 2024

proposed as viable diagnostic support tools or even spoken of as replacements for “curbside consults”.
However, even LLMs specifically trained on medical topics may lack sufficient diagnostic accuracy for
real-life applications.
Methods Using collective intelligence methods and a dataset of 200 clinical vignettes of real-life
cases, we assessed and compared the accuracy of differential diagnoses obtained by asking individual
commercial LLMs (OpenAI GPT-4, Google PaLM 2, Cohere Command, Meta Llama 2) against the
accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same
LLMs.
Results We find that aggregating responses from multiple, various LLMs leads to more accurate
differential diagnoses (average accuracy for 3 LLMs: 75.3% ± 1.6pp) compared to the differential
diagnoses produced by single LLMs (average accuracy for single LLMs: 59.0% ± 6.1pp).
Discussion The use of collective intelligence methods to synthesize differential diagnoses combining
the responses of different LLMs achieves two of the necessary steps towards advancing acceptance
of LLMs as a diagnostic support tool: (1) demonstrate high diagnostic accuracy and (2) eliminate
dependence on a single commercial vendor.

1 Background tance is lack of trust that an LLM can provide cor-


rect answers, and additionally, one that avoids so
Large language models (LLMs) such as GPT-4 have called “hallucinations”, i.e., verisimilar but fictional
been shown to be useful as support tools in various responses. This lack of trust can in turn be traced
healthcare settings such as during tumor boards [1] back to lack of data on the performance (e.g., cor-
or as a screening tool to match patient notes to rectness, accuracy, specificity) of said LLM-based
best practice alerts [2]. Their future use and de- tools. Put bluntly, before trusting them, medi-
ployment in healthcare is expected to parallel that cal practitioners want to know: “[Are they] good
of other AI tools such as automated electrocardio- enough?” [5].
gram (ECG) anomaly detection, i.e., as support
tools that provide insight to human practitioners Recent studies on the accuracy of LLMs have been
to better inform their decisions [3]. shown them to be capable of performing well in cer-
tain medical contexts, but results seem to also vary
In particular, there is ongoing research into the depending on the research study or application,
application of LLMs as summarization tools for which paints an unclear situation. For instance,
patient and procedure information [4] or as replace- Eriksen et al. [8] report that GPT-4 scores above
ment for “curbside consults”, especially in situations 72% of the readers of medical journals in 38 clinical
where colleagues may not be available (e.g., re- case challenges. On the other hand, Barile et al.
mote locations) or too expensive (e.g., underserved [9] report that GPT-4 has a diagnostic error rate
groups) [5, 6]. of 83% when confronted with 100 pediatric case
Nevertheless, the use of LLM-based tools is im- challenges published on JAMA and NEJM.
paired by their limited acceptance by medical pro-
This study investigates the use of collective in-
fessionals, among other factors [7].
telligence methods to synthesize higher-accuracy
One major factor driving this low rate of accep- differential diagnoses by aggregating differential

† Corresponding author: gioele.barabucci@uni-koeln.de


diagnoses produced by a set of LLMs (even ones All prompts used to query the LLMs follow the
with low accuracy) in response to medical ques- same template: “[CASE TEXT] What is the dif-
tions in the form of case vignettes. Aggregating ferential (list format of common shorthand non-
results from multiple LLMs could be the key to abbreviated diagnoses) for the above case? Re-
achieving high-accuracy responses (and potentially spond with ONLY diagnosis names (one per line)
with fewer implausible or “hallucinated” responses). up to a max of 5.”, where [CASE TEXT] is replaced
By employing algorithmic methods for knowledge with the textual description of the case vignette.
aggregation from research on collective intelligence, The actual prompts vary slightly between different
it is possible to create a high-quality response to LLMs because of different query paradigms (e.g.,
a question by aggregating lower-quality responses GPT-4 uses a chat paradigm while PaLM 2 uses a
from multiple respondents [10]. For instance, past text generation paradigm).
studies in the medical domain show that aggre-
In order to obtain cleaner differential diagnoses
gating as few as three answers from inexperienced
for combining in collectives in the scoring process,
respondents led to high diagnostic accuracy (77%
a round of manual prompt engineering has been
accuracy), significantly above the performance of
carried out using less than 5 Human Dx case vi-
individual human experts (62.5% accuracy) [11].
gnettes to help with the format and structure of
the response.
2 Methods We are highly confident that the case vignettes
used in this study are not part of the training
This study can be summarized as follows: we sam- corpora of these LLMs. First, because the case
pled 200 clinical vignettes of real-life cases from the vignettes are only available to users logged into
Human Diagnosis Project (Human Dx) database, the Human Dx application and to select research
asked various LLMs to provide differential diag- partners. Second, because there are contractual
noses for these cases, aggregated their responses agreements in place between Human Dx and the
using collective intelligence algorithms, and finally providers of the LLMs that forbid the use of data
compared the accuracy of the individual LLM re- included in prompts as training material (except
sponses to the accuracy of the aggregated differen- for Cohere). Finally, the correct diagnoses were
tials. never included in any of the prompts to the various
The prompts and the Python scripts used to run LLMs.
the study are provided in the supplemental mate-
rial. The case dataset and the LLM responses are 2.3 Collective intelligence and response
available upon request. aggregation
Starting from the differential diagnoses provided by
2.1 Dataset and case selection
the single LLMs, 11 synthetic differential diagnoses
have been generated by aggregating the single dif-
The data for this study is a set of 200 case vi-
ferential diagnoses in all possible combinations (6
gnettes, extracted from the Human Dx database of
two-fold combinations, 4 three-fold combinations,
clinical cases. Human Dx is a multinational online
1 four-fold combination).
platform in which physicians and medical students
solve teaching cases, as well as offer clinical reason- The aggregation method is a frequency-based,
ing support to fellow users. 1/r-weighted method similar to those used in other
collective intelligence studies focused on diagnostic
The 200 cases have been randomly sampled from
tasks via differential diagnoses [11, 12]:
the dataset used by Barnett et al. [11], restricting
the sampling to text-only vignettes.
1. Normalization: All diagnoses in the differ-
The correct diagnosis of each case (ground truth) is entials are normalized by removing common
known and has been validated by medical experts prefixes (e.g., “syndrome”, “disorder”), stop
as part of the Barnett et al. [11] study. words (e.g., “by”, “of”, “with”), and punctua-
tion signs. In addition, synonyms are merged
2.2 Querying of LLMs into preferred terms, following the matching
established by Barnett et al. [11].
Four general-purpose LLMs were asked to solve 2. Extraction of unique diagnoses: The set of all
each case by providing a differential with five unique normalized diagnoses present across all
ranked diagnoses. The four LLMs used in this differentials is created.
study are: OpenAI GPT-4, Google PaLM 2 for 3. 1/r weighting: Inside each differential each di-
text (text-bison), Cohere Command, Meta Llama agnosis is given an individual score calculated
2 (llama-2-70b-f). as the inverse of the rank r of the diagnosis

|2
in the differential (i.e., the first diagnosis is LLMs: 75.3% ± 1.6pp) are consistently better than
given the score 1/1 = 1, the second 1/2 = 0.5, the accuracy of differential diagnoses produced by
the third 1/3 = 0.33, etc). single LLMs (average accuracy of single LLMs:
4. Aggregation: Each of the unique diagnoses in 59.0% ± 6.1pp).
the set created in step 2 is given a aggregate The average accuracy of individual LLMs is
score calculated by adding all the individual 59.0% ± 6.1pp, i.e., on average a LLM produces a
scores of that diagnosis across all differentials. ranked differential diagnosis that contains the right
5. Synthesis: A synthetic differential is generated diagnosis in 59.0% of the cases. The average accu-
by taking the five unique diagnoses with the racy of groups of LLMs increases as the group size
highest aggregate score and ranking them by grows: the average accuracy for groups of 2 LLMs
their score in decreasing order. is 69.1% ± 2.6pp, for 3 LLMs is 75.3% ± 1.6pp, for
4 LLMs is 78.0% ± 0.1pp. Figure 1 shows the in-
2.4 Accuracy measure dividual and average accuracy of the single LLMs
and of groups of LLMs.
The accuracy of a solver (either an LLM or a group
This finding holds true also when the definition
of LLMs) is calculated as the percentage of cor-
of correctly diagnosed case is made stricter by
rectly diagnosed cases among all cases.
considering only the 3 highest ranked diagnosis in
For this study we consider a case to be correctly a differential (TOP-3 matching), as illustrated in
diagnosed by a solver if the differential provided Figure 2.
by that solver for that case contains the correct This trend is also confirmed when the differential
diagnosis among the five highest ranked diagnoses. diagnosis of the LLM with the highest individual
This so-called TOP-5 matching mirrors similar cor- accuracy, GPT-4, is excluded from the experiment
rectness measures used in previous studies [11, 12]. (average accuracy of single LLMs: 54.6% ± 6.4pp,
Results obtained using TOP-1 or TOP-3 matching of groups of 2 LLMs: 63.5% ± 2.3pp, 3 LLMs:
are provided in the supplemental material. 70.0% ± 0.1pp).
The supplemental material provides data on these
3 Results alternative evaluation methods.

The main finding of this study is that the accu- 4 Discussion


racy of differential diagnoses created by aggregat-
ing differential diagnoses from multiple LLMs us- This study demonstrates the feasibility and validity
ing collective intelligence methods (accuracy for 3 of using collective intelligence methods to combine

Single LLMs
% accuracy over 200 cases

2 LLMs
100 3 LLMs
4 LLMs

80

60

40

20

0




1







2


☆ □
☆ □
☆ ○
3 □

4







LL

LL

LL

LL
M

M
s

s
(a

(a

(a

(☆
ve


er

er
ra


ag

ag
ge


e)

e)

)
)

Figure 1 | Synthetic differential diagnoses aggregated from different LLMs show a greater diagnostic accuracy compared to
differential diagnoses produced by single LLMs. This graph provides a visual representation of the data presented in Table 1.
⃝ = Cohere Command, △ = Google PaLM 2, □ = Meta Llama 2, 9 = OpenAI GPT-4

|3
% accuracy over 200 cases

80

70

60

50 TOP-5
TOP-5 (without GPT-4)
TOP-3
40 TOP-3 (without GPT-4)

0 1 2 3 4 5
Number of LLMs responses contributing to the aggregated differential
Figure 2 | Increasing the number of LLMs contributing to a synthetic differential leads to an increase in accuracy also
(a) when the definition of correctly diagnosed case is made stricter by considering only the 3 highest ranked diagnosis in a
differential (TOP-3 matching) and (b) when the top-performing LLM, GPT-4, is excluded from the experiment.

low-accuracy differentials from multiple LLMs into used in LLM research such as ensemble methods
synthetic high-accuracy differentials. The degree [14, 15] or multi-agent debates [16].
of increase in accuracy achieved by the method
employed in this study is in line with similar results Two factors differentiate the method employed in
in the field of collective intelligence, both in the this study from these techniques: universality and
medical field [11, 12] and outside [10, 13]. simplicity. First, this method works despite using
LLMs with varying querying approaches and tech-
The mechanism that allows this increase in accu- nical differences, and can thus be easily extended
racy is that the presented aggregation method em- to work with any combination of LLMs. Second,
phasizes plausible diagnoses (likely to be present in the simplicity of this method means that not only
the differential returned by multiple LLMs, and so, can it be easily integrated in existing software ap-
bound to have a higher aggregate score), while min- plications, but could even be performed by medical
imizing the effects of hallucinated diagnoses (likely personnel manually querying separate LLMs and
to be present in only one of the LLMs). This can synthesizing the results themselves.
be seen as an instance of the Anna Karenina prin-
ciple (good answers are common to many LLMs, The trust of medical practitioners in LLM-based
bad answers are local to a specific LLM). tools could be strengthened by the application of
aggregation methods like the one employed in this
This principle is exploited by similar techniques study. In particular, knowing that multiple sources

Group size LLMs in group Accuracy Average accuracy


1 Cohere Command 39.5%
1 Google PaLM 2 66.0%
59.0% ± 6.1pp
1 Meta Llama 2 58.5%
1 OpenAI GPT-4 72.0%
2 Cohere Command, Meta Llama 2 58.0%
2 Google PaLM 2, Cohere Command 64.5%
2 Google PaLM 2, Meta Llama 2 68.0%
69.1% ± 2.6pp
2 OpenAI GPT-4, Cohere Command 73.5%
2 OpenAI GPT-4, Google PaLM 2 77.0%
2 OpenAI GPT-4, Meta Llama 2 73.5%
3 Google PaLM 2, Cohere Command, Meta Llama 2 70.0%
3 OpenAI GPT-4, Cohere Command, Meta Llama 2 75.5%
75.3% ± 1.6pp
3 OpenAI GPT-4, Google PaLM 2, Cohere Command 79.0%
3 OpenAI GPT-4, Google PaLM 2, Meta Llama 2 77.0%
4 Google PaLM 2, Cohere Command, Meta Llama 2, OpenAI GPT-4 78.0% 78.0% ± 0.1pp
Table 1 | Diagnostic accuracy of single LLM and groups of LLMs over the 200 cases present in the dataset. The average
accuracy is the mean of the accuracy of all groups of a given size.

|4
contributed would increase clinician confidence in References
the final differential and lessen the fear of having
encountered one of the many mistaken answers or 1. Sorin, V., Klang, E., Sklair-Levy, M., Cohen, I., Zippel,
D. B., Balint Lahat, N., Konen, E. & Barash, Y. Large
hallucinations that LLMs are known to produce [5]. language model (ChatGPT) as a support tool for breast
tumor board. NPJ Breast Cancer 9, 44 (2023).
An additional advantage of the use of knowledge 2. Savage, T., Wang, J. & Shieh, L. A Large Language Model
aggregation methods is preventing vendor lock- Screening Tool to Target Patients for Best Practice Alerts:
Development and Validation. JMIR Medical Informatics
in, removing the need to engage with a single, 11, e49886 (2023).
potentially expensive, or legally problematic LLM 3. Haug, C. J. & Drazen, J. M. Artificial intelligence and
machine learning in clinical medicine, 2023. New England
vendor in order to obtain high-quality diagnostic Journal of Medicine 388, 1201–1208 (2023).
differentials. The use of aggregation methods like 4. Patel, S. B. & Lam, K. ChatGPT: the future of dis-
charge summaries? The Lancet Digital Health 5, e107–
the one we propose would address these issues e108 (2023).
by enabling the use of multiple cheaper LLMs, 5. Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks
or alternatively, locally-deployed and fine-tuned of GPT-4 as an AI chatbot for medicine. New England
Journal of Medicine 388, 1233–1239 (2023).
LLMs. For instance, our results show that the 6. Schwartz, I. S., Link, K. E., Daneshjou, R. & Cortés-
3-LLM group without GPT-4 (the top-performing Penfield, N. Black box warning: large language models
and the future of infectious diseases consultation. Clinical
LLM) offers a diagnostic accuracy within a couple Infectious Diseases, ciad633 (2023).
of percentage points of GPT-4 alone. 7. Corrêa, N. K., Galvão, C., Santos, J. W., Del Pino, C.,
Pinto, E. P., Barbosa, C., Massmann, D., Mambrini, R.,
With a clear baseline on diagnostic accuracy and Galvão, L., Terem, E., et al. Worldwide AI ethics: A
review of 200 guidelines and recommendations for AI
trust, LLM-based tools can become valuable sup- governance. Patterns 4 (2023).
port instruments that can speed up diagnosis, re- 8. Eriksen, A. V., Möller, S. & Ryg, J. Use of GPT-4 to
diagnose complex clinical cases. NEJM AI 1, AIp2300031
duce diagnostic mistakes and costs, and provide (2023).
additional consulting services in underserved areas. 9. Barile, J., Margolis, A., Cason, G., Kim, R., Kalash, S.,
Tchaconas, A. & Milanaik, R. Diagnostic Accuracy of a
While this study shows that aggregating differen- Large Language Model in Pediatric Case Studies. JAMA
pediatrics (2024).
tials produced by current LLMs leads to improved
10. Suran, S., Pattanaik, V. & Draheim, D. Frameworks
diagnostic accuracy, further studies are needed to for collective intelligence: A systematic literature review.
ACM Computing Surveys (CSUR) 53, 1–36 (2020).
examine the impact of future LLMs specialized
11. Barnett, M. L., Boddupalli, D., Nundy, S. & Bates, D. W.
on medical topics or the use of participatory AI Comparative accuracy of diagnosis by collective intel-
methods (for instance, the synthesis of differentials ligence of multiple physicians vs individual physicians.
JAMA network open 2, e190096–e190096 (2019).
aggregating responses from both LLMs and human 12. Kurvers, R. H., Nuzzolese, A. G., Russo, A., Barabucci,
practitioners). G., Herzog, S. M. & Trianni, V. Automating hybrid col-
lective intelligence in open-ended medical diagnostics.
Proceedings of the National Academy of Sciences 120,
e2221473120 (2023).
5 Acknowledgements 13. Klein, N. & Epley, N. Group discussion improves lie detec-
tion. Proceedings of the National Academy of Sciences
112, 7460–7465 (2015).
The authors would like to thank Nikolas Zöller of
14. Jiang, D., Ren, X. & Lin, B. Y. LLM-Blender: Ensembling
the Max Plank Institute for Human Development Large Language Models with Pairwise Ranking and Gen-
for his valuable and constructive feedback. erative Fusion. arXiv preprint arXiv:2306.02561 (2023).
15. Yang, H., Li, M., Xiao, Y., Zhou, H., Zhang, R. & Fang, Q.
The authors would also like to thank Irving Lin and One LLM is not Enough: Harnessing the Power of Ensem-
ble Learning for Medical Question Answering. medRxiv,
Jay Komarneni of the Human Diagnosis Project 2023–12 (2023).
for their suggested framing and review. 16. Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang,
S., Fu, J. & Liu, Z. Chateval: Towards better llm-based
This work is supported by the European Union’s evaluators through multi-agent debate. arXiv preprint
arXiv:2308.07201 (2023).
Horizon Europe Research and Innovation Pro-
gramme under grant agreement No 101070588
(HACID: Hybrid Human Artificial Collective Intel-
ligence in Open-Ended Domains).

|5

You might also like