Arg Med LLM

ArgMed-Agents: Explainable Clinical Decision Reasoning with Large Language
Models via Argumentation Schemes
Shengxin Hong1 Liang Xiao2 , Xin Zhang1 and Jianxia Chen2

1
Detroit Green Technology Institute, Hubei University of Technology, China
2
School of Computer Science, Hubei University of Technology, China
seikin.shengxinhong@gmail.com, lx@mail.hbut.edu.cn
arXiv:2403.06294v1 [cs.AI] 10 Mar 2024
Abstract faced with highly complex clinical reasoning tasks [Pal et

al., 2023]. (ii) There is a perception that LLMs use unex-
There are two main barriers to using large language plainable methods to arrive at clinical decisions (known as
models (LLMs) in clinical reasoning. Firstly, while black boxes), which may have led to user distrust [Eigner and
LLMs exhibit significant promise in Natural Lan- Händler, 2024].
guage Processing (NLP) tasks, their performance
To address these barriers, exploring the capabilities of
in complex reasoning and planning falls short of
LLMs in argumentative reasoning is a promising direc-
expectations. Secondly, LLMs use uninterpretable
tion. Argumentation is a means of conveying a compelling
methods to make clinical decisions that are fun-
point of view that can increase user acceptance of a posi-
damentally different from the clinician’s cognitive
tion. Its considered a fundamental requirement for build-
processes. This leads to user distrust. In this
ing Human-Centric AI [Dietz et al., 2022]. As compu-
paper, we present a multi-agent framework called
tational argumentation has become a growing area of re-
ArgMed-Agents, which aims to enable LLM-based
search in Natural Language Processing (NLP) [Dietz et al.,
agents to make explainable clinical decision rea-
2021], researchers have begun to apply argumentation to
soning through interaction. ArgMed-Agents per-
a wide range of clinical reasoning applications, including
forms self-argumentation iterations via Argumen-
analysis of clinical discussions [Qassas et al., 2015], clini-
tation Scheme for Clinical Decision (a reasoning
cal decision making [Hong et al., 2023; Zeng et al., 2020;
mechanism for modeling cognitive processes in
clinical reasoning), and then constructs the argu- Sassoon et al., 2021], address clinical conflicting [Čyras et
mentation process as a directed graph representing al., 2018]. In the recent past, some work has assessed the abil-
conflicting relationships. Ultimately, Reasoner(a ity of LLMs in argumentation reasoning [Chen et al., 2023;
symbolic solver) identify a series of rational and Castagna et al., 2024] or non-monotonic reasoning [Xiu et
coherent arguments to support decision. ArgMed- al., 2022]. Although LLMs show some potential for compu-
Agents enables LLMs to mimic the process of clin- tational argumentation, more results show that LLMs perform
ical argumentative reasoning by generating expla- poorly in logical reasoning tasks [Xie et al., 2024], and bet-
nations of reasoning in a self-directed manner. The ter ways to utilise LLMs for non-monotonic reasoning tasks
setup experiments show that ArgMed-Agents not need to be explored.
only improves accuracy in complex clinical deci- Meanwhile, LLM as agent studies have been surpris-
sion reasoning problems compared to other prompt ingly successful [Zhang et al., 2023; Wang et al., 2023;
methods, but more importantly, it provides users Shi et al., 2024]. These methods use LLMs as computational
with decision explanations that increase their con- engines for autonomous agents, and optimise the reasoning,
fidence. planning capabilities of LLMs through external tools (e.g.
symbolic solvers, APIs, retrieval tools, etc.) [Pan et al., 2023;
Shi et al., 2024], multi-agent interactions [Tang et al., 2024]
1 Introduction and novel algorithmic frameworks [Gandhi et al., 2023].
Large Language Models (LLMs) [OpenAI, 2023] have re- Through this design, LLMs agents can interact with the envi-
ceived a lot of attention for their human-like performance in ronment and generate action plans through intermediate rea-
a variety of domains. In the medical field especially, prelim- soning steps that can be executed sequentially to obtain an
inary studies have shown that LLMs can be used as clinical effective solution.
assistants for tasks such as writing clinical texts [Nayak et Motivated by these concepts, we present ArgMed-Agents,
al., 2023], providing biomedical knowledge [Singhal et al., a multi-agent framework designed for explainable clinical de-
2022] and drafting responses to patients’ questions [Ayers et cision reasoning. We formalised the cognitive process of clin-
al., 2023]. However, the following barriers to building an ical reasoning using an argumentation scheme for clinical de-
LLM-based clinical decision-making system still exist: (i) cision (ASCD) as a prompt strategy for interactive reasoning
LLMs still struggle to provide secure, stable answers when by LLM agents. There are three types of agents in ArgMed-
Agents: the Generator, the Verifier, and the Reasoner. the inclusion) among the admissible sets(a set of all arguments
Generator generates arguments to support clinical decisions are acceptable and conflict-free) with respect to A.
based on the argumentation scheme; the Verifier checks the
arguments for legitimacy based on the critical question, and 2.2 Argumentation Scheme
if not legitimate, it asks the Generator to generate attack ar- The concept of Argumentation Scheme (AS) originated
guments; Reasoner is a symbolic solver that identifies reason- within the domain of informal logic, stemming from the semi-
able, non-contradictory arguments in the resulting directed ar- nal works of [Walton et al., 2008; Walton, 1996]. Argumenta-
gumentation graph as decision support. tion Scheme serves as a semi-formalized framework for cap-
In our method, we do not expect every proposed argument turing and analyzing human reasoning patterns. Formally de-
or detection of Generator or Verifier to be correct, instead we fined as AS = ⟨P, c, V ⟩, it comprises a collection of premises
consider their generation as a assumption. The LLM agents (P ), a conclusion (c) substantiated by these premises, and
are induced to recursively iterate in a self-argumentative man- variables (V ) inherent within the premises (P.V ) or the con-
ner through the prompt strategy , while the newly proposed clusion (c.V ). A pivotal aspect of the argumentation scheme
assumptions always contradict the old ones, and eventually is the delineation of Critical Questions (CQs) pertinent to
the Reasoner eliminates unreasonable assumptions and iden- AS. Failure to address them prompts a challenge to both the
tifies coherent arguments, leading to consistent reasoning re- premises and the conclusion posited by the scheme. Conse-
sults. ArgMed-Agents enables LLMs to explain its own out- quently, the role of CQs is to instigate argument generation;
puts in terms of self-cognitive profiling by modelling their when an AS is contested, it engenders the formulation of a
own generation as a prompt for question recursion. The gen- counter-argument in response to the initial AS. This iterative
erative process derives the attack relations of the arguments process culminates in the construction of an attack argument
are determined based on the ASCD prompt strategy. For ex- graph, facilitating a nuanced understanding of argumentative
ample, when there exist two decisions as arguments a and b, dynamics. Figure 1 illustrates three templates of argumenta-
they attack each other based on the decision exclusivity de- tion schemes.
fined in ASCD.
Our experiment was divided into two parts, evaluating ac- 3 Argumentation Scheme for Clinial Decision
curacy and explainability of ArgMed-Agents clinical reason- In the past, numerous studies [Oliveira et al., 2018; Sassoon
ing, respectively. First, we conducted experiments on two et al., 2021; Qassas et al., 2015] have explored the application
datasets, including MedQA [Jin et al., 2020] and PubMedQA of argumentation in the clinical domain. In this section, we
[Jin et al., 2019]. To better align with real-world applica-
provide a summary of these endeavors, focusing on the de-
tion scenarios, our study focused on a zero-shot setting. Sec- velopment of Argumentation Schemes for Clinical Decision
ond, we used predictability and traceability as measures of (ASCD). ASCD encapsulate various argumentation scheme
explainability by manually assessing whether the clinical rea- tailored for clinical decision-making processes and the rea-
soning process of ArgMed-Agents is meaningful in terms of soning process. Additionally, we propose to refine and adapt
explanation for users. The results show that ArgMed-Agents these schemes to enhance their suitability for LLM. ASCD
achieves better performance in both accuracy and explain- consists of three argumentation schemes which are Argumen-
ability compared to direct generation and Chain of Thought tation Scheme for Decision (ASD), Argumentation Scheme
(CoT). for Side Effect (ASSE) and Argumentation Scheme for Bet-
ter Decision (ASBD). Figure 1 includes the components of
2 Preliminaries the three argumentation schemes and the derivations between
them.
2.1 Abstract Argumentation Framework ASCD formalizes the decision-making process for clini-
Abstract Argumentation (AA) frameworks [Dung, 1995] are cians. Clinical decision reasoning begins with ASD, where
pair ⟨A, R⟩ composed of two components: a set of abstract decisions are related to clinical goals. Formally, in Figure
arguments A, and a binary attack relation R. Given an AA 1 we model the derivation of CQs. When the CQs are re-
framework ⟨A, R⟩, a, b ∈ Aand(a, b) ∈ R indicates that a jected, new ASs are generated as arguments to refute the first
attacks b. On top of that, Dung defines some notions: proposed AS based on the derivation relation. It is worth not-
ing that when a CQ without a derivation relationship (e.g.,
• ∃a ∈ A is acceptable if f ∄b ∈ A such that (b, a) ∈ R ASD.CQ1, ASSE.CQ1) is rejected, the AS will be refuted by
or ∃b, c ∈ A and (b, a) ∈ R such that (c, b) ∈ R. itself (on the grounds of the CQ’s answer). We illustrate how
• A set of arguments is conflict-free if there is no attack to reason using ASCD with an example:
between its arguments. • Joey is a 50 year old male patient who has suffered a
In order to decide whether a single parameter can be ac- stroke and has high blood pressure. For Joey, the goal is
cepted or whether multiple parameters can be accepted at the to control his blood pressure;
same time, the argumentation system allows for the use of First, we represent the initial ASD reasoning result as
various semantics to compute the set of parameters (called ASD1 (joey, control blood pressure, ACEI) → ACEI
extensions). For example, Given an AA framework ⟨A, R⟩, meaning that ACEI drugs should be considered as treatment
a, b ∈ Aand(a, b) ∈ R, E ⊆ A is a prefered extension only for joey. Given the known patient facts and clinical guide-
if it is a maximal element (with respect to the set-theoretical lines, both ASD1 .CQ1 and ASD1 .CQ2 were accepted,
knowledge to build knowledge bases and rule. In this sec-
Argumentation Scheme for Decision
ASD (F, D, G)
tion, we propose a multi-agent framework called ArgMed-
Premise - Given the patient Fact F Agents, which supports the seamless integration of prompt
Premise - To achieve Goal G
Premise - Decision D promotes the goal G strategy designed based on ASCD into agent interactions.
Conclusion: Decision D should be considered Our approach enhances LLMs to be able to perform explain-
CQ1: Is there any evidence to support Decision able clinical decision reasoning without the need for expert
D promote goal G? involvement in knowledge encoding.
derive
CQ2: Has Decision D caused side effects on the
patient? 4.1 ASCD-based Multi-agent Interaction
CQ3: Can Goal G be achieved by completing the
Decision D?
ArgMed-Agents framework includes three distinct types of
derive
CQ4: Are there alternative Decision to achieve LLMs agent:
derive
the same goal G? • Generator(s): Instantiate the AS according to the current
situation.
Argumentation Scheme for Side Effect Argumentation Scheme for Better Decision
ASSE (SE, D, F) ASBD (F, G, D1, D2)
• Verifier: Whenever Generator generates an instantiation
Premise - Given the patient Fact F Premise - Given the Decisions D1 and D2 of an AS, Verifier checks the accuracy of the instantia-
Premise - Decision D results in side effect SE Premise - Given the patient Fact F, for Goal G,
Conclusion: Decision D needs to be used with caution
Decision D2 is better than D1 tion of that AS via CQ. When the CQ validation is re-
Conclusion: Decision D2 should be considered jected, return to the generator the reason why the CQ
CQ1: Is there any evidence to support Decision D
results in side effect SE?
CQ1: Is there any evidence to support Decision D2 rejected, so that the generator generates a new AS based
better than Decision D1 for Goal G?
on the derivation rules of that CQ. The process iterates
CQ2: Is this side effect SE unacceptable?
until no more new AS are generated.
CQ3: Are there ways of ameliorating the side effects
observed? • Reasoner: A symbolic solver for computational argu-
mentation. At the end of the iteration, the arguments
Add New presented by the generator constitute a complete argu-
Goal to G mentation framework, and it identifies a subset of coher-
ent and reasonable arguments within that argumentation
framework.
Figure 1: The derivation rules between the argumentation schemes We convert ASCD into a step-by-step reasoning interac-
in ASCD. tion protocol as a prompt strategy and define the interac-
tion behavior between LLM agents based on ASCD prompt
strategy, which describes the specification of the interac-
and ASD1 .CQ4 was found to have substituted decision- tion behavior between different types of agents under the
making thta ARB can also control blood pressure. Ac- ASCD reasoning mechanism. Figure 2 depicts how ArgMed-
cording to the derivation rules of ASD1 .CQ4, first gen- Agents interacts. In this example, GeneratorA receives
erate the ASD2 (joey, control blood pressure, ARB) → the question at first, which uses ASD to generate an initial
ARB. Since decisions are exclusive, ASD1 and ASD2 decision Paroxetine for the patient. Then Verifier detects
are mutually attacks. Based on known clinical evidence, that the argument presented by GeneratorA is rejected by
ACEI have shown more significant effects in reducing the ASD.CQ2 and ASD.CQ4, which results in argument from
risk of cardiovascular events (e.g., heart attack, stroke). GeneratorB , GeneratorC and GeneratorD . The output
Therefore, another derivation of ASD1 .CQ4 is denoted as from GeneratorB indicates that Paroxetine exists side ef-
ASBD(joey, control blood pressure, ACEI, ARB) → fect derived from ASD.CQ2. On the other hand, accord-
ACEI. From this, we get a arguments graph in which ASD1 ing to the derivation rules of ASD.CQ4, firstly, GeneratorC
and ASD3 attack each other and ASBD attacks ASD2 . No- is generated to suggest that there exists an alternative deci-
tably, since we set exclusivity between ASD, in order for sion Trazodone; secondly, GeneratorD finds that Trazodone
ASD.CQ1 and ASD.CQ3 to be valid, when these two CQs is more suitable for the patient than Paroxetine by compar-
are rejected, that ASD first refutes itself (forming a closed ing Paroxetine and Trazodone. Moreover, the Verifier of
loop) and then generates a new ASD. GeneratorB suggests that GeneratorB ’s argument is re-
jected by ASSE.CQ3, but this CQ does not have a derivation
4 ArgMed-Agents: a Multi-LLM-Agents rule, so the argument representing GeneratorB would attack
Framework itself. The Verifier facilitates an iterative process of mutual
debate between Generators, which forms an argumentation
In Section 3, we discussed Argumentation Schemes for clini- framework that can be solved by the symbolic Reasonner. We
cal decision(ASCD). ASCD is a semi-formal reasoning tem- provide a formal description of the interaction process in Ap-
plate for clinical decision making by defining the logical pendix A.
structure and the reasoning mechanism in the clinical rea- In ArgMed-Agents, we do not expect single generation or
soning process. However, the clinical decision support sys- single verification to be correct. ArgMed-Agents uses gener-
tems based on argumentation scheme in the past [Sassoon et ation as a assumption to recursively prompt the model with
al., 2021; Qassas et al., 2015] have often been implemented critical questions, identifying conflicting and erroneous ar-
through knowledge-based approaches, which rely on expert guments in an iterative process. Ultimately, such mutually
Question
Premise - Given the patient Fact F [A 40-year-old woman with symptoms of
A 40-year-old woman presents with difficulty falling asleep, diminished depression including difficulty falling asleep, diminished appetite, Premise - Given the Decisions Trazodone and
appetite, and tiredness for the past 6 weeks. She says that, despite going tiredness, irritability, feelings of hopelessness, diminished Paroxetine
to bed early at night, she is unable to fall asleep. She denies feeling concentration and interest, and weight loss.]
anxious or having disturbing thoughts while in bed. Even when she manages to Premise -
Generator
fall asleep, she wakes up early in the morning and is unable to fall back Premise - To achieve Goal G [G: Treat the patient's condition] Effectiveness in Treating Depression: Generator
A
asleep. She says she has grown increasingly irritable and feels increasingly Both Trazodone and Paroxetine have evidence D
hopeless, and her concentration and interest at work have diminished. The Premise - Paroxetine promotes the goal G [ Paroxetine is a selective supporting their effectiveness in treating
patient denies thoughts of suicide or death. Because of her diminished serotonin reuptake inhibitor (SSRI) antidepressant known to effectively depression.
appetite, she has lost 4 kg (8.8 lb) in the last few weeks and has started treat depression by increasing serotonin levels in the brain.]
drinking a glass of wine every night instead of eating dinner. She has no Side Effect Profile:
significant past medical history and is not on any medications. Which of the Trazodone may be more sedating but generally has
following is the best course of treatment in this patient? Conclusion: Paroxetine should be considered fewer sexual side effects compared to Paroxetine.
However, both medications have their own set of
Options: (a) Diazepam (b) Paroxetine (c) Zolpidem (d) Trazodone potential adverse effects.
CQ1: Is there any evidence to support Decision D (Paroxetine) promote
goal G? Symptom Relief:
YES Both medications can potentially address the
Argumentation CQ2:Has Decision D (Paroxetine) caused side effects on the patient?
patient's symptoms, but Trazodone may be more
specifically targeted towards sleep disturbances.
Framework YES Verifier
CQ3:Can Goal G be achieved by completing the Decision D (Paroxetine)? Conclusion: Trazodone should be considered
YES
A CQ4: Are there alternative Decision of (Paroxetine) to achieve the
C same goal G?
YES
CQ1: Is there any evidence to support Decision D2
(Trazodone) better than Decision D1 (Paroxetine)
for Goal G?
YES
Premise - The patient is experiencing symptoms of depression, including Verifier
B difficulty falling asleep, diminished appetite, tiredness, irritability,
hopelessness, diminished concentration, and loss of interest in work.
D Generator
She has also lost weight due to decreased appetite and has started using
alcohol as a coping mechanism.
B
(Paroxetine) results in side effect SE?
Premise - The decision to prescribe Paroxetine, a selective serotonin
YES
reuptake inhibitor (SSRI) commonly used to treat depression and anxiety
disorders, is considered as a potential course of treatment. Verifier
CQ2:Is this side effect SE not acceptable?
NO
Premise - Paroxetine, like many medications, can have side effects.
Common side effects of Paroxetine include nausea, drowsiness, dizziness,
CQ3: Are there ways of ameliorating the side effects
dry mouth, sweating, blurred vision, constipation, or headache.
observed?
Additionally, more serious side effects such as serotonin syndrome,
NO
suicidal thoughts or behavior, and withdrawal symptoms can occur,
although they are less common.
Symbolic Conclusion: Considering the patient's symptoms and the potential side
effects of Paroxetine, the decision to prescribe Paroxetine needs to be
(Trazodone) promote goal G?
Solver made with caution. YES
CQ2:Has Decision D (Trazodone) caused side effects on the Verifier

patient?
Premise - Given the patient Fact F [A 40-year-old woman with symptoms YES
Reasoner of depression including difficulty falling asleep, diminished
appetite, tiredness, irritability, feelings of hopelessness, CQ3:Can Goal G be achieved by completing the Decision D
diminished concentration and interest, and weight loss.] (Trazodone)?
Generator YES
C Premise - To achieve Goal G [G: Treat the patient's condition]
CQ4: Are there alternative Decision of (Paroxetine,
Premise - Trazodone promotes the goal G [Trazodone is primarily Trazodone) to achieve the same goal G?
Decision prescribed as an antidepressant due to its serotonin modulating YES
properties and its sedative effects, which can help with sleep
disturbances associated with depression.] //The iteration will continues because this CQ2 and CQ4 are
Trazodone should be considered as treatment rejected but is not shown here due to space constraints
Conclusion: Trazodone should be considered
Figure 2: An example from the MedQA USMLE dataset, with the entire process of ArgMed-Agents reasoning about the clinical problem.
Notably, the letters in the argumentation framework correspond to the serial numbers of the four generators on the right, representing the
premises and conclusion generated by that generator. In the argumentation framework, the red nodes (A and C) represent arguments in
support of the decision and the yellow nodes (B and D) represent arguments in support of the beliefs.
attacking arguments are converted into a formal framework ized process consistent with clinical reasoning (i.e., following
that solves for a subset of reasonably coherent arguments via the ASCD prompt strategy). This design not only facilitates
Reasoner. LLMs to engage in clinical reasoning in a critical-thinking
manner, which in turn maintains a cognitive processes consis-
The theoretical motivation for our method stems from tent with clinicians to improve the explainability of decisions,
non-monotonic logic, logical intuition and cognitive clinical. but also allows for the effective highlighting of some implicit
Studies have shown that LLM performs reasonably well for knowledge in LLMs that cannot be readily accessed through
simple reasoning, single-step reasoning problems, however, traditional prompts.
as the number of reasoning steps rises, the rate of correct
reasoning decreases in a catastrophic manner [Creswell et
al., 2022]. Thus, we consider that LLM is primed with ra- 4.2 Explainable Argumentative Reasoning
tional intuition, but lacks true logical reasoning ability. In ArgMed-Agent obtains a complete argument graph by iter-
light of this, ASCD Prompt Stratygy guides LLM in a re- ating through the ASCD Prompt Strategy. In this section,
cursive form to generate a series of casual reasoning steps as we describe how Reasoner reasons and explains decisions
a tentative conclusion, with any further evidence withdraw- through these arguments. First, we map the reasoning results
ing their conclusion. The process ultimately leads to a di- of ASCD to the AA framework.
rected graph representing the disputed relationships. In addi-
tion this, agents with different LLMs roles within ArgMed- Definition 1. An AA framework for ASCD is a pair ⟨A, R⟩
Agents interact collaboratively with each other in a formal- such that:
• ∃arg = ⟨P, c, V ⟩ ∈ A that arguments set A is consti- • ∃a ∈ Argsd (A) is acceptable such that Argsd (E) ̸= ∅
tuted by the argumentation schemes.
• ∃a ∈ Argsd (E) such that ∀b ∈ Argsd (A), b ̸= a that
• A = Argsd (A) ∪ Argsb (A) such that arguments in ar- b∈
/ Argsd (E).
gument set A are distinguished between two types of ar-
guments: arguments in support of decisions Argsd (A) Proposition 1 states that using preferred semantics on the
and arguments in support of beliefs Argsb (A). AA framework for ASCD always selects extensions that con-
tain the only acceptable decision, which is consistent with our
In Definition 1, arguments in support of decisions expected reasoning. Meanwhile, arguments from Argsb (E)
Argsd (A) correspond to ASD in the argumentation scheme, can be regarded as explanation of decision Argsd (E) [Fan
and arguments in support of beliefs Argsb (A) correspond to and Toni, 2015]. The proof of the proposition proceeds as
ASSE and ASBD in the argumentation scheme. These two follows: We first prove that there will be only one decision in
types of arguments play different roles; arguments in sup- the preferred extension. According to Definition 2, we know
port of decisions build on beliefs and goals and try to jus- that decisions are exclusive, so different decisions do not ap-
tify choices, while arguments in support of beliefs always try pear in the same admissible set. Secondly, we prove that the
to undermine decision arguments. Therefore, based on the preferred extension will always contain a argument in sup-
modeling of ASCD, we make the following setup for the at- port of decision. According to the modeling of ASCD as well
tack relation between these two types of arguments: first, ar- as Definition 2, it can be seen that arguments in support of
guments in support of different decisions are in conflict with decision (ASD) are built on top of arguments in support of
each other, which is consistent with our previous description beliefs (ASSE and ASBD), i.e., the inclusion of ASD argu-
of decisions being exclusive in ASCD. Second, arguments ments does not lead to the ASSE and ASBD arguments be-
in support of decisions are not allowed to attack arguments coming unacceptable. Therefore, the set is maximized when
in support of beliefs, which is dictated by the modeling of the acceptable set includes pro-decision arguments.
ASCD, as can be seen in Figure 1, where ASSE and ASBD
are always trying to disprove ASD, while the converse does Definition 4. Given an AA framework for ASCD ⟨A, R⟩ and
not work. Formalized as: E is its prefered extension. There are errors in the clinical
reasoning of ArgMed-Agents in this case iff Argsd (E) = ∅.
Definition 2. Given an AA framework for ASCD ⟨A, R⟩,
R ⊆ Argsd (A) × Argsb (A) such that: Clinical reasoning errors are discussed in [Tang et al.,
2024]. They consider that most clinical reasoning errors in
• ∀arg1 , arg2 ∈ Argsd (A) such that arg1 ̸= arg2 , LLMs are due to confusion about domain knowledge. On
(arg1 , arg2 ) ∈ R. the other hand, research by [Singhal et al., 2022] points out
• ∄(arg1 , arg2 ) ∈ R such that arg1 ∈ Argsd (A) and that LLMs may produce compelling misinformation about
arg2 ∈ Argsb (A). medical treatment, so it is crucial to recognize this illusion,
which is difficult for humans to detect. For this purpose, we
Next, we use the preferred semantics in the AA framework define the relevant mechanisms for identifying clinical rea-
to illustrate how the Reasoner Agent selects the set of possible soning errors in Reasoner as follows: When any decision in
admissible arguments to support clinical decision. the AA framework is not accepted, we consider there are er-
Definition 3. Given an AA framework for ASCD ⟨A, R⟩, E ⊆ rors in the reasoning by ArgMed-Agents. This mechanism
E is a set of preferred extension such that Argsb (E) is the assists ArgMed-Agents in identifying the capability bound-
support decision for E. aries of LLMs when their knowledge reserves are insufficient
to address certain issues.This helps ArgMed-Agents avoid
Example 1. An AA framework for ASCD is given below such the risks associated with adopting erroneous decisions and
that Argsd (A) = {A, B, C} and Argsb (A) = {D, E}. In achieving more robust and safe clinical reasoning.
this case, decisions B and C are optional because they are
supported by preferred extension.
5 Experiments
D E
In this section, we demonstrate the potential of ArgMed-
Admissible Sets: {B, D, E}, Agents in clinical decision reasoning by evaluating the ac-
{C, D, E},{B, D},{B, E}, curacy and explainability.
A {C, D},{C, E},{D, E},{B}, We implemented Generator and Verifier in ArgMed-
{C},{D},{E},{} Agents using the APIs GPT-3.5-turbo and GPT-4 provided
by OpenAI [OpenAI, 2023], and according to our setup, the
Prefered Extension:{B, D, E}, LLM Agents are implemented with the same LLM and dif-
B C {C, D, E} ferent few-shot prompts. Each agent is configured with spe-
cific parameters: the temperature is set to 0.0, as well as a
dialogue limit of 4, indicating the maximum number of de-
In this regard, we present a proposition to justify the Rea- cisions allowed for ArgMed-Agents to generate, which is to
soner agent’s use of preferred semantics for decision making. prevent the agents from getting into loops with each other. On
Proposition 1. Given an AA framework for ASCD ⟨A, R⟩ the other hand, we using python3 to implemente a symbolic
and E ⊆ E is a set in preferred extension E where, solver of abstract argumentation framework as Reasoner.
Model Method MedQA PubMedQA Method Pre. Tra. % Arg.
Direct 52.7 68.4 CoT 0.63 2.8 59.5%
GPT-3.5-turbo CoT 48.0 71.5 ArgMed-Agents 0.91 4.2 87.5%
ArgMed-Agents 62.1 78.3
Table 2: Human evaluation result on 20 examples.
Direct 67.8 72.9
GPT-4 CoT 71.4 77.2
we analysed the examples in which both Direct Generation,
ArgMed-Agents 83.3 81.6 CoT and ArgMed-Agents reasoned wrongly. We found that
ArgMed-Agents can report 76% of errors in these examples
Table 1: Results of the accuracy of various clinical decision reason- according to Reasoner’s definition 4. This provides initial
ing methods on MedQA and PubMedQA datasets. evidence that ArgMed-Agents not only has higher accuracy
compared to baseline, but also that it is safer for clinical de-
5.1 Accuracy cision reasoning.
The following two datasets were used to assess the accuracy 5.2 Human Evaluation for Explainability
of ArgMed-Agents clinical reasoning:
The main focus of Explainable AI (XAI) is usually the rea-
• MedQA [Jin et al., 2020]:Answering multiple-choice soning behind decisions or predictions made by AI that be-
questions derived from the United States Medical Li- come more understandable and transparent. In the context
cense Exams (USMLE). This dataset is sourced from of- of artificial intelligence, Explainability is defined as follows
ficial medical board exams and encompasses questions [ISO, 2020]:
in English, simplified Chinese, and traditional Chinese.
The total question counts for each language are 12,723, ...level of understanding how the AI-based system came
34,251, and 14,123, respectively. up with a given result.
• PubMedQA [Jin et al., 2019]: A biomedical ques- Based on this criterion, we define two measures of LLMs’
tion and answer (QA) dataset collected from PubMed explanability:
abstracts. The task of PubMedQA is to answer • predictability(Pre.): Can the explanations given by
yes/no/maybe research questions using the correspond- LLMs help users predict LLM decisions?
ing abstracts (e.g., Do preoperative statins reduce atrial
fibrillation after coronary artery bypass graft surgery?). • proof-based traceability(Tra.): Can the model reason
about the correct decision based on the correct path of
We randomly selected 300 examples in each dataset for our inference (i.e. does the path of inference as an explana-
experiments. We set the baseline in our experiments to com- tion have relevance to the LLM’s decision)?
pare with gpt direct generation and Chain of Thought(CoT)
[Wei et al., 2023]. It is worth noting that we focussed on eval- For the evaluation of predictability, our experimental setup
uating the performance of ArgMed-Agents on clinical deci- is as follows: we recorded the inputs and the reasoning pro-
sion reasoning, however, there were some biomedical general cess (e.g., the complete dialogue between Generator and Ver-
knowledge quiz-type questions in MedQA and PubMedQA, ifier in ArgMed-Agents and the corresponding argumentation
and so we intentionally excluded this part of the questioning framework). We invited 50 undergraduate and graduate stu-
in selecting the example. dents in medical-related disciplines to answer questions given
Table 1 shows the accuracy results of MedQA and Pub- only the inputs to the LLMs as well as the explanations (four
MedQA. We compared ArgMed-Agents to several baselines questions were distributed to each individual), and counted
in direct generation and CoT settings. Notably, both GPT-3.5- to what extent they were able to predict the decisions of the
turbo and GPT4 models showed that our proposed ArgMed- LLMs.
Agents improved accuracy on a clinical decision reasoning Our team has produced a fully functional knowledge-based
task compared to a baseline such as CoT. This result demon- clinical decision support system at an early stage that can
strates the effectiveness of ArgMed-Agents to enhance the be used to assist in treatment decisions for complex dis-
clinical reasoning performance of LLMs. Interestingly, after eases such as cancer, neurological disorders, and infectious
we repeated the experiment several times, we found that the diseases. Figure 3 illustrates this knowledge-based CDSS,
introduction of CoT sometimes leads to a surprising drop in consisting of computer-interpretable guidelines in the form
performance. The reason for this may be that when LLMs of Resource Description Framework (RDF) [Klyne and Car-
experience hallucinatory phenomena, the use of CoT will roll, 2004] and the corresponding inference engine. We use
amplify such hallucinations indefinitely. In contrast, our the knowledge-based CDSS’s process of performing reason-
approach of using mutual argumentation iterations between ing on the 20 examples in MedQA as a logical criterion for
multiple agents effectively mitigates this problem. We anal- measuring the proof-based traceability of LLMs. We manu-
ysed the data from our experiments and found that each AS’s ally evaluated the consistency of ArgMed-Agents’ reasoning
CQ1 (i.e., asking the Verifier whether there is evidence to sup- paths with logical criteria and the relevance of explanations
port the argument made by the Generator) acts as a halluci- to decision making, scoring each example on a scale of one
nation detector in the reasoning process. On the other hand, to five. CoT performance was also used as a comparison.
not focus on critical reasoning in clinical decision-making,
which allows LLM to generate explanations of why one de-
cision is ”good” but not why others were not chosen. In
addition to this, the work such as [Singhal et al., 2023;
Li et al., 2023] fine-tuned LLMs using extensive datasets col-
lected from the medical and biomedical literature. Unlike our
approach, ours focuses more on exploiting the latent medi-
cal knowledge inherent in LLMs to improve their reasoning
ability in a training-free environment.
6.2 Logical Reasoning with LLMs

An extensive body of research has been dedicated to utiliz-
ing symbolic systems to augment reasoning, which includes
code environments, knowledge graphs, and formal theorem
provers [Pan et al., 2023; Pan et al., 2024; Wu et al., 2023].
Figure 3: A screenshot of the knowledge-based CDSS for diagnos- In the study by Jung et al. [Jung et al., 2022], reasoning is
ing depression, where prismatic nodes represent enquiry, circular constructed as a satisfiability problem of its logical relations
nodes represent decision, and square nodes represent action. through inverse causal reasoning, with consistent reasoning
enhanced using SAT solvers to improve the reasoning abili-
ties of Large Language Models (LLMs). Zhang et al.’s work
In Table 2, it is shown that ArgMed-Agents outperforms [Zhang et al., 2023] employs cumulative reasoning to decom-
CoT in both measures of explainability. Specifically, in pose tasks into smaller components, thereby simplifying the
the experiment evaluating predictability, 50 participants an- problem-solving process and enhancing efficiency. Xiu et al.
swered the question a total of 200 times, of which partici- [Xiu et al., 2022] delve into the non-monotonic reasoning
pants succeeded 91 out of 100 times in predicting ArgMed- capabilities of LLMs. However, despite the initial promise
Agents decisions, compared to 63/100 times in predicting the exhibited by LLMs, their performance falls short in terms
CoT. On the other hand, most of the reasoning nodes of the of generalization and proof-based traceability, with a signif-
knowledge-based CDSS can be found to correspond in the icant decline in performance observed with increasing depth
reasoning process of ArgMed-Agents, whereas the reasoning of reasoning. In line with our study, some recent works have
process of the CoT method to arrive at an answer is often log- started to explore the potential for argumentative reasoning
ically incomplete. Interestingly, we found that the reasoning in LLMs [Chen et al., 2023], aiming to enhance LLMs ar-
of the knowledge-based CDSS on many examples can be re- gumentative reasoning abilities [de Wynter and Yuan, 2023;
garded as a reasoning subgraph of ArgMed-Agents, probably Castagna et al., 2024].
because ArgMed-Agents traverses all the reasoning paths in
a randomly generated manner, and thus the reasoning graph
includes not only the reasoning paths of the correct decisions, 7 Conclusion
but also the explanations of the incorrect decisions.
In this work, We propose a novel medical multi-agent interac-
6 Related Work tion framework called ArgMed-Agents, which inspires agents
of different roles to iterate through argumentation and critical
6.1 LLM-based Clinical Decision Support System questioning, systematically generating an abstract argumen-
Extensive research highlights the potential for LLMs to be tation framework. Then, using a Reasoner agent to identify a
used in medicine [Bao et al., 2023; Nori et al., 2023; set of reasonable and coherent arguments in this framework as
Jin et al., 2023]. However, LLMs still struggle to make safe decision support. The experimental results indicate that, com-
and trustworthy decisions when they encounter clinical deci- pared to different baselines, using ArgMed-Agent for clini-
sions that require complex medical expertise and good rea- cal decision-making reasoning achieves greater accuracy and
soning skills [Singhal et al., 2022]. Therefore, much work provides inherent explanations for its inferences.
has begun to focus on ways to enhance clinical reasoning in Despite the success of ArgMed-Agents, there are still some
LLMs. [Tang et al., 2024] proposes a Multi-disciplinary Col- limitations. In particular, abstract arguments alone may strug-
laboration (MC) framework that utilises LLM-based agents gle with clinical reasoning tasks that numerical calculations
in a role-playing environment that engages in collaborative or probabilistic reasoning. Therefore, future research will fo-
multi-round discussions until consensus is reached. De- cus on enhancing the capabilities of ArgMed-Agents to ad-
spite the results achieved, the method is unable to formalise dress these challenges. One promising avenue involves ex-
the iterative results in such a way as to enhance the infer- ploring the integration of value-based argumentation or prob-
ence performance of the LLM using inference tools. [Sav- abilistic argumentation techniques. Our goal is to provide
age et al., 2024] proposes a method that uses diagnostic healthcare professionals with powerful tools to enhance their
reasoning prompts to improve clinical reasoning abilities decision-making process and ultimately improve patient out-
and interpretability in LLM. However, their approach does comes.
Ethical Statement of the Twenty-Ninth AAAI Conference on Artificial Intelli-
There are no ethical issues. gence, AAAI’15, page 1496–1492. AAAI Press, 2015.
[Gandhi et al., 2023] Kanishk Gandhi, Dorsa Sadigh, and
References Noah D. Goodman. Strategic reasoning with language
models, 2023.
[Ayers et al., 2023] John W. Ayers, Adam Poliak, Mark
Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kel- [Hong et al., 2023] Shengxin Hong, Liang Xiao, and Jianxia
ley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Chen. An interaction model for merging multi-agent
Longhurst, Michael Hogarth, and Davey M. Smith. Com- argumentation in shared clinical decision making. In
paring Physician and Artificial Intelligence Chatbot Re- 2023 IEEE International Conference on Bioinformatics
sponses to Patient Questions Posted to a Public Social Me- and Biomedicine (BIBM), pages 4304–4311, 2023.
dia Forum. JAMA Internal Medicine, 183(6):589–596, 06 [ISO, 2020] ISO, IEC: AWI TS 29119-11: Software and sys-
2023. tems engineering - software testing - part 11: Testing of AI
[Bao et al., 2023] Zhijie Bao, Wei Chen, Shengze Xiao, systems. Technical report, 2020.
Kuang Ren, Jiaao Wu, Cheng Zhong, Jiajie Peng, Xu- [Jin et al., 2019] Qiao Jin, Bhuwan Dhingra, Zhengping Liu,
anjing Huang, and Zhongyu Wei. Disc-medllm: Bridg- William W. Cohen, and Xinghua Lu. Pubmedqa: A dataset
ing general large language models and real-world medical for biomedical research question answering, 2019.
consultation, 2023. [Jin et al., 2020] Di Jin, Eileen Pan, Nassim Oufattole, Wei-
[Castagna et al., 2024] Federico Castagna, Nadin Kokciyan, Hung Weng, Hanyi Fang, and Peter Szolovits. What dis-
Isabel Sassoon, Simon Parsons, and Elizabeth Sklar. Com- ease does this patient have? a large-scale open domain
putational argumentation-based chatbots: a survey, 2024. question answering dataset from medical exams, 2020.
[Chen et al., 2023] Guizhen Chen, Liying Cheng, Luu Anh [Jin et al., 2023] Qiao Jin, Yifan Yang, Qingyu Chen, and
Tuan, and Lidong Bing. Exploring the potential of large Zhiyong Lu. Genegpt: Augmenting large language mod-
language models in computational argumentation, 2023. els with domain tools for improved access to biomedical
[Creswell et al., 2022] Antonia Creswell, Murray Shanahan, information, 2023.
and Irina Higgins. Selection-inference: Exploiting large [Jung et al., 2022] Jaehun Jung, Lianhui Qin, Sean Welleck,
language models for interpretable logical reasoning, 2022. Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras,
[Čyras et al., 2018] Kristijonas Čyras, Brendan Delaney, and Yejin Choi. Maieutic prompting: Logically consistent
reasoning with recursive explanations, 2022.
Denys Prociuk, Francesca Toni, Martin Chapman, Jesús
Domı́nguez, and Vasa Curcin. Argumentation for ex- [Klyne and Carroll, 2004] Graham Klyne and Jeremy J. Car-
plainable reasoning with conflicting medical recommen- roll. Resource description framework (rdf): Concepts and
dations. CEUR Workshop Proceedings, 2237, January abstract syntax. W3C Recommendation, 2004.
2018. 2018 Joint Reasoning with Ambiguous and Con- [Li et al., 2023] Chunyuan Li, Cliff Wong, Sheng Zhang,
flicting Evidence and Recommendations in Medicine and Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan
the 3rd International Workshop on Ontology Modularity, Naumann, Hoifung Poon, and Jianfeng Gao. Llava-
Contextuality, and Evolution, MedRACER + WOMoCoE med: Training a large language-and-vision assistant for
2018 ; Conference date: 29-10-2018. biomedicine in one day, 2023.
[de Wynter and Yuan, 2023] Adrian de Wynter and Tommy [Nayak et al., 2023] Anupama Nayak, Michael S Alkaitis,
Yuan. I wish to have an argument: Argumentative reason- Karthik Nayak, Martin Nikolov, Kevin P Weinfurt, and
ing in large language models, 2023. Kevin Schulman. Comparison of history of present ill-
[Dietz et al., 2021] Emmanuelle Dietz, Antonis Kakas, and ness summaries generated by a chatbot and senior internal
Loizos Michael. Computational argumentation and cogni- medicine residents. JAMA Intern Med, 183(9):1026–1027,
tion, 2021. Sep 2023.
[Dietz et al., 2022] Emmanuelle Dietz, Antonis Kakas, and [Nori et al., 2023] Harsha Nori, Yin Tat Lee, Sheng Zhang,
Loizos Michael. Argumentation: A calculus for human- Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas
centric ai. Frontiers in Artificial Intelligence, 5, 2022. King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Ren-
qian Luo, Scott Mayer McKinney, Robert Osazuwa Ness,
[Dung, 1995] Phan Minh Dung. On the acceptability of ar-
Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and
guments and its fundamental role in nonmonotonic rea- Eric Horvitz. Can generalist foundation models outcom-
soning, logic programming and n-person games. Artificial pete special-purpose tuning? case study in medicine, 2023.
Intelligence, 77(2):321–357, 1995.
[Oliveira et al., 2018] Tiago Oliveira, Jérémie Dauphin, Ken
[Eigner and Händler, 2024] Eva Eigner and Thorsten
Satoh, Shusaku Tsumoto, and Paulo Novais. Argumen-
Händler. Determinants of llm-assisted decision-making, tation with goals for clinical decision support in multi-
2024. morbidity. In Proceedings of the 17th International Con-
[Fan and Toni, 2015] Xiuyi Fan and Francesca Toni. On ference on Autonomous Agents and MultiAgent Systems,
computing explanations in argumentation. In Proceedings 2018. 2031–2033.
[OpenAI, 2023] OpenAI. Gpt-4 technical report, 2023. [Tang et al., 2024] Xiangru Tang, Anni Zou, Zhuosheng
[Pal et al., 2023] Ankit Pal, Logesh Kumar Umapathi, and Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman
Malaikannan Sankarasubbu. Med-halt: Medical domain Cohan, and Mark Gerstein. Medagents: Large language
hallucination test for large language models, 2023. models as collaborators for zero-shot medical reasoning,
2024.
[Pan et al., 2023] Liangming Pan, Alon Albalak, Xinyi
Wang, and William Yang Wang. Logic-lm: Empowering [Walton et al., 2008] Douglas Walton, Chris Reed, and Fab-
large language models with symbolic solvers for faithful rizio Macagno. Argumentation Schemes. Cambridge Uni-
logical reasoning, 2023. versity Press, New York, 2008.
[Pan et al., 2024] Shirui Pan, Linhao Luo, Yufei Wang, Chen [Walton, 1996] Douglas Walton. Argumentation Schemes for
Chen, Jiapu Wang, and Xindong Wu. Unifying large lan- Presumptive Reasoning. Routledge, 1st edition, 1996.
guage models and knowledge graphs: A roadmap. IEEE [Wang et al., 2023] Lei Wang, Chen Ma, Xueyang Feng,
Transactions on Knowledge and Data Engineering, page Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Ji-
1–20, 2024. akai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei
[Qassas et al., 2015] Malik Al Qassas, Daniela Fogli, Massi- Wei, and Ji-Rong Wen. A survey on large language model
miliano Giacomin, and Giovanni Guida. Analysis of clin- based autonomous agents, 2023.
ical discussions based on argumentation schemes. Pro- [Wei et al., 2023] Jason Wei, Xuezhi Wang, Dale Schuur-
cedia Computer Science, 64:282–289, 2015. Conference mans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc
on ENTERprise Information Systems/International Con- Le, and Denny Zhou. Chain-of-thought prompting elicits
ference on Project MANagement/Conference on Health reasoning in large language models, 2023.
and Social Care Information Systems and Technologies, [Wu et al., 2023] Qingyun Wu, Gagan Bansal, Jieyu Zhang,
CENTERIS/ProjMAN / HCist 2015 October 7-9, 2015. Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun
[Sassoon et al., 2021] Isabel Sassoon, Nadin Kökciyan, San- Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadal-
jay Modgil, and Simon Parsons. Argumentation schemes lah, Ryen W White, Doug Burger, and Chi Wang. Auto-
for clinical decision support. Argument and Computation, gen: Enabling next-gen llm applications via multi-agent
12(3):329–355, November 2021. conversation, 2023.
[Savage et al., 2024] Thomas Savage, Abhishek Nayak, [Xie et al., 2024] Jian Xie, Kai Zhang, Jiangjie Chen,
Robert Gallo, and et al. Diagnostic reasoning prompts re- Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao,
veal the potential for large language model interpretability and Yu Su. Travelplanner: A benchmark for real-world
in medicine. npj Digital Medicine, 7:20, 2024. planning with language agents, 2024.
[Shi et al., 2024] Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue [Xiu et al., 2022] Yeliang Xiu, Zhanhao Xiao, and Yongmei
Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Liu. LogicNMR: Probing the non-monotonic reasoning
Yang, and May D. Wang. Ehragent: Code empowers large ability of pre-trained language models. In Yoav Goldberg,
language models for few-shot complex tabular reasoning Zornitsa Kozareva, and Yue Zhang, editors, Findings of the
on electronic health records, 2024. Association for Computational Linguistics: EMNLP 2022,
[Singhal et al., 2022] Karan Singhal, Shekoofeh Azizi, Tao pages 3616–3626, Abu Dhabi, United Arab Emirates, De-
Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, cember 2022. Association for Computational Linguistics.
Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, [Zeng et al., 2020] Zhiwei Zeng, Zhiqi Shen, Benny
Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Toh Hsiang Tan, Jing Jih Chin, Cyril Leung, Yu Wang,
Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Ying Chi, and Chunyan Miao. Explainable and
Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Argumentation-based Decision Making with Qualita-
Dale Webster, Greg S. Corrado, Yossi Matias, Kather- tive Preferences for Diagnostics and Prognostics of
ine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alzheimer’s Disease. In Proceedings of the 17th In-
Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan ternational Conference on Principles of Knowledge
Karthikesalingam, and Vivek Natarajan. Large language Representation and Reasoning, pages 816–826, 9 2020.
models encode clinical knowledge, 2022. [Zhang et al., 2023] Yifan Zhang, Jingqin Yang, Yang Yuan,
[Singhal et al., 2023] Karan Singhal, Tao Tu, Juraj Gottweis, and Andrew Chi-Chih Yao. Cumulative reasoning with
Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, large language models, 2023.
Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike
Schaekermann, Amy Wang, Mohamed Amin, Sami Lach-
gar, Philip Mansfield, Sushant Prakash, Bradley Green,
Ewa Dominowska, Blaise Aguera y Arcas, Nenad Toma-
sev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara
Mahdavi, Joelle Barral, Dale Webster, Greg S. Corrado,
Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam,
and Vivek Natarajan. Towards expert-level medical ques-
tion answering with large language models, 2023.
A Formal Description of ArgMed-Agents
Algorithm 1 ArgMed-Agents Interaction Reasoning

Require: Question Q, Argumentation Schemes AS∈{ASD , ASSE, ASBD}
Ensure: Decision D
1: function argGeneration(Q, AS)
2: arg ← Generator(Q, AS.prompt) ▷ Arguments presented by Generator
3: AA.add(arg) ▷ Add arguments to AA framework
4: for CQ in AS.CQs do
5: if Verifier(arg, CQ) is rejected then ▷ Verifier proposes the arg rejected by CQ
6: if AS.derive(CQ) = True then ▷ When CQ exist derived role
7: AS ← AS.derive(CQ) ▷ Update AS to derived role
8: arg2 ← argGeneration(Q, AS) ▷ recursive
9: AA.addAttack(arg2 , arg) ▷ arg2 attack arg
10: else
11: AA.addAttack(arg, arg) ▷ Self-Attack (Defeasible)
12: end if
13: end if
14: end for
15: return arg
16: end function
17:
18: function main
19: AA ← ∅ ▷ initialize the AA framework
20: argGeneration(Q, ASD)
21: D ← Reasoner(AA)
22: end function
B ASCD Reasoning Mechanisms
Argumentation Scheme Critical Questions Reject Rule Derive Role Attack Rule
ASD.CQ1 NO (no evidence to support the argument) none self-attack
ASD.CQ2 YES (there are side effects of decision-making) ASSE ASSE attack ASD
Argumentation Scheme
for Decision(ASD) ASD.CQ3 NO (incomplete decisions to fulfil goals) none self-attack
ASD.CQ4 YES (existence of alternative decisions) ASD ASBD attack the worse decision
in ASD 1 and ASD 2
ASSE.CQ1 NO (no evidence to support the argument) none self-attack
Argumentation Scheme
ASSE.CQ2 NO (the side-effect is acceptable) none self-attack
for Side Effect(ASSE)
ASSE.CQ3 YES (there are ways to ameliorate side effects) ASD none
Argumentation Scheme ASBD.CQ1 NO (no evidence to support the argument) none self-attack
for Better Decision(ASBD)
Table 3: List of specialized schemes, CQs, when CQs will be rejected and their derived roles and attack rules between derived roles and initial
schemes. Notably, every ASD is attacking each other by default, so it is not shown in the attack rules.
C More Case Details in Experiment

Question: An 18-year-old woman presents with recurrent headaches. The
Generator D [Derive from ASD(B).CQ4]:
pain is usually unilateral, pulsatile in character, exacerbated by
ASD(F, G, D)
light and noise, and usually lasts for a few hours to a full day. The
Premise - Given the patient Fact F [An 18-year-old woman presents with recurrent headaches, described as
pain is sometimes triggered by eating chocolates. These headaches unilateral, pulsatile, exacerbated by light and noise, sometimes triggered by eating chocolates, and
disturb her daily routine activities. The physical examination was accompanied by essential tremors.]
within normal limits. She also has essential tremors. Which drug is
suitable in her case for the prevention of headaches? Premise - To achieve Goal G [G: Prevent the recurrence of the patient's headaches and improve quality of
life.]
Options: a: "Ergotamine", b: "Verapamil", c: "Valproic, d: Premise - Valproic acid promotes the goal G [Valproic acid is an anticonvulsant and mood-stabilizing drug
"Propranolol" that has been shown to be effective in migraine prophylaxis and could also have a benefit on essential
tremors.]
Generator A: Conclusion: Valproic acid should be considered.

ASD(F, G, D)
Premise - Given the patient Fact F [An 18-year-old woman with recurrent headaches Verifier:
that are unilateral, pulsatile, exacerbated by light and noise, lasting for hours ASD.CQ1- Are there any evidence to support Decision D promote goal G?
to a day, sometimes triggered by chocolates, and disturb daily activities. YES
Diagnosis suggestive of migraines. She also has essential tremors.] ASD.CQ2- Has Decision D caused side effects on the patient?
YES
Premise - To achieve Goal G [G: Prevent the patient's headaches and manage ASD.CQ3- Can Goal G be achieved by completing the Decision D?
essential tremors.] YES
ASD.CQ4- Are there alternative Decision to achieve the same goal G?
Premise - Propranolol promotes the goal G [Propranolol is a beta-blocker that is NO
effective in preventing migraines and is also used to treat essential tremors,
which the patient has.] Generator E [Derive from ASD(B).CQ4]:
ASBD(F, G, D1, D2)
Conclusion: Propranolol should be considered. Premise - Given the decisions Valproic Acid (D1) and Propranolol (D2)
Verifier: Premise -
Effectiveness in Migraine Prophylaxis:
ASD.CQ1- Are there any evidence to support Decision D promote goal G?
Both Valproic Acid and Propranolol are effective in the prevention of migraines, which align with the patient's
YES
goal of preventing headaches.
ASD.CQ2- Has Decision D caused side effects on the patient?
NO
Side Effect Profile:
ASD.CQ3- Can Goal G be achieved by completing the Decision D?
Valproic Acid is associated with a variety of potential side effects including weight gain, hair loss, and
YES
teratogenicity, which may not be ideal for a young woman. Propranolol, while also having potential side
effects, is less likely to cause these particular issues and may provide a dual benefit for this patient's
YES
existing essential tremor condition.
Generator B [Derive from ASD(A).CQ4]: The medical profile of the Patient:

The patient is an 18-year-old woman with essential tremors, which is also an indication for Propranolol. As
ASD(F, G, D) Propranolol is a non-selective beta-blocker, it is not only effective for migraine prophylaxis but also for
Premise - Given the patient Fact F [An 18-year-old woman presents with recurrent treating essential tremors.
headaches characterized as unilateral, pulsatile, exacerbated by light and noise,
lasting a few hours to a full day, sometimes triggered by eating chocolates, and
Conclusion: Propranolol should be considered
disrupting her daily activities; she also has essential tremors.]
Premise - To achieve Goal G [G: Prevent the patient's recurrent headaches] Verifier:
Premise - Verapamil promotes the goal G [Verapamil is a calcium channel blocker YES
that can be effective in the prophylaxis of headaches, specifically in migraines ASD.CQ2- Has Decision D caused side effects on the patient?
and cluster headaches; it can also be beneficial for patients with coexisting YES
conditions where beta-blockers like propranolol are contraindicated or ASD.CQ3- Can Goal G be achieved by completing the Decision D?
ineffective.] YES
Conclusion: Verapamil should be considered. NO
Verifier: Generator F [Derive from ASD(D).CQ2]:

ASD.CQ1- Are there any evidence to support Decision D promote goal G? ASSE(SE, D, F)
YES Premise - The patient is an 18-year-old woman with recurrent, unilateral, pulsatile headaches that are
ASD.CQ2- Has Decision D caused side effects on the patient? exacerbated by light and noise, lasting for several hours to a day, and triggered by eating chocolates. She has a
NO history of essential tremors and her physical examination was within normal limits.
YES Premise - The decision to prescribe Valproic acid is made for the prevention of headaches. Valproic acid is a
ASD.CQ4- Are there alternative Decision to achieve the same goal G? medication that can be used to prevent migraines, which her symptoms suggest, as well as for treating seizures
YES and bipolar disorder.
Premise - Valproic acid can have side effects such as weight gain, hair loss, tremors, and liver toxicity. The
liver toxicity is particularly concerning and requires regular monitoring of liver function tests. In women of
Generator C [Derive from ASD(A).CQ4]: childbearing age, valproic acid also carries a risk of teratogenicity and thus must be used with effective
ASBD(F, G, D1,D2) contraception to prevent pregnancy.
Premise - Given the decisions Verapamil and Propranolol
Conclusion: Given the potential risk of significant side effects, including liver toxicity and teratogenicity, as
Premise - well as the patient's existing essential tremors which could potentially be exacerbated by valproic acid, the
Effectiveness in Preventing Headaches: decision to use Valproic acid as a preventive treatment for her headaches needs to be made with caution. Careful
Both Verapamil and Propranolol can be used to prevent headaches, but their consideration should be given to the patient's health status, potential for pregnancy, and the need for regular
mechanisms of action and efficacy in specific types of headaches may differ. monitoring if Valproic acid is chosen as a therapy.
Additional Medical Condition (Essential Tremors): Verifier:

Propranolol is not only used for the prophylaxis of headaches but also for
essential tremors, which the patient has. While Verapamil is a calcium channel ASD.CQ1- Is there any evidence to support Decision D results in side effect SE?
blocker used in various other conditions, it's not particularly indicated for YES
essential tremors. ASD.CQ1- Is this side effect SE not acceptable?
YES
Patient-specific factors and side effect profile: ASD.CQ1- Are there ways of ameliorating the side effects observed?
Considering the patient's age, gender, and the type of headaches, which are likely NO
migraines based on the description, Propranolol is often used as a first-line
preventive treatment for migraines. Propranolol might also help control her
essential tremor symptoms. Verapamil can be used for migraine prevention but does
not address her tremors.
AA framework: Decision:
Conclusion: Propranolol should be considered F A C A

Verifier:
Propranolol should
ASD.CQ1- Is there any evidence to support Decision D2 better than Decision D1 for
E B
Goal G?
YES D be considered
Figure 4: Example of correct reasoning by ArgMed-Agents. We show the flow of the multi-agent dialogue and the resulting formal AA
framework.
Question: A 14-year-old girl is brought to the physician by her Generator C [Derive from ASD(A).CQ4]:
mother for the evaluation of recurrent episodes of nose bleeding ASD(F, G, D)
for several months. The episodes occur unexpectedly and stop after Given the patient Fact F [A 14-year-old girl with recurrent nose bleeding
a few minutes by elevating the upper body and bending the head episodes and heavy menstrual flow, normal vital signs, a slightly reduced
hemoglobin level, a normal platelet count, prolonged partial thromboplastin
forward. Menses occur at regular 27-day intervals with heavy flow. time, and prolonged bleeding time.]
Her last menstrual period was 3 weeks ago. Vital signs are within
normal limits. Physical examination shows no abnormalities. Premise - To achieve Goal G [G: Stop the patient's bleeding and prevent future
Laboratory studies show:\nHemoglobin 11 g/dL\nHematocrit 34%\ episodes.]
nLeukocyte count 7,000/mm3\nPlatelet count 180,000/mm3\
Premise - Desmopressin promotes the goal G [Desmopressin is a medication that
nProthrombin time 13 sec\nPartial thromboplastin time 45 sec\ acts on the vasopressin receptors and can increase factor VIII levels, which
nFibrin split products negative\nThe bleeding time is 10 minutes. can help improve coagulation in bleeding disorders such as von Willebrand
Which of the following is the most appropriate next step in disease, which is suspected in this girl due to the symptoms and prolonged
treatment? bleeding time.]
Conclusion: Desmopressin should be considered

Options: a: "Oral prednisone", b: "Intravenous immunoglobulins",
c: "Tranexamic acid" , d: "Desmopressin"
Verifier:
YES
Generator A: ASD.CQ2- Has Decision D caused side effects on the patient?
NO
ASD(F, G, D)
Premise - Given the patient Fact F [A 14-year-old girl with a history of
YES
recurrent nosebleeds and heavy menstruation, presenting with laboratory findings
that suggest a potential bleeding disorder given the prolonged bleeding time and
NO
slightly prolonged partial thromboplastin time.]
Premise - To achieve Goal G [G: Stop the patient's current bleeding episode
effectively and reduce the likelihood of future bleeding episodes.]
Generator D [Derive from ASD(A).CQ4]:
Premise - Tranexamic acid promotes Goal G [Tranexamic acid is an ASBD(F, G, D1, D2)
antifibrinolytic agent known to reduce bleeding by inhibiting the activation of Premise - Given the decisions Tranexamic acid and Desmopressin
plasminogen to plasmin, thereby aiding in the stabilization of blood clots and
is indicated in cases of heavy menstrual bleeding and situations where excessive Premise -
bleeding is a concern.] Effectiveness in Treating Bleeding Disorders:
Desmopressin is effective in treating patients with bleeding disorders because
Conclusion: Tranexamic acid should be considered. it increases the levels of factor VIII and von Willebrand factor, which helps
in improving coagulation. Tranexamic acid helps by inhibiting the breakdown of
fibrin clots, but does not address underlying deficiencies in clotting
Verifier: factors.
YES Directness of Treatment:
ASD.CQ2- Has Decision D caused side effects on the patient? Desmopressin directly addresses the potential underlying problem in this
YES patient—impaired function of platelets or factor VIII deficiency—by increasing
ASD.CQ3- Can Goal G be achieved by completing the Decision D? factor VIII levels. Tranexamic acid would not correct the underlying
YES coagulation deficiency; it only helps to stabilize clots that have already
ASD.CQ4- Are there alternative Decision to achieve the same goal G? formed.
YES
Potential for Recurrence Prevention:
Desmopressin can prevent or reduce the severity of future bleeding episodes by
Generator B [Derive from ASD(A).CQ2]: increasing factor VIII levels, particularly before known triggers (like
surgical procedures). Tranexamic acid is more of a temporary measure during an
ASSE(SE, D, F) acute bleeding episode and does not have a prophylactic effect.
Premise - The patient is a 14-year-old girl with a history of recurrent
nosebleeds and heavy menstruation, presenting with laboratory findings Given the Fact the patient has a prolonged partial thromboplastin time,
suggestive of a potential bleeding disorder, including prolonged bleeding time indicating a possible deficiency in intrinsic pathway coagulation factors, and
and slightly prolonged partial thromboplastin time. Desmopressin can increase factor VIII levels, which could be deficient in this
patient:
Premise - The decision to administer Tranexamic acid, an antifibrinolytic
agent known to reduce bleeding, is considered as a potential course of
Conclusion: Desmopressin should be considered
treatment to achieve the goal of stopping the patient's current bleeding
episode effectively and reducing the likelihood of future bleeding episodes.
Verifier:
Premise - Tranexamic acid, like many medications, can have side effects. ASD.CQ1- Is there any evidence to support Decision D2 better than Decision D1
Common side effects of Tranexamic acid include nausea, diarrhea, headache, for Goal G?
muscle pain, or weakness. Additionally, more serious side effects such as YES
blood clots, visual disturbances, or allergic reactions can occur, although
they are less common.
Conclusion: Considering the patient's symptoms and the potential side effects
AA framework: Decision:
of Tranexamic acid, the decision to administer Tranexamic acid needs to be
D
C
made with caution.
Verifier:
ASD.CQ1- Is there any evidence to support D ecision D results in
B A
side effect SE? Desmopressin should be
YES
Is this side effect SE not acceptable? considered
C
ASD.CQ1-
NO
ASD.CQ1- Are there ways of ameliorating the side effects observed?
NO
Figure 5: (continued) Example of correct reasoning by ArgMed-Agents. We provide runnable demo files with available GPT-4 API Keys in
the code that include these cases.

Arg Med LLM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arg Med LLM

Uploaded by

Copyright:

Available Formats

ArgMed-Agents: Explainable Clinical Decision Reasoning with Large Language

Models via Argumentation Schemes

Shengxin Hong1 Liang Xiao2 , Xin Zhang1 and Jianxia Chen2

Abstract faced with highly complex clinical reasoning tasks [Pal et

Solver made with caution. YES

CQ2:Has Decision D (Trazodone) caused side effects on the Verifier

Conclusion: Trazodone should be considered

6.2 Logical Reasoning with LLMs

Algorithm 1 ArgMed-Agents Interaction Reasoning

B ASCD Reasoning Mechanisms

C More Case Details in Experiment

Generator A: Conclusion: Valproic acid should be considered.

Generator B [Derive from ASD(A).CQ4]: The medical profile of the Patient:

Verifier: Generator F [Derive from ASD(D).CQ2]:

Additional Medical Condition (Essential Tremors): Verifier:

Conclusion: Propranolol should be considered F A C A

Conclusion: Desmopressin should be considered

You might also like