Professional Documents
Culture Documents
D Generator
She has also lost weight due to decreased appetite and has started using
alcohol as a coping mechanism.
CQ1: Is there any evidence to support Decision D
B
(Paroxetine) results in side effect SE?
Premise - The decision to prescribe Paroxetine, a selective serotonin
YES
reuptake inhibitor (SSRI) commonly used to treat depression and anxiety
disorders, is considered as a potential course of treatment. Verifier
CQ2:Is this side effect SE not acceptable?
NO
Premise - Paroxetine, like many medications, can have side effects.
Common side effects of Paroxetine include nausea, drowsiness, dizziness,
CQ3: Are there ways of ameliorating the side effects
dry mouth, sweating, blurred vision, constipation, or headache.
observed?
Additionally, more serious side effects such as serotonin syndrome,
NO
suicidal thoughts or behavior, and withdrawal symptoms can occur,
although they are less common.
Symbolic Conclusion: Considering the patient's symptoms and the potential side
effects of Paroxetine, the decision to prescribe Paroxetine needs to be
CQ1: Is there any evidence to support Decision D
(Trazodone) promote goal G?
Figure 2: An example from the MedQA USMLE dataset, with the entire process of ArgMed-Agents reasoning about the clinical problem.
Notably, the letters in the argumentation framework correspond to the serial numbers of the four generators on the right, representing the
premises and conclusion generated by that generator. In the argumentation framework, the red nodes (A and C) represent arguments in
support of the decision and the yellow nodes (B and D) represent arguments in support of the beliefs.
attacking arguments are converted into a formal framework ized process consistent with clinical reasoning (i.e., following
that solves for a subset of reasonably coherent arguments via the ASCD prompt strategy). This design not only facilitates
Reasoner. LLMs to engage in clinical reasoning in a critical-thinking
manner, which in turn maintains a cognitive processes consis-
The theoretical motivation for our method stems from tent with clinicians to improve the explainability of decisions,
non-monotonic logic, logical intuition and cognitive clinical. but also allows for the effective highlighting of some implicit
Studies have shown that LLM performs reasonably well for knowledge in LLMs that cannot be readily accessed through
simple reasoning, single-step reasoning problems, however, traditional prompts.
as the number of reasoning steps rises, the rate of correct
reasoning decreases in a catastrophic manner [Creswell et
al., 2022]. Thus, we consider that LLM is primed with ra- 4.2 Explainable Argumentative Reasoning
tional intuition, but lacks true logical reasoning ability. In ArgMed-Agent obtains a complete argument graph by iter-
light of this, ASCD Prompt Stratygy guides LLM in a re- ating through the ASCD Prompt Strategy. In this section,
cursive form to generate a series of casual reasoning steps as we describe how Reasoner reasons and explains decisions
a tentative conclusion, with any further evidence withdraw- through these arguments. First, we map the reasoning results
ing their conclusion. The process ultimately leads to a di- of ASCD to the AA framework.
rected graph representing the disputed relationships. In addi-
tion this, agents with different LLMs roles within ArgMed- Definition 1. An AA framework for ASCD is a pair ⟨A, R⟩
Agents interact collaboratively with each other in a formal- such that:
• ∃arg = ⟨P, c, V ⟩ ∈ A that arguments set A is consti- • ∃a ∈ Argsd (A) is acceptable such that Argsd (E) ̸= ∅
tuted by the argumentation schemes.
• ∃a ∈ Argsd (E) such that ∀b ∈ Argsd (A), b ̸= a that
• A = Argsd (A) ∪ Argsb (A) such that arguments in ar- b∈
/ Argsd (E).
gument set A are distinguished between two types of ar-
guments: arguments in support of decisions Argsd (A) Proposition 1 states that using preferred semantics on the
and arguments in support of beliefs Argsb (A). AA framework for ASCD always selects extensions that con-
tain the only acceptable decision, which is consistent with our
In Definition 1, arguments in support of decisions expected reasoning. Meanwhile, arguments from Argsb (E)
Argsd (A) correspond to ASD in the argumentation scheme, can be regarded as explanation of decision Argsd (E) [Fan
and arguments in support of beliefs Argsb (A) correspond to and Toni, 2015]. The proof of the proposition proceeds as
ASSE and ASBD in the argumentation scheme. These two follows: We first prove that there will be only one decision in
types of arguments play different roles; arguments in sup- the preferred extension. According to Definition 2, we know
port of decisions build on beliefs and goals and try to jus- that decisions are exclusive, so different decisions do not ap-
tify choices, while arguments in support of beliefs always try pear in the same admissible set. Secondly, we prove that the
to undermine decision arguments. Therefore, based on the preferred extension will always contain a argument in sup-
modeling of ASCD, we make the following setup for the at- port of decision. According to the modeling of ASCD as well
tack relation between these two types of arguments: first, ar- as Definition 2, it can be seen that arguments in support of
guments in support of different decisions are in conflict with decision (ASD) are built on top of arguments in support of
each other, which is consistent with our previous description beliefs (ASSE and ASBD), i.e., the inclusion of ASD argu-
of decisions being exclusive in ASCD. Second, arguments ments does not lead to the ASSE and ASBD arguments be-
in support of decisions are not allowed to attack arguments coming unacceptable. Therefore, the set is maximized when
in support of beliefs, which is dictated by the modeling of the acceptable set includes pro-decision arguments.
ASCD, as can be seen in Figure 1, where ASSE and ASBD
are always trying to disprove ASD, while the converse does Definition 4. Given an AA framework for ASCD ⟨A, R⟩ and
not work. Formalized as: E is its prefered extension. There are errors in the clinical
reasoning of ArgMed-Agents in this case iff Argsd (E) = ∅.
Definition 2. Given an AA framework for ASCD ⟨A, R⟩,
R ⊆ Argsd (A) × Argsb (A) such that: Clinical reasoning errors are discussed in [Tang et al.,
2024]. They consider that most clinical reasoning errors in
• ∀arg1 , arg2 ∈ Argsd (A) such that arg1 ̸= arg2 , LLMs are due to confusion about domain knowledge. On
(arg1 , arg2 ) ∈ R. the other hand, research by [Singhal et al., 2022] points out
• ∄(arg1 , arg2 ) ∈ R such that arg1 ∈ Argsd (A) and that LLMs may produce compelling misinformation about
arg2 ∈ Argsb (A). medical treatment, so it is crucial to recognize this illusion,
which is difficult for humans to detect. For this purpose, we
Next, we use the preferred semantics in the AA framework define the relevant mechanisms for identifying clinical rea-
to illustrate how the Reasoner Agent selects the set of possible soning errors in Reasoner as follows: When any decision in
admissible arguments to support clinical decision. the AA framework is not accepted, we consider there are er-
Definition 3. Given an AA framework for ASCD ⟨A, R⟩, E ⊆ rors in the reasoning by ArgMed-Agents. This mechanism
E is a set of preferred extension such that Argsb (E) is the assists ArgMed-Agents in identifying the capability bound-
support decision for E. aries of LLMs when their knowledge reserves are insufficient
to address certain issues.This helps ArgMed-Agents avoid
Example 1. An AA framework for ASCD is given below such the risks associated with adopting erroneous decisions and
that Argsd (A) = {A, B, C} and Argsb (A) = {D, E}. In achieving more robust and safe clinical reasoning.
this case, decisions B and C are optional because they are
supported by preferred extension.
5 Experiments
D E
In this section, we demonstrate the potential of ArgMed-
Admissible Sets: {B, D, E}, Agents in clinical decision reasoning by evaluating the ac-
{C, D, E},{B, D},{B, E}, curacy and explainability.
A {C, D},{C, E},{D, E},{B}, We implemented Generator and Verifier in ArgMed-
{C},{D},{E},{} Agents using the APIs GPT-3.5-turbo and GPT-4 provided
by OpenAI [OpenAI, 2023], and according to our setup, the
Prefered Extension:{B, D, E}, LLM Agents are implemented with the same LLM and dif-
B C {C, D, E} ferent few-shot prompts. Each agent is configured with spe-
cific parameters: the temperature is set to 0.0, as well as a
dialogue limit of 4, indicating the maximum number of de-
In this regard, we present a proposition to justify the Rea- cisions allowed for ArgMed-Agents to generate, which is to
soner agent’s use of preferred semantics for decision making. prevent the agents from getting into loops with each other. On
Proposition 1. Given an AA framework for ASCD ⟨A, R⟩ the other hand, we using python3 to implemente a symbolic
and E ⊆ E is a set in preferred extension E where, solver of abstract argumentation framework as Reasoner.
Model Method MedQA PubMedQA Method Pre. Tra. % Arg.
Direct 52.7 68.4 CoT 0.63 2.8 59.5%
GPT-3.5-turbo CoT 48.0 71.5 ArgMed-Agents 0.91 4.2 87.5%
ArgMed-Agents 62.1 78.3
Table 2: Human evaluation result on 20 examples.
Direct 67.8 72.9
GPT-4 CoT 71.4 77.2
we analysed the examples in which both Direct Generation,
ArgMed-Agents 83.3 81.6 CoT and ArgMed-Agents reasoned wrongly. We found that
ArgMed-Agents can report 76% of errors in these examples
Table 1: Results of the accuracy of various clinical decision reason- according to Reasoner’s definition 4. This provides initial
ing methods on MedQA and PubMedQA datasets. evidence that ArgMed-Agents not only has higher accuracy
compared to baseline, but also that it is safer for clinical de-
5.1 Accuracy cision reasoning.
The following two datasets were used to assess the accuracy 5.2 Human Evaluation for Explainability
of ArgMed-Agents clinical reasoning:
The main focus of Explainable AI (XAI) is usually the rea-
• MedQA [Jin et al., 2020]:Answering multiple-choice soning behind decisions or predictions made by AI that be-
questions derived from the United States Medical Li- come more understandable and transparent. In the context
cense Exams (USMLE). This dataset is sourced from of- of artificial intelligence, Explainability is defined as follows
ficial medical board exams and encompasses questions [ISO, 2020]:
in English, simplified Chinese, and traditional Chinese.
The total question counts for each language are 12,723, ...level of understanding how the AI-based system came
34,251, and 14,123, respectively. up with a given result.
• PubMedQA [Jin et al., 2019]: A biomedical ques- Based on this criterion, we define two measures of LLMs’
tion and answer (QA) dataset collected from PubMed explanability:
abstracts. The task of PubMedQA is to answer • predictability(Pre.): Can the explanations given by
yes/no/maybe research questions using the correspond- LLMs help users predict LLM decisions?
ing abstracts (e.g., Do preoperative statins reduce atrial
fibrillation after coronary artery bypass graft surgery?). • proof-based traceability(Tra.): Can the model reason
about the correct decision based on the correct path of
We randomly selected 300 examples in each dataset for our inference (i.e. does the path of inference as an explana-
experiments. We set the baseline in our experiments to com- tion have relevance to the LLM’s decision)?
pare with gpt direct generation and Chain of Thought(CoT)
[Wei et al., 2023]. It is worth noting that we focussed on eval- For the evaluation of predictability, our experimental setup
uating the performance of ArgMed-Agents on clinical deci- is as follows: we recorded the inputs and the reasoning pro-
sion reasoning, however, there were some biomedical general cess (e.g., the complete dialogue between Generator and Ver-
knowledge quiz-type questions in MedQA and PubMedQA, ifier in ArgMed-Agents and the corresponding argumentation
and so we intentionally excluded this part of the questioning framework). We invited 50 undergraduate and graduate stu-
in selecting the example. dents in medical-related disciplines to answer questions given
Table 1 shows the accuracy results of MedQA and Pub- only the inputs to the LLMs as well as the explanations (four
MedQA. We compared ArgMed-Agents to several baselines questions were distributed to each individual), and counted
in direct generation and CoT settings. Notably, both GPT-3.5- to what extent they were able to predict the decisions of the
turbo and GPT4 models showed that our proposed ArgMed- LLMs.
Agents improved accuracy on a clinical decision reasoning Our team has produced a fully functional knowledge-based
task compared to a baseline such as CoT. This result demon- clinical decision support system at an early stage that can
strates the effectiveness of ArgMed-Agents to enhance the be used to assist in treatment decisions for complex dis-
clinical reasoning performance of LLMs. Interestingly, after eases such as cancer, neurological disorders, and infectious
we repeated the experiment several times, we found that the diseases. Figure 3 illustrates this knowledge-based CDSS,
introduction of CoT sometimes leads to a surprising drop in consisting of computer-interpretable guidelines in the form
performance. The reason for this may be that when LLMs of Resource Description Framework (RDF) [Klyne and Car-
experience hallucinatory phenomena, the use of CoT will roll, 2004] and the corresponding inference engine. We use
amplify such hallucinations indefinitely. In contrast, our the knowledge-based CDSS’s process of performing reason-
approach of using mutual argumentation iterations between ing on the 20 examples in MedQA as a logical criterion for
multiple agents effectively mitigates this problem. We anal- measuring the proof-based traceability of LLMs. We manu-
ysed the data from our experiments and found that each AS’s ally evaluated the consistency of ArgMed-Agents’ reasoning
CQ1 (i.e., asking the Verifier whether there is evidence to sup- paths with logical criteria and the relevance of explanations
port the argument made by the Generator) acts as a halluci- to decision making, scoring each example on a scale of one
nation detector in the reasoning process. On the other hand, to five. CoT performance was also used as a comparison.
not focus on critical reasoning in clinical decision-making,
which allows LLM to generate explanations of why one de-
cision is ”good” but not why others were not chosen. In
addition to this, the work such as [Singhal et al., 2023;
Li et al., 2023] fine-tuned LLMs using extensive datasets col-
lected from the medical and biomedical literature. Unlike our
approach, ours focuses more on exploiting the latent medi-
cal knowledge inherent in LLMs to improve their reasoning
ability in a training-free environment.
Argumentation Scheme Critical Questions Reject Rule Derive Role Attack Rule
ASD.CQ1 NO (no evidence to support the argument) none self-attack
ASD.CQ2 YES (there are side effects of decision-making) ASSE ASSE attack ASD
Argumentation Scheme
for Decision(ASD) ASD.CQ3 NO (incomplete decisions to fulfil goals) none self-attack
ASD.CQ4 YES (existence of alternative decisions) ASD ASBD attack the worse decision
in ASD 1 and ASD 2
ASSE.CQ1 NO (no evidence to support the argument) none self-attack
Argumentation Scheme
ASSE.CQ2 NO (the side-effect is acceptable) none self-attack
for Side Effect(ASSE)
ASSE.CQ3 YES (there are ways to ameliorate side effects) ASD none
Argumentation Scheme ASBD.CQ1 NO (no evidence to support the argument) none self-attack
for Better Decision(ASBD)
Table 3: List of specialized schemes, CQs, when CQs will be rejected and their derived roles and attack rules between derived roles and initial
schemes. Notably, every ASD is attacking each other by default, so it is not shown in the attack rules.
Options: a: "Ergotamine", b: "Verapamil", c: "Valproic, d: Premise - Valproic acid promotes the goal G [Valproic acid is an anticonvulsant and mood-stabilizing drug
"Propranolol" that has been shown to be effective in migraine prophylaxis and could also have a benefit on essential
tremors.]
Verifier: Premise -
Effectiveness in Migraine Prophylaxis:
ASD.CQ1- Are there any evidence to support Decision D promote goal G?
Both Valproic Acid and Propranolol are effective in the prevention of migraines, which align with the patient's
YES
goal of preventing headaches.
ASD.CQ2- Has Decision D caused side effects on the patient?
NO
Side Effect Profile:
ASD.CQ3- Can Goal G be achieved by completing the Decision D?
Valproic Acid is associated with a variety of potential side effects including weight gain, hair loss, and
YES
teratogenicity, which may not be ideal for a young woman. Propranolol, while also having potential side
ASD.CQ4- Are there alternative Decision to achieve the same goal G?
effects, is less likely to cause these particular issues and may provide a dual benefit for this patient's
YES
existing essential tremor condition.
Premise - To achieve Goal G [G: Prevent the patient's recurrent headaches] Verifier:
ASD.CQ1- Are there any evidence to support Decision D promote goal G?
Premise - Verapamil promotes the goal G [Verapamil is a calcium channel blocker YES
that can be effective in the prophylaxis of headaches, specifically in migraines ASD.CQ2- Has Decision D caused side effects on the patient?
and cluster headaches; it can also be beneficial for patients with coexisting YES
conditions where beta-blockers like propranolol are contraindicated or ASD.CQ3- Can Goal G be achieved by completing the Decision D?
ineffective.] YES
ASD.CQ4- Are there alternative Decision to achieve the same goal G?
Conclusion: Verapamil should be considered. NO
Premise - Valproic acid can have side effects such as weight gain, hair loss, tremors, and liver toxicity. The
liver toxicity is particularly concerning and requires regular monitoring of liver function tests. In women of
Generator C [Derive from ASD(A).CQ4]: childbearing age, valproic acid also carries a risk of teratogenicity and thus must be used with effective
ASBD(F, G, D1,D2) contraception to prevent pregnancy.
Premise - Given the decisions Verapamil and Propranolol
Conclusion: Given the potential risk of significant side effects, including liver toxicity and teratogenicity, as
Premise - well as the patient's existing essential tremors which could potentially be exacerbated by valproic acid, the
Effectiveness in Preventing Headaches: decision to use Valproic acid as a preventive treatment for her headaches needs to be made with caution. Careful
Both Verapamil and Propranolol can be used to prevent headaches, but their consideration should be given to the patient's health status, potential for pregnancy, and the need for regular
mechanisms of action and efficacy in specific types of headaches may differ. monitoring if Valproic acid is chosen as a therapy.
Figure 4: Example of correct reasoning by ArgMed-Agents. We show the flow of the multi-agent dialogue and the resulting formal AA
framework.
Question: A 14-year-old girl is brought to the physician by her Generator C [Derive from ASD(A).CQ4]:
mother for the evaluation of recurrent episodes of nose bleeding ASD(F, G, D)
for several months. The episodes occur unexpectedly and stop after Given the patient Fact F [A 14-year-old girl with recurrent nose bleeding
a few minutes by elevating the upper body and bending the head episodes and heavy menstrual flow, normal vital signs, a slightly reduced
hemoglobin level, a normal platelet count, prolonged partial thromboplastin
forward. Menses occur at regular 27-day intervals with heavy flow. time, and prolonged bleeding time.]
Her last menstrual period was 3 weeks ago. Vital signs are within
normal limits. Physical examination shows no abnormalities. Premise - To achieve Goal G [G: Stop the patient's bleeding and prevent future
Laboratory studies show:\nHemoglobin 11 g/dL\nHematocrit 34%\ episodes.]
nLeukocyte count 7,000/mm3\nPlatelet count 180,000/mm3\
Premise - Desmopressin promotes the goal G [Desmopressin is a medication that
nProthrombin time 13 sec\nPartial thromboplastin time 45 sec\ acts on the vasopressin receptors and can increase factor VIII levels, which
nFibrin split products negative\nThe bleeding time is 10 minutes. can help improve coagulation in bleeding disorders such as von Willebrand
Which of the following is the most appropriate next step in disease, which is suspected in this girl due to the symptoms and prolonged
treatment? bleeding time.]
Premise - To achieve Goal G [G: Stop the patient's current bleeding episode
effectively and reduce the likelihood of future bleeding episodes.]
Generator D [Derive from ASD(A).CQ4]:
Premise - Tranexamic acid promotes Goal G [Tranexamic acid is an ASBD(F, G, D1, D2)
antifibrinolytic agent known to reduce bleeding by inhibiting the activation of Premise - Given the decisions Tranexamic acid and Desmopressin
plasminogen to plasmin, thereby aiding in the stabilization of blood clots and
is indicated in cases of heavy menstrual bleeding and situations where excessive Premise -
bleeding is a concern.] Effectiveness in Treating Bleeding Disorders:
Desmopressin is effective in treating patients with bleeding disorders because
Conclusion: Tranexamic acid should be considered. it increases the levels of factor VIII and von Willebrand factor, which helps
in improving coagulation. Tranexamic acid helps by inhibiting the breakdown of
fibrin clots, but does not address underlying deficiencies in clotting
Verifier: factors.
ASD.CQ1- Are there any evidence to support Decision D promote goal G?
YES Directness of Treatment:
ASD.CQ2- Has Decision D caused side effects on the patient? Desmopressin directly addresses the potential underlying problem in this
YES patient—impaired function of platelets or factor VIII deficiency—by increasing
ASD.CQ3- Can Goal G be achieved by completing the Decision D? factor VIII levels. Tranexamic acid would not correct the underlying
YES coagulation deficiency; it only helps to stabilize clots that have already
ASD.CQ4- Are there alternative Decision to achieve the same goal G? formed.
YES
Potential for Recurrence Prevention:
Desmopressin can prevent or reduce the severity of future bleeding episodes by
Generator B [Derive from ASD(A).CQ2]: increasing factor VIII levels, particularly before known triggers (like
surgical procedures). Tranexamic acid is more of a temporary measure during an
ASSE(SE, D, F) acute bleeding episode and does not have a prophylactic effect.
Premise - The patient is a 14-year-old girl with a history of recurrent
nosebleeds and heavy menstruation, presenting with laboratory findings Given the Fact the patient has a prolonged partial thromboplastin time,
suggestive of a potential bleeding disorder, including prolonged bleeding time indicating a possible deficiency in intrinsic pathway coagulation factors, and
and slightly prolonged partial thromboplastin time. Desmopressin can increase factor VIII levels, which could be deficient in this
patient:
Premise - The decision to administer Tranexamic acid, an antifibrinolytic
agent known to reduce bleeding, is considered as a potential course of
Conclusion: Desmopressin should be considered
treatment to achieve the goal of stopping the patient's current bleeding
episode effectively and reducing the likelihood of future bleeding episodes.
Verifier:
Premise - Tranexamic acid, like many medications, can have side effects. ASD.CQ1- Is there any evidence to support Decision D2 better than Decision D1
Common side effects of Tranexamic acid include nausea, diarrhea, headache, for Goal G?
muscle pain, or weakness. Additionally, more serious side effects such as YES
blood clots, visual disturbances, or allergic reactions can occur, although
they are less common.
Conclusion: Considering the patient's symptoms and the potential side effects
AA framework: Decision:
of Tranexamic acid, the decision to administer Tranexamic acid needs to be
D
C
made with caution.
Verifier:
ASD.CQ1- Is there any evidence to support D ecision D results in
B A
side effect SE? Desmopressin should be
YES
Is this side effect SE not acceptable? considered
C
ASD.CQ1-
NO
ASD.CQ1- Are there ways of ameliorating the side effects observed?
NO
Figure 5: (continued) Example of correct reasoning by ArgMed-Agents. We provide runnable demo files with available GPT-4 API Keys in
the code that include these cases.