You are on page 1of 24

REFINER: Reasoning Feedback on Intermediate Representations

Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges,


Antoine Bosselut, Robert West, Boi Faltings
EPFL
{firstname.lastname}@epfl.ch

Abstract Context: Frank had


number0 pieces of candy.
He lost number1 of them.
If he put the remaining
Language models (LMs) have recently shown pieces into bags with
number2 pieces in each
Intermediate Equation:
remarkable performance on reasoning tasks by bag,
#0: 𝑑𝑖𝑣𝑖𝑑𝑒 ( 𝑛𝑢𝑚𝑏𝑒𝑟0, 𝑛𝑢𝑚𝑏𝑒𝑟2 ) |
Question: How many bags
explicitly generating intermediate inferences, would he have? #1: 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 ( #0, 𝑛𝑢𝑚𝑏𝑒𝑟1 ) | 𝐸𝑂𝑆 Generator
arXiv:2304.01904v1 [cs.CL] 4 Apr 2023

e.g., chain-of-thought prompting. However,


🤖
The operator in #0 is incorrect, the second number
in #0 is incorrect, the operator in #1 is incorrect,
these intermediate inference steps may be in- Critic and the second number in #1 is incorrect.

appropriate deductions from the initial context


Intermediate Equation:
and lead to incorrect final predictions. Here #0: 𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡 ( 𝑛𝑢𝑚𝑏𝑒𝑟0, 𝑛𝑢𝑚𝑏𝑒𝑟1 ) |
#1: 𝑑𝑖𝑣𝑖𝑑𝑒 ( #0, 𝑛𝑢𝑚𝑏𝑒𝑟0 ) | 𝐸𝑂𝑆
we introduce REFINER, a framework for fine- Generator

tuning LMs to explicitly generate intermedi- 🤖Critic


The second number in #1 is incorrect.

ate reasoning steps while interacting with a Intermediate Equation:


#0: 𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡 ( 𝑛𝑢𝑚𝑏𝑒𝑟0, 𝑛𝑢𝑚𝑏𝑒𝑟1 ) |
critic model that provides automated feedback #1: 𝑑𝑖𝑣𝑖𝑑𝑒 ( #0, 𝑛𝑢𝑚𝑏𝑒𝑟2 ) | 𝐸𝑂𝑆
Generator
on the reasoning. Specifically, the critic pro-
vides structured feedback that the reasoning
🤖Critic
No hint.

LM uses to iteratively improve its intermedi-


ate arguments. Empirical evaluations of RE- Figure 1: REFINER example. The critic model pro-
FINER on three diverse reasoning tasks show vides the generator model with feedback on its reason-
significant improvements over baseline LMs ing errors after evaluating the generated intermediate
of comparable scale. Furthermore, when us- steps. The feedback, alongside the original question
ing GPT3.5 as the reasoner, the trained critic and previous intermediate equation, are fed back to the
significantly improves reasoning without fine- generator model.
tuning the reasoner. Finally, our critic model is
trained without expensive human-in-the-loop
data but can be substituted with humans at in- importantly, it is unclear how to meaningfully re-
ference time.
fine the intermediate representations to further im-
1 Introduction prove the final performance.
The standard practice for correcting reasoning
Large language models (LLMs) have made signifi- errors is to annotate new data and either retrain
cant strides in natural language processing (NLP) or finetune the model (Feng et al., 2021; Hed-
tasks (Brown et al., 2020). Recent work has shown derich et al., 2021). However, fixing such errors
that explicitly generating intermediate steps during by finetuning with more data is not only data- and
reasoning tasks significantly improves a model’s resource-intensive but can also be insufficient to
performance and interpretability (Shwartz et al., generalize well in complex reasoning tasks (Ward
2020; Paul and Frank, 2021; Marasovic et al., 2022; et al., 2022). Other works have explored improving
Lampinen et al., 2022; Wei et al., 2022). Producing models using feedback by providing a scalar score
such intermediate representations provides insight (Ziegler et al., 2019; Martin et al., 2022) or directly
into the model’s predictions and allows humans to revealing the correct missing answer (Mehta and
inspect the model’s reasoning process. However, Goldwasser, 2019; Elgohary et al., 2021; Tandon
these intermediate representations1 can be unre- et al., 2022). However, in natural language rea-
liable (Ye and Durrett, 2022) and result in poor soning tasks, defining a scalar value that captures
performance on downstream reasoning tasks. Most different fine-grained reasoning error types (e.g., se-
1 In a reasoning task, the intermediate representations can mantic consistency, logical, etc.) remains an open
be viewed as inference rules, explanations or reasoning steps. challenge (Golovneva et al., 2023).
In this work, we instead provide fine-grained (iii) We empirically demonstrate that for math
and structured feedback on reasoning errors. We word problem and synthetic natural language rea-
present REFINER, a novel interaction-based frame- soning, our trained critic models alone are bene-
work that allows a generator LM to iteratively use ficial for improving intermediate representations
fine-grained feedback and refine its reasoning. The as they help GPT3.5 significantly increase its per-
interaction happens between two models: a gener- formance in a few-shot setting (by +3, +9.2 pts.,
ator, which learns to solve the task by first generat- respectively). We also demonstrate that provid-
ing the intermediate reasoning steps, and a critic, ing structured feedback on fine-grained errors can
which provides structured feedback to the generator benefit more than scalar value feedback for moral
about errors in the intermediate steps. action generation and math word problem tasks.
To provide fine-grained feedback about reason- Our critic model acts as a ‘reasoning refinement
ing errors, we develop a scheme to independently tool’ for LLMs.
train the critic model on automatically constructed (iv) Our analyses show that (a) improving the
feedback data. More specifically, we create pairs intermediate representation generation improves
of incorrect intermediate representations and struc- the performance on the reasoning tasks, and (b)
tured2 feedback on fine-grained reasoning errors. training a generator with an imperfect (noisy) critic
Then, we use this data to train the critic to provide is still beneficial.
fine-grained feedback on erroneous intermediate Our code is made publicly available 3
reasoning steps. Finally, the critic interacts with
the generator LM, offering feedback both during
2 Related Work
the training of the generator and during inference.
Figure 1 illustrates an example of our REFINER Intermediate Representations. While state-of-
framework where, given a math word problem, the the-art LMs achieve incredible performances in a
generator generates an equation as an intermediate wide range of tasks, they have difficulty with many
representation. The critic identifies the errors in the reasoning tasks (Wang et al., 2022), especially ones
equation and provides semi-structured textual feed- with multiple constraints or sub-problems, similar
back (e.g., "the operator in #0 is incorrect") keywords to more common tasks, or requiring spe-
to the generator. By interacting with the critic, RE- cialized knowledge (Austin et al., 2021) – such as
FINER enables the generator to reason over the mathematical problem solving (Ling et al., 2017;
semi-structured feedback and refine its generation. Andor et al., 2019; Ran et al., 2019; Geva et al.,
2020; Pi˛ekos et al., 2021; Cobbe et al., 2021; Kim
Contributions. (i) We propose REFINER, a et al., 2022).
framework that refines LMs reasoning capabilities
For these tasks, both intermediate representa-
through feedback. To the best of our knowledge,
tions and rationales have been shown to be benefi-
our work is the first to investigate how interacting
cial in learning mathematical skills (Pi˛ekos et al.,
with fine-grained reasoning feedback on interme-
2021), intermediate program execution computa-
diate reasoning steps impacts the performance of
tions (Nye et al., 2021), or general reasoning out-
LMs on reasoning tasks.
puts (Wei et al., 2022; Golovneva et al., 2022).
(ii) We evaluate REFINER on three natural
Our work builds upon the observation that gen-
language reasoning tasks: math word problems
erating intermediate steps are valuable but distin-
(Koncel-Kedziorski et al., 2016; Patel et al., 2021),
guishes itself in several key aspects. Firstly, instead
synthetic natural language reasoning (Liang et al.,
of prompting a large model, we finetune smaller
2022), and moral action generation (Emelin et al.,
models to learn to generate intermediate steps. Sec-
2021). REFINER demonstrates significant perfor-
ondly, our framework can accommodate tasks that
mance gains across different LM architectures with
do not necessarily have unique closed-form cor-
different scales. Across different reasoning tasks,
rect answer, such as the Moral Norm task (see §3).
REFINER outperforms comparably-sized strong
Finally, our framework is trained with a critic pro-
fine-tuned LM baselines (by +13.1, +3.2, +15 pts.,
viding feedback, improving the model’s reasoning
respectively).
process and teaching it how to leverage feedback.
2 Note that we transform the structured feedback into semi-
structured textual feedback using templates. 3 https://github.com/debjitpaul/refiner
Natural Language Feedback. Recent work has In practice, one can compute each conditional using
explored giving models richer and more com- an LM that includes its conditioning variables as a
plex feedback through the use of natural language part of its input.
(Ziegler et al., 2019; Nguyen et al., 2021; Scheurer Before continuing with the model description,
et al., 2022), used for aligning LLMs’ output with we illustrate three NLR tasks where we conduct our
users’ preferences (Christiano et al., 2017; Ziegler study and their respective intermediate representa-
et al., 2019; Saunders et al., 2022; Scheurer et al., tion z. These three tasks broadly cover two types:
2022; Bai et al., 2022), or to directly improve the (i) logical reasoning and (ii) normative reasoning.
model’s performance in its current task (Weston, They are exemplified in Figure 2 and detailed be-
2016; Rupprecht et al., 2018; Elgohary et al., 2020; low.
Austin et al., 2021; Madaan et al., 2023).
Synthetic natural language reasoning
This training is usually dependent on human- (sNLR), where given a reasoning scenario x
created feedback, generated in large quantities (Bai consisting of 5 synthetic rules and a fact, the model
et al., 2022), which takes up considerable resources. needs to deduce a conclusion y. This task requires
Though an external feedback provider can guide the model to perform deductive reasoning and
models to correct answers and reasoning (Austin generate the conclusion z using closed-world rules
et al., 2021), demonstrably better than they can and facts.
themselves (Saunders et al., 2022), feedback has Math word problem (MWP), where given a word
rarely been used in this way – and automated crit- problem x consisting of a context and question, the
ics for reasoning tasks have proved to be difficult goal is to map x to a valid mathematical expression
(Scheurer et al., 2022; Wang et al., 2022; Huang z (the intermediate representation) and then to a
et al., 2022). solution y. This task requires the model to perform
Recently, Welleck et al. (2022) introduced a sec- deduction using mathematical reasoning.
ondary model, the corrector, which improves the Moral norm and action generation for moral
initial proposition of a generation model, by learn- stories (MS), where given a context x consisting
ing the kind of mistakes made by the generator and of a situation, an intention, and an immoral action,
how to fix them. In this work, we also use a sec- the model needs to generate the moral norm z and
ondary model, a critic, but apply it quite differently the moral action y. Moral actions are encouraged
as we integrate it into an interaction loop with the by the moral norm. This task requires the model
generator model during training. We further differ to perform abductive reasoning to generate moral
from previous works as we provide feedback at norms and deductive reasoning for moral action.
the intermediate reasoning steps of the model and
We propose to solve these tasks by forcing the
not at the final output. The feedback is thus closer
model to generate intermediate hypotheses (z) and
to the source of mistakes and guides the model’s
improving them via structured feedback. We intro-
reasoning toward the correct answer. Additionally,
duce an interactive framework named REFINER,
intermediate steps are often structured, allowing
made of two separate models: (a) a CRITIC model
the critic to provide precise feedback.
(§3.1) trained to provide structured feedback on
3 REFINER intermediate reasoning steps and (b) a GENERA -
TOR model trained to solve the reasoning task by
Problem Formulation. In this paper, we view nat- first generating intermediate reasoning steps (§3.2).
ural language reasoning (NLR) as an autoregres- The core idea of REFINER is to exploit the inter-
sive generation task where, given input context x, a action between the generator model and the critic
model needs to generate y, such that y satisfies the model, where the generator’s intermediate reason-
constraints of the task. Usually, to generate correct ing steps are improved via structured feedback
or plausible y, the model needs to make the correct from the critic. An overview of the framework
inference z as intermediate steps.4 We decompose is depicted in Fig. 3.
NLR tasks as follows: REFINER presents several important properties.
First, the generator is trained to incorporate and
p(y|x) = p(y|x, z)p(z|x) (1) leverage feedback, which helps it converge towards
4 We use “inference steps/representations” and “hypothesis” better reasoning during training and makes it ca-
interchangeably. pable of integrating feedback at test time, whether
Math Word Problem
Context: Lewis earns $number0 per Incorrect Numbers or Operators Missing Operators
week during number1 weeks of Correct Equation: Incorrect Equation: Incorrect Equation:
harvest. He also earns $number2 per #0: add (number0, number2) #0: add (number1, number2) #0: add (number0, number2)
week for working overtime. #1: multiply (#0, number1) #1: divide (#0, number1) (missing)
If he works overtime every week, how
much money does he earn during The first number in #0 is incorrect.
harvest season? Feedback: An operator is missing.
The operator in #1 is incorrect.

Synthetic Natural Language Reasoning


Rules: Logically Invalid Missing Steps
If a rose is small and dull, then the Plausible Hypothesis:
rose is young. Implausible Hypothesis: Implausible Hypothesis:
#0: viridian is green
If a rose is clean or cold, then the #0: viridian is green #1: rose is green (missing)
#1: rose is green
rose is purple. #2: rose is dull #3: rose is young and weak #1: rose is weak
#2: rose is weak
If a rose is green or blue , then the
rose is weak. The add operator makes #0 and #1 Missing link between fact and
Feedback:
Fact : The rose is viridian and dull. invalid rules.

Moral Norm Generation


Situation: Jeff has not been happy in his Contradiction Semantic misalignment
relationship with his girlfriend Jenny for a
long time. Plausible Hypothesis: Implausible Hypothesis 1: Implausible Hypothesis 2:
Intention: Jeff wants to break up with “It is considerate to break up with “It’s good to break up over text “It’s wrong to break up with
Jenny. someone in person.” messages.” people.”
Immoral action: Jeff sends Jenny a text
message telling her that he's breaking up Semantically misaligned:
with her. Feedback: Contradiction
“breakup in person”

Figure 2: An overview of the three tasks tackled in this paper, with examples of both valid and invalid intermediate
reasoning steps, as well as their corresponding fine-grained error types. Notice the Missing Steps error type, in
the second task, actually encompasses two error types: reasoning misalignment, derived from not considering the
or operation, and lack of implicit knowledge, where implicit knowledge is needed to match the existing rules.

Tasks Error Types Feedbacks 3.1 CRITIC Model


Incorrect Numbers The position number
in equation-number The role of the critic is to provide feedback on the
is incorrect. intermediate hypotheses produced by the generator
MWP Incorrect Operators The operator in model. One way to evaluate the quality of the hy-
equation-number
is incorrect. pothesis and produce feedback on the hypothesis z,
Missing Operators An operator is missing. would be to compare it against a gold hypothesis z∗ .
Logically Invalid The X operator Previous works employed automatic metrics like
makes inference rule BLEU, ROUGE, etc., as value functions (Wu et al.,
number invalid.
sNLR Missing Link Missing link between
2018; Ramamurthy et al., 2022). However, these
the fact and the rules. scalar value functions are not suitable for natural
Missing Implicit The implicit knowledge language reasoning tasks because: (i) it is unclear
Knowledge Step is missing.
how to define a scalar value function that can en-
Contradiction Contradiction capsulate fine-grained reasoning errors (Golovneva
MS Semantic Misalignment Semantically misaligned:
“ text snippet” et al., 2023) and (ii) during inference, these func-
tions require access to the gold hypothesis (which
Table 1: An overview of the Error Types and Feedbacks is unavailable in practice). Therefore, we train a
for each reasoning tasks. critic model and endow it with the ability to eval-
uate the hypothesis in a fine-grained manner and
provide structured feedback.
from a trained critic or a human (see §5). Sec-
ond, the trained critic can be useful on its own; we Feedback Data Generation. To train the critic,
demonstrate that a generalist LLM like GPT-3.5 we have to create example pairs of implausi-
can significantly benefit from interacting with our ble hypotheses and their corresponding feedback
trained critic on the reasoning tasks we consider with fine-grained reasoning errors. Inspired by
(see §5). Finally, having two separate models al- Golovneva et al. (2023) and Talmor et al. (2020),
lows us to easily measure the benefits of feedback we first define fine-grained reasoning error types
during training and/or during inference (see §6). for each reasoning task (see Table 1). For MWP, an
equation can be incorrect due to: (i) the operands cept of deontic logic from Kiehne et al. (2022)
or operators in the equations being incorrect and/or and derive new norms contrary to those of Moral
(ii) one or more operators missing. For sNLR, an Stories. Hence, we replace the correct moral judg-
inference rule can be incorrect because it is (i) log- ments in the plausible hypothesis with inverse judg-
ically invalid and/or (ii) missing reasoning rules ments. For example, replacing [You shouldn’t]
(failing to connect the correct facts with correct to [It’s good] from the plausible hypothesis as
rules or missing implicit knowledge). For MS, a depicted in Fig. 2. To scale such inverse norms
moral norm can be incorrect due to (i) contradic- (implausible hypothesis), we paraphrase them by
tion and/or (ii) semantic misalignment. In Figure substituting the adjectives with synonyms from
2, we provide a few examples of fine-grained error WordNet (see Fig. 2). Secondly, to create seman-
types for each reasoning task. tic misalignments, we should collect implausible
Based on these error types, we perturb the plau- hypotheses that are either misaligned with the plau-
sible hypotheses (z) in the training data and collect sible hypothesis or incomplete in nature. To create
a pool of data D (x: input, z: plausible hypoth- them, we replace the correct action (verb phrase)
esis, z0 : implausible hypothesis). We perturb by from the plausible hypothesis with random verb
omitting, replacing or adding some tokens or some phrases selected from the context of the plausible
rules from the plausible hypothesis to automatically hypothesis.
create an implausible hypothesis. For example, in We also replace the correct judgment with ran-
Fig. 2 for sNLR (second example) we omit a few dom judgments to scale the number of implausible
inference steps from the correct hypothesis "#0: hypotheses per example. Finally, as feedback f , we
viridian is green, #1: rose is green" and provide <error type, hint>. For non-monotonic
create an incorrect (incomplete) hypothesis (see reasoning tasks5 like norm and action generation,
Fig. 2). Since our perturbations are based on logic the critic should be able to provide hints that align
and reasoning errors, we create structured feedback the generator model’s objective to the reasoning
f for every example (x, z, z0 ) by stating the error task. Hence, as a hint, we provide verb phrases
type that occurs in z0 but not in z (see Table 1). from the norms. Since the critic provides textual
The basic structure of feedback f for these tasks feedback to the generator, we convert the struc-
is herror type, position (optional), hint (optional)i, tured feedback into natural language feedback 6 .
where position denotes the error position in the im- Formally, we create a data pool D = {x, z, z0 , f } to
plausible hypothesis (see Tab. 1). For example, in train a critic model.
the previous scenario, we create feedback "Missing Training the critic model. We train a supervised
link between fact and rules". Despite the simplicity CRITIC model (πβ ) with the context (x) and (plau-
of the strategy we used for our tasks, this approach sible or implausible) hypothesis (z or z0 ) as input
is easily generalisable to other reasoning tasks. and the textual feedback as output. We update the
For MWP and sNLR problems, the underlying CRITIC using the following cross-entropy loss:
reasoning requires symbolic systems with closed-
world rules. Hence, we consider a simple rule- L(β) = − log pβ ( f (u)|x, u) (2)
based method to automatically generate the pairs of
errors and their corresponding structured feedback where u ∈ z, z0 . The trained critic is only used dur-
by considering the error types and position of the ing inference. The oracle critic is used while train-
errors (see Fig. 2). ing the generator model.
In the moral norm generation task, we consider 3.2 GENERATOR Model
two kinds of fine-grained errors: logical contradic-
This section presents a generator model that itera-
tion and semantic misalignment (incoherent, unin-
tively learns to interact with the CRITIC model.
formative). Moral norms are people’s subjective
judgments about the character and actions men- Warm-up. Given a context x the generator model
tioned in the context. Each moral norm is a com- (πθ ) is trained to generate plausible hypotheses.
bination of two components (implicit structure): a
5 A non-monotonic reasoning deals with incomplete facts,
moral judgment [You shouldn’t] and an action
whereby reasoners draw tentative conclusions, which may be
[criticize your family’s religion]. Firstly, retracted based on further evidence.
to create logical contradictions, we use the con- 6 Further details about feedback are provided in Appendix.
generates a trajectory z0 y0 , z1 y1 , ..., zT yT and stops
z 🥈 🥇 … 🥉
Hypothesis 1 Hypothesis 2 … Hypothesis n
🤖
Critic (𝛽)
when the critic generates “No hint”.

4 Experimental Setup
Textual
Exploration feedback (f) Datasets. We evaluate REFINER on three di-
verse tasks (examples in Fig. 2). We briefly de-
Input Generator (𝜃)
x scribe the datasets used for each task below7
Math Word Problem (MWP): We train our mod-
x+z🥇+f els on MAWPs (Koncel-Kedziorski et al., 2016)
dataset and evaluated our models on a challenging
Figure 3: Overview of REFINER interaction loop. dataset SVAMP (Patel et al., 2021). We evaluate
In each iteration, the generator generates multiple hy- our model on both the equation generation (z) and
potheses. The critic randomly selects one hypothesis answer prediction (y) tasks. Similar to Ling et al.
and provides feedback based on reasoning errors.
(2017); Amini et al. (2019) for equation generation,
we replace the numeric values with variable names,
The warm-up phase is critical to ensure that, when for example, number0, number1, etc. For Synthetic
the critic comes in the loop, the generator does Natural Language Reasoning (sNLR), we use the
not produce random answers likely to be very bad, dataset from Liang et al. (2022) with the difficulty
given the size of the output space. Hence, we use level as hard. We evaluate our model on both infer-
a relatively small supervised dataset (10% training ence rule generation (z) and consequent generation
data) to fine-tune the model on the NLR task of (y). For Moral Story (MS), we use dataset from
interest. After the warm-up phase, we use the addi- (Emelin et al., 2021), where we evaluate our model
tional feedback f from the critic model and learn on moral norm z and the moral action y generation.
πθ (z|x, z0 , f ). Training Details. For each task, we train a
Exploration. At each iteration (t), the generator UnifiedQa-T5-base model (UQA-base) (Khashabi
model generates multiple hypotheses (zk ) using nu- et al., 2020) as a critic (§3.1). For exploration
cleus sampling. The critic model randomly selects (§3.2), we use nucleus sampling with p = 0.5. We
one hypothesis and provides feedback on that hy- select the hyper-parameters by the validation loss:
pothesis. The exploration step aims at increasing for both the generator and critic model, we use the
the output variance such that the generator receives Adam optimizer with a learning rate of 1e−4 . Each
a wide range of feedback during training. model is trained for 20 epochs with early stopping
based on validation loss. We trained all models on
Learning. We update the GENERATOR model us- one A100 GPU. We run our models with 3 random
ing the following cross-entropy loss: seeds and report the average results. For the human
study, we selected outputs from the best models
T
X (baselines and our model) according to automatic
L(θ) = − log pθ (zt |x, zt0 , ft (z0 )) (3)
metrics. We train models with T = 3 iterations.
t=1
At inference time, we use greedy decoding for
where T = total number of iterations. Since the the generator and critic model with T = 1 for the
feedback contains the error types and hints, which automatic critic and T = 3 for the oracle critic.
are (latent) fine-grained and logical, it should allow On the MWP and sNLR tasks, we use the exact
the model to learn and update its generation by match (EM) metric for intermediate steps (equation
addressing the reasoning errors mentioned in the generation and inference rules) and accuracy (Acc)
feedback. for the final answers. For MS, we conduct a manual
evaluation study to assess the relevance of norms
Inference. We use the trained critic (see Eq. 2) and moral actions8 . Further evaluation details are
along with the trained generator to generate a tra- provided in Appendix F.
jectory z0 , z1 , ..., zT and stop when either f (zt ) is
7 In
Table 6, we report the data statistics in Appendix.
generated by the generator or “No hint” is gen- 8 Sincethe automatic scores such as BLUE, ROUGE, etc.
erated by the critic. We also experimented with only account for word level similarity between gold norms or
chain of thought prompting, where the generator actions and generate norms or actions.
Baselines. We compare our method with three Model Eq. (z) Ans. (y)
different LMs as generator models: UQA-base, UQA-base 34.1 –
UQA-large (supervised setting), GPT-3.5-code- UQA-base + PPO 31.5 –
REFINER base 47.2 –
DaVinci-002 (few-shot setting) and GPT-3.5? -text- REFINER base + Oracle (T=3) 66.0 –
DaVinci-003. We also compare REFINER to Prox-
UQA-large 46.7 –
imal Policy Optimization (PPO) RL-based method UQA-large + PPO 48.2 –
(Schulman et al., 2017). We use the implementa- REFINER large 53.8 –
tion of PPO from (Ramamurthy et al., 2022). For REFINER large + Oracle (T=3) 68.1 –
GPT-3.5, the model is not finetuned but prompted GPT-3.5 + CoT 59.3 63.5
GPT-3.5 + CoT + REFINERcritic 62.3 66.4
with standard few-shot prompts (Brown et al., GPT-3.5? + CoT 64.1 67.1
2020), in which the LM is given in-context demon- GPT-3.5? + CoT + REFINERcritic 67.3 70.6
strations of input-output pairs. We provide 2 for
demonstrations per class. We also experimented Table 2: Results on MWP. Eq.: Equation, Ans. An-
with chain of thought (COT) prompting (Wei et al., swer. Comparison of REFINER with baselines on the
2022) where the model is prompted first to gen- SVAMP dataset. GPT-3.5: code-DaVinci-002, GPT-
3.5? : text-DaVinci-002 For models other than GPT3.5,
erate the intermediate steps (z) and then the final
the answer can be obtained via symbolic execution of
answer (y). Note that sNLR task is a synthetic task the equation and is thus a function of the validity of the
where the model needs to perform either one-hop equation. For GPT3.5, the model is few-shot prompted
or two-hop reasoning. Clark et al. (2021) showed to either generate the equation with variable names z,
that fine-tuning large language models (354M pa- or generate the answer y.
rameter size) could achieve (99% accuracy) high
performance. Hence, we only compare our RE-
Model IR (z) Con (y)
FINER model with the UQA-base model (220M)
UQA-base 90.6 ± 0.8 94.1
(see Table 3). Since human annotation is expensive, REFINER base 93.5 ± 0.4 97.3
we focus on comparing against the most meaning- REFINER base + Oracle 96.2 ± 0.9 98.9
ful baseline: UQA-large for MS task (see Table GPT-3.5 + CoT 17.4 ± 0.5 45.0
GPT-3.5 + CoT + REFINER 26.8 ± 0.5 46.6
4). It is important to highlight that our proposed GPT-3.5? + CoT 14.3 ± 0.9 40.6
framework is general and one can use any other GPT-3.5? + CoT + REFINER 21.1 ± 1.2 42.1
LMs as GENERATOR or CRITIC.
Table 3: Results on sNLR task. IR: Inference Rules,
5 Results Con: Consequent

We evaluate our model on two aspects (i) perfor-


mance on intermediate steps and (ii) performance
Interestingly, we observe that GPT-3.5 + COT man-
on the final answer prediction. Tables 2, 3, and 4
ages to have significantly higher accuracy in answer
show the performance comparisons.
y than in equation z (see Table 2). This result is
Performance on Intermediate Steps. Table similar to the observation made by Ye and Durrett
2 reports the performance on the MWP task. (2022) and suggests that the intermediate equations
We explored two different scenarios: (i) can be unreliable. Also, we observe that if the critic
where the model only generates the equations was perfect (Oracle), then REFINER could perform
(z) with variable names replacing the numeric very strongly. Finally, REFINER could even out-
values, and (ii) where the model generates perform PPO, which uses BLEU-score as a reward
both the equations and the final answers together. function. This suggests that semi-structured fine-
We observe for both scenarios that REFINER sig- grained textual feedback can be more beneficial
nificantly outperforms baseline models with com- than value-based (where values are from automatic
parable sizes. Notably, UQA-base benefits (+13.1 metrics) reward feedback. Note that this result
EM) most when adding a critic in the loop. We may vary when these models are optimized directly
observe that GPT-3.5 significantly benefits from with complex human values, as shown in Stiennon
REFINER trained critic. Since LLMs like GPT- et al. (2020). Qualitatively, REFINER can correct
3.5 (175B parameters) are expensive to finetune, incorrect equations through structured feedback,
the improvement in equation generation of +3 and fixing the operators within a multistep solution (see
+3.2 EM without any modification is important. Figure 8).
Norm (z) Action (y) Model Eq. (z)
Model
I↓ U↓ R↑ α I↓ U↓ R↑ α
REFINER base 47.2
B 34 17 49 0.35 28 14 58 0.64
REFINER base - criticin f erence 39.8
B+PPO 38 10 52 0.38 31 17 52 0.38
REFINER base - criticin f erence - exp 37.4
REFINER 19 12 69 0.33 18 9 73 0.55
REFINER base - critictraining 34.1

Table 4: Results on Moral Norm and Moral Action. We


Table 5: Ablation Result on MWP task; Comparing
report human evaluation. B: UQA-large; I: Irrelevant,
model without critic during inference, and without the
U: Unsure; R: Relevant; α: Krippendorff’s alpha
exploration (exp) phase during training. The results are
measure with exact match of the generated equation,
comparable to Table 2.
For sNLR, similar to Liang et al. (2022), we ob-
serve that GPT-3.5 (code & text) performs poorly
(see Table 3). We also find that GPT-3.5 code- critic, there is an improvement of 9.4 in inference
DaVinci is better than text-DaVinci on the sNLR step generation; however, only +1.6 is in conse-
task. REFINER improves +2.9, +6.8 and +9.4 quent prediction. This result indicates that for
EM scores over UQA-base, GPT-3.5? and GPT- LLMs, these intermediate inferences can be inap-
3.5, respectively. Contrary to the MWP, the final propriate for the actual computation leading to the
answer y is not a symbolic execution away from model’s answer. Hence, training to improve the
the intermediate step z, but we still observe that quality of the intermediate step can result in better
REFINER focuses on improving the intermediate performance on the final prediction.
step z, resulting in significant improvements in the
answer y prediction. Again, we observe that RE- Ablation To obtain better insight into the contri-
FINER with a UQA-base can outperform few-shot butions of the individual components of our mod-
prompted GPT3.5. Thus, our critic can identify the els, we perform an ablation study (Table 5). We
fine-grained reasoning errors and help improve the observe that there is a considerable drop in perfor-
performance on inference rules generation. mance from 47.2 to 39.8 when we do not use the
For MS, we assess the generation quality with critic model during inference. Hence, this result
three human judges who indicate whether the gen- indicates that our generator model can leverage the
erated norms and moral actions are relevant to the feedback from the critic at inference time. Further,
given moral story. In Table 4, we summarize hu- we find that the exploration step improves the per-
man evaluation results on 100 moral story exam- formance +3.3 over the baseline model. This result
ples randomly sampled from the MS test dataset. supports our hypothesis that the exploration step in-
More specifically, we report evaluation breakdown creases the output variance and gives the generator
for both norm and moral action by the number of model the opportunity to learn over a wide range
instances that are either Irrelevant, Unsure or Rel- of feedback.
evant along with Krippendorf’s α (Krippendorff,
2018) agreement scores. The results show an im- 6 Analysis
provement of 20 points, increasing the relevance 6.1 Quantitative Analysis
over a strong UQA-large baseline. Hence, this sug-
gests that feedback from a critic model, which has Error Analysis. In order to get more insight into
3 times fewer parameters than the generator, can the performance of our method, we conduct a fine-
improve the performance on generating reasoning grained error analysis on the MWP and MS datasets
steps. Further evaluation details are provided in (Fig. 5). We note that the most frequent errors are
Appendix F. Incorrect Numbers for MWP and Semantic Mis-
alignment for MS. An intuitive reason can be that
Performance on Final Answer Prediction. We for the MWP task, the models are sensitive to the
observe that REFINER outperforms the strong numbers order as argued in (Patel et al., 2021). For
LM baselines by +2.9, +3.2, +15 points for MWP, MS, generating norms grounded in the context is
sNLR, and MS, respectively. These results support challenging. Our analyses show a clear trend that
our hypothesis that improving the performance of REFINER is able to considerably reduce the errors
generating better intermediate steps can result in for both datasets. This indicates that our trained
better answer prediction. Notably, on the sNLR critic model could identify fine-grained reasoning
task, for GPT-3.5, we observe that by adding a errors during inference.
Context: Dave had 21 apps on his phone. He added Context: Melissa scored 109 points in each game.
Context: Paco had 36 cookies. He gave 14 cookies
89 new apps. After deleting some he had 24 She also got 82 bonus points in each game.
to his friend and ate 10 cookies.
left..
Question: How many more apps did he add than he Question: How many points did she score in 79
Question: How many cookies did Paco have left?
deleted? games?

GPT-3.5 Generated Equation:


GPT-3.5 Generated Equation: GPT-3.5 Generated Equation: #0: multiply ( 109, 79 ) |
#0: subtract ( 89, 24 ) #0: subtract ( 36, 14 )| #1: multiply ( 82, 79 ) |
#2: add ( #0, #1 ) Generator
Generator #1: subtract ( #0, 10 ) Generator

🤖
Feedback: The operator in #0 is

🤖 Critic
Feedback: The first number in
#0 is incorrect, the second
number in #0 is incorrect
🤖Critic
Feedback: The second number in
#0 is incorrect. The second
number in #1 is incorrect.
Critic
incorrect, the second number
in #0 is incorrect, the
operator in #1 is incorrect.
Incomplete

GPT-3.5 REFINER Generated Equation:


GPT-3.5 REFINER Generated Equation: #0: add ( 109, 82 ) |
GPT-3.5 REFINER Generated Equation: #0: subtract ( 36, 10 ) #1: multiply ( 82, 79 ) |
#0: subtract ( 24, 21 ) #1: subtract ( #0, 36) Generator
Generator Generator #2: add ( #0, #1 )

(a) (b) (c)

Figure 4: Examples. REFINER on MWP task. There different scenarios are highlighted in the figure, where (a)
the CRITIC model provides a correct feedback, GENERATOR model utilizes the feedback and fixes the incorrect
equation, (b) the CRITIC model provides a correct feedback however, GENERATOR model fails to fix the incor-
rect equation, and (c) the CRITIC model provides an incomplete feedback GENERATOR model partially fixes the
incorrect equation.

0.8
UQA-T5-large REFINER Automatic critic during inference
60 Exact Match 0.6 Oracle critic during inference
57
0.4
47
45 0.2
40 42 a) Noise-level of the critic used during training
0.8
Number of Errors

Noisy critic during inference


Exact Match

0.6 REFINER (automatic critic during inference)


26
20 23
0.4

2 0.2 0% 25% 50% 75% 100%


7
0
6 5
b) Noise-level of the critic used during inference (oracle critic used during training)
Missing operators Incorrect Numbers Incorrect Logical Semantic
Operators Contradiction Misalignment
Math Word Problem Moral Stories
Figure 6: Noisy-critics analysis. In plot (a), we vary
the noise level of the critic used during training (0 noise
Figure 5: Error analysis. Number of errors made by corresponds to oracle) and compare the resulting mod-
baseline UQA-large and REFINER on 100 instances els when using the oracle and the training automatic
sampled randomly from dev sets of both datasets. Er- critic during inference. In plot (b), we train with the
rors are categorized according to Table 1). oracle critic but vary the noise level of the critic used
during inference.

Noise Sensitivity. To further understand the be- with a bit of noise (< 50%) does not seem to harm
haviour of the REFINER framework, we run vari- the model, as performances are not statistically dif-
ations with noisy critics. We replace the oracle ferent than training with the oracle (noise of 0%).
critic used during training with a noisy critic in Then, in Fig. 6 (b), we remark that the quality of
(Fig. 6 (a)) to inspect how training with an imper- the critic used at inference time has a huge impact.
fect critic impacts the generator. Also, we use a Having oracle provide feedback is by far the best
noisy critic at inference but keep the oracle critic scenario. Already with 25% noise, the critic makes
during training (in Fig. 6 (b)). This analysis is per- the generator perform worse than using our trained
formed on the SVAMP dataset for MWP. The noisy critic (REFINER). With more than 50% noise, the
critics are generated by random perturbations of the critic significantly harms the generator. Indeed, the
oracle critic; for a noise-level , the oracle feedback generator, trained with an oracle critic, has learned
is replaced by random feedback with probability . to trust the critic and expects useful feedback.
We observe, in Fig. 6 (a), that when training
with a very noisy critic (> 75% noise), the gen- 6.2 Qualitative Analysis
erator LM learns to ignore the critic, as there is To explain the findings in §6.1, we further manu-
no difference between using the trained critic or ally analyze 100 instances for the MWP task. We
the oracle during inference. Interestingly, training observe two different scenarios when REFINER
failed to fix the outputs generated by GENERA - edges the support of Innosuisse under PFFS-21-29,
TOR model: (a) when the CRITIC model provides the EPFL Science Seed Fund, the EPFL Center for
a correct feedback; however, the GENERATOR Imaging, Sony Group Corporation, and the Allen
model still generates incorrect equation, and (b) Institute for AI.
the CRITIC model provides an incomplete or par-
tially correct feedback. The former case indicates
that either the GENERATOR model makes mistakes
in following the instruction from the CRITIC or the
feedback from the critic can be ambiguous. For
example, in Fig. 4, (b) we observe the case when
the critic is correct, but the feedback could result
in an incorrect equation. The latter case indicates
that our trained critic model generates incorrect
feedback, which can result in incorrect or partially
correct equations. We also observe that our CRITIC
model failed to generate correct feedback when the
GENERATOR model generates incorrect equations
with multiple mistakes (see Fig. 4(c)).

7 Conclusion
In this paper, we propose REFINER, a framework
to improve the reasoning abilities of LMs through
an iterative feedback loop between two models, a
generator and a critic. Our evaluation of this frame-
work on three reasoning tasks showed structured
and fine-grained feedback on intermediate reason-
ing errors results in significant performance gains,
surpassing scalar value feedback. Our trained critic
model alone, even when noisy, can improve inter-
mediate representations of LMs, showing that RE-
FINER can significantly boost LMs’ performance
on reasoning tasks. Our REFINER framework is
very general and in principle, might be applied to
steer language models in performing different rea-
soning tasks. More specifically, critic model can be
seen as a tool for LLMs to refine their generation
quality.

Acknowledgment
We would like to thank Martin Josifoski, Syrielle
Montariol, and Zeming Chen for their helpful feed-
back on a draft version of the paper. We acknowl-
edge the support of the ICT-48 Network of AI Re-
search Excellence Center “TAILOR” (EU Hori-
zon 2020, GA No 952215). West’s lab is partly
supported by grants from the Swiss National Sci-
ence Foundation (200021_185043), Swiss Data
Science Center (P22_08), H2020 (952215), Mi-
crosoft Swiss Joint Research Center, and Google,
and by generous gifts from Facebook, Google, and
Microsoft. Antoine Bosselut gratefully acknowl-
Limitations jishirzi. 2019. MathQA: Towards interpretable
math word problem solving with operation-based
Our REFINER framework could not be comprehen- formalisms. In Proceedings of the 2019 Conference
sively evaluated on all applicable downstream rea- of the North American Chapter of the Association
soning tasks due to their sheer number. While delib- for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
erately distinct, we focused on only three different pages 2357–2367, Minneapolis, Minnesota. Associ-
reasoning tasks in order to study how natural lan- ation for Computational Linguistics.
guage reasoning feedback can impact downstream
Daniel Andor, Luheng He, Kenton Lee, and Emily
tasks. We believe this represents an initial but im- Pitler. 2019. Giving BERT a calculator: Finding
portant step towards exploring automated natural operations and arguments with reading comprehen-
language feedback on intermediate representations. sion. In Proceedings of the 2019 Conference on
In addition, the critic we presented here is specific Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on
for each task, while the ideal critic would be a gen- Natural Language Processing (EMNLP-IJCNLP),
eral one, capable of providing feedback on a wide pages 5947–5952, Hong Kong, China. Association
range of reasoning tasks. Similarly, we consid- for Computational Linguistics.
ered fine-grained reasoning errors specific to each
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten
reasoning task. Recent work has mentioned sev- Bosma, Henryk Michalewski, David Dohan, Ellen
eral other fine-grained reasoning errors (Golovneva Jiang, Carrie Cai, Michael Terry, Quoc Le, and
et al., 2023), which can’t be fully covered by the Charles Sutton. 2021. Program synthesis with large
reasoning tasks we considered. Generalizing both language models.
the critic and fine-grained error types emerge as Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
both the main limitations of this paper and the di- Amanda Askell, Jackson Kernion, Andy Jones,
rections of future work. Anna Chen, Anna Goldie, Azalia Mirhoseini,
Cameron McKinnon, Carol Chen, Catherine Ols-
son, Christopher Olah, Danny Hernandez, Dawn
Ethical Considerations Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson,
In this paper, we experiment with existing datasets Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey
Ladish, Joshua Landau, Kamal Ndousse, Kamile
which are, to the best of our knowledge, adequately Lukosuite, Liane Lovitt, Michael Sellitto, Nelson
cited. Our proposed framework REFINER is de- Elhage, Nicholas Schiefer, Noemi Mercado, Nova
signed to improve the reasoning abilities of LMs. DasSarma, Robert Lasenby, Robin Larson, Sam
These LMs have been shown to encode biases Ringer, Scott Johnston, Shauna Kravec, Sheer El
Showk, Stanislav Fort, Tamera Lanham, Timothy
about race, gender, and many other demographic Telleen-Lawton, Tom Conerly, Tom Henighan, Tris-
attributes (Weidinger et al., 2021), (Sheng et al., tan Hume, Samuel R. Bowman, Zac Hatfield-Dodds,
2020). Since our framework does not offer a way Ben Mann, Dario Amodei, Nicholas Joseph, Sam
to mitigate these biases, models improved using McCandlish, Tom Brown, and Jared Kaplan. 2022.
Constitutional ai: Harmlessness from ai feedback.
this framework could still reflect the same harm-
ful behaviours normally exhibited by these models. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
We recommend anyone deploying our model off- Subbiah, Jared D Kaplan, Prafulla Dhariwal,
the-shelf should first check whether the model is Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
harmful towards any protected group, and appropri- Voss, Gretchen Krueger, Tom Henighan, Rewon
ate mitigation should be taken. In addition, our MS Child, Aditya Ramesh, Daniel Ziegler, Jeffrey
task is based on a dataset of situations, intentions, Wu, Clemens Winter, Chris Hesse, Mark Chen,
and actions that heavily skew towards western cul- Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
ture and social norms (Emelin et al., 2021). Con- Candlish, Alec Radford, Ilya Sutskever, and Dario
sequently, our human evaluation on the MS task Amodei. 2020. Language models are few-shot learn-
was done with AMT workers based in the US who ers. In Advances in Neural Information Processing
were paid adequately for the average time it took Systems, volume 33, pages 1877–1901. Curran As-
sociates, Inc.
to solve the task.
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Mar-
tic, Shane Legg, and Dario Amodei. 2017. Deep re-
References inforcement learning from human preferences.
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Peter Clark, Oyvind Tafjord, and Kyle Richardson.
Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha- 2021. Transformers as soft reasoners over language.
In Proceedings of the Twenty-Ninth International Michael A. Hedderich, Lukas Lange, Heike Adel, Jan-
Joint Conference on Artificial Intelligence, IJ- nik Strötgen, and Dietrich Klakow. 2021. A survey
CAI’20. on recent approaches for natural language process-
ing in low-resource scenarios. In Proceedings of
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, the 2021 Conference of the North American Chapter
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias of the Association for Computational Linguistics:
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Human Language Technologies, pages 2545–2568,
Nakano, Christopher Hesse, and John Schulman. Online. Association for Computational Linguistics.
2021. Training verifiers to solve math word prob-
lems. Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,
Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022.
Ahmed Elgohary, Saghar Hosseini, and Ahmed Has- Large language models can self-improve.
san Awadallah. 2020. Speak to your parser: Inter-
active text-to-SQL with natural language feedback. Zhanming Jie, Jierui Li, and Wei Lu. 2022. Learn-
In Proceedings of the 58th Annual Meeting of the ing to reason deductively: Math word problem solv-
Association for Computational Linguistics, pages ing as complex relation extraction. In Proceedings
2065–2077, Online. Association for Computational of the 60th Annual Meeting of the Association
Linguistics. for Computational Linguistics (Volume 1: Long
Papers), pages 5944–5955, Dublin, Ireland. Associ-
Ahmed Elgohary, Christopher Meek, Matthew ation for Computational Linguistics.
Richardson, Adam Fourney, Gonzalo Ramos,
and Ahmed Hassan Awadallah. 2021. NL-EDIT: Daniel Khashabi, Sewon Min, Tushar Khot, Ashish
Correcting semantic parse errors through natural Sabharwal, Oyvind Tafjord, Peter Clark, and Han-
language interaction. In Proceedings of the 2021 naneh Hajishirzi. 2020. UNIFIEDQA: Crossing
Conference of the North American Chapter of the format boundaries with a single QA system. In
Association for Computational Linguistics: Human Findings of the Association for Computational
Language Technologies, pages 5599–5610, Online. Linguistics: EMNLP 2020, pages 1896–1907, On-
Association for Computational Linguistics. line. Association for Computational Linguistics.

Denis Emelin, Ronan Le Bras, Jena D. Hwang, Niklas Kiehne, Hermann Kroll, and Wolf-Tilo Balke.
Maxwell Forbes, and Yejin Choi. 2021. Moral sto- 2022. Contextualizing language models for norms
ries: Situated reasoning about norms, intents, ac- diverging from social majority. In Findings of the
tions, and their consequences. In Proceedings of the EMNLP 2022, Abu Dhabi, United Arab Emirates.
2021 Conference on Empirical Methods in Natural Association for Computational Linguistics, Associa-
Language Processing, pages 698–718, Online and tion for Computational Linguistics.
Punta Cana, Dominican Republic. Association for
Computational Linguistics. Bugeun Kim, Kyung Seo Ki, Sangkyu Rhim, and Gah-
gene Gweon. 2022. EPT-X: An expression-pointer
Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan- transformer model that generates eXplanations for
dar, Soroush Vosoughi, Teruko Mitamura, and Ed- numbers. In Proceedings of the 60th Annual
uard Hovy. 2021. A survey of data augmentation Meeting of the Association for Computational
approaches for NLP. In Findings of the Association Linguistics (Volume 1: Long Papers), pages 4442–
for Computational Linguistics: ACL-IJCNLP 2021, 4458, Dublin, Ireland. Association for Computa-
pages 968–988, Online. Association for Computa- tional Linguistics.
tional Linguistics.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini,
Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. Nate Kushman, and Hannaneh Hajishirzi. 2016.
Injecting numerical reasoning skills into language MAWPS: A math word problem repository. In
models. In Proceedings of the 58th Annual Meeting Proceedings of the 2016 Conference of the
of the Association for Computational Linguistics, North American Chapter of the Association for
pages 946–958, Online. Association for Computa- Computational Linguistics: Human Language
tional Linguistics. Technologies, pages 1152–1157, San Diego, Califor-
nia. Association for Computational Linguistics.
Olga Golovneva, Moya Chen, Spencer Poff, Mar-
tin Corredor, Luke Zettlemoyer, Maryam Fazel- Klaus Krippendorff. 2018. Content analysis: An
Zarandi, and Asli Celikyilmaz. 2022. Roscoe: A introduction to its methodology. Sage Publications.
suite of metrics for scoring step-by-step reasoning.
Andrew K Lampinen, Nicholas A Roy, Ishita Dasgupta,
Olga Golovneva, Moya Chen, Spencer Poff, Mar- Stephanie CY Chan, Allison C Tam, James L Mc-
tin Corredor, Luke Zettlemoyer, Maryam Fazel- Clelland, Chen Yan, Adam Santoro, Neil C Rabi-
Zarandi, and Asli Celikyilmaz. 2023. ROSCOE: nowitz, Jane X Wang, et al. 2022. Tell me why!
A suite of metrics for scoring step-by-step reason- explanations support learning relational and causal
ing. In The Eleventh International Conference on structure. In International Conference on Machine
Learning Representations. Learning.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Language Technologies, pages 2080–2094, Online.
Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Association for Computational Linguistics.
Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku-
mar, et al. 2022. Holistic evaluation of language Debjit Paul and Anette Frank. 2021. COINS: Dynam-
models. arXiv preprint arXiv:2211.09110. ically generating COntextualized inference rules
for narrative story completion. In Proceedings
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- of the 59th Annual Meeting of the Association
som. 2017. Program induction by rationale gen- for Computational Linguistics and the 11th
eration: Learning to solve and explain algebraic International Joint Conference on Natural Language
word problems. In Proceedings of the 55th Annual Processing (Volume 1: Long Papers), pages
Meeting of the Association for Computational 5086–5099, Online. Association for Computational
Linguistics (Volume 1: Long Papers), pages 158– Linguistics.
167, Vancouver, Canada. Association for Computa-
tional Linguistics. Piotr Pi˛ekos, Mateusz Malinowski, and Henryk
Michalewski. 2021. Measuring and improving
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler BERT’s mathematical abilities by predicting the or-
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, der of reasoning. In Proceedings of the 59th Annual
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Meeting of the Association for Computational
Sean Welleck, Bodhisattwa Prasad Majumder, Linguistics and the 11th International Joint
Shashank Gupta, Amir Yazdanbakhsh, and Peter Conference on Natural Language Processing
Clark. 2023. Self-refine: Iterative refinement with (Volume 2: Short Papers), pages 383–394, Online.
self-feedback. Association for Computational Linguistics.

Ana Marasovic, Iz Beltagy, Doug Downey, and Rajkumar Ramamurthy, Prithviraj Ammanabrolu,
Matthew Peters. 2022. Few-shot self-rationalization Kianté Brantley, Jack Hessel, Rafet Sifa, Christian
with natural language prompts. In Findings of the Bauckhage, Hannaneh Hajishirzi, and Yejin Choi.
Association for Computational Linguistics: NAACL 2022. Is reinforcement learning (not) for natural
2022, pages 410–424, Seattle, United States. Associ- language processing?: Benchmarks, baselines,
ation for Computational Linguistics. and building blocks for natural language policy
optimization.
Alice Martin, Guillaume Quispe, Charles Ollion, Syl-
vain Le Corff, Florian Strub, and Olivier Pietquin. Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan
2022. Learning natural language generation with Liu. 2019. NumNet: Machine reading compre-
truncated reinforcement learning. In Proceedings of hension with numerical reasoning. In Proceedings
the 2022 Conference of the North American Chapter of the 2019 Conference on Empirical Methods
of the Association for Computational Linguistics: in Natural Language Processing and the 9th
Human Language Technologies, pages 12–37, Seat- International Joint Conference on Natural Language
tle, United States. Association for Computational Processing (EMNLP-IJCNLP), pages 2474–2484,
Linguistics. Hong Kong, China. Association for Computational
Linguistics.
Nikhil Mehta and Dan Goldwasser. 2019. Improv-
Christian Rupprecht, Iro Laina, Nassir Navab, Gre-
ing natural language interaction with robots using
gory D. Hager, and Federico Tombari. 2018. Guide
advice. In Proceedings of the 2019 Conference
me: Interacting with deep networks. CoRR,
of the North American Chapter of the Association
abs/1803.11544.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), William Saunders, Catherine Yeh, Jeff Wu, Steven
pages 1962–1967, Minneapolis, Minnesota. Associ- Bills, Long Ouyang, Jonathan Ward, and Jan Leike.
ation for Computational Linguistics. 2022. Self-critiquing models for assisting human
evaluators.
Khanh Nguyen, Dipendra Misra, Robert Schapire,
Miro Dudík, and Patrick Shafto. 2021. Interactive Jérémy Scheurer, Jon Ander Campos, Jun Shern Chan,
learning from activity description. Angelica Chen, Kyunghyun Cho, and Ethan Perez.
2022. Training language models with language feed-
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, back.
Henryk Michalewski, Jacob Austin, David Bieber,
David Dohan, Aitor Lewkowycz, Maarten Bosma, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec
David Luan, Charles Sutton, and Augustus Odena. Radford, and Oleg Klimov. 2017. Proximal policy
2021. Show your work: Scratchpads for intermedi- optimization algorithms. ArXiv, abs/1707.06347.
ate computation with language models.
Emily Sheng, Kai-Wei Chang, Prem Natarajan, and
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Nanyun Peng. 2020. Towards Controllable Bi-
2021. Are NLP models really able to solve simple ases in Language Generation. In Findings of the
math word problems? In Proceedings of the 2021 Association for Computational Linguistics: EMNLP
Conference of the North American Chapter of the 2020, pages 3239–3254, Online. Association for
Association for Computational Linguistics: Human Computational Linguistics.
Vered Shwartz, Peter West, Ronan Le Bras, Chandra Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-
Bhagavatula, and Yejin Choi. 2020. Unsupervised Yan Liu. 2018. A study of reinforcement learning
commonsense question answering with self-talk. In for neural machine translation. In Proceedings of the
Proceedings of the 2020 Conference on Empirical 2018 Conference on Empirical Methods in Natural
Methods in Natural Language Processing (EMNLP), Language Processing, pages 3612–3621, Brussels,
pages 4615–4629, Online. Association for Computa- Belgium. Association for Computational Linguis-
tional Linguistics. tics.

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Zhipeng Xie and Shichao Sun. 2019. A goal-driven
Ziegler, Ryan Lowe, Chelsea Voss, Alec Rad- tree-structured neural model for math word prob-
ford, Dario Amodei, and Paul Christiano. 2020. lems. In International Joint Conference on Artificial
Learning to summarize from human feedback. In Intelligence.
Proceedings of the 34th International Conference on
Neural Information Processing Systems, NIPS’20, Xi Ye and Greg Durrett. 2022. The unreliability of ex-
Red Hook, NY, USA. Curran Associates Inc. planations in few-shot prompting for textual reason-
ing. In Advances in Neural Information Processing
Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Gold- Systems.
berg, and Jonathan Berant. 2020. Leap-of-thought:
Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan
Teaching pre-trained models to systematically rea-
Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-
son over implicit knowledge. In Advances in Neural
to-tree learning for solving math word problems.
Information Processing Systems, volume 33, pages
In Proceedings of the 58th Annual Meeting of the
20227–20237. Curran Associates, Inc.
Association for Computational Linguistics, pages
3928–3937, Online. Association for Computational
Niket Tandon, Aman Madaan, Peter Clark, and Yim-
Linguistics.
ing Yang. 2022. Learning to repair: Repairing
model output errors after deployment using a dy- Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B.
namic memory of feedback. In Findings of the Brown, Alec Radford, Dario Amodei, Paul Chris-
Association for Computational Linguistics: NAACL tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
2022, pages 339–352, Seattle, United States. Associ- guage models from human preferences.
ation for Computational Linguistics.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc


Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2022. Self-consistency improves
chain of thought reasoning in language models.

Francis Rhys Ward, Francesco Belardinelli, and


Francesca Toni. 2022. Argumentative reward
learning: Reasoning about human preferences.
Workshop on Human-Machine Collaboration and
Teaming at ICML, abs/2209.14010.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten


Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022.
Chain of thought prompting elicits reasoning in large
language models. CoRR, abs/2201.11903.

Laura Weidinger, John Mellor, Maribeth Rauh, Conor


Griffin, Jonathan Uesato, Po-Sen Huang, Myra
Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh,
Zac Kenton, Sasha Brown, Will Hawkins, Tom
Stepleton, Courtney Biles, Abeba Birhane, Ju-
lia Haas, Laura Rimell, Lisa Anne Hendricks,
William S. Isaac, Sean Legassick, Geoffrey Irv-
ing, and Iason Gabriel. 2021. Ethical and so-
cial risks of harm from language models. CoRR,
abs/2112.04359.

Sean Welleck, Ximing Lu, Peter West, Faeze Brah-


man, Tianxiao Shen, Daniel Khashabi, and Yejin
Choi. 2022. Generating sequences by learning to
self-correct.

Jason Weston. 2016. Dialog-based language learning.


A REFINER Framework Task Train Dev Test
MWP 3,138 – 1000
Alg. 1 and Alg. 2 outline the training and inference sNLR 1000 5000 5000
algorithms for REFINER. We train a supervised MS 10000 1000 1000
CRITIC model (πβ ) with the context (x) and (plau-
sible or implausible) hypothesis (z or z0 ) as input Table 6: Dataset Statistics: nb. of instances.
and the textual feedback as output. Given a context
x the generator model (πθ ) is trained to generate Model Parameter Size
plausible hypotheses. UQA-base 220M
REFINERb ase 440M
UQA-large 770M
Algorithm 1 REFINER Training REFINERlarge 990M
1: for E epochs do GPT3.5 175B
2: for i(batch) ← 1 to N do
3: Initialize (feedback) f0 ← No Table 7: Model Sizes.
4: for t ← 1 to T do
5: ẑki,t ∼ πθ (yi |ci , ft−1 , ẑi,t−1 ) relations using GCN; GTS (Xie and Sun, 2019)
6: ft , ẑ ← πβ (ci , zi , ẑki,t ) which is a sequence-to-tree model that mainly uses
7: Lilm += − log p(zi |ci , ft−1 , ẑi,t−1 ) a tree-based decoder with GRU; and DeductRea-
8: end for soner (Jie et al., 2022) which uses bottom-up DAG-
9: end for structured decoding. Results of this comparison
10: end for can be found in Table 8. For the sNLR task, we
11: return πθ also experiment with a critic model trained on 50%
of its original training data and we still observe a
performance improvement over the baseline as can
Algorithm 2 REFINER Inference be seen in Table 9.
1: Initialize answers ← empty list
2: for i(batch) ← 1 to N do Answer Prediction (y) Acc %
3: Initialize (reward) ri ← 0, pi ← 1 GTS 30.8
4: Initialize (hint) h0 , ŷi,0 ← No, [] Graph2Tree 36.5
5: for (turn) t ← 1 to T do BERT-Tree 32.4
6: ŷ ← πθ (yi |ci , ht−1 , ŷi,t−1 ) Roberta-large-GTS 41.0
7: ht ← πβ (ci , ŷi ) Roberta-large-Graph2Tree 43.8
8: if ht == No then Roberta-large-DeductReasoner 45.0
9: answers.append(ŷ) Few-Shot GPT-3 63.05
10: break Few-Shot GPT-3 + COT 63.5
11: end if Few-Shot GPT-3 + COT + REFINER 66.4
12: end for
13: answers.append(ŷ) Table 8: Results on SVAMP dataset
14: end for
15: return answers
D Qualitative Examples
Figure 8 and 12 depict a qualitative example of
B Datasets and Models
REFINER where REFINER could correct incorrect
In Table 6, we report the data statistics. In Table 7, equations through structured feedback, fixing the
we report the details of used models. operators within a multistep solution. Table 12
shows some qualitatively improved examples for
C Additional Results MS.
In the MWP, for the answer prediction task, E Feedback Data Generation
we compare REFINER with the previously re-
ported baselines from Jie et al. (2022) including Based on the error types as mentioned in Fig 7 and
Graph2Tree (Zhang et al., 2020) that uses quantity Table 10, we perturb the plausible hypotheses (z)
Figure 7: An overview of the three tasks tackled in this paper, with examples of both valid and invalid intermediate
reasoning steps, as well as their corresponding fine-grained error types. Notice the Missing Steps error type, in
the second task, actually encompasses two error types: reasoning misalignment, derived from not considering the
or operation, and lack of implicit knowledge, where implicit knowledge is needed to match the existing rules.

Model IR C annotators to judge the relevancy of the generated


norm and the moral action based on a Likert scale,
50% training data with 1 = strongly disagree, 2 = disagree, 3 = unsure,
T5-base 84.28 ± 0.5 88.86 4 = agree, and 5 = strongly agree. Ratings were
REFINER base 88.26 ± 0.8 94.26 subsequently aggregated, with scores ≥ 4 deemed
REFINER base + Oracle 91.11 ± 05 97.28 to be Relevant and with scores, ≤ 2 deemed to
be Irrelevant while ratings with score 3 (Unsure)
Table 9: Results on SNR dataset. IR: Inference Rules,
C: Consequent left as is. More specifically, we asked three differ-
ent human judges to evaluate each example. We
performed majority voting over answers with the
in the training data and collect a pool of data D rating Unsure assigned to those examples with no
(x: input, z: plausible hypothesis, z0 : implausible clear majority winner. In Figures 9 and 10, we
hypothesis). Since our perturbations are based on report a complete breakdown of evaluation results
logic and reasoning errors, we create structured for both norm and moral action. We also report
feedback f for every example (x, z, z0 ) by stating agreement scores computed according to Krippen-
the error type that occurs in z0 but not in z. For dorff’s α (Krippendorff, 2018) in Table 4. The low
e.g., in sNLR, we omit a few inference steps from and moderate α values indicate that judging the
the correct hypothesis "#0: viridian is green, plausibility of moral norms and actions is a chal-
#1: rose is green" and create an incorrect lenging task. In Figures 11-19, we provide excerpts
(incomplete) hypothesis. For such a scenario, we of HIT instructions given to AMT workers during
create feedback "Missing link between fact and moral norm and action evaluation. Each task was
rules". In Table 11, we show some examples from supplemented by an Acceptance and Privacy Policy
the training data for MS. (Figure 19) that explains participation and data col-
lection terms. All workers were based in US and
F Human Evaluation on Moral Stories paid $0.10 per task which took around 5 minutes
As part of the human evaluation of model gener- to complete on average.
ations on MS, we asked Amazon MTurk (AMT)
After resting they decided to go for a swim. The depth of the water
is number0 times Dean's height. Dean is number1 feet taller than
Ron. If Ron stands at number2 feet. How deep was the water?

Generator: Critic:
#0: add( number0, number2 ) | The first number in #0 is
Iteration 1:
#1: subtract ( #0, number2 ) incorrect. The operator in #1 is
incorrect.

Generator:
#0: add ( number1, number2 ) | Critic:
Iteration 2:
#1: multiply( #0, number2 ) The number in #1 is incorrect.

Generator:
#0: add ( number0, number1 ) | Critic:
Iteration 3:
#1: multiply ( #0, number2 ) No Hint

Figure 8: REFINER on MWP. The generator’s output improves step-wise.

Task Error Types Structured Feedback Human Readable Feedback


MWP Incorrect Numbers herrortype, position, equation − numberi The position number in equation-number is incorrect.
Incorrect Operators herrortype, equation − numberi The operator in equation-number is incorrect.
Missing Operators herrortypei An operator is missing.
sNLR Logically Incorrect h X operator, inference rule number i The X operator makes inference rule number invalid.
Missing Lookup Step herrortypei Missing link between the fact and the rules.
Missing Implicit Knowledge Step herrortypei The implicit knowledge is missing.

Table 10: Feedback Templates

Norm Evaluation
UQA-T5-large UQA-T5-large + PPO REFINER

50

40

30

20

10

0
strongly disagree disagree unsure agree strongly agree

Figure 9: Human Evaluation of Moral Norm on 100 test samples.


Situation: Jim was in class when his classmate answered one of the teacher’s ques-
tions wrong.
Intention: Jim wants his classmate to know the answer was wrong.
Moral Action: Jim tells his classmate the right answer and offers to help him after school.
Immoral Action: Jim starts to laugh at his classmate and tells him he is stupid for not
knowing the answer.
Plausible Hypothesis: It’s hurtful to make fun of your classmates.
Contradictory Hypothesis: You should make fun of your classmates.
Implausible Hypothesis 1: It’s good to know the answer.
Implausible Hypothesis 2: It’s wrong to be stupid.

Situation: Jenny has been going out with a guy for a while, but she feels it isn’t
working out.
Intention: Jenny wants to end the relationship with the guy.
Moral Action: Jenny meets the guy and gently says that she wants to end the relationship.
Immoral Action: Jenny sends the guy a text message that informs him that the relation-
ship is over.
Plausible Hypothesis: It’s good to break up with someone face to face.
Contradictory Hypothesis: It’s bad to break up with someone face to face.
Implausible Hypothesis 1: You should not send a text message
Implausible Hypothesis 2: It’s bad to end the relationship with the guy.

Situation: Paula is leaving home to do some shopping and notices that the dog’s
water bowl is empty.
Intention: Paula wants to leave the dog at home.
Moral Action: Paula fills the bowl with water before leaving dog alone in the house.
Immoral Action: Paula decides to fill the bowl when she gets home and leaves the dog with
no water.
Plausible Hypothesis: You must always be kind to animals
Contradictory Hypothesis: It’s bad to be kind to animals.
Implausible Hypothesis 1: It’s rude to leave the dog at home.
Implausible Hypothesis 2: It’s important to do some shopping.

Table 11: Critic training examples from Moral Stories. Norm judgments are highlighted in light blue , original
norm actions are colored in light gray and norm actions extracted from the context are colored in light yellow .
Situation: Jeff has not been happy in his relationship with his girlfriend Jenny for a
long time.
Intention: Jeff wants to break up with Jenny.
Moral Action: Jeff goes to Jenny’s house and tells her that he wants to break up with her.
Immoral Action: Jeff sends Jenny a text message telling her that he’s breaking up with her.
UQA-T5-large Output: It’s wrong to break up with someone.
Jeff calls Jenny and tells her that he’s breaking up with her.

REFINER Output: Its wrong to break up with someone over text.


Jeff sits down with Jenny and tells her that hes breaking up with her.

Situation: Jack and his friends are playing in his backyard on a sunny day.
Intention: Jack wants to have fun with his friends.
Moral Action: Jack sets up an inflatable pool and swims in the pool with his friends.
Immoral Action: Jack tosses a ball around with his friends, and he accidentally hit a window
of his neighbor.
UQA-T5-large Output: It’s wrong to hit someone.
Jack throws a ball with his friends and hits a window of his neighbor.

REFINER Output: Its wrong to cause harm to someone elses property.


Jack tosses a ball around with his friends and hits a tree with it.

Situation: Jack was invited to a party, but he has the flu.


Intention: Jack wants to attend the party and have fun.
Moral Action: Jack stays home, but Skypes in to the party to socialize.
Immoral Action: Jack goes to the party even though he is sick.
UQA-T5-large Output: It’s bad to be sick.
Jack calls his mom and tells her he is sick.

REFINER Output: Its bad to spread germs.


Jack calls his friend and tells him he cant go to the party.

Table 12: Moral Stories generations. Norm outputs (z) are highlighted in light blue , moral action outputs (y)
are colored in light green
.
Moral Action Evaluation
UQA-T5-large UQA-T5-large + PPO REFINER

40

30

20

10

0
strongly disagree disagree unsure agree strongly agree

Figure 10: Human Evaluation of Moral Action on 100 test samples.

Figure 11: Excerpt from AMT HIT instructions: Norm Evaluation Task
Figure 12: Excerpt from AMT HIT instructions: Moral Action Evaluation Task

Figure 13: Excerpt from AMT HIT instructions: Norm Evaluation Task instructions
Figure 14: Excerpt from AMT HIT instructions: Norm Evaluation Task Dos and Don’ts

Figure 15: Excerpt from AMT HIT instructions: Norm Evaluation Task examples
Figure 16: Excerpt from AMT HIT instructions: Moral Action Evaluation Task instructions

Figure 17: Excerpt from AMT HIT instructions: Moral Action Evaluation Task Dos and Don’ts
Figure 18: Excerpt from AMT HIT instructions: Moral Action Evaluation Task examples

Figure 19: Excerpt from AMT HIT instructions: Acceptance and Privacy Policy

You might also like