You are on page 1of 12

Back to the Future: Unsupervised Backprop-based Decoding for

Counterfactual and Abductive Commonsense Reasoning


Lianhui Qin†‡ Vered Shwartz †‡ Peter West †‡ Chandra Bhagavatula‡
^
Jena D. Hwang ‡ Ronan Le Bras ‡ Antoine Bosselut ‡ Yejin Choi†‡

Paul G. Allen School of Computer Science & Engineering, University of Washington

Allen Institute for Artificial Intelligence
^Stanford University
{lianhuiq, pawest, yejin}@cs.washington.edu
{vered, chandrab, jenah, ronanlb}@allenai.org

Abstract Abductive Reasoning

Hypothesis
Abductive and counterfactual reasoning, core Past Observation Future Observation
abilities of everyday human cognition, require She hit the rope
arXiv:2010.05906v3 [cs.CL] 7 Dec 2020

and the tire fell


reasoning about what might have happened at Ray hung a tire on on top of her. Ray ran to his
time t, while conditioning on multiple contexts a rope to make his daughter to make
daughter a swing. sure she was okay.
from the relative past and future. However,
simultaneous incorporation of past and future AN
RE
LO
DE
contexts using generative language models
(LMs) can be challenging, as they are trained
either to condition only on the past context or
to perform narrowly scoped text-infilling. Story Context
Original Ending
Rewritten Ending
In this paper, we propose D E L OREAN, a new
Zeke was throwing
unsupervised decoding algorithm that can flex- a party. Zeke thought about
ibly incorporate both the past and future con- Zeke thought about being a vampire or
Lannister, but he a wizard.
texts using only off-the-shelf, left-to-right lan- All his friends were
didn’t want to look
dressing up for this
guage models and no supervision. The key in- like a Lannister. Then he decided on
Halloween party.
tuition of our algorithm is incorporating the fu- a scarier costume.
He wanted to look
ture through back-propagation, during which, All his friends were like a Stark. Zeke dressed up
we only update the internal representation of dressing up for this
like a skeleton.
Game of Thrones
the output while fixing the model parameters. themed party. Zeke dressed up like
a Stark.
By alternating between forward and backward [Counterfactual]
propagation, D E L OREAN can decode the out- Counterfactual Reasoning
put representation that reflects both the left
and right contexts. We demonstrate that our
approach is general and applicable to two Figure 1: D ELOREAN, our proposed method, with gen-
nonmonotonic reasoning tasks: abductive text erated reasoning results. Top: the goal in abductive
generation and counterfactual story revision, reasoning is to generate a hypothesis (Y ) of what hap-
where D E L OREAN outperforms a range of pened between the observed past (X) and future (Z)
unsupervised and some supervised methods, contexts. Bottom: In counterfactual reasoning, given
based on automatic and human evaluation.1 a story context altered by a counterfactual condition,
X, and the original ending Z, the goal is to generate a
new ending Y which is coherent with X while remain-
ing similar to Z. The story from T IME T RAVEL (Qin
1 Introduction et al., 2019a) consists of five sentences. Our approach
alternates forward (left-to-right) and backward (right-
Everyday causal reasoning requires reasoning to-left) passes that iteratively refine the generated texts
about the likely explanations to partially observ- w.r.t context from each side.
able past and future (abductive reasoning (Peirce,
1960)) and reasoning about the alternative future inferring plausible but potentially defeasible con-
based on counterfactual past (counterfactual rea- clusions from incomplete or hypothetical observa-
soning). Such nonmonotonic reasoning requires tions (Reiter, 1988). While humans are remarkably
1
Code is available at https://github.com/ good at this type of causal reasoning, developing
qkaren/unsup_gen_for_cms_reasoning AI systems capable of nonmonotonic reasoning for
a wide range of situations describable in natural requires learning coherent patterns in narrative,
language has been a major open research question. which should be already available in large-scale
More concretely, with abductive reasoning, the pretrained language models. However, the key chal-
goal is to find the most plausible explanation for lenge is that most generative language models are
incomplete observations (Peirce, 1960). In the top trained to condition only on the left context, or to
part of Figure 1, given the first observation that perform narrowly scoped text-infilling.
Ray is “making his daughter a swing” and the later This paper presents D ELOREAN: DEcoding for
observation that he “ran to [her] to make sure she nonmonotonic LOgical REAsoNing, an unsuper-
was okay,” we can hypothesize that she somehow vised decoding algorithm that only assumes off-the-
got hurt by the swing. shelf left-to-right language models with no supervi-
In contrast, counterfactual reasoning concerns sion. The key intuition of our algorithm is incorpo-
the causal changes to future events given a change rating the future through back-propagation, during
in the past condition (i.e., “counterfactual condi- which, we only update the internal representation
tion”; Goodman, 1947). For example, the bottom of the output while fixing the model parameters.
part of Figure 1 shows the original five sentence More specifically, D ELOREAN alternates between
story (S1 , ..., S5 ) and an alternative counterfac- the forward and backward passes, where the for-
tual condition given in S20 —that instead of being ward pass performs left-to-right inference given
a generic “Halloween party”, the new counterfac- the left context (roughly maximizing P (Y |X) in
tual condition is that it is going to be a “Game Figure 1), while the backward pass instills the
of Thrones themed party”! Given these, the prob- right constraint through right-to-left backpropaga-
lem we want to solve is to update the future events tion with a task-specific loss (roughly maximizing
(S30 , ..., S50 ), so that instead of “Zeke dressed up as P (Z|XY )). The forward and backward outputs
skeleton”, we have “Zeke dressed up like a Stark”.2 are mixed into a single vector, from which tokens
Recently, two tasks and corresponding bench- are sampled to generate the desired output. To
marks have been introduced to tackle language- choose the best output across iterations, we employ
based nonmonotonic reasoning: the ART dataset an unsupervised ranking step based on BERT’s
for abductive NLG (Bhagavatula et al., 2019), and next sentence prediction task to measure coherence
the T IME T RAVEL dataset for counterfactual story (Devlin et al., 2018).
rewriting (Qin et al., 2019a). Both tasks are framed On both tasks, D ELOREAN outperforms all other
as conditional generation, with multiple contexts unsupervised methods in terms of both automatic
to condition on. The currently dominant paradigm metrics and human evaluation, demonstrating that
for conditional text generation tasks is fine-tuning nonmonotonic reasoning through conditional de-
pre-trained language models (LMs), such as GPT2 coding is a promising research direction. Moreover,
(Radford et al., 2019a), on large-scale training data outputs produced by our model are judged as more
for supervision. However, despite the large num- coherent than those from the supervised models. In
ber of training examples, supervised approaches sum, our study shows that backpropagation-based
still perform considerably worse than humans and decoding may enable additional future applications
are subject to developing superficial strategies such of unsupervised generation and reasoning.
as repeating the observations as is or memorizing
prevalent surface patters specific in the dataset (Qin 2 Background
et al., 2019a). Furthermore, having to require large-
scale training data for each domain and task would Most NLP benchmarks have focused on reason-
be utterly inefficient for broad-coverage nonmono- ing about information that is entailed from the
tonic reasoning in language. premise. For instance, natural language infer-
In this paper, we investigate an alternative path ence (NLI; Bowman et al., 2015) focuses primarily
toward language-based nonmonotonic reasoning on whether a hypothesis is entailed from a given
using pre-trained language models as is. Intuitively, premise, which means the information stated in the
both the abductive and counterfactual reasoning hypothesis is a subset of the information provided
in the premise. However, it has been noted that
2
“Lannister” in S30 and “Stark” in S40 and S50 refer to char- human reasoning is often the other way, where hy-
acter names in the TV show, “Game of the Thrones.” All the
output text shown in Figure 1 is the actual system output from potheses often contain new information that was
D E L OREAN. not available in the premise, but plausibly true (but
Initialization ỹ1 ỹ2 ỹN Generation Y

Output: She hit the rope and the tire fell on top of her.
LM

x1 x2 … xNX

ỹb1 ỹb2 ỹbN Backpropagation



Computing Loss
ỹ1 ỹ2 ỹN
Repeat 
 …
T times Ray ran to … [S]
ỹ f1 ỹ f2 ỹ fN
… …

LM

x1 x2 … xNX z1 z2 …
NZ z
LM Backward Pass
Input: LM Forward Pass Input:
Ray hung a tire on a rope to Backward Logits Ray ran to his daughter to
make his daughter a swing. Forward Logits make sure she was okay.
Past context X Forward Backward Mix
Future constraint Z

Figure 2: Illustration of the D ELOREAN decoding procedure, using abductive reasoning as an example. At ini-
Generation Y
tialization (upper-left box), the language model (LM) initializes the logits Ỹ = {ỹ1 , . . . , ỹN } of the hypothesis
Output: She hit the rope and the tire fell on top of her
by reading the past context X and generating a continuation with regular decoding. At each forward-backward
iteration, we compute the task-specific loss LỸ of the logits based on the future constraint ZGeneration (red box). Y
The
Output:
b She hit the
b rope andbthe tire Generation
fell on top ofYher.
backward pass then performs back-propagation and produces the backward logits Ỹ = {ỹ1Output: ỹNhit}.theIn
, . . . ,She the tire fell on top of her.
b b andỹthe b
ỹbN Y
rope
f
subsequent forward pass, for each step n, we compute the forward logits ỹn conditioning on the preceding ỹ 1 ỹ 2
logits 3 Generation
Generation Y … ropeGeneration
ỹn−1 , and then mix it with the respective backward logits to produce the new logits ỹb n at stepb
n. b Output:b She hit the and the tire fe
Output: She hit the rope and the tire fell on top of her.
1 ỹ ỹỹ
2ỹ
Generation 3
b 1Y b 2
ỹ ỹOutput:ỹ She hit theỹrope and the
N
b 3 b N Backprop
ỹ ỹ
Output: She hit the rope and1the tire fell
Repeat 
 ỹ
3 her.… N
… 2 on top of ỹ
T times Ray
… Computing Los
possibly defeasible with new additional context) abductive reasoning task. Given ỹ1 two ỹỹ13 ỹ 2ỹỹNb1 ỹ f3 ỹb2 ỹỹfNb3
ỹ2 observations,f f
Cỹ
(Johnson-Laird, 2006; Mercier ỹb1 b
ỹand ỹb3
Sperber, b
ỹN2017). the Backpropagation
goalT isRepeat 

to determine the most likely ỹ1 explana-…ỹ
2 ỹ3ỹb Ray … ỹ b
ỹran
N ỹb3…
2 times
Repeat 

1
… 2 to…
f f b ỹ f
This type of reasoning corresponds to…nonmono- tion of what happened ỹb1 ỹ 1 ỹb2 ỹ 2 ỹb3The
T times in-between. ỹ f3 dataset
ỹ ỹ ỹ3 Ray…
Computing Loss f f N N 1 f ỹ2 f Backpropag
tonic reasoning (Kraus et ỹ1 al., ỹ1990), ỹ3 as it contra-
ỹN introduced for the task, ART,Repeat consists ỹof1 …20kỹ obser-

2
ỹLM3ỹ1
ỹỹN …ỹ… 3

2 T times …
2
Repeat
dicts Tthe 

monotonicity property according … to which Repeat 
 sentence of f Computing
f f Loss …
Ray vations
ran to …derived from the ỹfirst and ỹ last ỹ 3 ỹ ỹ ỹ ỹ
x1 x21dataset x2NX(Mostafazadeh
[S] N 1 2 3 f
times T times
ỹ f1 madeỹ f2 invalid ỹ fN
ỹ f3 by adding ỹ f1Pass ranỹ f2to …

valid arguments cannot be storiesRepeat in the
T times

 ROCStories LM… LM Backward Ray
…ỹ 3 [S
Input:
premises. We study two tasks of that nature: … abduc- et al., … 2016a). WeRayfocus fon
hung aỹtire on athe f to f
abductive
ỹ 2 ỹ 3
rope NLG LM Logits
LM Forward Pass
fBackward
ỹ Forward

tive reasoning (§2.1) and counterfactual reasoning setup introduced x 1 x … make x 1


2 in theNXpaper, which is framed
his daughter a swing.
…Pass Forward
N Logits
as aBackward Mix z 1 … z 2 …

Input: x1 Past x2 context


… xXNX LM LM Backward
Forward Pass Input: LM
(§2.2). LM conditional Ray hung generation
a tire on a rope task
Input:
to where aBackward plausible Logits expla-
LM Backward Pass Ray ran to LM his daugh
Inp
Ray hung a tire on a x rope tox x
make his daughter a swing. LM Forward Pass make sure she was o
1 2
Forward Logits
… Ra
nation to the observations
Past context must
makeXhis daughter a swing.
be generated
LM
Forward Backward N Mix using
Backward Logits
X Forward Logits LM Future constra
x
2.1 Input:1 x
Abductive2
… x NX
Reasoning z 1 z
language.2 … z
The NZ authors Pastreported
context
Input: x x
thea1tireperformance
2
… x NX of
Backward Pass
LM Forward Pass
ma

Ray X F
Forward Backward Mix
LM Backward Pass hung on a rope to Backward LM Logits
Backward Pass

Ray hung a tire on a rope to


LM Forward Pass
Backward Logits
Input:
Rayseveral xpre-trained
1 xto2
ran to his daughter
… xNX
LM-based
Input:
make his daughter a swing.
baselines
Ray hung a tire and on a showed
rope to z 1
Forward LM
z 2
Logits
Backward
… z
Forward Pass
Logits
Abductive reasoning
make his daughter a swing. aims at finding the most likely make sure sheInput:
Forward Logits was okay. Past context
LM Backward
make Pass
his daughter X a swing. Forward Backward Mix
Input: Forward Logits
Past context X Forward Backward Mix promises
Future constraintand
Ray hung aZtirelimitations
on a rope to of suchLM Forward Pass
approaches.
Past Logits
Backward context X Ray ran to his daughter
Forward Backward
explanation to partial observations (Peirce, 1960). make his daughter a swing. Forward Logits make sure she was okay

It has a central role in the human ability to “read be- Past context X Forward Backward Mix Future constraint
2.2 Counterfactual Reasoning
tween the lines,” and is crucial for language acqui-
sition (Andersen, 1973), understanding sentences Counterfactual reasoning aims at inferring alterna-
in discourse (Hobbs et al., 1993), and many more. tive past events that could have happened given
Despite the importance, however, relatively little a certain change in conditions (Goodman, 1947;
focus has been given to it in NLP research. Starr, 2019). While counterfactual reasoning plays
Recently, Bhagavatula et al. (2019) propose the an important role in AI systems (Isard, 1974; Gins-
berg, 1986), it requires causal reasoning abilities, Algorithm 1: D ELOREAN Decoding
which are arguably absent from current association- Input: Pre-trained language model (LM)
based AI (Pearl and Mackenzie, 2018). While Context X
Future constraint Z
there has been work on counterfactual reasoning 1: Initialize logits Ỹ (0)
in NLP, including recognizing counterfactuals in 2: Initialize Ys , list of candidate generations
text (Son et al., 2017), and improving the perfor- 3: for t ← 1 to T do
4: // Backward pass
mance of NLP tasks using counterfactual learn- 5: for n ← N to 1 do
ing (Lawrence et al., 2017; Lawrence and Riezler, 6: Compute backward logits ỹnb , Eq.(1)
2018), it remains a major research challenge. 7: end for
8: // Forward pass
Recently, Qin et al. (2019a) introduce the task of 9: for n ← 1 to N do
counterfactual story generation. Given a 5-sentence 10: Compute forward logits ỹnf , Eq.(2)
11: Mix forward and backward logits, Eq.(3)
original story, and an alternative context in which 12: end for
the second sentence of the story was altered by 13: Sample candidate Y from logits Ỹ and add to Ys
a counterfactual, the task is to generate a new 3- 14: end for
15: Rank Ys by coherence
sentence story ending that addresses the alternative Output: The most coherent generated text Y from Ys
beginning while minimally editing the original end-
ing. The associated T IME T RAVEL dataset is based
on fictional narratives from ROCStories, for which
The proposed approach interleaves two proce-
counterfactual contexts and alternative endings are
dures, namely, forward and backward, that produce
crowdsourced, yielding 29,849 problem instances.
and iteratively refine the generation, for a prede-
Qin et al. (2019a) report several baseline perfor-
fined number of iterations T . In particular, the
mances, and find that models based on pre-trained
forward pass ensures the generated text is a fluent
LMs produce output that recognize the counterfac-
continuation of the context X, while the backward
tual, but generated endings which deviated consid-
pass informs the model about the constraint and
erably from the original storyline. In contrast, in
steers the generation to satisfy it.
the supervised setup, models optimize the easier of
the two goals and generate endings that are overly As detailed below, the backward pass uses gradi-
similar to the original endings. ent descent to update the generation Y . However,
Y is a discrete text that is not differentiable. In-
3 The D ELOREAN Approach stead, throughout the algorithm, we maintain a soft
representation of the sequence Ỹ = (ỹ1 , . . . , ỹN ),
Humans make inferences based on available in- where ỹn ∈ RV represents the logits of the n-th
formation and refine them when new information token and V is the vocabulary size. After the logits
arrives. Since currently available pre-trained LMs are refined over multiple iterations of the forward
generate text by sequentially predicting the next and backward passes, we generate discrete text at
token from left to right, they are incapable of con- each step by sampling from yn ∼ softmax(ỹn /τ ),
ditioning on future constraints. Therefore, we pro- where τ > 0 is the temperature.
pose D ELOREAN: an unsupervised backprop-based We start by initializing the logits before the first
decoding algorithm, which is summarized in Algo- (0) (0)
iteration, Ỹ (0) = (ỹ1 . . . ỹN ), by feeding the
rithm 1, illustrated in Figure 2, and detailed below. context X into the LM and greedily decoding N
D ELOREAN intermittently refines the predictions continuation tokens.
to cohere with either the context or the constraints
(Section 3.1). The candidate generations are then Backward The backward pass uses gradient
ranked by coherence (Section 3.2). backpropagation to update the generation with
respect to the constraint. Specifically, we ex-
3.1 Decoding Strategy press the task-specific constraint as a loss function
Given context text X, the goal is to generate contin- L(X, Ỹ (t−1) , Z) that evaluates how well the gener-
uation text Y = (y1 , . . . , yN ), such that Y satisfies ation Y (approximated with the soft representation
certain constraints according to the reasoning tasks, Ỹ ) obeys the constraint (see the subsequent sec-
usually defined based on another context Z (see tions for concrete instantiations of the loss). The
Figure 1; we discuss the task-specific constraints goal of this pass is thus to minimize the loss w.r.t
in the respective task sections). the generation. Specifically, at iteration t, for each
step n in the generation, we update its logits with: Model BLEU-4 ROUGE-L BERT
(t),b (t−1) Supervised
ỹn = ỹn − λ · ∇ỹn L(X, Ỹ (t−1) , Z), (1) Sup 3.46 25.60 49.38
(t−1) , Z) +COMET-Emb 4.06 26.06 49.71
where ∇ỹn L(X, Ỹ is the gradient of the Unsupervised
constraint-informed loss L w.r.t the n-th logits, and Zero-ShotX 0.65 14.99 39.36
λ ∈ R is the step size. In practice, we may repeat Zero-ShotZX 0.53 14.23 40.03
Zero-ShotX -Ranked 0.87 16.76 41.58
the gradient updates multiple times in a single pass. Zero-ShotZX -Ranked 0.98 17.25 41.93
D ELOREAN 1.38 18.94 42.86
Forward The forward pass ensures that Y is flu-
Human 8.25 30.40 53.30
ent and coherent with the preceding context X. At
iteration t, for a particular step n, we compute the Table 1: Automatic evaluation results on the abductive
forward logits with the LM: task, using the test set of ART.
(t)
ỹn(t),f = LM(X, Ỹ1:n−1 ). (2)
We then mix the nth-step forward and backward 4 Task 1: Abductive Reasoning
logits to get the final logits of iteration t:
Each instance in the ART dataset consists of two
ỹn(t) =γ· ỹn(t),f + (1 − γ) · ỹn(t),b , (3) observations O1 , O2 and a hypothesis H that ex-
plains the two observations. These inputs naturally
where 0 < γ < 1 is the mixing weight. The result- map to X, Z and Y in our framework. Formally,
(t)
ing logits ỹn are then fed to the LM to compute the abductive generation task aims to maximize
the forward logits at the (n + 1)th step (Eq.2). This P (Y |X, Z) – i.e. models must consider both left
way, information from the backward pass is inte- and right contexts (X and Z) jointly.
grated into the left-to-right generation process to
produce text that is informed by the constraint. 4.1 Task Setup
We pre-define the number of tokens N required Constraints We maximize Z given X Ỹ by defin-
by the backward pass, but we allow the forward ing the loss function as the cross-entropy loss of
pass to generate more than N tokens if those are generating Z given X Ỹ with the LM:3
needed to obtain complete sentences. In that case,
we set the logits of the extra tokens to the forward L(X, Ỹ , Z) := −
PNZ
, Z1:n−1 ), (5)
(t) (t),f n=1 log PLM (zn |X, Ỹ
logits, without mixing: ỹn = ỹn for n > N .
We then prune any trailing tokens in the sampled where PLM (aj |a1:j−1 ) is the likelihood of generat-
text to get complete sentences. ing token aj given the preceding text a1:j−1 .
3.2 Ranking Ranking We rank candidates by the overall co-
The output of the decoding step is a list of candi- herence after inserting Y in between X and Z:
date generations for each iteration: Ys = {Y (t) |t =
1, ..., T }. We further use an unsupervised approach ranking score(Y ) = c(XY, Z) + c(X, Y Z). (6)
to rank and pick the best sample as the final out-
put. Specifically, we take advantage of the BERT Hyperparameters We use GPT2-345M (Rad-
model, which was pre-trained with a next-sentence ford et al., 2019b) as the pre-trained LM for all
prediction (NSP) objective. Given two sentences models. We use the ART development set to se-
A and B, we use NSP to compute the likelihood of lect hyperparameters. We use greedy decoding for
B following A as a proxy for coherence: our method and top k decoding (Fan et al., 2018)
(k = 40, τ = 0.7) for our baselines. Other hyper-
c(A, B) = BERT NSP(A, B), (4) parameters are outlined in Appendix ??.
where c(·, ·) denotes the coherence score. This 4.2 Experimental Setup
score is used to evaluate the quality of a given
candidate continuation Y by measuring (1) its com- Baselines We compare our method against base-
patibility with the subsequent text of the context lines from Bhagavatula et al. (2019). The unsu-
X, (2) the internal consistency of Y if it consists pervised baselines use a pre-trained GPT-2 model
of multiple sentences, and (3) the compatibility of 3
Note that this is applied to each prefix of Ỹ , although
Y with its right-side text when it is applicable. some of them are not complete sentences.
Ray drive his car on a steep mountain road. ? Ray was fine but his car was totaled.

As he drives the car to the top of the mountain his car is hit by a car.

Peter was excited to go to the Sanders rally in New Hampshire. ? He couldn't wait to vote for him.

He has a long history of supporting Bernie Sanders and was excited to see him in person.

Figure 3: Examples of generated hypotheses on three abductive reasoning cases. Given observations O1 and O2,
D ELOREAN generates a hypothesis explaining the observations.

to generate Y given a prompt text—either the ob- Model X-Y Y -Z X-Y -Z


servation X alone (Zero-ShotX ) or ZheiX (Zero- Supervised
ShotZX ), where hei denotes a special end-of-text Sup 0.510 0.375 0.314
token. The supervised method (Sup) follows +COMET-Emb 0.466 0.342 0.286
Unsupervised
the same input format as Zero-ShotZX , but fine- Zero-ShotZX 0.233 0.103 0.108
tunes GPT-2 on the ART training set. Finally, Zero-ShotX -Ranked 0.478 0.208 0.195
our knowledge-informed baseline (+COMET-Emb) Zero-ShotZX -Ranked 0.474 0.238 0.236
D ELOREAN 0.522 0.325 0.297
further augments the representation of Sup with
Human 0.879 0.823 0.783
knowledge from COMET (Bosselut et al., 2019).
To separately study the contribution of our de- Table 2: Human calibration results on test set of ART .
coding strategy and ranking component, we also All scores are normalized to [0, 1].
report the performance of ranking the baseline out-
puts. Specifically, we let each baseline generate 20 Overall - Human Judges Preferred

candidates and rank them by coherence (Eq. 6).4 Our model Neutral Comparator
D ELOREAN 21% 43% 36% Sup
D ELOREAN 25% 44% 31% +COMET-Emb
4.3 Results D ELOREAN 23% 62% 15% Zero-ShotX -Ranked
D ELOREAN 27% 50% 23% Zero-ShotXZ -Ranked
Automatic Evaluation We report the same met-
rics as Bhagavatula et al. (2019): BLEU-4 (Pa- D ELOREAN 3% 11% 86% Human

pineni et al., 2002), ROUGE-L (Lin, 2004) and Table 3: Human pairwise comparison results on the test
B ERT S CORE (Zhang et al., 2019) (with the bert- set of ART, between D ELOREAN and each of the base-
base-uncased model). The results in Table 1 show lines, by jointly considering all 3 criteria from Table 2.
that D ELOREAN performs best among the unsuper- “Neutral” means “equally good/bad”.
vised systems across all metrics. We also note that
our ranking step improves both the performance of pairwise comparison setting, presented in Table 3,
our model and that of the zero-shot baselines. workers were presented the outputs from a pair of
Human Evaluation We conduct two sets of hu- systems (D ELOREAN and baseline) and asked to
man evaluations on 100 test examples using crowd- choose the better output in terms of the same co-
workers from Amazon Mechanical Turk. In the herence criteria. Each example was labeled by 3
scoring setting, presented in Table 2, workers were workers.5
presented a pair of observations (X and Z) and In both evaluation setups, our method sub-
a generated hypothesis Y , and asked to rate the stantially outperform the unsupervised baselines,
coherence of the hypothesis with respect to the ob- achieving a relative improvement of 36% − 215%
servation X (X-Y ), the observation Z (Y -Z), and with respect to Y -Z coherence. Our method also
both (X-Y -Z), on a 4-point Likert scale. In the outperform the supervised methods with respect to
X-Y coherence (Table 2), and achieve competitive
4
We tried ablating the ranking component from our method performance in the pairwise comparison (Table 3).
in preliminary experiments, and found that ranking is essential
5
to obtaining good performance. By adding ranking to our The average inter-rater agreement measured by Fleiss’
baselines, we assess the contribution of our decoding strategy. κ = 0.44 (“moderate agreement”) (Fleiss, 1971).
Delorean RECON ZS_ranked ZS FT FT+CF
BLEU 4 ROUGE L BERT
Supervised + Discriminative 2 HARMONIC MEAN OF COHERENCE
AND MIN_EDIT SCORES
Sup+Disc 75.71 72.72 62.39 1.9
Unsupervised+ Discriminative

HARMONIC MEAN
Recon+CF 75.92 70.93 62.49 1.8
Unsupervised
1.7
FT 4.06 24.09 62.55
FT+CF 4.02 24.35 62.63 1.6
Pretrained-only
Zero-Shots1 s02 1.74 21.41 59.31 1.5

Zero-Shots1 s02 -Ranked 2.26 25.81 60.07


1.4
D ELOREAN 21.35 40.73 63.36
1.3
Human 64.93 67.64 61.87
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Table 4: Automatic evaluation results of counterfactual
COHERENCE
β MIN_EDIT
story rewriting, on the test set of T IME T RAVEL.
Figure 4: Human calibration results for counterfactual
Coherence - Human Judges Preferred generation in terms of weighted harmonic mean of co-
2

Our model Neutral Comparator herence and min-edit, Hβ = (1+β )·coherence·min edit
β 2 ·coherence+min edit , as
D ELOREAN 25% 58% 17% Sup+Disc a function of the scaling factor β. Low β values assign
D ELOREAN 23% 70% 7% Recon+CF more weight to coherence, and high β values empha-
D ELOREAN 22% 48% 30% FT size more on min-edit.
D ELOREAN 18% 60% 22% Zero-Shots1 s02
D ELOREAN 27% 42% 31% Zero-Shots1 s02 -Ranked
D ELOREAN 10% 29% 61% Human 5 Task 2: Counterfactual Reasoning
Given an original story ending Z of story con-
Min-Edits - Human Judges Preferred text X ori , and a counterfactual condition X that
Our model Neutral Comparator changes X ori to invalidate Z (see Fig. 1), the task
D ELOREAN 4% 17% 79% Sup+Disc
is to generate a new story ending Y that minimally
D ELOREAN 1% 14% 85% Recon+CF
D ELOREAN 21% 76% 3% FT
edits the original ending Z to regain coherence with
D ELOREAN 28% 71% 1% Zero-Shots1 s02
the counterfactual condition X (Qin et al., 2019a).
D ELOREAN 37% 56% 7% Zero-Shots1 s02 -Ranked
5.1 Task Setup
M+Sup 8% 22% 70% Human
Constraints The constraint we enforce is that Y
Table 5: Human pairwise comparison results on the is close to Z (i.e., minimal edits). We impose this
counterfactual task, between our best model and each
constraint by minimizing their KL divergence:
baseline with respect to coherence and min-edits.
 
L(X, Ỹ , Z) :=KL Zksoftmax(Ỹ /τ ) , (7)

Again, the ranking component contributes to in- where, with a slight abuse of notation, Z is the
creasing performance for the zero-shot baselines. one-hot distribution of the tokens in the original
Finally, the large performance gap between the ending. That is, we encourage the generated logits
methods and human-written explanations stresses to recover the original ending.
the difficulty of this reasoning task and warrants
future research. Ranking We rank the candidates based on both
their coherence with the context, as well as the
internal coherence between the multiple sentences
Qualitative Analysis Figure 3 presents two ex- of each candidate (rewritten ending, consists of 3
ample outputs produced by D ELOREAN. We can sentences). More concretely, given a candidate Y ,
see our approach generates reasonable hypotheses we compute the aggregated coherence score:
by taking into account both the past and future con-
texts. For instance, in the first example, the future ranking score(Y ) = c(X, Y ) +
PS−1
s=1 c(Y [s], Y [s + 1]), (8)
observation (O2) “car was totaled” indicates that
Ray had a car accident, which is correctly captured where each candidate has S sentences (here, S = 3)
in the generated hypothesis “car is hit by a car”. and Y [s] denotes the sth sentence.
that the price was $1,000. Kay was not happy with the price. Rewritten Ending

Story Context Tara wanted to buy a new shirt for her upcoming school formal. She went to the mall with her mom.

She knew of a cool place online that did custom fits really cheaply, and ordered from there. Counterfactual Condition

They browsed shirts from a variety of stores. Tara picked out a floral
Original Ending
patterned shirt that she liked best. Tara looked forward to wearing it.
They sent her a shirt that fit her perfectly. Tara was so
excited to wear it. She looked forward to wearing it. Rewritten Ending

Story Context Shane and John were best friends at school. Shane was caught stealing and got suspended from school.

Shane enjoyed volunteering his time helping others. Counterfactual Condition

John was not allowed to be friends with Shane anymore. this bothered John greatly but
Original Ending
his mom explained the reasons. She explained that Shane was a bad influence on John.
John was a good student and was always looking for ways to help others. They were
both very kind and caring people. Shane was a member of the Boy Scouts of America. Rewritten Ending

Figure 5: Examples of generated story endings on three counterfactual reasoning cases. Given a story context, a
counterfactual condition, and a original ending, D ELOREAN generates a rewritten ending which is coherent with
the counterfactual condition and is similar to the original ending.

Hyperparameters We largely follow the same fidelity. Meanwhile, D ELOREAN achieves the high-
setting as in the abductive reasoning task, but tune est B ERT S CORE for counterfactual coherence.
hyperparameters on the T IME T RAVEL develop-
ment set. Deviations from these settings are out- Human Evaluation We repeat the human eval-
lined in Appendix ??. uation setup from Section 4.3. Presented with the
original story, the counterfactual condition X, and
5.2 Experimental Setup the generated ending Y , workers were asked to
Baselines We compare our method with base- judge (1) the coherence of Y with respect to the
lines from Qin et al. (2019a). The zero-shot base- X; and (2) to what extent the generated ending
line uses the pre-trained GPT-2 model to generate minimally-edits the original ending.7 In order to
Y as a continuation to the counterfactual condition judge both criteria, we report the weighted har-
X. It is the most apt comparison to our method monic mean Hβ of these scores across a range of
which also doesn’t require additional supervision. weights β (Figure 4).
We also experiment with two baselines that fine- Our results show that D ELOREAN is the only
tune GPT-2 on the original story X ori Z to fit the model that maintains a consistent balance between
model to the story domain, either with an LM ob- coherence (1.66) and minimal edits (1.54). While
jective (FT) or a tailored conditional objective that the ranking-augmented zero-shot model produces
encourages minimal edits of Z (Recon+CF).6 Fi- the most coherent endings (coherence = 1.8), it de-
nally, we report the performance of a supervised viates from the original ending. As β is increased
baseline (Sup), in which GPT-2 is fine-tuned to (i.e., increasing importance of minimal edits), its
produce the gold Y from X ori Z and X. weighted performance drops considerably, indicat-
ing it cannot generate new endings that follow the
5.3 Results original plot of the story (min-edit = 1.25). Con-
Automatic Evaluation Following Qin et al. versely, Recon+CF generates stories that are faith-
(2019a), we report B ERT S CORE (Zhang et al., ful to the original endings, but are far less coher-
2019), which was shown to best correlate with hu- ent with the counterfactual condition (coherence =
man judges’ notion of counterfactual coherence, 1.23). Through human annotation, we found that
and BLEU-4 and ROUGE-L, which better mea- Recon+CF copies the original ending word-for-
sure minimum-edits. We find that the discrimina- word in a 84% of cases.
tive baselines achieve the highest degree of plot The pairwise comparison results in Table 5
6 7
See Qin et al. (2019a) for more details. Fair inter-rater agreement with Fleiss’ κ = 0.34
parallel these observations. D ELOREAN signifi- tured metadata such as source, domain, etc. The
cantly outperforms the discriminative approaches Plug & Play model (PPLM; Dathathri et al., 2019)
(Recon+CF and Sup+Disc) in coherence, while controls topic and sentiment in an approach similar
falling short of the Zero-shot re-ranked baselines. to ours that involves forward and backward passes
In minimal edits, this pattern is flipped with our to update token distributions. However, PPLM
approach outperforming Zero-shot baselines con- relies on trained attribute discriminators for super-
siderably and losing to the discriminative baselines. vision, while our method is unsupervised. While
these models are restricted to specific dimensions,
Qualitative Analysis Figure 5 provides two ex- often with pre-defined values, our model can adjust
ample results for counterfactual story rewriting by to any open-ended textual constraint. Perhaps the
D ELOREAN. The approach successfully captures most similar work in that aspect is the “text infill-
the causal relations between events and properly ing” models, which, however, are in a more narrow
rewrites the endings with minimal edits. For in- setting by filling only a relatively short text span
stance, in the first example, given the counterfac- (Devlin et al., 2018; Zhu et al., 2019; Donahue
tual condition that “Tara ordered a shirt online” (as et al., 2020), and more restrictive due to the reliance
opposed to the original “went to mall”), the rewrit- on an extra right-to-left language model (Sun et al.,
ten ending is about “sent shirt” to Tara (as opposed 2017) or a pre-specified generation length (Zeldes
to the original “browsed from stores”). The last et al., 2020, which is not publicly available).
sentence of the original ending “She looked for-
ward to wearing it” is correctly preserved as it is Reasoning about narratives. A prominent re-
coherent with the counterfactual condition. source from recent years is the RocStories corpus
(Mostafazadeh et al., 2016b), consisting of 98K
6 Related Work crowdsourced 5-sentence everyday life stories. It
was used for the story cloze task whose goal was to
Unsupervised text generation. Unsupervised predict the story ending from its first 4 sentences,
approaches are often applied to problems that copy but gained popularity and became the base of ad-
information from a source text into decoded text. ditional benchmarks (Rashkin et al., 2018). Addi-
Unsupervised paraphrasing requires repeating this tional related work includes “script knowledge”, i.e.
information (Miao et al., 2019; Bao et al., 2019), learning about prototypical series of events (Schank
as does translation, but with a bilingual transfor- and Abelson, 1977; Chambers and Jurafsky, 2008;
mation (Artetxe et al., 2017; Lample et al., 2018). Pichotta and Mooney, 2014), temporal common-
In summarization there is an additional task to se- sense (Granroth-Wilding and Clark, 2016; Li et al.,
lect a subset of the original text (Baziotis et al., 2018), and modeling pre- and post- conditions of
2019; Schumann et al., 2020; West et al., 2019). In events (Roemmele et al., 2011; Sap et al., 2019;
cases where information is mostly copied from the Bosselut et al., 2019). Qin et al. (2019b) studied
original, auto-encoding objectives can ensure the conversation modeling that reads and connects the
correct information is captured (Bao et al., 2019; dots of events in related documents. Finally, a re-
Baziotis et al., 2019; Artetxe et al., 2017). This cent line of work explores counterfactual questions
work tackles problems where generation is more in reading comprehension (Huang et al., 2019; Tan-
open-ended. Rather than reproducing information don et al., 2019), but instantiates the problem of
from the prompt, generations should agree with and counterfactual reasoning as a multiple choice task.
expand on it, making autoencoding less applicable.
7 Conclusion
Controllable language generation. Earlier ap-
proaches for controllable generation involved pre- We presented D ELOREAN, an unsupervised LM-
serving the content of text while changing it along based approach to generate text conditioned on
discrete dimensions, such as theme, sentiment, or past context as well as future constraints, through
style (Koncel-Kedziorski et al., 2016; Hu et al., forward and backward passes considering each con-
2017; Ficler and Goldberg, 2017; Shen et al., 2017; dition. We demonstrated its effectiveness for ab-
Lample et al., 2019). Recent works such as Grover ductive and counterfactual reasoning, on which it
(Zellers et al., 2019) and CTRL model (Keskar performed substantially better than unsupervised
et al., 2019) used these ideas to augment trans- baselines. Our method is general and can be easily
former language models that can condition on struc- adapted for other generative reasoning tasks.
Acknowledgements Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
We thanks the anonymous reviewers and colleages bidirectional transformers for language understand-
at UW NLP and AI2 for many helpful comments. ing. arXiv preprint arXiv:1810.04805.
This research was supported in part by DARPA C. Donahue, M. Lee, and P. Liang. 2020. Enabling
CwC through ARO (W911NF15-1-0543), DARPA language models to fill in the blanks. In ACL.
MCS program through NIWC Pacific (N66001-19-
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
2-4031), and Allen Institute for AI. erarchical neural story generation. In ACL.

Jessica Ficler and Yoav Goldberg. 2017. Controlling


References linguistic style aspects in neural language genera-
tion. In Proceedings of the Workshop on Stylistic
Henning Andersen. 1973. Abductive and deductive Variation, pages 94–104.
change. Language, pages 765–793.
Joseph L Fleiss. 1971. Measuring nominal scale agree-
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and ment among many raters. Psychological bulletin,
Kyunghyun Cho. 2017. Unsupervised neural ma- 76(5):378.
chine translation. arXiv preprint arXiv:1710.11041.
Matthew L Ginsberg. 1986. Counterfactuals. Artificial
Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili intelligence, 30(1):35–79.
Mou, Olga Vechtomova, Xinyu Dai, and Jiajun
Chen. 2019. Generating sentences from disentan- Nelson Goodman. 1947. The problem of counter-
gled syntactic and semantic spaces. arXiv preprint factual conditionals. The Journal of Philosophy,
arXiv:1907.05789. 44(5):113–128.
Christos Baziotis, Ion Androutsopoulos, Ioannis Kon- Mark Granroth-Wilding and Stephen Clark. 2016.
stas, and Alexandros Potamianos. 2019. Seq3: Dif- What happens next? event prediction using a com-
ferentiable sequence-to-sequence-to-sequence au- positional neural network model. In Thirtieth AAAI
toencoder for unsupervised abstractive sentence Conference on Artificial Intelligence.
compression. In NAACL-HLT.
Jerry R Hobbs, Mark E Stickel, Douglas E Appelt, and
Chandra Bhagavatula, Ronan Le Bras, Chaitanya Paul Martin. 1993. Interpretation as abduction. Ar-
Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han- tificial intelligence, 63(1-2):69–142.
nah Rashkin, Doug Downey, Wen-tau Yih, and Yejin
Choi. 2019. Abductive commonsense reasoning. In Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan
International Conference on Learning Representa- Salakhutdinov, and Eric P Xing. 2017. Toward con-
tions. trolled generation of text. In ICML.

Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. Yejin Choi. 2019. Cosmos qa: Machine reading
2019. COMET: Commonsense transformers for au- comprehension with contextual commonsense rea-
tomatic knowledge graph construction. In Proceed- soning. In Proceedings of the 2019 Conference on
ings of the 57th Annual Meeting of the Association Empirical Methods in Natural Language Processing
for Computational Linguistics, pages 4762–4779, and the 9th International Joint Conference on Natu-
Florence, Italy. Association for Computational Lin- ral Language Processing (EMNLP-IJCNLP), pages
guistics. 2391–2401.

Samuel R Bowman, Gabor Angeli, Christopher Potts, Steve D Isard. 1974. What would you have done if...?
and Christopher D Manning. 2015. A large anno- Theoretical Linguistics, 1(1-3):233–256.
tated corpus for learning natural language inference.
arXiv preprint arXiv:1508.05326. Philip Nicholas Johnson-Laird. 2006. How we reason.
Oxford University Press, USA.
Nathanael Chambers and Dan Jurafsky. 2008. Unsu-
pervised learning of narrative event chains. In Pro- Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
ceedings of ACL-08: HLT, pages 789–797. Caiming Xiong, and Richard Socher. 2019. Ctrl: A
conditional transformer language model for control-
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane lable generation. arXiv preprint arXiv:1909.05858.
Hung, Eric Frank, Piero Molino, Jason Yosinski, and
Rosanne Liu. 2019. Plug and play language mod- Rik Koncel-Kedziorski, Ioannis Konstas, Luke Zettle-
els: a simple approach to controlled text generation. moyer, and Hannaneh Hajishirzi. 2016. A theme-
arXiv preprint arXiv:1912.02164. rewriting approach for generating algebra word
problems. In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Process-
ing, pages 1617–1628.
Sarit Kraus, Daniel Lehmann, and Menachem Magidor. Charles Sanders Peirce. 1960. Collected papers of
1990. Nonmonotonic reasoning, preferential models charles sanders peirce, volume 2. Harvard Univer-
and cumulative logics. Artificial intelligence, 44(1- sity Press.
2):167–207.
Karl Pichotta and Raymond Mooney. 2014. Statisti-
Guillaume Lample, Myle Ott, Alexis Conneau, Lu- cal script learning with multi-argument events. In
dovic Denoyer, and Marc’Aurelio Ranzato. 2018. EACL, pages 220–229.
Phrase-based & neural unsupervised machine trans-
lation. arXiv preprint arXiv:1804.07755. Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chan-
dra Bhagavatula, Elizabeth Clark, and Yejin Choi.
Guillaume Lample, Sandeep Subramanian, Eric Smith, 2019a. Counterfactual story reasoning and gener-
Ludovic Denoyer, Marc’Aurelio Ranzato, and Y- ation. In Proceedings of the 2019 Conference on
Lan Boureau. 2019. Multiple-attribute text rewrit- Empirical Methods in Natural Language Processing
ing. In ICLR. and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
Carolin Lawrence and Stefan Riezler. 2018. Improving 5046–5056.
a neural semantic parser by counterfactual learning
from human bandit feedback. In Proceedings of the Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong
56th Annual Meeting of the Association for Compu- Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jian-
tational Linguistics (Volume 1: Long Papers), pages feng Gao. 2019b. Conversing by reading: Con-
1820–1830. tentful neural conversation with on-demand machine
reading. In ACL, pages 5427–5436.
Carolin Lawrence, Artem Sokolov, and Stefan Riezler.
2017. Counterfactual learning from bandit feedback Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
under deterministic logging: A case study in statisti- Dario Amodei, and Ilya Sutskever. 2019a. Language
cal machine translation. In Proceedings of the 2017 models are unsupervised multitask learners. -.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 2566–2576. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019b. Lan-
Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Con- guage models are unsupervised multitask learners.
structing narrative event evolutionary graph for OpenAI Blog, 1:8.
script event prediction. In IJCAI.
Hannah Rashkin, Antoine Bosselut, Maarten Sap,
Chin-Yew Lin. 2004. Rouge: A package for auto- Kevin Knight, and Yejin Choi. 2018. Modeling
matic evaluation of summaries. Text Summarization naive psychology of characters in simple common-
Branches Out. sense stories. arXiv preprint arXiv:1805.06533.
Hugo Mercier and Dan Sperber. 2017. The enigma of Raymond Reiter. 1988. Nonmonotonic reasoning. In
reason. Harvard University Press. Exploring artificial intelligence, pages 439–481. El-
sevier.
Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei
Li. 2019. Cgmh: Constrained sentence generation Melissa Roemmele, Cosmin Adrian Bejan, and An-
by metropolis-hastings sampling. In Proceedings of drew S Gordon. 2011. Choice of plausible alterna-
the AAAI Conference on Artificial Intelligence, vol- tives: An evaluation of commonsense causal reason-
ume 33, pages 6834–6842. ing. In 2011 AAAI Spring Symposium Series.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong Maarten Sap, Ronan LeBras, Emily Allaway, Chan-
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,
Pushmeet Kohli, and James Allen. 2016a. A cor- Brendan Roof, Noah A Smith, and Yejin Choi. 2019.
pus and evaluation framework for deeper under- ATOMIC: An atlas of machine commonsense for if-
standing of commonsense stories. arXiv preprint then reasoning. In AAAI.
arXiv:1604.01696.
Roger C Schank and Robert P Abelson. 1977. Scripts,
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong plans, goals and understanding: An inquiry into hu-
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, man knowledge structures.
Pushmeet Kohli, and James F. Allen. 2016b. A cor-
pus and cloze evaluation for deeper understanding of Raphael Schumann, Lili Mou, Yao Lu, Olga Vech-
commonsense stories. In HLT-NAACL. tomova, and Katja Markert. 2020. Discrete op-
timization for unsupervised sentence summariza-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- tion with word-level extraction. arXiv preprint
Jing Zhu. 2002. BLEU: a method for automatic eval- arXiv:2005.01791.
uation of machine translation. In ACL, pages 311–
318. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi
Jaakkola. 2017. Style transfer from non-parallel text
Judea Pearl and Dana Mackenzie. 2018. The book of by cross-alignment. In Advances in neural informa-
why: the new science of cause and effect. Basic tion processing systems, pages 6830–6841.
Books.
Youngseo Son, Anneke Buffone, Joe Raso, Allegra 2019. Bottlesum: Unsupervised and self-supervised
Larche, Anthony Janocko, Kevin Zembroski, H An- sentence summarization using the information bot-
drew Schwartz, and Lyle Ungar. 2017. Recognizing tleneck principle. In Proceedings of the 2019 Con-
counterfactual thinking in social media texts. In Pro- ference on Empirical Methods in Natural Language
ceedings of the 55th Annual Meeting of the Associa- Processing and the 9th International Joint Confer-
tion for Computational Linguistics (Volume 2: Short ence on Natural Language Processing (EMNLP-
Papers), pages 654–658. IJCNLP), pages 3743–3752.
William Starr. 2019. Counterfactuals. In Edward N. Yoel Zeldes, Dan Padnos, and Barak Peleg. 2020.
Zalta, editor, The Stanford Encyclopedia of Philos- Haim-1.5 - the next generation.
ophy, fall 2019 edition. Metaphysics Research Lab,
Stanford University. Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirec- Yejin Choi. 2019. Defending against neural fake
tional beam search: Forward-backward inference in news. In Advances in Neural Information Process-
neural sequence models for fill-in-the-blank image ing Systems, pages 9051–9062.
captioning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
6961–6969. Weinberger, and Yoav Artzi. 2019. BERTScore:
Evaluating text generation with BERT. CoRR,
Niket Tandon, Bhavana Dalvi Mishra, Keisuke Sak- abs/1904.09675.
aguchi, Antoine Bosselut, and Peter Clark. 2019.
Wiqa: A dataset for” what if...” reasoning over pro- Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Text
cedural text. In EMNLP. infilling. arXiv preprint arXiv:1901.00158.
Peter West, Ari Holtzman, Jan Buys, and Yejin Choi.

You might also like