Professional Documents
Culture Documents
Hypothesis
Abductive and counterfactual reasoning, core Past Observation Future Observation
abilities of everyday human cognition, require She hit the rope
arXiv:2010.05906v3 [cs.CL] 7 Dec 2020
x1 x2 … xNX
LM
x1 x2 … xNX z1 z2 …
NZ z
LM Backward Pass
Input: LM Forward Pass Input:
Ray hung a tire on a rope to Backward Logits Ray ran to his daughter to
make his daughter a swing. Forward Logits make sure she was okay.
Past context X Forward Backward Mix
Future constraint Z
Figure 2: Illustration of the D ELOREAN decoding procedure, using abductive reasoning as an example. At ini-
Generation Y
tialization (upper-left box), the language model (LM) initializes the logits Ỹ = {ỹ1 , . . . , ỹN } of the hypothesis
Output: She hit the rope and the tire fell on top of her
by reading the past context X and generating a continuation with regular decoding. At each forward-backward
iteration, we compute the task-specific loss LỸ of the logits based on the future constraint ZGeneration (red box). Y
The
Output:
b She hit the
b rope andbthe tire Generation
fell on top ofYher.
backward pass then performs back-propagation and produces the backward logits Ỹ = {ỹ1Output: ỹNhit}.theIn
, . . . ,She the tire fell on top of her.
b b andỹthe b
ỹbN Y
rope
f
subsequent forward pass, for each step n, we compute the forward logits ỹn conditioning on the preceding ỹ 1 ỹ 2
logits 3 Generation
Generation Y … ropeGeneration
ỹn−1 , and then mix it with the respective backward logits to produce the new logits ỹb n at stepb
n. b Output:b She hit the and the tire fe
Output: She hit the rope and the tire fell on top of her.
1 ỹ ỹỹ
2ỹ
Generation 3
b 1Y b 2
ỹ ỹOutput:ỹ She hit theỹrope and the
N
b 3 b N Backprop
ỹ ỹ
Output: She hit the rope and1the tire fell
Repeat
ỹ
3 her.… N
… 2 on top of ỹ
T times Ray
… Computing Los
possibly defeasible with new additional context) abductive reasoning task. Given ỹ1 two ỹỹ13 ỹ 2ỹỹNb1 ỹ f3 ỹb2 ỹỹfNb3
ỹ2 observations,f f
Cỹ
(Johnson-Laird, 2006; Mercier ỹb1 b
ỹand ỹb3
Sperber, b
ỹN2017). the Backpropagation
goalT isRepeat
to determine the most likely ỹ1 explana-…ỹ
2 ỹ3ỹb Ray … ỹ b
ỹran
N ỹb3…
2 times
Repeat
1
… 2 to…
f f b ỹ f
This type of reasoning corresponds to…nonmono- tion of what happened ỹb1 ỹ 1 ỹb2 ỹ 2 ỹb3The
T times in-between. ỹ f3 dataset
ỹ ỹ ỹ3 Ray…
Computing Loss f f N N 1 f ỹ2 f Backpropag
tonic reasoning (Kraus et ỹ1 al., ỹ1990), ỹ3 as it contra-
ỹN introduced for the task, ART,Repeat consists ỹof1 …20kỹ obser-
…
2
ỹLM3ỹ1
ỹỹN …ỹ… 3
2 T times …
2
Repeat
dicts Tthe
monotonicity property according … to which Repeat
sentence of f Computing
f f Loss …
Ray vations
ran to …derived from the ỹfirst and ỹ last ỹ 3 ỹ ỹ ỹ ỹ
x1 x21dataset x2NX(Mostafazadeh
[S] N 1 2 3 f
times T times
ỹ f1 madeỹ f2 invalid ỹ fN
ỹ f3 by adding ỹ f1Pass ranỹ f2to …
…
valid arguments cannot be storiesRepeat in the
T times
ROCStories LM… LM Backward Ray
…ỹ 3 [S
Input:
premises. We study two tasks of that nature: … abduc- et al., … 2016a). WeRayfocus fon
hung aỹtire on athe f to f
abductive
ỹ 2 ỹ 3
rope NLG LM Logits
LM Forward Pass
fBackward
ỹ Forward
…
Ray X F
Forward Backward Mix
LM Backward Pass hung on a rope to Backward LM Logits
Backward Pass
It has a central role in the human ability to “read be- Past context X Forward Backward Mix Future constraint
2.2 Counterfactual Reasoning
tween the lines,” and is crucial for language acqui-
sition (Andersen, 1973), understanding sentences Counterfactual reasoning aims at inferring alterna-
in discourse (Hobbs et al., 1993), and many more. tive past events that could have happened given
Despite the importance, however, relatively little a certain change in conditions (Goodman, 1947;
focus has been given to it in NLP research. Starr, 2019). While counterfactual reasoning plays
Recently, Bhagavatula et al. (2019) propose the an important role in AI systems (Isard, 1974; Gins-
berg, 1986), it requires causal reasoning abilities, Algorithm 1: D ELOREAN Decoding
which are arguably absent from current association- Input: Pre-trained language model (LM)
based AI (Pearl and Mackenzie, 2018). While Context X
Future constraint Z
there has been work on counterfactual reasoning 1: Initialize logits Ỹ (0)
in NLP, including recognizing counterfactuals in 2: Initialize Ys , list of candidate generations
text (Son et al., 2017), and improving the perfor- 3: for t ← 1 to T do
4: // Backward pass
mance of NLP tasks using counterfactual learn- 5: for n ← N to 1 do
ing (Lawrence et al., 2017; Lawrence and Riezler, 6: Compute backward logits ỹnb , Eq.(1)
2018), it remains a major research challenge. 7: end for
8: // Forward pass
Recently, Qin et al. (2019a) introduce the task of 9: for n ← 1 to N do
counterfactual story generation. Given a 5-sentence 10: Compute forward logits ỹnf , Eq.(2)
11: Mix forward and backward logits, Eq.(3)
original story, and an alternative context in which 12: end for
the second sentence of the story was altered by 13: Sample candidate Y from logits Ỹ and add to Ys
a counterfactual, the task is to generate a new 3- 14: end for
15: Rank Ys by coherence
sentence story ending that addresses the alternative Output: The most coherent generated text Y from Ys
beginning while minimally editing the original end-
ing. The associated T IME T RAVEL dataset is based
on fictional narratives from ROCStories, for which
The proposed approach interleaves two proce-
counterfactual contexts and alternative endings are
dures, namely, forward and backward, that produce
crowdsourced, yielding 29,849 problem instances.
and iteratively refine the generation, for a prede-
Qin et al. (2019a) report several baseline perfor-
fined number of iterations T . In particular, the
mances, and find that models based on pre-trained
forward pass ensures the generated text is a fluent
LMs produce output that recognize the counterfac-
continuation of the context X, while the backward
tual, but generated endings which deviated consid-
pass informs the model about the constraint and
erably from the original storyline. In contrast, in
steers the generation to satisfy it.
the supervised setup, models optimize the easier of
the two goals and generate endings that are overly As detailed below, the backward pass uses gradi-
similar to the original endings. ent descent to update the generation Y . However,
Y is a discrete text that is not differentiable. In-
3 The D ELOREAN Approach stead, throughout the algorithm, we maintain a soft
representation of the sequence Ỹ = (ỹ1 , . . . , ỹN ),
Humans make inferences based on available in- where ỹn ∈ RV represents the logits of the n-th
formation and refine them when new information token and V is the vocabulary size. After the logits
arrives. Since currently available pre-trained LMs are refined over multiple iterations of the forward
generate text by sequentially predicting the next and backward passes, we generate discrete text at
token from left to right, they are incapable of con- each step by sampling from yn ∼ softmax(ỹn /τ ),
ditioning on future constraints. Therefore, we pro- where τ > 0 is the temperature.
pose D ELOREAN: an unsupervised backprop-based We start by initializing the logits before the first
decoding algorithm, which is summarized in Algo- (0) (0)
iteration, Ỹ (0) = (ỹ1 . . . ỹN ), by feeding the
rithm 1, illustrated in Figure 2, and detailed below. context X into the LM and greedily decoding N
D ELOREAN intermittently refines the predictions continuation tokens.
to cohere with either the context or the constraints
(Section 3.1). The candidate generations are then Backward The backward pass uses gradient
ranked by coherence (Section 3.2). backpropagation to update the generation with
respect to the constraint. Specifically, we ex-
3.1 Decoding Strategy press the task-specific constraint as a loss function
Given context text X, the goal is to generate contin- L(X, Ỹ (t−1) , Z) that evaluates how well the gener-
uation text Y = (y1 , . . . , yN ), such that Y satisfies ation Y (approximated with the soft representation
certain constraints according to the reasoning tasks, Ỹ ) obeys the constraint (see the subsequent sec-
usually defined based on another context Z (see tions for concrete instantiations of the loss). The
Figure 1; we discuss the task-specific constraints goal of this pass is thus to minimize the loss w.r.t
in the respective task sections). the generation. Specifically, at iteration t, for each
step n in the generation, we update its logits with: Model BLEU-4 ROUGE-L BERT
(t),b (t−1) Supervised
ỹn = ỹn − λ · ∇ỹn L(X, Ỹ (t−1) , Z), (1) Sup 3.46 25.60 49.38
(t−1) , Z) +COMET-Emb 4.06 26.06 49.71
where ∇ỹn L(X, Ỹ is the gradient of the Unsupervised
constraint-informed loss L w.r.t the n-th logits, and Zero-ShotX 0.65 14.99 39.36
λ ∈ R is the step size. In practice, we may repeat Zero-ShotZX 0.53 14.23 40.03
Zero-ShotX -Ranked 0.87 16.76 41.58
the gradient updates multiple times in a single pass. Zero-ShotZX -Ranked 0.98 17.25 41.93
D ELOREAN 1.38 18.94 42.86
Forward The forward pass ensures that Y is flu-
Human 8.25 30.40 53.30
ent and coherent with the preceding context X. At
iteration t, for a particular step n, we compute the Table 1: Automatic evaluation results on the abductive
forward logits with the LM: task, using the test set of ART.
(t)
ỹn(t),f = LM(X, Ỹ1:n−1 ). (2)
We then mix the nth-step forward and backward 4 Task 1: Abductive Reasoning
logits to get the final logits of iteration t:
Each instance in the ART dataset consists of two
ỹn(t) =γ· ỹn(t),f + (1 − γ) · ỹn(t),b , (3) observations O1 , O2 and a hypothesis H that ex-
plains the two observations. These inputs naturally
where 0 < γ < 1 is the mixing weight. The result- map to X, Z and Y in our framework. Formally,
(t)
ing logits ỹn are then fed to the LM to compute the abductive generation task aims to maximize
the forward logits at the (n + 1)th step (Eq.2). This P (Y |X, Z) – i.e. models must consider both left
way, information from the backward pass is inte- and right contexts (X and Z) jointly.
grated into the left-to-right generation process to
produce text that is informed by the constraint. 4.1 Task Setup
We pre-define the number of tokens N required Constraints We maximize Z given X Ỹ by defin-
by the backward pass, but we allow the forward ing the loss function as the cross-entropy loss of
pass to generate more than N tokens if those are generating Z given X Ỹ with the LM:3
needed to obtain complete sentences. In that case,
we set the logits of the extra tokens to the forward L(X, Ỹ , Z) := −
PNZ
, Z1:n−1 ), (5)
(t) (t),f n=1 log PLM (zn |X, Ỹ
logits, without mixing: ỹn = ỹn for n > N .
We then prune any trailing tokens in the sampled where PLM (aj |a1:j−1 ) is the likelihood of generat-
text to get complete sentences. ing token aj given the preceding text a1:j−1 .
3.2 Ranking Ranking We rank candidates by the overall co-
The output of the decoding step is a list of candi- herence after inserting Y in between X and Z:
date generations for each iteration: Ys = {Y (t) |t =
1, ..., T }. We further use an unsupervised approach ranking score(Y ) = c(XY, Z) + c(X, Y Z). (6)
to rank and pick the best sample as the final out-
put. Specifically, we take advantage of the BERT Hyperparameters We use GPT2-345M (Rad-
model, which was pre-trained with a next-sentence ford et al., 2019b) as the pre-trained LM for all
prediction (NSP) objective. Given two sentences models. We use the ART development set to se-
A and B, we use NSP to compute the likelihood of lect hyperparameters. We use greedy decoding for
B following A as a proxy for coherence: our method and top k decoding (Fan et al., 2018)
(k = 40, τ = 0.7) for our baselines. Other hyper-
c(A, B) = BERT NSP(A, B), (4) parameters are outlined in Appendix ??.
where c(·, ·) denotes the coherence score. This 4.2 Experimental Setup
score is used to evaluate the quality of a given
candidate continuation Y by measuring (1) its com- Baselines We compare our method against base-
patibility with the subsequent text of the context lines from Bhagavatula et al. (2019). The unsu-
X, (2) the internal consistency of Y if it consists pervised baselines use a pre-trained GPT-2 model
of multiple sentences, and (3) the compatibility of 3
Note that this is applied to each prefix of Ỹ , although
Y with its right-side text when it is applicable. some of them are not complete sentences.
Ray drive his car on a steep mountain road. ? Ray was fine but his car was totaled.
As he drives the car to the top of the mountain his car is hit by a car.
Peter was excited to go to the Sanders rally in New Hampshire. ? He couldn't wait to vote for him.
He has a long history of supporting Bernie Sanders and was excited to see him in person.
Figure 3: Examples of generated hypotheses on three abductive reasoning cases. Given observations O1 and O2,
D ELOREAN generates a hypothesis explaining the observations.
candidates and rank them by coherence (Eq. 6).4 Our model Neutral Comparator
D ELOREAN 21% 43% 36% Sup
D ELOREAN 25% 44% 31% +COMET-Emb
4.3 Results D ELOREAN 23% 62% 15% Zero-ShotX -Ranked
D ELOREAN 27% 50% 23% Zero-ShotXZ -Ranked
Automatic Evaluation We report the same met-
rics as Bhagavatula et al. (2019): BLEU-4 (Pa- D ELOREAN 3% 11% 86% Human
pineni et al., 2002), ROUGE-L (Lin, 2004) and Table 3: Human pairwise comparison results on the test
B ERT S CORE (Zhang et al., 2019) (with the bert- set of ART, between D ELOREAN and each of the base-
base-uncased model). The results in Table 1 show lines, by jointly considering all 3 criteria from Table 2.
that D ELOREAN performs best among the unsuper- “Neutral” means “equally good/bad”.
vised systems across all metrics. We also note that
our ranking step improves both the performance of pairwise comparison setting, presented in Table 3,
our model and that of the zero-shot baselines. workers were presented the outputs from a pair of
Human Evaluation We conduct two sets of hu- systems (D ELOREAN and baseline) and asked to
man evaluations on 100 test examples using crowd- choose the better output in terms of the same co-
workers from Amazon Mechanical Turk. In the herence criteria. Each example was labeled by 3
scoring setting, presented in Table 2, workers were workers.5
presented a pair of observations (X and Z) and In both evaluation setups, our method sub-
a generated hypothesis Y , and asked to rate the stantially outperform the unsupervised baselines,
coherence of the hypothesis with respect to the ob- achieving a relative improvement of 36% − 215%
servation X (X-Y ), the observation Z (Y -Z), and with respect to Y -Z coherence. Our method also
both (X-Y -Z), on a 4-point Likert scale. In the outperform the supervised methods with respect to
X-Y coherence (Table 2), and achieve competitive
4
We tried ablating the ranking component from our method performance in the pairwise comparison (Table 3).
in preliminary experiments, and found that ranking is essential
5
to obtaining good performance. By adding ranking to our The average inter-rater agreement measured by Fleiss’
baselines, we assess the contribution of our decoding strategy. κ = 0.44 (“moderate agreement”) (Fleiss, 1971).
Delorean RECON ZS_ranked ZS FT FT+CF
BLEU 4 ROUGE L BERT
Supervised + Discriminative 2 HARMONIC MEAN OF COHERENCE
AND MIN_EDIT SCORES
Sup+Disc 75.71 72.72 62.39 1.9
Unsupervised+ Discriminative
HARMONIC MEAN
Recon+CF 75.92 70.93 62.49 1.8
Unsupervised
1.7
FT 4.06 24.09 62.55
FT+CF 4.02 24.35 62.63 1.6
Pretrained-only
Zero-Shots1 s02 1.74 21.41 59.31 1.5
Our model Neutral Comparator herence and min-edit, Hβ = (1+β )·coherence·min edit
β 2 ·coherence+min edit , as
D ELOREAN 25% 58% 17% Sup+Disc a function of the scaling factor β. Low β values assign
D ELOREAN 23% 70% 7% Recon+CF more weight to coherence, and high β values empha-
D ELOREAN 22% 48% 30% FT size more on min-edit.
D ELOREAN 18% 60% 22% Zero-Shots1 s02
D ELOREAN 27% 42% 31% Zero-Shots1 s02 -Ranked
D ELOREAN 10% 29% 61% Human 5 Task 2: Counterfactual Reasoning
Given an original story ending Z of story con-
Min-Edits - Human Judges Preferred text X ori , and a counterfactual condition X that
Our model Neutral Comparator changes X ori to invalidate Z (see Fig. 1), the task
D ELOREAN 4% 17% 79% Sup+Disc
is to generate a new story ending Y that minimally
D ELOREAN 1% 14% 85% Recon+CF
D ELOREAN 21% 76% 3% FT
edits the original ending Z to regain coherence with
D ELOREAN 28% 71% 1% Zero-Shots1 s02
the counterfactual condition X (Qin et al., 2019a).
D ELOREAN 37% 56% 7% Zero-Shots1 s02 -Ranked
5.1 Task Setup
M+Sup 8% 22% 70% Human
Constraints The constraint we enforce is that Y
Table 5: Human pairwise comparison results on the is close to Z (i.e., minimal edits). We impose this
counterfactual task, between our best model and each
constraint by minimizing their KL divergence:
baseline with respect to coherence and min-edits.
L(X, Ỹ , Z) :=KL Zksoftmax(Ỹ /τ ) , (7)
Again, the ranking component contributes to in- where, with a slight abuse of notation, Z is the
creasing performance for the zero-shot baselines. one-hot distribution of the tokens in the original
Finally, the large performance gap between the ending. That is, we encourage the generated logits
methods and human-written explanations stresses to recover the original ending.
the difficulty of this reasoning task and warrants
future research. Ranking We rank the candidates based on both
their coherence with the context, as well as the
internal coherence between the multiple sentences
Qualitative Analysis Figure 3 presents two ex- of each candidate (rewritten ending, consists of 3
ample outputs produced by D ELOREAN. We can sentences). More concretely, given a candidate Y ,
see our approach generates reasonable hypotheses we compute the aggregated coherence score:
by taking into account both the past and future con-
texts. For instance, in the first example, the future ranking score(Y ) = c(X, Y ) +
PS−1
s=1 c(Y [s], Y [s + 1]), (8)
observation (O2) “car was totaled” indicates that
Ray had a car accident, which is correctly captured where each candidate has S sentences (here, S = 3)
in the generated hypothesis “car is hit by a car”. and Y [s] denotes the sth sentence.
that the price was $1,000. Kay was not happy with the price. Rewritten Ending
Story Context Tara wanted to buy a new shirt for her upcoming school formal. She went to the mall with her mom.
She knew of a cool place online that did custom fits really cheaply, and ordered from there. Counterfactual Condition
They browsed shirts from a variety of stores. Tara picked out a floral
Original Ending
patterned shirt that she liked best. Tara looked forward to wearing it.
They sent her a shirt that fit her perfectly. Tara was so
excited to wear it. She looked forward to wearing it. Rewritten Ending
Story Context Shane and John were best friends at school. Shane was caught stealing and got suspended from school.
John was not allowed to be friends with Shane anymore. this bothered John greatly but
Original Ending
his mom explained the reasons. She explained that Shane was a bad influence on John.
John was a good student and was always looking for ways to help others. They were
both very kind and caring people. Shane was a member of the Boy Scouts of America. Rewritten Ending
Figure 5: Examples of generated story endings on three counterfactual reasoning cases. Given a story context, a
counterfactual condition, and a original ending, D ELOREAN generates a rewritten ending which is coherent with
the counterfactual condition and is similar to the original ending.
Hyperparameters We largely follow the same fidelity. Meanwhile, D ELOREAN achieves the high-
setting as in the abductive reasoning task, but tune est B ERT S CORE for counterfactual coherence.
hyperparameters on the T IME T RAVEL develop-
ment set. Deviations from these settings are out- Human Evaluation We repeat the human eval-
lined in Appendix ??. uation setup from Section 4.3. Presented with the
original story, the counterfactual condition X, and
5.2 Experimental Setup the generated ending Y , workers were asked to
Baselines We compare our method with base- judge (1) the coherence of Y with respect to the
lines from Qin et al. (2019a). The zero-shot base- X; and (2) to what extent the generated ending
line uses the pre-trained GPT-2 model to generate minimally-edits the original ending.7 In order to
Y as a continuation to the counterfactual condition judge both criteria, we report the weighted har-
X. It is the most apt comparison to our method monic mean Hβ of these scores across a range of
which also doesn’t require additional supervision. weights β (Figure 4).
We also experiment with two baselines that fine- Our results show that D ELOREAN is the only
tune GPT-2 on the original story X ori Z to fit the model that maintains a consistent balance between
model to the story domain, either with an LM ob- coherence (1.66) and minimal edits (1.54). While
jective (FT) or a tailored conditional objective that the ranking-augmented zero-shot model produces
encourages minimal edits of Z (Recon+CF).6 Fi- the most coherent endings (coherence = 1.8), it de-
nally, we report the performance of a supervised viates from the original ending. As β is increased
baseline (Sup), in which GPT-2 is fine-tuned to (i.e., increasing importance of minimal edits), its
produce the gold Y from X ori Z and X. weighted performance drops considerably, indicat-
ing it cannot generate new endings that follow the
5.3 Results original plot of the story (min-edit = 1.25). Con-
Automatic Evaluation Following Qin et al. versely, Recon+CF generates stories that are faith-
(2019a), we report B ERT S CORE (Zhang et al., ful to the original endings, but are far less coher-
2019), which was shown to best correlate with hu- ent with the counterfactual condition (coherence =
man judges’ notion of counterfactual coherence, 1.23). Through human annotation, we found that
and BLEU-4 and ROUGE-L, which better mea- Recon+CF copies the original ending word-for-
sure minimum-edits. We find that the discrimina- word in a 84% of cases.
tive baselines achieve the highest degree of plot The pairwise comparison results in Table 5
6 7
See Qin et al. (2019a) for more details. Fair inter-rater agreement with Fleiss’ κ = 0.34
parallel these observations. D ELOREAN signifi- tured metadata such as source, domain, etc. The
cantly outperforms the discriminative approaches Plug & Play model (PPLM; Dathathri et al., 2019)
(Recon+CF and Sup+Disc) in coherence, while controls topic and sentiment in an approach similar
falling short of the Zero-shot re-ranked baselines. to ours that involves forward and backward passes
In minimal edits, this pattern is flipped with our to update token distributions. However, PPLM
approach outperforming Zero-shot baselines con- relies on trained attribute discriminators for super-
siderably and losing to the discriminative baselines. vision, while our method is unsupervised. While
these models are restricted to specific dimensions,
Qualitative Analysis Figure 5 provides two ex- often with pre-defined values, our model can adjust
ample results for counterfactual story rewriting by to any open-ended textual constraint. Perhaps the
D ELOREAN. The approach successfully captures most similar work in that aspect is the “text infill-
the causal relations between events and properly ing” models, which, however, are in a more narrow
rewrites the endings with minimal edits. For in- setting by filling only a relatively short text span
stance, in the first example, given the counterfac- (Devlin et al., 2018; Zhu et al., 2019; Donahue
tual condition that “Tara ordered a shirt online” (as et al., 2020), and more restrictive due to the reliance
opposed to the original “went to mall”), the rewrit- on an extra right-to-left language model (Sun et al.,
ten ending is about “sent shirt” to Tara (as opposed 2017) or a pre-specified generation length (Zeldes
to the original “browsed from stores”). The last et al., 2020, which is not publicly available).
sentence of the original ending “She looked for-
ward to wearing it” is correctly preserved as it is Reasoning about narratives. A prominent re-
coherent with the counterfactual condition. source from recent years is the RocStories corpus
(Mostafazadeh et al., 2016b), consisting of 98K
6 Related Work crowdsourced 5-sentence everyday life stories. It
was used for the story cloze task whose goal was to
Unsupervised text generation. Unsupervised predict the story ending from its first 4 sentences,
approaches are often applied to problems that copy but gained popularity and became the base of ad-
information from a source text into decoded text. ditional benchmarks (Rashkin et al., 2018). Addi-
Unsupervised paraphrasing requires repeating this tional related work includes “script knowledge”, i.e.
information (Miao et al., 2019; Bao et al., 2019), learning about prototypical series of events (Schank
as does translation, but with a bilingual transfor- and Abelson, 1977; Chambers and Jurafsky, 2008;
mation (Artetxe et al., 2017; Lample et al., 2018). Pichotta and Mooney, 2014), temporal common-
In summarization there is an additional task to se- sense (Granroth-Wilding and Clark, 2016; Li et al.,
lect a subset of the original text (Baziotis et al., 2018), and modeling pre- and post- conditions of
2019; Schumann et al., 2020; West et al., 2019). In events (Roemmele et al., 2011; Sap et al., 2019;
cases where information is mostly copied from the Bosselut et al., 2019). Qin et al. (2019b) studied
original, auto-encoding objectives can ensure the conversation modeling that reads and connects the
correct information is captured (Bao et al., 2019; dots of events in related documents. Finally, a re-
Baziotis et al., 2019; Artetxe et al., 2017). This cent line of work explores counterfactual questions
work tackles problems where generation is more in reading comprehension (Huang et al., 2019; Tan-
open-ended. Rather than reproducing information don et al., 2019), but instantiates the problem of
from the prompt, generations should agree with and counterfactual reasoning as a multiple choice task.
expand on it, making autoencoding less applicable.
7 Conclusion
Controllable language generation. Earlier ap-
proaches for controllable generation involved pre- We presented D ELOREAN, an unsupervised LM-
serving the content of text while changing it along based approach to generate text conditioned on
discrete dimensions, such as theme, sentiment, or past context as well as future constraints, through
style (Koncel-Kedziorski et al., 2016; Hu et al., forward and backward passes considering each con-
2017; Ficler and Goldberg, 2017; Shen et al., 2017; dition. We demonstrated its effectiveness for ab-
Lample et al., 2019). Recent works such as Grover ductive and counterfactual reasoning, on which it
(Zellers et al., 2019) and CTRL model (Keskar performed substantially better than unsupervised
et al., 2019) used these ideas to augment trans- baselines. Our method is general and can be easily
former language models that can condition on struc- adapted for other generative reasoning tasks.
Acknowledgements Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
We thanks the anonymous reviewers and colleages bidirectional transformers for language understand-
at UW NLP and AI2 for many helpful comments. ing. arXiv preprint arXiv:1810.04805.
This research was supported in part by DARPA C. Donahue, M. Lee, and P. Liang. 2020. Enabling
CwC through ARO (W911NF15-1-0543), DARPA language models to fill in the blanks. In ACL.
MCS program through NIWC Pacific (N66001-19-
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
2-4031), and Allen Institute for AI. erarchical neural story generation. In ACL.
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. Yejin Choi. 2019. Cosmos qa: Machine reading
2019. COMET: Commonsense transformers for au- comprehension with contextual commonsense rea-
tomatic knowledge graph construction. In Proceed- soning. In Proceedings of the 2019 Conference on
ings of the 57th Annual Meeting of the Association Empirical Methods in Natural Language Processing
for Computational Linguistics, pages 4762–4779, and the 9th International Joint Conference on Natu-
Florence, Italy. Association for Computational Lin- ral Language Processing (EMNLP-IJCNLP), pages
guistics. 2391–2401.
Samuel R Bowman, Gabor Angeli, Christopher Potts, Steve D Isard. 1974. What would you have done if...?
and Christopher D Manning. 2015. A large anno- Theoretical Linguistics, 1(1-3):233–256.
tated corpus for learning natural language inference.
arXiv preprint arXiv:1508.05326. Philip Nicholas Johnson-Laird. 2006. How we reason.
Oxford University Press, USA.
Nathanael Chambers and Dan Jurafsky. 2008. Unsu-
pervised learning of narrative event chains. In Pro- Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
ceedings of ACL-08: HLT, pages 789–797. Caiming Xiong, and Richard Socher. 2019. Ctrl: A
conditional transformer language model for control-
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane lable generation. arXiv preprint arXiv:1909.05858.
Hung, Eric Frank, Piero Molino, Jason Yosinski, and
Rosanne Liu. 2019. Plug and play language mod- Rik Koncel-Kedziorski, Ioannis Konstas, Luke Zettle-
els: a simple approach to controlled text generation. moyer, and Hannaneh Hajishirzi. 2016. A theme-
arXiv preprint arXiv:1912.02164. rewriting approach for generating algebra word
problems. In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Process-
ing, pages 1617–1628.
Sarit Kraus, Daniel Lehmann, and Menachem Magidor. Charles Sanders Peirce. 1960. Collected papers of
1990. Nonmonotonic reasoning, preferential models charles sanders peirce, volume 2. Harvard Univer-
and cumulative logics. Artificial intelligence, 44(1- sity Press.
2):167–207.
Karl Pichotta and Raymond Mooney. 2014. Statisti-
Guillaume Lample, Myle Ott, Alexis Conneau, Lu- cal script learning with multi-argument events. In
dovic Denoyer, and Marc’Aurelio Ranzato. 2018. EACL, pages 220–229.
Phrase-based & neural unsupervised machine trans-
lation. arXiv preprint arXiv:1804.07755. Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chan-
dra Bhagavatula, Elizabeth Clark, and Yejin Choi.
Guillaume Lample, Sandeep Subramanian, Eric Smith, 2019a. Counterfactual story reasoning and gener-
Ludovic Denoyer, Marc’Aurelio Ranzato, and Y- ation. In Proceedings of the 2019 Conference on
Lan Boureau. 2019. Multiple-attribute text rewrit- Empirical Methods in Natural Language Processing
ing. In ICLR. and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
Carolin Lawrence and Stefan Riezler. 2018. Improving 5046–5056.
a neural semantic parser by counterfactual learning
from human bandit feedback. In Proceedings of the Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong
56th Annual Meeting of the Association for Compu- Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jian-
tational Linguistics (Volume 1: Long Papers), pages feng Gao. 2019b. Conversing by reading: Con-
1820–1830. tentful neural conversation with on-demand machine
reading. In ACL, pages 5427–5436.
Carolin Lawrence, Artem Sokolov, and Stefan Riezler.
2017. Counterfactual learning from bandit feedback Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
under deterministic logging: A case study in statisti- Dario Amodei, and Ilya Sutskever. 2019a. Language
cal machine translation. In Proceedings of the 2017 models are unsupervised multitask learners. -.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 2566–2576. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019b. Lan-
Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Con- guage models are unsupervised multitask learners.
structing narrative event evolutionary graph for OpenAI Blog, 1:8.
script event prediction. In IJCAI.
Hannah Rashkin, Antoine Bosselut, Maarten Sap,
Chin-Yew Lin. 2004. Rouge: A package for auto- Kevin Knight, and Yejin Choi. 2018. Modeling
matic evaluation of summaries. Text Summarization naive psychology of characters in simple common-
Branches Out. sense stories. arXiv preprint arXiv:1805.06533.
Hugo Mercier and Dan Sperber. 2017. The enigma of Raymond Reiter. 1988. Nonmonotonic reasoning. In
reason. Harvard University Press. Exploring artificial intelligence, pages 439–481. El-
sevier.
Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei
Li. 2019. Cgmh: Constrained sentence generation Melissa Roemmele, Cosmin Adrian Bejan, and An-
by metropolis-hastings sampling. In Proceedings of drew S Gordon. 2011. Choice of plausible alterna-
the AAAI Conference on Artificial Intelligence, vol- tives: An evaluation of commonsense causal reason-
ume 33, pages 6834–6842. ing. In 2011 AAAI Spring Symposium Series.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong Maarten Sap, Ronan LeBras, Emily Allaway, Chan-
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,
Pushmeet Kohli, and James Allen. 2016a. A cor- Brendan Roof, Noah A Smith, and Yejin Choi. 2019.
pus and evaluation framework for deeper under- ATOMIC: An atlas of machine commonsense for if-
standing of commonsense stories. arXiv preprint then reasoning. In AAAI.
arXiv:1604.01696.
Roger C Schank and Robert P Abelson. 1977. Scripts,
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong plans, goals and understanding: An inquiry into hu-
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, man knowledge structures.
Pushmeet Kohli, and James F. Allen. 2016b. A cor-
pus and cloze evaluation for deeper understanding of Raphael Schumann, Lili Mou, Yao Lu, Olga Vech-
commonsense stories. In HLT-NAACL. tomova, and Katja Markert. 2020. Discrete op-
timization for unsupervised sentence summariza-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- tion with word-level extraction. arXiv preprint
Jing Zhu. 2002. BLEU: a method for automatic eval- arXiv:2005.01791.
uation of machine translation. In ACL, pages 311–
318. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi
Jaakkola. 2017. Style transfer from non-parallel text
Judea Pearl and Dana Mackenzie. 2018. The book of by cross-alignment. In Advances in neural informa-
why: the new science of cause and effect. Basic tion processing systems, pages 6830–6841.
Books.
Youngseo Son, Anneke Buffone, Joe Raso, Allegra 2019. Bottlesum: Unsupervised and self-supervised
Larche, Anthony Janocko, Kevin Zembroski, H An- sentence summarization using the information bot-
drew Schwartz, and Lyle Ungar. 2017. Recognizing tleneck principle. In Proceedings of the 2019 Con-
counterfactual thinking in social media texts. In Pro- ference on Empirical Methods in Natural Language
ceedings of the 55th Annual Meeting of the Associa- Processing and the 9th International Joint Confer-
tion for Computational Linguistics (Volume 2: Short ence on Natural Language Processing (EMNLP-
Papers), pages 654–658. IJCNLP), pages 3743–3752.
William Starr. 2019. Counterfactuals. In Edward N. Yoel Zeldes, Dan Padnos, and Barak Peleg. 2020.
Zalta, editor, The Stanford Encyclopedia of Philos- Haim-1.5 - the next generation.
ophy, fall 2019 edition. Metaphysics Research Lab,
Stanford University. Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirec- Yejin Choi. 2019. Defending against neural fake
tional beam search: Forward-backward inference in news. In Advances in Neural Information Process-
neural sequence models for fill-in-the-blank image ing Systems, pages 9051–9062.
captioning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
6961–6969. Weinberger, and Yoav Artzi. 2019. BERTScore:
Evaluating text generation with BERT. CoRR,
Niket Tandon, Bhavana Dalvi Mishra, Keisuke Sak- abs/1904.09675.
aguchi, Antoine Bosselut, and Peter Clark. 2019.
Wiqa: A dataset for” what if...” reasoning over pro- Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Text
cedural text. In EMNLP. infilling. arXiv preprint arXiv:1901.00158.
Peter West, Ari Holtzman, Jan Buys, and Yejin Choi.