MERIt - Meta-Path Guided Contrastive Learning For Logical Reasoning
MERIt - Meta-Path Guided Contrastive Learning For Logical Reasoning
ies either employ graph-based models to in- before individuals' goals (𝑟𝑟3 ) cannot (1) emerge quickly from
corporate prior knowledge about logical rela- an economic recession.
tions, or introduce symbolic logic into neu- Question:
ral models through data augmentation. These Which one of the following, if assumed, enables the economist's
conclusion to be properly drawn?
methods, however, heavily depend on anno-
Options:
tated training data, and thus suffer from over-
A. People in (4) countries that put collective goals before
fitting and poor generalization problems due individuals’ goals (𝑟𝑟4 ) lack (3) confidence in the economic
to the dataset sparsity. To address these two policies of their countries.
problems, in this paper, we propose MERIt, a B. A country’s economic policies are the most significant factor
MEta-path guided contrastive learning method determining whether that country’s economy will experience a
recession.
for logical ReasonIng of text, to perform self-
C. If the people in a country that puts individuals’ goals first are
supervised pre-training on abundant unlabeled willing to make new investments in their country’s economy,
text data. Two novel strategies serve as indis- their country will emerge quickly from an economic recession.
pensable components of our method. In partic- D. No new investment occurs in any country that does not
ular, a strategy based on meta-path is devised emerge quickly from an economic recession.
to discover the logical structure in natural texts, Answer: A
𝒓𝒓𝟒𝟒 𝒓𝒓𝟐𝟐 𝒓𝒓� 𝟏𝟏 𝒓𝒓𝟑𝟑
followed by a counterfactual data augmenta- Logic Structure: (4) → (3) → (2) → (1) ⟺ (4) → (1)
tion strategy to eliminate the information short-
cut induced by pre-training. The experimental
Figure 1: An instance of logical reasoning from the Re-
results on two challenging logical reasoning
Clor dataset. To infer the right answer, we should un-
benchmarks, i.e., ReClor and LogiQA, demon-
cover the underlying logical structure, as shown in the
strate that our method outperforms the SOTA
bottom. (x) represents the logical variable (e.g., entity
baselines with significant improvements.1
or phrase) and rj denotes the relation (e.g., predicate)
between two logical variables. r̄j is the passive relation
1 Introduction of rj .
in Figure 1, namely, a real-world examination in-
Logical reasoning has long been recognized as one
stance from ReClor. As can be seen, to find the
key critical thinking ability of human being. Un-
correct answer for the given question, one needs to
til very recently, some pioneer researchers have
extract the logical structures residing in a pair of
crystallized this for the NLP community, and built
each option and the whole context, and justify its
several public challenging benchmarks, such as
reasonableness.
ReColor (Yu et al., 2020) and LogiQA (Liu et al.,
2020). Logical reasoning2 requires to correctly As a matter of fact, logical reasoning is still at its
infer the semantic relations with respect to the initial stage, thence, existing studies are somewhat
constituents among different sentences. A typi- rare in literature. Some efforts have been devoted
cal formulation of logical reasoning is illustrated to designing specific model architectures or inte-
grating symbolic logic as the hints attached to the
* Corresponding author: Yangyang Guo and Liqiang Nie. potential logical structure. For instance, Huang
1
Our code and pre-trained models are available at https: et al. (2021) and Ouyang et al. (2021) first con-
//[Link]/SparkJiao/MERIt.
2
We refer the term logical reasoning to the task itself in structed a graph of different constituents and then
the remaining of this paper. performed implicit reasoning with graph neural
networks (GNNs). Wang et al. (2021) proposed vantage in reasoning over logical relations.
LReasoner, a unified context extension and data We integrate this method with both AL-
augmentation framework based on the parsed logi- BERT (Lan et al., 2020) and RoBERTa (Liu et al.,
cal expressions. 2019)3 for further pre-training, and then fine-tune
them on two downstream logical reasoning bench-
These approaches have achieved some progress
marks, i.e., ReClor and LogiQA. The experimental
on benchmark datasets. However, though equipped
results demonstrate that our method can outperform
with pre-trained language models, they still suf-
all the existing strong baselines, yet without any
fer from problems like overfitting and poor gen-
augmentation from the original training data. Be-
eralization. We attribute these drawbacks to the
sides, the ablation studies also show the effective-
difficulty of building a model aware of the logi-
ness of the two essential strategies in our method.
cal relations beneath natural language, which is
The contribution of this paper is summarized as
revealed from two sides: 1) the high sparsity of
follows:
the existing datasets, and 2) the goal of general
pre-training, i.e., masked language modeling (De- 1. We propose MERIt, a MEta-path guided con-
vlin et al., 2019), which however, deviates largely trastive learning method for logical Reason-
from that of the logical reasoning. To tackle this Ing of text, to reduce the heavy reliance on
issue, we aim to build a bridge between logical annotated data. To the best of our knowledge,
reasoning and self-supervised pre-training, and ac- we are the first to explore self-supervised pre-
cordingly inherit the strong generalization power training for logical reasoning.
from pre-trained language models.
2. We successfully employ the meta-path strat-
Our proposed method is inspired by the recent egy to mine the potential logical structure in
progress of contrastive learning based pre-training. raw text. It is able to automatically generate
It mainly consists of two novel components: meta- negative candidates for contrastive learning
path guided data construction and counterfactual via logical relation editing.
data augmentation. Both components are leveraged
3. We propose a simple yet effective counterfac-
to perform automatic instance composition from
tual data augmentation method to eliminate
unlabeled corpus (e.g., Wikipedia) for contrastive
the information shortcut during pre-training.
learning. Regarding the first component, we pro-
pose to employ the meta-path to define a symbolic 4. We evaluate our method on two logical rea-
form of logical structure. The intuition behind this soning tasks, LogiQA and ReClor. The exper-
is that the logical structure can be expressed as a imental results show that our method achieves
reasoning path composed of a series of relation the new state-of-the-art performance on two
triplets, and a meta-path inherently offers such a benchmark datasets.
means of consistency (Liu et al., 2021). Specif-
ically, given an arbitrary document and a pair of 2 Related Work
entities in it, we try to find a positive instance pair
in the document according to the logical structure. 2.1 Self-Supervised Pre-training
And the negative ones can thus be generated by With the success of language modeling based pre-
modifying the relations involved in the structure, training (Devlin et al., 2019; Brown et al., 2020),
which explicitly break the logical consistency. Nev- designing self-supervised pretext tasks to facili-
ertheless, the contrastive learning often fails when tate specific downstream ones has been extensively
models easily locate trivial solutions (Lai et al., studied thus far. For example, Guu et al. (2020) pro-
2021). In this context, the pre-trained language posed to train the retriever jointly with the encoder
model may exclude the negative options through via retrieval enhanced masked language modeling
their conflicts with the world knowledge. To elimi- for open-domain question answering. Jiao et al.
nate this information shortcut, in our second novel (2021) devised a retrieval-based pre-training ap-
component, we devise a strong counterfactual data proach to bridge the gap between language mod-
augmentation (Zeng et al., 2020b) strategy. By eling and machine reading comprehension by en-
mixing counterfactual data during pre-training, of hancing the evidence extraction ability. Deng et al.
which the positive instance pair is also against the 3
In this paper, we refer ALBERT-xxlarge and RoBERTa-
world knowledge, this component shows more ad- large to ALBERT and RoBERTa for simplicity, respectively.
(2021) proposed ReasonBERT to facilitate complex tage in utilizing the unlabeled text data, and 2) the
reasoning over multiple and hybrid contexts. The symbolic logic is seamlessly introduced into neural
model is pre-trained on automatically constructed models via the guidance of meta-path for automatic
query-evidence pairs, which involve different types data construction.
of corpora and long-range relations.
In addition, contrastive learning (Hadsell et al., 3 Preliminary
2006) contributes to a strong toolkit to implement
self-supervised pre-training. The key to contrastive 3.1 Contrastive Learning
learning is to build efficacious positive and nega- Contrastive Learning (CL) aims to learn recogniz-
tive counterparts. For example, Gao et al. (2021) able representations by pulling the semantically
leveraged Dropout (Srivastava et al., 2014) to build similar examples close and pushing apart the dis-
positive pairs from the same sentence while keep- similar ones (Hadsell et al., 2006). Given an in-
ing the semantics untouched. Other sentences in stance x, a semantically similar example x+ , and a
the same mini-batch serve as negative candidates set of dissimilar examples X − to x, the objective
to obtain better sentence embeddings. ERICA (Qin of CL can be formulated as:
et al., 2021) is a knowledge enhanced language
model pre-trained through entity and relation dis- LCL = L(x, x+ , X − )
crimination, where the negative candidates are sam- exp f (x, x+ ) (1)
pled from the pre-defined dictionaries. Neverthe- = − log P 0
x0 ∈X − ∪{x+ } exp f (x, x )
less, directly employing these contrastive learning
approaches to logical reasoning is arduous. One
where f is the model to be optimized.
possible reason to this is the absence of distant la-
bels or strong assumptions to group the naturally
3.2 Symbolic Logical Reasoning
occurring text by its logical structure.
As shown in Figure 1, given a context containing
2.2 Logical Reasoning a series of logical variables {v1 , v2 , · · · , vn }, and
Logical reasoning has attracted increasing research the relations between them, the logical reasoning
attention recently. Devising specific model archi- objective is to judge whether a triplet h vi , ri,j , vj i
tectures and integrating symbolic logic have been in language, where ri,j is the relation between vi
proved to be two effective solutions. For exam- and vj , can be inferred from the context through a
ple, Huang et al. (2021) and Ouyang et al. (2021) reasoning path:
proposed to extract the basic units for logical rea-
ri,i+1 rj−1,j
soning, e.g., the elementary discourse or fact units, h vi , ri,j , vj i ← (vi −→ vi+1 · · · −→ vj ). (2)
and then employed GNNs to model possible rela-
tionships. The graph structure of constituents can The equation is also referred to symbolic logic
be viewed as a form of prior knowledge pertaining rules (Clark et al., 2020; Liu et al., 2021).
to logical relations. Differently, Betz (2020) and
Clark et al. (2020) used synthetically generated 3.3 Meta-Path
datasets to prove that the Transformer (Vaswani
et al., 2017) or pre-trained GPT-2 is able to per- Given an entity-level knowledge graph, where the
form complex reasoning, motivating following re- nodes refer to entities and edges are the relations
searchers to introduce symbolic rules into neural among them, the meta-path connecting two target
models. For example, Wang et al. (2021) devel- entities h ei , ej i can be given as,
oped a context extension and data augmentation ri,i+1 ri+1,i+2 rj−1,j
framework, which is based on the extracted log- ei −→ ei+1 −→ · · · ej−1 −→ ej , (3)
ical expressions. Superior performance over its
contenders can be observed on the ReClor dataset. where ri,j denotes the relation between entities ei
In this paper, we propose a self-supervised con- and ej . The meta-path in the entity-level knowl-
trastive learning approach to enhance the logical edge graph are often employed as a particular data
reasoning ability of neural models. Orthogonal to structure expressing the relation between two indi-
existing methods, our approach is endowed with rectly connected entities (Zeng et al., 2020a; Xu
two intriguing merits: 1) it shows strong advan- et al., 2021).
(a) Graph Construction (c) Negative Candidate Generation
(𝒔𝒔𝟏𝟏 ) “Mirror Mask (𝑒𝑒1 )”, McKean (𝑒𝑒2 )’s first feature film as director, A children’s fantasy …, “Mirror
premiered at … in January 2005. (𝒔𝒔𝟐𝟐 ) The screenplay was written by Neil The screenplay was written by
Mask” was produced by … and Mirror Mask, from a story by
Gaiman (𝑒𝑒3 ), from a story by Gaiman and McKean. (𝑠𝑠3 ) A children’s stars a British cast Stephanie
fantasy …, “Mirror Mask” was produced by Jim Henson Studios (𝑒𝑒4 ) and Gaiman and Stephanie Leonidas.
Leonidas, …, and Gina McKee.
stars a British cast Stephanie Leonidas (𝑒𝑒5 ), … and Gina McKee (𝑒𝑒6 ). (𝑠𝑠4 )
Before “Mirror Mask”, McKean directed a number of …. (𝑠𝑠5 ) McKean has Answer: 𝑎𝑎 = 𝑠𝑠3 Option 𝑎𝑎−
directed “The Gospel of Us (𝑒𝑒7 )”, …. A new feature film, “Luna”, written The screenplay was written by Option-oriented
and directed by McKean and starring Stephanie Leonidas, ..., debuted at …. Relation Provider z Neil Gaiman, from a story by
Gaiman and McKean. Context-oriented
𝑒𝑒7 𝑒𝑒6
Entity A new feature film, “Luna”, … The screenplay was written by
and directed by McKean and Stephanie Leonidas, from a story by
Intra-Sentence starring Stephanie Leonidas …. Gaiman and McKean.
𝑒𝑒5 𝑒𝑒4 Relation Context Sentence 𝑠𝑠5 Negative Sentence 𝑠𝑠5−
𝑒𝑒1
External
Relation Negative Context 𝒮𝒮 − = 𝑠𝑠1 , 𝑠𝑠5−
𝑒𝑒2 𝑒𝑒3
(d) Counterfactual Data Augmentation
(b) Meta-Path Guided Positive Instance Construction A children’s fantasy which …, “Mirror A children’s fantasy which …,
Mask” was produced by … and stars a “[ENT A]” was produced by …
Target Entities 𝑎𝑎 British cast Stephanie Leonidas, …, and stars a British cast [ENT
𝑒𝑒1 , 𝑒𝑒5 and Gina McKee. B], …, and Gina McKee.
𝑒𝑒2
𝑒𝑒1 : Mirror Mask The screenplay was written by Mirror The screenplay was written by
𝑒𝑒2 : McKean
Possible Answers 𝑠𝑠1 𝑠𝑠5 𝑒𝑒5 : Stephanie Leonidas
𝑎𝑎− Mask, from a story by Gaiman and [ENT A], from a story by Gaiman
Stephanie Leonidas. and [ENT B].
𝒜𝒜+ = 𝑠𝑠3
𝑒𝑒1 𝑒𝑒5
(e) Objectives of Contrastive Learning
Option-oriented Contrastive Learning: 𝑓𝑓 𝒮𝒮, 𝑎𝑎 ≫ 𝑓𝑓 𝒮𝒮, 𝑎𝑎−
Positive Data Pair Meta-Path
𝒮𝒮 = 𝑠𝑠1 , 𝑠𝑠5 ↔ 𝑠𝑠3 𝒫𝒫 = 𝑒𝑒1 , 𝑒𝑒2 , 𝑒𝑒5 Context-oriented Contrastive Learning: 𝑓𝑓 𝒮𝒮, 𝑎𝑎 ≫ 𝑓𝑓 𝒮𝒮 − , 𝑎𝑎
Figure 2: The overall framework of our proposed method. (a) A document D from Wikipedia and the correspond-
ing entity-level graph construction. The sentences in black will be extracted as the context input for (b). (b) Given
two target entities h e1 , e5 i, the possible answers A+ and the meta-path are firstly extracted. The context sentences
S connecting the entities in the meta-path, and the answers in A, are leveraged to yield positive instance pairs. (c)
Given a sentence z with alternative relations, the relation modification for negative context sentence and option
construction is implemented through entity replacement. The top operation is performed for negative options while
the bottom one is to facilitate negative contexts. (d) The counterfactual sentences are generated by entity replace-
ment to eliminate the information shortcut during pre-training. (e) The generated positive and negative samples
are used for contrastive learning.
4 Method option only.
In light of this, the training instances for our con-
In this paper, we study the problem of logical
trastive learning based pre-training should be in the
reasoning on the task of multiple choice ques-
form of a context-option pair, where the context
tion answering (MCQA). Specifically, given a pas-
consists of multiple sentences and expresses the
sage P , a question Q and a set of K options
relations between the included constituents, while
O = {O1 , · · · , OK }, the goal is to select the cor-
the option should illustrate the potential relations
rect option Oy , where y ∈ [1, K]. Notably, to
between parts of the constituents. Nevertheless,
tackle this task, we devise a novel pre-training
it is non-trivial to derive such instance pairs from
method equipped with contrastive learning, where
large-scale unlabeled corpus like Wikipedia due to
the abundant knowledge contained in the large-
the redundant constituents, e.g., nouns and pred-
scale Wikipedia documents is explored. We then
icates. In order to address it, we propose to take
transfer the learned knowledge to the downstream
the entities contained in unlabeled text as logical
logical reasoning task.
variables, and Equation 2 can be transformed as:
4.1 From Logical Reasoning to Meta-Path ri,i+1 rj−1,j
h ei , ri,j , ej i ← (ei −→ ei+1 · · · −→ ej ). (4)
In a sense, in MCQA for logical reasoning, both
the given context (i.e., passage and question) and As can be seen, the right part above is indeed a
options express certain relations between different meta-path connecting h ei , ej i as formulated in
logical variables (Figure 1). Go a step further, fol- Equation 3, indicating an indirect relation between
lowing Equation 2, the relation triplet contained h ei , ej i through intermediary entities and relations.
in the correct option should be deduced from the In order to aid the logical consistency conditioned
given context through a reasoning path, while that on entities to be established, we posit an assump-
in the wrong options should not. In other words, tion that under the same context (in the same pas-
the context is logically consistent with the correct sage), the definite relation between a pair of en-
tities can be inferred from the contextual indirect generation algorithm are described in Appendix A.
one, or at least not logically contradict to it. Tak-
ing the passage in Figure 2 as an example, it can be 4.3 Negative Instance Generation
concluded from the sentences s1 and s5 that, the
In order to obtain the negative instances (i.e., neg-
director McKean has cooperated with Stephanie
ative context-option pairs) where the option is not
Leonidas. Therefore, the logic is consistent be-
logically consistent with the context, the most
tween {s1 , s5 } and s3 . This can be viewed as a
straightforward way is to randomly sample the sen-
weaker constraint than the original one in Equa-
tences from different documents. However, this
tion 2 for logical consistency, yet it can be further
approach could lead to trivial solutions by simply
enhanced by constructing negative candidates vio-
checking whether the entities involved in each op-
lating logics.
tion are the same as those in the given context. In
Motivated by this, given an arbitrary document the light of this, we resort to directly breaking the
D = {s1 , · · · , sm }, where si is the i-th sentence, logical consistency of the positive instance pair by
we can first build an entity-level graph, denoted as modifying the relation rather than the entities in
G = (V, E), where V is the set of entities contained the context or the option, to derive the negative
in D and E denotes the set of relations between en- instance pair.
tities. Notably, to comprehensively capture the rela- In particular, given a positive instance pair
tions among entities, we take into account both the (S, a), we devise two negative instance genera-
external relation from the knowledge graph and the tion methods: the context-oriented and the option-
intra-sentence relation. As illustrated in Figure 2 oriented method, focusing on generating negative
(a), there will be an intra-sentence relation between pairs by modifying the relations involved in the
two entities if they are mentioned in a common sen- context S and answer a of the positive pair, re-
tence. Thereafter, we can derive the pre-training spectively. Considering that the relation is difficult
instance pairs according to the meta-paths extracted to be extracted, especially the intra-sentence rela-
from the graph, which will be detailed in the fol- tion, we propose to implement this reversely via
lowing subsections. the entity replacement. In particular, for the option-
oriented method, suppose that h ei , ej i is the target
4.2 Meta-Path Guided Positive Instance entity pair for retrieving the answer a, we first
Construction randomly sample a sentence z that contains at least
As defined in Equation 4, in the positive instances, one different entity pair h ea , eb i from h ei , ej i as
the answer should contain a relation triplet that is the relation provider. We then obtain the negative
logically consistent with the given context. Since option by replacing the entities ea and eb in z with
we take the intra-sentence relationship into consid- ei and ej , respectively. The operation is equivalent
eration, given a pair of entities contained in the to replacing the relation contained in a with that in
document, we first collect the sentences mention- z. Formally, we denote the operation as
ing both of them as the set of answer candidates.
Accordingly, we then try to find a meta-path con- a− = Relation_Replace(z → a).
necting the entity pair and hence derive the corre-
sponding logically consistent context. Pertaining to the context-oriented negative in-
In particular, as shown in Figure 2 (b), given an stance generation method, we first randomly sam-
entity pair h ei , ej i, we denote the collected answer ple a sentence si ∈ S, and then conduct the modifi-
candidates as A+ , and then we use Depth-First cation process as follows,
Search (Tarjan, 1972) to find a meta-path linking
them on G, following Equation 3. Thereafter, the s−
i = Relation_Replace(z → si ),
context sentences S corresponding to the answer
candidates in A+ are derived by retrieving those where the entity pair to be replaced in si should
sentences undertaking the intra-sentence relations be contained in the meta-path corresponding to
during the search algorithm. Finally, for each an- the target entity pair h ei , ej i. Accordingly, the
swer candidate a ∈ A+ , the pair (S, a) is treated as negative context can be written as S − = S \ {si } ∪
a positive context-answer pair to facilitate our con- {s−
i }. Figure 2 (c) illustrates the above operations
trastive learning. The details of positive instance on both the answer and context sentence.
4.4 Counterfactual Data Augmentation Wikipedia Downstream
Documents Tasks Data
According to Ko et al. (2020); Guo et al. (2019);
Lai et al. (2021); Guo et al. (2022), the neural mod- 𝑓𝑓(𝜃𝜃, 𝜔𝜔0 , 𝜙𝜙) Prompt-Tuning
els are adept at finding a trivial solution through the 𝑓𝑓(𝜃𝜃, 𝜔𝜔0 )
illusory statistical information in datasets to make 𝑓𝑓(𝜃𝜃, 𝜔𝜔1 ) Fine-Tuning
correct predictions, which often leads to inferior
generalization. In fact, this issue can also occur Figure 3: The overall training scheme of our method.
in our scenario. In particular, since the correct an-
swer is from a natural sentence and describes a real formulated as:
world fact, while the negative option is synthesized LOCL = L(S, a, A− ). (5)
by entity replacement, which may conflict with
the commonsense knowledge. As a result, the pre- In addition, given C − as the set of all generated
trained language model tends to identify the correct negative contexts corresponding to S, the objective
option directly by judging its factuality rather than of context-oriented CL can be written as:
the logical consistency with the given context. For
example, as shown in Figure 2 (d) (left), the lan- LCCL = L(a, S, C − ). (6)
guage model deems a as correct, simply due to To avoid the catastrophic forgetting problem, we
that the other synthetic option a− conflicts with the also add the MLM objective during pre-training
world knowledge. and the final loss is:
To overcome this problem, we develop a sim-
ple yet effective counterfactual data augmentation L = LOCL + LCCL + LMLM . (7)
method to further improve the capability of log-
ical reasoning (Zeng et al., 2020b). Specifically, 4.6 Fine-tuning
given the entities P that are involved in the meta- During the fine-tuning stage, to approach the task
path, we randomly select some entities from P and of MCQA, we adopt the following loss function:
replace their occurrences in the context and the an-
swer of the positive instance pair (S, a) with the exp f (P, Q, Oy )
LQA = − log P , (8)
entities extracted from other documents. In this i exp f (P, Q, Oi )
manner, the positive instance also contradicts to where Oy is the ground-truth option for the ques-
the world knowledge. Notably, considering that the tion Q, given the passage P .
positive and negative instance pairs should keep Figure 3 shows the overall training scheme of our
the same set of entities, we also conduct the same method. f is the model to be optimized, θ, ω0 , ω1
replacement for a− or S − , if they mention the se- and φ are parameters of different modules. During
lected entities. As illustrated in Figure 2 (d) (right), pre-training, we use a 2-layer MLP as the output
a counterfactual instance can be generated by re- layer. The parameters of the output layer are de-
placing Mirror Mask and Stephanie Leonidas in a noted as ω0 , and θ represents the pre-trained Trans-
and a− with [ENT A] and [ENT B], where [ENT former parameters. As for the fine-tuning stage,
A] and [ENT B] are arbitrary entities. Ultimately, we employ two schemes. For simple fine-tuning,
the key to infer the correct answer lies in the ac- we follow Devlin et al. (2019) to add another 2-
curate inference of the logical relation between layer MLP with randomly initialized parameters
entities [ENT A] and [ENT B] implied in each ω1 on the top of the pre-trained Transformer. In
context-option pair. We provide more cases of the addition, to fully take advantage the knowledge
constructed data and their corresponding counter- acquired during pre-training stage, we choose to
factual samples in Appendix D. directly fine-tune the pre-trained output layer with
optimizing both θ and ω0 . In order to address the
4.5 Contrastive Learning based Pre-training
discrepancy that the question is absent during pre-
As discussed in previous subsection, there are two training, the prompt-tuning technique (Lester et al.,
contrastive learning schemes: option-oriented CL 2021) is employed. Specifically, some learnable
and context-oriented CL. Let A− be the set of all embeddings with randomly initialized parameters φ
constructed negative options with respect to the are appended to the input to transform the question
correct option a. The option-oriented CL can be in downstream tasks into declarative constraint.
ReClor LogiQA
Model / Dataset
Dev Test Test-E Test-H Dev Test
RoBERTa 62.6 55.6 75.5 40.0 35.0 35.3
DAGN 65.2 58.2 76.1 44.1 35.5 38.7
DAGN (Aug) 65.8 58.3 75.9 44.5 36.9 39.3
LReasoner (RoBERTa)‡ 64.7 58.3 77.6 43.1 — —
Focal Reasoner 66.8 58.9 77.1 44.6 41.0 40.3
MERIt 66.8 59.6 78.1 45.2 40.0 38.9
MERIt + LReasoner 67.4 60.4 78.5 46.2 — —
MERIt + Prompt 69.4 61.6 79.3 47.8 39.9 40.7
MERIt + Prompt + LReasoner 67.3 61.4 79.8 46.9 — —
ALBERT 69.1 66.5 76.7 58.4 38.9 37.6
MERIt (ALBERT) 74.2 70.1 81.6 61.0 43.7 42.5
MERIt (ALBERT) + Prompt 74.7 70.5 82.5 61.1 46.1 41.7
max
LReasoner (RoBERTa) 66.2 62.4 81.4 47.5 38.1 40.6
MERIt 67.8 60.7 79.6 45.9 42.4 41.5
MERIt + Prompt 70.2 62.6 80.5 48.5 39.5 42.4
LReasoner (ALBERT) 73.2 70.7 81.1 62.5 41.6 41.2
MERIt (ALBERT) 73.2 71.1 83.6 61.3 43.9 45.3
MERIt (ALBERT) + Prompt 75.0 72.2 82.5 64.1 45.8 43.8
Table 1: The overall results on ReClor and LogiQA. We adopt the accuracy as the evaluation metric and all the
baselines are based on RoBERTa except specific statement. For each model we repeated training for 5 times using
different random seeds and reported the average results. ‡ : The results are reproduced by ourselves. max: The
results of the model achieving the best accuracy on the test set.
In this paper, we present MERIt, a meta-path Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
guided contrastive learning method to facilitate Kristina Toutanova. 2019. BERT: pre-training of
deep bidirectional transformers for language under-
logical reasoning via self-supervised pre-training. standing. In NAACL-HLT, pages 4171–4186. ACL.
MERIt is built upon the meta-path strategy for auto-
matic data construction and the counterfactual data Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
augmentation to eliminate the information shortcut SimCSE: Simple contrastive learning of sentence
embeddings. In EMNLP. ACL.
during pre-training. With the evaluation on two
logical reasoning benchmarks, our method has ob- Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Yibing
tained significant improvements over strong base- Liu, Yinglong Wang, and Mohan S. Kankanhalli.
lines relying on task-specific model architecture 2019. Quantifying and alleviating the language prior
problem in visual question answering. In SIGIR,
or augmentation of original dataset. Pertaining to pages 75–84. ACM.
the further work, we plan to strengthen our method
from both data construction and model architecture Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Qi Tian,
and Min Zhang. 2022. Loss re-scaling VQA: re-
design angles. More challenging instances are ex-
visiting the language prior problem from a class-
pected to be constructed if multiple meta-paths can imbalance view. TIP, 31:227–238.
be considered at the same time. Besides, leveraging
GNNs may bring better interpretability and gener- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
pat, and Ming-Wei Chang. 2020. REALM: retrieval-
alization since the graph structure can be integrated augmented language model pre-training. CoRR,
into both pre-training and fine-tuning stages. abs/2002.08909.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu
Kevin Gimpel, Piyush Sharma, and Radu Soricut. Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and Nan
2020. ALBERT: A lite BERT for self-supervised Duan. 2021. Logic-driven context extension and
learning of language representations. In ICLR. data augmentation for logical reasoning of text. In
NAACL-HLT, pages 5642–5650. ACL.
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
The power of scale for parameter-efficient prompt Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
tuning. In EMNLP. ACL. Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Yile Wang, and Yue Zhang. 2020. LogiQA: A chal- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
lenge dataset for machine reading comprehension Teven Le Scao, Sylvain Gugger, Mariama Drame,
with logical reasoning. In IJCAI, pages 3622–3628. Quentin Lhoest, and Alexander M. Rush. 2020.
Transformers: State-of-the-art natural language pro-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- cessing. In EMNLP: System Demonstrations, pages
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 38–45. ACL.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining ap- Wang Xu, Kehai Chen, and Tiejun Zhao. 2021. Dis-
proach. CoRR, abs/1907.11692. criminative reasoning for document-level relation
extraction. In Findings of ACL/IJCNLP, pages
Yushan Liu, Marcel Hildebrandt, Mitchell Joblin, Mar- 1653–1663. ACL.
tin Ringsquandl, Rime Raissouni, and Volker Tresp.
2021. Neural multi-hop reasoning with logical rules Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu,
on biomedical knowledge graphs. In ESWC, volume Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song,
12731, pages 375–391. Springer. James Demmel, Kurt Keutzer, and Cho-Jui Hsieh.
2020. Large batch optimization for deep learning:
Siru Ouyang, Zhuosheng Zhang, and Hai Zhao. Training BERT in 76 minutes. In ICLR.
2021. Fact-driven logical reasoning. CoRR,
abs/2105.10334. Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi
Feng. 2020. Reclor: A reading comprehension
Yujia Qin, Yankai Lin, Ryuichi Takanobu, Zhiyuan Liu, dataset requiring logical reasoning. In ICLR.
Peng Li, Heng Ji, Minlie Huang, Maosong Sun, and
Jie Zhou. 2021. ERICA: improving entity and re- Shuang Zeng, Yuting Wu, and Baobao Chang. 2021.
lation understanding for pre-trained language mod- SIRE: separate intra- and inter-sentential reasoning
els via contrastive learning. In ACL/IJCNLP, pages for document-level relation extraction. In Findings
3350–3363. ACL. of ACL/IJCNLP, pages 524–534. ACL.
Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Shuang Zeng, Runxin Xu, Baobao Chang, and Lei Li.
Hongxia Yang, Ming Ding, Kuansan Wang, and Jie 2020a. Double graph based reasoning for document-
Tang. 2020. GCC: graph contrastive coding for level relation extraction. In EMNLP, pages 1630–
graph neural network pre-training. In KDD, pages 1640. ACL.
1150–1160. ACM.
Xiangji Zeng, Yunliang Li, Yuchen Zhai, and Yin
Nitish Srivastava, Geoffrey E. Hinton, Alex Zhang. 2020b. Counterfactual generator: A weakly-
Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- supervised method for named entity recognition. In
nov. 2014. Dropout: a simple way to prevent neural EMNLP, pages 7270–7280. ACL.
networks from overfitting. Journal of Machine
Learning Research, 15(1):1929–1958.
Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi,
and Claire Cardie. 2019. DREAM: A challenge
dataset and models for dialogue-based reading com-
prehension. TACL, 7:217–231.
Table 7: Hyper-parameters for fine-tuning on ReClor and LogiQA. ♣: Fine-Tuning. ♠: Prompt Tuning.
Figure 6: Two cases of the generated and the counterfactual examples. The target entities used for extracting
meta-path are colored in red.