Professional Documents
Culture Documents
sense reasoning tasks has shown progress in have shortcomings when it comes to commonsense
resolving some, but not all, knowledge gaps
reasoning (Sap et al., 2019b; Rajani et al., 2019;
in these tasks. For knowledge integration to
yield peak performance, it is critical to select
Mitra et al., 2019). Thus, incorporating knowledge
a knowledge graph (KG) that is well-aligned graph (KG) information into these models is an
with the given task’s objective. We present active area of research (Lin et al., 2019a; Sun et al.,
an approach to assess how well a candidate 2019; Mitra et al., 2019; Bosselut et al., 2019).
KG can correctly identify and accurately fill However, when selecting a KG match for a task, it
in gaps of reasoning for a task, which we is often difficult to quantitatively assess what kind
call KG-to-task match. We show this KG- of knowledge is missing from these models and
to-task match in 3 phases: knowledge-task
how much of the missing knowledge required for
identification, knowledge-task alignment, and
knowledge-task integration. We also analyze the task is available in a candidate KG. It is also
our transformer-based KG-to-task models via critical to examine how easily transformer-based
commonsense probes to measure how much models can learn commonsense knowledge, to de-
knowledge is captured in these models be- termine the benefits of integrating a KG.
fore and after KG integration. Empirically, We investigate how well a KG matches with
we investigate KG matches for the SocialIQA a task objective, referred to as KG-to-task match.
(SIQA) (Sap et al., 2019b), Physical IQA
We use a 3-step process that examines knowledge
(PIQA) (Bisk et al., 2020), and MCScript2.0
(Ostermann et al., 2019) datasets with 3 di- identification, alignment, and integration. We uti-
verse KGs: ATOMIC (Sap et al., 2019a), Con- lize a modular pipeline approach to allow for inter-
ceptNet (Speer et al., 2017), and an automat- pretable results and easy replacement of new and
ically constructed instructional KG based on different modules. Our approach reveals features
WikiHow (Koupaee and Wang, 2018). With such as: how often a KG identifies a knowledge gap
our methods we are able to demonstrate that in a question-answer pair (identification), whether
ATOMIC, an event-inference focused KG, is
a KG identifies the correct knowledge gap (align-
the best match for SIQA and MCScript2.0, and
that the taxonomic ConceptNet and WikiHow-
ment), and whether the inserted knowledge cor-
based KGs are the best matches for PIQA rectly fills the knowledge gap required for the task
across all 3 analysis phases. We verify our (integration). These steps are depicted in Fig. 1.
methods and findings with human evaluation.1 We also compare the effects of knowledge content,
structure, and shape.
1 Introduction The results of this analysis are impacted by the
model we use, and thus we also develop probes
Recently, several datasets (Sap et al., 2019b; Huang to examine how much commonsense knowledge
et al., 2019; Bhagavatula et al., 2020; Talmor et al., LMs already know and how easy it is for them to
2019b) have been released to tackle the challenge learn. We evaluate our KG-to-task models in a QA
of commonsense reasoning. While deep pretrained probe setup to examine how much commonsense
1
Code and commonsense probes: https://github. is learned with and without the matched KG. Our
com/lbauer6/IdentifyAlignIntegrate probes are automatically built from ATOMIC, en-
question answer
3. Integration
1. Extraction knowledge gap
question answer
Base model KS model
knowledge identification
knowledge insertion
21% 34% 45% 7% 53% 40%
KS model
Figure 1: We illustrate our 3 phases of analysis: Knowledge-Task Extraction Analysis, Knowledge-Task Alignment
Analysis, and Knowledge-Task Integration Analysis with Probing Analysis.
abling us to leverage existing knowledge sources tification by analyzing our extraction quantities.
as a probing base without relying on expensive col- In our second phase, we examine alignment by
lection methods. We also include an MLM probe utilizing a ‘knowledge-surrounded’ (KS) model,
setup to obtain zero-shot and fine-tuned results on in which we replace task candidate answers with
probes for social relations, agent-patient assign- knowledge-surrounded answers. We found that
ment, and world knowledge. ATOMIC is the best match for SIQA across both
We present detailed empirical results on three identification and alignment: 11% more ATOMIC
diverse datasets: SocialIQA (SIQA) task (Sap et al., data is extracted for question-answer knowledge
2019b), which requires social knowledge; Physi- gaps than ConceptNet data, with a 4.8% perfor-
cal IQA (PIQA) (Bisk et al., 2020) which requires mance increase over BERT using our ATOMIC KS
physical knowledge; and MCScript2.0 (Ostermann model. We use our third phase, integration, to inves-
et al., 2019), which requires commonsense script tigate the classification change distributions from
knowledge, not restricted to a particular domain. BERT to the KS model, finding that our model is
Since both SIQA and PIQA require a particular more confident about correct classification changes,
domain of commonsense knowledge, these tasks supporting the ATOMIC-SIQA match. Addition-
allow us to draw strong conclusions about KG in- ally, both ConceptNet and WikiHow graphs outper-
tegration, as knowledge must be well aligned with formed ATOMIC on PIQA: 8% more ConceptNet
the tasks to yield performance gains. Analyzing data is extracted than ATOMIC and a 17.4% per-
MCScript2.0, on the other hand, allows us to un- formance increase is achieved with our Concept-
derstand how this analysis applies to a task where Net KS model, whereas we get a 15.5% increase
the best match is not obvious. We compare KG-to- with our WikiHow KS model. Finally, we find that
task match with three diverse KGs: ATOMIC (Sap ATOMIC is the best match for MCScript2.0, with
et al., 2019a), ConceptNet (Speer et al., 2017), and a 2.7% increase with our ATOMIC KS model.
automatically extracted subgraphs from WikiHow. We also perform human evaluation and show im-
Each KG is tailored for a different commonsense portant connections between the analysis phases.
domain: ATOMIC focuses on social commonsense, We see that if our KS model shows improvement
ConceptNet on taxonomic commonsense, and Wik- for high quality settings, our extraction step is a
iHow on instruction-based commonsense. This al- valid knowledge-gap identification metric between
lows us to see how different tasks require different 74% and 89% of the time, depending on the dataset.
types of commonsense knowledge. We also show that our best alignment strategy for
To investigate KG-to-task match, we follow ATOMIC-SIQA fills knowledge gaps 66% of the
three phases: identify, align, and integrate. In time, outperforming the best alignment strategy for
our first phase, we examine knowledge gap iden- ConceptNet-SIQA, which supports our KS model
performance results. We find similar trends for et al., 2019; Xiong et al., 2019). We show how find-
PIQA alignment and also find that the amount of ing nuanced knowledge for successful common-
information available at inference time may affect sense reasoning can be quantitatively examined.
alignment results for MCScript2.0. Human evalua- Commonsense Knowledge Analysis: Zhang et al.
tion shows that 93% of ATOMIC-SIQA KS model (2020) presented a categorization of essential
prediction changes (with respect to the baseline) se- knowledge for the Winograd Schema Challenge
lect the answer with the highest knowledge quality, (Levesque et al., 2012) via human annotation to
verifying our integration phase as a quality metric. identify what knowledge was required for better
Our commonsense QA probes before and after commonsense reasoning. Ma et al. (2019) investi-
KG integration show that our KS model only con- gated how KG integration methods affected model
siderably outperforms the BERT baseline on cer- performance on different tasks and found that the
tain relational probes, indicating the type of knowl- degree of domain overlap between the KG and the
edge gaps ATOMIC is better at resolving, e.g., re- task plays a crucial role in performance. We further
lational knowledge such as feelings, reactions, etc. investigate this by measuring KG-to-task match
Overall, our methods not only illustrate the type across 3 automatic phases, considering different
of knowledge that current transformer-based mod- extraction methods, and probing models for knowl-
els are missing to approach human-level common- edge before and after KG integration.
sense reasoning but also how we can identify, align,
and integrate knowledge between a KG and a task 3 Tasks & Knowledge Graphs
to find the best match to fill in these missing gaps 3.1 Tasks
of reasoning.
SIQA: The SocialIQA (SIQA) (Sap et al., 2019b)
2 Related Work task focuses on social commonsense. Given a con-
text and question, a model selects from 3 answers.
Language Model Probes: Recent work in probe SIQA contexts are based on ATOMIC (Sap et al.,
construction has examined neural model knowl- 2019a) events and SIQA question types are guided
edge (Richardson and Sabharwal, 2019; Zhou et al., by ATOMIC inference dimensions. Thus, we ex-
2020b; Rogers et al., 2020; Lin et al., 2020). Tal- pect ATOMIC to match SIQA requirements. For
mor et al. (2019a) constructed eight tasks that eval- simplicity, we refer to the concatenation of context
uated LMs for operations such as comparison, con- and question as the question throughout the paper.
junction, and composition. Zhou et al. (2020a) PIQA: The PhysicalIQA (PIQA) (Bisk et al., 2020)
created logically equivalent probes to evaluate ro- task objective focuses on physical commonsense
bustness on commonsense tasks to syntax. Kwon reasoning. Given a goal, a model selects from 2
et al. (2019) proposed tests based on ConceptNet to candidate solutions. PIQA is derived from the in-
measure what types of commonsense MLMs under- struction domain, and thus we expect instructional
stand. Our work instead focuses on probing models physical commonsense to benefit PIQA. For sim-
for causal, social commonsense in both the MLM plicity, we refer to the goal as the question.
and QA setup before and after KG integration and MCScript2.0: MCScript2.0 (Ostermann et al.,
fine-tuning, and automatically constructs probes 2019) focuses on script events and participants,
from existing knowledge sources. requiring commonsense knowledge, in particular
Commonsense Reasoning: Recent commonsense script knowledge, to answer questions correctly.
reasoning datasets (Bhagavatula et al., 2020; We specifically choose this dataset such that it does
Zellers et al., 2018; Zhou et al., 2019; Sap et al., not have a strong preference for any of the KGs we
2019b; Bisk et al., 2020; Lin et al., 2019b; Zellers investigate, to illustrate what our analysis may look
et al., 2019; Ostermann et al., 2019) have motivated like for an unpredictable result. For simplicity, we
research in several domains of commonsense: ab- refer to the concatenation of context and question
ductive, grounded, temporal, social, and physical. as the question throughout the paper.
Commonsense reasoning can be learned either by
KGs pre-training (Bosselut et al., 2019; Bosselut 3.2 Knowledge Sources
and Choi, 2019; Ye et al., 2019) or by integrating We show results across three knowledge graphs to
explicit knowledge (Chen et al., 2017; Mitra et al., illustrate differences in KG-to-task identification,
2018; Bauer et al., 2018; Lin et al., 2019a; Zhang alignment, and integration, and to show how BERT
ATOMIC Knowledge ConceptNet Knowledge
Context: Sydney walked past a homeless woman asking for change but did not have any Context: Tracy brought the kids to their dentist
money they could give to her. Sydney felt bad afterwards. How would you describe Sydney? appointment, but it was scheduled during the school
Correct Answer: sympathetic day. What does Tracy need to do before this?
Correct Answer: Keep the kids home from school
until after their appointments
Pair:
PersonX often felt sympathetic Triple:
atLocation
kids home
Path: bad for asking
PersonX begs for money PersonX often felt sympathetic
for money Subgraph:
atLocation home
Subgraph: kids
PersonX gives _ sympathetic
PersonX often felt school
some money atLocation
Figure 2: Examples of different knowledge shapes per KG given SIQA context and ground truth answer.
KG Shape Cond. Filter Sets Pres. Fig 2 as a running example throughout this section.
AT triple QC HQ CS-1 KS+
CN path A HR CS-2 KS- 4.1.1 Knowledge Conditioning
WH subgraph - - CS-3 - Unconditioned (A) Answer-Knowledge:
ATOMIC: For each candidate answer, we extract a
Table 1: Variations for each knowledge setting.
pool of top scoring knowledge using tf-idf between
AT=ATOMIC, CN=ConceptNet, WH=WikiHow,
QC=question-conditioned, A=unconditioned, the answer and all ATOMIC event-inference pairs.
HR=high recall, HQ=high quality, KS+/-=knowledge ConceptNet & WikiHow: For each candidate an-
presence at inference time. swer, we extract knowledge that links concepts in
the answer to any concept in the KG, where con-
responds differently to various types of knowledge. cepts are tokens in the answer and nodes in the KG.
ATOMIC: ATOMIC (Sap et al., 2019a) is an in- Example: Consider the SIQA context and ground-
ferential knowledge atlas that focuses on if-then truth answer on the right side of Fig 2. Here, the A
reasoning. This knowledge is structured as event- conditioning setup for ConceptNet would extract
inference pairs, where each pair reflects one of 9 the triple [keep, Antonym, get rid].
possible inference dimensions for different if-then Question-Conditioned (QC) Answer-Knowl.:
reasoning relations (cause vs. effect, etc). Knowl- ATOMIC: We select a question-conditioned knowl-
edge is in the form of short, abstracted, free-text. edge pool via the top scoring tf-idf match between
See the left of Fig 2 for examples. the question & candidate answer and all ATOMIC
ConceptNet: ConceptNet (Speer et al., 2017) is a event-inference pairs. We then select a pool of top
taxonomic knowledge graph that connects natural scoring knowledge for each candidate answer us-
language concepts with relation edges. While there ing tf-idf between the candidate answer and the
are relation edges similar to ATOMIC inference question-conditioned knowledge pool.
dimensions, the structure of the knowledge is in ConceptNet & WikiHow: For each candidate an-
the form of structured triples and generally tends to swer, we extract knowledge that links concepts in
focus on relations between words or phrases. See the question directly to concepts in the answer.
the right of Fig 2 for examples. Example: All knowledge illustrated in Fig 2 is ex-
WikiHow: We automatically extract subgraphs tracted using QC conditioning.
from WikiHow (Koupaee and Wang, 2018) to build
our own instruction-based, domain-specific KG. 4.1.2 Knowledge Shape
Details are found in the appendix. Knowledge Pairs/Triples:
ATOMIC: We take the highest scoring knowledge
4 Phase 1: Identify pair determined by the conditioning step.
ConceptNet & WikiHow: We select a triple at ran-
4.1 Setup dom from the conditioning step.
We identify knowledge using the following extrac- Knowledge Paths:
tion methods for each KG. Our setup with all pos- ATOMIC: In the QC setup, for each data point, we
sible options is illustrated in Table 1. We will use extract a question-knowledge pool via top scoring
tf-idf match between the question and all ATOMIC Variation ATOMIC ConceptNet
event-inference pairs. If there exists a concept SIQA:
link between the question-knowledge pool and the QC-HQ CS-1 51% 39%
QC-HQ CS-2 24% 22%
answer-knowledge pool from the conditioning step, QC-HQ CS-3 2% 6%
we link this knowledge as a path. In the A setup, we QC-HQ CS 77% 66%
make the modification that our answer-knowledge PIQA:
pool can link to any pair in ATOMIC. QC-HQ CS-1 16% 24%
WikiHow: In the QC setup, we find a path from a QC-HQ CS-2 8% 8%
QC-HQ CS 24% 32%
word in the question, to another word in the ques-
MCScript2.0:
tion, to a word in the answer. In the A setup, we QC-HQ CS-1 2% 36%
find a path from a word in the answer to any word QC-HQ CS-2 88% 54%
it connects to in the KG, as a path through the KG. QC-HQ CS 90% 90%
Knowledge Subgraphs:
Table 2: %Knowledge extracted for each subset wrt.
ATOMIC: We take a maximum of the 3 highest
original data size. Results shown for the best aligned
scoring knowledge triples determined by the condi- KG shape in the QC-HQ setting: SIQA=CN triples,
tioning step to create 1-hop subgraphs. ATOMIC paths; PIQA=CN subgraphs, ATOMIC pairs;
ConceptNet & WikiHow: From the conditioning MCScript2.0=CN subgraphs, ATOMIC pairs.
knowledge pool, we add the subgraph with the
highest number of edges, as we assume these to QC-HQ subset. We use the QC-HQ setting to show
be the most informative. We only consider 1-hop how often our KG specifically identifies question-
edges and take the top 5. answer knowledge gaps. We use each KG’s best
Example: All three shape variations are illustrated aligned shape in this comparison (Section 5.2.2).
in Fig 2, using QC conditioning. For SIQA, we extract more ATOMIC data than
for ConceptNet and for PIQA, we extract more
4.1.3 Knowledge Filtering ConceptNet data than for ATOMIC. This illustrates
High Quality/Low Recall (HQ): We constrain that ATOMIC identifies more knowledge gaps for
each answer candidate to keep its highest scoring SIQA, and ConceptNet identifies more knowledge
unique knowledge such that no answer candidate gaps for PIQA. For MCScript2.0, we see that the
shares knowledge, intending to ensure the rele- same total knowledge is extracted from both KGs,
vance of the knowledge to that candidate alone. however more ATOMIC knowledge is extracted in
Low Quality/High Recall (HR): Candidate the CS-2 setup, indicating better coverage.
keeps its highest scoring knowledge regardless of
knowledge sharing among candidates. 5 Phase 2: Align
5.1 Setup
4.1.4 Data Subsets & Baseline Training Baseline Model: We fine-tune BERT-base (Devlin
We split data into subsets depending on how many et al., 2019) as our baseline on SIQA following Sap
candidate answers extracted knowledge (CS-X) to et al. (2019b), BERT-base on MCScript2.0, and
evaluate knowledge impact on task performance BERT-large (Devlin et al., 2019) as our baseline on
fairly. For our main results, we use the split in PIQA following Koupaee and Wang (2018). See
which each answer has access to knowledge (CS-2 original papers for hyperparameter settings.
for PIQA and MCScript2.0, CS-3 for SIQA). Table Knowledge-Surrounded (KS) Model: We en-
2 illustrates the percent of original data for each hance task candidate answers with knowledge-
split. We compare KS model subset results against surrounded answers, in which answer-specific
a BERT baseline trained and evaluated on the same knowledge is appended to each candidate answer.
subset simply without the added knowledge. This knowledge is intended to explicitly add miss-
ing knowledge gaps to the answer. This model is
4.2 Analysis used in the alignment and integration steps in Fig.
We examine how often a KG identifies a potential 1. We encode input as follows. For a question qi ,
knowledge gap between a question and an answer. we modify each candidate answer, cij , by append-
This is illustrated on the far left in Fig. 1. Table 2 ing its respective knowledge kij . The sequence
shows the percent of knowledge extracted for each of tokens {[CLS] qi [SEP] cij kij [SEP]} is then
Variation Base. KS+ KS- ing inference time (KS+) and when it is not (KS-).
SIQA AT. QC-HQ 38.1 42.9 40.5
SIQA AT. QC-HR 60.3 59.0 56.4 We see that SIQA performed best when it received
SIQA AT. A-HQ 60.4 60.0 59.2 QC-HQ knowledge from ATOMIC, reflecting the
SIQA AT. A-HR 61.8 60.8 61.0 strong, one-to-one alignment between SIQA and
SIQA CN QC-HQ 54.1 45.0 43.2
SIQA CN QC-HR 54.5 52.2 49.1 ATOMIC. PIQA, however, performs well across
SIQA CN A-HQ 61.5 61.5 61.9 most extractions for ConceptNet, indicating that
SIQA CN A-HR 61.2 60.0 59.7 PIQA is generally well aligned with ConceptNet
PIQA AT. QC-HQ 54.6 49.3 50.0
PIQA AT. QC-HR 61.8 51.2 48.8 and only performs poorly when the extraction pro-
PIQA AT. A-HQ 50.6 61.2 64.9 cess becomes too noisy. Additionally, PIQA per-
PIQA AT. A-HR 70.5 64.9 65.4
forms well with WikiHow for the unconditioned,
PIQA CN QC-HQ 60.5 66.5 59.9
PIQA CN QC-HR 51.4 57.2 58.7 high quality setting, indicating that the WikiHow
PIQA CN A-HQ 49.0 64.5 66.4 KG is not well aligned across question-answer
PIQA CN A-HR 70.5 54.5 51.9
PIQA WH QC-HQ 53.3 54.7 48.0
pairs, but does identify useful knowledge gaps
PIQA WH QC-HR 59.9 54.7 55.0 within the answer that may improve performance
PIQA WH A-HQ 53.5 68.3 69.0 on the task. Finally, we see that MCScript2.0 per-
PIQA WH A-HR 67.5 51.4 49.3
MC AT. QC-HQ 80.3 83.0 79.6
formed best when it received QC-HQ knowledge
MC AT. QC-HR 82.4 80.8 79.4 from ATOMIC, and similarly to SIQA, improves
MC AT. A-HQ 82.5 80.9 79.6 when seeing this knowledge at inference time.
MC AT. A-HR 81.2 80.0 77.6
MC CN QC-HQ 78.6 79.4 76.2 5.2.2 Knowledge Shape Analysis
MC CN QC-HR 78.8 79.3 78.4
MC CN A-HQ 82.6 80.3 77.4 We discuss knowledge shape effects on alignment.
MC CN A-HR 82.2 80.2 76.9
ATOMIC: For SIQA, ATOMIC paths have the
Table 3: Accuracy for extraction variations on: SIQA best alignment, due to the high quality achieved
CS-3 for ATOMIC paths & ConceptNet triples; PIQA when constraining knowledge for the SIQA ques-
CS-2 for ATOMIC pairs & ConceptNet subgraphs & tion to link to knowledge for the answer. ATOMIC
WikiHow paths; and MCScript2.0 CS-2 for ATOMIC
pairs and subgraphs seem to be learned more im-
pairs & ConceptNet subgraphs.
plicitly and do not yield large overall improvements
passed as input to BERT. Thus, each candidate an- when added explicitly during inference time. It
swer is surrounded by knowledge that allows BERT seems that SIQA requires longer, more informa-
to potentially fill reasoning gaps between the ques- tive knowledge at inference time, which pairs and
tion and answer. The extraction variations for kij subgraphs do not offer. For example, consider
are described in Section 4.1. the SIQA context and answer on the right of Fig
2. For this data point, we extract the following
5.2 Analysis ATOMIC path: [PersonX has to go to the dentist,
We investigate how well the extracted knowledge need to make an appointment, PersonX picks
and the task are aligned by allowing the knowledge up from school, to drive kids home], and the fol-
to fill in the question-answer knowledge gap and lowing ATOMIC pair: [PersonX picks up from
determining whether this improves performance. school, to drive kids home]. We can see that the
This is illustrated in the center of Fig. 1. path clearly contains more context and detail for
the knowledge required to make the correct predic-
5.2.1 Extraction Variation Analysis tion. For PIQA, we saw the largest improvements
Table 3 illustrates performance for each KG across for ATOMIC pairs and subgraphs, where pairs ul-
each of the different extractions, using the subset in timately perform best, indicating that PIQA might
which each candidate answer has access to knowl- find concise and direct information from ATOMIC
edge. We compare question-conditioned, uncondi- more useful. For MCScript2.0, ATOMIC pairs
tioned, high recall, and high quality settings for the aligned best exclusively.
best aligned knowledge shape (ConceptNet triples, ConceptNet: For SIQA, ConceptNet triples and
ATOMIC paths for SIQA; ConceptNet subgraphs, subgraphs show similar alignment results and we
ATOMIC pairs, WikiHow paths for PIQA; Concept- do not see major improvements. It seems that the
Net subgraphs, ATOMIC pairs for MCScript2.0) content of ConceptNet is not aligned well to SIQA,
across the three KGs. For each setting, we show regardless of shape. For PIQA, we see improve-
results both for when knowledge is present dur- ments for both triples and subgraphs, and get our
best improvements with subgraphs, indicating that Type Baseline KS model
the extra knowledge encoded in a subgraph shape Relation Probes
via ConceptNet is helpful for the PIQA task. Sim- xWant vs xEffect 53.66 46.59
xWant vs xReact 51.66 77.39
ilarly to SIQA, MCScript2.0 performed best with xWant vs xIntent 52.05 45.77
ConceptNet subgraphs, but these do not yield ma- xWant vs xNeed 47.99 50.32
jor improvements. Interestingly, results on MC- xWant vs xAttr 71.74 65.19
xEffect vs xReact 43.32 63.36
Script2.0 only show slight improvement for Con- xEffect vs xIntent 54.29 44.85
ceptNet when knowledge is present at inference xEffect vs xNeed 49.22 52.48
time, and shows no improvement otherwise. xEffect vs xAttr 61.52 59.44
xReact vs xIntent 52.34 73.32
WikiHow: WikiHow paths performed best, indi- xReact vs xNeed 49.21 74.76
cating that paths were the best way to extract in- xReact vs xAttr 48.48 57.71
formation, as WikiHow pairs and subgraphs might xIntent vs xNeed 44.18 42.24
xIntent vs xAttr 78.96 64.64
have contained redundant information given limita- xNeed vs xAttr 68.97 68.49
tions with the WikiHow KG extraction process. oWant vs oEffect 51.74 41.83
oWant vs oReact 43.90 72.04
5.2.3 Knowledge Graph Analysis oEffect vs oReact 49.12 59.02
Agent-Patient Probes
Given our alignment results, it is clear to see that xWant vs oWant 70.12 70.09
ATOMIC is the best match for SIQA, that Concept- xEffect vs oEffect 74.34 73.51
Net is the best match for PIQA, and that ATOMIC xReact vs oReact 74.34 69.51
is the best match for MCScript2.0 (most likely due Concept Probes
inference 74.96 77.40
to its need for script knowledge, which often re- event 67.74 65.07
quires social knowledge). The encoding of each
KG plays an important role in this match. We see Table 4: QA probe results on ATOMIC-SIQA QC-HQ
that the ConceptNet to PIQA match is more robust KS model vs BERT baseline.
to extraction methods, which may be a side effect
of ConceptNet’s encoding, where directly linking Relational Probes: We predict the ATOMIC rela-
nodes is less noisy than using tf-idf measures for tion between an event and inference pair, constrain-
the ATOMIC encoding, in which we only see posi- ing our candidate answer set to two specified infer-
tive results when we have very selective filters in ence dimensions. For example, xWant vs xNeed
our extraction techniques. The concise, short na- might refer to a probe that will predict an answer
ture of ConceptNet’s knowledge also lends itself to from the candidate set: [Person wants recogni-
more implicit knowledge learning for certain types tion, Person needs recognition] given some event,
of tasks, whereas the more descriptive nature of essentially pitting the two relations against each
ATOMIC can be read during inference time (see other to evaluate the difficulty of distinguishing.
Fig 2 for examples). This illustrates the possibil- Agent-Patient Probes: We predict the agent of the
ity that ConceptNet may boost performance as a inference where the candidate set is the agent and
regularizer for certain tasks. patient of the event (using ATOMIC abstractions).
Concept Probes: We predict concepts and con-
6 Phase 3: Integrate strain our candidate answer set to the most salient
concept in the sequence and its respective antonym.
6.1 Setup A full description of probe construction and exam-
We analyze two aspects of integration, depicted on ples for each knowledge type can be found in the
the far right in Fig. 1. First, we construct common- appendix. We evaluate QA probes via standard
sense probes to demonstrate how much knowledge accuracy after fine-tuning.
we gain from our KGs via our transformer-based
KS model with respect to a BERT baseline. Sec- 6.2 Analysis: Commonsense Probes
ond, we examine distributional changes in our mod- Table 4 compares the performance between the
els before and after commonsense integration and best performing ATOMIC-SIQA KS model and
verify our results with human evaluation. With respective SIQA BERT baseline. We see notable
our probes, we can compare how well models dis- performance differences for our relation probes,
tinguish between several types of ATOMIC-style showing that the KS model does well at identify-
knowledge, outlined below. ing the React relation and that the baseline tends
Correct Incorrect ATOMIC CN
19.5% 12.4% SIQA 89% 91%
PIQA 48% 82%
Table 5: Distribution changes for selected class from MC 75% 74%
baseline to KS model. Table 6: Human evaluation for valid knowledge gap.
Mahnaz Koupaee and William Yang Wang. 2018. Wik- Simon Ostermann, Michael Roth, and Manfred Pinkal.
ihow: A large scale text summarization dataset. 2019. Mcscript2. 0: A machine comprehension cor-
arXiv preprint arXiv:1810.09305. pus focused on script events and participants. In Pro-
ceedings of the Eighth Joint Conference on Lexical
Sunjae Kwon, Cheongwoong Kang, Jiyeon Han, and and Computational Semantics (* SEM 2019), pages
Jaesik Choi. 2019. Why do masked neural language 103–117.
models still need common sense knowledge? arXiv
preprint arXiv:1911.03024. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
Hector Levesque, Ernest Davis, and Leora Morgen- models are unsupervised multitask learners. OpenAI
stern. 2012. The winograd schema challenge. In Blog, 1(8).
Nazneen Fatema Rajani, Bryan McCann, Caiming Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
Xiong, and Richard Socher. 2019. Explain your- bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019.
self! leveraging language models for commonsense Xlnet: Generalized autoregressive pretraining for
reasoning. In Proceedings of the 57th Annual Meet- language understanding. NeurIPS.
ing of the Association for Computational Linguistics,
pages 4932–4942. Zhi-Xiu Ye, Qian Chen, Wen Wang, and Zhen-Hua
Ling. 2019. Align, mask and select: A simple
Kyle Richardson and Ashish Sabharwal. 2019. What method for incorporating commonsense knowledge
does my qa model know? devising controlled into language representation models. arXiv preprint
probes using expert knowledge. arXiv preprint arXiv:1908.06725.
arXiv:1912.13337.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
Choi. 2019. From recognition to cognition: Visual
Anna Rogers, Olga Kovaleva, and Anna Rumshisky.
commonsense reasoning. In 2019 IEEE/CVF Con-
2020. A primer in bertology: What we know about
ference on Computer Vision and Pattern Recognition
how bert works. arXiv preprint arXiv:2002.12327.
(CVPR), pages 6713–6724.
Maarten Sap, Ronan Le Bras, Emily Allaway, Chan- Rowan Zellers, Yonatan Bisk, Roy Schwartz, and
dra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Yejin Choi. 2018. SWAG: A large-scale adversar-
Brendan Roof, Noah A Smith, and Yejin Choi. ial dataset for grounded commonsense inference. In
2019a. Atomic: An atlas of machine commonsense Proceedings of the 2018 Conference on Empirical
for if-then reasoning. In Proceedings of the AAAI Methods in Natural Language Processing, pages 93–
Conference on Artificial Intelligence, volume 33, 104.
pages 3027–3035.
Hongming Zhang, Xinran Zhao, and Yangqiu Song.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan 2020. Winowhy: A deep diagnosis of es-
Le Bras, and Yejin Choi. 2019b. Social iqa: Com- sential commonsense knowledge for answering
monsense reasoning about social interactions. In winograd schema challenge. arXiv preprint
Proceedings of the 2019 Conference on Empirical arXiv:2005.05763.
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan- Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
guage Processing (EMNLP-IJCNLP), pages 4453– Maosong Sun, and Qun Liu. 2019. Ernie: Enhanced
4463. language representation with informative entities. In
ACL.
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
Conceptnet 5.5: An open multilingual graph of gen- Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan
eral knowledge. In Thirty-First AAAI Conference on Roth. 2019. “Going on a vacation” takes longer
Artificial Intelligence. than “Going for a walk”: A Study of Temporal
Commonsense Understanding. In Proceedings of
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi the 2019 Conference on Empirical Methods in Nat-
Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao ural Language Processing and the 9th International
Tian, and Hua Wu. 2019. Ernie: Enhanced repre- Joint Conference on Natural Language Processing
sentation through knowledge integration. ACL. (EMNLP-IJCNLP).
Table 10: Majority label and fine-tuning results for MLM probes.
For inference dimensions relating to Others, we we find the most salient concept in the event via
map the inference dimensions oWant, oReact, POS tagging. We then replace this concept with
and oEffect to the same verbs as before: wants, [MASK] and set candidate answers as the ground
feels, and effect respectively. We set up probes in truth answer and an antonym, as found via Word-
the same way as above. Net (Miller, 1995). For example, given the event
and inference:
Agent-Patient Probes. We create probes to eval-
uate whether a model can determine whether an PersonX discovers the answer. PersonX feels
inference dimension is assigned to PersonX or oth- accomplished.
ers. For example, consider the following probe: We identify discovers as the most salient concept
in the event, and use the lemma from WordNet: dis-
PersonX puts out a fire. [MASK] wants to receive
covery. We then use this to find a viable antonym:
recognition.
lose. Finally, we have the probe:
In this example, the correct prediction is Per-
PersonX [MASK] the answer. PersonX feels
sonX. However, in the below probe, the correct
accomplished.
prediction is others.
And the candidates: [discovery, lose].
PersonX puts out a fire. [MASK] want to thank We lemmatize the answers to allow for fair pre-
PersonX. diction between the truth concept and the antonym
Both of these probes use the following answer (which often comes lemmatized from Wordnet).
candidates: [PersonX, others]. We also remove Similarly, we construct inference concept probes
plurals to ensure that the model does not make by predicting salient concepts in the inference di-
predictions using hints from grammar. mension instead of the event.
A.4 Reproducibility
We train each KS model using 2 GeForce GTX
1080 Ti GPUs. Hyperparameter settings are used
from previously reported BERT results on each
task (Sap et al., 2019b; Bisk et al., 2020; Da and
Kasai, 2019).