You are on page 1of 17

Decomposing Generalization

Models of Generic, Habitual, and Episodic Statements

Venkata Govindarajan Benjamin Van Durme Aaron Steven White


University of Rochester Johns Hopkins University University of Rochester

Abstract have a long history in both the linguistics and


artificial intelligence literatures (see Carlson,
We present a novel semantic framework for 2011; Maienborn et al., 2011; Leslie and Lerner,
modeling linguistic expressions of generalization— 2016). Nevertheless, few modern semantic parsers
generic, habitual, and episodic statements—as make a systematic distinction (cf. Abzianidze and
combinations of simple, real-valued referen- Bos, 2017).
tial properties of predicates and their argu- This is problematic, because the ability to
ments. We use this framework to construct a accurately capture different modes of generaliza-
dataset covering the entirety of the Universal tion is likely key to building systems with robust
Dependencies English Web Treebank. We use common sense reasoning (Zhang et al., 2017a;
this dataset to probe the efficacy of type-level Bauer et al., 2018): Such systems need some
and token-level information—including hand- source for general knowledge about the world
engineered features and static (GloVe) and (McCarthy, 1960, 1980, 1986; Minsky, 1974;
contextual (ELMo) word embeddings—for Schank and Abelson, 1975; Hobbs et al., 1987;
predicting expressions of generalization. Reiter, 1987) and natural language text seems like
a prime candidate. It is also surprising, because
1 Introduction there is no dearth of data relevant to linguistic
Natural language allows us to convey not only expressions of generalization (Doddington et al.,
information about particular individuals and events, 2004; Cybulska and Vossen, 2014b; Friedrich
as in Example (1), but also generalizations about et al., 2015).
those individuals and events, as in (2). One obstacle to further progress on general-
ization is that current frameworks tend to take
(1) a. Mary ate oatmeal for breakfast today. standard descriptive categories as sharp classes—
for example, EPISODIC, GENERIC, HABITUAL for state-
b. The students completed their assignments. ments and KIND, INDIVIDUAL for noun phrases.
This may seem reasonable for sentences like
(2) a. Mary eats oatmeal for breakfast.
(1a), where Mary clearly refers to a particular
b. The students always complete their assign- individual, or (3a), where Bishops clearly refers
ments on time. to a kind; but natural text is less forgiving
(Grimm, 2014, 2016, 2018). Consider the under-
This capacity for expressing generalization is lined arguments in (4): Do they refer to kinds
extremely flexible—allowing for generalizations or individuals?
about the kinds of events that particular individuals
(4) a. I will manage client expectations.
are habitually involved in, as in (2), as well as
characterizations about kinds of things, as in (3). b. The atmosphere may not be for everyone.

(3) a. Bishops move diagonally. c. Thanks again for great customer service!

b. Soap is used to remove dirt. To remedy this, we propose a novel frame-


work for capturing linguistic expressions of
Such distinctions between episodic statements (1), generalization. Taking inspiration from decompo-
on the one hand, and habitual (2) and generic sitional semantics (Reisinger et al., 2015; White
(or characterizing) statements (3), on the other, et al., 2016), we suggest that linguistic expressions
501

Transactions of the Association for Computational Linguistics, vol. 7, pp. 501–517, 2019. https://doi.org/10.1162/tacl a 00285
Action Editor: Christopher Potts. Submission batch: 3/2019; Revision batch: 6/2019; Published 9/2019.

c 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
of generalization should be captured in a contin- underspecified entries (USP), where the referent is
uous multi-label system, rather than a multi-class ambiguous between GENERIC and SPECIFIC.
system. We do this by decomposing categories The existence of the USP label already portends
such as EPISODIC, HABITUAL, and GENERIC into simple an issue with multi-class annotation schemes,
referential properties of predicates and their argu- which have no way of capturing the well-known
ments. Using this framework (§3), we develop an phenomena of taxonomic reference (see Carlson
annotation protocol, which we validate (§4) and and Pelletier, 1995, and references therein),
compare against previous frameworks (§5). We abstract/event reference (Grimm, 2014, 2016,
then deploy this framework (§6) to construct a 2018), and weak definites (Carlson et al., 2006).
new large-scale dataset of annotations covering For example, wines in (5) refers to particular
the entire Universal Dependencies (De Marneffe kinds of wine; service in (6) refers to an abstract
et al., 2014; Nivre et al., 2015) English Web entity/event that could be construed as both
Treebank (UD-EWT; Bies et al., 2012; Silveira particular-referring, in that it is the service at a
et al., 2014)—yielding the Universal Decompo- specific restaurant, and kind-referring, in that it
sitional Semantics-Genericity (UDS-G) dataset.1 encompasses all service events at that restaurant;
Through exploratory analysis of this dataset, and bus in (7) refers to potentially multiple
we demonstrate that this multi-label framework distinct buses that are grouped into a kind by
is well-motivated (§7). We then present models the fact that they drive a particular line.
for predicting expressions of linguistic general-
ization that combine hand-engineered type and (5) That vintner makes three different wines.
token-level features with static and contextual
learned representations (§8). We find that (i) (6) The service at that restaurant is excellent.
referential properties of arguments are easier to
predict than those of predicates; and that (ii) (7) That bureaucrat takes the 90 bus to work.
contextual learned representations contain most
of the relevant information for both arguments This deficit is remedied to some extent in the
and predicates (§9). ARRAU (Poesio et al., 2018, and see Mathew,
2009; Louis and Nenkova, 2011) and ECB+
2 Background (Cybulska and Vossen, 2014a,b) corpora. The
ARRAU corpus is mainly intended to capture
Most existing annotation frameworks aim to cap- anaphora resolution, but following the GNOME
ture expressions of linguistic generalization using guidelines (Poesio, 2004), it also annotates entity
multi-class annotation schemes. We argue that mentions for a GENERIC attribute, sensitive to
this reliance on multi-class annotation schemes whether the mention is in the scope of a relevant
is problematic on the basis of descriptive and semantic operator (e.g., a conditional or quantifier)
theoretical work in the linguistics literature. and whether the nominal refers to a type of object
One of the earliest frameworks explicitly aimed whose genericity is left underspecified, such as a
at capturing expressions of linguistic general- substance. The ECB+ corpus is an extension of
ization was developed under the ACE-2 program the EventCorefBank (ECB; Bejan and Harabagiu,
(Mitchell et al., 2003; Doddington et al., 2004, 2010; Lee et al., 2012), which annotates Google
and see Reiter and Frank, 2010). This framework News texts for event coreference in accordance
associates entity mentions with discrete labels with the TimeML specification (Pustejovsky et al.,
for whether they refer to a specific member of 2003), and is an improvement in the sense that, in
the set in question (SPECIFIC) or any member of addition to entity mentions, event mentions may
the set in question (GENERIC). The ACE-2005 be labeled with a GENERIC class.
Multilingual Training Corpus (Walker et al., 2006) The ECB+ approach is useful, since episodic,
extends these annotation guidelines, providing habitual, and generic statements can straightfor-
two additional classes: (i) negatively quantified wardly be described using combinations of event
entries (NEG) for referring to empty sets and (ii) and entity mention labels. For example, in ECB+,
episodic statements involve only non-generic
1
Data, code, protocol implementation, and task instruc- entity and event mentions; habitual statements
tions provided to annotators are available at decomp.io. involve a generic event mention and at least one

502
non-generic entity mention; and generic state- amples may just not have categorically correct
ments involve generic event mentions and at least answers; and even if they do, annotator uncertainty
one generic entity mention. This demonstrates the and bias may obscure them. To account for this,
strength of decomposing statements into proper- we develop a novel annotation framework that
ties of the events and entities they describe; but both (i) explicitly captures annotator confidence
there remain difficult issues arising from the fact about the different referential properties discussed
that the decomposition does not go far enough. above and (ii) attempts to correct for annotator
One is that, like ACE-2/2005 and ARRAU, ECB+ bias using standard psycholinguistic methods.
does not make it possible to capture taxonomic
and abstract reference or weak definites; another 3 Annotation Framework
is that, because ECB+ treats generics as mutu-
ally exclusive from other event classes, it is not We divide our framework into two protocols—the
possible to capture that events and states in those argument and predicate protocols—that probe
classes can themselves be particular or generic. properties of individuals and situations (i.e.,
This is well known for different classes of events, events or states) referred to in a clause. Drawing
such as those determined by a predicate’s lex- inspiration from prior work in decompositional
ical aspect (Vendler, 1957); but it is likely also semantics (White et al., 2016), a crucial aspect of
important for distinguishing more particular stage- our framework is that (i) multiple properties can
level properties (e.g., availability (8)) from more be simultaneously true for a particular individual
generic individual-level properties (e.g., strength or situation (event or state); and (ii) we explicitly
(9)) (Carlson, 1977). collect confidence ratings for each property. This
makes our framework highly extensible, because
(8) Those firemen are available. further properties can be added without breaking
a strict multi-class ontology.
(9) Those firemen are strong.
Drawing inspiration from the prior literature
This situation is improved upon in the Richer on generalization discussed in §1 and §2, we
Event Descriptions (RED; O’Gorman et al., 2016) focus on properties that lie along three main axes:
and Situation Entities (SitEnt; Friedrich and whether a predicate or its arguments refer to (i)
Palmer, 2014a,b; Friedrich et al., 2015; Friedrich instantiated or spatiotemporally delimited (i.e.,
and Pinkal, 2015b,a; Friedrich et al., 2016) frame- particular) situations or individuals; (ii) classes of
works, which annotate both NPs and entire clauses situations (i.e., hypothetical situations) or kinds of
for genericity. In particular, SitEnt, which is individuals; and/or (iii) intangible (i.e., abstract
used to annotate MASC (Ide et al., 2010) and concepts or stative situations).
Wikipedia, has the nice property that it rec- This choice of axes is aimed at allowing
ognizes the existence of abstract entities and our framework to capture not only the standard
lexical aspectual class of clauses’ main verbs, EPISODIC-HABITUAL-GENERIC distinction, but also
along with habituality and genericity. This is use- phenomena that do not fit neatly into this dis-
ful because, in addition to decomposing state- tinction, such as taxonomic reference, abstract
ments using the genericity of the main referent and reference, and weak definites. The idea here is
event, this framework recognizes that lexical as- similar to prior decompositional semantics work
pect is an independent phenomenon. In practice, on semantic protoroles (Reisinger et al., 2015;
however, the annotations produced by this frame- White et al., 2016, 2017), which associates
work are mapped into a multi-class scheme contain- categories like AGENT or PATIENT with sets of
ing only the high-level GENERIC-HABITUAL-EPISODIC more basic properties, such as volitionality,
distinction—alongside a conceptually indepen- causation, change-of-state, and so forth, and is
dent distinction among illocutionary acts. similarly inspired by classic theoretical work
A potential argument in favor of mapping into (Dowty, 1991).
a multi-class scheme is that, if it is sufficiently In our framework, prototypical episodics, habit-
elaborated, the relevant decomposition may be uals, and generics correspond to sets of properties
recoverable. But regardless of such an elaboration, that the referents of a clause’s head predicate and
uncertainty about which class any particular entity arguments have—namely, clausal categories are
or event falls into cannot be ignored. Some ex- built up from properties of the predicates that head

503
Figure 1 shows examples of the argument pro-
tocol (top) and predicate protocol (bottom), whose
implementation is based on the event factuality
annotation protocol described by White et al.
(2016) and Rudinger et al. (2018). Annotators
are presented with a sentence with one or many
words highlighted, followed by statements pertain-
ing to the highlighted words in the context of
the sentence. They are then asked to fill in the
statement with a binary response saying whether
it does or does not hold and to give their confidence
on a 5-point scale—not at all confident (1), not
very confident (2), somewhat confident (3), very
confident (4), and totally confident (5).

Figure 1: Examples of argument protocol (top) and


predicate protocol (bottom). 4 Framework Validation
To demonstrate the efficacy of our framework
for use in bulk annotation (reported in §6), we
them along with those predicates’ arguments. For conduct a validation study on both our predicate
instance, prototypical episodic statements, like and argument protocols. The aim of these studies
those in (1), have arguments that only refer to is to establish that annotators display reasonable
particular, non-kind, non-abstract individuals and agreement when annotating for the properties in
a predicate that refers to a particular event or each protocol, relative to their reported confi-
(possibly) state; prototypical habitual statements, dence. We expect that, the more confident
like those in (2) have arguments that refer to at least both annotators are in their annotation, the
one particular, non-kind, non-abstract individual more likely it should be that annotators agree on
and a predicate that refers to a non-particular, those annotations.
dynamic event; and prototypical generics, like To ensure that the findings from our validation
those in (3), have arguments that only refer to studies generalize to the bulk annotation setting,
kinds of individuals and a predicate that refers to we simulate the bulk setting as closely as pos-
non-particular situations. sible: (i) randomly sampling arguments and pre-
It is important to note that these are all proto- dicates for annotation from the same corpus we
typical properties of episodic, habitual, or generic conduct the bulk annotation on UD-EWT; and
statements, in the same way that volitionality is (ii) allowing annotators to do as many or as few
a prototypical property of agents and change-of- annotations as they would like. This design makes
state is a prototypical property of patients. That is, standard measures of interannotator agreement
our framework explicitly allows for bleed between somewhat difficult to accurately compute, because
categories because it only commits to the referen- different pairs of annotators may annotate only
tial properties, not the categories themselves. It is a small number of overlapping items (arguments/
this ambivalence toward sharp categories that also predicates), so we turn to standard statistical
allows our framework to capture phenomena that methods from psycholinguistics to assist in esti-
fall outside the bounds of the standard three-way mation of interannotator agreement.
distinction. For instance, taxonomic reference, as
in (5), and weak definites, as in (7), prototypically Predicate and argument extraction We ex-
involve an argument being both particular- and tracted predicates and their arguments from the
kind-referring; stage-level properties, as in (8), gold UD parses from UD-EWT using PredPatt
prototypically involve particular, non-dynamic (White et al., 2016; Zhang et al., 2017b). From the
situations, while individual-level properties, as in UD-EWT training set, we then randomly sampled
(9), prototypically involve non-particular, non- 100 arguments from those headed by a DET, NUM,
dynamic situations. NOUN, PROPN, or PRON and 100 predicates from

504
those headed by a ADJ, NOUN, NUM, DET, PROPN, Property β̂0 σ̂ann σ̂item
PRON, VERB, or AUX. Is.Particular 0.49 1.15 1.76

Argument
Annotators A total of 44 annotators were re- Is.Kind −0.31 1.23 1.34
cruited from Amazon Mechanical Turk to anno- Is.Abstract −1.29 1.27 1.70
tate arguments; and 50 annotators were recruited Is.Particular 0.98 0.91 0.72

Predicate
to annotate predicates. In both cases, arguments Is.Dynamic 0.24 0.82 0.59
and predicates were presented in batches of 10, Is.Hypothetical −0.78 1.24 0.90
with each predicate and argument annotated by
Table 1: Bias (log-odds) for answering true.
10 annotators.

Confidence normalization Because different


(Baayen, 2008), to estimate pe and po for each
annotators use the confidence scale in different
property. The basic idea behind using these models
ways (e.g., some annotators use all five options
is to allow our estimates of pe and po to be sensi-
while others only ever respond with totally
tive to the number of items annotators anno-
confident (5)) we normalize the confidence ratings
tated as well as how annotators’ confidence relates
for each property using a standard ordinal scale
to agreement.
normalization technique known as ridit scoring
To estimate pe for each property, we fit a
(Agresti, 2003). In ridit scoring, ordinal labels are
random effects logistic regression to the binary
mapped to (0, 1) using the empirical cumulative
responses for that property, with random intercepts
distribution function of the ratings given by
for both annotator and item. The fixed intercept
each annotator. Specifically, for the responses 
( a) estimate β̂0 for this model is an estimate of the
y(a) given by annotator a, ridity(a) yi = log-odds that the average annotator would answer
   
( a) ( a) true on that property for the average item; and
ECDFy(a) yi − 1 + 0.5 × ECDFy(a) yi .
the random intercepts give the distribution of
Ridit scoring has the effect of reweighting the
actual annotator (σ̂ann ) or item (σ̂item ) biases.
importance of a scale label based on the frequency
Table 1 gives the estimates for each property.
with which it is used. For example, insofar as
We note a substantial amount of variability in
an annotator rarely uses extreme values, such
the bias different annotators have for answering
as not at all confident or totally confident, the
true on many of these properties. This variability
annotator is likely signaling very low or very high
is evidenced by the fact that σ̂ann and σ̂item are
confidence, respectively, when they are used; and
similar across properties, and it suggests the need
insofar as an annotator often uses extreme values,
to adjust for annotator biases when analyzing
the annotator is likely not signaling particularly
these data, which we do both here and for our
low or particularly high confidence.
bulk annotation.
Interannotator Agreement (IAA) Common To compute pe from these estimates, we use
IAA statistics, such as Cohen’s or Fleiss’ κ, rely on a parametric bootstrap. On each replicate, we
the ability to compute both an expected agreement sample annotator biases b1 , b2 independently
o −p e
pe and an observed agreement po , with κ ≡ p1− pe .
from N (β̂0 , σ̂ann ), then compute the expected
Such a computation is relatively straightforward probability of random agreement in the standard
when a small number of annotators annotate many way: π1 π2 + (1 − π1 )(1 − π2 ), where πi =
items, but when many annotators each annotate logit−1 (bi ). We compute the mean across 9,999
a small number of items pairwise, pe and po can such replicates to obtain pe , shown in Table 2.
be difficult to estimate accurately, especially for To estimate po for each property in a way
annotators that only annotate a few items total. that takes annotator confidence into account, we
Further, there is no standard way to incorporate first compute, for each pair of annotators, each
confidence ratings, like the ones we collect, into item they both annotated, and each property they
these IAA measures. annotated that item on, whether or not they agree
To overcome these obstacles, we use a com- in their annotation. We then fit separate mixed
bination of mixed and random effects models effects logistic regressions for each property to
(Gelman and Hill, 2014), which are extremely this agreement variable, with a fixed intercept β0
common in the analysis of psycholinguistic data and slope βconf for the product of the annotators’

505
Property pe κlow κhigh Clause type P R F κmod κann
Argument
Is.Particular 0.52 0.21 0.77 EVENTIVE 0.68 0.55 0.61 0.49 0.74
Is.Kind 0.51 0.12 0.51 STATIVE 0.61 0.59 0.60 0.47 0.67
Is. Abstract 0.61 0.17 0.80 HABITUAL 0.49 0.52 0.50 0.33 0.43
Is.Particular 0.58 −0.11 0.54 GENERIC 0.66 0.77 0.71 0.61 0.68
Predicate

Is.Dynamic 0.51 −0.02 0.22


Is.Hypothetical 0.54 −0.04 0.60 Table 3: Predictability of standard ontology using
our property set in a kernelized support vector
Table 2: Interannotator agreement scores. classifier.

confidence for that item and random intercepts for it assumes a categorical distinction between
both annotator and item.2 GENERIC, HABITUAL (their GENERALIZING), EPISODIC

We find, for all properties, that there is a reliable (their EVENTIVE), and STATIVE clauses (Friedrich and
increase (i.e., a positive β̂conf ) in agreement as Palmer, 2014a,b; Friedrich et al., 2015; Friedrich
annotators’ confidence ratings go up (ps < 0.001). and Pinkal, 2015b,a; Friedrich et al., 2016).3
This corroborates our prediction that annotators SitEnt is also a useful comparison because it
should have higher agreement for things they was constructed by highly trained annotators who
are confident about. It also suggests the need to had access to the entire document containing
incorporate confidence ratings into the annotations the clause being annotated, thus allowing us to
our models are trained on, which we do in our assess both how much it matters that we use only
normalization of the bulk annotation responses. very lightly trained annotators and do not provide
From the fixed effects, we can obtain an esti- document context.
mate of the probability of agreement for the Predicate and argument extraction For each
average pair of annotators at each confidence level of GENERIC, HABITUAL, STATIVE, and EVENTIVE, we
between 0 and 1. We compute two versions of κ randomly sample 100 clauses from SitEnt such
based on such estimates: κlow , which corresponds that (i) that clause’s gold annotation has that
to 0 confidence for at least one annotator in a pair, category; and (ii) all SitEnt annotators agreed on
and κhigh , which corresponds to perfect confidence that annotation. We annotate the mainRefer-
for both. Table 2 shows these κ estimates. ent of these clauses (as defined by SitEnt) in
As implied by reliably positive β̂conf s, we see our argument protocol and the mainVerb in our
that κhigh is greater than κlow for all properties. predicate protocol, providing annotators only the
Further, with the exception of DYNAMIC, κhigh is sentence containing the clause.
generally comparable to the κ estimates reported
in annotations by trained annotators using a Annotators 42 annotators were recruited from
multi-class framework. For instance, compare the Amazon Mechanical Turk to annotate arguments,
metrics in Table 2 to κann in Table 3 (see §5 for and 45 annotators were recruited to annotate
details), which gives the Fleiss’ κ metric for clause predicates—both in batches of 10, with each
types in the SitEnt dataset (Friedrich et al., 2016). predicate and argument annotated by 5 annotators.
Annotation normalization As noted in §4,
5 Comparison to Standard Ontology
different annotators use the confidence scale dif-
To demonstrate that our framework subsumes ferently and have different biases for responding
standard distinctions (e.g., EPISODIC v. HABITUAL true or false on different properties (see Table 1).
v. GENERIC) we conduct a study comparing anno- To adjust for these biases, we construct a nor-
tations assigned under our multi-label framework malized score for each predicate and argument
to those assigned under a framework that recog- using mixed effects logistic regressions. These
nizes such multi-class distinctions. We choose the mixed effects models all had (i) a hinge loss
the SitEnt framework for this comparison, because with margin set to the normalized confidence
rating; (ii) fixed effects for property (PARTICULAR,
2
We use the product of annotator confidences because it
3
is large when both annotators have high confidence and small SitEnt additionally assumes three other classes, con-
when either annotator has low confidence and always remains trasting with the four above: IMPERATIVE, QUESTION, and REPORT.
between 0 (lowest confidence) and 1 (highest confidence). We ignore clauses labeled with these categories.

506
KIND, and ABSTRACT for arguments; PARTICULAR,
HYPOTHETICAL, and DYNAMIC for predicates) token,
and their interaction; and (iii) by-annotator ran-
dom intercepts and random slopes for property
with diagonal covariance matrices. The rationale
behind (i) is that true should be associated with
positive values; false should be associated with
negative values; and the confidence rating should
control how far from zero the normalized rating
is, adjusting for the biases of annotators that
responded to a particular item. The resulting re-
sponse scale is analogous to current approaches
to event factuality annotation (Lee et al., 2015;
Stanovsky et al., 2017; Rudinger et al., 2018).
We obtain a normalized score from these
models by setting the Best Linear Unbiased Pre-
dictors for the by-annotator random effects to zero
and using the Best Linear Unbiased Estimators
Figure 2: Mean property value for each clause type.
for the fixed effects to obtain a real-valued label
for each token on each property. This procedure
amounts to estimating a label for each property interannotator agreement. This pattern is sur-
and each token based on the ‘‘average annotator.’’ prising, because our model is based on annota-
tions by very lightly trained annotators who have
Quantitative comparison To compare our anno-
access to very limited context compared with the
tations to the gold situation entity types from
annotators of SitEnt, who receive the entire doc-
SitEnt, we train a support vector classifier with
ument in which a clause is found. Indeed, our
a radial basis function kernel to predict the
model has access to even less context than it could
situation entity type of each clause on the basis
otherwise have on the basis of our framework,
of the normalized argument property annotations
since we only annotate one of the potentially
for that clause’s mainReferent and the nor-
many arguments occurring in a clause; thus, the
malized predicate property annotations for that
metrics in Table 3 are likely somewhat conser-
clause’s mainVerb. The hyperparameters for
vative. This pattern may further suggest that,
this support vector classifier were selected using
although having extra context for annotating com-
exhaustive grid search over the regularization
plex semantic phenomena is always preferable, we
parameter
 λ ∈ {1, 10, 100, 1000}
 and bandwidth
still capture useful information by annotating only
σ ∈ 10−2 , 10−3 , 10−4 , 10−5 in a 5-fold cross-
isolated sentences.
validation (CV). This 5-fold CV was nested within
a 10-fold CV, from which we calculate metrics. Qualitative comparison Figure 2 shows the
Table 3 reports the precision, recall, and F-score mean normalized value for each property in our
computed using the held-out set in each fold of framework broken out by clause type. As ex-
the 10-fold CV. For purposes of comparison, it pected, we see that episodics tend to have
also gives the Fleiss’ κ reported by Friedrich et al. particular-referring arguments and predicates,
(2016) for each property (κann ) as well as Cohen’s whereas generics tend to have kind-referring
κ between our model predictions on the held-out arguments and non-particular predicates. Also as
folds and the gold SitEnt annotations (κmod ). One expected, episodics and habituals tend to refer
way to think about κmod is that it tells us what to situations that are more dynamic than statives
agreement we would expect if we used our model and generics. But although it makes sense that
as an annotator instead of highly trained humans. generics would be, on average, near zero for
We see that our model’s agreement (κmod ) tracks dynamicity—since generics can be about both
interannotator agreement (κann ) surprisingly well. dynamic and non-dynamic situations—it is less
Indeed, in some cases, such as for GENERIC, our clear why statives are not more negative. This
model’s agreement is within a few points of pattern may arise in some way from the fact that

507
there is relatively lower agreement on dynamicity, Corpus Level Scheme Size
as noted in §4. ACE-2
NP multi-class 40,106
ACE-2005
6 Bulk Annotation Arg. multi-class 12,540
ECB+
Pred. multi-class 14,884
We use our annotation framework to collect anno- CFD NP multi-class 3,422
tations of predicates and arguments on UD-EWT Matthew et al clause multi-class 1,052
using the PredPatt system—thus yielding the Uni-
ARRAU NP multi-class 91,933
versal Decompositional Semantics–Genericity
Topic multi-class
(UDS-G) dataset. Using UD-EWT in conjunction SitEnt 40,940
Clause multi-class
with PredPatt has two main advantages over other
Arg. multi-class 10,319
similar corpora: (i) UD-EWT contains text from RED
Pred. multi-class 8,731
multiple genres—not just newswire—with gold
Arg. multi-label 37,146
standard Universal Dependency parses; and (ii) UDS-G
Pred. multi-label 33,114
there are now a wide variety of other semantic
annotations on top of UD-EWT that use the Table 4: Survey of genericity annotated corpora
PredPatt standard (White et al., 2016; Rudinger for English, including our new corpus (in bold).
et al., 2018; Vashishtha et al., 2019).

Predicate and argument extraction PredPatt 7 Exploratory Analysis


identifies 34,025 predicates and 56,246 arguments
Before presenting models for predicting our prop-
of those predicates from 16,622 sentences. Based
erties, we conduct an exploratory analysis to dem-
on analysis of the data from our validation study
onstrate that the properties of the dataset relate
(§4) and other pilot experiments (not reported
to other token- and type-level semantic properties
here), we developed a set of heuristics for filtering
in intuitive ways. Figure 3 plots the normalized
certain tokens that PredPatt identifies as predicates
ratings for the argument (left) and predicate (right)
and arguments, either because we found that there
protocols. Each point corresponds to a token and
was little variability in the label assigned to
the density plots visualize the number of points in
particular subsets of tokens—for example, pro-
a region.
nominal arguments (such as I, we, he, she, etc.) are
almost always labeled particular, non-kind, and Arguments We see that arguments have a slight
non-abstract (with the exception of you and they, tendency (Pearson correlation ρ = −0.33) to refer
which can be kind-referring)—or because it to either a kind or a particular—for example, place
is not generally possible to answer questions in (10) falls in the lower right quadrant (particular-
about those tokens (e.g., adverbial predicates are referring) and transportation in (11) falls in
excluded). Based on these filtering heuristics, we the upper left quadrant (kind-referring)—though
retain 37,146 arguments and 33,114 predicates there are a not insignificant number of arguments
for annotation. Table 4 compares these numbers that refer to something that is both—for example,
against the resources described in §2. registration in (12) falls in the upper right
quadrant.
Annotators We recruited 482 annotators from
Amazon Mechanical Turk to annotate arguments, (10) I think this place is probably really great
and 438 annotators were recruited to annotate especially judging by the reviews on here.
predicates. Arguments and predicates in the UD-
(11) What made it perfect was that they offered
EWT validation and test sets were annotated
transportation so that[...]
by three annotators each; and those in the UD-
EWT train set were annotated by one each. All (12) Some places do the registration right at the
annotations were performed in batches of 10. hospital[...]
Annotation normalization We use the anno- We also see that there is a slight tendency for
tation normalization procedure described in §5, fit arguments that are neither particular-referring
separately to our train and development splits, on (ρ = −0.28) nor kind-referring (ρ = −0.11) to
the one hand, and our test split, on the other. be abstract-referring—for example, power in (13)

508
Figure 3: Distribution of normalized annotations in argument (left) and predicate (right) protocols.

falls in the lower left quadrant (only abstract- (18) a. Who knows what the future might hold,
referring)—but that there are some arguments that and it might be expensive?
refer to abstract particulars and some that refer to
abstract kinds—for example, both reputation (14) b. I have tryed to give him water but he wont
and argument (15) are abstract, but reputation take it...what should i do?
falls in the lower right quadrant, while argument
8 Models
falls in the upper left.
We consider two forms of predicate and argument
(13) Power be where power lies. representations to predict the three attributes in our
framework: hand-engineered features and learned
(14) Meanwhile, his reputation seems to be
features. For both, we contrast both type-level
improving, although Bangs noted a ‘‘pretty
information and token-level information.
interesting social dynamic.’’
Hand-engineered features We consider five
(15) The Pew researchers tried to transcend the sets of type-level hand-engineered features.
economic argument.
1. Concreteness Concreteness ratings for root
Predicates We see that there is effectively no argument lemmas in the argument protocol
tendency (ρ = 0.00) for predicates that refer to from the concreteness database (Brysbaert
particular situations to refer to dynamic events— et al., 2014) and the mean, maximum, and
for example, faxed in (16) falls in the upper minimum concreteness rating of a predicate’s
right quadrant (particular- and dynamic-referring), arguments in the predicate protocol.
while available in (17) falls in the lower right
quadrant (particular- and non-dynamic-referring). 2. Eventivity Eventivity and stativity ratings for
the root predicate lemma in the predicate
(16) I have faxed to you the form of Bond[...] protocol and the predicate head of the root
argument in the argument protocol from the
(17) is gare montparnasse storage still available?
LCS database (Dorr, 2001).
But we do see that there is a slight tendency 3. VerbNet Verb classes from VerbNet (Schuler,
(ρ = −0.25) for predicates that are hypothetical- 2005) for root predicate lemmas.
referring not to be particular-referring—for ex-
ample, knows in (18a) and do in (18b) are 4. FrameNet Frames evoked by root predicate
hypotheticals in the lower left. lemmas in the predicate protocol and for both

509
 
the root argument lemma and its predicate 0, 10−5 , 10−4 , 10−3 , and the dropout probability
head in the argument protocol from FrameNet d ∈ {0.1, 0.2, 0.3, 0.4, 0.5}.
(Baker et al., 1998).
Development For all models, we train for at
5. WordNet The union of WordNet (Fellbaum, most 20 epochs with early stopping. At the end
1998) supersenses (Ciaramita and Johnson, of each epoch, the L1 loss is calculated on the
2003) for all WordNet senses the root development set, and if it is higher than the pre-
argument or predicate lemmas can have. vious epoch, we stop training, saving the param-
eter values from the previous epoch.
And we consider two sets of token-level hand-
engineered features.
Evaluation Consonant with work in event fac-
1. Syntactic features POS tags, UD morpho- tuality prediction, we report Pearson correlation
logical features, and governing dependencies (ρ) and proportion of mean absolute error (MAE)
were extracted using PredPatt for the predi- explained by the model, which we refer to as R1
cate/argument root and all of its dependents. on analogy with the variance explained R2 = ρ2 .

2. Lexical features Function words (determin- MAEpmodel


ers, modals, auxiliaries) in the dependents of R1 = 1 −
MAEpbaseline
the arguments and predicates.

Learned features For our type-level learned where MAEpbaseline is always guessing the median
features, we use the 42B uncased GloVe embed- for property p. We calculate R1 across properties
dings for the root of the annotated predicate or (wR1) by taking the mean R1 weighted by the
argument (Pennington et al., 2014). For our token- MAE for each property.
level learned features, we use 1,024-dimensional These metrics together are useful, because ρ
ELMo embeddings (Peters et al., 2018). To obtain tells us how similar the predictions are to the
the latter, the UD-EWT sentences are passed as true values, ignoring scale, and R1 tells us how
input to the ELMo three-layered biLM, and we close the predictions are to the true values, after
extract the output of all three hidden layers for the accounting for variability in the data. We focus
root of the annotated predicates and arguments, mainly on differences in relative performance
giving us 3,072-dimensional vectors for each. among our models, but for comparison, state-
of-the-art event factuality prediction systems
Labeling models For each protocol, we predict obtain ρ ≈ 0.77 and R1 ≈ 0.57 for predicting
the three normalized properties corresponding to event factuality on the predicates we annotate
the annotated token(s) using different subsets of (Rudinger et al., 2018).
the above features. The feature representation is
used as the input to a multilayer perceptron with 9 Results
ReLU nonlinearity and L1 loss. The number of
hidden layers and their sizes are hyperparameters Table 5 contains the results on the test set for
that we tune on the development set. both the argument (top) and predicate (bottom)
protocols. We see that (i) our models are generally
Implementation For all experiments, we use
better able to predict referential properties of
stochastic gradient descent to train the multilayer
arguments than those of predicates; (ii) for both
perceptron parameters with the Adam optimizer
predicates and arguments, contextual learned repre-
(Kingma and Ba, 2015), using the default learning
sentations contain most of the relevant information
rate in pytorch (10−3 ). We performed ablation
for both arguments and predicates, though the
experiments on the four major classes of features
addition of hand-engineered features can give
discussed above.
a slight performance boost, particularly for the
Hyperparameters For each of the ablation ex- predicate properties; and (iii) the proportion of
periments, we ran a hyperparameter grid search absolute error explained is significantly lower than
over hidden layer sizes (one or two hidden layers what we might expect from the variance explained
with sizes h1 , h2 ∈ {512, 256, 128, 64, 32}; h2 implied by the correlations. We discuss (i) and (ii)
at most half of h1 ), L2 regularization penalty l ∈ here, deferring discussion of (iii) to §10.

510
Feature sets Is.Particular Is.Kind Is.Abstract All
Type Token GloVe ELMO ρ R1 ρ R1 ρ R1 wR1
+ − − − 42.4 7.4 30.2 4.9 51.4 11.7 8.1
− + − − 50.6 13.0 41.5 8.8 33.8 4.8 8.7
− − −
ARGUMENT

+ 44.5 8.3 33.4 4.6 45.2 7.7 6.9


− − − + 57.5 17.0 48.1 13.3 55.7 14.9 15.1
+ + − − 55.3 14.1 46.2 11.6 52.6 13.0 12.9
− + − + 58.6 15.6 48.6 13.7 56.8 14.2 14.5
+ + − + 58.3 16.3 47.8 13.2 56.3 15.2 14.9
+ + + + 58.1 17.0 48.9 13.2 56.1 15.1 15.1
Is.Particular Is.Hypothetical Is.Dynamic
+ − − − 14.0 0.8 13.4 0.0 32.5 5.6 2.0
− + − − 22.3 2.8 37.7 7.3 31.7 5.1 5.1
− − −
PREDICATE

+ 20.6 2.2 23.4 2.4 29.7 4.6 3.0


− − − + 26.2 3.6 43.1 10.0 37.0 6.8 6.8
− − + + 26.8 4.0 42.8 8.9 37.3 7.3 6.7
+ + − − 24.0 3.3 37.9 7.6 37.1 7.6 6.1
− + − + 27.4 4.1 43.3 10.1 38.6 7.8 7.4
+ − − + 27.1 4.0 43.0 10.1 37.5 7.6 7.2
+ + + + 26.8 4.1 43.5 10.3 37.1 7.2 7.2

Table 5: Correlation (ρ) and MAE explained (R1) on test split for argument (top) and predicate
(bottom) protocols. Bolded numbers give the best result in the column; the models highlighted in
blue are the ones analyzed in §10.

Argument properties While type-level hand- relevant to generalization that the contextual
engineered and learned features perform relatively learned features are missing. This performance
poorly for properties such as IS.PARTICULAR and boost may be diminished by improved contextual
IS.KIND for arguments, they are able to predict encoders, such as BERT (Devlin et al., 2019).
IS.ABSTRACT relatively well compared to the models
with all features. The converse of this also Predicate properties We see a pattern similar
holds: Token-level hand-engineered features are to the one observed for the argument properties
better able to predict IS.PARTICULAR and IS.KIND, mirrored in the predicate properties: Whereas
but perform relatively poorly on their own for type-level hand-engineered and learned features
IS.ABSTRACT.
perform relatively poorly for properties such as
IS.PARTICULAR and IS.HYPOTHETICAL, they are able
This seems likely to be a product of abstract to predict IS.DYNAMIC relatively well compared
reference being fairly strongly associated with with the models with all features. The converse
particular lexical items, while most arguments can of this also holds: Token-level hand-engineered
refer to particulars and kinds (and which they features are better able to predict IS.PARTICULAR
refer to is context-dependent). And in light of and IS.HYPOTHETICAL, but perform relatively poorly
the relatively good performance of contextual on their own for IS.DYNAMIC.
learned features alone, it suggests that these One caveat here is that, unlike for IS.ABSTRACT,
contextual learned features—in contrast to the type-level learned features (GloVe) alone perform
hand-engineered token-level features—are able to quite poorly for IS.DYNAMIC, and the difference
use this information coming from the lexical item. between the models with only type-level hand-
Interestingly, however, the models with both engineered features and the ones with only
contextual learned features (ELMo) and hand- token-level hand-engineered features is less stark
engineered token-level features perform slightly for IS.DYNAMIC than for IS.ABSTRACT. This may
better than those without the hand-engineered suggest that, though IS.DYNAMIC is relatively con-
features across the board, suggesting that there is strained by the lexical item, it may be more
some (small) amount of contextual information contextually determined than IS.ABSTRACT. Another

511
Figure 4: True (normalized) property values for argument (top) and predicate (bottom) protocols in the development
set plotted against values predicted by models highlighted in blue in Table 5.

major difference between the argument prop- Second, the model appears to be heavily reliant
erties and the predicate properties is that on part-of-speech information—or some semantic
IS.PARTICULAR is much more difficult to predict than information related to part-of-speech—for making
IS.HYPOTHETICAL. This contrasts with IS.PARTICULAR predictions. This behavior can be seen in the
for arguments, which is easier to predict than fact that, though common noun-rooted arguments
IS.KIND. get relatively variable predictions, pronoun- and
proper noun-rooted arguments are almost always
predicted to be particular, non-kind, non-abstract;
10 Analysis and though verb-rooted predicates also get rela-
tively variable predictions, common noun-, adjective-,
Figure 4 plots the true (normalized) property val- and proper noun-rooted predicates are almost
ues for the argument (top) and predicate (bottom) always predicted to be non-dynamic.
protocols from the development set against the
values predicted by the models highlighted in blue Argument protocol Proper nouns tend to refer
in Table 5. Points are colored by the part-of-speech to particular, non-kind, non-abstract entities, but
of the argument or predicate root. they can be kind-referring, which our models miss:
We see two overarching patterns. First, our iPhone in (20) and Marines in (19) were predicted
models are generally reluctant to predict values to have low kind score and high particular
outside the [−1, 1] range, despite the fact that score, while annotators label these arguments as
there are not an insignificant number of true values non-particular and kind-referring.
outside this range. This behavior likely contributes
to the difference we saw between the ρ and R1 (19) The US Marines took most of Fallujah
metrics, wherein R1 was generally worse than Wednesday, but still face[...]
we would expect from ρ. This pattern is starkest
for IS.PARTICULAR in the predicate protocol, where (20) I’m writing an essay...and I need to know if
predictions are nearly all constrained to [0, 1]. the iPhone was the first Smart Phone.

512
This similarly holds for pronouns. As men- 11 Conclusion
tioned in §6, we filtered out several pronominal
We have proposed a novel semantic framework for
arguments, but certain pronouns—like you, they,
modeling linguistic expressions of generalization
yourself, themselves—were not filtered because
as combinations of simple, real-valued referential
they can have both particular- and kind-referring
properties of predicates and their arguments.
uses. Our models fail to capture instances where
We used this framework to construct a dataset
pronouns are labeled kind-referring (e.g., you in
covering the entirety of the Universal Depen-
(21) and (22)) consistently predicting low IS.KIND
dencies English Web Treebank and probed the
scores, likely because they are rare in our data.
ability of both hand-engineered and learned type-
(21) I like Hayes Street Grill....another plus, it’s and token-level features to predict the annotations
right by Civic Center, so you can take a in this dataset.
romantic walk around the Opera House, City
Hall, Symphony Auditorium[...] Acknowledgments

(22) What would happen if you flew the flag of We would like to thank three anonymous re-
South Vietnam in Modern day Vietnam? viewers and Chris Potts for useful comments
on this paper as well as Scott Grimm and the
This behavior is not seen with common nouns: FACTS.lab at the University of Rochester for use-
The model correctly predicts common nouns in ful comments on the framework and protocol
certain contexts as non-particular, non-abstract, design. This research was supported by the Univer-
and kind-referring (e.g., food in (23) and men in sity of Rochester, JHU HLTCOE, and DARPA
(24)). AIDA. The U.S. Government is authorized to re-
(23) Kitchen puts out good food[...] produce and distribute reprints for Governmental
purposes. The views and conclusions contained
(24) just saying most men suck! in this publication are those of the authors and
Predicate protocol As in the argument protocol, should not be interpreted as representing official
general trends associated with part-of-speech are policies or endorsements of DARPA or the U.S.
exaggerated by the model. We noted in §7 that Government.
annotators tend to annotate hypothetical predicates
as non-particular and vice-versa (ρ = −0.25), but
the model’s predictions are anti-correlated to a References
much greater extent (ρ = −0.79). For example, Lasha Abzianidze and Johan Bos. 2017. Towards
annotators are more willing to say a predicate can universal semantic tagging. In IWCS 2017, 12th
refer to particular, hypothetical situations (25) or International Conference on Computational
a non-particular, non-hypothetical situation (26). Semantics, Short papers.
(25) Read the entire article[...] Alan Agresti. 2003. Categorical Data Analysis,
(26) it s illegal to sell stolen property, even if you 482. John Wiley & Sons.
don’t know its stolen. R.H. Baayen. 2008. Analyzing Linguistic Data:
The model also had a bias towards partic- A Practical Introduction to Statistics using R.
ular predicates referring to dynamic predicates Cambridge University Press.
(ρ = 0.34)—a correlation not present among
Collin F. Baker, Charles J. Fillmore, and John
annotators. For instance, is closed in (27) was
B. Lowe. 1998. The Berkeley FrameNet Proj-
annotated as particular but non-dynamic but
ect. In Proceedings of the 36th Annual
predicted by the model to be particular and dy-
Meeting of the Association for Computational
namic; and helped in (28) was annotated as non-
Linguistics and 17th International Conference
particular and dynamic, but the model predicted
on Computational Linguistics - Volume 1,
particular and dynamic.
pages 86–90, Montreal.
(27) library is closed.
Lisa Bauer, Yicheng Wang, and Mohit Bansal.
(28) I have a new born daughter and she helped 2018. Commonsense for generative multi-
me with a lot. hop question answering tasks. In Proceedings

513
of the 2018 Conference on Empirical Marie-Catherine De Marneffe, Timothy Dozat,
Methods in Natural Language Processing, Natalia Silveira, Katri Haverinen, Filip Ginter,
pages 4220–4230, Brussels. Joakim Nivre, and Christopher D. Manning.
2014. Universal Stanford dependencies: A
Cosmin Adrian Bejan and Sanda Harabagiu. 2010. cross-linguistic typology. In Proceedings of the
Unsupervised event coreference resolution with Ninth International Conference on Language
rich linguistic features. In Proceedings of the Resources and Evaluation (LREC’14), pages
48th Annual Meeting of the Association for 4585–4592, Reykjavik.
Computational Linguistics, pages 1412–1422,
Uppsala. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
Ann Bies, Justin Mott, Colin Warner, and of deep bidirectional transformers for language
Seth Kulick. 2012. English Web Treebank understanding. In Proceedings of the 2019
LDC2012T13. Linguistic Data Consortium, Conference of the North American Chapter of
Philadelphia, PA. the Association for Computational Linguistics:
Human Language Technologies, Volume 1
Marc Brysbaert, Amy Beth Warriner, and
(Long and Short Papers), pages 4171–4186,
Victor Kuperman. 2014. Concreteness ratings
Minneapolis, MN.
for 40 thousand generally known English
word lemmas. Behavior Research Methods, George R. Doddington, Alexis Mitchell, Mark A.
46(3):904–911. Przybocki, Lance A. Ramshaw, Stephanie
Strassel, and Ralph M. Weischedel. 2004.
Greg Carlson, Rachel Sussman, Natalie Klein,
The Automatic Content Extraction (ACE)
and Michael Tanenhaus. 2006. Weak definite
program—Tasks, data, and evaluation. In Pro-
noun phrases. In Proceedings of NELS 36,
ceedings of the Fourth International Con-
pages 179–196, Amherst, MA.
ference on Language Resources and Evaluation
Greg N. Carlson. 1977. Reference to Kinds in (LREC’04), Lisbon.
English. Ph.D. thesis, University of Massachusetts, Bonnie J. Dorr. 2001. LCS Database. University
Amherst. of Maryland.
Gregory Carlson. 2011. Genericity, In (Maienborn David Dowty. 1991. Thematic proto-roles and
et al., 2011), 1153–1185. argument selection. Language, 67(3):547–619.
Gregory N. Carlson and Francis Jeffry Pelletier. Christiane Fellbaum. 1998. WordNet: An Electronic
1995. The Generic Book, The University of Lexical Database, MIT Press, Cambridge, MA.
Chicago Press.
Annemarie Friedrich and Alexis Palmer. 2014a.
Massimiliano Ciaramita and Mark Johnson. 2003. Automatic prediction of aspectual class of
Supersense tagging of unknown nouns in verbs in context. In Proceedings of the
WordNet. In Proceedings of the 2003 Con- 52nd Annual Meeting of the Association for
ference on Empirical Methods in Natural Computational Linguistics (Volume 2: Short
Language Processing, pages 168–175. Papers), pages 517–523, Baltimore, MD.
Agata Cybulska and Piek Vossen. 2014a. Annemarie Friedrich and Alexis Palmer. 2014b.
Guidelines for ECB+ annotation of events Situation entity annotation. In Proceedings of
and their coreference. Technical Report NWR- LAW VIII - The 8th Linguistic Annotation
2014-1, VU University Amsterdam. Workshop, pages 149–158, Dublin.
Agata Cybulska and Piek Vossen. 2014b. Using Annemarie Friedrich, Alexis Palmer, Melissa
a sledgehammer to crack a nut? Lexical Peate Sørensen, and Manfred Pinkal. 2015.
diversity and event coreference resolution. In Annotating genericity: A survey, a scheme, and
Proceedings of the Ninth International Con- a corpus. In Proceedings of The 9th Linguistic
ference on Language Resources and Evaluation Annotation Workshop, pages 21–30, Denver,
(LREC’14), Reykjavik. CO.

514
Annemarie Friedrich, Alexis Palmer, and Diederik P. Kingma and Jimmy Ba. 2015.
Manfred Pinkal. 2016. Situation entity types: Adam: A method for stochastic optimization.
Automatic classification of clause-level aspect. In Proceedings of 3rd International Conference
In Proceedings of the 54th Annual Meeting of on Learning Representations (ICLR 2015),
the Association for Computational Linguistics San Diego, CA.
(Volume 1: Long Papers), pages 1757–1768,
Berlin. Heeyoung Lee, Marta Recasens, Angel Chang,
Mihai Surdeanu, and Dan Jurafsky. 2012. Joint
Annemarie Friedrich and Manfred Pinkal. 2015a. entity and event coreference resolution across
Automatic recognition of habituals: A three- documents. In Proceedings of the 2012 Joint
way classification of clausal aspect. In Pro- Conference on Empirical Methods in Natural
ceedings of the 2015 Conference on Empirical Language Processing and Computational
Methods in Natural Language Processing, Natural Language Learning, pages 489–500,
pages 2471–2481, Lisbon. Jeju Island.

Annemarie Friedrich and Manfred Pinkal. 2015b. Kenton Lee, Yoav Artzi, Yejin Choi, and
Discourse-sensitive automatic identification of Luke Zettlemoyer. 2015. Event detection and
generic expressions. In Proceedings of the 53rd factuality assessment with non-expert supervi-
Annual Meeting of the Association for Compu- sion. In Proceedings of the 2015 Conference
tational Linguistics and the 7th International Joint on Empirical Methods in Natural Language
Conference on Natural Language Processing Processing, pages 1643–1648, Lisbon.
(Volume 1: Long Papers), pages 1272–1281,
Beijing. Sarah-Jane Leslie and Adam Lerner. 2016.
Generic generalizations, Edward N. Zalta,
Andrew Gelman and Jennifer Hill. 2014. Data editor, The Stanford Encyclopedia of Phil-
Analysis using Regression and Multilevel- osophy, Winter 2016 edition. Metaphysics
Hierarchical Models. Cambridge University Research Lab, Stanford University.
Press, New York City.
Annie Louis and Ani Nenkova. 2011. Automatic
Scott Grimm. 2014. Individuating the abstract. In identification of general and specific sentences
Proceedings of Sinn und Bedeutung 18, pages by leveraging discourse annotations. In Proceed-
182–200, Bayonne and Vitoria-Gasteiz. ings of 5th International Joint Conference on
Natural Language Processing, pages 605–613,
Scott Grimm. 2016. Crime investigations: The Chiang Mai.
countability profile of a delinquent noun. Baltic
International Yearbook of Cognition, Logic Claudia Maienborn, Klaus von Heusinger, and
and Communication, 11. doi:10.4148/1944- Paul Portner, editors. 2011. Semantics: An
3676.1111. International Handbook of Natural Language
Meaning, volume 2. Mouton de Gruyter, Berlin.
Scott Grimm. 2018. Grammatical number and
the scale of individuation. Language, 94(3): Thomas A. Mathew. 2009. Supervised catego-
527–574. rization of habitual versus episodic sentences.
Master’s thesis, Georgetown University.
Jerry R. Hobbs, William Croft, Todd Davies,
Douglas Edwards, and Kenneth Laws, 1987. John McCarthy. 1960. Programs with Common
Commonsense metaphysics and lexical seman- Sense, RLE and MIT Computation Center.
tics. Computational Linguistics, 13(3-4):241–250.
John McCarthy. 1980. Circumscription—A form
Nancy Ide, Christiane Fellbaum, Collin Baker, of nonmonotonic reasoning. Artificial Intelli-
and Rebecca Passonneau. 2010. The manually gence, 13(1–2):27–39.
annotated sub-corpus: A community resource
for and by the people. In Proceedings of the ACL John McCarthy. 1986. Applications of cir-
2010 Conference Short Papers, pages 68–73, cumscription to formalizing common sense
Uppsala. knowledge. Artificial Intelligence, 28:89–116.

515
Marvin Minsky. 1974. A framework for represent- Storylines (CNS 2016), pages 47–56, Austin,
ing knowledge. MIT-AI Laboratory Memo 306. TX.

Alexis Mitchell, Stephanie Strassel, Mark Jeffrey Pennington, Richard Socher, and
Przybocki, J. K. Davis, George Doddington, Christopher D. Manning. 2014. GloVe: Global
Ralph Grishman, Adam Meyers, Ada Brunstein, vectors for word representation. In Proceedings
Lisa Ferro, and Beth Sundheim. 2003. ACE- of the 2014 Conference on Empirical Methods
2 Version 1.0 LDC2003T11. Linguistic Data in Natural Language Processing (EMNLP),
Consortium, Philadelphia, PA. pages 1532–1543, Doha.

Joakim Nivre, Zeljko Agic, Maria Jesus Aranzabe, Matthew Peters, Mark Neumann, Mohit Iyyer,
Masayuki Asahara, Aitziber Atutxa, Miguel Matt Gardner, Christopher Clark, Kenton Lee,
Ballesteros, John Bauer, Kepa Bengoetxea, and Luke Zettlemoyer. 2018. Deep con-
Riyaz Ahmad Bhat, Cristina Bosco, Sam Bowman, textualized word representations. In Proceed-
Giuseppe G. A. Celano, Miriam Connor, Marie- ings of the 2018 Conference of the North
Catherine de Marneffe, Arantza Diaz de Ilarraza, American Chapter of the Association for
Kaja Dobrovoljc, Timothy Dozat, Tomaž Computational Linguistics: Human Language
Erjavec, Richárd Farkas, Jennifer Foster, Daniel Technologies, Volume 1 (Long Papers), pages
Galbraith, Filip Ginter, Iakes Goenaga, Koldo 2227–2237, New Orleans, LA.
Gojenola, Yoav Goldberg, Berta Gonzales,
Bruno Guillaume, Jan Hajič, Dag Haug, Radu Massimo Poesio. 2004. Discourse annotation and
Ion, Elena Irimia, Anders Johannsen, Hiroshi semantic annotation in the GNOME corpus.
Kanayama, Jenna Kanerva, Simon Krek, Veronika In Proceedings of the 2004 ACL Workshop on
Laippala, Alessandro Lenci, Nikola Ljubešić, Discourse Annotation, pages 72–79, Barcelona.
Teresa Lynn, Christopher Manning, Cătălina
Massimo Poesio, Yulia Grishina, Varada Kolhatkar,
Mărănduc, David Mareček, Héctor Martı́nez
Nafise Moosavi, Ina Roesiger, Adam Roussel,
Alonso, Jan Mašek, Yuji Matsumoto, Ryan
Fabian Simonjetz, Alexandra Uma, Olga
McDonald, Anna Missilä, Verginica Mititelu,
Uryupina, Juntao Yu, and Heike Zinsmeister.
Yusuke Miyao, Simonetta Montemagni, Shunsuke
2018. Anaphora resolution with the ARRAU
Mori, Hanna Nurmi, Petya Osenova, Lilja
corpus. In Proceedings of the First Workshop on
Øvrelid, Elena Pascual, Marco Passarotti, Cenel-
Computational Models of Reference, Anaphora
Augusto Perez, Slav Petrov, Jussi Piitulainen,
and Coreference, pages 11–22, New Orleans,
Barbara Plank, Martin Popel, Prokopis Prokopidis,
LA.
Sampo Pyysalo, Loganathan Ramasamy,
Rudolf Rosa, Shadi Saleh, Sebastian Schuster, James Pustejovsky, Patrick Hanks, Roser Sauri,
Wolfgang Seeker, Mojgan Seraji, Natalia Andrew See, Robert Gaizauskas, Andrea
Silveira, Maria Simi, Radu Simionescu, Katalin Setzer, Dragomir Radev, Beth Sundheim, David
Simkó, Kiril Simov, Aaron Smith, Jan Štěpánek, Day, Lisa Ferro, and Marcia Lazo. 2003. The
Alane Suhr, Zsolt Szántó, Takaaki Tanaka, Reut TIMEBANK corpus. In Proceedings of Corpus
Tsarfaty, Sumire Uematsu, Larraitz Uria, Viktor Linguistics, pages 647–656, Lancaster.
Varga, Veronika Vincze, Zdeněk Žabokrtský,
Daniel Zeman, Hanzhi Zhu. 2015. Universal Drew Reisinger, Rachel Rudinger, Francis
Dependencies 1.2. LINDAT/CLARIN digital Ferraro, Craig Harman, Kyle Rawlins, and
library at the Institute of Formal and Applied Benjamin Van Durme. 2015. Semantic proto-
Linguistics (ÚFAL), Faculty of Mathematics roles. Transactions of the Association for
and Physics, Charles University. Computational Linguistics, 3:475–488.

Tim O’Gorman, Kristin Wright-Bettner, and Nils Reiter and Anette Frank. 2010. Identifying
Martha Palmer. 2016. Richer event description: generic noun phrases. In Proceedings of the
Integrating event coreference with temporal, 48th Annual Meeting of the Association for
causal and bridging annotation. In Proceedings Computational Linguistics, pages 40–49,
of the 2nd Workshop on Computing News Uppsala.

516
Raymond Reiter. 1987. Nonmonotonic reasoning, Siddharth Vashishtha, Benjamin Van Durme,
In J. F. Traub, N. J. Nilsson, and B. J. Grosz, and Aaron Steven White. 2019. Fine-grained
editors, Annual Review of Computer Science, temporal relation extraction. arXiv, cs.CL/
volume 2, pages 147–186. Annual Reviews 1902.01390v2.
Inc.
Zeno Vendler. 1957. Verbs and times. Philo-
Rachel Rudinger, Aaron Steven White, and sophical Review, 66(2):143–160.
Benjamin Van Durme. 2018. Neural models
of factuality. In Proceedings of the 2018 Christopher Walker, Stephanie Strassel, Julie
Conference of the North American Chapter of Medero, and Kazuaki Maeda. 2006. ACE 2005
the Association for Computational Linguistics: multilingual training corpus LDC2006T06.
Human Language Technologies, Volume 1 Linguistic Data Consortium, Philadelphia, PA.
(Long Papers), pages 731–744, New Orleans,
LA. Aaron Steven White, Kyle Rawlins, and Benjamin
Van Durme. 2017. The semantic proto-role
Roger C. Schank and Robert P. Abelson. 1975. linking model. In Proceedings of the 15th
Scripts, plans, and knowledge. In Proceedings Conference of the European Chapter of the
of the 4th International Joint Conference Association for Computational Linguistics,
on Artificial Intelligence - Volume 1, pages volume 2, pages 92–98, Valencia.
151–157.
Aaron Steven White, Drew Reisinger, Keisuke
Karin Kipper Schuler. 2005. VerbNet: A broad-
Sakaguchi, Tim Vieira, Sheng Zhang, Rachel
coverage, comprehensive verb lexicon. Ph.D.
Rudinger, Kyle Rawlins, and Benjamin
thesis, Computer and Information Science
Van Durme. 2016. Universal decompositional
Department, Universiy of Pennsylvania.
semantics on universal dependencies. In Pro-
Natalia Silveira, Timothy Dozat, Marie-Catherine ceedings of the 2016 Conference on Empirical
de Marneffe, Samuel Bowman, Miriam Connor, Methods in Natural Language Processing,
John Bauer, and Chris Manning. 2014. A pages 1713–1723, Austin, TX.
gold standard dependency corpus for English.
In Proceedings of the Ninth International Sheng Zhang, Rachel Rudinger, Kevin Duh,
Conference on Language Resources and Eval- and Benjamin Van Durme. 2017a. Ordinal
uation (LREC’14). common-sense inference. Transactions of the
Association for Computational Linguistics,
Gabriel Stanovsky, Judith Eckle-Kohler, 5:379–395.
Yevgeniy Puzikov, Ido Dagan, and Iryna
Gurevych. 2017. Integrating deep linguistic fea- Sheng Zhang, Rachel Rudinger, and Benjamin
tures in factuality prediction over unified data- Van Durme. 2017b. An evaluation of PredPatt
sets. In Proceedings of the 55th Annual Meeting and Open IE via stage 1 semantic role labeling.
of the Association for Computational Linguis- In IWCS 2017, 12th International Conference
tics (Volume 2: Short Papers), pages 352–357, on Computational Semantics, Short papers,
Vancouver. Montpellier.

517

You might also like