Professional Documents
Culture Documents
Figure 1: The basic framework of our policy gradient model for one trajectory. The policy network is an end-to-end
neural module that can generate probability distributions over actions of coreference linking. The reward function
computes a reward given a trajectory of actions based on coreference evaluation metrics. Solid line indicates the
model exploration and (red) dashed line indicates the gradient update.
reward function is used to measure how good the entropy regularization to encourage more explo-
generated coreference clusters are, which is di- ration (Mnih et al.; Eysenbach et al., 2019) and
rectly related to coreference evaluation metrics. prevent our model from prematurely converging to
Besides, we introduce an entropy regularization a sub-optimal (or bad) local optimum.
term to encourage exploration and prevent the pol-
icy from prematurely converging to a bad local op- 3 Methodology
timum. Finally, we update the regularized policy
3.1 Task definition
network parameters based on the rewards associ-
ated with sequences of sampled actions, which are Given a document, the task of end-to-end corefer-
computed on the whole input document. ence resolution aims to identify a set of mention
We evaluate our end-to-end reinforced corefer- clusters, each of which refers to the same entity.
ence resolution model on the English OntoNotes Following Lee et al. (2017), we formulate the task
v5.0 benchmark. Our model achieves the new as a sequence of linking decisions for each span
state-of-the-art F1-score of 73.8%, which out- i to the set of its possible antecedents, denoted as
performs previous best-published result (73.0%) Y(i) = {, 1, · · · , i − 1}, a dummy antecedent
of Lee et al. (2018) with statistical significance. and all preceding spans. In particular, the use of
dummy antecedent for a span is to handle two
2 Related Work possible scenarios: (i) the span is not an entity
mention or (ii) the span is an entity mention but
Closely related to our work are the end-to-end it is not coreferent with any previous spans. The
coreference models developed by Lee et al. (2017) final coreference clusters can be recovered with a
and Lee et al. (2018). Different from previous backtracking step on the antecedent predictions.
pipeline approaches, Lee et al. used neural net-
works to learn mention representations and cal- 3.2 Our Model
culate mention and antecedent scores without us- Figure 2 illustrates a demonstration of our itera-
ing syntactic parsers. However, their models op- tive coreference resolution model on a document.
timize a heuristic loss based on local decisions Given a document, our model first identifies top
rather than the actual coreference evaluation met- scored mentions, and then conducts a sequence
rics, while our reinforcement model directly opti- of actions a1:T = {a1 , a2 , · · · , aT } over them,
mizes the evaluation metrics based on the rewards where T is the number of mentions and each ac-
calculated from sequences of actions. tion at assigns mention t to a candidate antecedent
Our work is also inspired by Clark and Manning yt in Yt = {, 1, · · · , t − 1}. The state at time t
(2016a) and Yin et al. (2018), which resolve coref- is defined as St = {g1 , · · · , gt−1 , gt }, where gi is
erences with reinforcement learning techniques. the mention i’s representation.
They view the mention-ranking model as an agent Once our model has finished all the actions, it
taking a series of actions, where each action links observes a reward R(a1:T ). The calculated gradi-
each mention to a candidate antecedent. They also ents are then propagated to update model param-
use pretraining for initialization. Nevertheless, eters. We use the average of the three metrics:
their models assume mentions are given while MUC (Grishman and Sundheim, 1995), B3 (Re-
our work is end-to-end. Furthermore, we add casens and Hovy, 2011) and CEAFφ4 (Cai and
(1) (2) (1) (2) Env (1) (2)
(5) Observe Sample Act (5) update (5)
Stop:0
(2)
(1) (2) (5) (1) (2) (5) Env (1) (2) (5)
Observe Sample Act update Stop:1
Reward: r
(a) State: St (b) Policy network: pθ(at|St) (c) Action (d) Execute action (e) Update env and
compute reward
Figure 2: A demonstration of our reinforced coreference resolution method on a document with 6 mentions. The
upper and lower rows correspond to step 5 and 6 respectively, in which the policy network selects mention (2)
as the antecedent of mention (5) and leaves mention (6) as a singleton mention. The red (gray) nodes represent
processed (current) mentions and edges between them indicate current predicted coreferential relations. The gray
rectangles around circles are span embeddings and the reward is calculated at the trajectory end.
Char & word emb encoder: BiLSTM For the dummy antecedent, the score s(i, ) is
fixed to 0. Here sm (.) is the mention score func-
Figure 3: Architecture of the policy network. The com-
tion, sc (., .) is a bilinear score function used to
ponents in dashed square iteratively refine span repre-
sentations. The last layer is a masked softmax layer that prune antecedents, and sa (., .) is the antecedent
computes probability distribution only over the candi- score function. Let gi denote the refined represen-
date antecedents for each mention. We omit the span tation for span i after gating, the three functions
generation and pruning component for simplicity. are sm (i) = θm T FFNN (g ), s (i, j) = gT Θ g ,
m i c i c j
and sa (i, j) is:
Strube, 2010) as the reward. Following Clark and sa (i, j) = θaT FFNNa ([gi , gj , gi ◦ gj , φ(i, j)])
Manning (2016a), we assume actions are indepen-
dent and the next state St+1 is generated based on where FFNN denotes a feed-forward neural net-
the natural order of the starting position and then work and ◦ denotes the element-wise product. θm ,
the end position of mentions regardless of action Θc and θa are network parameters. φ(i, j) is the
at . feature vector encoding speaker and genre infor-
Policy Network: We adopt the state-of-the-art mation from metadata.
end-to-end neural coreferene scoring architecture The Reinforced Algorithm: We explore using
from Lee et al. (2018) and add a masked softmax the policy gradient algorithm to maximize the ex-
layer to compute the probability distribution over pected reward:
actions, as illustrated in Figure 3. The success of
J(θ) = Ea1:T ∼pθ (a) R(a1:T ) (3)
their approach lies in two aspects: (i) a coarse-to-
fine pruning to reduce the search space, and (ii) Computing the exact gradient of J(θ) is infeasible
an iterative procedure to refine the span represen- due to the expectation over all possible action se-
tation with an self-attention mechanism that av- quences. Instead, we use Monte-Carlo methods
MUC B3 CEAFφ4
Model
Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Avg. F1
Wiseman et al. (2016) 77.5 69.8 73.4 66.8 57.0 61.5 62.1 53.9 57.7 64.2
Clark and Manning (2016a) 79.2 70.4 74.6 69.9 58.0 63.4 63.5 55.5 59.2 65.7
Clark and Manning (2016b) 79.9 69.3 74.2 71.0 56.5 63.0 63.8 54.3 58.7 65.3
Lee et al. (2017) 78.4 73.4 75.8 68.6 61.8 65.0 62.7 59.0 60.8 67.2
Zhang et al. (2018) 79.4 73.8 76.5 69.0 62.3 65.5 64.9 58.3 61.4 67.8
Luan et al. (2018)* 78.6 77.1 77.9 66.3 65.4 65.9 66.0 63.1 64.5 69.4
Lee et al. (2018)* 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 73.0
Our base reinforced model 79.0 76.9 77.9 66.8 64.9 65.8 66.5 63.0 64.7 69.5
+ Entropy Regularization 79.6 77.2 78.4 70.7 65.1 67.8 67.6 63.4 65.4 70.5
+ ELMo embedding* 85.4 77.9 81.4 77.9 66.4 71.7 70.6 66.3 68.4 73.8
Table 1: Experimental results with MUC, B3 and CEAFφ4 metrics on the test set of English OntoNotes. The
models marked with * utilized word embedding from deep language model ELMo (Peters et al., 2018). The F1
improvement is statistically significant under t-test with p < 0.05, compared with Lee et al. (2018).
73.5
Model Prec. Rec. F1
Our full model 89.6 82.2 85.7 73