Professional Documents
Culture Documents
2325
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2325–2330
Austin, Texas, November 1-5, 2016.
2016
c Association for Computational Linguistics
W ORD
W ORD
un W ORD S UFFIX P REFIX W ORD able P REFIX W ORD able un W ORD S UFFIX
Figure 1: Canonical segmentation parse trees for untestably and unlockable. For both words, the scope of un is ambiguous.
Arguably, (a) is the only correct parse tree for untestably; the reading associated with (b) is hard to get. On the other hand,
unlockable is truly ambiguous between “able to be unlocked” (c) and “unable to be locked” (d).
readings of untestably depending on how the neg- in detail, we first consider its pieces individually.
ative prefix un scopes. The correct tree (see Fig-
ure 1) yields the reading “in the manner of not 3.1 Restoring Orthographic Changes
able to be tested.” A second—likely infelicitous To extract a canonical segmentation (Naradowsky
reading—where the segment untest forms a con- and Goldwater, 2009; Cotterell et al., 2016), we re-
stituent yields the reading “in a manner of being store orthographic changes that occur during word
able to untest.” Recovering the hierarchical struc- formation. To this end, we define the score function
ture allows us to select the correct reading; note
there are even cases of true ambiguity; e.g., unlock-
able has two readings: “unable to be locked” and scoreη (u, a, w) = exp g(u, a, w)> η (1)
“able to be unlocked.”
We also note that theoretical linguists often im- where a is a monotonic alignment between the
plicitly assume a context-free treatment of word strings u and w. The goal is for scoreη to as-
formation, e.g., by employing brackets to indicate sign higher values to better matched pairs, e.g.,
different levels of affixation. Others have explicitly (w=untestably, u=untestablely). We refer to
modeled word-internal structure with grammars Dreyer et al. (2008) for a thorough exposition.
(Selkirk, 1982; Marvin, 2002). For ease of computation, we can encode
this function as a weighted finite-state machine
3 Parsing the Lexicon (WFST) (Mohri et al., 2002). This requires,
A novel component of this work is the development however, that the feature function g factors over
of a discriminative parser (Finkel et al., 2008; the topology of the finite-state encoding. Since
Hall et al., 2014) for morphology. The goal is our model conditions on the word w, the feature
to define a probability distribution over all trees function g can extract features from any part of this
that could arise from the input word, after reversal string. Features on the output string, u, however,
of orthographic and phonological processes. We are more restricted. In this work, we employ a
employ the simple grammar shown in Table 1. bigram model over output characters. This implies
Despite its simplicity, it models the order in which that each state remembers exactly one character:
morphemes are attached. the previous one. See Cotterell et al. (2014) for
More formally, our goal is to map a surface form details. We can compute the score for two strings
w (e.g., w=untestably) into its underlying canon- u and w using a weighted generalization of the
ical form u (e.g., u=untestablely) and then into a Levenshtein algorithm. Computing the partition
parse tree t over its morphemes. We assume u, w ∈ function requires a different dynamic program,
Σ∗ , for some discrete alphabet Σ.2 Note that a which runs in O(|w|2 · |Σ|2 ) time. Note that since
parse tree over the string implicitly defines a flat |Σ| ≈ 26 (lower case English letters), it takes
segmentation given our grammar—one can simply roughly 262 = 676 times longer to compute the
extract the characters spanned by all preterminals in partition function than to score a pair of strings.
the resulting tree. Before describing the joint model Our model includes several simple feature tem-
plates, including features that fire on individual edit
2
For efficiency, we assume u ∈ Σ|w|+k , k = 5. actions as well as conjunctions of edit actions and
2326
ROOT → W ORD alignments to the original word
W ORD → P REFIX W ORD
W ORD → W ORD S UFFIX pθ (t,a, u | w) = (3)
W ORD → Σ+ 1
scoreω (t, u) · scoreη (u, a, w)
P REFIX → Σ+ Zθ (w)
S UFFIX → Σ+
where θ = {ω, η} is the parameter vector and the
Table 1: The context-free grammar used in this work to model normalizing partition function as
word formation. The productions closely resemble those of
Johnson et al. (2006)’s Adaptor Grammar. Zθ (w) =
X X
(4)
characters in the surrounding context. See Cotterell u0 ∈Σ|w|+k a∈A(u0 ,w)
X
et al. (2016) for details. scoreω (t0 , u0 ) · scoreη (u0 , a, w)
t0 ∈T (u0 )
3.2 Morphological Analysis as Parsing
where T (u) is the set of all parse trees for the
Next, we need to score an underlying canonical string u. This involves a sum over all possible
form (e.g., u=untestablely) together with a parse underlying orthographic forms and all parse trees
tree (e.g., t=[[un[[test]able]]ly]). Thus, we define for those forms.
the parser score with the following function The joint approach has the advantage that it al-
lows both factors to work together to influence the
choice of the underlying form u. This is useful
X
scoreω (t, u) = exp f (π, u)> ω (2) as the parser now has access to which words are
π∈Π(t) attested in the language; this helps guide the rela-
tively weak transduction model. On the downside,
where Π(t) is the set of anchored productions in the the partition function Zθ now involves a sum over
tree t. An anchored production π is a grammar rule all strings in Σ|w|+k and all possible parses of each
in Chomsky normal form attached to a span, e.g., string! Finally, we define the marginal distribution
Ai,k → Bi,j Cj,k . Each π is then assigned a weight over trees and underlying forms as
by the linear function f (π, u)> ω, where the func- pθ (t, u | w) =
X
pθ (t, a, u | w) (5)
tion f extracts relevant features from the anchored a∈A(u,w)
production as well as the corresponding span of
the underlying form u. This model is typically re- where A(u, w) is the set of all monotonic align-
ferred to as a weighted CFG (WCFG) (Smith and ments between u and w. The marginalized form
Johnson, 2007) or a CRF parser. in eq. (5) is our final model of morphological seg-
For f , we define three span features: (i) indi- mentation since we are not interested in the latent
cator features on the span’s segment, (ii) an indi- alignments a.
cator feature that fires if the segment appears in 4.1 Learning and Inference
an external corpus3 and (iii) the conjunction of the
segment with the label (e.g., P REFIX) of the sub- We use stochastic gradient descent to opti-
tree root. Following Hall et al. (2014), we employ mize
PN the log-probability of the training data
log p (t (n) , u(n) | w (n) ); this requires the
an indicator feature for each production as well as n=1 θ
2327
Segmentation Tree
Morph. F1 Edit Acc. Const. F1
Flat 78.89 (0.9) 0.72 (0.04) 72.88 (1.21) N/A
Hier 85.55 (0.6) 0.55 (0.03) 73.19 (1.09) 79.01 (0.5)
Table 2: Results for the 10 splits of the treebank. Segmentation quality is measured by morpheme F1 , edit distance and accuracy;
tree quality by constituent F1 .
2328
eting paradox (Pesetsky, 1985; Embick, 2015). We canonical segment correct. Morpheme F1 , on the
annotate exclusively at the syntactic-semantic tier. other hand, gives us partial credit. Thus, what this
Thirdly, in the context of derivational mor- shows us is that the WCFG gets a lot more of the
phology, we force spans to be words themselves. morphemes in the held-out set correct, even if it
Since derivational morphology—by definition— only gets a few more complete forms correct. We
forms new words from existing words (Lieber and provide additional results evaluating the entire tree
Štekauer, 2014), it follows that each span rooted with constituency F1 as a future baseline.
with W ORD or ROOT in the correct parse corre-
sponds to a word in the lexicon. For example, 8 Conclusion
consider unlickable. The correct parse, under our We presented a discriminative CFG-based model
scheme, is [un [[lick] able]]. Each of the spans for canonical morphological segmentation and
(lick, lickable and unlickable) exists as a word. By showed empirical improvements on its ability
contrast, the parse [[un [lick]] able] contains the to segment words under three metrics. We ar-
span unlick, which is not a word in the lexicon. The gue that our hierarchical approach to modeling
span in the segmented form may involve changes, morphemes is more often appropriate than the
e.g., [un [[achieve] able]], where achieveable is not traditional flat segmentation. Additionally, we
a word, but achievable (after deleting e) is. have annotated 7454 words with a morpholog-
ical constituency parse. The corpus is avail-
7 Experiments
able online at http://ryancotterell.github.
We run a simple experiment to show the empirical io/data/morphological-treebank to allow for
utility of parsing words—we compare a WCFG- exact comparison and to spark future research.
based canonical segmenter with the semi-Markov
segmenter introduced in Cotterell et al. (2016). Acknowledgements
We divide the corpus into 10 distinct train/dev/test The first author was supported by a DAAD Long-
splits with 5454 words for train and 1000 for each Term Research Grant and an NDSEG fellowship.
of dev and test. We report three evaluation met- The third author was supported by DFG (SCHU
rics: full form accuracy, morpheme F1 (Van den 2246/10-1).
Bosch and Daelemans, 1999) and average edit dis-
tance to the gold segmentation with boundaries
marked by a distinguished symbol. For the WCFG References
model, we also report constituent F1 —typical for Mohamed Afify, Ruhi Sarikaya, Hong-Kwang Jeff
sentential constituency parsing— as a baseline Kuo, Laurent Besacier, and Yuqing Gao. 2006. On
for future systems. This F1 measures how well the use of morphological analysis for dialectal Ara-
we predict the whole tree (not just a segmenta- bic speech recognition. In INTERSPEECH.
tion). For all models, we use L2 regularization Mark Aronoff. 1976. Word Formation in Generative
and run 100 epochs of A DAG RAD (Duchi et al., Grammar. MIT Press.
2011) with early stopping. We tune the regu-
larization coefficient by grid search considering R. Harald Baayen, Richard Piepenbrock, and Rijn van
H. 1993. The CELEX lexical data base on CD-
λ ∈ {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}. ROM.
7.1 Results and Discussion James K. Baker. 1979. Trainable grammars for speech
recognition. The Journal of the Acoustical Society
Table 2 shows the results. The hierarchical WCFG of America, 65(S1):S132–S132.
model outperforms the flat semi-Markov model on
all metrics on the segmentation task. This shows Kenneth R. Beesley and Lauri Karttunen. 2003. Finite-
that modeling structure among the morphemes, in- state Morphology: Xerox Tools and Techniques.
CSLI, Stanford.
deed, does help segmentation. The largest improve-
ments are found under the morpheme F1 metric Yoshua Bengio, Jean-Sébastien Senécal, et al. 2003.
(≈ 6.5 points). In contrast, accuracy improves by Quick training of probabilistic neural nets by impor-
tance sampling. In AISTATS.
< 1%. Edit distance is in between with an improve-
ment of 0.2 characters. Accuracy, in general, is an Antal Van den Bosch and Walter Daelemans. 1999.
all or nothing metric since it requires getting every Memory-based morphological analysis. In ACL.
2329
Jan A. Botha and Phil Blunsom. 2013. Adaptor gram- Mehryar Mohri, Fernando Pereira, and Michael Ri-
mars for learning non-concatenative morphology. In ley. 2002. Weighted finite-state transducers in
EMNLP, pages 345–356. speech recognition. Computer Speech & Language,
16(1):69–88.
Ann Clifton and Anoop Sarkar. 2011. Combin-
ing morpheme-based machine translation with post- Jason Naradowsky and Sharon Goldwater. 2009. Im-
processing morpheme prediction. In ACL. proving morphology induction by learning spelling
rules. In IJCAI.
Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014.
Stochastic contextual edit distance and probabilistic Karthik Narasimhan, Damianos Karakos, Richard
FSTs. In ACL. Schwartz, Stavros Tsakalidis, and Regina Barzilay.
2014. Morphological segmentation for keyword
Ryan Cotterell, Tim Vieira, and Hinrich Schütze. 2016. spotting. In EMNLP.
A joint model of orthography and morphological
segmentation. In NAACL. David Pesetsky. 1985. Morphology and logical form.
Linguistic Inquiry, 16(2):193–246.
Markus Dreyer, Jason R. Smith, and Jason Eisner.
2008. Latent-variable modeling of string transduc- Teemu Ruokolainen, Oskar Kohonen, Kairit Sirts, Stig-
tions with finite-state methods. In EMNLP. Arne Grönroos, Mikko Kurimo, and Sami Virpioja.
2016. Comparative study of minimally supervised
John Duchi, Elad Hazan, and Yoram Singer. 2011. morphological segmentation. Computational Lin-
Adaptive subgradient methods for online learning guistics.
and stochastic optimization. JMLR, 12:2121–2159.
Helmut Schmid. 2005. Disambiguation of morpholog-
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, ical structure using a PCFG. In EMNLP, pages 515–
and Noah A. Smith. 2016. Recurrent neural network 522. Association for Computational Linguistics.
grammars. In NAACL.
Helmut Schmid, Arne Fitschen, and Ulrich Heid. 2004.
David Embick. 2015. The Morpheme: A Theoretical SMOR: A German computational morphology cov-
Introduction, volume 31. Walter de Gruyter GmbH ering derivation, composition and inflection. In
& Co KG. LREC.
Jenny Rose Finkel, Alex Kleeman, and Christopher D. Wolfgang Seeker and Özlem Çetinoğlu. 2015. A graph-
Manning. 2008. Efficient, feature-based, condi- based lattice dependency parser for joint morpholog-
tional random field parsing. In ACL, volume 46, ical segmentation and syntactic analysis. TACL.
pages 959–967.
Elisabeth Selkirk. 1982. The Syntax of Words. Num-
David Leo Wright Hall, Greg Durrett, and Dan Klein.
ber 7 in Linguistic Inquiry Monograph Series. MIT
2014. Less grammar, more features. In ACL, pages
Press.
228–237.
Kairit Sirts and Sharon Goldwater. 2013. Minimally-
Sébastien Jean, KyungHyun Cho, Roland Memisevic,
supervised morphological segmentation using adap-
and Yoshua Bengio. 2015. On using very large tar-
tor grammars. TACL, 1:255–266.
get vocabulary for neural machine translation. In
ACL, pages 1–10. Noah A. Smith and Mark Johnson. 2007. Weighted
and probabilistic context-free grammars are equally
Mark Johnson, Thomas L. Griffiths, and Sharon Gold-
expressive. Computational Linguistics, 33(4):477–
water. 2006. Adaptor grammars: A framework for
491.
specifying compositional nonparametric Bayesian
models. In NIPS, pages 641–648. Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting
Mark Johnson, Thomas L. Griffiths, and Sharon Gold- Liu. 2014. Character-level Chinese dependency
water. 2007. Bayesian inference for PCFGs via parsing. In ACL, pages 1326–1336.
Markov Chain Monte Carlo. In HLT-NAACL, pages
139–146.
Daphne Koller and Nir Friedman. 2009. Probabilistic
Graphical Models: Principles and Techniques. MIT
press.
2330
A Derivation of Eq. 6
Here we provide the gradient of the log-partition function as an expectation:
1
∇θ log Zθ (w) = ∇θ Zθ (w) (7)
Zθ (w)
1 X X X
= ∇θ scoreω (t0 , u0 ) · scoreη (u0 , a, w)
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 )
1 X X X
∇θ scoreω (t0 , u0 ) · scoreη (u0 , a, w)
=
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 )
1 X X X
= scoreη (u0 , a, w) · ∇ω scoreω (t0 , u0 )
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 )
+ scoreω (t0 , u0 ) · ∇η scoreη (u0 , a, w)
1 X X X X
= scoreη (u0 , a, w) · scoreω (t0 , u0 ) f (π, u0 )> + g(u0 , a, w)>
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 ) π∈Π(t0 )
X X X scoreη (u0 , a, w)
· scoreω (t0 , u0 ) X
= f (π, u)> + g(u0 , a, w)>
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 ) π∈Π(t0 )
X
= E(t,a,u)∼pθ (·|w) f (π, u)> + g(u, a, w)> (8)
π∈Π(t)
The result above can be further improved through Rao-Blackwellization. In this case, when we sample
a tree–underlying form pair (t, u), we marginalize out all alignments that could have given rise to the
sampled pair. The final derivation is show below:
X
∇θ log Zθ (w) = E(t,a,u)∼p (·|w) θ
f (π, u)> + g(u, a, w)>
π∈Π(t)
X X
= E(t,u)∼pθ (·|w) f (π, u)> + pθ (a | u, w)g(u, a, w)> (9)
π∈Π(t) a∈A(u,w)
This estimator in eq. (9) will have lower variance than eq. (8).
2331