Morphological Segmentation Inside-Out

Morphological Segmentation Inside-Out
Ryan Cotterell@ Arun KumarG Hinrich SchützeH

@Department of Computer Science, Johns Hopkins University
GFaculty of Arts and Humanities, Universitat Oberta de Catalunya
HCIS, LMU Munich
{ryan.cotterell}@jhu.edu
Abstract morphological segmentation can also be straightfor-

wardly derived from the CFG parse tree. Segmenta-
Morphological segmentation has traditionally
tions have found use in a diverse set of NLP appli-
been modeled with non-hierarchical models,
which yield flat segmentations as output. In
cations, e.g., automatic speech recognition (Afify
many cases, however, proper morphologi- et al., 2006), keyword spotting (Narasimhan et al.,
cal analysis requires hierarchical structure— 2014), machine translation (Clifton and Sarkar,
especially in the case of derivational morphol- 2011) and parsing (Seeker and Çetinoğlu, 2015).
ogy. In this work, we introduce a discrimina- In contrast to prior work, we focus on canoni-
tive, joint model of morphological segmenta- cal segmentation, i.e., we seek to jointly model
tion along with the orthographic changes that orthographic changes and segmentation. For in-
occur during word formation. To the best
stance, the canonical segmentation of untestably is
of our knowledge, this is the first attempt to
approach discriminative segmentation with a un+test+able+ly, where we map ably to able+ly,
context-free model. Additionally, we release restoring the letters le.
an annotated treebank of 7454 English words We make two contributions: (i) We introduce
with constituency parses, encouraging future a joint model for canonical segmentation with a
research in this area.1 CFG backbone. We experimentally show that this
model outperforms a semi-Markov model on flat
1 Introduction segmentation. (ii) We release the first morphology
In NLP, supervised morphological segmentation treebank, consisting of 7454 English word types,
has typically been viewed as either a sequence- each annotated with a full constituency parse.
labeling or a segmentation task (Ruokolainen et al.,
2 The Case For Hierarchical Structure
2016). In contrast, we consider a hierarchical ap-
proach, employing a context-free grammar (CFG). Why should we analyze morphology hierarchi-
CFGs provide a richer model of morphology: They cally? It is true that we can model much of mor-
capture (i) the intuition that words themselves have phology with finite-state machinery (Beesley and
internal constituents, which belong to different cat- Karttunen, 2003), but there are, nevertheless, many
egories, as well as (ii) the order in which affixes cases where hierarchical structure appears requi-
are attached. Moreover, many morphological pro- site. For instance, the flat segmentation of the word
cesses, e.g., compounding and reduplication, are untestably7→un+test+able+ly is missing impor-
best modeled as hierarchical; thus, context-free tant information about how the word was derived.
models are expressively more appropriate. The correct parse [[un[[test]able]]ly], on the other
The purpose of morphological segmentation is hand, does tell us that this is the order in which the
to decompose words into smaller units, known as complex form was derived:
morphemes, which are typically taken to be the able un ly
test7−−→testable7−→untestable7−
→untestably.
smallest meaning-bearing units in language. This
work concerns itself with modeling hierarchical This gives us insight into the structure of the
structure over these morphemes. Note a simple flat lexicon—we expect that the segment testable exists
1
as an independent word, but ably does not.
We found post publication that CELEX (Baayen et al.,
1993) has annotated words for hierarchical morphological Moreover, a flat segmentation is often semanti-
segmentation as well. cally ambiguous. There are two potentially valid
2325
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2325–2330
Austin, Texas, November 1-5, 2016. 2016
c Association for Computational Linguistics
W ORD
W ORD
W ORD S UFFIX W ORD W ORD

S UFFIX
ly
P REFIX W ORD ly W ORD S UFFIX W ORD S UFFIX P REFIX W ORD
un W ORD S UFFIX P REFIX W ORD able P REFIX W ORD able un W ORD S UFFIX
test able un test un lock lock able
(a) (b) (c) (d)
Figure 1: Canonical segmentation parse trees for untestably and unlockable. For both words, the scope of un is ambiguous.
Arguably, (a) is the only correct parse tree for untestably; the reading associated with (b) is hard to get. On the other hand,
unlockable is truly ambiguous between “able to be unlocked” (c) and “unable to be locked” (d).
readings of untestably depending on how the neg- in detail, we first consider its pieces individually.
ative prefix un scopes. The correct tree (see Fig-
ure 1) yields the reading “in the manner of not 3.1 Restoring Orthographic Changes
able to be tested.” A second—likely infelicitous To extract a canonical segmentation (Naradowsky
reading—where the segment untest forms a con- and Goldwater, 2009; Cotterell et al., 2016), we re-
stituent yields the reading “in a manner of being store orthographic changes that occur during word
able to untest.” Recovering the hierarchical struc- formation. To this end, we define the score function
ture allows us to select the correct reading; note
there are even cases of true ambiguity; e.g., unlock-
able has two readings: “unable to be locked” and scoreη (u, a, w) = exp g(u, a, w)> η (1)
“able to be unlocked.”
We also note that theoretical linguists often im- where a is a monotonic alignment between the
plicitly assume a context-free treatment of word strings u and w. The goal is for scoreη to as-
formation, e.g., by employing brackets to indicate sign higher values to better matched pairs, e.g.,
different levels of affixation. Others have explicitly (w=untestably, u=untestablely). We refer to
modeled word-internal structure with grammars Dreyer et al. (2008) for a thorough exposition.
(Selkirk, 1982; Marvin, 2002). For ease of computation, we can encode
this function as a weighted finite-state machine
3 Parsing the Lexicon (WFST) (Mohri et al., 2002). This requires,
A novel component of this work is the development however, that the feature function g factors over
of a discriminative parser (Finkel et al., 2008; the topology of the finite-state encoding. Since
Hall et al., 2014) for morphology. The goal is our model conditions on the word w, the feature
to define a probability distribution over all trees function g can extract features from any part of this
that could arise from the input word, after reversal string. Features on the output string, u, however,
of orthographic and phonological processes. We are more restricted. In this work, we employ a
employ the simple grammar shown in Table 1. bigram model over output characters. This implies
Despite its simplicity, it models the order in which that each state remembers exactly one character:
morphemes are attached. the previous one. See Cotterell et al. (2014) for
More formally, our goal is to map a surface form details. We can compute the score for two strings
w (e.g., w=untestably) into its underlying canon- u and w using a weighted generalization of the
ical form u (e.g., u=untestablely) and then into a Levenshtein algorithm. Computing the partition
parse tree t over its morphemes. We assume u, w ∈ function requires a different dynamic program,
Σ∗ , for some discrete alphabet Σ.2 Note that a which runs in O(|w|2 · |Σ|2 ) time. Note that since
parse tree over the string implicitly defines a flat |Σ| ≈ 26 (lower case English letters), it takes
segmentation given our grammar—one can simply roughly 262 = 676 times longer to compute the
extract the characters spanned by all preterminals in partition function than to score a pair of strings.
the resulting tree. Before describing the joint model Our model includes several simple feature tem-
plates, including features that fire on individual edit
2
For efficiency, we assume u ∈ Σ|w|+k , k = 5. actions as well as conjunctions of edit actions and
2326
ROOT → W ORD alignments to the original word
W ORD → P REFIX W ORD
W ORD → W ORD S UFFIX pθ (t,a, u | w) = (3)
W ORD → Σ+ 1
scoreω (t, u) · scoreη (u, a, w)
P REFIX → Σ+ Zθ (w)
S UFFIX → Σ+
where θ = {ω, η} is the parameter vector and the
Table 1: The context-free grammar used in this work to model normalizing partition function as
word formation. The productions closely resemble those of
Johnson et al. (2006)’s Adaptor Grammar. Zθ (w) =
X X
(4)
characters in the surrounding context. See Cotterell u0 ∈Σ|w|+k a∈A(u0 ,w)
X
et al. (2016) for details. scoreω (t0 , u0 ) · scoreη (u0 , a, w)
t0 ∈T (u0 )
3.2 Morphological Analysis as Parsing
where T (u) is the set of all parse trees for the
Next, we need to score an underlying canonical string u. This involves a sum over all possible
form (e.g., u=untestablely) together with a parse underlying orthographic forms and all parse trees
tree (e.g., t=[[un[[test]able]]ly]). Thus, we define for those forms.
the parser score with the following function The joint approach has the advantage that it al-
lows both factors to work together to influence the
choice of the underlying form u. This is useful
 
X
scoreω (t, u) = exp  f (π, u)> ω  (2) as the parser now has access to which words are
π∈Π(t) attested in the language; this helps guide the rela-
tively weak transduction model. On the downside,
where Π(t) is the set of anchored productions in the the partition function Zθ now involves a sum over
tree t. An anchored production π is a grammar rule all strings in Σ|w|+k and all possible parses of each
in Chomsky normal form attached to a span, e.g., string! Finally, we define the marginal distribution
Ai,k → Bi,j Cj,k . Each π is then assigned a weight over trees and underlying forms as
by the linear function f (π, u)> ω, where the func- pθ (t, u | w) =
X
pθ (t, a, u | w) (5)
tion f extracts relevant features from the anchored a∈A(u,w)
production as well as the corresponding span of
the underlying form u. This model is typically re- where A(u, w) is the set of all monotonic align-
ferred to as a weighted CFG (WCFG) (Smith and ments between u and w. The marginalized form
Johnson, 2007) or a CRF parser. in eq. (5) is our final model of morphological seg-
For f , we define three span features: (i) indi- mentation since we are not interested in the latent
cator features on the span’s segment, (ii) an indi- alignments a.
cator feature that fires if the segment appears in 4.1 Learning and Inference
an external corpus3 and (iii) the conjunction of the
segment with the label (e.g., P REFIX) of the sub- We use stochastic gradient descent to opti-
tree root. Following Hall et al. (2014), we employ mize
PN the log-probability of the training data
log p (t (n) , u(n) | w (n) ); this requires the
an indicator feature for each production as well as n=1 θ
production backoff features. computation of the gradient of the partition func-

tion ∇θ log Zθ . We may view this gradient as an
expectation:
4 A Joint Model
∇θ log Zθ (w) = (6)
Our complete model is a joint CRF (Koller and  
Friedman, 2009) where each of the above scores X
E(t,a,u)∼pθ (·|w)  f (π, u)> + g(u, a, w)> 
are factors. We define the following probability
π∈Π(t)
distribution over trees, canonical forms and their
We provide the full derivation in Appendix A with
3
We use the Wikipedia dump from 2016-05-01. an additional Rao-Blackwellization step that we
2327
Segmentation Tree
Morph. F1 Edit Acc. Const. F1
Flat 78.89 (0.9) 0.72 (0.04) 72.88 (1.21) N/A
Hier 85.55 (0.6) 0.55 (0.03) 73.19 (1.09) 79.01 (0.5)
Table 2: Results for the 10 splits of the treebank. Segmentation quality is measured by morpheme F1 , edit distance and accuracy;
tree quality by constituent F1 .
make use of in the implementation. While the sum 6 Morphological Treebank

over all underlying forms and trees in eq. (6) may
Supervised morphological segmentation has histor-
be achieved in polynomial time (using the Bar-
ically been treated as a segmentation problem, de-
Hillel construction), we make use of an importance-
void of hierarchical structure. A core reason behind
sampling estimator, derived by Cotterell et al.
this is that—to the best of our knowledge—there
(2016), which is faster in practice. Roughly
are no hierarchically annotated corpora for the task.
speaking, we approximate the hard-to-sample-
To remedy this, we provide tree annotations for a
from distribution pθ by taking samples from
subset of the English portion of CELEX (Baayen
an easy-to-sample-from proposal distribution q.
et al., 1993). We reannotated 7454 English types
Specifically, we employ a pipeline model for q con-
with a full constituency parse.4 The resource will
sisting of WFST and then a WCFG sampled from
be freely available for future research.
consecutively. We then reweight the samples using
the unnormalized score from pθ . Importance sam- 6.1 Annotation Guidelines
pling has found many uses in NLP ranging from
The annotation of the morphology treebank
language modeling (Bengio et al., 2003) and neural
was guided by three core principles. The first
MT (Jean et al., 2015) to parsing (Dyer et al., 2016).
principle concerns productivity: we exclusively
Due to a lack of space, we omit the derivation of
annotate productive morphology. In the context of
the importance-sampled approximate gradient.
morphology, productivity refers to the degree that
4.2 Decoding native speakers actively employ the affix to create
new words (Aronoff, 1976). We believe that for
We also decode by importance sampling. Given NLP applications, we should focus on productive
w, we sample canonical forms u and then run the affixation. Indeed, this sets our corpus apart from
CKY algorithm to get the highest scoring tree. many existing morphologically annotated corpora
such as CELEX. For example, CELEX contains
5 Related Work warmth7→warm+th, but th is not a productive
suffix and cannot be used to create new words.
We believe our attempt to train discriminative gram-
Thus, we do not want to analyze hearth7→hear+th
mars for morphology is novel. Nevertheless, other
or, in general, allow wug7→wug+th. Second, we
researchers have described parsers for morphology.
annotate for semantic coherence. When there are
Most of this work is unsupervised: Johnson et al.
several candidate parses, we choose the one that is
(2007) applied a Bayesian PCFG to unsupervised
best compatible with the compositional semantics
morphological segmentation. Similarly, Adaptor
of the derived form.
Grammars (Johnson et al., 2006), a non-parametric
Interestingly, multiple trees can be considered
Bayesian generalization of PCFGs, have been ap-
valid depending on the linguistic tier of interest.
plied to the unsupervised version of the task (Botha
Consider the word unhappier. From a semantic per-
and Blunsom, 2013; Sirts and Goldwater, 2013).
spective, we have the parse [[un [happy]] er] which
Relatedly, Schmid (2005) performed unsupervised
gives us the correct meaning “not happy to a greater
disambiguation of a German morphological ana-
degree.” However, since the suffix er only attaches
lyzer (Schmid et al., 2004) using a PCFG, using the
to mono- and bisyllabic words, we get [un[[happy]
inside-outside algorithm (Baker, 1979). Also, dis-
er]] from a phonological perspective. In the linguis-
criminative parsing approaches have been applied
tics literature, this problem is known as the brack-
to the related problem of Chinese word segmenta-
4
tion (Zhang et al., 2014). In many cases, we corrected the flat segmentation as well.
2328
eting paradox (Pesetsky, 1985; Embick, 2015). We canonical segment correct. Morpheme F1 , on the
annotate exclusively at the syntactic-semantic tier. other hand, gives us partial credit. Thus, what this
Thirdly, in the context of derivational mor- shows us is that the WCFG gets a lot more of the
phology, we force spans to be words themselves. morphemes in the held-out set correct, even if it
Since derivational morphology—by definition— only gets a few more complete forms correct. We
forms new words from existing words (Lieber and provide additional results evaluating the entire tree
Štekauer, 2014), it follows that each span rooted with constituency F1 as a future baseline.
with W ORD or ROOT in the correct parse corre-
sponds to a word in the lexicon. For example, 8 Conclusion
consider unlickable. The correct parse, under our We presented a discriminative CFG-based model
scheme, is [un [[lick] able]]. Each of the spans for canonical morphological segmentation and
(lick, lickable and unlickable) exists as a word. By showed empirical improvements on its ability
contrast, the parse [[un [lick]] able] contains the to segment words under three metrics. We ar-
span unlick, which is not a word in the lexicon. The gue that our hierarchical approach to modeling
span in the segmented form may involve changes, morphemes is more often appropriate than the
e.g., [un [[achieve] able]], where achieveable is not traditional flat segmentation. Additionally, we
a word, but achievable (after deleting e) is. have annotated 7454 words with a morpholog-
ical constituency parse. The corpus is avail-
7 Experiments
able online at http://ryancotterell.github.
We run a simple experiment to show the empirical io/data/morphological-treebank to allow for
utility of parsing words—we compare a WCFG- exact comparison and to spark future research.
based canonical segmenter with the semi-Markov
segmenter introduced in Cotterell et al. (2016). Acknowledgements
We divide the corpus into 10 distinct train/dev/test The first author was supported by a DAAD Long-
splits with 5454 words for train and 1000 for each Term Research Grant and an NDSEG fellowship.
of dev and test. We report three evaluation met- The third author was supported by DFG (SCHU
rics: full form accuracy, morpheme F1 (Van den 2246/10-1).
Bosch and Daelemans, 1999) and average edit dis-
tance to the gold segmentation with boundaries
marked by a distinguished symbol. For the WCFG References
model, we also report constituent F1 —typical for Mohamed Afify, Ruhi Sarikaya, Hong-Kwang Jeff
sentential constituency parsing— as a baseline Kuo, Laurent Besacier, and Yuqing Gao. 2006. On
for future systems. This F1 measures how well the use of morphological analysis for dialectal Ara-
we predict the whole tree (not just a segmenta- bic speech recognition. In INTERSPEECH.
tion). For all models, we use L2 regularization Mark Aronoff. 1976. Word Formation in Generative
and run 100 epochs of A DAG RAD (Duchi et al., Grammar. MIT Press.
2011) with early stopping. We tune the regu-
larization coefficient by grid search considering R. Harald Baayen, Richard Piepenbrock, and Rijn van
H. 1993. The CELEX lexical data base on CD-
λ ∈ {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}. ROM.
7.1 Results and Discussion James K. Baker. 1979. Trainable grammars for speech
recognition. The Journal of the Acoustical Society
Table 2 shows the results. The hierarchical WCFG of America, 65(S1):S132–S132.
model outperforms the flat semi-Markov model on
all metrics on the segmentation task. This shows Kenneth R. Beesley and Lauri Karttunen. 2003. Finite-
that modeling structure among the morphemes, in- state Morphology: Xerox Tools and Techniques.
CSLI, Stanford.
deed, does help segmentation. The largest improve-
ments are found under the morpheme F1 metric Yoshua Bengio, Jean-Sébastien Senécal, et al. 2003.
(≈ 6.5 points). In contrast, accuracy improves by Quick training of probabilistic neural nets by impor-
tance sampling. In AISTATS.
< 1%. Edit distance is in between with an improve-
ment of 0.2 characters. Accuracy, in general, is an Antal Van den Bosch and Walter Daelemans. 1999.
all or nothing metric since it requires getting every Memory-based morphological analysis. In ACL.
2329
Jan A. Botha and Phil Blunsom. 2013. Adaptor gram- Mehryar Mohri, Fernando Pereira, and Michael Ri-
mars for learning non-concatenative morphology. In ley. 2002. Weighted finite-state transducers in
EMNLP, pages 345–356. speech recognition. Computer Speech & Language,
16(1):69–88.
Ann Clifton and Anoop Sarkar. 2011. Combin-
ing morpheme-based machine translation with post- Jason Naradowsky and Sharon Goldwater. 2009. Im-
processing morpheme prediction. In ACL. proving morphology induction by learning spelling
rules. In IJCAI.
Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014.
Stochastic contextual edit distance and probabilistic Karthik Narasimhan, Damianos Karakos, Richard
FSTs. In ACL. Schwartz, Stavros Tsakalidis, and Regina Barzilay.
2014. Morphological segmentation for keyword
Ryan Cotterell, Tim Vieira, and Hinrich Schütze. 2016. spotting. In EMNLP.
A joint model of orthography and morphological
segmentation. In NAACL. David Pesetsky. 1985. Morphology and logical form.
Linguistic Inquiry, 16(2):193–246.
Markus Dreyer, Jason R. Smith, and Jason Eisner.
2008. Latent-variable modeling of string transduc- Teemu Ruokolainen, Oskar Kohonen, Kairit Sirts, Stig-
tions with finite-state methods. In EMNLP. Arne Grönroos, Mikko Kurimo, and Sami Virpioja.
2016. Comparative study of minimally supervised
John Duchi, Elad Hazan, and Yoram Singer. 2011. morphological segmentation. Computational Lin-
Adaptive subgradient methods for online learning guistics.
and stochastic optimization. JMLR, 12:2121–2159.
Helmut Schmid. 2005. Disambiguation of morpholog-
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, ical structure using a PCFG. In EMNLP, pages 515–
and Noah A. Smith. 2016. Recurrent neural network 522. Association for Computational Linguistics.
grammars. In NAACL.
Helmut Schmid, Arne Fitschen, and Ulrich Heid. 2004.
David Embick. 2015. The Morpheme: A Theoretical SMOR: A German computational morphology cov-
Introduction, volume 31. Walter de Gruyter GmbH ering derivation, composition and inflection. In
& Co KG. LREC.
Jenny Rose Finkel, Alex Kleeman, and Christopher D. Wolfgang Seeker and Özlem Çetinoğlu. 2015. A graph-
Manning. 2008. Efficient, feature-based, condi- based lattice dependency parser for joint morpholog-
tional random field parsing. In ACL, volume 46, ical segmentation and syntactic analysis. TACL.
pages 959–967.
Elisabeth Selkirk. 1982. The Syntax of Words. Num-
David Leo Wright Hall, Greg Durrett, and Dan Klein.
ber 7 in Linguistic Inquiry Monograph Series. MIT
2014. Less grammar, more features. In ACL, pages
Press.
228–237.
Kairit Sirts and Sharon Goldwater. 2013. Minimally-
Sébastien Jean, KyungHyun Cho, Roland Memisevic,
supervised morphological segmentation using adap-
and Yoshua Bengio. 2015. On using very large tar-
tor grammars. TACL, 1:255–266.
get vocabulary for neural machine translation. In
ACL, pages 1–10. Noah A. Smith and Mark Johnson. 2007. Weighted
and probabilistic context-free grammars are equally
Mark Johnson, Thomas L. Griffiths, and Sharon Gold-
expressive. Computational Linguistics, 33(4):477–
water. 2006. Adaptor grammars: A framework for
491.
specifying compositional nonparametric Bayesian
models. In NIPS, pages 641–648. Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting
Mark Johnson, Thomas L. Griffiths, and Sharon Gold- Liu. 2014. Character-level Chinese dependency
water. 2007. Bayesian inference for PCFGs via parsing. In ACL, pages 1326–1336.
Markov Chain Monte Carlo. In HLT-NAACL, pages
139–146.
Daphne Koller and Nir Friedman. 2009. Probabilistic
Graphical Models: Principles and Techniques. MIT
press.
Rochelle Lieber and Pavol Štekauer. 2014. The Ox-

ford Handbook of Derivational Morphology. Ox-
ford University Press, USA.
Tatjana Marvin. 2002. Topics in the Stress and Syntax
of Words. Ph.D. thesis, Massachusetts Institute of
Technology.
2330
A Derivation of Eq. 6
Here we provide the gradient of the log-partition function as an expectation:
1
∇θ log Zθ (w) = ∇θ Zθ (w) (7)
Zθ (w)
 
1 X X X
= ∇θ  scoreω (t0 , u0 ) · scoreη (u0 , a, w)
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 )
1 X X X
∇θ scoreω (t0 , u0 ) · scoreη (u0 , a, w)

=
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 )
1 X X X
= scoreη (u0 , a, w) · ∇ω scoreω (t0 , u0 )
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 )

+ scoreω (t0 , u0 ) · ∇η scoreη (u0 , a, w)
 
1 X X X X
= scoreη (u0 , a, w) · scoreω (t0 , u0 )  f (π, u0 )> + g(u0 , a, w)> 
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 ) π∈Π(t0 )
 
X X X scoreη (u0 , a, w)
· scoreω (t0 , u0 ) X
=  f (π, u)> + g(u0 , a, w)> 
Zθ (w)
u0 ∈Σ|w|+k a∈A(u0 ,w) t0 ∈T (u0 ) π∈Π(t0 )
 
X
= E(t,a,u)∼pθ (·|w)  f (π, u)> + g(u, a, w)>  (8)
π∈Π(t)
The result above can be further improved through Rao-Blackwellization. In this case, when we sample
a tree–underlying form pair (t, u), we marginalize out all alignments that could have given rise to the
sampled pair. The final derivation is show below:
 
X
∇θ log Zθ (w) = E(t,a,u)∼p (·|w)  θ
f (π, u)> + g(u, a, w)> 
π∈Π(t)
 
X X
= E(t,u)∼pθ (·|w)  f (π, u)> + pθ (a | u, w)g(u, a, w)>  (9)
π∈Π(t) a∈A(u,w)
This estimator in eq. (9) will have lower variance than eq. (8).
2331

Morphological Segmentation Inside-Out

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Morphological Segmentation Inside-Out

Uploaded by

Copyright:

Available Formats

Morphological Segmentation Inside-Out

Ryan Cotterell@ Arun KumarG Hinrich SchützeH

Abstract morphological segmentation can also be straightfor-

W ORD S UFFIX W ORD W ORD

test able un test un lock lock able

(a) (b) (c) (d)

production backoff features. computation of the gradient of the partition func-

make use of in the implementation. While the sum 6 Morphological Treebank

Rochelle Lieber and Pavol Štekauer. 2014. The Ox-

You might also like