You are on page 1of 7

A neural joint model for Vietnamese word segmentation, POS tagging

and dependency parsing


Dat Quoc Nguyen1,2
1
The University of Melbourne, Australia
dqnguyen@unimelb.edu.au
2
VinAI Research, Hanoi, Vietnam
v.datnq9@vinai.io

Abstract sub vmod


softmax softmax
FFNN FFNN
We propose the first multi-task learning model Dependency
parsing
for joint Vietnamese word segmentation, part- LSTM LSTM
BiLSTMDEP
LSTM

of-speech (POS) tagging and dependency


CRF PRON VERB NOUN
parsing. In particular, our model extends the POS tagging
BIST graph-based dependency parser (Kiper- LSTM LSTM
BiLSTMPOS
LSTM

wasser and Goldberg, 2016) with BiLSTM- Word vector


CRF-based neural layers (Huang et al., 2015) representation Tôi là sinh_viên

for word segmentation and POS tagging. On CRF O O B I


Word
Vietnamese benchmark datasets, experimental segmentation LSTM LSTM LSTM LSTM
results show that our joint model obtains state- BiLSTMWS
Syllable vector
of-the-art or competitive performances. representation Tôi là sinh viên

Word embedding Syllable embedding Initial word-boundary tag embedding


1 Introduction
ID Form POS Head DepRel
Dependency parsing (Kübler et al., 2009) is ex- 1 Tôi I PRON 2 sub
tremely useful in many downstream applications 2 là am VERB 0 root
3 sinh viên student NOUN 2 vmod
such as relation extraction (Bunescu and Mooney,
2005) and machine translation (Galley and Man- Figure 1: Illustration of our joint model. Linear trans-
formations are not shown for simplification.
ning, 2009). POS tags are essential features used
in dependency parsing. In real-world parsing,
most parsers are used in a pipeline process with (e.g. “Tôi là sinh viên”) is provided as the input
a precursor POS tagging model for producing pre- to the POS tagger, which automatically generates
dicted POS tags. In English where white space is POS-annotated text (e.g. “Tôi/PRON là/VERB
a strong word boundary indicator, POS tagging is sinh viên/NOUN”) which is in turn fed to the
considered to be the first important step towards parser. See Figure 1 for the final parsing output.
dependency parsing (Ballesteros et al., 2015). However, Vietnamese word segmenters and
Unlike English, for Vietnamese NLP, word seg- POS taggers have a non-trivial error rate, thus
mentation is considered to be the key first step. leading to error propagation. A solution to these
This is because when written, white space is used problems is to develop models for jointly learn-
in Vietnamese to separate syllables that consti- ing word segmentation, POS tagging and depen-
tute words, in addition to marking word bound- dency parsing, such as those that have been ac-
aries (Nguyen et al., 2009). For example, a 4- tively explored for Chinese. These include tradi-
syllable written text “Tôi là sinh viên” (I am stu- tional feature-based models (Hatori et al., 2012;
dent) forms 3 words “TôiI làam sinh viênstudent ”.1 Qian and Liu, 2012; Zhang et al., 2014, 2015)
When parsing real-world Vietnamese text where and neural models (Kurita et al., 2017; Li et al.,
gold word segmentation is not available, a pipeline 2018). These models construct transition-based
process is defined that starts with a word seg- frameworks at character level.
menter to segment the text. The segmented text In this paper, we present a new multi-task learn-
1
About 85% of Vietnamese word types are composed of ing model for joint word segmentation, POS tag-
at least two syllables and 80%+ of syllable types are words by ging and dependency parsing. More specifically,
themselves (Thang et al., 2008). For Vietnamese word seg-
mentation, white space is only used to separate word tokens our model can be viewed as an extension of
while underscore is used to separate syllables inside a word. the BIST graph-based dependency parser (Kiper-
wasser and Goldberg, 2016), that incorporates ear transformation over each latent feature vector:
BiLSTM-CRF-based architectures (Huang et al., (WS)
hi = FFNNWS ri
(WS) 
(3)
2015) to predict the segmentation and POS tags.
To the best of our knowledge, our model is the first Next, the WSeg component feeds output vectors
(WS)
one which is proposed to jointly learn these three hi into a linear-chain CRF layer (Lafferty et al.,
tasks for Vietnamese. Experiments on Vietnamese 2001) for final BIO word-boundary tag prediction.
benchmark datasets show that our model produces A cross-entropy objective loss LWS is computed
state-of-the-art or competitive results. during training, while the Viterbi algorithm is used
for decoding.
2 Our proposed model
Word vector representation: Assume that we
As illustrated in Figure 1, our joint multi-task form n words w1 , w2 , ..., wn based on m syllables
model can be viewed as a hierarchical mixture in the input sentence S. Note that we use gold
of three components: word segmentation, POS word segmentation when training, and use pre-
tagging and dependency parsing. In particular, dicted segmentation produced by the WSeg com-
our word segmentation component formalizes the ponent when decoding. We create a vector xj to
Vietnamese word segmentation task as a sequence represent each j th word wj by concatenating its
(W)
labeling problem, thus uses a BiLSTM-CRF ar- word embedding ewj and its syllable-level word
chitecture (Huang et al., 2015) to predict BIO (SW)
embedding ewj :
word boundary tags from input syllables, result-
ing in a word-segmented sequence. As for word xj = e(wWj ) ◦ e(wSW
j
)
(4)
segmentation, our POS tagging component also Here, inspired by Bohnet et al. (2018), to obtain
uses a BiLSTM-CRF to predict POS tags from (SW)
ewj , we combine sentence-level context sensi-
the sequence of segmented words. Based on the tive syllable encodings (from Equation 2) and feed
input segmented words and their predicted POS it into a FFNN (FFNNSW ):
tags, our dependency parsing component uses a (WS) (WS) 
graph-based architecture similarly to the one from e(wSW
j
)
= FFNNSW rf (wj ) ◦ rl(wj ) (5)
Kiperwasser and Goldberg (2016) to decode de- where f (wj ) and l(wj ) denote indices of the first
pendency arcs and labels. and last syllables of wj in S, respectively.
Syllable vector representation: Given an in- POS tagging: The POS tagging component first
put sentence S of m syllables s1 , s2 , ..., sm , we feeds a sequence of vectors x1:n into a BiLSTM
apply an initial word segmenter to produce ini- (BiLSTMPOS ) to learn latent feature vectors rep-
tial BIO word-boundary tags b1 , b2 , ..., bm . Fol- resenting input words, and passes each of these la-
lowing the state-of-the-art Vietnamese word seg- tent vectors as input to a FFNN (FFNNPOS ):
menter VnCoreNLP’s RDRsegmenter (Nguyen (POS)
rj = BiLSTMPOS (x1:n , j) (6)
et al., 2018b), our initial word segmenter is based
(POS) (POS) 
on the lexicon-based longest matching strategy hj = FFNNPOS rj (7)
(Poowarawan, 1986). We create a vector vi to rep- (POS)
resent each ith syllable in the input sentence S by Output vectors hj are then fed into a CRF
(S) layer for POS tag prediction. A cross-entropy loss
concatenating its syllable embedding esi and its
(B) LPOS is computed for POS tagging when training.
initial word-boundary tag embedding ebi :
(B) Dependency parsing: Assume that the POS
vi = e(sSi ) ◦ ebi (1) tagging component produces p1 , p2 , ..., pn as pre-
dicted POS tags for the input words w1 , w2 , ...,
Word segmentation (WSeg): The WSeg com-
wn , respectively. Each j th predicted POS tag pj
ponent uses a BiLSTM (BiLSTMWS ) to learn a (P)
latent feature vector representing the ith syllable is represented by an embedding epj . We create
from a sequence of vectors v1:m : a sequence of vectors z1:n as input for the depen-
(WS)
dency parsing component, in which each zj is re-
ri = BiLSTMWS (v1:m , i) (2) sulted by concatenating the word vector represen-
The WSeg component then uses a single-layer tation xj (from Equation 4) and the corresponding
(P)
feed-forward network (FFNNWS ) to perform lin- POS tag embedding epj . The dependency parsing
component uses a BiLSTM (BiLSTMDEP ) to learn model (Zhang and Weiss, 2016) uses a transition-
latent feature representations from the input z1:n : based approach, and the joint multi-task model
JMT (Hashimoto et al., 2017) uses a head selec-
zj = xj ◦ e(pPj) (8)
tion based approach which produces a probabil-
(DEP)
rj = BiLSTMDEP (z1:n , j) (9) ity distribution over possible heads for each word
(DEP) (Zhang et al., 2017), while our model uses a graph-
Based on latent feature vectors rj , either based approach.
a transition-based or graph-based neural architec- Our model can be viewed as an extension of the
ture can be applied for dependency parsing (Kiper- joint POS tagging and dependency parsing model
wasser and Goldberg, 2016). jPTDP-v2 (Nguyen and Verspoor, 2018),2 where
Nguyen et al. (2016) show that in both neural we incorporate a BiLSTM-CRF for word bound-
network-based and traditional feature-based cat- ary prediction. Other improvements to jPTDP-v2
egories, graph-based parsers perform better than include: (i) instead of using ‘local’ single word-
transition-based parsers for Vietnamese. Thus, our based character-level embeddings, we use ‘global’
parsing component is constructed similarly to the sentence-level context for learning word embed-
BIST graph-based dependency parser from Kiper- dings (see equations 2 and 5), (ii) we use a CRF
wasser and Goldberg (2016). A difference is that layer for POS tagging instead of a softmax layer,
(DEP)
we use FFNNs to split rj into head and depen- and (iii) following Dozat and Manning (2017), we
dent representations: employ head and dependent projection represen-
hj
(A - H)
= FFNNArc-Head rj
(DEP) 
(10) tations (in Equations 10–13) as feature vectors for
dependency parsing rather than the top recurrent
(A - D) (DEP) 
hj = FFNNArc-Dep rj (11) states (in Equation 9).
(L - H) (DEP) 
hj = FFNNLabel-Head rj (12)
3 Experimental setup
(L - D) (DEP) 
hj = FFNNLabel-Dep rj (13)
Datasets: We follow the setup used in the Viet-
To score a potential dependency arc, we use a namese NLP toolkit VnCoreNLP (Vu et al., 2018).
FFNN (FFNNARC ) with a one-node output layer: For word segmentation and POS tagging, we
(A - H) (A - D)  use standard datasets from the Vietnamese Lan-
score(i, j) = FFNNARC hi ◦ hj (14)
guage and Speech Processing (VLSP) 2013 shared
Given scores of word pairs, we predict the highest tasks.3 To train the word segmentation layer, we
scoring projective parse tree by using the Eisner use 75K manually word-segmented sentences in
(1996) decoding algorithm. This unlabeled pars- which 70K sentences are used for training and
ing model is trained with a margin-based hinge 5K sentences are used for development. For POS
loss LARC (Kiperwasser and Goldberg, 2016). tagging, we use 27,870 manually word-segmented
To label predicted arcs, we use another FFNN and POS-annotated sentences in which 27K and
(FFNNLABEL ) with softmax output: 870 sentences are used for training and develop-
(L - H) (L - D)  ment, respectively. For both tasks, the test set con-
v (i,j) = FFNNLABEL hi ◦ hj (15)
sists of 2120 manually word-segmented and POS-
Based on vectors v (i,j) , a cross entropy loss annotated sentences.
LLABEL for dependency label prediction is com- To train the dependency parsing layer, we use
puted when training, using the gold labeled tree. the benchmark Vietnamese dependency treebank
VnDT (v1.1) of 10,197 sentences (Nguyen et al.,
Joint multi-task learning: We train our model 2014), and follow a standard split to use 1,020 sen-
by summing LWS , LPOS , LARC and LLABEL losses tences for test, 200 sentences for development and
prior to computing gradients. Model parameters the remaining 8,977 sentences for training.
are learned to minimize the sum of the losses.
Implementation: We implement our model
Discussion: Our model is inspired by stack (namely, jointWPD) using DY N ET (Neubig et al.,
propagation based methods (Zhang and Weiss,
2
2016; Hashimoto et al., 2017) which are joint On the benchmark English PTB-WSJ corpus, jPTDP-v2
does better than Stack-propagation, while obtaining similar
models for POS tagging and dependency parsing. performance to JMT.
3
For dependency parsing, the Stack-propagation http://vlsp.org.vn/vlsp2013
Model WSeg PTag LAS UAS Model WSeg PTag LAS UAS
Unsegmented Our jointWPD 97.81 94.05 71.50 77.23 WS 7→ Pos 7→ Dep 98.48∗ 95.09∗ 70.68∗ 76.70∗
VnCoreNLP 97.90 94.06 68.84∗∗ 74.52∗∗ Our jointWPD 98.66 95.35 71.13 77.01
jPTDP-v2 97.90 93.82∗ 70.78∗∗ 76.80∗ (a) w/o InitialBIO 98.25∗∗ 95.01∗ 70.34∗∗ 76.36∗∗
Biaffine 97.90 94.06 72.59∗∗ 78.54∗∗ (b) w/o CRFWSeg 98.32∗∗ 95.06∗ 70.48∗∗ 76.47∗∗
(c) w/o CRFPTag 98.65 95.14∗ 71.00 76.94
Table 1: F1 scores (in %) for word segmentation (d) w/o PTag 98.63 95.10∗ 69.78∗∗ 76.03∗∗
(WSeg), POS tagging (PTag) and dependency parsing
(LAS and UAS) on test sets of unsegmented sentences. Table 2: F1 scores on development sets of unsegmented
Scores are computed on all tokens (including punctua- sentences. (a): Without using initial word-boundary
(S)
tion), employing the CoNLL 2017 shared task evalua- tag embedding, i.e., Equation 1 becomes vi = esi ;
tion script (Zeman et al., 2017). In all tables, ∗ and ∗∗ (b): Using a softmax layer for word-boundary tag pre-
denote the statistically significant differences against diction instead of a CRF layer; (c): Using a softmax
jointWPD at p ≤ 0.05 and p ≤ 0.01, respectively. We layer for POS tag prediction instead of a CRF layer;
compute sentence-level scores for each model and task, (d): Without using the POS tag embeddings for the
then use paired t-test to measure the significance level. parsing component, i.e. Equation 8 becomes zj = xj .

2017). We learn model parameters using Adam (2018), we produce the predicted POS tags on
(Kingma and Ba, 2014), and run for 50 epochs. the whole VnDT treebank by using VnCoreNLP.
We compute the average of F1 scores computed We train both jPTDP-v2 and Biaffine with gold
for word segmentation, POS tagging and (LAS) word segmentation.5 For test, these models are fed
dependency parsing after each training epoch. We with predicted word-segmented test sentences pro-
choose the model with the highest average score duced by VnCoreNLP. Our jointWPD performs
over the development sets to apply to the test sets. significantly better than jPTDP-v2 on both POS
See Appendix for implementation details. tagging and dependency parsing tasks. However,
jointWPD obtains 1.1+% lower LAS and UAS
4 Main results than Biaffine which uses a “biaffine” attention
mechanism for predicting dependency arcs and la-
End-to-end results: Our scores on the test sets
bels. We will extend our parsing component with
are presented in Table 1. We compare our scores
the biaffine attention mechanism to investigate the
with the VnCoreNLP toolkit (Vu et al., 2018)
benefit for our joint model in future work.
which produces the previous highest reported re-
sults on the same test sets for the three tasks. Note Ablation analysis: Table 2 shows performance
that published scores of VnCoreNLP for POS tag- of a Pipeline strategy WS 7→ Pos 7→ Dep where
ging and dependency parsing were reported using we treat our word segmentation, POS tagging and
gold word segmentation, and its published scores dependency parsing components as independent
for dependency parsing were reported using the networks, and train them separately. We find
previous VnDT v1.0. As the current released that jointWPD does significantly better than the
VnCoreNLP version is retrained using the VnDT Pipeline strategy on all three tasks.
v1.1 and we also use the same experimental setup, Table 2 also presents ablation tests over 4
we thus rerun VnCoreNLP on the unsegmented factors. When not using either initial word-
test sentences and compute its scores.4 Our join- boundary tag embeddings or the CRF layer for
tWPD obtains a slightly lower word segmentation word-boundary tag prediction, all scores degrade
score and a similar POS tagging score against Vn- by about 0.3+% absolutely. The 2 remaining fac-
CoreNLP. However, jointWPD achieves 2.7% ab- tors, including (c) using a softmax classifier for
solute higher LAS and UAS than VnCoreNLP.
5
We also show in Table 1 scores of the joint POS We reimplement jPTDP-v2 such that its POS tagging
layer makes use of the VLSP 2013 POS tagging training set
tagging and dependency parsing model jPTDP-v2 of 27K sentences, and then perform hyper-parameter tuning.
(Nguyen and Verspoor, 2018) and the state-of-the- The original jPTDP-v2 implementation only uses gold POS
art Biaffine dependency parser (Dozat and Man- tags available in 8,977 training dependency trees, thus giv-
ing lower parsing performance than ours. For Biaffine, we
ning, 2017). For Biaffine which requires auto- use its updated version (Dozat et al., 2017) which won the
matically predicted POS tags, following Vu et al. CoNLL 2017 shared task on multilingual Universal Depen-
dencies (UD) parsing from raw text (Zeman et al., 2017). Bi-
4
See accuracy results w.r.t. the gold word segmentation in affine was also employed in all the top systems at the follow-
Table 3 in the Appendix. up CoNLL 2018 shared task (Zeman et al., 2018).
POS tag prediction rather than a CRF layer and manually builds another Vietnamese dependency
(d) removing POS tag embeddings, do not effect treebank—BKTreebank—consisting of about 7K
the word segmentation score. Both factors notably sentences based on the Stanford Dependencies an-
decrease the POS tagging score. Factor (c) slightly notation scheme (Marneffe and Manning, 2008).
decreases LAS and UAS parsing scores. Factor
(d) degrades the parsing scores by about 1.0+%, 6 Conclusions and future work
clearly showing the usefulness of POS tag infor- In this paper, we have presented the first multi-task
mation for the dependency parsing task. learning model for joint word segmentation, POS
tagging and dependency parsing in Vietnamese.
5 Related work
Experiments on Vietnamese benchmark datasets
Nguyen et al. (2018b) propose a transforma- show that our joint multi-task model obtains re-
tion rule-based learning model RDRsegmenter for sults competitive with the state-of-the-art.
Vietnamese word segmentation, which obtains the Che et al. (2018) show that deep contextual-
highest performance to date. Nguyen et al. (2017) ized word representations (Peters et al., 2018; De-
briefly review word segmentation and POS tag- vlin et al., 2019) help improve the parsing per-
ging approaches for Vietnamese. In addition, formance. We will evaluate effects of the con-
Nguyen et al. (2017) also present an empirical textualized representations to our joint model. A
comparison between state-of-the-art feature- and Vietnamese syllable is analogous to a character
neural network-based models for Vietnamese POS in other languages such as Chinese and Japanese.
tagging, and show that a conventional feature- Thus we will also evaluate the application of our
based model performs better than neural network- model to those languages in future work.
based models. In particular, on the VLSP 2013
POS tagging dataset, MarMoT (Mueller et al., Acknowledgments
2013) obtains better accuracy than BiLSTM-CRF- I would like to thank Karin Verspoor and Vu Cong
based models with LSTM- and CNN-based char- Duy Hoang as well as the anonymous reviewers
acter level word embeddings (Lample et al., 2016; for their feedback.
Ma and Hovy, 2016). Vu et al. (2018) incorpo-
rate RDRsegmenter and MarMoT as the word seg- References
mentation and POS tagging components of Vn- Miguel Ballesteros, Chris Dyer, and Noah A. Smith.
CoreNLP, respectively. 2015. Improved transition-based parsing by mod-
Thi et al. (2013) propose a conversion method eling characters instead of words with LSTMs. In
to automatically convert the manually built Viet- Proceedings of EMNLP, pages 349–359.
namese constituency treebank (Nguyen et al., Bernd Bohnet, Ryan McDonald, Gonçalo Simões,
2009) into a dependency treebank. However, Thi Daniel Andor, Emily Pitler, and Joshua Maynez.
et al. (2013) do not clarify how dependency la- 2018. Morphosyntactic Tagging with a Meta-
BiLSTM Model over Context Sensitive Token En-
bels are inferred; also, they ignore syntactic infor- codings. In Proceedings of ACL, pages 2642–2652.
mation encoded in grammatical function tags, and
unable to deal with coordination and empty cat- Razvan Bunescu and Raymond Mooney. 2005. A
Shortest Path Dependency Kernel for Relation Ex-
egory cases.6 Nguyen et al. (2014) later present traction. In Proceedings of HLT-EMNLP, pages
a new conversion method to tackle all those is- 724–731.
sues, producing the high quality dependency tree-
Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng,
bank VnDT which is then widely used in Viet- and Ting Liu. 2018. Towards better UD parsing:
namese dependency parsing research (Nguyen Deep contextualized word embeddings, ensemble,
and Nguyen, 2015, 2016; Nguyen et al., 2016, and treebank concatenation. In Proceedings of the
2018a; Vu et al., 2018). Recently, Nguyen (2018) CoNLL 2018 Shared Task, pages 55–64.
6
Thi et al. (2013) reformed their dependency treebank Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
with the UD annotation scheme to create a Vietnamese UD Kristina Toutanova. 2019. BERT: Pre-training of
treebank in 2017. Note that the CoNLL 2017 & 2018 mul- Deep Bidirectional Transformers for Language Un-
tilingual parsing shared tasks also provided F1 scores for derstanding. In Proceedings of NAACL-HLT.
word segmentation, POS tagging and dependency parsing on
this Vietnamese UD treebank. However, this UD treebank Timothy Dozat and Christopher D. Manning. 2017.
is small (containing about 1,400 training sentences), thus it Deep Biaffine Attention for Neural Dependency
might not be ideal to draw a reliable conclusion. Parsing. In Proceedings of ICLR.
Timothy Dozat, Peng Qi, and Christopher D. Manning. Marie-catherine De Marneffe and Christopher D. Man-
2017. Stanford’s Graph-based Neural Dependency ning. 2008. The Stanford typed dependencies rep-
Parser at the CoNLL 2017 Shared Task. In Proceed- resentation. In Proceedings of the Coling 2008
ings of the CoNLL 2017 Shared Task, pages 20–30. workshop on Cross-Framework and Cross-Domain
Parser Evaluation, pages 1–8.
Jason M. Eisner. 1996. Three New Probabilistic Mod-
els for Dependency Parsing: An Exploration. In Thomas Mueller, Helmut Schmid, and Hinrich
Proceedings of COLING, pages 340–345. Schütze. 2013. Efficient higher-order CRFs for mor-
phological tagging. In Proceedings of EMNLP,
Michel Galley and Christopher D. Manning. 2009. pages 322–332.
Quadratic-Time Dependency Parsing for Machine
Translation. In Proceedings of ACL-IJCNLP, pages Graham Neubig, Chris Dyer, Yoav Goldberg, et al.
773–781. 2017. DyNet: The Dynamic Neural Network
Toolkit. arXiv preprint, arXiv:1701.03980.
Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-
ruoka, and Richard Socher. 2017. A joint many-task Binh Duc Nguyen, Kiet Van Nguyen, and Ngan Luu-
model: Growing a neural network for multiple NLP Thuy Nguyen. 2018a. LSTM Easy-first Depen-
tasks. In Proceedings of EMNLP, pages 1923–1933. dency Parsing with Pre-trained Word Embeddings
Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, and and Character-level Word Embeddings in Viet-
Jun’ichi Tsujii. 2012. Incremental Joint Approach namese. In Proceedings of KSE, pages 187–192.
to Word Segmentation, POS Tagging, and Depen-
dency Parsing in Chinese. In Proceedings of ACL, Dat Quoc Nguyen, Mark Dras, and Mark Johnson.
pages 1045–1053. 2016. An empirical study for Vietnamese depen-
dency parsing. In Proceedings of ALTA, pages 143–
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- 149.
tional LSTM-CRF Models for Sequence Tagging.
arXiv preprint, arXiv:1508.01991. Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham,
Phuong-Thai Nguyen, and Minh Le Nguyen. 2014.
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A From Treebank Conversion to Automatic Depen-
Method for Stochastic Optimization. arXiv preprint, dency Parsing for Vietnamese. In Proceedings of
arXiv:1412.6980. NLDB, pages 196–207.
Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim- Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark
ple and Accurate Dependency Parsing Using Bidi- Dras, and Mark Johnson. 2018b. A Fast and Accu-
rectional LSTM Feature Representations. Transac- rate Vietnamese Word Segmenter. In Proceedings of
tions of ACL, 4:313–327. LREC, pages 2582–2587.
Sandra Kübler, Ryan McDonald, and Joakim Nivre. Dat Quoc Nguyen and Karin Verspoor. 2018. An Im-
2009. Dependency Parsing. Synthesis Lectures on proved Neural Network Model for Joint POS Tag-
Human Language Technologies, Morgan & cLay- ging and Dependency Parsing. In Proceedings of
pool publishers. the CoNLL 2018 Shared Task, pages 81–91.
Shuhei Kurita, Daisuke Kawahara, and Sadao Kuro- Dat Quoc Nguyen, Thanh Vu, Dai Quoc Nguyen, Mark
hashi. 2017. Neural Joint Model for Transition- Dras, and Mark Johnson. 2017. From Word Seg-
based Chinese Syntactic Analysis. In Proceedings mentation to POS Tagging for Vietnamese. In Pro-
of ACL, pages 1204–1214. ceedings of ALTA, pages 108–113.
John D. Lafferty, Andrew McCallum, and Fernando
C. N. Pereira. 2001. Conditional Random Fields: Kiem-Hieu Nguyen. 2018. BKTreebank: Building a
Probabilistic Models for Segmenting and Labeling Vietnamese Dependency Treebank. In Proceedings
Sequence Data. In Proceedings of ICML, pages LREC, pages 2164–2168.
282–289.
Kiet Van Nguyen and Ngan Luu-Thuy Nguyen. 2015.
Guillaume Lample, Miguel Ballesteros, Sandeep Sub- Error Analysis for Vietnamese Dependency Parsing.
ramanian, Kazuya Kawakami, and Chris Dyer. 2016. In Proceedings of KSE, pages 79–84.
Neural Architectures for Named Entity Recognition.
In Proceedings of NAACL-HLT, pages 260–270. Kiet Van Nguyen and Ngan Luu-Thuy Nguyen. 2016.
Vietnamese transition-based dependency parsing
Haonan Li, Zhisong Zhang, Yuqi Ju, and Hai Zhao. with supertag features. In Proceedings of KSE,
2018. Neural Character-level Dependency Parsing pages 175–180.
for Chinese. In Proceedings of AAAI, pages 5205–
5212. Phuong Thai Nguyen, Xuan Luong Vu, Thi
Minh Huyen Nguyen, Van Hiep Nguyen, and
Xuezhe Ma and Eduard Hovy. 2016. End-to-end Se- Hong Phuong Le. 2009. Building a Large
quence Labeling via Bi-directional LSTM-CNNs- Syntactically-Annotated Corpus of Vietnamese. In
CRF. In Proceedings of ACL, pages 1064–1074. Proceedings of LAW, pages 182–185.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Appendix
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep- Implementation details: When training, each
resentations. In Proceedings of NAACL-HLT, pages task component is fed with the corresponding task-
2227–2237. associated sentences. The dependency parsing
Yuen Poowarawan. 1986. Dictionary-based Thai Syl- training set is smallest in size (consisting of 8,977
lable Separation. In Proceedings of the Ninth Elec- sentences), thus for each training epoch, we sam-
tronics Engineering Conference, pages 409–418. ple the same number of sentences from the word
segmentation and POS tagging training sets. We
Xian Qian and Yang Liu. 2012. Joint Chinese Word
Segmentation, POS Tagging and Parsing. In Pro- train our model with a fixed random seed and with-
ceedings of EMNLP-CoNLL, pages 501–511. out mini-batches. Dropout (Srivastava et al., 2014)
is applied with a 67% keep probability to the in-
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
puts of BiLSTMs and FFNNs. Following Kiper-
Dropout: A Simple Way to Prevent Neural Networks wasser and Goldberg (2016), we also use word
from Overfitting. Journal of Machine Learning Re- dropout to learn embeddings for unknown sylla-
search, 15:1929–1958. bles/words: we replace each syllable/word token
Dinh Quang Thang, Le Hong Phuong, Nguyen
s/w appearing #(s/w) times with a “unk” sym-
0.25
Thi Minh Huyen, Nguyen Cam Tu, Mathias Rossig- bol with probability punk (s/w) = 0.25+#(s/w) .
nol, and Vu Xuan Luong. 2008. Word segmentation We initialize syllable and word embeddings
of Vietnamese texts: a comparison of approaches. In with 100-dimensional pre-trained Word2Vec vec-
Proceedings of LREC, pages 1933–1936.
tors as used in Nguyen et al. (2017), while
Luong Nguyen Thi, Linh Ha My, Hung Nguyen Viet, the initial word-boundary and POS tag embed-
Huyen Nguyen Thi Minh, and Phuong Le Hong. dings are randomly initialized. All these em-
2013. Building a treebank for Vietnamese depen- beddings are then updated when training. The
dency parsing. In Proceedings of RIVF, pages 147–
151.
sizes of the output layers of FFNNWS , FFNNPOS
and FFNNLABEL are the number of BIO word-
Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark boundary tags (i.e. 3), the number of POS tags
Dras, and Mark Johnson. 2018. VnCoreNLP: A and the number of dependency relation types, re-
Vietnamese Natural Language Processing Toolkit.
In Proceedings of NAACL: Demonstrations, pages spectively. We perform a minimal grid search of
56–60. hyper-parameters, resulting in the size of the ini-
tial word-boundary tag embeddings at 25, the POS
Daniel Zeman et al. 2017. CoNLL 2017 Shared Task: tag embedding size of 100, the size of the output
Multilingual Parsing from Raw Text to Universal
Dependencies. In Proceedings of the CoNLL 2017 layers of remaining FFNNs at 100, the number of
Shared Task, pages 1–19. BiLSTM layers at 2 and the size of LSTM hidden
states in each layer at 128.
Daniel Zeman et al. 2018. CoNLL 2018 shared task:
Multilingual parsing from raw text to universal de- Additional results: Table 3 presents POS tag-
pendencies. In Proceedings of the CoNLL 2018 ging and parsing accuracies w.r.t. gold word seg-
Shared Task, pages 1–21.
mentation. In this case, for our jointWPD, we feed
Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting the POS tagging and parsing components with
Liu. 2014. Character-Level Chinese Dependency gold word-segmented sentences when decoding.
Parsing. In Proceedings of ACL, pages 1326–1336.
Model WSeg PTag LAS UAS
Xingxing Zhang, Jianpeng Cheng, and Mirella Lapata.
Gold segment.

Our jointWPD 100.0 95.97 73.90 80.12


2017. Dependency Parsing as Head Selection. In VnCoreNLP 100.0 95.88 71.38∗∗ 77.35∗∗
Proceedings of EACL, pages 665–676.
jPTDP-v2 100.0 95.70∗ 73.12∗∗ 79.63∗
Yuan Zhang, Chengtao Li, Regina Barzilay, and Ka- Biaffine 100.0 95.88 74.99∗∗ 81.19∗∗
reem Darwish. 2015. Randomized Greedy Infer- Table 3: POS tagging, LAS and UAS accuracy scores
ence for Joint Segmentation, POS Tagging and De- on the test sets w.r.t. gold word-segmented sentences.
pendency Parsing. In Proceedings of NAACL-HLT,
These scores are computed on all tokens (including
pages 42–52.
punctuation). Recall that the LAS and UAS accura-
Yuan Zhang and David Weiss. 2016. Stack- cies are computed on the VnDT v1.1 test set w.r.t. the
propagation: Improved representation learning for automatically predicted POS tags.
syntax. In Proceedings of ACL, pages 1557–1566.

You might also like