Professional Documents
Culture Documents
Xiao Zhang† , Yong Jiang‡ , Hao Peng† , Kewei Tu‡ , and Dan Goldwasser†
†
Department of Computer Science, Purdue University, West Lafayette, USA
{zhang923, pengh, dgoldwas}@cs.purdue.edu
‡
School of Information Science and Technology, ShanghaiTech University, Shanghai, China
{jiangyong, tukw}@shanghaitech.edu.cn
DecoderDecoder
3.1 Neural CRF Encoder
Y Y yt 1 yt y1t ytyt+1 yt+1
←
− ←−
h t = H(We← − e + W←
h ct
−←
h h
− h t−1 + b←
− ),
h
r1 r2 r3 r4 r5
where ect here is the character embedding for
character c in position t in a word.
l1 l2 l3 l4 l5
The inputs to the MLP are word embed-
dings ev ∈ Rk2 for each word v, where k2 is
the dimensionality of the vector, concatenated That is a money maker
Table 1: Statistics of different UD languages used in our experiments, including the number of tokens,
and the number of sentences in training, development and testing set respectively.
UD division for training, development and test- character embeddings as parameters as well. We
ing in our experiments. The statistics of the data used Theano to implement our algorithm, and all
used in our experiments are described in table 1. the experiments were run on NVIDIA GPUs. To
The UD dataset includes several low-resource lan- prevent over-fitting, we used the “early-stop” strat-
guages which are of particular interest to us, as egy to determine the appropriate number of epochs
our model has practical application in them: in re- during training. We did not take efforts to tune
ality, they have fewer annotated data and a semi- those hyper-parameters and they remained the
supervised model is useful for them. same in both our supervised and semi-supervised
Input Representation and Neural Architecture learning experiments.
Our model uses word embeddings as input. In our
pilot experiments, we compared the performance 6.2 Supervised Learning
on the English dataset of the pre-trained embed- In these settings our Neural CRF autoencoder
ding from Google News (Mikolov et al., 2013) model had access to the full amount of annotated
and the embeddings we trained directly on the UD training data in the UD dataset. As described in
dataset using the skip-gram algorithm (Mikolov Section 5, the decoder’s parameters Θ were esti-
et al., 2013). We found these two types of em- mated using real counts from the labeled data.
beddings yield very similar performance on the We compared our model with existing sequence
POS tagging task. So in our experiments, we used labeling models including HMM, CRF, LSTM
embeddings of different languages directly trained and neural CRF (NCRF) on all the 8 languages.
on the UD dataset as input to our model, whose Among these models, the NCRF can be most di-
dimension is 200. For the MLP neural network rectly compared to our model, as it is used as the
layer, the number of hidden nodes in the hidden base of our model, but without the decoder (and as
layer is 20, which is the same for the hidden layer a result, can only be used for supervised learning).
in the character-level LSTM. The dimension of the The results, summarized in Table 2, show that
character-level embeddings sent into the LSTM our NCRF-AE consistently outperformed all other
layer is 15, which is randomly initialized. In or- systems, on all the 8 languages, including Russian,
der to incorporate the global information of the in- Indonesian and Croatian which had considerably
put sequence, we set the context window size to 3. less data compared to other languages. Interest-
The dropout rate for the dropout layer is set to 0.5. ingly, the NCRF consistently came second to our
Learning We used ADADELTA (Zeiler, 2012) model, which demonstrates the efficacy of the ex-
to update parameters Λ in the encoder, as pressivity added to our model by the decoder, to-
ADADELTA dynamically adapts learning rate gether with an appropriate optimization approach.
over time using only first order information To better understand the performance difference
and has minimal computational overhead beyond by different models, we performed error analy-
vanilla stochastic gradient descent (SGD). The au- sis, using an illustrative example, described in Fig-
thors of ADADELTA also argue this method ap- ure 3.
pears robust to noisy gradient information, dif- In this example, the LSTM incorrectly predicted
ferent model architecture choices, various data the POS tag of the word “search” as a verb, instead
modalities and selection of hyper-parameters. We of a noun (part of the NP “nice search engine”),
observed that ADADELTA indeed had faster con- while predicting correctly the preceding word,
vergence than vanilla SGD optimization. In our “nice”, as an adjective. We attribute the error to
experiments, we include word embeddings and LSTM lacking an explicit output transition scor-
Models English French German Italian Russian Spanish Indonesian Croatian
HMM 86.28% 91.23% 85.59% 92.03% 79.82% 91.31% 89.40% 86.98%
CRF 89.96% 93.40% 86.83% 94.07% 83.38% 91.47% 88.63% 86.90%
LSTM 90.50% 94.16% 88.40% 94.96% 84.87% 93.17% 89.42% 88.95%
NCRF 91.52% 95.07% 90.27% 96.20% 93.37% 93.34% 92.32% 93.85%
NCRF-AE 92.50% 95.28% 90.50% 96.64% 93.60% 93.86% 93.96% 94.32%
Table 2: Supervised learning accuracy of POS tagging on 8 UD languages using different models
Table 3: Semi-supervised learning accuracy of POS tagging on 8 UD languages. HEM means hard-EM,
used as a self-training approach, and OL means only 20% of the labeled data is used and no unlabeled
data is used.
Tagging Accuracy
Tagging Accuracy
88.7 88.2
labeled data, and in accordance decreased the pro-
88.5 87.4
portion of unlabeled data.
88.2 86.6
87.9
0 5 10 15 20
85.8
0 5 10 15 20
Proportion Labeled Only Both
(a) English (b) Croatian 10% − 90% 85.31% 85.80%
20% − 80% 88.41% 88.69%
Figure 4: UD English and Croatian POS tag- 30% − 70% 89.46% 89.85%
ging accuracy versus increasing proportion of un- 40% − 60% 90.06% 90.11%
labeled sequences using 20% labeled data. The 50% − 50% 91.09% 91.15%
green straight line is the performance of the neural
CRF, trained over the labeled data. Table 4: Performance of NCRF-AE of different
proportion of labeled and unlabeled data. The left-
most column shows the proportion of unlabeled
6.3.2 Semi-supervised POS Tagging on (left) and labeled (right) data in the experients.
Multiple Languages Both the experimental results of only using labeled
We compared our NCRF-AE model with other data and using both labeled and unlabeled data are
semi-supervised learning models, including the included.
HMM-EM algorithm and the hard-EM version of
our NCRF-AE. The hard EM version of our model The results of these experiments are demon-
can be considered as a variant of self-training, as strated in Table 4. As we speculated, we ob-
it infers the missing labels using the current model served diminishing effectiveness when increasing
in the E-step, and uses the real counts of these la- the proprtion of labeled data in training.
bels to update the model in the M-step. To contex-
tualize the results, we also provide the results of 7 Conclusion
the NCRF model and the supervised version our
NCRF-AE model trained on 20% of the data. We We proposed an end-to-end neural CRF autoen-
set the proportion of labeled data to 20% for each coder (NCRF-AE) model for semi-supervised se-
language and set the proportion of unlabeled data quence labeling. Our NCRF-AE is an integration
to 50% of the dataset. There was no overlap be- of a discriminative model and generative model
tween labeled and unlabeled data. which extends the generalized autoencoder by us-
The results are summarized in Table 3. Simi- ing a neural CRF model as its encoder and a gen-
lar to the supervised experiments, the supervised erative decoder built on top of it. We suggest a
version of our NCRF-AE, trained over 20% of the variant of the EM algorithm to learn the parame-
labeled data, outperforms the NCRF model. Our ters of our NCRF-AE model.
model was able to successfully use the unlabeled We evaluated our model in both supervised
data, leading to improved performance in all lan- and semi-supervised scenarios over multiple lan-
guages, over both the supervised version of our guages, and showed it can outperform other super-
model, as well as the HMM-EM and Hard-EM vised and semi-supervised methods. We also per-
models that were also trained over both the labeled formed additional experiments to show how vary-
and unlabeled data. ing sizes of labeled training data affect the effec-
tiveness of our model.
6.3.3 Varying Sizes of Labeled Data on These results demonstrate the strength of our
English model, as it was able to utilize the small amount
As is known to all, semi-supervised approaches of labeled data and exploit the hidden information
tend to work well when given a small size of la- from the large amount of unlabeled data, with-
beled training data. But with the increase of la- out additional feature engineering which is of-
beled training data size, we might get diminishing ten needed in order to get semi-supervised and
effectiveness. To verify this conjecture, we con- weakly-supervised systems to perform well. The
ducted additional experiments to show how vary- superior performance on the low resource lan-
ing sizes of labeled training data affect the effec- guage also suggests its potential in practical use.
tiveness of our NCRF-AE model. In these exper-
References Tomás Kociský, Gábor Melis, Edward Grefenstette,
Chris Dyer, Wang Ling, Phil Blunsom, and
Waleed Ammar, Chris Dyer, and Noah A Smith. 2014. Karl Moritz Hermann. 2016. Semantic parsing with
Conditional random field autoencoders for unsuper- semi-supervised sequential autoencoders. In Proc.
vised structured prediction. In The Conference on of the Conference on Empirical Methods for Natural
Advances in Neural Information Processing Systems Language Processing (EMNLP), pages 1078–1087.
(NIPS), pages 1–12.
John D Lafferty, Andrew McCallum, and Fernando
Daniel Andor, Chris Alberti, David Weiss, Aliaksei C N Pereira. 2001. Conditional random fields:
Severyn, Alessandro Presta, Kuzman Ganchev, Slav Probabilistic models for segmenting and labeling se-
Petrov, and Michael Collins. 2016. Globally nor- quence data. In Proc. of the International Confer-
malized transition-based neural networks. In Proc. ence on Machine Learning (ICML), pages 282–289,
of the Annual Meeting of the Association Computa- San Francisco, CA, USA.
tional Linguistics (ACL), pages 2442–2452.
Guillaume Lample, Miguel Ballesteros, Kazuya
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kawakami, Sandeep Subramanian, and Chris Dyer.
Dan Klein. 2016. Learning to compose neural net- 2016. Neural architectures for named entity recog-
works for question answering. In Proc. of the An- nition. In Proc. of the Annual Meeting of the North
nual Meeting of the North American Association of American Association of Computational Linguistics
Computational Linguistics (NAACL), pages 1545– (NAACL), pages 1–10.
1554.
Chu-Cheng Lin, Waleed Ammar, Chris Dyer, and Lori
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Levin. 2015. Unsupervised pos induction with word
gio. 2015. Neural machine translation by jointly embeddings. In Proc. of the Annual Meeting of the
learning to align and translate. In Proc. Inter- North American Association of Computational Lin-
national Conference on Learning Representation guistics (NAACL), pages 1311–1316.
(ICLR).
Xuezhe Ma and Eduard Hovy. 2016. End-to-end
Danqi Chen and Christopher D Manning. 2014. A
sequence labeling via bi-directional lstm-cnns-crf.
fast and accurate dependency parser using neural
In Proc. of the Annual Meeting of the Associa-
networks. In Proc. of the Conference on Em-
tion Computational Linguistics (ACL), pages 1064–
pirical Methods for Natural Language Processing
1074, Berlin, Germany.
(EMNLP).
Z. Marinho, A. F. T. Martins, S. B. Cohen, and N. A.
Yong Cheng, Yang Liu, and Wei Xu. 2017. Maxi-
Smith. 2016. Semi-supervised learning of sequence
mum reconstruction estimation for generative latent-
models with the method of moments. In Proc. of
variable models. In Proc. of the National Confer-
the Conference on Empirical Methods for Natural
ence on Artificial Intelligence (AAAI).
Language Processing (EMNLP).
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.
David McClosky, Eugene Charniak, and Mark John-
Maximum likelihood from incomplete data via the
son. 2006. Effective self-training for parsing. In
em algorithm. Journal of the Royal Statistical Soci-
Proc. of the Annual Meeting of the North American
ety. Series B, 39(1):1–38.
Association of Computational Linguistics (NAACL),
Greg Durrett and Dan Klein. 2015. Neural crf parsing. pages 152–159, Stroudsburg, PA, USA.
pages 302–312.
Ryan Mcdonald, Joakim Nivre, Yvonne Quirmbach-
Geoffrey E Hinton and Richard S Zemel. 1994. brundage, Yoav Goldberg, Dipanjan Das, Kuzman
Autoencoders, minimum description length and Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Os-
helmholtz free energy. In The Conference on Ad- car Tckstrm, Claudia Bedini, Nria Bertomeu, and
vances in Neural Information Processing Systems Castell Jungmee Lee. 2013. Universal dependency
(NIPS), pages 3–10. annotation for multilingual parsing. In Proc. of the
Annual Meeting of the Association Computational
Sepp Hochreiter and J Urgen Schmidhuber. 1997. Linguistics (ACL).
Long short-term memory. Neural Computation,
9(8):1735–1780. G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng,
D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu,
Arzoo Katiyar and Claire Cardie. 2016. Investigating and G. Zweig. 2015. Using recurrent neural net-
lstms for joint extraction of opinion entities and rela- works for slot filling in spoken language understand-
tions. In Proc. of the Annual Meeting of the Associ- ing. Transactions on Audio, Speech, and Language
ation Computational Linguistics (ACL), pages 919– Processing, 23(3):530–539.
929, Berlin, Germany.
Yishu Miao and Phil Blunsom. 2016. Language as a
Yoon Kim. 2014. Convolutional neural networks for latent variable: Discrete generative models for sen-
sentence classification. Proc. of the Conference on tence compression. In Proc. of the Conference on
Empirical Methods for Natural Language Process- Empirical Methods for Natural Language Process-
ing (EMNLP), pages 1746–1751. ing (EMNLP), pages 319–328, Austin, Texas.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Sam Wiseman and Alexander M. Rush. 2016.
Dean. 2013. Distributed representations of words Sequence-to-sequence learning as beam-search op-
and phrases and their compositionality. The Confer- timization. In EMNLP, pages 1296–1306, Austin,
ence on Advances in Neural Information Processing Texas.
Systems (NIPS), pages 1–9.
Matthew D. Zeiler. 2012. Adadelta: An adaptive learn-
Nanyun Peng and Mark Dredze. 2016. Improving ing rate method. CoRR, abs/1212.5701.
named entity recognition for chinese social media
with word segmentation representation learning. In
Proc. of the Annual Meeting of the Association Com-
putational Linguistics (ACL), pages 149–155.