You are on page 1of 11

Semi-Supervised Structured Prediction with Neural CRF Autoencoder

Xiao Zhang† , Yong Jiang‡ , Hao Peng† , Kewei Tu‡ , and Dan Goldwasser†

Department of Computer Science, Purdue University, West Lafayette, USA
{zhang923, pengh, dgoldwas}@cs.purdue.edu

School of Information Science and Technology, ShanghaiTech University, Shanghai, China
{jiangyong, tukw}@shanghaitech.edu.cn

Abstract placing the linear potentials with non-linear poten-


tial using neural networks these models were able
In this paper we propose an end-to- to improve performance in several structured pre-
end neural CRF autoencoder (NCRF-AE) diction tasks (Andor et al., 2016; Peng and Dredze,
model for semi-supervised learning of se- 2016; Lample et al., 2016; Ma and Hovy, 2016;
quential structured prediction problems. Durrett and Klein, 2015).
Our NCRF-AE consists of two parts: an Despite their promise, wider adoption of these
encoder which is a CRF model enhanced algorithms for new structured prediction tasks can
by deep neural networks, and a decoder be difficult. Neural networks are notoriously sus-
which is a generative model trying to re- ceptible to over-fitting unless large amounts of
construct the input. Our model has a uni- training data are available. This problem is exacer-
fied structure with different loss functions bated in the structured settings, as accounting for
for labeled and unlabeled data with shared the dependencies between decisions requires even
parameters. We developed a variation of more data. Providing it through manual annotation
the EM algorithm for optimizing both the is often a difficult labor-intensive task.
encoder and the decoder simultaneously In this paper we tackle this problem, and
by decoupling their parameters. Our Ex- propose an end-to-end neural CRF autoencoder
perimental results over the Part-of-Speech (NCRF-AE) model for semi-supervised learning
(POS) tagging task on eight different lan- on sequence labeling problems.
guages, show that our model can outper- An autoencoder is a special type of neural net,
form competitive systems in both super- modeling the conditional probability P (X̂|X),
vised and semi-supervised scenarios. where X is the original input to the model and
X̂ is the reconstructed input (Hinton and Zemel,
1 Introduction
1994). Autoencoders consist of two parts, an en-
The recent renaissance of deep learning has led coder projecting the input to a hidden space, and a
to significant strides forward in several AI fields. decoder reconstructing the input from it.
In Natural Language Processing (NLP), charac- Traditionally, autoencoders are used for gener-
terized by highly structured tasks, promising re- ating a compressed representation of the input by
sults were obtained by models that combine deep projecting it into a dense low dimensional space.
learning methods with traditional structured learn- In our setting the hidden space consists of discrete
ing algorithms (Chen and Manning, 2014; Dur- variables that comprise the output structure. These
rett and Klein, 2015; Andor et al., 2016; Wise- generalized settings are described in Figure 1a. By
man and Rush, 2016). These models combine the definition, it is easy to see that the encoder (lower
strengths of neural models, that can score local half in Figure 1a) can be modeled by a discrimina-
decisions using a rich non-linear representation, tive model describing P (Y |X) directly, while the
with efficient inference procedures used to com- decoder (upper half in Figure 1a) naturally fits as
bine the local decisions into a coherent global de- a generative model, describing P (X̂|Y ), where Y
cision. Among these models, neural variants of the is the label. In our model, illustrated in Figure 1b,
Conditional Random Fields (CRF) model (Laf- the encoder is a CRF model with neural networks
ferty et al., 2001) are especially popular. By re- as its potential extractors, while the decoder is a
generative model, trying to reconstruct the input. Hovy, 2016; Mesnil et al., 2015; Lample et al.,
Our model carries the merit of autoencoders, 2016), parsing (Chen and Manning, 2014), text
which can exploit the valuable information from generation (Sutskever et al., 2011), machine trans-
unlabeled data. Recent works (Ammar et al., lation (Bahdanau et al., 2015), sentiment anal-
2014; Lin et al., 2015) suggested using an autoen- ysis (Kim, 2014) and question answering (An-
coder with a CRF model as an encoder in an unsu- dreas et al., 2016). Most relevant to this work are
pervised setting. We significantly expand on these structured prediction models capturing dependen-
works and suggest the following contributions: cies between decisions, either by modeling the de-
1. We propose a unified model seamlessly ac- pendencies between the hidden representations of
commodating both unlabeled and labeled data. connected decisions using RNN/LSTM (Vaswani
While past work focused on unsupervised struc- et al., 2016; Katiyar and Cardie, 2016), by explic-
tured prediction, neglecting the discriminative itly modeling the structural dependencies between
power of such models, our model easily sup- output predictions (Durrett and Klein, 2015; Lam-
ports learning in both fully supervised and semi- ple et al., 2016; Andor et al., 2016), or by combin-
supervised settings. We developed a variation ing the two approaches (Socher et al., 2013; Wise-
of the Expectation-Maximization (EM) algorithm, man and Rush, 2016).
used for optimizing the encoder and the decoder In contrast to supervised latent variable models,
of our model simultaneously. such as the Hidden Conditional Random Fields in
2. We increase the expressivity of the traditional (Quattoni et al., 2007), which utilizes additional
CRF autoencoder model using neural networks as latent variables to infer for supervised structure
the potential extractors, thus avoiding the heavy prediction, we do not presume any additional la-
feature engineering necessary in previous works. tent variables in our NCRF-AE model in both su-
Interestingly, our model’s predictions, which pervised and semi-supervised setting.
unify the discriminative neural CRF encoder and The difficulty of providing sufficient super-
the generative decoder, have led to an improved vision has motivated work on semi-supervised
performance over the highly optimized neural and unsupervised learning for many of these
CRF (NCRF) model alone, even when trained in tasks (McClosky et al., 2006; Spitkovsky et al.,
the supervised settings over the same data. 2010; Subramanya et al., 2010; Stratos and
3. We demonstrate the advantages of our model Collins, 2015; Marinho et al., 2016; Tran et al.,
empirically, focusing on the well-known Part- 2016), including several that also used autoen-
of-Speech (POS) tagging problem over 8 differ- coders (Ammar et al., 2014; Lin et al., 2015; Miao
ent languages, including low resource languages. and Blunsom, 2016; Kociský et al., 2016; Cheng
In the supervised setting, our NCRF-AE outper- et al., 2017). In this paper we expand on these
formed the highly optimized NCRF. In the semi- works, and suggest a neural CRF autoencoder, that
supervised setting, our model was able to suc- can leverage both labeled and unlabeled data.
cessfully utilize unlabeled data, improving on the
performance obtained when only using the la- 3 Neural CRF Autoencoder
beled data, and outperforming competing semi-
In semi-supervised learning the algorithm needs to
supervised learning algorithms.
utilize both labeled and unlabeled data. Autoen-
Furthermore, our newly proposed algorithm
coders offer a convenient way of dealing with both
is directly applicable to other sequential learn-
types of data in a unified fashion.
ing tasks in NLP, and can be easily adapted to
A generalized autoencoder (Figure 1a) tries to
other structured tasks such as dependency parsing
reconstruct the input X̂ given the original input X,
or constituent parsing by replacing the forward-
aiming to maximize the log probability P (X̂|X)
backward algorithm with the inside-outside algo-
without knowing the latent variable Y explicitly.
rithm. All of these tasks can benefit from semi-
Since we focus on sequential structured prediction
supervised learning algorithms.
problems, the encoding and decoding processes
2 Related Work are no longer for a single data point (x, y) (x if
unlabeled), but for the whole input instance and
Neural networks were successfully applied to output sequence (x, y) (x if unlabeled). Addition-
many NLP tasks, including tagging (Ma and ally, as our main purpose in this study is to recon-
struct the input with precision, x̂ is just a copy of being induced. The labels induced here are core-
x. spoding to the hidden nodes in a generalized au-
toencoder model.
X̂ X̂ x̂t 1 x̂t x̂1t x̂x̂t t+1 x̂t+1

DecoderDecoder
3.1 Neural CRF Encoder
Y Y yt 1 yt y1t ytyt+1 yt+1

EncoderEncoder In a CRF model, the probability of predicted labels


X X x x y, given sequence x as input is modeled as
(a) A generalized (b) The neural CRF autoencoder
autoencoder. model in this work.
eΦ(x,y)
PΛ (y|x) = ,
Z
Figure 1: On the left is a generalized autoencoder,
of which the lower half is the encoder and the up-
per half is the decoder. On the right is an illus- where Z =
P
eΦ(x,ỹ) is the partition function that
tration of the graphical model of our NCRF-AE ỹ
model. The yellow squares are interactive poten- marginalize over all possible assignments to the
tials among labels, and the green squares represent predicted labels of the sequence, and Φ(x, y) is
the unary potentials generated by the neural net- the scoring function, which is defined as:
works.
As shown in Figure 1b, our NCRF-AE model
X
Φ(x, y) = φ(x, yt ) + ψ(yt−1 , yt ).
consists of two parts: the encoder (the lower half) t
is a discriminative CRF model enhanced by deep
neural networks as its potential extractors with en-
coding parameters Λ, describing the probability The partition function Z can be computed effi-
of a predicted sequence of labels given the input; ciently via the forward-backward algorithm. The
the decoder (the upper half) is a generative model term φ(x, yt ) corresponds to the score of a par-
with reconstruction parameters Θ, modeling the ticular tag yt at position t in the sequence, and
probability of reconstructing the input given a se- ψ(yt−1 , yt ) represents the score of transition from
quence of labels. Accordingly, we present our the tag at position t − 1 to the tag at position t.
model mathematically as follows: In our NCRF-AE model, φ(x, yt ) is described by
X deep neural networks while ψ(yt−1 , yt ) by a tran-
PΘ,Λ (x̂|x) = PΘ,Λ (x̂, y|x) sition matrix. Such a structure allows for the use
y of distributed representations of the input, for in-
X stance, the word embeddings on a continuous vec-
= PΘ (x̂|y)PΛ (y|x),
y
tor space (Mikolov et al., 2013).
Typically in our work, φ(x, yt ) is modeled
where PΛ (y|x) is the probability given by the jointly by a multi-layer perceptron (MLP) that
neural CRF encoder, and PΘ (x̂|y) is the proba- utilizes the word-level information, and a bi-
bility produced by the generative decoder. directional long-short term memory (LSTM) neu-
When making a prediction, the model tries to ral network (Hochreiter and Urgen Schmidhuber,
find the most probable output sequence by per- 1997) that captures the character level information
forming the following inference procedure using within each word. A bi-directional structure can
the Viterbi algorithm: extract character level information from both di-
rections, with which we expect to catch the pre-
y ∗ = arg max PΘ,Λ (x̂, y|x). fix and suffix information of words in an end-
y
to-end system, rather than using hand-engineered
To clarify, as we focus on POS tagging prob- features. The bi-directional LSTM neural network
lems in this study, in the unsupervised setting consumes character embeddings ec ∈ Rk1 as in-
where the true POS tags are unknown, the labels put, where k1 is the dimensionality of the charac-
used for reconstruction are actually the POS tags ter embeddings. A normal LSTM can be denoted
as: The detailed structure of the neural CRF en-
coder is demonstrated in Fig 2. Note that the
it = σ(Wei ect + Whi ht−1 + bi ), MLP layer is also interchangeable with a recur-
ft = σ(Wef ect + Whf ht−1 + bf ), rent neural network (RNN) layer or LSTM layer.
ot = σ(Weo ect + Who ht−1 + bo ), But in our pilot experiments, we found a single
gt = Relu(Wec ect + Whc ht−1 + bc ), MLP structure yields better performance, which
we conjecture is due to over-fitting caused by the
ct = ft ct−1 + it gt ,
high complexity of those alternatives.
ht = ot tanh(ct ),

where denotes element-wise multiplication. PRON VERB DET NOUN NOUN

Then a bi-directional LSTM neural network ex-


tends it as follows, by denoting the procedure of u1 u2 u3 u4 u5
generating ht as H:

− →

h t = H(We− → ect + W−
h
→−→ h t−1 + b−
h h
→ ),
h
e1 e2 e3 e4 e5


− ←−
h t = H(We← − e + W←
h ct
−←
h h
− h t−1 + b←
− ),
h
r1 r2 r3 r4 r5
where ect here is the character embedding for
character c in position t in a word.
l1 l2 l3 l4 l5
The inputs to the MLP are word embed-
dings ev ∈ Rk2 for each word v, where k2 is
the dimensionality of the vector, concatenated That is a money maker

with the final representation generated by the


bi-directional LSTM over the characters of that

− ← −
word: u = [ev ; h v ; h v ]. In order to leverage the Figure 2: A demonstration of the neural CRF en-
capacity of the CRF model, we use a word and its coder. lt and rt are the output of the forward and
context together to generate the unary potential. backward character-level LSTM of the word at po-
More specifically, we adopt a concatenation vt = sition t in a sentence, and et is the word-level em-
[ut−(w−1)/2 ; · · · ; ut−1 ; ut ; ut+1 ; · · · ; ut+(w−1)/2 ] bedding of that word. ut is the concatenation of
as the inputs to the MLP model, where t denotes et , lt and rt , denoted by blue dashed arrows.
the position in a sequence, and w being an odd
number indicates the context size. Further, in 3.2 Generative Decoder
order to enhance the generality of our model, we In our NCRF-AE, we assume the generative pro-
add a dropout layer on the input right before the cess follows several multinomial distributions:
MLP layer as a regularizer. Notice that different each label y has the probability θy→x to recon-
from a normal MLP, the activation function of struct the corresponding word x, i.e., P (x|y) =
the last layer is no more a softmax function, but θy→x . This setting naturally leads to a constraint
a linear function generates the log-linear part
P
θy→x = 1. The number of parameters of the
φt (x, yt ) of the CRF model: x
decoder is |Y| × |X |. For a whole sequence,
ht = Relu(W vt + b) Q reconstruction probability is PΘ (x̂|y) =
the
φt = wy| ht + by . P (x̂t |yt ).
t

The transition score ψ(yt−1 , yt ) is a single 4 A Unified Learning Framework


scalar representing the interactive potential. We
use a transition matrix Ψ to cover all the transi- We first constructed two loss functions for labeled
tions between different labels, and Ψ is part of the and unlabeled data using the same model. Our
encoder parameters Λ. model is trained in an on-line fashion: given a
All the parameters in the neuralized encoder are labeled or unlabeled sentence, our NCRF-AE op-
updated when the loss function is minimized via timizes the loss function by choosing the corre-
error back-propagation through all the structures sponding one. In an analogy to coordinate de-
of the neural networks and the transition matrix. scent, we optimize the loss function of the NCRF-
AE by alternatively updating the parameters Θ in loss functions. Using negative log-likelihood as
the decoder and the parameters Λ in the encoder. our loss function, if the data is labeled, the loss
The parameters Θ in the decoder are updated via function is:
a variation of the Expectation-Maximization (EM)
algorithm, and the the parameters Λ in the encoder lossl = − log PΘ,Λ (x̂, y|x)
are updated through a gradient-based method due = −(
X
st (x, y) − log Z)
to the non-convexity of the neuralized CRF. In t
contrast to the early autoencoder models (Ammar
et al., 2014; Lin et al., 2015), our model has two If the data is unlabeled, the loss function is:
distinctions: First, we have two loss functions to
model labeled example and unlabeled examples; lossu = − log PΘ,Λ (x̂|x)
Second, we designed a variant of EM algorithm to = −(log U − log Z).
alternatively learn the parameters of the encoder
and the decoder at the same time. Thus, during training, based on whether the en-
countered data is labeled or unlabeled, our model
4.1 Unified Loss Functions for Labeled and
can select the appropriate loss function for learn-
unlabeled Data
ing parameters. In practice, we found for labeled
For a sequential input with labels, the complete data, using a combination of lossl and lossu actu-
data likelihood given by our NCRF-AE is ally yields better performance.

PΘ,Λ (x̂, y|x) = PΘ (x̂|y)PΛ (y|x) 5 Mixed Expectation-Maximization


" #
Y eΦ(x,y) Algorithm
= P (x̂t |yt )
Z
t
P
The Expectation-Maximization (EM) algorithm
st (x,y) (Dempster et al., 1977) was applied to a wide
et
= , range of problems. Generally, it establishes a
Z
lower-bound of the objective function by using
where Jensen’s Inequality. It first tries to find the pos-
terior distribution of the latent variables, and then
st (x, y) = log P (xt |yt ) + φ(x, yt ) + ψ(yt−1 , yt ). based on the posterior distribution of the latent
variables, it maximizes the lower-bound. By al-
If the input sequence is unlabeled, we can sim- ternating expectation (E) and maximization (M)
ply marginalize over all the possible assignment to steps, the algorithm iteratively improves the lower-
labels. The probability is formulated as bound of the objective function.
X In this section we describe the mixed
PΘ,Λ (x̂|x) = P (x̂, y|x)
y
Expectation-Maximization (EM) algorithm
used in our study. Parameterized by the encoding
U
= , parameters Λ and the reconstruction parameters
Z
Θ, our NCRF-AE consists of the encoder and the
P
P st (x,y) decoder, which together forms the log-likelihood
where U = et .
y a highly non-convex function. However, a careful
Our formulation have two advantages. First, observation shows that if we fix the encoder, the
term U is different from but in a similar form lower bound derived in the E step, is convex with
as term Z, such that to calculate the probability respect to the reconstruction parameters Θ in the
P (x̂|x) for an unlabeled sequence, the forward- M step. Hence, in the M step we can analytically
backward algorithm to compute the partition func- obtain the global optimum of Θ. In terms of
tion Z can also be applied to compute U effi- the reconstruction parameters Θ by fixing Λ,
ciently. Second, our NCRF-AE highlights a uni- we describe our EM algorithm in iteration t as
fied structure of different loss functions for labeled follows:
and unlabeled data with shared parameters. Thus In the E-step, we let Q(yi ) = P (yi |xi , x̂i ), and
during training, our model can address both la- treat yi the latent variable as it is not observable in
beled and unlabeled data well by alternating the unlabeled data. We derive the lower-bound of the
log-likelihood using Q(yi ): Algorithm 1 Obtain Expected Count (Te )
Require: the expected count table Te
1: for an unlabeled data example xi do
X X X P (x̂i , yi |xi ) 2: Compute the forward messages:
log P (x̂i |xi ) = log Q(yi ) α(y, t) ∀y, t. . t is the position in a
yi
Q(yi )
i i
sequence.
XX P (x̂i , yi |xi ) 3: Compute the backward messages:
≥ Q(yi ) log ,
yi
Q(y i ) β(y, t) ∀y, t.
i
4: Calculate the expected count for each x in
xi : P (yt |xt ) ∝ α(y, t) × β(y, t).
5: Te (xt , yt ) ← Te (xt , yt ) + P (yt |xt ) . Te
where Q(yi ) is computed using parameters
is the expected count table.
Θ(t−1) in the previous iteration t − 1.
6: end for
In the M-step, we try to improve the aforemen-
tioned lower-bound using all examples: Algorithm 2 Mixed Expectation-Maximization
1: Initialize expected count table Te using la-
beled data {x, y}li and use it as Θ(0) in the
XX PΘ(t) (x̂i |yi )PΛ (yi |xi ) decoder.
arg max Q(yi ) log
Θ(t) Q(yi ) 2: Initialize Λ(0) in the encoder randomly.
i yi
XX 3: for t in epochs do
arg max Q(yi ) log PΘ(t) (x̂i |yi ) + const 4: Train the encoder on labeled data {x, y}l
Θ(t) yi
i
X X and unlabeled data {x}u to update Λ(t−1) to
(t)
arg max log θy→x Q(y)C(y, x) Λ(t) .
Θ(t) y→x y 5: Re-initialize expected count table Te with
0s.
X
(t)
arg max log θy→x Ey∼Q [C(y, x)]
Θ(t) y→x 6: Use labeled data {x, y}l to calculate real
X
(t) counts and update Te .
s.t. θy→x = 1.
7: Use unlabeled data {x}u to compute the
x
expected counts with parameters Λ(t) and
Θ(t−1) and update Te .
8: Obtain Θ(t) globally and analytically
In this formulation, const is a constant with re- based on Te .
spect to the parameters we are updating. Q(y) 9: end for
is the distribution of a label y at any position by
marginalizing labels at all other positions in a se- This mixed EM algorithm is a combination of
quence. By denoting C(y, x) as the number of the gradient-based approach to optimize the en-
times that (x, y) co-occurs, Ey∼Q (t−1) [C(y, x)] coder by minimizing the negative log-likelihood
Θ
is the expected count of a particular reconstruc- as the loss function, and the EM approach to up-
tion at any position, which can also be calculated date the decoder’s parameters by improving the
using Baum-Welch algorithm (Welch, 2003), and lower-bound of the log-likelihood.
can be summed over for all examples in the dataset
(In the labeled data, it is just a real count). The al- 6 Experiments
gorithm we used to calculate the expected count
is described in Algorithm 1. Therefore, it can 6.1 Experimental Settings
be shown that the aforementioned global optimum Dataset We evaluated our model on the POS
can be calculated by simply normalizing the ex- tagging task, in both the supervised and semi-
pected counts. In terms of the encoder’s parame- supervised learning settings, over eight different
ters Λ, they are first updated via a gradient-based languages from the UD (Universal Dependencies)
optimization before each EM iteration. Based on 1.4 dataset (Mcdonald et al., 2013). The task is
the above discussion, our Mixed EM Algorithm is defined over 17 different POS tags, used across
presented in Algorithm 2. the different languages. We followed the original
English French German Italian Russian Spanish Indonesian Croatian
Tokens 254830 391107 293088 272913 99389 423346 121923 139023
Training 12543 14554 14118 12837 4029 14187 4477 5792
Development 2002 1596 799 489 502 1552 559 200
Testing 2077 298 977 489 499 274 297 297

Table 1: Statistics of different UD languages used in our experiments, including the number of tokens,
and the number of sentences in training, development and testing set respectively.

UD division for training, development and test- character embeddings as parameters as well. We
ing in our experiments. The statistics of the data used Theano to implement our algorithm, and all
used in our experiments are described in table 1. the experiments were run on NVIDIA GPUs. To
The UD dataset includes several low-resource lan- prevent over-fitting, we used the “early-stop” strat-
guages which are of particular interest to us, as egy to determine the appropriate number of epochs
our model has practical application in them: in re- during training. We did not take efforts to tune
ality, they have fewer annotated data and a semi- those hyper-parameters and they remained the
supervised model is useful for them. same in both our supervised and semi-supervised
Input Representation and Neural Architecture learning experiments.
Our model uses word embeddings as input. In our
pilot experiments, we compared the performance 6.2 Supervised Learning
on the English dataset of the pre-trained embed- In these settings our Neural CRF autoencoder
ding from Google News (Mikolov et al., 2013) model had access to the full amount of annotated
and the embeddings we trained directly on the UD training data in the UD dataset. As described in
dataset using the skip-gram algorithm (Mikolov Section 5, the decoder’s parameters Θ were esti-
et al., 2013). We found these two types of em- mated using real counts from the labeled data.
beddings yield very similar performance on the We compared our model with existing sequence
POS tagging task. So in our experiments, we used labeling models including HMM, CRF, LSTM
embeddings of different languages directly trained and neural CRF (NCRF) on all the 8 languages.
on the UD dataset as input to our model, whose Among these models, the NCRF can be most di-
dimension is 200. For the MLP neural network rectly compared to our model, as it is used as the
layer, the number of hidden nodes in the hidden base of our model, but without the decoder (and as
layer is 20, which is the same for the hidden layer a result, can only be used for supervised learning).
in the character-level LSTM. The dimension of the The results, summarized in Table 2, show that
character-level embeddings sent into the LSTM our NCRF-AE consistently outperformed all other
layer is 15, which is randomly initialized. In or- systems, on all the 8 languages, including Russian,
der to incorporate the global information of the in- Indonesian and Croatian which had considerably
put sequence, we set the context window size to 3. less data compared to other languages. Interest-
The dropout rate for the dropout layer is set to 0.5. ingly, the NCRF consistently came second to our
Learning We used ADADELTA (Zeiler, 2012) model, which demonstrates the efficacy of the ex-
to update parameters Λ in the encoder, as pressivity added to our model by the decoder, to-
ADADELTA dynamically adapts learning rate gether with an appropriate optimization approach.
over time using only first order information To better understand the performance difference
and has minimal computational overhead beyond by different models, we performed error analy-
vanilla stochastic gradient descent (SGD). The au- sis, using an illustrative example, described in Fig-
thors of ADADELTA also argue this method ap- ure 3.
pears robust to noisy gradient information, dif- In this example, the LSTM incorrectly predicted
ferent model architecture choices, various data the POS tag of the word “search” as a verb, instead
modalities and selection of hyper-parameters. We of a noun (part of the NP “nice search engine”),
observed that ADADELTA indeed had faster con- while predicting correctly the preceding word,
vergence than vanilla SGD optimization. In our “nice”, as an adjective. We attribute the error to
experiments, we include word embeddings and LSTM lacking an explicit output transition scor-
Models English French German Italian Russian Spanish Indonesian Croatian
HMM 86.28% 91.23% 85.59% 92.03% 79.82% 91.31% 89.40% 86.98%
CRF 89.96% 93.40% 86.83% 94.07% 83.38% 91.47% 88.63% 86.90%
LSTM 90.50% 94.16% 88.40% 94.96% 84.87% 93.17% 89.42% 88.95%
NCRF 91.52% 95.07% 90.27% 96.20% 93.37% 93.34% 92.32% 93.85%
NCRF-AE 92.50% 95.28% 90.50% 96.64% 93.60% 93.86% 93.96% 94.32%

Table 2: Supervised learning accuracy of POS tagging on 8 UD languages using different models

Models English French German Italian Russian Spanish Indonesian Croatian


NCRF (OL) 88.01% 93.38% 90.43% 91.75% 86.63% 91.22% 88.35% 86.11%
NCRF-AE (OL) 88.41% 93.69% 90.75% 92.17% 87.82% 91.70% 89.06% 87.92%
HMM-EM 79.92% 88.15% 77.01% 84.57% 72.96% 86.77% 83.61% 77.20%
NCRF-AE (HEM) 86.79% 92.83% 89.78% 90.68% 86.39% 91.30% 88.86% 86.55%
NCRF-AE 89.43% 93.89% 90.99% 92.85% 88.93% 92.17% 89.41% 89.14%

Table 3: Semi-supervised learning accuracy of POS tagging on 8 UD languages. HEM means hard-EM,
used as a self-training approach, and OL means only 20% of the labeled data is used and no unlabeled
data is used.

Text Google is a nice search engine . 6.3 Semi-supervised Learning


Gold PROPN VERB DET ADJ NOUN NOUN PUNCT

NCRF-AE PROPN VERB DET ADJ NOUN NOUN PUNCT


In the semi-supervised settings we compared our
NCRF NOUN VERB DET ADJ NOUN NOUN PUNCT
models with other semi-supervised structured pre-
LSTM PROPN VERB DET ADJ VERB NOUN PUNCT
diction models. In addition, we studied how vary-
Figure 3: An example from the test set to compare ing the amount of unlabeled data would change the
the predicted results of our NCRF-AE model, the performance of our model.
NCRF model and the LSTM model. As described in Sec. 5, the decoder’s parame-
ters Θ are initialized by the labeled dataset using
real counts and updated in training.
ing function, which would penalize the ungram-
matical transition between “ADJ” and “VERB”.
The NCRF, which does score such transitions, 6.3.1 Varying Unlabeled Data Proportion
correctly predicted that word. However, it incor-
rectly predicted “Google” as a noun rather than We first experimented with varying the proportion
a proper-noun. This is a subtle mistake, as the of unlabeled data, while fixing the amount of la-
two are grammatically and semantically similar. beled data. We conducted these experiments over
This mistake appeared consistently in the NCRF two languages, English and low-resource language
results, while NCRF-AE predictions were correct. Croatian. We fixed the proportion of labeled data
at 20%, and gradually added more unlabeled data
We attribute this success to the superior ex-
from 0% to 20% (from full supervision to semi-
pressivity of our model: The prediction is done
supervision). The unlabeled data was sampled
jointly by the encoder and the decoder, as the re-
from the same dataset (without overlapping with
construction decision is defined over all output
the labeled data), with the labels removed. The
sequences, picking the jointly optimal sequence.
results are shown in Figure 4.
From another perspective, our NCRF-AE model
is a combination of discriminative and generative The left most point of both sub-figures is the
models, in that sense the decoder can be regarded accuracy of fully supervised learning with 20% of
as a soft constraint that supplements the encoder. the whole data. As we can observe, the tagging
Such that, the decoder performs as a regularizer to accuracy increased as the proportion of unlabeled
check-balance the choices made by the encoder. data increased.
Neural CRF AE Neural CRF Neural CRF AE Neural CRF
89.0 89.0
iments, we gradually increased the proportion of

Tagging Accuracy
Tagging Accuracy
88.7 88.2
labeled data, and in accordance decreased the pro-
88.5 87.4
portion of unlabeled data.
88.2 86.6

87.9
0 5 10 15 20
85.8
0 5 10 15 20
Proportion Labeled Only Both
(a) English (b) Croatian 10% − 90% 85.31% 85.80%
20% − 80% 88.41% 88.69%
Figure 4: UD English and Croatian POS tag- 30% − 70% 89.46% 89.85%
ging accuracy versus increasing proportion of un- 40% − 60% 90.06% 90.11%
labeled sequences using 20% labeled data. The 50% − 50% 91.09% 91.15%
green straight line is the performance of the neural
CRF, trained over the labeled data. Table 4: Performance of NCRF-AE of different
proportion of labeled and unlabeled data. The left-
most column shows the proportion of unlabeled
6.3.2 Semi-supervised POS Tagging on (left) and labeled (right) data in the experients.
Multiple Languages Both the experimental results of only using labeled
We compared our NCRF-AE model with other data and using both labeled and unlabeled data are
semi-supervised learning models, including the included.
HMM-EM algorithm and the hard-EM version of
our NCRF-AE. The hard EM version of our model The results of these experiments are demon-
can be considered as a variant of self-training, as strated in Table 4. As we speculated, we ob-
it infers the missing labels using the current model served diminishing effectiveness when increasing
in the E-step, and uses the real counts of these la- the proprtion of labeled data in training.
bels to update the model in the M-step. To contex-
tualize the results, we also provide the results of 7 Conclusion
the NCRF model and the supervised version our
NCRF-AE model trained on 20% of the data. We We proposed an end-to-end neural CRF autoen-
set the proportion of labeled data to 20% for each coder (NCRF-AE) model for semi-supervised se-
language and set the proportion of unlabeled data quence labeling. Our NCRF-AE is an integration
to 50% of the dataset. There was no overlap be- of a discriminative model and generative model
tween labeled and unlabeled data. which extends the generalized autoencoder by us-
The results are summarized in Table 3. Simi- ing a neural CRF model as its encoder and a gen-
lar to the supervised experiments, the supervised erative decoder built on top of it. We suggest a
version of our NCRF-AE, trained over 20% of the variant of the EM algorithm to learn the parame-
labeled data, outperforms the NCRF model. Our ters of our NCRF-AE model.
model was able to successfully use the unlabeled We evaluated our model in both supervised
data, leading to improved performance in all lan- and semi-supervised scenarios over multiple lan-
guages, over both the supervised version of our guages, and showed it can outperform other super-
model, as well as the HMM-EM and Hard-EM vised and semi-supervised methods. We also per-
models that were also trained over both the labeled formed additional experiments to show how vary-
and unlabeled data. ing sizes of labeled training data affect the effec-
tiveness of our model.
6.3.3 Varying Sizes of Labeled Data on These results demonstrate the strength of our
English model, as it was able to utilize the small amount
As is known to all, semi-supervised approaches of labeled data and exploit the hidden information
tend to work well when given a small size of la- from the large amount of unlabeled data, with-
beled training data. But with the increase of la- out additional feature engineering which is of-
beled training data size, we might get diminishing ten needed in order to get semi-supervised and
effectiveness. To verify this conjecture, we con- weakly-supervised systems to perform well. The
ducted additional experiments to show how vary- superior performance on the low resource lan-
ing sizes of labeled training data affect the effec- guage also suggests its potential in practical use.
tiveness of our NCRF-AE model. In these exper-
References Tomás Kociský, Gábor Melis, Edward Grefenstette,
Chris Dyer, Wang Ling, Phil Blunsom, and
Waleed Ammar, Chris Dyer, and Noah A Smith. 2014. Karl Moritz Hermann. 2016. Semantic parsing with
Conditional random field autoencoders for unsuper- semi-supervised sequential autoencoders. In Proc.
vised structured prediction. In The Conference on of the Conference on Empirical Methods for Natural
Advances in Neural Information Processing Systems Language Processing (EMNLP), pages 1078–1087.
(NIPS), pages 1–12.
John D Lafferty, Andrew McCallum, and Fernando
Daniel Andor, Chris Alberti, David Weiss, Aliaksei C N Pereira. 2001. Conditional random fields:
Severyn, Alessandro Presta, Kuzman Ganchev, Slav Probabilistic models for segmenting and labeling se-
Petrov, and Michael Collins. 2016. Globally nor- quence data. In Proc. of the International Confer-
malized transition-based neural networks. In Proc. ence on Machine Learning (ICML), pages 282–289,
of the Annual Meeting of the Association Computa- San Francisco, CA, USA.
tional Linguistics (ACL), pages 2442–2452.
Guillaume Lample, Miguel Ballesteros, Kazuya
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kawakami, Sandeep Subramanian, and Chris Dyer.
Dan Klein. 2016. Learning to compose neural net- 2016. Neural architectures for named entity recog-
works for question answering. In Proc. of the An- nition. In Proc. of the Annual Meeting of the North
nual Meeting of the North American Association of American Association of Computational Linguistics
Computational Linguistics (NAACL), pages 1545– (NAACL), pages 1–10.
1554.
Chu-Cheng Lin, Waleed Ammar, Chris Dyer, and Lori
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Levin. 2015. Unsupervised pos induction with word
gio. 2015. Neural machine translation by jointly embeddings. In Proc. of the Annual Meeting of the
learning to align and translate. In Proc. Inter- North American Association of Computational Lin-
national Conference on Learning Representation guistics (NAACL), pages 1311–1316.
(ICLR).
Xuezhe Ma and Eduard Hovy. 2016. End-to-end
Danqi Chen and Christopher D Manning. 2014. A
sequence labeling via bi-directional lstm-cnns-crf.
fast and accurate dependency parser using neural
In Proc. of the Annual Meeting of the Associa-
networks. In Proc. of the Conference on Em-
tion Computational Linguistics (ACL), pages 1064–
pirical Methods for Natural Language Processing
1074, Berlin, Germany.
(EMNLP).
Z. Marinho, A. F. T. Martins, S. B. Cohen, and N. A.
Yong Cheng, Yang Liu, and Wei Xu. 2017. Maxi-
Smith. 2016. Semi-supervised learning of sequence
mum reconstruction estimation for generative latent-
models with the method of moments. In Proc. of
variable models. In Proc. of the National Confer-
the Conference on Empirical Methods for Natural
ence on Artificial Intelligence (AAAI).
Language Processing (EMNLP).
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.
David McClosky, Eugene Charniak, and Mark John-
Maximum likelihood from incomplete data via the
son. 2006. Effective self-training for parsing. In
em algorithm. Journal of the Royal Statistical Soci-
Proc. of the Annual Meeting of the North American
ety. Series B, 39(1):1–38.
Association of Computational Linguistics (NAACL),
Greg Durrett and Dan Klein. 2015. Neural crf parsing. pages 152–159, Stroudsburg, PA, USA.
pages 302–312.
Ryan Mcdonald, Joakim Nivre, Yvonne Quirmbach-
Geoffrey E Hinton and Richard S Zemel. 1994. brundage, Yoav Goldberg, Dipanjan Das, Kuzman
Autoencoders, minimum description length and Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Os-
helmholtz free energy. In The Conference on Ad- car Tckstrm, Claudia Bedini, Nria Bertomeu, and
vances in Neural Information Processing Systems Castell Jungmee Lee. 2013. Universal dependency
(NIPS), pages 3–10. annotation for multilingual parsing. In Proc. of the
Annual Meeting of the Association Computational
Sepp Hochreiter and J Urgen Schmidhuber. 1997. Linguistics (ACL).
Long short-term memory. Neural Computation,
9(8):1735–1780. G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng,
D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu,
Arzoo Katiyar and Claire Cardie. 2016. Investigating and G. Zweig. 2015. Using recurrent neural net-
lstms for joint extraction of opinion entities and rela- works for slot filling in spoken language understand-
tions. In Proc. of the Annual Meeting of the Associ- ing. Transactions on Audio, Speech, and Language
ation Computational Linguistics (ACL), pages 919– Processing, 23(3):530–539.
929, Berlin, Germany.
Yishu Miao and Phil Blunsom. 2016. Language as a
Yoon Kim. 2014. Convolutional neural networks for latent variable: Discrete generative models for sen-
sentence classification. Proc. of the Conference on tence compression. In Proc. of the Conference on
Empirical Methods for Natural Language Process- Empirical Methods for Natural Language Process-
ing (EMNLP), pages 1746–1751. ing (EMNLP), pages 319–328, Austin, Texas.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Sam Wiseman and Alexander M. Rush. 2016.
Dean. 2013. Distributed representations of words Sequence-to-sequence learning as beam-search op-
and phrases and their compositionality. The Confer- timization. In EMNLP, pages 1296–1306, Austin,
ence on Advances in Neural Information Processing Texas.
Systems (NIPS), pages 1–9.
Matthew D. Zeiler. 2012. Adadelta: An adaptive learn-
Nanyun Peng and Mark Dredze. 2016. Improving ing rate method. CoRR, abs/1212.5701.
named entity recognition for chinese social media
with word segmentation representation learning. In
Proc. of the Annual Meeting of the Association Com-
putational Linguistics (ACL), pages 149–155.

Ariadna Quattoni, Sybor Wang, Louis-Philippe


Morency, Michael Collins, and Trevor Darrell.
2007. Hidden conditional random fields. IEEE
Trans. Pattern Anal. Mach. Intell., 29(10):1848–
1852.

Richard Socher, John Bauer, Christopher D. Manning,


and Ng Andrew Y. 2013. Parsing with composi-
tional vector grammars. In Proc. of the Annual
Meeting of the Association Computational Linguis-
tics (ACL), pages 455–465, Sofia, Bulgaria.

Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Ju-


rafsky. 2010. From baby steps to leapfrog: How
less is more in unsupervised dependency parsing. In
Proc. of the Annual Meeting of the North American
Association of Computational Linguistics (NAACL),
pages 751–759.

Karl Stratos and Michael Collins. 2015. Simple semi-


supervised pos tagging. In Proc. of the Annual
Meeting of the North American Association of Com-
putational Linguistics (NAACL).

Amarnag Subramanya, Slav Petrov, and Fernando


Pereira. 2010. Efficient graph-based semi-
supervised learning of structured tagging models. In
Proc. of the Conference on Empirical Methods for
Natural Language Processing (EMNLP).

Ilya Sutskever, James Martens, and Geoffrey Hinton.


2011. Generating text with recurrent neural net-
works. In Proc. of the International Conference on
Machine Learning (ICML), pages 1017–1024, New
York, NY, USA.

Ke M. Tran, Yonatan Bisk, Ashish Vaswani, Daniel


Marcu, and Kevin Knight. 2016. Unsupervised neu-
ral hidden markov models. In Proceedings of the
Workshop on Structured Prediction for NLP, pages
63–71, Austin, TX.

Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and Ryan


Musa. 2016. Supertagging with lstms. In Proc. of
the Annual Meeting of the North American Associ-
ation of Computational Linguistics (NAACL), pages
232–237.

Lloyd R Welch. 2003. Hidden markov models and the


baum-welch algorithm. IEEE Information Theory
Society Newsletter, 53(4):1,10–13.

You might also like