Professional Documents
Culture Documents
Adversarial Multi-Task Learning For Text Classification: A B A B
Adversarial Multi-Task Learning For Text Classification: A B A B
Abstract
A B A B
Neural network models have shown their
arXiv:1704.05742v1 [cs.CL] 19 Apr 2017
Figure 3: Adversarial shared-private model. Yel- where U ∈ Rd×d is a learnable parameter and b ∈
low and gray boxes represent shared and private Rd is a bias.
LSTM layers respectively.
Adversarial Loss Different with most existing
multi-task learning algorithm, we add an extra task
4.1 Adversarial Network adversarial loss LAdv to prevent task-specific fea-
Adversarial networks have recently surfaced and ture from creeping in to shared space. The task
are first used for generative model (Goodfellow adversarial loss is used to train a model to pro-
et al., 2014). The goal is to learn a generative dis- duce shared features such that a classifier cannot
tribution pG (x) that matches the real data distri- reliably predict the task based on these features.
bution Pdata (x) Specifically, GAN learns a gen- The original loss of adversarial network is limited
erative network G and discriminative model D, since it can only be used in binary situation. To
in which G generates samples from the genera- overcome this, we extend it to multi-class form,
tor distribution pG (x). and D learns to determine which allow our model can be trained together
whether a sample is from pG (x) or Pdata (x). This with multiple tasks:
min-max game can be optimized by the following Nk
K X
X
!
risk: LAdv = min λmax( dki k
log[D(E(x ))]) (13)
θs θD
k=1 i=1
φ = min max Ex∼Pdata [log D(x)] where dki denotes the ground-truth label indicating
G D
the type of the current task. Here, there is a min-
+ Ez∼p(z) [log(1 − D(G(z)))] (11)
max optimization and the basic idea is that, given
a sentence, the shared LSTM generates a repre-
While originally proposed for generating random
sentation to mislead the task discriminator. At the
samples, adversarial network can be used as a gen-
same time, the discriminator tries its best to make
eral tool to measure equivalence between distri-
a correct classification on the type of task. After
butions (Taigman et al., 2016). Formally, (Ajakan
the training phase, the shared feature extractor and
et al., 2014) linked the adversarial loss to the
task discriminator reach a point at which both can-
H-divergence between two distributions and suc-
not improve and the discriminator is unable to dif-
cessfully achieve unsupervised domain adaptation
ferentiate among all the tasks.
with adversarial network. Motivated by theory on
domain adaptation (Ben-David et al., 2010, 2007; Semi-supervised Learning Multi-task Learning
Bousmalis et al., 2016) that a transferable feature We notice that the LAdv requires only the input
is one for which an algorithm cannot learn to iden- sentence x and does not require the correspond-
tify the domain of origin of the input observation. ing label y, which makes it possible to combine
our model with semi-supervised learning. Finally,
4.2 Task Adversarial Loss for MTL in this semi-supervised multi-task learning frame-
Inspired by adversarial networks (Goodfellow work, our model can not only utilize the data from
et al., 2014), we proposed an adversarial shared- related tasks, but can employ abundant unlabeled
private model for multi-task learning, in which a corpora.
shared recurrent neural layer is working adversar-
ially towards a learnable multi-layer perceptron, 4.3 Orthogonality Constraints
preventing it from making an accurate prediction We notice that there is a potential drawback of the
about the types of tasks. This adversarial training above model. That is, the task-invariant features
encourages shared space to be more pure and en- can appear both in shared space and private space.
sure the shared representation not be contaminated Motivated by recently work(Jia et al., 2010;
by task-specific features. Salzmann et al., 2010; Bousmalis et al., 2016)
Dataset Train Dev. Test Unlab. Avg. L Vocab. 5 Experiment
Books 1400 200 400 2000 159 62K
Elec. 1398 200 400 2000 101 30K 5.1 Dataset
DVD 1400 200 400 2000 173 69K
Kitchen 1400 200 400 2000 89 28K To make an extensive evaluation, we collect 16
Apparel 1400 200 400 2000 57 21K different datasets from several popular review cor-
Camera 1397 200 400 2000 130 26K pora.
Health 1400 200 400 2000 81 26K
Music 1400 200 400 2000 136 60K The first 14 datasets are product reviews, which
Toys 1400 200 400 2000 90 28K contain Amazon product reviews from different
Video 1400 200 400 2000 156 57K
Baby 1300 200 400 2000 104 26K
domains, such as Books, DVDs, Electronics, ect.
Mag. 1370 200 400 2000 117 30K The goal is to classify a product review as either
Soft. 1315 200 400 475 129 26K positive or negative. These datasets are collected
Sports 1400 200 400 2000 94 30K
IMDB 1400 200 400 2000 269 44K based on the raw data 1 provided by (Blitzer et al.,
MR 1400 200 400 2000 21 12K 2007). Specifically, we extract the sentences and
corresponding labels from the unprocessed orig-
Table 1: Statistics of the 16 datasets. The columns inal data 2 . The only preprocessing operation of
2-5 denote the number of samples in training, de- these sentences is tokenized using the Stanford to-
velopment, test and unlabeled sets. The last two kenizer 3 .
columns represent the average length and vocabu- The remaining two datasets are about movie re-
lary size of corresponding dataset. views. The IMDB dataset4 consists of movie re-
views with binary classes (Maas et al., 2011). One
key aspect of this dataset is that each movie review
on shared-private latent space analysis, we intro- has several sentences. The MR dataset also con-
duce orthogonality constraints, which penalize re- sists of movie reviews from rotten tomato website
dundant latent representations and encourages the with two classes 5 (Pang and Lee, 2005).
shared and private extractors to encode different All the datasets in each task are partitioned ran-
aspects of the inputs. domly into training set, development set and test-
After exploring many optional methods, we find ing set with the proportion of 70%, 20% and 10%
below loss is optimal, which is used by Bousmalis respectively. The detailed statistics about all the
et al. (2016) and achieve a better performance: datasets are listed in Table 1.
Table 2: Error rates of our models on 16 datasets against typical baselines. The numbers in brackets
represent the improvements relative to the average performance (Avg.) of three single task baselines.
• MT-DNN: The model is proposed by Liu MTL achieves 4.1% average improvement sur-
et al. (2015b) with bag-of-words input and passing SP-MTL with 1.0%, which indicates the
multi-layer perceptrons, in which a hidden importance of adversarial learning. It is notewor-
layer is shared. thy that for FS-MTL, the performances of some
tasks are degraded, since this model puts all pri-
5.3 Hyperparameters vate and shared information into a unified space.
The word embeddings for all of the models are ini-
tialized with the 200d GloVe vectors ((Pennington 5.5 Shared Knowledge Transfer
et al., 2014)). The other parameters are initialized With the help of adversarial learning, the shared
by randomly sampling from uniform distribution feature extractor Es can generate more pure task-
in [−0.1, 0.1]. The mini-batch size is set to 16. invariant representations, which can be considered
For each task, we take the hyperparameters as off-the-shelf knowledge and then be used for
which achieve the best performance on the devel- unseen new tasks.
opment set via an small grid search over com- To test the transferability of our learned shared
binations of the initial learning rate [0.1, 0.01], extractor, we also design an experiment, in which
λ ∈ [0.01, 0.1], and γ ∈ [0.01, 0.1]. Finally, we we take turns choosing 15 tasks to train our model
chose the learning rate as 0.01, λ as 0.05 and γ as MS with multi-task learning, then the learned
0.01. shared layer are transferred to a second network
MT that is used for the remaining one task. The
5.4 Performance Evaluation parameters of transferred layer are kept frozen,
Table 2 shows the error rates on 16 text clas- and the rest of parameters of the network MT are
sification tasks. The column of “Single Task” randomly initialized.
shows the results of vanilla LSTM, bidirectional More formally, we investigate two mechanisms
LSTM (BiLSTM), stacked LSTM (sLSTM) and towards the transferred shared extractor. As shown
the average error rates of previous three models. in Figure 4. The first one Single Channel (SC)
The column of “Multiple Tasks” shows the re- model consists of one shared feature extractor Es
sults achieved by corresponding multi-task mod- from MS , then the extracted representation will
els. From this table, we can see that the perfor- be sent to an output layer. By contrast, the Bi-
mance of most tasks can be improved with a large Channel (BC) model introduces an extra LSTM
margin with the help of multi-task learning, in layer to encode more task-specific information. To
which our model achieves the lowest error rates. evaluate the effectiveness of our introduced adver-
More concretely, compared with SP-MTL, ASP- sarial training framework, we also make a compar-
Single Task Transfer Models
Source Tasks
LSTM BiLSTM sLSTM Avg. SP-MTL-SC SP-MTL-BC ASP-MTL-SC ASP-MTL-BC
φ (Books) 20.5 19.0 18.0 19.2 17.8(−1.4) 16.3(−2.9) 16.8(−2.4) 16.3(−2.9)
φ (Electronics) 19.5 21.5 23.3 21.4 15.3(−6.1) 14.8(−6.6) 17.8(−3.6) 16.8(−4.6)
φ (DVD) 18.3 19.5 22.0 19.9 14.8(−5.1) 15.5(−4.4) 14.5(−5.4) 14.3(−5.6)
φ (Kitchen) 22.0 18.8 19.5 20.1 15.0(−5.1) 16.3(−3.8) 16.3(−3.8) 15.0(−5.1)
φ (Apparel) 16.8 14.0 16.3 15.7 14.8(−0.9) 12.0(−3.7) 12.5(−3.2) 13.8(−1.9)
φ (Camera) 14.8 14.0 15.0 14.6 13.3(−1.3) 12.5(−2.1) 11.8(−2.8) 10.3(−4.3)
φ (Health) 15.5 21.3 16.5 17.8 14.5(−3.3) 14.3(−3.5) 12.3(−5.5) 13.5(−4.3)
φ (Music) 23.3 22.8 23.0 23.0 20.0(−3.0) 17.8(−5.2) 17.5(−5.5) 18.3(−4.7)
φ (Toys) 16.8 15.3 16.8 16.3 13.8(−2.5) 12.5(−3.8) 13.0(−3.3) 11.8(−4.5)
φ (Video) 18.5 16.3 16.3 17.0 14.3(−2.7) 15.0(−2.0) 14.8(−2.2) 14.8(−2.2)
φ (Baby) 15.3 16.5 15.8 15.9 16.5(+0.6) 16.8(+0.9) 13.5(−2.4) 12.0(−3.9)
φ (Magazines) 10.8 8.5 12.3 10.5 10.5(+0.0) 10.3(−0.2) 8.8(−1.7) 9.5(−1.0)
φ (Software) 15.3 14.3 14.5 14.7 13.0(−1.7) 12.8(−1.9) 14.5(−0.2) 11.8(−2.9)
φ (Sports) 18.3 16.0 17.5 17.3 16.3(−1.0) 16.3(−1.0) 13.3(−4.0) 13.5(−3.8)
φ (IMDB) 18.3 15.0 18.5 17.3 12.8(−4.5) 12.8(−4.5) 12.5(−4.8) 13.3(−4.0)
φ (MR) 27.3 25.3 28.0 26.9 26.0(−0.9) 26.5(−0.4) 24.8(−2.1) 23.5(−3.4)
AVG 18.2 17.4 18.3 18.0 15.6(−2.4) 15.2(−2.8) 14.7(−3.3) 14.3(−3.7)
Table 3: Error rates of our models on 16 datasets against vanilla multi-task learning. φ (Books) means
that we transfer the knowledge of the other 15 tasks to the target task Books.
Figure 5: (a) The change of the predicted sentiment score at different time steps. Y-axis represents the
sentiment score, while X-axis represents the input words in chronological order. The darker grey horizon-
tal line gives a border between the positive and negative sentiments. (b) The purple heat map describes the
behaviour of neuron hs18 from shared layer of SP-MTL, while the blue one is used to show the behaviour
of neuron hs21 , which belongs to the shared layer of our model.
Model Shared Layer Task-Movie Task-Baby for text classification. Liu et al. (2016b) proposes
good, great a generic multi-task framework, in which different
good, great, love, bad,
bad, love, tasks can share information by an external mem-
well-directed, cute, safety,
SP-MTL simple, cut,
slow, cheap,
pointless, cut, mild, broken ory and communicate by a reading/writing mech-
cheap, infantile simple anism. These work has potential limitation of just
infantile
good, great, well-directed, cute, safety, learning a shared space solely on sharing param-
ASP-MTL love, bad pointless, cut, mild, broken eters, while our model introduce two strategies to
poor cheap, infantile simple learn the clear and non-redundant shared-private
space.
Table 4: Typical patterns captured by shared layer
Another thread of work is adversarial network.
and task-specific layer of SP-MTL and ASP-MTL
Adversarial networks have recently surfaced as a
models on Movie and Baby tasks.
general tool measure equivalence between distri-
butions and it has proven to be effective in a va-
neurons from shared layer and task-specific layer riety of tasks. Ajakan et al. (2014); Bousmalis
in Table 4, and we have observed that: 1) for et al. (2016) applied adverarial training to domain
SP-MTL, if some patterns are captured by task- adaptation, aiming at transferring the knowledge
specific layer, they are likely to be placed into of one source domain to target domain. Park and
shared space. Clearly, suppose we have many tasks Im (2016) proposed a novel approach for multi-
to be trained jointly, the shared layer bear much modal representation learning which uses adver-
pressure and must sacrifice substantial amount sarial back-propagation concept.
of capacity to capture the patterns they actu- Different from these models, our model aims to
ally do not need. Furthermore, some typical task- find task-invariant sharable information for mul-
invariant features also go into task-specific layer. tiple related tasks using adversarial training strat-
2) for ASP-MTL, we find the features captured by egy. Moreover, we extend binary adversarial train-
shared and task-specific layer have a small amount ing to multi-class, which enable multiple tasks to
of intersection, which allows these two kinds of be jointly trained.
layers can work effectively.
7 Conclusion
6 Related Work
In this paper, we have proposed an adversarial
There are two threads of related work. One thread multi-task learning framework, in which the task-
is multi-task learning with neural network. Neu- specific and task-invariant features are learned
ral networks based multi-task learning has been non-redundantly, therefore capturing the shared-
proven effective in many NLP problems (Col- private separation of different tasks. We have
lobert and Weston, 2008; Glorot et al., 2011). demonstrated the effectiveness of our approach by
Liu et al. (2016c) first utilizes different LSTM applying our model to 16 different text classifica-
layers to construct multi-task learning framwork tion tasks. We also perform extensive qualitative
analysis, deriving insights and indirectly explain- Yaroslav Ganin and Victor Lempitsky. 2015. Unsu-
ing the quantitative improvements in the overall pervised domain adaptation by backpropagation. In
Proceedings of the 32nd International Conference
performance.
on Machine Learning (ICML-15). pages 1180–1189.
Acknowledgments Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
2011. Domain adaptation for large-scale sentiment
We would like to thank the anonymous review- classification: A deep learning approach. In Pro-
ers for their valuable comments and thank Kaiyu ceedings of the 28th International Conference on
Qian, Gang Niu for useful discussions. This work Machine Learning (ICML-11). pages 513–520.
was partially funded by National Natural Sci-
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
ence Foundation of China (No. 61532011 and
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
61672162), the National High Technology Re- Courville, and Yoshua Bengio. 2014. Generative ad-
search and Development Program of China (No. versarial nets. In Advances in Neural Information
2015AA015408), Shanghai Municipal Science Processing Systems. pages 2672–2680.
and Technology Commission (No. 16JC1420401).
Alex Graves. 2013. Generating sequences with
recurrent neural networks. arXiv preprint
arXiv:1308.0850 .
References
Hana Ajakan, Pascal Germain, Hugo Larochelle, Sepp Hochreiter and Jürgen Schmidhuber. 1997.
François Laviolette, and Mario Marchand. 2014. Long short-term memory. Neural computation
Domain-adversarial neural networks. arXiv preprint 9(8):1735–1780.
arXiv:1412.4446 .
Yangqing Jia, Mathieu Salzmann, and Trevor Darrell.
Shai Ben-David, John Blitzer, Koby Crammer, Alex 2010. Factorized latent spaces with structured spar-
Kulesza, Fernando Pereira, and Jennifer Wortman sity. In Advances in Neural Information Processing
Vaughan. 2010. A theory of learning from different Systems. pages 982–990.
domains. Machine learning 79(1-2):151–175.
Rafal Jozefowicz, Wojciech Zaremba, and Ilya
Shai Ben-David, John Blitzer, Koby Crammer, Fer- Sutskever. 2015. An empirical exploration of recur-
nando Pereira, et al. 2007. Analysis of represen- rent network architectures. In Proceedings of The
tations for domain adaptation. Advances in neural 32nd International Conference on Machine Learn-
information processing systems 19:137. ing.
John Blitzer, Mark Dredze, Fernando Pereira, et al. Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-
2007. Biographies, bollywood, boom-boxes and som. 2014. A convolutional neural network for
blenders: Domain adaptation for sentiment classifi- modelling sentences. In Proceedings of ACL.
cation. In ACL. volume 7, pages 440–447.
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-
Konstantinos Bousmalis, George Trigeorgis, Nathan tos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua
Silberman, Dilip Krishnan, and Dumitru Erhan. Bengio. 2017. A structured self-attentive sentence
2016. Domain separation networks. In Advances in embedding. arXiv preprint arXiv:1703.03130 .
Neural Information Processing Systems. pages 343–
351. Pengfe Liu, Xipeng Qiu, Jifan Chen, and Xuanjing
Huang. 2016a. Deep fusion LSTMs for text seman-
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
tic matching. In Proceedings of ACL.
and Yoshua Bengio. 2014. Empirical evaluation of
gated recurrent neural networks on sequence model-
ing. arXiv preprint arXiv:1412.3555 . PengFei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu,
and Xuanjing Huang. 2015a. Multi-timescale long
Ronan Collobert and Jason Weston. 2008. A unified short-term memory neural network for modelling
architecture for natural language processing: Deep sentences and documents. In Proceedings of the
neural networks with multitask learning. In Pro- Conference on EMNLP.
ceedings of ICML.
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016b.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Deep multi-task learning with shared memory. In
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Proceedings of EMNLP.
2011. Natural language processing (almost) from
scratch. The JMLR 12:2493–2537. PengFei Liu, Xipeng Qiu, and Xuanjing Huang. 2016c.
Recurrent neural network for text classification with
Jeffrey L Elman. 1990. Finding structure in time. Cog- multi-task learning. In Proceedings of International
nitive science 14(2):179–211. Joint Conference on Artificial Intelligence.
Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,
Kevin Duh, and Ye-Yi Wang. 2015b. Representation
learning using multi-task deep neural networks for
semantic classification and information retrieval. In
NAACL.
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol
Vinyals, and Lukasz Kaiser. 2015. Multi-task
sequence to sequence learning. arXiv preprint
arXiv:1511.06114 .
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan
Huang, Andrew Y Ng, and Christopher Potts. 2011.
Learning word vectors for sentiment analysis. In
Proceedings of the ACL. pages 142–150.
Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and
Martial Hebert. 2016. Cross-stitch networks for
multi-task learning. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recog-
nition. pages 3994–4003.
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-
ing class relationships for sentiment categorization
with respect to rating scales. In Proceedings of
the 43rd annual meeting on association for compu-
tational linguistics. Association for Computational
Linguistics, pages 115–124.
Gwangbeen Park and Woobin Im. 2016. Image-text
multi-modal representation learning by adversarial
backpropagation. arXiv preprint arXiv:1612.08354
.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word rep-
resentation. Proceedings of the EMNLP 12:1532–
1543.
Mathieu Salzmann, Carl Henrik Ek, Raquel Urtasun,
and Trevor Darrell. 2010. Factorized orthogonal la-
tent spaces. In AISTATS. pages 701–708.
Richard Socher, Alex Perelygin, Jean Y Wu, Jason
Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. 2013. Recursive deep mod-
els for semantic compositionality over a sentiment
treebank. In Proceedings of EMNLP.
Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.
Sequence to sequence learning with neural net-
works. In Advances in NIPS. pages 3104–3112.
Yaniv Taigman, Adam Polyak, and Lior Wolf.
2016. Unsupervised cross-domain image genera-
tion. arXiv preprint arXiv:1611.02200 .
Zhanpeng Zhang, Ping Luo, Chen Change Loy, and
Xiaoou Tang. 2014. Facial landmark detection by
deep multi-task learning. In European Conference
on Computer Vision. Springer, pages 94–108.