Professional Documents
Culture Documents
EMNLP2019 Mingyueshang
EMNLP2019 Mingyueshang
Mingyue Shang1∗, Piji Li2 , Zhenxin Fu1 , Lidong Bing4 , Dongyan Zhao1,3 , Shuming Shi2 , Rui Yan1,3†
1
Wangxuan Institute of Computer Technology, Peking University, China
2
Tencent AI Lab, Shenzhen, China, 3 Center for Data Science, China
4
R&D Center Singapore, Machine Intelligence Technology, Alibaba DAMO Academy
{shangmy, fuzhenxin, zhaodongyan, ruiyan}@pku.edu.cn
{pijili, shumingshi}@tencent.com, l.bing@alibaba-inc.com
ca ~
ca
Input a Output ã
Encoder A Decoder A
f (·)
g(·)
~
cb
Input b cb Output b
̃
Encoder B Decoder B
Latent Space of B
Figure 1: The process of training the sequence-to-sequence model. Take a sentence pair (a, b) for example. After
a is encoded into ca by the encoder A, the projection function f (·) (shown by the dotted green line) projects it to
the latent space of B, which is then fed to the decoder B to generate the output ã. The encoder B gives the latent
representation of b. One of the constraints is computed from the distance of c̃b and cb (shown by the red lines) and
another is the NLL-loss of b̃ and b. The dotted blue line refers to the projection function from the space of B to A.
a cycle projection strategy to utilize the parallel respectively, by which each encoder therefore con-
data and nonparallel data. These two strategies are structs the latent space for the corresponding style.
exerted iteratively in the training process, thus our
model is compatible with the two types of training 5.2 Cross Projection
data. The following subsections give the details of To perform the style transfer task, we establish a
the modules and the strategies of model training. projection relationship between latent spaces and
cross link the encoder and the decoder from dif-
5.1 Denoising Auto-Encoder ferent styles. Take the transfer from style sa to sb
To learn the latent space representations of styles, for example. Given a sentence a that is required to
we train an auto-encoder for each style of text transferred to b, we cross link the Enc(a) as the en-
through reconstructing the input, which is com- coder and Dec(b) as the decoder to form the style
posed of an encoder and a decoder. The encoder transfer model. However, considering the DAE
takes a sentence as input and maps the sentence to models for different styles are trained respectively,
a latent vector representation, and the decoder re- therefor the latent vector spaces are usually differ-
constructs the sentence based on the vector. But ent. The latent vector produced by Enc(a) is the
a common pitfall is that the decoder may sim- representation on the latent space of style sa while
ply copy every token from the input, making the the Dec(b) relies on the information and features
encoder lose the ability to extract the features. from the latent space of style sb . Therefore, we in-
To alleviate this issue, we adopt the denoising troduce a projection function to project the latent
auto-encoder (DAE) following the previous work vector from the latent space of sa to sb .
(Lample et al., 2019) that tries to reconstruct the Concretely, after we get the ca from Enc(a) , a
corrupted input. Specifically, we randomly shuffle projection function f (·) is employed to project ca
a small portion of words in the original sentence from the latent space of style sa to the latent space
to the corrupted input which are then fed to the of sb , denoted as c̃b = f (ca ). Then the decoder
denoising auto-encoder. For example, the origi- Dec(b) takes the the projected vector c̃b as input
nal input is “Listen to your heart and your mind.”, and generates a sentence b̃ of style sb base on the
then the corrupted version could be “Listen your prediction probability, denoted as pb (b̃|c̃b ).
to and heart your mind.”. The decoder is required It is worth noting that we have only exploited
to reconstruct the original sentence. the nonparallel corpus for the style transfer up to
The training of a DAE model relies on the cor- now. Recall that our framework can employ both
pus of one style only, thus we train each DAE the parallel corpus and nonparallel corpus for the
model using the nonparallel corpus of each style. model training. With parallel corpus, we design
Formally, the encoder and decoder of style sa and the cross projection strategy. When the input a is
sb are referred as Enc(a) - Dec(a) and Enc(b) - accompanied with a ground-truth b, we can get the
Dec(b) . Given sentences a and b from two styles, standard latent representation of b by the Enc(b) ,
the corresponding encoder takes the sentence as denoted as cb . In order to align the latent vectors
input and converts it to the latent vector ca and cb , from different spaces, we design the constraints to
vector c̃b has no reference to train, the latent vec-
tor c̃a and the output ã could be trained by treating
ca and a as the references. The loss functions are
formulated as:
寒月寒风吹得寒冷寒冷。
SLS
(The cold wind and the cold moon blew the cold )
寒风吹得很寒冷。
DAR
(The cold wind is very cold)
水边寒冷的寒风吹过好像寒冷的刀刀。
CPLS
(The cold wind from the water side like a cold knife.)
Source give them a chance to discover you.
S2S Share them an opportunity to meet you.
to F.en
Table 2: Case study: the sentences in the brackets are the translation of the corresponding sentence.
Formality Dataset. The formality dataset used in sentence as 30. For the formality datasets, we use
this paper is built based on the parallel dataset re- NLTK (Loper and Bird, 2002) to tokenize the texts
leased by Rao and Tetreault (2018) which contains and set the minimum length as 5 and the maximum
texts of formal and informal style. length as 30 for both formal and informal styles.
With the released data, we randomly sample We adopt GloVE (Pennington et al., 2014) to
5,000 sentence pairs from it as the parallel cor- pretrain the embeddings, and the dimensions of
pus with limited data volume. We then use the the embeddings are set to 300 for all the datasets.
Yahoo Answers L6 corpus7 as the source which is The hidden states are set to 500 for both encoders
in the same content domain as the parallel data to and decoders. We adopt SGD optimizer with the
construct the large-scale nonparallel data. To di- learning rate as 1 for DAE models and 0.1 for S2S
vide nonparallel dataset into two styles, we train a models. The dropout rate is 0.4. In the inference
CNN-based classifier (Kim, 2014) on the parallel stage, the beam size is set to 5.
data with annotation of styles and use it to classify
the nonparallel data. 8 Comparison Models and Evaluations
Table 3: Automatic evaluation results on four style transfer tasks. Acc refers to the style accuracy.
Model Dataset Content Style Fluency test cases in all. We invited four well-educated
to M.zh 0.1875 0.5675 0.3575 volunteers to score the results from the supervised
to Anc.P 0.2275 0.5400 0.4425
S2S baseline and the three semi-supervised models.
to Inf.en 0.3175 0.4600 0.5800
to F.en 0.3625 0.5125 0.6325
to M.zh 0.3100 0.5825 0.3050 9 Results and Analysis
to Anc.P 0.4375 0.6875 0.5475
CPLS
to Inf.en 0.4675 0.4625 0.5725 Evaluation Results. Table 3 presents the evalu-
to F.en 0.4600 0.5675 0.6200 ation results of automatic metrics on the models.
It can be seen that the BLEU scores and GLEU
Table 4: The human annotation results of the S2S
scores of the semi-supervised models on almost
model and CPLS model from three aspects.
all the datasets are better than the baseline S2S
model. This result indicates that the model ben-
as the automatic evaluation metrics to measure the efits from the nonparallel data in terms of con-
content preservation degree and the style changing tent preservation. One interesting thing is that
degree. BLEU calculates the N-gram overlap be- the overall BLEU scores on the ancient poems
tween the generated sentence and the references, and modern Chinese datasets are lower than other
thus can be used to measure the preservation of datasets. This result may be explained by the fact
text content. Considering that text style transfer that the edit distance between formal and infor-
is a monolingual text generation task, we also use mal texts are smaller than between ancient po-
GLEU, a generalized BLEU proposed by (Napoles ems and modern Chinese texts. Therefore, it is
et al., 2015). To evaluate the extent to which the more challenging for model to preserve the con-
sentences are transferred to the target style, we fol- tent meaning when transferring between ancient
low Shen et al. (2017); Hu et al. (2017) that build poems and modern Chinese text. Among three
a CNN-based style classifier and use it to measure semi-supervised models, CPLS model achieves
the style accuracy. the greatest improvement, verifying the effective-
ness of the projection functions. However, the gain
8.3 Human Evaluation of CPLS model in the aspect of style accuracy is
We also adopt human evaluations to judge the not that significant. A possible explanation may
quality of the transferred sentences from three as- be the bias of the style classifier. Take the trans-
pects, namely content, style and fluency. These fer task from ancient poems to modern Chinese
aspects evaluate how well the transferred text pre- text for example. We observe that the classifier
serve the content of the input, the style strength tends to classify short sentences into ancient po-
and the fluency of the transferred text. Take the ems as length is an obvious feature. We analyse
content relevance for example, the criterion is as the sentences generated by S2S model and by the
follows: +2: The transferred sentence has the CPLS model, and the statistics show that the av-
same meaning with the input sentence. +1: The erage length of the text generated by S2S model
transferred sentence preserves part of the content is shorter, which may lead to the bias of the style
meaning of the input sentence. 0: The transferred classifier. Therefore, we also adopt human evalu-
sentence and the input sentence are irrelevant in ation to alleviate this issue.
content. The criteria for style strength and fluency Table 4 compares the human evaluation results
are similar to the content relevance criterion. of S2S model and CPLS model on all the datasets,
To get the evaluation results, we first randomly which are calculated by the average score of the
sample 50 test cases from each dataset. As the human annotations. As shown in the Table 4,
style transfer is bilateral in this paper, there are 400 the CPLS model outperforms the S2S model in
the aspects of the content preservation and style tentive recurrent neural networks. In Proceedings of
strength, and is on par in terms of fluency. the 2016 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Case Study. We select the generated results of the
Human Language Technologies, pages 93–98.
S2S baseline and three semi-supervised models on
two style transfer tasks due to the limited space as Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan
shown in Table 2. Compared with the Chinese lit- Zhao, and Rui Yan. 2018. Style transfer in text:
eracy style datasets, the formality datasets are less Exploration and evaluation. In Thirty-Second AAAI
Conference on Artificial Intelligence.
challenging as discussed before. Thus it can be
seen from the table that the generated sentences on Yifan Gao, Lidong Bing, Wang Chen, Michael R. Lyu,
the formality datasets are more fluent. For the task and Irwin King. 2019. Difficulty controllable gener-
that transferring from the modern Chinese text to ation of reading comprehension questions. In Pro-
ceedings of the Twenty-Eightth International Joint
the ancient poem, the S2S generates a shorter sen- Conference on Artificial Intelligence, IJCAI-19.
tence while the CPLS model generates the sen-
tence that preserve the most content information. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
10 Conclusion Courville, and Yoshua Bengio. 2014. Generative ad-
versarial nets. In Advances in neural information
In this paper, we design a differentiable semi- processing systems, pages 2672–2680.
supervised model that introduces a projection
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,
function between the latent spaces of different Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learn-
styles. The model is flexible in alternating the ing for machine translation. In Advances in Neural
training mode between supervised and unsuper- Information Processing Systems, pages 820–828.
vised learning. We also introduce another two
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
semi-supervised methods that are simple but ef- Long short-term memory. Neural computation,
fective to use the nonparallel and parallel data. 9(8):1735–1780.
We evaluate our models on two datasets that have
small-scale parallel data and large-scale nonparal- Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan
Salakhutdinov, and Eric P Xing. 2017. Toward con-
lel data, and verify the effectiveness of the model trolled generation of text. In Proceedings of the 34th
on both automatic metrics and human evaluation. International Conference on Machine Learning-
Volume 70, pages 1587–1596. JMLR. org.
11 Acknowledgement
Zhijing Jin, Di Jin, Jonas Mueller, Nicholas Matthews,
We would like to thank the anonymous re- and Enrico Santus. 2019. Unsupervised text style
viewers for their constructive comments. This transfer via iterative matching and translation. arXiv
work was supported by the National Key Re- preprint arXiv:1901.11333.
search and Development Program of China (No. Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga
2017YFC0804001), the National Science Founda- Vechtomova. 2018. Disentangled representation
tion of China (NSFC No. 61876196 and NSFC learning for text style transfer. arXiv preprint
No. 61672058). Rui Yan was sponsored by Ten- arXiv:1808.04339.
cent Collaborative Research Fund.
Yoon Kim. 2014. Convolutional neural net-
works for sentence classification. arXiv preprint
arXiv:1408.5882.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Diederik P. Kingma and Max Welling. 2013. Auto-
gio. 2014. Neural machine translation by jointly encoding variational bayes. arXiv, abs/1312.6114.
learning to align and translate. arXiv preprint
arXiv:1409.0473. Guillaume Lample, Myle Ott, Alexis Conneau, Lu-
dovic Denoyer, and Marc’Aurelio Ranzato. 2018.
Cheng-Kuan Chen, Zhufeng Pan, Ming-Yu Liu, and Phrase-based & neural unsupervised machine trans-
Min Sun. 2019. Unsupervised stylish image de- lation. arXiv preprint arXiv:1804.07755.
scription generation via domain layer norm. In Pro-
ceedings of the AAAI Conference on Artificial Intel- Guillaume Lample, Sandeep Subramanian, Eric Smith,
ligence, volume 33, pages 8151–8158. Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-
Lan Boureau. 2019. Multiple-attribute text rewrit-
Sumit Chopra, Michael Auli, and Alexander M Rush. ing. In International Conference on Learning Rep-
2016. Abstractive sentence summarization with at- resentations.
Yi Liao, Lidong Bing, Piji Li, Shuming Shi, Wai Lam, Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
and Tong Zhang. 2018. Quase: Sequence editing Sequence to sequence learning with neural net-
under quantifiable guidance. In Proceedings of the works. In Advances in neural information process-
2018 Conference on Empirical Methods in Natural ing systems, pages 3104–3112.
Language Processing, pages 3855–3864.
Oriol Vinyals and Quoc Le. 2015. A neural conversa-
Lajanugen Logeswaran, Honglak Lee, and Samy Ben- tional model. arXiv preprint arXiv:1506.05869.
gio. 2018. Content preserving text generation with
attribute controls. In Advances in Neural Informa- Yi Zhang, Jingjing Xu, Pengcheng Yang, and Xu Sun.
tion Processing Systems, pages 5103–5113. 2018a. Learning sentiment memories for sentiment
modification without parallel data. In Proceedings
Edward Loper and Steven Bird. 2002. Nltk: the natural of the 2018 Conference on Empirical Methods in
language toolkit. arXiv preprint cs/0205028. Natural Language Processing, pages 1103–1108.
Jonas Mueller, David Gifford, and Tommi Jaakkola. Association for Computational Linguistics.
2017. Sequence to better sequence: continuous re- Zhirui Zhang, Shuo Ren, Shujie Liu, Jianyong Wang,
vision of combinatorial structures. In Proceedings Peng Chen, Mu Li, Ming Zhou, and Enhong Chen.
of the 34th International Conference on Machine 2018b. Style transfer as unsupervised machine
Learning-Volume 70, pages 2536–2544. JMLR. org. translation. arXiv preprint arXiv:1808.07894.
Courtney Napoles, Keisuke Sakaguchi, Matt Post, and
Yanpeng Zhao, Wei Bi, Deng Cai, Xiaojiang Liu,
Joel Tetreault. 2015. Ground truth for grammati-
Kewei Tu, and Shuming Shi. 2018. Language
cal error correction metrics. In Proceedings of the
style transfer from sentences with arbitrary unknown
53rd Annual Meeting of the Association for Compu-
styles. arXiv preprint arXiv:1808.04071.
tational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Vol-
ume 2: Short Papers), volume 2, pages 588–593.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of
the 40th annual meeting on association for compu-
tational linguistics, pages 311–318. Association for
Computational Linguistics.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. Glove: Global vectors for word
representation. In Proceedings of the 2014 confer-
ence on empirical methods in natural language pro-
cessing (EMNLP), pages 1532–1543.
Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan
Salakhutdinov, and Alan W Black. 2018. Style
transfer through back-translation. In Proceedings of
the 56th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers),
pages 866–876. Association for Computational
Linguistics.
Sudha Rao and Joel Tetreault. 2018. Dear sir or
madam, may i introduce the gyafc dataset: Corpus,
benchmarks and metrics for formality style transfer.
In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 129–140. Associa-
tion for Computational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Improving neural machine translation
models with monolingual data. arXiv preprint
arXiv:1511.06709.
Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi
Jaakkola. 2017. Style transfer from non-parallel text
by cross-alignment. In Advances in neural informa-
tion processing systems, pages 6830–6841.