You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE
IEEE Transactions AND DATA
on Knowledge ENGINEERING,
and Data VOL. (?,Volume:
Engineering NO. ?, AUGUST 2017 1,
30, Issue: Jan. 1 2018 ) 1

Weakly-supervised Deep Embedding for


Product Review Sentiment Analysis
Wei Zhao, Ziyu Guan∗ , Long Chen, Xiaofei He, Fellow, IAPR, Deng Cai, Beidou Wang and Quan Wang

Abstract—Product reviews are valuable for upcoming buyers in helping them make decisions. To this end, different opinion mining
techniques have been proposed, where judging a review sentence’s orientation (e.g. positive or negative) is one of their key
challenges. Recently, deep learning has emerged as an effective means for solving sentiment classification problems. A neural network
intrinsically learns a useful representation automatically without human efforts. However, the success of deep learning highly relies on
the availability of large-scale training data. We propose a novel deep learning framework for product review sentiment classification
which employs prevalently available ratings as weak supervision signals. The framework consists of two steps: (1) learning a high level
representation (an embedding space) which captures the general sentiment distribution of sentences through rating information; (2)
adding a classification layer on top of the embedding layer and use labeled sentences for supervised fine-tuning. We explore two kinds
of low level network structure for modeling review sentences, namely, convolutional feature extractors and long short-term memory. To
evaluate the proposed framework, we construct a dataset containing 1.1M weakly labeled review sentences and 11,754 labeled review
sentences from Amazon. Experimental results show the efficacy of the proposed framework and its superiority over baselines.

Index Terms—Deep learning, opinion mining, sentiment classification, weak-supervision.

1 I NTRODUCTION

W ITH the booming of e-commerce, people are getting used


to consuming online and writing comments about their
purchase experiences on merchant/review Websites. These opin-
helpful than subjective feelings. Lexicon-based methods can only
deal with implicit opinions in an ad-hoc way [52].
The first machine learning based sentiment classification work
ionated contents are valuable resources both to future customers [35] applied popular machine learning algorithms such as Naive
for decision-making and to merchants for improving their products Bayes to the problem. After that, most research in this direc-
and/or service. However, as the volume of reviews grows rapidly, tion revolved around feature engineering for better classification
people have to face a severe information overload problem. To performance. Different kinds of features have been explored,
alleviate this problem, many opinion mining techniques have been e.g. n-grams [6], Part-of-speech (POS) information and syntactic
proposed, e.g. opinion summarization [8], [19], opinion polling relations [32], etc. Feature engineering also costs a lot of human
[54], and comparative analysis [26]. The key challenge is how to efforts, and a feature set suitable for one domain may not generate
accurately predict the sentiment orientation of review sentences. good performance for other domains [34].
Popular sentiment classification methods generally fall into In recent years, deep learning has emerged as an effective
two categories: (1) lexicon-based methods and (2) machine learn- means for solving sentiment classification problems [14], [21],
ing methods. Lexicon-based methods [8], [19], [47] typically take [40], [41], [44], [45]. A deep neural network intrinsically learns a
the tack of first constructing a sentiment lexicon of opinion words high level representation of the data [2], thus avoiding laborious
(e.g. “wonderful”, “disgusting”), and then design classification work such as feature engineering. A second advantage is that deep
rules based on appeared opinion words and prior syntactic knowl- models have exponentially stronger expressive power than shallow
edge. Despite effectiveness, this kind of methods require substan- models. However, the success of deep learning heavily relies on
tial efforts in lexicon construction and rule design. Furthermore, the availability of large-scale training data [1], [2]. Labeling a
lexicon-based methods cannot well handle implicit opinions, i.e. large number of sentences is very laborious.
objective statements such as “I bought the mattress a week ago,
and a valley appeared today”. As pointed out in [12], this is also an
important form of opinions. Factual information is usually more

* Corresponding author
• W. Zhao and Q. Wang are with the School of Computer Sci-
ence and Technology, Xidian University, Xi’an, CN 710127. E-mail:
ywzhao@mail.xidian.edu.cn; qwang@xidian.edu.cn Fig. 1. A 5-stars review with negative words.
• Z. Guan and L. Chen are with the School of Information and Tech-
nology, Northwest University of China, Xi’an, CN 710127. E-mail: Fortunately, most merchant/review Websites allow customers
ziyuguan@nwu.edu.cn; longchen@stumail.nwu.edu.cn to summarize their opinions by an overall rating score (typically
• X. He and D. Cai are with the State Key Lab of CAD&CG, College of in 5-stars scale). Ratings reflect the overall sentiment of customer
Computer Science, Zhejiang University, Hangzhou, CN 310027. E-mail:
{xiaofeihe, dengcai}@cad.zju.edu.cn reviews and have already been exploited for sentiment analysis
• B. Wang is with the School of Computing Science, Simon Fraser University, [27], [37]. Nevertheless, review ratings are not reliable labels for
Burnaby, BC V5A 1S6, Canada. E-mail: beidouw@sfu.ca the constituent sentences, e.g. a 5-stars review can contain negative
Manuscript received April 19, 2005; revised August 26, 2015. sentences and we may also see positive words occasionally in 1-

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 2

star reviews. An example is shown in Figure 1. Therefore, treating https://www.dropbox.com/s/aji68llxmtcuu5l/data.zip.


binarized ratings as sentiment labels could confuse a sentiment Experimental results show that WDE is effective and
classifier for review sentences. outperforms baseline methods.
Despite the promising performance of deep learning on senti- The rest of the paper is organized as follows: the next section
ment classification, no previous work tried to leverage the preva- outlines related work. Section 3 presents the WDE framework
lently available ratings for training deep models. In this work, and explains the design of each step in details. Experiments are
we propose a novel deep learning framework for review sentence described in Section 4, and finally, Section 5 concludes our work
sentiment classification. The framework treats review ratings as and discusses about future work.
weak labels to train deep neural networks. For example, with
5-stars scale we can deem ratings above/below 3-stars as posi-
tive/negative weak labels respectively. The framework generally 2 R ELATED W ORK
consists of two steps. In the first step, rather than predicting Sentiment analysis is a long standing research topic. Readers can
sentiment labels directly, we try to learn an embedding space refer to [25] for a recent survey. Sentiment classification is one
(a high level layer in the neural network) which reflects the of the key tasks in sentiment analysis and can be categorized as
general sentiment distribution of sentences, from a large number document level, sentence level and aspect level [25]. Traditional
of weakly labeled sentences. That is, we force sentences with machine learning methods for sentiment classification can gener-
the same weak labels to be near each other, while sentences ally be applied to the three levels [25]. Our work falls into the
with different weak labels are kept away from one another. To last category since we consider aspect information. In the next we
reduce the impact of sentences with rating-inconsistent orientation review two subtopics closely related to our work.
(hereafter called wrong-labeled sentences), we propose to penalize
the relative distances among sentences in the embedding space 2.1 Deep Learning for Sentiment Classification
through a ranking loss. In the second step, a classification layer
In recent years, deep learning techniques have been exploited to
is added on top of the embedding layer, and we use labeled
address text related problems, e.g. information retrieval [38], ques-
sentences to fine-tune the deep network. The framework is dubbed
tion answering [36] and text categorization [53]. In the sentiment
Weakly-supervised Deep Embedding (WDE). Regarding network
analysis community, researchers have explored different deep
structure, two popular schemes are adopted to learn to extract
models for sentiment classification. Glorot et al. used stacked de-
fixed-length feature vectors from review sentences, namely, con-
noising auto-encoder to train review representation in an unsuper-
volutional feature extractors [21] and Long Short-Term Memory
vised fashion, in order to address the domain adaptation problem
(LSTM) [16], [18]. With a slight abuse of concept, we will refer
of sentiment classification [14]. Socher et al. [39]–[41] proposed
to the former model as Convolutional Neural Network based
a series of Recursive Neural Network (RecNN) models for sen-
WDE (WDE-CNN); the latter one is called LSTM based WDE
timent classification. These methods learn vector representations
(WDE-LSTM). We then compute high level features (embedding)
of variable-length sentences through compositional computation
by synthesizing the extracted features, as well as the contextual
recursively. Kim investigated using CNN for sentence sentiment
aspect information (e.g. screen of cell phones) of the product. The
classification and found it outperformed RecNN [21]. A variant
aspect input represents prior knowledge regarding the sentence’s
CNN with dynamic k-max pooling and multiple convolutional lay-
orientation.
ers was proposed in [20]. Researchers have also investigated using
The main contributions of this paper are summarized as sequential models such as Recurrent Neural Network (RNN) such
follows: as Long Short-Term Memory (LSTM) for sentiment classification
[44]. In the following, we review neural models for text embedding
1) We propose a new deep learning framework WDE which
and aspect level sentiment classification.
can leverage the vast amount of weakly labeled review
sentences for sentiment analysis. The framework first Text embedding learning. There is an increasing interest in
tries to capture the sentiment distribution of the data by learning distributed text embedding by neural models for text
embedding training on weakly labeled sentences. Then understanding tasks among which sentiment classification is a
it uses a few labeled sentences for deep network fine- popular one. Le and Mikolov [24] developed an unsupervised
tuning, as well as for prediction model learning. We embedding learning method for sentences, paragraphs and docu-
empirically demonstrate this “weakly pre-training + su- ments. Two simple network models were proposed which were
pervised fine-tuning” idea is feasible. The idea could also inspired by [30]. Kiros et al. [22] proposed the unsupervised
be useful for exploiting other kinds of weakly labeled skip-thoughts model which generalized the skip-gram model [30]
data (e.g. tagging data [17]). to the sentence level. The key idea was to use the embedding
2) We devise a general neural network architecture for representation of a sentence to predict its surrounding sentences. A
WDE and instantiate it by two popular neural network supervised sentence embedding learning framework was proposed
schemes for modeling text data: CNN and LSTM. We by Wieting et al. [51], where sentence similarity in the embedding
compare WDE-CNN and WDE-LSTM in terms of their space was trained according to a paraphrase database via a margin
effectiveness, efficiency and specialties on this sentiment loss. Six network architectures were tested and compared.
classification task.
3) To evaluate WDE we construct a dataset containing Aspect-based neural models for sentiment classification. Re-
1.1M weakly labeled review sentences and 11,754 cently, neural models have been proposed for aspect level (or target
labeled review sentences from three domains dependent) sentiment classification. Several works were built on
of Amazon, i.e. digital cameras, cell phones the RecNN models: Dong et al. proposed a variant of RecNN
and laptops. The dataset can be downloaded at for target dependent sentiment classification on Twitter [9]. The

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 3

method used a set of composition functions and syntactic rela- [46], which developed a variant of the C&W neural model [5] for
tionships between words to control the propagation of sentiments learning sentiment-bearing word vectors from weak tweet labels
to targets. Nguyen and Shirai [33] further extended the method derived from emoticons. The tweet representation was obtained
of [9] by using syntactic information from both the dependency by min, max and avg pooling on word vectors. Although this kind
and constituent trees of sentences. In [23], Lakkaraju et al. used of methods can generate sentence representations automatically,
RecNN to jointly perform aspect detection and sentiment predic- the representations were derived by simple pooling of the learned
tion. Researchers also explored adapting LSTM for aspect level word vectors. In comparison, our method generates a sentence
sentiment classification. Tang et al. proposed two LSTM based representation by feeding word vectors through an expressive
models for this task [43]. The first one used two LSTMs to model deep neural network. Moreover, we directly optimize sentence
preceding and succeeding contexts of the target words respectively representation, rather than word vectors. We take the above two
and built target specific representation by a concatenation of the methods as baselines in experiments.
outputs of the two LSTMs. The second one, which performed bet-
ter, explicitly took the target vector as an input for each time step.
3 W EAKLY - SUPERVISED D EEP E MBEDDING
The target vector was simply an average of target word vectors.
Wang et al. [49] introduced the attention mechanism into LSTM to The classic deep learning methods take an “unsupervised training
learn aspect specific sentence representation. Similar to our work, then supervised fine-tuning” scheme, where restricted Boltzmann
aspects were modeled as embedding vectors and used for attention machines (RBM) or auto-encoders are used to pre-train network
estimation. One of the state-of-art methods was proposed by Tang parameters from large quantities of unlabeled data [1]. This works
et al. [45] which is based on deep memory network. The key well when the data distribution is correlated with label prediction
idea was to use content- and location-based attention models to [1]. Nevertheless, in sentiment analysis the word co-occurrence
learn the weight of each context word and combine context words information is usually not well correlated with sentiment predic-
and the aspect word (after linear transformation) by summation, tion [27], which motivates us to exploit large-scale rating data for
to learn sentence representation in different levels. However, the training deep sentiment classifiers.
model was only naturally defined for single-word aspects, and it
did not try to model implicit aspects [25].
None of the above works tried to use review ratings to train
deep sentiment classifiers for sentences. This is not a trivial
problem since ratings are too noisy to be used directly as sentence
labels (see Section 3 and experiments for discussions of this issue).
To our knowledge, the WDE framework is the first attempt to
make use of rating information as weak labels for training deep
sentence sentiment classifiers. While some of the studies described
above also revolved around text embedding by neural networks,
they were not concerned with weak supervision. Our work is
orthogonal to theirs in that we could use/adapt their methods to
pre-train the deep models. In this paper, our focus is to investigate Fig. 2. Percentages of wrong-labeled sentences by ratings in our labeled
whether the proposed WDE framework can effectively exploit review dataset. The overall percentage is 13.4%.
weakly labeled data, thus we do not employ those methods for pre-
training. Note that although we instantiate WDE with CNN and However, ratings are noisy labels for review sentences and
LSTM which are two popular neural schemes for text, the idea of would mislead classifier training if directly used in supervised
WDE could also be applied to other types of deep models as well, training. In this paper, we adopt a simple rule to assign weak
such as the existing aspect-based neural models aforementioned. labels to sentences with 5-stars rating scale:
The major contribution of this work is a weakly-supervised deep 
pos, if s is in a 4 or 5-stars review
learning framework, rather than specific deep models. `(s) = , (1)
neg, if s is in a 1 or 2-stars review
where `(s) denotes the weak sentiment label of sentence s. Note
2.2 Exploiting Ratings in Sentence Sentiment Classifi- we follow previous works on aspect level sentiment analysis
cation [8], [19] to only consider positive and negative sentiment labels.
Rating information has been exploited in sentence sentiment clas- The reason is that when commenting on various aspects of a
sification. Qu et al. incorporated ratings as weak labels in a proba- product, people hardly express neutral opinions. Figure 2 shows
bilistic framework for sentence level sentiment classification [37]. the percentages of wrong-labeled sentences by `(s), estimated in
However, their method still required careful feature design and our labeled review dataset (detailed description of the dataset is
relied on base predictors. While our method automatically learns in Section 4.1). We can see the noise level is moderate but not
a meaningful sentence representation for sentiment classification. ignorable.
Täckström and McDonald used conditional random fields to The general idea behind WDE is that we use large quantities
combine review level and sentence level sentiment labels for sen- of weakly labeled sentences to train a good embedding space so
tence sentiment analysis [42]. This method also required feature that a linear classifier would suffice to accurately make sentiment
engineering. Maas et al. [27] proposed to learn sentiment-bearing predictions. Here good embedding means in the space sentences
word vectors by incorporating rating information in a probabilistic with the same sentiment labels are close to one another, while
model. For sentiment classification, they simply averaged the word those with different labels are kept away from each other. In the
vectors of a document as its representation. A similar work is following, we first present the network architecture and explain the

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 4

specific design choices for WDE-CNN and WDE-LSTM. Then


we discuss how to train it with large-scale rating data, followed by
supervised fine-tuning on labeled sentences.

Fig. 3. The network architecture in general for sentence sentiment


classification.

3.1 Network Architecture in General


The general architecture of the neural network designed for WDE
is shown in Figure 3. At the first layer, the network takes a review
Fig. 4. The network architecture for WDE-CNN.
sentence as input and extracts a fixed-length low-level feature
vector from the sentence. Unlike many traditional methods for
sentiment analysis, no feature engineering is required and the the total number of words in the vocabulary. The trainable word
extractor is learned automatically. Specific implementation of the lookup table X is then a k × n matrix with word vectors as its
extractor will be discussed in the following for WDE-CNN and columns. The input layer simply maps s =< w1 w2 . . . wT > to
WDE-LSTM respectively. The low-level feature vector is then its corresponding word vector representation < x1 x2 . . . xT >.
passed through a hidden layer, adding sufficient nonlinearity, and The lookup table is initialized using the publicly available 300-
the output is used to compute the embedding representation of the dimensional word vectors trained on 100 billion words from
sentence. The embedding representation also takes the sentence’s Google News by word2vec [31]. Out-of-sample words are ini-
aspect contextual information into consideration. An aspect is a tialized randomly.
topic on which customers can comment with respect to a sort of
entities. For instance, battery life is an aspect for cell phones. Convolutional Layer and Max pooling Layer. The convolutional
We use a learnable context vector to represent an aspect. The layer applies a set of filters on the sentence. Each filter w ∈ Rhk
motivation for incorporating aspect information as the context of a is applied to a window of h words to produce a local feature value:
sentence is that similar comments in different contexts could be of u(t) = f (wT xt:(t+h−1) + b), (2)
opposite orientations, e.g. “the screen is big” vs. “the size is big”.
In the weakly-supervised training phase, the goal is to learn where xt:(t+h−1) represents the concatenated vector
an embedding space which can properly reflect data’s semantic [xTt xTt+1 . . . xTt+h−1 ]T , u(t) is the computed feature value at
distribution. Hence, the network used in this phase contains only position t, b is the bias of the current filter, and f (·) is a non-linear
the layers up to the embedding layer. The final classification layer activation function such as hyperbolic tangent. Computing u(t) at
(drawn in dotted lines in Figure 3) is added in the following all possible positions in s yields a (T −h+1)-dimensional feature
supervised training phase, in order to learn the final prediction vector (i.e. a feature map) u = [u(1)u(2) . . . u(T − h + 1)]T .
model. Then the max pooling layer performs a max operation over
each feature map uj to find the most salient value of the filter’s
corresponding feature as its final value [5]
3.2 Network Architecture of WDE-CNN
The network architecture of WDE-CNN, depicted in Figure 4, is v(j) = max{uj (t)}. (3)
t
a variant of the CNNs described in [5], [21]. In what follows, we
This pooling scheme keeps the most important indicator of a
use upper case bold letters such as W to denote matrices and
feature and naturally leads to a fixed-length vector output v at
lower case bold letters such as x to denote column vectors. The
the max pooling layer.
i-th element in vector x is denoted by x(i).
A filter with window size h is intrinsically a feature extractor
Input Layer. An input sentence of length t is a word sequence which performs “feature selection” from the h-gram features of
s =< w1 w2 . . . wT >. Each word w in the vocabulary is a sentence. When the input h-gram matches its w, we will
described by a word vector x. Let k be the length of x and n be obtain a high feature value, indicating this h-gram activates the

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 5

feature. This resembles the traditional feature selection in senti- model to explicitly capture long-term dependencies in sequential
ment classification [34], but is done automatically by the network. data. Hence, we propose another instantiation of WDE by Long
Since traditional machine learning based methods often exploit short-term memory (LSTM) [18]. LSTM is a popular technique
unigrams, bigrams and trigrams [34], we also employ filters with for recurrent neural networks (RNNs). An RNN progressively
different window sizes, i.e. h = 1, 2, 3. updates its hidden state given the current input and the previous
hidden state from the last time step, making it a natural choice for
Hidden Layer and Embedding Layer. The fixed-length feature
modeling sequential data such as natural language sentences.
vector v is then fed to the fully connected hidden layer and
LSTM has recently attracted much attention due to its ability
embedding layer to extract nonlinear higher level features. For
to learn long-term dependencies by using a gating mechanism.
the hidden layer, the computation is straightforward with weigh
A LSTM maintains a structure called memory cell which can be
matrix Wh and bias vector bh :
viewed as a continuous analogy of a memory circuit. The memory
h = f (Wh v + bh ). (4) cell controls the read, write and reset operations of its internal
state through output, input and forget gates respectively, allowing
The embedding layer gets its input from two sources: the output of
the gradient information to be back-propagated through many time
the hidden layer h, and context vector as of sentence s. Context
steps. The forward computation of a block of memory cells at time
vectors of all aspects constitute the context lookup table A (as
t is as follows [16]:
columns). The embedding layer output is computed as
 
h
  dt =g(Wd xt + Ud zt−1 + bd ) (6)
y = f We + be . (5)
as it =σ(Wi xt + Ui zt−1 + bi ) (7)
ft =σ(Wf xt + Uf zt−1 + bf ) (8)
Classification Layer. This layer is fully connected to the embed-
ding layer and outputs sentiment prediction for the input sentence. ot =σ(Wo xt + Uo zt−1 + bo ) (9)
It is not until the supervised fine-tuning phase that the layer will ct =it dt + ft ct−1 (10)
be added to the network. We defer the description of this layer to zt =ot g(ct ) (11)
Section 3.5.
where {W? , U? , b? }?∈{d,i,f ,o} is the set of model parameters,
Classification Layer denotes the element-wise product between vectors, and each
element in the vectors dt , it , ft and ot corresponds to the outcome
Embedding
Layer y value of one memory cell’s input unit, input gate, forget gate and
output gate, respectively, at time t. ct represents the internal state
Hidden of the memory cells and zt represents their outputs at time t. σ(·)
h
Layer
Context
is the sigmoid activation function. g(·) is the activation function
Max Pooling v Lookup of inputs and outputs (usually hyperbolic tangent). The above
Layer ......
max max max as Table A
definition is a variant of the vanilla LSTM [16]. We omit peephole
connections from internal states to gates since it has been reported
that removing them does not influence LSTM’s performance
¬
{z} ...... significantly [16]. Readers are referred to [16], [18] for a detailed
description of LSTM and discussion of its advantages.
® ¬ Taking LSTM as a building block, we design the network
concat( z , z )
architecture of WDE-LSTM as depicted in Figure 5. The input
layer is the same with that of WDE-CNN. We also use the 300-
® ...... dimensional word vectors trained with Google News to initialize
{z}
the word lookup table. The LSTM layer contains two LSTMs to
form a bidirectional RNN [15]. More formally, we have
¬ →
−z =LSTM (x , → −z )
LSTMb LSTMb ... LSTMb LSTMb z0 t f t t−1
LSTM ←

Layer ®
z0
z t =LSTMb (xt , ←−
z t+1 )
LSTMf LSTMf ... LSTMf LSTMf

− ←−
zt =concat( z t , z t ), (12)
where LSTMf /LSTMb represents forward/backward LSTM, and
Input Layer x ... →
−z t and ←−
z t are their output at time t respectively. zt is then
Word
constructed as a concatenation of →

z t and ←−
z t . Such a bidirectional
Lookup structure generates a more comprehensive feature representation
Table X of each word by encoding contextual information from the whole
Input Sentence s: w1 w2 ... w T -1 wT sentence. On the other hand, memory cells in LSTMs can be
viewed as feature extractors for detecting sequential patterns in
Fig. 5. The network architecture for WDE-LSTM. sentences. Therefore, we again perform max pooling over the
whole word sequence of a sentence to obtain the most salient
value of each extractor, forming the fixed-length feature vector v:
3.3 Network Architecture of WDE-LSTM
v(j) = max{zt (j)}. (13)
The convolutional filters in WDE-CNN can only capture input t
patterns from a text window, e.g., of 3 words. It is hard for a CNN The layers above max pooling layer is the same as in WDE-CNN.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 6

3.4 Embedding Training with Ratings closer to s1 (white circle) than s2 (black circle), the algorithm will
With the weak label definition in Eq. (1), we can divide review drag s3 away from s1 and drag s2 toward s1 . Cases 2 and 3 lead
sentences into two sets: P = {s|`(s) = pos} and N = to a mixed behavior: one move is desirable while the other one
{s|`(s) = neg}. Since P and N contain wrong-labeled sentences, is not. Therefore, cases 2 and 3 in Figure 6(b) are not as harmful
they cannot directly be used to train a classifier. Therefore, we as the cases in Figure 6(a). Furthermore, in triplet-based training
propose to first train an embedding space that captures the general there will not be a move if the difference in distances exceeds the
sentiment distribution of sentences. Intuitively, we should let sen- margin λ, since the derivative of Lweak becomes 0. This is useful
tences in P /N stick together, while keeping P and N separated. in that we will not make things too bad. For example, in case 2 of
A straightforward training scheme could be adapted from [50] Figure 6(b) s2 is actually a negative sentence and should not be
by stochastic gradient descent (SGD): we sample sentence pairs too close to s1 . Notice s3 is far away from s1 . Hence, the distance
and reduce distances for same-label pairs and increase distances difference may already exceed λ and there will be no move for
for opposite-label pairs. However, when wrong-labeled sentences this triplet. As a comparison, cases 1 and 2 in Figure 6(a) will
are sampled, there is still a relatively high chance that we make continually move s1 and s2 toward each other until their distance
a wrong move. To alleviate this issue, we propose to penalize becomes 0, which is the worst result.
relative distances for sentence triplets. The training objective is Training details. Embedding training is done by taking the
defined as a ranking loss [5] derivative of Lweak in (14) with respect to all the parameters under
X the Embedding Layer (Figure 3). Hyperbolic tangent is employed
Lweak = max (0, λ − dst(s1 , s3 ) + dst(s1 , s2 )) ,
as the activation function for all the layers. The detailed derivation
<s1 ,s2 ,s3 >
(14) of the back-propagation procedure for WDE can be found in the
where λ is the margin parameter, dst(·) is the Euclidean distance appendix. We do SGD over sampled sentence triplets with Ada-
between sentences computed by their embedding layer represen- Grad update rule [10]. Each mini-batch consists of 1024 sentence
tation: triplets. To cope with the overfitting problem, we do early stopping
dst(si , sj ) = kysi − ysj k2 , (15) in the training process according to the network’s performance on
the validation set. We devise a measure for validation. Intuitively,
and < s1 , s2 , s3 > denotes a valid triplet with `(s1 ) = `(s2 ) 6= a good embedding space will keep intra-class instances near each
`(s3 ). Eq. 14 means we require the distance between same-label other and separate instances from different classes, providing a
sentences s1 and s2 to be shorter than that between s1 and a proper prior model for supervised training. Hence, we propose
sentence s3 with the opposite label by at least λ. A sample triplet to assess the intra-class affinity and inter-class separability of the
is generated as follows. First, we randomly choose P or N as the embedding on the validation set as an indicator of the model’s
focus. Suppose we choose P . Then two sentences s1 and s2 are generalization ability. Specifically, the measure is defined as
sampled from P in turn, and a sentence s3 is sampled from N . 1 P
The case for N as the focus is just a mirror case. |Ψ| (si ,sj )∈Ψ dst(si , sj )
Dst ratio = 1 P , (16)
|Φ| (si ,sj )∈Φ dst(si , sj )

where Ψ and Φ are sets of inter-class sentence pairs and intra-class


sentence pairs in the validation set respectively. It is essentially
the ratio between inter-class average distance and intra-class
average distance. A better model would lead to a higher value
of Dst ratio. In the training process, the learning rate  is first
set to 0.1. After every 1M triplets are processed, Dst ratio is
evaluated. If it deteriorates apparently, we start to adjust  as
 = 0.5 × . The training stops if either  is below a preset
threshold (e.g. 10−4 ) or Dst ratio can no longer be increased
significantly. The hyperparameters are tuned on the validation
Fig. 6. Comparison between (a) pair-based training and (b) triplet-based set. We find the performance becomes relatively stable when the
training. Please see the text for detailed explanations.
hyperparameters are large enough. Considering both effectiveness
and efficiency, for WDE-CNN and WDE-LSTM we set context
Figure 6 illustrates the advantages of triplet-based training
vector size to 50 and set both hidden layer size and embedding
over pair-based training via a toy example. We use circles and
layer size to 300. For WDE-CNN, the number of filters for each
triangles to represent sentences in P and N respectively. Black
window size is 200 (a total of 600 filters). For WDE-LSTM, both
nodes denote wrong-labeled sentences. Since the majority of
forward LSTM and backward LSTM contain 300 memory cells (a
sentences are with correct labels, they would gather together in the
total of 600, matching the number of filters in WDE-CNN for fair
training process. Wrong-labeled sentences would go towards the
comparison). The training is accelerated using GPU.
wrong clusters, but with slower speeds. In both training methods,
undesirable moves could happen when wrong-labeled sentences CNN vs. LSTM As discussed before, WDE-LSTM can naturally
are sampled. For clarity, we just show three such cases that are capture long-term dependencies while WDE-CNN is limited by
representative for respective methods. The three cases in Fig- the window sizes of convolutional filters. Nevertheless, WDE-
ure 6(a) all result in undesirable moves: sentences with different CNN is more efficient than WDE-LSTM. With the above hy-
orientations become closer (cases 1 and 2), while same-orientation perparameter setting, WDE-CNN needs about half an hour for
sentences become more separated (case 3). In Figure 6(b), case processing 1M triplets on a Nvidia GTX 980ti GPU, while for
1 generates only undesirable moves: since s3 (black triangle) is WDE-LSTM the corresponding time is about 5 hours. The low

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 7

efficiency of WDE-LSTM stems from two facts: (1) there are treat each clause as a single-aspect sentence2 . After preprocessing,
much more parameters in LSTM layer than in CNN layer; (2) we obtained a vocabulary of 148,183 terms, an unlabeled set of
when computing derivatives for a LSTM cell’s weights, each 1,143,721 sentences with rating information only (named 1.1M
word position must be considered, while for a convolutional filter, dataset), and 11,754 sentences for labeling. We labeled each
we only need to care about the word position which wins the sentence with respect to its subjectivity and orientation. For each
max operation. Moreover, the parallelization can only be done labeling task, three students were instructed to perform labeling,
sentence-wise in WDE-LSTM, while for WDE-CNN we can do and each sentence was labeled by all the three students. In case
word-level parallelization. the labeling results are inconsistent, they discussed to reach a
consensus. The statistics of the labeled dataset is shown in Table 1.
We can see the dataset is roughly balanced. We used Fleiss’s kappa
3.5 Supervised Fine-tuning
[13] to measure the degree of agreement between the raters. For
After obtaining a good enough sentence representation by the the subjectivity and orientation labeling tasks, we achieved Fleiss’s
embedding layer, we add a classification layer on the top (Figure 4) kappa values of 0.81 and 0.79 respectively. The labeled dataset
to further train the network using labeled sentences. The classifi- was randomly split into training set (50%), validation set (20%)
cation layer simply performs standard affine transformation of the and test set (30%) and we maintain the proportion as shown in
embedding layer output y and then applies a softmax activation Table 1.
function [3] to the result for label prediction. In this work, we
focus on binary sentiment prediction (i.e. positive or negative)
since we only consider sentences which comment on specific 4.2 Baselines and Evaluation Settings
aspects of an entity. This kind of sentences hardly contain neutral The main research question we want to answer is, whether the
sentences. Nevertheless, WDE could also be adapted to multi-class proposed weakly-supervised training method for deep networks
prediction problems. For binary prediction, the classification layer can help boost performance. We compare WDE-CNN and WDE-
is equivalent to a logistic regression model. We train the network LSTM3 with the following baseline methods for review sentence
using standard SGD, since AdaGrad can easily “forget” the prior sentiment classification:
model learned in the first phase. The mini-batch size is set to 64. Lexicon: this is the popular lexicon-based method proposed in [8].
A similar early stopping strategy is adopted as in the embedding SVM: the support vector machine with n-gram features [35] is
training phase (except that we employ prediction accuracy as the widely employed as a baseline for sentiment classification. We
validation measure). use up to tri-grams since this setting is shown to yield good
performance for product reviews. Liblinear [11] is used to train
the classifier.
4 E XPERIMENTS NBSVM: NBSVM combines Naive Bayes and NB-enhanced
In this section, we present the empirical evaluation of WDE on SVM to predict sentiment labels [48]. It generates good perfor-
reviews collected from Amazon.com. mance on many sentiment classification datasets.
SSWE: SSWE learns sentiment-bearing word vectors by a neural
TABLE 1 network applied on weakly labeled data. We use min, max and avg
Statistics of the labeled dataset. pooling [46] on word vectors to generate the sentence representa-
tion which is then fed to a classifier.
Positive Negative Total
Subjective 3750 2024 5774 SentiWV: this is the sentiment word vector learning method on
Objective 1860 4120 5980 rating data described in Section 2 [27]. We also use the afore-
Total 5610 6144 11754 mentioned three pooling functions to generate sentence vectors.
Liblinear is used to train the classifiers for SentiWV and SSWE.
MemNet: MemNet is the deep memory network proposed in [45]
for aspect level sentiment classification. The method is naturally
4.1 Data and Preprocessing
defined only for single-word aspects. For multi-word aspects, we
We collected Amazon customer reviews of 3 domains: digital follow the suggestion of averaging the constituting word vectors
cameras, cell phones and laptops. All unlabeled reviews were [45]. For implicit aspects, the word that is the most relevant to the
extracted from the Amazon data product dataset [29]. In particular, aspect according to word vectors is treated as the aspect word.
we extracted all the reviews from 12 leaf categories closely related CNN-rand & CNN-rand11m: we train the same CNN based
to the above three domains (3-stars reviews were ignored). For network (Figure 4) on labeled data with random parameter ini-
the labeled dataset, we crawled latest reviews in 2015 for random tialization. For the sake of fairness, we also initialize the word
products in the above 12 categories, in order to be disjoint with the lookup table by word vectors trained by word2vec [31]. Besides
unlabeled data. We tried to keep a balance between reviews with 4 the publicly available word vectors trained on Google News,
& 5-stars and those with 1 & 2-stars. We then summarized product we also employ word vectors trained on the 1.1M dataset. The
aspects and their keywords by the popular method in [8] with corresponding baselines are named CNN-rand and CNN-rand11m,
manual calibration1 . A total of 107 aspects were extracted from the respectively.
obtained reviews. Next, all reviews were split into sentences and CNN-weak: we train the same CNN based network on 1.1M
those with no aspect keywords were discarded. In case a sentence dataset by treating weak labels defined in Eq. (1) as real labels.
mentioned multiple aspects in different clauses, we split it to
2. This is also to be fair to the classic machine learning baseline methods
1. Although more recent methods such as [4] could be used for aspect which do not take aspect information as input.
extraction, we choose [8] for simplicity. Comparison between different aspect 3. The source code of WDE can be downloaded at
extraction methods is out of scope of this paper. https://www.dropbox.com/s/yvo4x4c6dqtx3nm/code.zip.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 8

TABLE 2
Performance comparison.
0.85
Accuracy Macro-F1
Method
All Subj Obj All Subj Obj

Accuracy
Lexicon .722 .827 .621 .721 .812 .613 0.8
SVM .818 .838 .800 .818 .821 .765 WDE-CNN
WDE-LSTM
NBSVM .826 .844 .808 .825 .831 .773 CNN-rand
SSWE .835 .857 .815 .834 .826 .804 0.75 MemNet
Sentiwv
SentiWV .808 .806 .809 .807 .786 .771 SSWE
MemNet .839 .841 .837 .838 .829 .803 SVM
NBSVM
CNN-rand .847 .861 .835 .847 .848 .802
0.7
CNN-rand11m .849 .861 .836 .848 .850 .804 0 20 40 60 80 100
CNN-weak .771 .773 .770 .771 .755 .741 Percentage of training data (%)
LSTM-rand .845 .863 .829 .845 .852 .794
(a)
LSTM-rand11m .850 .861 .839 .849 .853 .809
WDE-CNN .877 .886 .868 .876 .875 .843
WDE-LSTM .879 .889 .870 .879 .878 .844
0.85

Macro-F1
This baseline will answer whether rating data can be used to train
sentence sentiment classifiers directly4 . 0.8
WDE-CNN
LSTM-rand & LSTM-rand11m: The setting of these baselines WDE-LSTM
CNN-rand
is the same as that of CNN-rand and CNN-rand11m except that 0.75 MemNet
Sentiwv
we employ the LSTM based network (Figure 5). SSWE
SVM
All methods (except CNN-weak and Lexicon) are trained on NBSVM
the training set and evaluated on the test set. WDE-CNN, WDE- 0.7
0 20 40 60 80 100
LSTM, SSWE and Senti-WV have a pre-training phase on the Percentage of training data (%)
1.1M dataset. The validation set is used for parameter tuning of
(b)
all the methods and early stopping of neural network training. We
employ Accuracy and Macro-F1 as the evaluation metrics. Fig. 7. Impact of labeled training data size on each method’s perfor-
mance.
4.3 Performance Comparison
The results are shown in Table 2. We also report performance for WDE, aspects are modeled by separate semantic vectors, thus
subjective sentences and objective sentences separately. The key avoiding the above issues. Finally, the WDE framework encodes
observations are as follows. Lexicon performs poorly on objective both a large number of weak supervision signals and supervision
sentences, since factual statements would not contain opinion signals in a deep neural network and beats all the baselines. By
words. When no opinion word is detected, we can only make incorporating a large set of weakly labeled training data, WDE-
random predictions. The machine learning methods all achieve ac- LSTM performs slightly better than WDE-CNN. In Section 4.6,
ceptable performance, on both subjective and objective sentences. we will demonstrate by case studies that WDE-LSTM can handle
One exception is CNN-weak, which is trained on weakly labeled sentences with more complicated structures.
sentences. We find its validation performance fluctuates drastically
during training. This indicates directly binarizing ratings as labels
for supervised training is not a good idea. SSWE performs better 4.4 Varying the Size of Training Set
than traditional classifiers (SVM and NBSVM) by applying a Next we examine the impact of the size of labeled training data
neural model on 1.1M dataset. However, it just uses word vectors on each method’s performance. CNN-weak and Lexicon are not
to encode the useful information in the 1.1M dataset. The classifier involved since they do not depend on labeled training data. We
is still a “shallow” linear model. The two neural network methods, randomly select d% training data to train the classifiers and test
CNN-rand and LSTM-rand, generate comparable results, with them on the test set, with d ranging from 10 to 90. For each d, we
LSTM-rand exhibiting a slightly lower overall performance. The generate the training set 30 times and the averaged performance
reason could be that the LSTM based model has more parameters is reported. Figure 7 shows the results. The curve of LSTM-
and consequently needs more training data. The performance rand is slightly lower than that of CNN-rand, and the margin
of CNN-rand11m and LSTM-rand11m is marginally better than increases a bit as the training set becomes smaller5 . This is in
CNN-rand and LSTM-rand. The reason could be that word vectors conformity with the observation in the previous subsection. The
trained on the 1.1M dataset are “customized” for reviews. These reason could be that LSTM based model requires more training
four baselines cannot beat WDE since (1) word2vec does not data. Secondly, we can see that as the number of available training
explicitly capture sentiment information (as discussed in [27], co- instances decreases, the performance of CNN-rand, NBSVM and
occurrence information is not enough); (2) WDE pre-trains the SVM drops faster than that of WDE-CNN, WDE-LSTM, SSWE
whole network (except the last prediction layer) while word2vec and SentiWV. This should be because the latter methods have
only pre-trains input word vectors. MemNet is slightly worse gained prior knowledge about the sentiment distribution through
than CNN-rand and LSTM-rand. This could be because Mem- pre-training, though with different capabilities. With 10% training
Net cannot naturally handle multi-word and implicit aspects. In
5. Since the curves of CNN-rand11m, LSTM-rand and LSTM-rand11m
4. The result is similar by using the LSTM based network. overlap with that of CNN-rand heavily, we omit them for clarity.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 9

set (nearly 600 instances), WDE-CNN and WDE-LSTM can still


achieve around 80% accuracy on the test set. According to t-test,
WDE-CNN and WDE-LSTM significantly outperform the other
methods with p-value < 0.01. WDE-LSTM performs slight better
than WDE-CNN when the labeled training set is sufficiently large.

0.89 1.25

Inter-/Intra-class Avg. Dist. Ratio


Classification Accuracy

0.88 1.2

0.87 1.15

0.86 1.1

0.85 1.05
Accuracy of final prediction model
Inter-/intra-class average distance ratio
0.84 1
0 20 40 60 80 100
No. of triplets processed (m)

Fig. 8. Investigating the relationship between Dst ratio (Eq. (16)) and
the accuracy of the final prediction model: for every 1M triplets pro- Fig. 9. Visualization of the embedding space by t-SNE. The upper
cessed in the embedding training phase, we save the model and feed it subfigure: after random initialization; the lower subfigure: after training
to the supervised training phase; then we plot both Dst ratio and the on 10M triplets.
final classification accuracy.

TABLE 3
Example sentences for which WDE-LSTM makes correct predictions
4.5 Performance Evaluation for Embedding Training while WDE-CNN fails.
While traditional neural network training algorithms often use the
prediction performance of the model on validation set to assess its Sentence in test set Label
Battery capacity is not the greatest, but it is ok. Pos
generalization ability, it is difficult to apply this strategy in the first
The internet drops randomly while the yoga in the same room Neg
phase of WDE since the target model is not a prediction model. is absolutely fine.
Hence, in section 3.4 we devised a measure called Dst ratio None of these cameras has an articulating front view screen. Neg
(Eq. (16)) for the first phase of WDE which could be able to reflect
the model’s generalization ability. Here we investigate whether
Dst ratio can reflect the generalization performance of the final set in detail and find that: (1) WDE-CNN is better at handling
prediction model. We save the learned model in the embedding
sentences with simple structures, e.g. “Sound is not that good.”
training phase after every 1M triplets are processed (the early
(2) for more complicatedly structured sentences, WDE-LSTM can
stopping strategy is not used). Then the saved models are fed to the
do better. Table 3 shows some examples for which WDE-LSTM
supervised fine-tuning phase to get the classification performance.
makes correct predictions while WDE-CNN fails. The second
The two curves of Dst ratio and classification accuracy are
column shows their true labels. Regarding the first two sentences,
shown in Figure 8. The results of WDE-CNN and WDE-LSTM
there is an attitude transition between the two clauses. For WDE-
are very similar, so we just show those of WDE-CNN. Figure 8
CNN, conflicting local patterns (e.g. “not the greatest” and “is ok”)
indicates that Dst ratio can to some extent forecast the final
could lead to wrong predictions. In the last sentence, the negation
prediction model’s performance. Therefore, Dst ratio could be
word “none” is far from the words conveying the user’s opinion,
used as the validation measure for embedding training.
so it is hard for WDE-CNN to capture the dependence with limited
To get an intuitive understanding of the learned embedding
window sizes. By modeling sentences holistically using LSTMs,
space, we visualize it by t-SNE [28]. We sample 2000 sentences
WDE-LSTM is able to predict correctly for these sentences.
from the 1.1M dataset to manually label them and plot them
as projected by t-SNE. The visualization results are shown in We also compare WDE-CNN (WDE-LSTM) with its coun-
Fig. 9. We only report the results of WDE-CNN since those of terparts, CNN-rand and CNN-rand11m (LSTM-rand and LSTM-
WDE-LSTM are similar. The upper subfigure shows the status rand11m). Table 4 shows three example sentences for which
of the embedding space right after random initialization of the WDE-CNN generates correct predictions while its counterparts
neural network. The lower one shows its status after training give wrong predictions. Table 5 is for WDE-LSTM. The second
on 10M triplets. We also show the Dst ratio scores on top of column of the tables presents the most similar sentences of the
the subfigures. From Fig. 9 we see that the general sentiment corresponding example sentences in the 1.1M dataset. We use
distribution of sentences can be captured by the embedding space. Latent Semantic Indexing [7] to assess sentence similarity. Most
of these example sentences are objective ones. We find objective
sentences are more likely to be diverse in expression (without
4.6 Case studies popular opinion words), so it is harder to make correct predictions
As stated in Section 3.3, the LSTM layer is able to capture for them. With pre-training on the 1.1M dataset WDE-CNN and
long-term dependencies. Compared to WDE-CNN, WDE-LSTM WDE-LSTM have a much higher chance to be trained on similar
could be more capable of “perceiving” a sentence holistically. We expressions, as shown in Tables 4 and 5. This could explain their
examine the classification results of the two methods on the test superior performance on those sentences.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 10

0.89
TABLE 4 WDE-CNN
Example sentences for which WDE-CNN makes correct predictions WDE-LSTM
CNN-rand
while its counterparts, CNN-rand and CNN-rand11m, fail to give correct 0.88
answers.
0.87

Accuracy
Sentence in test set The most similar sentence in
1.1M dataset 0.86
I was using a motorola t720i with It dropped all calls every two min-
verizon and I had a lot of dropped utes.
calls and I had troubles syncing 0.85

data.
This thing was amazing for about The phone has an issue that causes 0.84
0 5 10 15 20 25 30
20 minutes, then I started charging it to reboot at random or simply λ
it and it bricked out, started re- freeze up.
booting randomly and then stopped
turning on altogether. Fig. 10. Impact of λ on classification performance.
No future updates for this device There will (obviously) be no up-
huawei refused to release any up- dates to push this phone any fur-
dates for this device. ther and I don’t think it would be 5 C ONCLUSION
necessary to do so.
In this work we proposed a novel deep learning framework
named Weakly-supervised Deep Embedding for review sentence
TABLE 5 sentiment classification. WDE trains deep neural networks by
Example sentences for which WDE-LSTM makes correct predictions exploiting rating information of reviews which is prevalently
while its counterparts, LSTM-rand and LSTM-rand11m, fail to give available on many merchant/review Websites. The training is a
correct answers.
2-step procedure: first we learn an embedding space which tries
Sentence in test set The most similar sentence in to capture the sentiment distribution of sentences by penalizing
1.1M dataset relative distances among sentences according to weak labels
It leaks it leaks it leaks. Basically, at an angle, you can see inferred from ratings; then a softmax classifier is added on top
a horizontal light leaking at the top
of the embedding layer and we fine-tune the network by labeled
edge of the screen.
Bluetooth does not work on this The Bluetooth does not work half data. Experiments on reviews collected from Amazon.com show
phone. the time. that WDE is effective and outperforms baseline methods.
You can shoot in very low light Low light photos come out with Two specific instantiations of the framework, WDE-CNN and
without dark backgrounds looking color intact (no flash) - ideal for
grainy. museum items.
WDE-LSTM, are proposed. Compared to WDE-LSTM, WDE-
CNN has fewer model parameters, and its computation is more
easily parallelized on GPUs. Nevertheless, WDE-CNN cannot
well handle long-term dependencies in sentences. WDE-LSTM
4.7 Effect of λ in WDE is more capable of modeling the long-term dependencies in
sentences, but it is less efficient than WDE-CNN and needs more
The margin parameter λ in Eq. (14) controls the extent to which training data. For future work, we plan to investigate how to com-
we require weakly labeled positive instances to be separated bine different methods to generate better prediction performance.
from weakly labeled negative ones. A small value of λ may not We will also try to apply WDE on other problems involving weak
effectively capture the sentiment distribution while too large λ labels.
could amplify the impact of wrong-labeled sentences. Take the
case 2 in Figure 6(b) as an illustrative example. First, let us
ACKNOWLEDGMENT
assume s2 in case 2 is white (i.e. correctly labeled). In that case,
a small λ may result in no move since the margin requirement This research was supported by the National Natural Sci-
is already satisfied. Consequently, the positive class will not be ence Foundation of China (Grant Nos. 61672409, 61522206,
compact since s2 is still far from the majority of the class. When 61373118), the Major Basic Research Project of Shaanxi Province
s2 represents a wrong-labeled sentence (as shown in Figure 6(b)), (Grant No. 2017ZDJC-31) and the Science and Technology Plan
large margin values tend to force s2 to move toward s1 until they Program in Shaanxi Province of China (Grant No. 2017KJXX-
are very close to each other, making the classifier training hard. 80). The content of the information does not necessarily reflect
the position or the policy of the Government, and no official
Here we investigate λ’s impact on the classification perfor-
endorsement should be inferred.
mance. Recall that the embedding layer is a 300-dimensional
vector, and the output range of its neural nodes is [−1, 1]. It forms
a hypercube where the maximal √ distance between any two points R EFERENCES
in the hypercube is dia = 1200 ≈ 35. Hence, we vary λ [1] Y. Bengio. Learning deep architectures for ai. Foundations and trends
R
from 1 to 30. Figure 10 plots the accuracy curves of WDE-CNN in Machine Learning, 2(1):1–127, 2009.
and WDE-LSTM. We also show the best baseline performance [2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A
review and new perspectives. IEEE TPAMI, 35(8):1798–1828, 2013.
achieved by CNN-rand for comparison. We find the performance [3] C. M. Bishop. Pattern recognition and machine learning. springer, 2006.
drops quickly when λ > 15, and when λ < 15 we can easily find [4] L. Chen, J. Martineau, D. Cheng, and A. Sheth. Clustering for simulta-
a value leading to good performance. Moreover, when λ is set to neous extraction of aspects and features from reviews. In NAACL-HLT,
a relatively high value (> 0.5dia), the network is more easily to pages 789–799, 2016.
[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
be trapped in saturating regions [2] after long time training. In this P. Kuksa. Natural language processing (almost) from scratch. JMLR,
paper we set λ = 5. 12:2493–2537, 2011.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 11

[6] K. Dave, S. Lawrence, and D. M. Pennock. Mining the peanut gallery: [37] L. Qu, R. Gemulla, and G. Weikum. A weakly supervised model for
Opinion extraction and semantic classification of product reviews. In sentence-level semantic orientation analysis with multiple experts. In
WWW, pages 519–528, 2003. EMNLP-CoNLL, pages 149–159, 2012.
[7] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and [38] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model
R. Harshman. Indexing by latent semantic analysis. Journal of the with convolutional-pooling structure for information retrieval. In CIKM,
American society for information science, 41(6):391, 1990. pages 101–110, 2014.
[8] X. Ding, B. Liu, and P. S. Yu. A holistic lexicon-based approach to [39] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. Semantic compo-
opinion mining. In WSDM, pages 231–240, 2008. sitionality through recursive matrix-vector spaces. In EMNLP-CoNLL,
[9] L. Dong, F. Wei, C. Tan, D. Tang, M. Zhou, and K. Xu. Adaptive recur- pages 1201–1211, 2012.
sive neural network for target-dependent twitter sentiment classification. [40] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning.
In ACL, pages 49–54, 2014. Semi-supervised recursive autoencoders for predicting sentiment distri-
[10] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for butions. In EMNLP, pages 151–161, 2011.
online learning and stochastic optimization. JMLR, 12:2121–2159, 2011. [41] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng,
[11] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. and C. Potts. Recursive deep models for semantic compositionality over
Liblinear: A library for large linear classification. JMLR, 9:1871–1874, a sentiment treebank. In EMNLP, volume 1631, page 1642, 2013.
2008. [42] O. Täckström and R. McDonald. Semi-supervised latent variable models
[12] R. Feldman. Techniques and applications for sentiment analysis. Com- for sentence-level sentiment analysis. In ACL, pages 569–574, 2011.
munications of the ACM, 56(4):82–89, 2013. [43] D. Tang, B. Qin, X. Feng, and T. Liu. Target-dependent sentiment
classification with long short term memory. CoRR, abs/1512.01100,
[13] J. L. Fleiss. Measuring nominal scale agreement among many raters.
2015.
Psychological bulletin, 76(5):378, 1971.
[44] D. Tang, B. Qin, and T. Liu. Deep learning for sentiment analysis:
[14] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale
successful approaches and future challenges. Wiley Interdisciplinary
sentiment classification: A deep learning approach. In ICML, pages 513–
Reviews: Data Mining and Knowledge Discovery, 5(6):292–303, 2015.
520, 2011.
[45] D. Tang, B. Qin, and T. Liu. Aspect level sentiment classification with
[15] A. Graves and J. Schmidhuber. Framewise phoneme classification deep memory network. arXiv preprint arXiv:1605.08900, 2016.
with bidirectional lstm and other neural network architectures. Neural [46] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. Learning
Networks, 18(5):602–610, 2005. sentiment-specific word embedding for twitter sentiment classification.
[16] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, and J. Schmid- In ACL, volume 1, pages 1555–1565, 2014.
huber. Lstm: A search space odyssey. arXiv preprint arXiv:1503.04069, [47] P. D. Turney. Thumbs up or thumbs down?: semantic orientation applied
2015. to unsupervised classification of reviews. In ACL, pages 417–424, 2002.
[17] H. Halpin, V. Robu, and H. Shepherd. The complex dynamics of [48] S. Wang and C. D. Manning. Baselines and bigrams: Simple, good
collaborative tagging. In WWW, pages 211–220, 2007. sentiment and topic classification. In ACL, pages 90–94, 2012.
[18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural [49] Y. Wang, M. Huang, X. Zhu, and L. Zhao. Attention-based lstm for
computation, 9(8):1735–1780, 1997. aspect-level sentiment classification. In EMNLP, pages 606–615, 2016.
[19] M. Hu and B. Liu. Mining and summarizing customer reviews. In [50] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised
SIGKDD, pages 168–177, 2004. embedding. In ICML, pages 1168–1175, 2008.
[20] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural [51] J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. Towards universal
network for modelling sentences. In ACL, 2014. paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198,
[21] Y. Kim. Convolutional neural networks for sentence classification. In 2015.
EMNLP, pages 1746–1751, 2014. [52] L. Zhang and B. Liu. Identifying noun product features that imply
[22] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, opinions. In ACL, pages 575–580, 2011.
and S. Fidler. Skip-thought vectors. In NIPS, pages 3294–3302, 2015. [53] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks
[23] H. Lakkaraju, R. Socher, and C. Manning. Aspect specific sentiment for text classification. In NIPS, pages 649–657, 2015.
analysis using hierarchical deep learning. In NIPS Workshop on Deep [54] J. Zhu, H. Wang, M. Zhu, B. K. Tsou, and M. Ma. Aspect-based
Learning and Representation Learning, 2014. opinion polling from customer reviews. IEEE Transactions on Affective
[24] Q. V. Le and T. Mikolov. Distributed representations of sentences and Computing, 2(1):37–49, 2011.
documents. In ICML, volume 14, pages 1188–1196, 2014.
[25] B. Liu. Sentiment analysis and opinion mining. Morgan & Claypool
Publishers, 2012.
[26] B. Liu, M. Hu, and J. Cheng. Opinion observer: analyzing and comparing
opinions on the web. In WWW, pages 342–351, 2005.
Wei Zhao received the B.S., M.S. and Ph.D.
[27] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. degrees from Xidian University, Xi’an, China, in
Learning word vectors for sentiment analysis. In ACL, pages 142–150, 2002, 2005 and 2015, respectively. He is cur-
2011. rently an associated professor in the School
[28] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. JMLR, of Computer Science and Technology at Xid-
9(Nov):2579–2605, 2008. ian University. His research direction is pattern
[29] J. McAuley, R. Pandey, and J. Leskovec. Inferring networks of substi- recognition and intelligent systems, with specific
tutable and complementary products. In SIGKDD, pages 785–794, 2015. interests in attributed graph mining and search,
[30] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of machine learning, signal processing and preci-
word representations in vector space. arXiv preprint arXiv:1301.3781, sion guiding technology.
2013.
[31] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In NIPS,
pages 3111–3119, 2013.
[32] T. Mullen and N. Collier. Sentiment analysis using support vector
machines with diverse information sources. In EMNLP, volume 4, pages
412–418, 2004. Ziyu Guan received the B.S. and Ph.D. degrees
[33] T. H. Nguyen and K. Shirai. Phrasernn: Phrase recursive neural network in Computer Science from Zhejiang University,
for aspect-based sentiment analysis. In EMNLP, pages 2509–2514, 2015. China, in 2004 and 2010, respectively. He had
[34] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations worked as a research scientist in the Univer-
and trends in information retrieval, 2(1-2):1–135, 2008. sity of California at Santa Barbara from 2010
to 2012. He is currently a full professor in the
[35] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classi-
School of Information and Technology of North-
fication using machine learning techniques. In EMNLP, pages 79–86,
west University, China. His research interests
2002.
include attributed graph mining and search, ma-
[36] X. Qiu and X. Huang. Convolutional neural tensor network architecture chine learning, expertise modeling and retrieval,
for community-based question answering. In IJCAI, pages 1305–1311, and recommender systems.
2015.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756658, IEEE
IEEE Transactions on Knowledge and Data Engineering Transactions on Knowledge
( Volume: and30,
DataIssue:
Engineering
1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. ?, NO. ?, AUGUST 2017 12

Long Chen Long Chen received the B.S. degree


in Electronic Information Engineering from Xi’an
Jiaotong University City College, Xi’an, China,
in 2012 and received the M.S. degree in Elec-
tronics and Communications Engineering from
Northwest University, Xi’an, China, in 2015. He
is currently working toward the Ph.D. degree with
the School of Information Science and Technol-
ogy, Northwest University , Xi’an, China. His re-
search interests include deep learning,sentiment
analysis,text mining and natural language pro-
cessing.

Xiaofei He received the BS degree in Computer


Science from Zhejiang University, China, in 2000
and the Ph.D. degree in Computer Science from
the University of Chicago, in 2005. He is a Pro-
fessor in the State Key Lab of CAD&CG at Zhe-
jiang University, China. Prior to joining Zhejiang
University, he was a Research Scientist at Ya-
hoo! Research Labs, Burbank, CA. His research
interests include machine learning, information
retrieval, and computer vision.

Deng Cai is an Associate Professor in the State


Key Lab of CAD&CG, College of Computer
Science at Zhejiang University, China. He re-
ceived the Ph.D. degree in computer science
from University of Illinois at Urbana Champaign
in 2009. Before that, he received his B.S. degree
and M.S. degree from Tsinghua University in
2000 and 2003 respectively, both in automation.
His research interests include machine learning,
data mining and information retrieval. He is a
member of the IEEE.

Beidou Wang received his B.S. degree from


Zhejiang University, China, in 2011. He is cur-
rently in the duo Ph.D. program of Zhejiang
University, China and Simon Fraser University,
Canada. His research interests include social
network mining, active learning, cross domain
recommendation, recommendation explanation
and recommendation in social networks.

Quan Wang was born in 1970. He received


the B.S., M.S. and Ph.D. degrees in Computer
Science and Technology from Xidian University,
Xi’an, China. He is now a full professor in Xidian
University. His current research interests include
input and output technologies and systems, im-
age processing and image understanding.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like