Professional Documents
Culture Documents
Sentiment Analysis Based On Weighted Word2vec and Att-LSTM
Sentiment Analysis Based On Weighted Word2vec and Att-LSTM
LSTM
Huanhuan Yuan Yongli Wang, Xia Feng Shurong Sun
School of Computer Science and School of Computer Science and Zhenjiang Analysis InfoTech Ltd,
Engineering, Nanjing University of Engineering, Nanjing University of Zhenjiang
Science and Technology, Nanjing Science and Technology, Nanjing 8618851026397
8615720622990 8618936032016, 8618761878717
779477284@qq.com
15720622990@163.com yongliwang@njust.edu.cn,
779477284@qq.com
420
phenomenon. Distributed representation solves the problem of one emotion text to word2vec.Word2vec constructs a vector for each
hot representation through mapping each word to a shorter word word, and each word vector is assigned a weight by TFIDF
vector during training process. However, the short word vector algorithm. And then the process uses the LSTM network with
dimension generally needs to be artificially specified. Attention mechanism to train feature vectors, so that the output
feature vectors contain word semantic features and word sequence
Before the appearance of word2vec, some people had used neural features. Finally, SoftMax regression classifier is used to predict
networks to train word vectors. Bengio[10] used a three-layer the text's emotional orientation in the output layer.
neural network to construct a language model. The process of
model processing is very time-consuming. The vocabulary is
generally more than one million levels, which means that the
amount of output probability of each word is very large. Mnih[11]
proposed a Log-Bilinear model to train the language model. To
optimize Log-Bilinear model, Mikolov[12] provided a log-bilinear
model which removed the hidden layer of the neural network,
used only the linear representation capability, calculated the real
number to represent word vector.
Traditional text representation methods have the disadvantages of
high dimensionality, sparseness, and lack of semantic information.
This paper uses the word2vec model to represent vectors.
Word2vec method trains the N-gram language model through a
neural network machine learning algorithm, and generates the
vector corresponding to word in the training process. The neural Figure 1. Framework of the sentiment analysis model
network language model can map text to a low-dimensional vector,
and it can also make the word vector obtained by training carry
3.2 Weighted Word2vec
semantic information. The word2vec word vector captures the The word2vec model is trained according to each sentence in the
semantic information of the context, but the word2vec model data set, and slides on the sentence with a fixed window. It
cannot distinguish the importance of the vocabulary in the text. predicts the vector of the word in the middle of the fixed window
This paper intends to use the TFIDF algorithm to weight the according to the context of the sentence, and then trains the model
word2vec model. according to a loss function and optimization method.
2.2 Sentiment Analysis Based on Machine The word2vec model captures the semantic information of the
context, but it cannot distinguish the importance of the vocabulary
Learning in the text. The TFIDF algorithm is one of the important
At present, the main research methods of sentiment analysis are algorithms for calculating the weight of feature items. Therefore,
divided into two categories. One is a method based on emotional this paper proposes a weighted word2vec model based on the
dictionaries and rules, and the other is a method based on machine TFIDF algorithm. Based on the word2vec model, we propose a
learning that is used usually. Turney [13] used the PMI to expand weighted representation of the review document vector through
the reference dictionary for the deficiency of the emotional combining with TFIDF algorithm.
dictionary; Yang[14] extracted and analyzed relevant features of
emotional words, then used SVM to identify and classify Algorithm 1 Weighted W2V
sentences. Pang[15] attempted to use the n-grams model and SVM
Input: word vector w
model to classify emotions, and selected unigrams as features to
obtain the best classification results. However, in the study of Output: weighted word vector x
sentiment analysis based on machine learning, the main job is to Step:
design features manually. This work is very random and takes a
long time. Deep neural network technology can use models to While (Text is not empty) do
automatically learn the deep features of text, especially recursive Computing the probability of tf in a single
neural network model is more suitable for feature learning of document: ,
serial data such as text.
Based on previous research, this paper solves the problem of the Computing the probability of idf in the whole
gradient disappearance of the text feature selection problem. document: .
Through the control of the three kinds of doors of the LSTM
model, the long-term dependence problem in the training of the Computing the weighted vector of word: ,
,
RNN model is solved. At the same time, the Attention mechanism
is added to the LSTM model to obtain the semantic code ∑ ∈ ,
containing the attention probability distribution of the input
Multiplies w and weight vector to calculate weighted word
sequence node. It is used as the input of the classifier to reduce the
information loss and information redundancy in the feature vector vector w to form weighted comment vector x :
extraction process. This can effectively highlight the role of ∑∈ ,
keywords. End while
3. SENTIMENT ANALYSIS MODEL
3.1 Model Framework In the above algorithm, a set D includes M comment texts. The
The model for implementing sentiment analysis in this paper is text 1,2, … , has done word segmentation operation, and
shown in the following figure. First, we input preprocessed is trained by the word2vec model to obtain the N-dimensional
421
word vector w (w , ,… ) corresponding to each word. The Attention-based LSTM (Att-LSTM) model preserves the
This paper uses the TFIDF algorithm to calculate weight value intermediate output of the input sequence by retaining the LSTM
K t, of every word in the text, which is expressed as the word t encoder, then trains the model to selectively learn these inputs and
weight in the text 1,2, … , . TFIDF comprehensively associates the output sequence with the model output. Figure 2
considers the probability of a word appearing in a single text and shows the architecture of the Att-LSTM model. The attention
the weight of the word in the entire text set. mechanism calculates the weight of the historical node's influence
on the current node, and forms the attention probability
denotes the number of occurrences of the word t in the text distribution. The Attention mechanism breaks the restriction that
1,2, … , , and denotes the sum of all occurrences the traditional encoder-decoder structure depends on a fixed-
of the word in the text . M is the total number of training texts, length vector.
and is the number of texts where the word t appears in the
training text set. tf , is the word frequency of the word t in The input sequence of the text is X , ,…, .The average
the , and is the word vector of the word t. of the input vectors of the history nodes is the input vector of the
article overall represents , which is the last input of the
3.3 LSTM Based on Attention Mechanism encoding stage. , , … , corresponds to the hidden layer state
The LSTM architecture consists of a memory cell c and three values of the input sequence , , … , . corresponds to the
gates including input gate i, output gate o and forget gate f. hidden layer state value of input . in the figure is the
Outputs of the three control gates input gate, forget gate, and attention probability of the historical node to the last node. The
output gate are respectively connected to a multiplication unit to influence weight of the input sequence , , … , on the text
control the network input, output, and status of the cell unit. Using can be calculated, which can highlight the role of keywords and
formal languages, LSTM can be expressed as: reduce the influence of non-keywords on the overall semantics of
the text. The Att-LSTM model has two calculations:
Step 1: Calculate the probability of attention distribution.
exp
∑ exp
∙
(2)
∙ (1)
represents the attention probability weight of node i for node
K. T is the number of elements of the input sequence. V, W, U are
weight matrix, and is the last input corresponding hidden layer
∙ tanh state. represents the state value of the hidden layer
is the activation function sigmoid; W, U, V and b represent the corresponding to the i-th element of the input sequence.
coefficient matrix and the offset vector, 、 、 represent the Step 2: Calculate the semantic encoding and feature vectors of the
calculation formulae of three gates at the time t. is the attention distribution probability.
calculation method of the memory cell at time t, and is the
output of the LSTM cell at time t. C ∑
Based on the idea of Attention Model, this paper designs an , H C, , (3)
Attention-Based LSTM model combined the characteristics of
sentiment analysis. This model mainly designs a new method of The semantic code C is mainly obtained by accumulating the
attention probability calculation, and uses this method to generate product of the attention probability weight and the hidden layer
a semantic code containing the attention probability distribution state of the historical input node. We use the semantic encoding of
and generate the final feature vector at the same time. the attention probability distribution of the historical node and the
text population vector as input of the LSTM module, and then the
hidden node state value , of the last node is the final feature
vector. The , contains the weight information of the historical
input nodes, and highlights the semantic information of the key
nodes.
Finally, the softmax layer is transformed into a conditional
probability distribution. and are parameters of the softmax
layer.
s x softmax , (4)
422
weights, and the learning rate does not change during training number of texts that are actually negative but predicted to be
process. The Adam algorithm designs an independent adaptive positive; TN indicates the number of texts that are actually
learning rate for different parameters by calculating the first and negative and predicted are also negative. The accuracy rate is used
second moment estimates of the gradient. to measure the accuracy of the classifier. The recall rate is used to
measure whether the classifier can find all the samples. These two
Algorithm 2 Adam Algorithm
indicators are indispensable and should be taken into account at
Input: the same time. Therefore, F1 measures can be used to balance
these two aspects.
Parameter:
Step length . (6)
Exponential decay rate of moment estimation . ,
. 4.3 Experiment
Small constants for numerical stability
4.3.1 Comparison to Other Methods
Step: We compare with several traditional methods, including:
Initialize first-order and second-order moment variables word2vec-based SVM method, the word2vec-based LSTM
, method, the weighted word2vec-based LSTM method, and the
Attention mechanism-based LSTM method.
Initialize time step
W2V-SVM model: Use word2vec model to form word vector, the
While(Failed to meet stop condition)do classification is trained by the SVM model.
Take a small batch containing m W2V-LSTM model: After the text is converted into vectors based
samples ,⋯, from the training set, and the on word2vec, the classification is trained by the LSTM model.
corresponding target is Weighted W2V-LSTM model: Combine the TFIDF algorithm and
∑ word2vec to convert the text into vectors and classify them by
Calculate the gradient ← ; ,
LSTM training.
W2V-Att-LSTM model: After the text is transformed into a vector
Update first moment estimation: ← based on word2vec, the LSTM model based on the Attention
mechanism is used for training.
Update second moment estimation : ←
⨀g 4.3.2 Parameter Settings
, 1. Experimental Environment Configuration
Correct the deviation of the first moment: ←
(1) Software Environment
,
Correct the deviation of the second moment: ← Language: Python; Platform: Google TensorFlow Deep Learning
, Framework.
Calculate update value:∆ ,
√
(2) Hardware environment
Apply update value: ← ∆
Operating platform: Win10; CPU: Intel dual-core 4.0GHz;
End while Memory: 12G; Hard disk: 1T.
2. LSTM hyper-parameter definition
4. EXPERIMENT The batch size is 24, the number of units of the LSTM is 64, the
4.1 Data Set classification category is 2, the number of training iteration is
The data set of this paper includes two parts, one is an English 100000, the optimizer selects the commonly used Adam, and the
data set and the other is a Chinese data set. The English dataset is learning rate is set to 0.001 by default.
an IMDB film review set. The IMDB dataset contains 25,000
movie data, with 12,500 positive and negative texts. The Chinese 4.3.3 Experimental Results
dataset is selected from the hotel review corpora (Chn Senti Corp). It can be seen from the performance results of the following
This dataset is a corpus of hotel reviews collected by Dr. Tan of comparison experiments that the proposed method in the paper
the Chinese Academy of Sciences. This article selects has better results than the other models in accuracy, recall, and F1
ChnSentiCorp-Htl-ba-6000 data to conduct experiments with measures. From the Figure 3 and Figure 4, it is not difficult to find
3,000 positive and negative texts. Chinese text needs that no matter which model is used on an English data set or on a
segmentation compared to English text. Chinese data set, the effect of our method is much better. The
main reason is that Chinese needs word segmentation in the
4.2 Indicator preprocessing stage. Word segmentation will always produce
Precision and recall are two indicators used to evaluate the some minor errors in some reasons. Semantic bias caused by word
outcome of the classification. segmentation will reduce the performance of the model. In the
comparison algorithms, the performance of the weighted W2V-
precision , (5) LSTM is not as good as that of the W2V-Att-LSTM, indicating
that for the text sentiment analysis, the TFIDF algorithm dose not
TP indicates the number of texts that are actually positive and the performs well as the Attention mechanism when calculating the
prediction is also positive; FN indicates the number of texts that importance of words.
are actually positive but predicted to be negative; FP indicates the
423
7. REFERENCES
[1] Sugimoto F, Yoneyama M. A method for classifying emotion
of text based on emotional dictionaries for emotional
reading[C]// Iasted International Conference on Artificial
Intelligence and Applications. ACTA Press, 2006:91-96.
[2] Tang H F, Tan S B, Cheng X Q. Research on Sentiment
Classification of Chinese Reviews Based on Supervised
Machine Learning Techniques[J]. Journal of Chinese
Information Processing, 2007, 21(6):88-126.
[3] Pak A, Paroubek P. Twitter as a Corpus for Sentiment
Analysis and Opinion Mining[C]//LREc. 2010, 10(2010).
[4] Baharudin B. Sentence based sentiment classification from
online customer reviews[C]//Proceedings of the 8th
Figure 3. Performance results of each model on the English International Conference on Frontiers of Information
dataset Technology. ACM, 2010: 25.
[5] Mnih V, Heess N, Graves A, et al. Recurrent models of
visual attention[J]. 2014, 3:2204-2212.
[6] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation
by Jointly Learning to Align and Translate[J]. Computer
Science, 2014.
[7] Tim Rocktaschel, Grefenstette E, Hermann K M, et al.
Reasoning about Entailment with Neural Attention[J]. 2015.
[8] Rush A M, Chopra S, Weston J. A Neural Attention Model
for Abstractive Sentence Summarization [J]. Computer
Science, 2015.
[9] Hermann K M, Kočiský T, Grefenstette E, et al. Teaching
machines to read and comprehend[J]. 2015:1693-1701.
Figure 4. Performance results of each model on the Chinese [10] Y. Bengio,R.Ducharme, P. Vincent. A neural probabilistic
dataset language model. Journal of Machine Learning Research,
2003,3:1137-1155.
5. Conclusion [11] Mnih A, Hinton GE.A scalable hierarchical distributed
In this paper, we propose a new text sentiment analysis method. language model[C]//Proc of the NIPS. 2009: 1081-1088.
After the text information is encoded into the word vector by
word2vec, the weight matrix is added in combination with the [12] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of
TFIDF algorithm to form the LSTM input. To achieve the Word Representations in Vector Space[J]. Computer Science,
emotion classification, the text-related features are obtained by 2013.
LSTM, and then we combine Attention mechanism to obtain the [13] Turney P D,Littman M L.Measuring Praise and Criticism:
feature vectors. The experimental results show that the proposed Inference of Semantic Orientation from Association[J]. ACM
method is feasible and effective, and the method can better find
Transactions on Information Systems (TOIS), 2003, 21 (04):
the emotional orientation of textual information. At present, there
315-346.
are more and more texts including the English and Chinese mixed
review information. The next step will further explore the [14] YANG Jing,LIN Shi-ping.Emotion Analysis on Text Words
sentiment analysis tasks of the mixed Chinese and English texts. and Sentences based on SVM[J].Computer Applications and
Software,2011,28(09):225-228.
6. ACKNOWLEDGMENT [15] Pang B,Lee L,Vaithyanathan S.Thumbs up?:Sentiment
This article has been awarded by the National Natural Science Classification Using Machine Learning Techniques[C].
Foundation of China (61170035, 61272420, 81674099), the Proceedings of the ACL-02 Conference on Empirical
Fundamental Research Fund for the Central Universities Methods in Natural Language Processing Association for
(30916011328, 30918015103), and Nanjing Science and Computational Linguistics,2002:79-86.
Technology Development Plan Project (201805036).
424