Information Sciences: You Zhang, Jin Wang, Xuejie Zhang

Information Sciences 571 (2021) 459–474
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Learning sentiment sentence representation with multiview

attention model
You Zhang, Jin Wang ⇑, Xuejie Zhang
School of Information Science and Engineering, Yunnan University, Kunming, China
a r t i c l e i n f o a b s t r a c t
Article history: Self-attention mechanisms in deep neural networks, such as CNN, GRU and LSTM, have
Received 21 September 2020 been proven to be effective for sentiment analysis. However, existing attention models
Received in revised form 14 May 2021 tend to focus on individual tokens or aspect meanings in an expression. If a text contains
Accepted 19 May 2021
information on multiple sentiments from different perspectives, the existing models will
Available online 26 May 2021
fail to extract the most critical and comprehensive features of the whole text. In the present
study, a multiview attention model was proposed for learning sentence representation.
Keywords:
Instead of using a single attention, multiple view vectors were used to map the attentions
Sentiment analysis
Text classification
from different perspectives. Then, a fusion gate was adopted to combine these multiview
Multiview attention attentions to draw a conclusion. To ensure the differences between multiview attentions,
Sentence representation a regularization item was introduced to add a penalty to the loss function. In addition,
the proposed model can be extended to other text tasks, such as questions and topics, to
provide a comprehensive representation for the classification. Comparative experiments
were conducted on both multiclass and multilabel classification datasets. The results
revealed that the proposed method improves the performance of several previously pro-
posed attention models.
Ó 2021 Elsevier Inc. All rights reserved.
1. Introduction
Sentiment analysis aims to computationally study people’s opinions, sentiments and emotions via texts or reviews of
entities. This type of analysis can be used to detect the sentiment polarity (e.g., positive or negative opinion) within the text,
regardless of whether it is a whole document, paragraph, sentence, or clause. Understanding people’s emotions is essential
for businesses because customers can express their thoughts and feelings more openly than ever before. By automatically
analyzing customer feedback, from survey responses to social media conversations, brands can attentively listen to their cus-
tomers, and tailor products and services to meet the customers’ needs [1–3].
The traditional approach for sentiment analysis mainly focuses on machine learning algorithms and feature engineering.
The features, such as n-gram, bag-of-words (BoW), term frequency-inverse document frequency (TF-IDF) and part-of-speech
(POS), are first extracted from the text. Then, classification algorithms, such as support vector machine (SVM) and logistic
regression, are applied to these features for polarity classification of the sentiment.
Recent studies have focused on convolutional neural networks (CNNs) [4], gated recurrent units (GRUs) [5], and long
short-term memory (LSTM) [6]. A CNN extracts local n-gram features, and a GRU or LSTM can capture long-range dependen-
⇑ Corresponding author.
E-mail addresses: yzhang@mail.ynu.edu.cn (Y. Zhang), wangjin@ynu.edu.cn (J. Wang), xjzhang@ynu.edu.cn (X. Zhang).
https://doi.org/10.1016/j.ins.2021.05.044
0020-0255/Ó 2021 Elsevier Inc. All rights reserved.
Y. Zhang, J. Wang and X. Zhang Information Sciences 571 (2021) 459–474
cies. However, all tokens are treated equally in these models. To address this issue, self-attention [7] was proposed for text
classification, which could extract more relevant features from the input data to provide a representation that has meaning-
ful and task-driven information. The attention mechanism allows a deep neural network (DNN) model to learn which fea-
tures should be highlighted or ignored. By reweighting the input representations based on their contributions to the final
classification, self-attention can boost the performance of several natural language processing (NLP) tasks.
In sentiment analysis, a subjective sentence often carries different sentiment information from different perspectives.
However, existing attention-based models typically compute weights that are related to only a certain target word or aspect
meaning. For example, the passage in Fig. 1(b) consists of three positive opinions in the three different aspects of salad, steak
and wine. In addition, there is a negative summary for the taste and experience: it is a bit too spicy and expensive. As indicated
in Fig. 1(a), the existing attention-based models heavily emphasize the word excellent, which is not what the author wants to
express.
In the present study, we proposed a multiview attention model for sentiment analysis to extract comprehensive senti-
ment features from different views without extra inputs. Instead of using a single attention in existing models, a set of view
vectors are first initialized and subsequently mapped from different views to ensure that different attentions can be learned.
For example, in the example text mentioned above, it would be useful for the model to emphasize delicious, good, excellent,
spicy and expensive, which contain different sentiment information via attention from different perspectives. Then, the over-
all polarity of the sentence that the author wants to express can be determined by using a fusion strategy to integrate the
information from different views, as shown in Fig. 1(c). To ensure the diversity between multiple views, we used a regular-
ization term of penalty to ensure that the learned attentions are different from one another. In addition, other text classifi-
cation tasks, such as question and topic classification, can also benefit from the proposed model, since it provides a
comprehensive representation for the classification. Compared with other aspect-based attention models, the proposed
model does not require any extra target-guided information. Although the comprehensive representation of a sentence could
be generated, the proposed method could have a restriction or constraint to address aspect-based classification tasks since
the specific attribute information in a certain view could possess a strong restriction that misguides the attention
distributions.
Empirical experiments were conducted on several datasets of both multiclass and multilabel classifications to compare
the proposed model against several existing attention-based methods. The results revealed that the multiview attention
model can extract meaningful and comprehensive features from different perspectives, thereby outperforming other meth-
ods with competitive performance on all datasets.
The main contributions can be summarized as follows:
A multiview attention-based model was proposed to learn sentence representation for text classification.
A regularization term was introduced to impose a penalty on the loss function to ensure that the learned view attentions
were different from each other.
To obtain a comprehensive representation, a fusion gate was adopted to integrate the attention-guided representations of
consistent and complementary information.
The remaining sections of the paper are organized as follows. Section 2 briefly reviews the previously proposed methods
for text classification based on both the pretrained language model and attention mechanism. Section 3 describes the pro-
posed multiview attention model. Comparative experiments on several datasets were conducted and are summarized in Sec-
tion 4. The conclusions are drawn in Section 5.
2. Related work
In this section, we briefly review the algorithms for learning sentiment representation and the attention mechanism used
to extract the task-specific features to enhance the performance of text classification tasks.
2.1. Sentiment representation learning
Sentiment analysis in NLP focuses mainly on determining the subjective attitude (i.e., positive, negative, or natural) of the
writer from texts. It can be further applied for other downstream applications, such as irony detection [8] and aspect-based
sentiment analysis [9]. Based on distributed word vectors, deep neural networks, such as CNNs [4], GRUs [5] and LSTMs [6],
can be built to learn sentiment representations of texts for polarity classification. A CNN uses convolution operators to
extract the local n-gram features, while an LSTM can address sequential and variable-length texts and address the vanishing
gradient issue that arises in standard recurrent neural networks (RNNs). By obtaining all of the local features, the long-range
dependency and any other syntactic information, structured models, such as CNN-LSTM [10], bidirectional LSTM (BiLSTM)
[11], Tree-LSTM [6], linguistic regularization-based LSTM (LR-LSTM) [12], and graph-based models [13], have been proposed
to learn sentiment representations with structured information for polarity classification. Other extra sentiment knowledge,
e.g., sentiment lexicons [14,15] or specific attributes [9,16], such as aspect, topic, user, and product, can be incorporated into
these classifiers to further locate aspect-based and personalized sentiment information, thus improving the performance.
460
Fig. 1. . Illustrative examples of the self-attention, multiview attention, and multiview attention fusion models for sentiment analysis. Each square presents
an attention weight (between 0 and 1) that is assigned to a corresponding word. A square with deeper color indicates a higher weight that is assigned to the
corresponding word for the sentence representation.
Fig. 2. . System architecture of the proposed multiview attention model.
461
2.2. Pretrained language model
Word embedding is a technique that maps words into a space using continuous value vectors in such a way that seman-
tically similar words appear closer to each other. The word2vec toolkit [17] applies continuous bag-of-words (CBOW) or
skip-grams trained on a large number of unlabeled text corpora to obtain a fixed-dimension word representation with rel-
evant semantic information between words. Pennington et al. [18] proposed GloVe, which can be trained on the aggregated
global word-word co-occurrence statistics from a corpus. The resulting representations exhibited interesting linear substruc-
tures of the word vector space. Both GloVe and word2vec are widely used in many NLP tasks, such as sentiment analysis
[19,20], language understanding [21], and machine translation [5].
Although distributed word representation can achieve good performance, it fails to capture contextual information in dif-
ferent contexts. More recently, contextualized word embeddings, such as CoVe [22] and ELMo [23], have achieved a signif-
icant improvement on multiple NLP tasks. These word embeddings are often employed as extra features for specific tasks.
Bidirectional encoder representations from transformers (BERT) [24], as one of the pretrained language models (including
GPT [25] and ULMFiT [26]), outperforms the predecessors and achieves the state-of-the-art on a wide range of natural lan-
guage understanding (NLU) tasks. These approaches always contain two steps. First, several specific unsupervised subtasks
were designed to learn the language models from a large number of natural language texts in an unsupervised manner. Then,
the trained language models were transferred for the target-specific downstream tasks with the supervision of the labeled
dataset. Furthermore, pretrained models can also transfer the semantic and syntactic information from the corpus to the
downstream tasks by fine-tuning the models rather than randomly initializing the parameters of the present models. Devlin
et al. [24] pretrained and fine-tuned BERT by feeding the representation of the special token ([CLS]) into a fully connected
layer with a softmax function for the sentiment analysis task. Furthermore, Sun et al. [27] investigated different studies
based on BERT for text classification tasks and performed a detailed analysis of BERT. The results revealed that a higher layer
of BERT is more useful for text classification.
2.3. Attention mechanism
To achieve better performance, an attention mechanism is applied over other neural layers, which reassigns different
attention for different time steps, according to its contribution to the final classification. The main idea of the attention
mechanism is to automatically adjust neural networks to focus on more relevant parts compared to others.
The attention models were first introduced in the machine translation, in which each predicted word in the decoder was
calculated from its relevant parts in the encoder, for more precise word forecasting [28]. The attention mechanism was also
properly applied in the field of text classification. Cheng et al. [29] incorporated a memory network into traditional LSTM
architectures, in which each time step would calculate its correlations with previous time steps to enrich its message for
language understanding. Yang et al. [30] proposed a hierarchical attention structure that applied a self-attention mechanism
to hierarchically integrate sentence representations from words and document representations from sentences. To obtain
interpretable sentence embeddings, a structural attentive network [7] was designed for locating the latent information by
introducing self-attention to generate matrix attention. In particular, the sentiment-aware attention network (SAAN) [31]
and linguistic-aware attention network (LAAN) [32] were proposed to learn sentence representation through the attention
mechanism for sentiment and linguistics, respectively.
Similar to self-attention, capsule networks (CapsNet) apply a dynamic routing algorithm to automatically learn a more
robust representation from low-level features. CapsNet was first introduced to improve the representations of the CNN
for computer vision. Recently, several types of CapsNets have been proposed for NLP tasks. Sara et al. [33] proposed a
dynamic routing algorithm that can dynamically decide what and how much information must be transferred from each
word to the whole sentence representation. Going et al. [34] proposed a capsule network with routing-by-agreement for text
classification and reported that capsule networks have potential for transfer learning and zero-shot learning. Yang et al. [35]
proposed three strategies to boost the performance of the dynamic routing process and alleviate the disturbance of noisy
capsules for text classification. Du et al. [36] proposed a dynamic routing-on-hyperplane routing algorithm, which leverages
the classification performance by selecting proper capsules that transfer to high-level capsules.
The attention mechanism and capsule network can extract task-related features and adjust the NN models to focus on
relevant tokens and ignore others. However, these models tend to emphasize only a certain target word or aspect meaning.
To obtain comprehensive information, a multiview graph attention network [37] can initialize a group of single view models
and then integrate the internal relationships between these single views. Based on this approach, Yuan et al. [38] proposed a
multiview attention network (MuVAN) to learn fine-grained representations from multivariate temporal data. Deng et al.
[39] took two types of question answering tasks as different views and proposed a multiview attention approach that com-
bines both of them to benefit each other.
3. Multiview attention model
In this section, the proposed multiview attention fusion model is described in detail. The proposed model is mainly com-
posed of three parts: sequential encoder, multiview attention and fusion gate (see Fig. 2). For sequential encoders, both LSTM
462
and BERT were introduced to generate contextual word representations. Then, the proposed multiview attention process was
performed on the hidden states of the LSTM layer. Compared with the aforementioned multiview models, the proposed
model introduces bilinear attention to learn comprehensive information from different view features. To combine the fea-
tures of different views, a fusion gate is used to compose the information from multiple views into the final sentence
representation.
3.1. Sequential encoder
The sequential encoder aims to generate the corresponding sequence of hidden representations h = [h1, h2, . . ., hn], where
hi 2 Rdh , and n is the length of the sentence. There are three different ways to learn such a representation. One intuitive way
is to directly use the sequence of word vectors, i.e., x = [x1, x2, . . ., xn] from GloVe [18] or Word2vec [17], where xi 2 Rdx , as the
hidden representations h. Such a method is of high efficiency and low consumption but could lose long-term dependency.
The second manner is to stack a neural layer, such as the LSTM, over the word embeddings x to obtain h. The last method is
to apply pretrained language models, e.g., ELMo [23] and BERT [24], to replace the static word embedding and directly learn
the contextual word representation as h.
3.2. Multiview attention
To learn multiview attention, we first randomly initialized m view vectors u = [u1, u2, . . ., um] with a uniform distribution
U(0.2, 0.2), where uj 2 Rdh ; j 2 f1; :::; mg. These view vectors were subsequently trained and updated iteratively. The view
vectors guide the model to attend to the most task-relevant tokens. With the given hi and uj, the model computes for the
attention vector rji that begins with a tensor operator to obtain a composition vector tji 2 Rdk , which can be defined as
T
t ji ¼ tanhðhi V j uj Þ; ð1Þ
where V j 2 Rdk dh dh is the 3-dimensional trainable parameter, and each slice V jk 2 Rdh dh is the bilinear term that interacts
with these two vectors and captures a specific relation; tanh denotes the hyperbolic tangent function. According to Socher
et al. [40], this tensor operator can be used to model complicated compositions between those vectors.
Each attention vector r ji 2 Rdr for the i-th word in the j-th view is subsequently generated through a BiLSTM model and
can be defined as
!j ! j !j
r i ¼ LSTMðt i ; r i1 Þ
j j
r i ¼ LSTM ðt ji ; r iþ1 Þ
; ð2Þ
j
!j
r ji ¼ concatð r i ; r i Þ
where r ji is the composition of both directions. Inspired by Wang et al. [41], the correlation captured in each view at different
time steps could influence the following step, and the attention vector r ji can be considered to be a content-dependent rep-
resentation that can review previous information in a sequence. Therefore, the introduction of a BiLSTM layer would help in
the computation of attention scores, making the scores more accurate. The attention score aji for the i-th token in the j-th
view was calculated using a softmax function that can be defined as follows:
expðW j rji Þ
aji ¼ P j j
; j ¼ 1; 2; :::; m; ð3Þ
i expðW r i Þ
where Wj is the shared parameter for assigning weights to each attention vector rji in the j-th view. Using the softmax func-
P
tion, the entries of attention weights ai in the j-th view are within the range of [0,1], and i aji ¼ 1.
3.3. Fusion gate
To weight the input token, features from both the contextual representation and attention vectors were used. The atten-
tion score was first applied to the contextualized word vectors to form the integrated token features, which were defined as
follows:
X
n
ojh ¼ aji hi : ð4Þ
i¼1
Similarly, we considered that an attention vector can still be associated with multiple previous views. Therefore, the
attention scores were applied for the attention vectors to form integrated view features, which can be described, as follows:
463
X
n
ojr ¼ aji r ji : ð5Þ
i¼1
Since all of the attention vectors were learned from the same hidden states in different views, the information in the hid-
den states and attention vectors could differ but become complementary and consistent. Therefore, a fusion gate was used to
combine the features from both the contextual hidden vectors and the attention vectors:
g j ¼ rðW r ojr þ W h ojh þ bÞ

ojg ¼ g j ojr þ ð1 g j Þojh ; ð6Þ
s ¼ concatðo1g ; o2g ; :::; om
g Þ
where {Wr, Wh} and b are the weights and bias associated with the fusion gate, respectively; r is the sigmoid function; ojg
denotes the representation of the j-th view; and s is the output sentence representation, which is the concatenation of m
view features.
3.4. Regularization
The multiview attention was mainly computed using Eq. (2), where a view vector uj was randomly initialized. This
method could lead the proposed model to focus on only one token, instead of considering the attention from all views, which
makes all of the view vectors similar. Thus, we applied a regularization of the penalty term on the objective function to dis-
perse the distributions of multiple attention scores across different views. Furthermore, we maximized the likelihood
between each pair of view vectors with the following penalty:
X
m Y
m
XðuÞ ¼ logpc;j / pc;j ; ð7Þ
c–j c–j
where pc,j is a joint distribution between a pair of the c-th and the j-th attention vectors uc and uj, respectively, which is cal-
culated by:
1
pc;j ¼ pðuc ; uj Þ ¼ : ð8Þ
1 þ expðucT uj Þ
Thus, if the cosine similarity between uc and uj is large, the dot product would lead to a corresponding large penalty on the
final objective function.
3.5. Training objective
3.5.1. Multiclass loss

Based on the obtained sentence representation, a fully connected layer with the softmax activation function was applied
to predict the probability distribution over multiple classes:
^ ¼ softmaxðMLPðsÞÞ;
y ð9Þ
^ 2 R is the normalized probability that corresponds to the classes with the
where MLP() is the multilayer perceptron and y C
number of C. Thus, the loss function for the multiclass task is a categorical cross-entropy, which can be defined as follows:
X
D
Lcls ¼ ^ d Þ k XðuÞ;
Iðyd Þ logðy ð10Þ
d¼1
where yd and y ^d represent the ground truth and the probability distribution of sample d; XðuÞ is the aforementioned regu-
larization item; k is the decay factor; D is the number of training samples; IðyÞdenotes a one-hot vector with the y-th com-
ponent being one; and represents the elementwise multiplication operation.
3.5.2. Multilabel loss

For multilabel tasks, each view vector (i.e., og in Eq. [6]) was applied to predict a single label. In the present implemen-
tation, m was consistent with the number of labels. For the l-th label, a fully connected layer with a sigmoid activation func-
tion was adopted to predict the probability distribution, which was defined as follows:
^l ¼ sigmoidðMLPðolg ÞÞ:
y ð11Þ
Thus, the loss function is designed as follows:

X
D X
Llab ¼ ^ ld Þ k XðuÞ;
Iðyld Þ logðy ð12Þ
d¼1 l
464
Where y ^d represents the probabilities of the l-th label normalized by the sigmoid function. The proposed model can be
l
trained end-to-end using the backpropagation algorithm.
4. Experimental results
In this section, the proposed multiview attention model was compared with several previously proposed methods on sev-
eral corpora for both multiclass and multilabel classification to investigate the effectiveness of the proposed models.
4.1. Datasets
All of the models are evaluated on six multiclass text classification tasks, along with two more challenging tasks of mul-
tilabel classification. For multiclass tasks, only one true label was contained in each text. For multilabel tasks, each text can
hold two or more true labels simultaneously.
The datasets for the multiclass text classification included the following:
Both the Stanford Sentiment Treebank (SST-5) [40] and Movie Review (MR) [42] are movie review datasets for determin-
ing user opinions in terms of five fine-grained and two clear sentiments, respectively.
The subjectivity dataset (SUBJ) [43] contains both subjective and objective sentences or snippets, which are labeled
according to whether the sentences are subjective or objective.
Customer Review (CR) [44] datasets consist of reviews of five electronics products downloaded from Amazon and CNET,
and all sentences have been manually labeled as to whether an opinion is expressed.
TREC question dataset (TREC) [45]: The TREC dataset is a question classification dataset that consists of open-domain,
fact-based questions categorized into six semantic classes.
AG’s news corpus (AG’s) [46]: AG’s dataset is a topic classification dataset that groups a large number of news articles into
four topic categories.
Correspondingly, these datasets were used for multilabel classification and included the following:
Reuters-21578 (Reuters): This dataset consists of 10,788 documents from the Reuters financial newswire service, in
which each document is both a single-label and multilabel classification.
Emotion Classification (EC) [47]: This dataset is available from the SemEval-2018 shared task 1: Affect in Tweets. Each
tweet in the dataset is classified as one or more of the 11 given emotions (including anger, anticipation, disgust, fear,
joy, love, optimism, pessimism, sadness, surprise and trust) for presenting the mental state of the tweet.
Only SST-5, Reuters and EC provide separate training, development and test sets for evaluation. For the other datasets, a
hold-one-out strategy is used to generate them by randomly splitting raw samples with ratios of 0.8, 0.1 and 0.1, following
previous work [35,36]. If only the development set is missing, it will be split from the training sets with a ratio of 0.1. The
detailed statistics of all datasets are shown in Table 1. Each sentence was first tokenized using the spaCy1 tokenizer. Then, the
pretrained GloVe2 840B Common Crawl [18] was used as the word embedding. For BERT, WordPiece tokenization3 [24] was
adopted, which means that one word will be divided into several pieces that start with ##. The BERT model was initialized from
a checkpoint in an uncased base version and fixed during the training process. For the Reuters dataset, the experiments can ben-
efit from the previous work [35], where preprocessed features embedded with 300-dimensional word2vec vectors are available
for a quick start.
4.2. Evaluation metrics
For multiclass tasks, considering all datasets taking an approximately balanced distribution of classes in Table 1, the accu-
racy score was used as a performance metric, following previous work [48,32,35,36]. For multilabel tasks, microaveraged F1-
scores (F1), precision (precision), recall (recall), and exact match ratio (ER) were used. ER takes fully correct prediction as
correct (value for 1) and partially correct prediction as incorrect (value for 0) in each sample. The final results are the mean
values over all values of all samples, according to the studies conducted by Yang et al. [35] and Sorower et al. [49]. Consid-
ering all of the metrics, a higher score indicates better performance.
1
https://spacy.io/.
2
https://nlp.stanford.edu/projects/glove/.
3
https://github.com/google/sentencepiece.
465
Table 1
The statistics of each experimental dataset. Train, Dev and Test denote the numbers of training, development and test sets, respectively. The statistics of the
number of classes and the labels in the multiclass and multilabel tasks are listed in the ‘‘Classes~(distribution) or Labels” column, where ‘‘distribution” denotes
the data distribution along with classes for multiclass datasets. Mean Length and Vocab represent the average word number and vocabulary size, respectively.
Dataset Train Dev Test Classes~(distribution) or Labels Mean Length Vocab Description
SST-5 8.5 k 1.1 k 2.2 k 5~ (0.26,0.26,0.19,0.16,0.13) 19.8 18,280 Sentiment classification
MR 8.6 k 1.0 k 1.0 k 2~ (0.5,0.5) 22.4 18,572
SUBJ 8.0 k 1.0 k 1.0 k 2~ (0.5,0.5) 25.2 21,125
CR 3.0 k 0.4 k 0.4 k 2~ (0.36, 0.64) 20.6 5384
TREC 5.0 k 0.5 k 0.5 k 6~ (0.22,0.22,0.22,0.17,0.15,0.02) 10.0 9344 Question classification
AG’s 108 k 12.0 k 7.6 k 4~ (0.25,0.25,0.25,0.25) 44.0 112,103 Topic classification
Reuters 5.8 k 0.6 k 0.3 k 9 67.1 21,055 Multilabel classification
EC 6.8 k 0.9 k 3.3 k 11 19.8 18,645
4.3. Baselines
To comprehensively evaluate the performance of the multiview attention model, we compared it with several previously
proposed methods for text classification. The implementation details for each method are described as follows:
LSTM and BiLSTM [5]: These apply both single and bidirectional LSTM with a dense layer for final classification.
Tree-LSTM [6]: Tree-structured LSTM accounts for semantical and syntactical information for sentiment analysis by
dependency parse trees.
LR-LSTM [12]: Linguistically regularized LSTM applies a type of linguistic regularization on the intermediate output of
neural networks for involving linguistic information.
CNN-nonstatic [4]: CNN is applied to capture n-gram information for sentiment analysis by not tuning the pretrained
embeddings.
Self-attention [7]: A structured self-attentive network was proposed to locate the structured information in sentences
using attention mechanisms over LSTM.
SAAN [31]: A sentiment-aware attention network integrates sentiment information with a sentiment-oriented attention
mechanism for extracting sentiment features.
LAAN [32]: A linguistic-aware attention network adopts an attention mechanism to generate linguistic features for sen-
tence representation.
Capsule-B [35]: The capsule network adopts a new strategy to boost the dynamic routing process and remedy the dis-
turbance of noisy capsules.
HCapsNet [36]: Capsule network on hyperplanes designs a routing-on-hyperplane method applied for capsule networks
to tackle the polysemy problem in natural language.
LSTMBERT: It stacks an LSMT layer over the contextualized word representations from BERT to learn the sentence
representation.
BERT(FiT): It fine-tunes pretrained BERT for sentiment representation.
The proposed MVA model was also implemented. Both contextual (BERT) and noncontextual word representation (GloVe)
were investigated. For a fair comparison, whether to use context encoder (LSTM) models of both word representations was
reported.
The dimensionality of the GloVe and BERT vectors was 300 and 768, respectively. For all models, if there was an additional
LSTM-based encoder, the dimension of the contextual representation dh was set to 300. The parameter V j ; j 2 f1; :::; mg in
each view was initialized from a Xavier uniform distribution [50]. All of the tasks initialized the attention vector (dr), having
the same dimension as the word representation’s dimension (dh), in other words, 768 for BERT and 300 for GloVe or adding
an additional LSTM encoder. The dimension of the composition vector (dk) in Equation (1) was set to 128. The optimal num-
ber of view vectors (m) for the classification tasks was tuned using the grid-search strategy. As mentioned above, the view
vector uj can be initialized from different distributions. Fig. 3 reports the results by using different distributions, i.e., N(0, 1), U
(0.02, 0.02), U(0.2, 0.2), U(2.0, 2.0), for the final classification. Here, N and U are a normal distribution and uniform dis-
tribution, respectively. As indicated, uj Uð0:2; 0:2Þ 2 Rdh achieved the best performance in most cases, similar to the
results reported in [41,16].
For all datasets, the Adam optimizer was used to train the model, with the penalty weight tuned on the development set.
The base learning rate was set to 1e4 for the BERT-based model and 5e4 for the noncontextual embedding-based model.
Batchsize was set to 32. The detailed settings of the critical parameters are listed in Table 2.
466
Fig. 3. Dev accuracy (%) on sentiment classification task with different initialization of view vectors.
Table 2
The statistics for the critical hyperparameters. The optimal number of views (m) and decay factors (k) are tuned on the development set.
Datasets m k
Glove Glove + LSTM BERT BERT + LSTM Glove Glove + LSTM BERT BERT + LSTM
MR 2 2 3 2 1e6 1e10 1e8 1e4
SST-5 2 3 3 2 1e6 1e3 1e3 1e8
SUBJ 3 3 2 3 1e6 1e4 1e2 1e10
CR 3 3 2 3 1e6 1e4 1e6 1e8
TREC 2 3 3 2 1e6 1e2 1e6 1e4
AG’s 3 3 3 3 1e2 1e8 1e4 1e6
Reuters – 9 – – – 1e4 – –
EC – 11 – – – 1e8 – –
Table 3
Comparative results for the sentiment classification. ‘‘[Word Embeddings] + [Sequential Encoder]” represents the context encoder layer stacked on the word
embeddings to generate the word representation. (FiT) denotes BERT encoders with the fine-tuning strategy. Performance is measured in average accuracy (%)
over five runs. Additionally, standard deviation is reported for more validity on the proposed methods. The boldfaced results are the best results in all
experiments.
Models Sentiment Question Topic

MR SST-5 Subj CR TREC AG’s
Conventional neural models LSTM 75.9 45.6 89.3 78.4 86.8 86.1
BiLSTM 79.3 46.5 90.5 82.1 89.6 88.2
Tree-LSTM 80.7 48.1 91.8 83.2 91.8 90.1
LR-LSTM 81.5 48.3 – 82.5 – –
CNN-non-static 81.5 48.0 93.6 84.3 93.6 92.3
Self-attention 82.5 48.7 – – – –
SAAN 84.3 49.7 – – – –
LAAN 83.9 49.1 – – – –
Capsule-B 82.3 – 92.8 85.1 92.8 92.6
HCapsNet 83.5 50.8 94.2 – 94.2 93.5
LSTMBERT 84.3 50.0 95.9 87.7 96.0 94.0
Multiview attention models MVAGloVe 84.2(±0.25) 49.5(±0.45) 96.2(±0.29) 86.8(±0.32) 95.4(±0.22) 93.5(±0.13)
MVABERT 86.8(±0.42) 53.2(±0.18) 96.9(±0.08) 91.0(±0.37) 96.7(±0.30) 94.4(±0.07)
MVAGloVe+LSTM 84.4(±0.28) 50.9(±0.23) 95.6(±0.12) 87.2(±0.29) 95.1(±0.26) 93.7(±0.14)
MVABERT+LSTM 86.4(±0.24) 52.6(±0.17) 97.0(±0.19) 89.5(±0.75) 97.0(±0.19) 94.4(±0.06)
Fine-tuned models BERT(FiT) 87.6(±0.27) 53.2(±0.34) 96.8(±0.07) 92.5(±0.32) 97.4(±0.17) 94.7(±0.11)
MVABERT(FiT) 88.0(±0.35) 54.8(±0.19) 96.9(±0.06) 92.9(±0.56) 97.5(±0.14) 94.8(±0.08)
467
Table 4
Comparative results of the multilabel classification. The results marked with * were obtained from the online codes provided by the corresponding authors, or
reimplementation based on the previous papers. Average performance is measured over five runs. Additionally, standard deviation is reported for more validity
on the proposed methods.
Models Reuters-Multi-Label Emotion Classification (EC)

ER Precision Recall F1 ER Precision Recall F1
LSTM 23.3 86.7 54.7 63.5 24.7* 72.3* 60.1* 65.6*
BiLSTM 26.4 82.3 55.9 64.6 25.4* 70.7* 61.6* 65.9*
CNN-non-static 27.4 92.0 59.7 70.4 25.5* 74.9* 58.7* 65.8*
Self-attention 36.0* 94.4* 65.4* 75.1* 24.7* 74.4* 57.6* 64.9*
Capsule-B 60.3 95.4 82.0 85.8 22.3* 70.3* 55.0* 61.7*
MVA word2vec/GloVe +LSTM 74.9(±0.20) 95.2(±0.15) 90.3(±0.34) 91.3(±0.15) 29.6(±0.18) 75.4(±0.13) 62.7(±0.25) 68.8(±0.21)
4.4. Comparative results
Tables 3 and 4 summarize the results of the multiclass and multilabel tasks. Table 3 consists of three groups: conventional
neural models, various multiview attention models, and fine-tuned language models. Table 4 compares the LSTM-based
MVA with other noncontextual word representations reported in previous work [35].
As indicated in Table 3, the proposed model outperforms other baselines in all datasets, demonstrating the effectiveness
of the MVA model. Compared with other attention-based methods, including self-attention, SAAN and LAAN, the MVA model
shows competitive results for learning sentence representations with multiple attentions from different perspectives. Even
when conventional word embeddings (e.g., GloVe) were used, the proposed MVA model also yielded better results than the
capsule networks, such as Capsule-B and HCapsNet, which use dynamic routing strategies to reweight the tokens and are
similar to the attention mechanism.
Comparative experiments also investigate the performance of different word representations. For conventional word
embeddings, the model that applied both GloVe and LSTM outperformed the other models that only used GloVe. Conversely,
models with only the BERT representation can achieve relatively better results than BERT with LSTM, especially for senti-
ment tasks. A detailed analysis of these results will be discussed in Section 4.5.
For the fine-tuned BERT model, the results show that it achieved significant improvement due to the large amount of
knowledge transferred. By applying multiview attention, the fine-tuned BERT model achieves further improvements, espe-
cially on SST-5, which demonstrates the effectiveness of the proposed method.
Table 4 shows the comparative results for the multilabel tasks. As indicated, the proposed model achieves the best per-
formance with respect to ER, recall and F1 on both the Reuters and EC datasets. In particular, the MVA model shows more
competitive results than previous attention- and capsule-based neural networks. Since the model focuses on the important
parts of syntax and structural information, such as the linguistic and semantic features, this approach could improve the final
performance. There were two main reasons why the proposed model could yield more competitive results. One reason is that
diversified attention maps were applied over sequential representations from multiple perspectives. The other reason is that
the contextual word representation can benefit from sequential encoders, such as pretrained BERT and LSTM models.
Capsule networks have been proven to possess a strong ability to learn sentence representation and have been applied for
both multiclass and multilabel tasks in previous work [35]. In multilabel tasks, the proposed model inputs the same pre-
trained word vectors as those with the Capsule-B and Self-attention models. As a result, this approach obtains a better result
than those models. This finding indicates that the proposed model can guarantee multiple information capturing from single
sentences for various targets. Table 4 shows that the proposed model achieved the best performance for all four criteria on
EC. For Reuters-Multi-Label, the proposed model achieved a very competitive score on the precision and a higher score on
the recall, resulting in a convincingly higher performance on the F1-score and ER metrics than Capsule-B.
4.5. Effect of different sequential encoders
Given a sentence, a sequential encoder produces a sentence representation as the input of multiview attention. Then, a
different sentence representation would impact the performance of the proposed model.
The results of the second group in Table 3 show the performance of different pretrained representations in the proposed
MVA model. Individual GloVe can be considered to be fixed word embeddings, which have the same representations in all
situations, while BERT produces contextual word embedding to provide different representations to capture complex lin-
guistic information. Hence, LSTM as an additional sequential encoder to generate a deeper contextual representation was
introduced. As indicated, LSTMBERT achieves a higher accuracy than the LSTM and BiLSTM models, which are based on non-
contextual word embedding; this finding indicates the excellent performance of contextual word presentations in the text
classification. The comparative results between MVABERT and MVAGloVe also revealed similar evidence. Furthermore, the BERT
representation can benefit from the contextualized information and hierarchical structure of stacking transformers. For the
noncontextual word embeddings of GloVe, an extra LSTM layer was applied to contextualize the word representation. The
comparative results between MVAGloVe+LSTM and MVABERT revealed that the BERT-based contextual representation achieved
468
better accuracy, which demonstrates the effectiveness of the BERT representation. It is noteworthy that when an LSTM is
stacked over BERT as a sequential encoder to generate a deeper representation, both MVABERT+LSTM and MVABERT outperform
other models. MVABERT+LSTM achieved a higher performance for the TREC, Subj and AG’s datasets. On the other hand, MVABERT
achieves better performance for MR, SST-5, and CR.
Based on the above analysis, the following conclusions can be drawn: (i) Pretrained contextual representation (i.e., BERT)
outperforms conventional models for all tasks; (ii) Noncontextual word representations (i.e., GloVe) must be used with a
context encoder (i.e., LSTM or CNN) to achieve better performance; (iii) For sentiment analysis tasks, there is no need to stack
an extra context encoder layer on the BERT representation, since it has already learned the high-layer representation along
with rich semantic and syntactic information. However, an additional context encoder along with pretrained BERT can still
improve the performance of other text classification tasks, such as TREC and AG’s.
4.6. Effect of multiview attention
The parameter of the number of view vectors (m) could mainly impact the final performance of the proposed models. To
determine how significant the improvement can be leveraged by the different numbers of views, an experiment with differ-
ent values of m within the range of 1–4 was conducted. When m is much larger than 4, the performance can be difficult to
leverage for information saturation. The performance on all development sets of the multiclass task is shown in Fig. 4. The
optimal m values were selected for the final experiments. In the multilabel tasks, the number of views was consistent with
the multiple labels. However, when m was 1, the contextual word representations were integrated into a sentence represen-
tation with a single attention.
As indicated in Fig. 4, a single attention MVA model (m = 1) achieved a relevant low result when compared to other mod-
els. The performance of the MVABERT and MVABERT+LSTM models improved when multiple attentions were used. The optimal
parameters of m in each dataset were finally used for the prediction in the test split.
4.7. Effect of regularization term
The function of the regularization term in the proposed MVA model is mainly intended to make multiple attention maps
different from one another. To investigate the effectiveness of the regularization term in these models, ablation experiments
were conducted for sentiment classification tasks based on the MVA models, as shown in Table 5. The experimental results
revealed that the MVA models with regularization terms outperformed those without regularization terms in all datasets.
To further analyze the phenomenon that multiview attention can capture diverse task-specific information, two examples
from SST-5 and CR were selected to explore the effectiveness of the regularization term. A visualization experiment was con-
ducted using a heat map presentation to show the weight learned in multiview attention for all word representations. As
indicated in Fig. 5, attention maps with regularization terms were more diverse than those without regularization terms,
and these maps always contained a certain number of redundant and repetitive attention maps with similar distributions.
As shown in Fig. 5(a) and (b), the model with a regularization term emphasized three different parts that reflect the different
views, such as funny and mediocre. Conversely, the model without a regularization term focused on one of those perspectives,
providing a large amount of redundant information in each view. A similar situation also occurred in Fig. 4(c) and (d), where
the learned attention maps without a regularization term had a large number of repetitions on views 1 and 4 and on views 2
and 3. These two examples indicate that the proposed MVA model can benefit from the regularization term to guarantee the
diversity of multiview attention.
Fig. 6 shows the performance of the MVABERT model with different decay factors k on the development set of the SST-5
datasets. Notably, k = 0 means that the regularization terms were not involved in the training phase. As indicated, the per-
formance improved when k gradually increased. Once k exceeded 1e3, the performance of the model began to decline. The
possible reason is that an excessively small decay factor might not sufficiently affect the diversity of the attention maps. If k
was large, this approach would add a high penalty to the loss function, making the objective function of the classification
useless and resulting in a decrease in the accuracy.
4.8. Effect of fusion strategy
Because contextualized word representation can capture the syntactic information while the attention vectors can extract
view-specific features, different fusion strategies could perform differently. To investigate the effectiveness of the proposed
fusion gate against other fusion strategies on the MVABERT model, comparative experiments were conducted and are reported
in Table 6. Notably, Concat and Add means concatenate and add operations applied on contextualized representation (CR)
and attention vectors (AV), respectively. Linear represents a linear projection on CR and AV.
As indicated in Table 6, the MVABERT model that only applied AV outperformed that of CR, which indicates that the bilinear
term operated on view and text could generate different view-specific representations. The MVABERT model with different
fusion strategies applied on both AV and CR achieved better performance by learning representation from both contextual-
ized and view-specific information. Additionally, the proposed fusion gate strategy achieved relatively better performance
than other strategies, which demonstrates the effectiveness of the fusion gate, which could flexibly integrate both informa-
tion from AV and CR for a complementary and consistent representation.
469
Fig. 4. Comparative performance on the development dataset with different m.
Table 5
Ablation studies for the sentiment classification.
Models MR SST-5 Subj CR

MVAGloVe 84.2 49.5 96.2 86.8
MVAGloVe-regularization 83.6 48.3 95.9 86.7
MVABERT 86.8 53.2 96.9 91.0
MVABERT-regularization 86.8 52.7 96.5 90.6
MVABERT+LSTM 86.4 52.6 97.0 89.5
MVABERT+LSTM-regularization 86.1 52.5 96.9 89.2
Fig. 5. Visualization of the attention-based heat maps on the two samples. Each attention block with a deeper color indicates a higher attention score for
the corresponding word.
Fig. 6. The performance of different decay factors k on the development set of the SST-5 dataset.
470
Table 6
Accuracy (%) results of MVABERT model with different fusion strategies on contextualized representation (CR) and attention vectors (AV).
Models MR SST-5 Subj CR

MVABERT w/Fusion Gate 86.8 53.2 96.9 91.0
MVABERT w/Concat 86.2 52.6 97.1 90.2
MVABERT w/Linear 86.6 52.6 96.9 90.5
MVABERT w/Add 86.6 52.2 96.9 90.3
MVABERT only CR 86.2 51.8 96.5 89.7
MVABERT only AV 86.5 52.2 96.9 90.5
Fig. 7. The visualization of several examples from different datasets.
471
4.9. Visualization and discussion
Compared with self-attention, the proposed multiview attention can exploit the integration of more complex informa-
tion. To further investigate this phenomenon, we selected four example sentences from different datasets that were the most
improved by multiview attention against self-attention (in a single view). Visualization experiments were also conducted
using a heat map. As indicated in Fig. 7(a), self-attention tended to focus on an individual token in the sentiment analysis
task, thereby providing the token funny a high attention score. As a result, the final classification result of the self-
attention was positive, which is different from the actual label, in other words, negative. In contrast, the proposed model
learned a more comprehensive attention distribution, where view 1 focused on mediocre and funny, view 2 highlighted me-
diocre and view 3 emphasized funny. Through the integration of fusion gates, the model finally drew a conclusion on me-
diocre, and classified the sentence as negative, which is consistent with the actual label. Another observation is that the
attention mechanism, including both self-attention and multiview attention, was more likely to focus on attributes, adver-
bials and verbs with an obvious sentiment tendency.
Fig. 7(b) and (c) select two examples from the question (TREC) and topic (AG’s) classification datasets. With multiview
attention, the model can locate nouns with critical task-specific information, such as blood, artery, stereo, system and price.
In contrast, a single self-attention mechanism can be misled by the context and subsequently focus on tokens, such as only
and price, which are not related to the expression of these whole sentences.
Fig. 7(d) shows an example of the multilabel classification selected from the EC dataset. This dataset provides 11 different
types of emotions, where each sample can be annotated with one or more labels. In our implementation, the proposed model
sets the number of views (m) to be consistent with the number of classification labels. In other words, each view can be
specifically responsible for the classification of a single label and shares the same hidden representation learned by the
sequential encoder. According to the label, each view can individually emphasize different tokens to extract more compre-
hensive features. For example, view 5 is in charge of the joy label that focuses on exhilarating and love, while view 6 is respon-
sible for love that highlights love. Conversely, a single self-attention can only highlight feeling and love. It appears
overburdened to leverage the complex information in multilabel tasks. Furthermore, there are relatively fewer differences
among various views than previous examples in Fig. 7(d). The possible reason is that there will be implicit relationships
between different labels in multilabel tasks. However, the bilinear operation can still generate different view-specific repre-
sentations for different labels. Table 6 also demonstrates the effectiveness of multiview attention compared with self-
attention. Therefore, the good performance of MVA is mainly located in different attention distributions and view-specific
sentiment representations.
In some extreme cases, the MVA model might not be able to learn the appropriate representation. For example, when
other explicit sentiment knowledge has been given, such as aspect, topic, user, and product, the model can impose a strong
restriction on the attention distribution in such a way that a contraction could occur between the learned representation and
the given sentiment information. In most cases without extra sentiment information, the MVA model could learn the implicit
sentiment distribution; thus, it outperformed other existing attention-based models.
5. Conclusions
In this study, we proposed a multiview attention model to learn a sentence representation from multiple perspectives for
classification tasks. To integrate the information from these multiview attentions, a fusion gate was applied to obtain a con-
sistent and complementary representation from tokens and attention. Then, we also used a special penalty term to ensure
the multiattention distribution without extra information. These comparative experiments show that the proposed model
outperforms other attention-based and capsule-based methods.
Future studies would attempt to investigate multiview models fusing across various embeddings and staked representa-
tions, such as different layers in BERT, and extend the proposed model to other fields in NLP or more complex sentiment
analysis, such as aspect-based and personalized sentiment analysis.
CRediT authorship contribution statement
You Zhang: Investigation, Methodology, Software, Formal analysis, Validation, Writing - original draft. Jin Wang: Concep-
tualization, Software, Formal analysis, Resources, Writing - review & editing, Funding acquisition. Xuejie Zhang: Project
administration, Resources, Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
472
Acknowledgements
This work was supported by the National Natural Science Foundation of China (NSFC) under GrantsNos. 61702443,
61966038 and 61762091. The authors would like to thank the anonymous reviewers for their constructive comments.
References
[1] B. Liu, Sentiment analysis and opinion mining, Synth. Lect. Hum. Lang. Technol. 5 (1) (2012) 1–167.
[2] R.A. Calvo, S. D’Mello, Affect detection: an interdisciplinary review of models, methods, and their applications, IEEE Trans. Affect. Comput. 1 (1) (2010)
18–37.
[3] R.K. Amplayo, S. Lee, M. Song, Incorporating product description to sentiment topic models for improved aspect-based sentiment analysis, Inf. Sci. (Ny)
454-455 (2018) 200–215.
[4] Y. Kim, Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP-2014), 2014, pp. 1746–1751.
[5] K. Cho et al, Learning phrase representations using RNN encoder-decoder for statistical machine translation, in: Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
[6] K.S. Tai, R. Socher, C.D. Manning, Improved semantic representations from tree-structured long short-term memory networks, in: Proceedings of the
53rd Annual Meeting of the Association for Computational Linguistics (ACL-2014), 2015, pp. 1556–1566.
[7] Z. Lin et al., A structured self-attentive sentence embedding, arXiv Prepr. arXiv1703.03130, 2017.
[8] D.I.H. Farías, V. Patti, P. Rosso, Irony detection in Twitter, ACM Trans. Internet Technol. 16 (3) (2016) 1–24.
[9] A. Nazir, Y. Rao, L. Wu, L. Sun, Issues and challenges of aspect-based sentiment analysis: a comprehensive survey, IEEE Trans. Affect. Comput (2020) 1.
[10] J. Wang, L.-C. Yu, K.R. Lai, X. Zhang, Dimensional sentiment analysis using a regional CNN-LSTM model, in: Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 225–230.
[11] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks 18 (5-
6) (2005) 602–610.
[12] Q. Qian, M. Huang, J. Lei, X. Zhu, Linguistically regularized LSTM for sentiment classification, in: Proceedings of the Conference the 55th Annual
Meeting of the Association for Computational Linguistics (ACL-2017), 2017, pp. 1679–1689.
[13] L. Yao, C. Mao, Y. Luo, Graph convolutional networks for text classification, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-
2019), 2019, pp. 7370–7377.
[14] D. Cavaliere, S. Senatore, Emotional concept extraction through ontology-enhanced classification, in: 13th International Conference on Metadata and
Semantics Research, MTSR 2019, 2019, pp. 52–63.
[15] M.T. Al-Sharuee, F. Liu, M. Pratama, Sentiment analysis: an automatic contextual analysis and ensemble clustering approach and comparison, Data
Knowl. Eng. 115 (2018) 194–213.
[16] Z. Wu, X.Y. Dai, C. Yin, S. Huang, J. Chen, Improving review representations with user attention and product attention for sentiment classification, in:
Proceedings of The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018, pp. 5989–5996.
[17] T. Mikolov, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Proceedings of Advances in
Neural Information Processing Systems (NIPS-2013), 2013, pp. 3111–3119.
[18] J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP-2014), 2014, pp. 1532–1543.
[19] B. Peng, J. Wang, X. Zhang, Adversarial learning of sentiment word representations for sentiment analysis, Inf. Sci. (Ny) 541 (2020) 426–441.
[20] L.i. Kong, C. Li, J. Ge, F. Zhang, Y.i. Feng, Z. Li, B. Luo, Leveraging multiple features for document sentiment classification, Inf. Sci. (Ny) 518 (2020) 39–55.
[21] D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neural network for sentiment classification, in: Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, 2015, pp. 1422–1432.
[22] M. Bryan, B. James, X. Caiming, S. Richard, Learned in translation: contextualized word vectors, in: Advances in Neural Information Processing Systems,
2017, pp. 6294–6305.
[23] M. Peters et al, Deep contextualized word representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), (NAACL-HLT 2018), 2018, pp. 2227–2237.
[24] J. Devlin, M.W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019),
2019, pp. 4171–4186.
[25] A. Radford, T. Salimans, Improving Language Understanding by Generative Pre-Training, 2018.
[26] H. Jeremy, S. Ruder, Universal language model fine-tuning for text classification, in: Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics(ACL-2018), 2018, pp. 328–339.
[27] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-tune BERT for text classification?, in: China National Conference on Chinese Computational Linguistics,
2019, pp 194–206.
[28] D. Bahdanau, K.H. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, Proceedings of the 3rd International Conference
on Learning Representations(ICLR-2015), 2015.
[29] J. Cheng, L. Dong, M. Lapata, Long short-term memory-networks for machine reading, in: Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing (EMNLP-2016), 2016, pp. 551–561.
[30] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in: Proceedings of the 2016 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-2016), 2016, pp. 1480–
1489.
[31] Z. Lei, Y. Yang, M. Yang, SAAN: A sentiment-aware attention network for sentiment analysis, in: 41st International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR 2018, 2018, pp. 1197–1200.
[32] Z. Lei, Y. Yang, Y. Liu, LAAN: a linguistic-aware attention network for sentiment analysis, in: Companion Proceedings of the The Web Conference 2018,
2018, pp. 47–48.
[33] S. Sabour, N. Frosst, G.E. Hinton, Dynamic routing between capsules, in: Proceedings of the 31st Conference on Neural Information Processing Systems
(NIPS-2017),, 2017, pp. 3859–3869.
[34] J. Gong, X. Qiu, S. Wang, X. Huang, Information aggregation via dynamic routing for sequence encoding, in: Proceedings of the 27th International
Conference on Computational Linguistics, 2018, pp. 2742–2752.
[35] M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, S. Zhang, Investigating capsule networks with dynamic routing for text classification, in: Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing (EMNLP-2018), 2019, pp. 3110–3119.
[36] C. Du et al, Investigating capsule network and semantic feature on hyperplanes for text classification, in: Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019,
pp. 456–465.
[37] Y. Xie, Y. Zhang, M. Gong, Z. Tang, C. Han, MGAT: multi-view graph attention networks, Neural Networks 132 (2020) 180–189.
473
[38] Y. Yuan et al, MuVAN: a multi-view attention network for multivariate temporal data, in: 2018 IEEE International Conference on Data Mining (ICDM),
2018, pp. 717–726.
[39] Y. Deng et al, Multi-task learning with multi-view attention for answer selection and knowledge base question answering, in: Proceedings of the AAAI
Conference on Artificial Intelligence (AAAI-2019), 2019, pp. 6318–6325.
[40] R. Socher, A. Perelygin, J. Wu, Recursive deep models for semantic compositionality over a sentiment treebank, in: Proceedings of the 2013 Conference
on Empirical Methods in Natural Language Processing (EMNLP-2013), 2013, pp. 1631–1642.
[41] W. Wang, S.J. Pan, D. Dahlmeier, X. Xiao, Coupled multi-layer attentions for co-extraction of aspect and opinion terms, in: The Proceedings of the 31st
AAAI Conference on Artificial Intelligence (AAAI-2017), 2017, pp. 3316–3322.
[42] B. Pang, L. Lee, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, in: Proceedings of the 43rd annual
meeting on association for computational linguistics, 2005, pp. 115–124.
[43] B. Pang, L. Lee, A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts, in: Proc. 42nd Annu. Meet.
Assoc. Comput. Linguist. (ACL-2004),, 2004, pp. 217–278.
[44] M. Hu, B. Liu, S.M. Street, Mining and summarizing customer reviews, in: Proceedings of the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining, 2004, pp. 168–177.
[45] X. Li, D. Roth, Learning question classifiers, in: Proceedings of the 19th international conference on Computational linguistics, 2002, pp. 1–7.
[46] A. Conneau, H. Schwenk, Y. Le Cun, L. Barrault, Very deep convolutional networks for text classification, in: Proceedings of the 15th Conference of the
European Chapter of the Association for Computational Linguistics, 2017, pp. 1107–1116.
[47] S. Mohammad, F. Bravo-Marquez, M. Salameh, S. Kiritchenko, SemEval-2018 Task 1: affect in tweets, in: Proceedings of International Workshop on
Semantic Evaluation (SemEval-2018), 2018, pp. 1–17.
[48] Z. Lei, Y. Yang, M. Yang, SAAN: A sentiment-aware attention network for sentiment analysis no. August, in: 41st International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR 2018, 2018, pp. 1197–1200.
[49] M. Sorower, A literature survey on algorithms for multi-label learning no. March, in: Oregon State Univ. Corvallis, 2010, pp. 1–25.
[50] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, J. Mach. Learn. Res. 9 (2010) 249–256.
You Zhang is a Ph.D. candidate in the School of Information Science and Engineering, Yunnan University, China. He received the
BS degree in Computer Science and Technology from Beijing Jiaotong University, China. His research interests include natural
language processing, text mining, and machine learning.
Jin Wang is an associate professor in the School of Information Science and Engineering, Yunnan University, China. He received
the Ph.D. degree in Computer Science and Engineering from Yuan Ze University, Taoyuan, Taiwan and in Communication and
Information Systems from Yunnan University, Kunming, China. His research interests include natural language processing, text
mining, and machine learning.
Xuejie Zhang is a professor in the School of Information Science and Engineering, and Director of High-Performance Computing
Center, Yunnan University, China. He received his Ph.D. in Computer Science and Engineering from Chinese University of Hong
Kong in 1998. His research interests include high performance computing, cloud computing, and big data analytics.
474

Information Sciences: You Zhang, Jin Wang, Xuejie Zhang

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Sciences: You Zhang, Jin Wang, Xuejie Zhang

Uploaded by

Copyright:

Available Formats

Information Sciences 571 (2021) 459–474

Contents lists available at ScienceDirect

Learning sentiment sentence representation with multiview

2.1. Sentiment representation learning

Fig. 2. . System architecture of the proposed multiview attention model.

2.2. Pretrained language model

2.3. Attention mechanism

3. Multiview attention model

3.1. Sequential encoder

3.2. Multiview attention

3.3. Fusion gate

g j ¼ rðW r ojr þ W h ojh þ bÞ

3.5. Training objective

3.5.1. Multiclass loss

3.5.2. Multilabel loss

Thus, the loss function is designed as follows:

trained end-to-end using the backpropagation algorithm.

4.2. Evaluation metrics

Models Sentiment Question Topic

Models Reuters-Multi-Label Emotion Classification (EC)

4.4. Comparative results

4.5. Effect of different sequential encoders

4.6. Effect of multiview attention

4.7. Effect of regularization term

4.8. Effect of fusion strategy

Fig. 4. Comparative performance on the development dataset with different m.

Models MR SST-5 Subj CR

Models MR SST-5 Subj CR

Fig. 7. The visualization of several examples from different datasets.

4.9. Visualization and discussion

CRediT authorship contribution statement

Declaration of Competing Interest

You might also like