You are on page 1of 5

1898 IEEE SIGNAL PROCESSING LETTERS, VOL.

28, 2021

Feature Fusion for Multimodal Emotion Recognition


Based on Deep Canonical Correlation Analysis
Ke Zhang, Yuanqing Li , Jingyu Wang , Member, IEEE, Zhen Wang , and Xuelong Li , Fellow, IEEE

Abstract—Fusion of multimodal features is a momentous prob- signals [8] of human can be collected to carried out emotion
lem for video emotion recognition. As the development of deep recognition. However, it is obviously that human, especially
learning, directly fusing feature matrixes of each mode through human emotions, is very complicated [9]. There are a great
neural networks at feature level becomes mainstream method.
However, unlike unimodal issues, for multimodal analysis, finding
many ways for people to express their emotions [10] according
the correlations between different modal is as important as dis- to the culture traditions and living habits [11]. These complex-
covering effective unimodal features. To make up the deficiency in ity and uncertainty make multimodal emotion recognition get
unearthing the intrinsic relationships between multimodal, a novel more and more attention than unimodal emotion recognition
modularized multimodal emotion recognition model based on deep in recent years for its wider application scenarios and market
canonical correlation analysis (MERDCCA) is proposed in this prospects [12]. Followed by these advantages, the difficulties
letter. In MERDCCA, four utterances are gathered as a new group of extracting and fusing features from heterogeneous modals
and each utterance contains text, audio and visual information as
multimodal input. Gated recurrent unit layers are used to extract
becomes the main topics and obstacles for multimodal emotion
the unimodal features. Deep canonical correlation analysis based recognition.
on encoder-decoder network is designed to extract cross-modal At present, most multimodal emotion recognition methods
correlations by maximizing the relevance between multimodal. The choose to extract notable features within unimodal before next
experiments on two public datasets show that MERDCCA achieves fusion. Audio features are integrated into contextualized word
the better results. embeddings through a parallel bidirectional language model
Index Terms—Deep canonical correlation analysis, gated inspired by word embeddings [13]. Fuzzy logic Convolutional
recurrent unit, multimodal emotion recognition. Neural Network(CNN) is utilized to process tri-modal data [14].
Audio and visual information are fused at feature level by a
belief network based CNN [15] and three separate networks
are designed to capture features of trimodal parallelly [16].
I. INTRODUCTION These work all process multimodal inputs meanwhile by directly
ULTIMODAL emotion recognition researches have al- calculating the feature matrices through deep neural networks
M ways been the coexistences of opportunities and chal-
lenges [1]. As soon as the multimodal emotion recognition is
at feature level or decision level.
Unlike the work previously mentioned which make great
mentioned, it is always closely related to our daily life [2] and this efforts on discovering the emotion related features of uni-
is also the reason why people’s daily habits become the hot spots modal [17], some recent works transit their focuses on the
for researchers to seek new inspirations [3]. Visual [4], audio [5], nonlinear relationship between modals and its influences on
text [6], electroencephalogram [7] and many other expression feature selections. Tang et al. [18] design a deep neural network
to fuse multi-scale features for defocus blur detection. Implicit
fusion via joint feature representation learning is verified fea-
sible through single shared network [19] and cross-view local
Manuscript received July 1, 2021; revised August 29, 2021; accepted August
30, 2021. Date of publication September 14, 2021; date of current version
structure [20]. Deep transfer networks are used by Shu et al. [21]
September 29, 2021. This work was supported in part by the Basic Research to mitigate the insufficient image data from text domain and by
Strengthening Program of China under Grant 2020-JCJQ-ZD-015-00-02; in part Tang et al. [22] to transfer information across heterogeneous
by the National Natural Science Foundation for Distinguished Young Scholars domains. Wen et al. [23] proposes a context-gated convolution
under Grant 62025602; and in part by the National Natural Science Foundation
of China under Grants U1803263, 61871470, 11931015, and 61502391. The neural network to discover the cross-modal information under
associate editor coordinating the review of this manuscript and approving it for both word-aligned and un-aligned conditions. Chen et al. [24]
publication was Prof. Xun Cao. (Corresponding author: Yuanqing Li.) uses two separate CNNs to process the audio and visual infor-
Ke Zhang and Yuanqing Li are with the National Key Laboratory of Aerospace mation and fuses them by calculating the feature correlations.
Flight Dynamics and School of Astronautics, Northwestern Polytechnical
University, Xi’an, Shaanxi 710072, China (e-mail: zhangke@nwpu.edu.cn; Chen et al. [25] also puts their key research point on digging the
YUANQINGLI@MAIL.NWPU.EDU.CN). implicit relation between audio and visual modal that all features
Jingyu Wang is with the School of Artificial Intelligence, Optics and Electron- are mixed up and re-clustered by k-means. These new feature
ics (iOPEN), School of Astronautics, Northwestern Polytechnical University,
Xi’an, Shaanxi 710072, China (e-mail: jywang@nwpu.edu.cn).
groups then compute the cross-modal relationship according to
Zhen Wang and Xuelong Li are with the School of Artificial Intelli- the emotion tags. Both of these former two work maximize the
gence, Optics and Electronics (iOPEN), Northwestern Polytechnical Uni- correlation between heterogeneous modals by using canonical
versity, Xi’an, Shaanxi 710072, China (e-mail: zhenwang0@gmail.com; correlation analysis, which is designed to reflect the overall
xuelong_li@nwpu.edu.cn).
Digital Object Identifier 10.1109/LSP.2021.3112314 correlation between two groups of indicators. However, both

1070-9908 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:14:23 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FEATURE FUSION FOR MULTIMODAL EMOTION RECOGNITION BASED ON DEEP CANONICAL CORRELATION ANALYSIS 1899

Fig. 1. The Structure of the proposed MERDCCA.

of them ignore to find the internal effective features within each feature extraction, correlation analysis and classification. The
unimodal. loss function is specially designed as three parts to ensure the
Motivated by the issues mentioned above, this letter proposes constraining and optimizing of the effective features selection
a modularized multimodal video emotion recognition model for both unimodal and cross-modal simultaneously. The details
based on deep canonical correlation analysis (MERDCCA). will be explained in the following subsections.
To the best of our knowledge, this is the first time that both
the unimodal and cross-modal features are given consideration A. Unimodal Feature Extraction
with using audio, text and visual as trimodal inputs. The detail
This module is the first part of the MERDCCA which is
of MERDCCA will be introduced in following sections. The
responsible for extracting unimodal features which are related
main contributions of this letter are listed as below: (1) a novel
to emotion recognition.
modularized multimodal emotion recognition model based on
Firstly, to make full use of context information, video seg-
deep canonical correlation analysis is proposed to improve
ments are preprocessed that every four utterances are pack-
the multimodal emotion recognition performance. (2) Deep
aged into a new input group and the third utterance is the
Canonical Correlation Analysis(DCCA) is applied to generate
target waiting to be recognized as shown in Fig. 1. In new
integrated variables of multimodal and maximize their correla-
group, the context of both sides of interlocutors are recorded
tions. (3) Four related utterances are united as a new group and
and the emotion persistence and contagious of the target are
the encoder-decoder system with optimized loss function are
potentially exposed. Let GA (t), GT (t) and GV (t) represent
used to retain unimodal features and subsequent cross-modal
the audio, text and visual features of the grouped utterances
correlations at the same time.
including the target at timestamp t. Secondly, three parallel
←−−→
sub-nets consist of bidirectional gated recurrent unit (GRU ) are
II. PROPOSED METHOD designed to fused the tri-modal information of the new group
and attention mechanism(Attention) are combined to better ex-
Screening and amplifying related features are the first and tract context information and unimodal features. The processed
most important step for following effective multimodal emotion audio(Af usion ), text(Tf usion ) and visual(Vf usion ) features are
recognition [26]. Extracting features from unimodal or cross- achieved as below.
modal are popular options in recent years. In fact, these two ←−−→
aspects have their own irreplaceable advantages in emotion Af usion = GRU (Attention(GA (t))) (1)
recognition. Unimodal features can better filter out the most ←−−→
relevant information for emotion recognition from the other Tf usion = GRU (Attention(GT (t))) (2)
noises. Cross-modal features can better explore the interaction ←−−→
Vf usion = GRU (Attention(GV (t))) (3)
between modals, because as a complex human activity, emotion
is often a comprehensive expression of simultaneous interpreting
of various channel information. Unimodal features are unable B. Correlation Analysis
to express all effective information. However, traditional and To compute the cross-modal correlations, we employ deep
recent methods either put research points on the internal feature canonical correlation analysis(DCCA) [27] to maximize the
extraction of unimodal, either focus on the features related to dependence between representative vectors. DCCA is an under-
cross-modal that there is almost no method that takes both into standing of cross covariance matrix and a multivariate statistical
account. In order to make up for this defect and make a discussion analysis method that uses the correlation between comprehen-
in this area of research, this letter proposes a modularized sive variable pairs to reflect the overall correlation between two
multimodal video emotion recognition model based on deep groups of indicators. It is a data analysis and dimension reduction
canonical correlation analysis which is shown in Fig. 1. From method similar to Principal Component Analysis(PCA) [28].
Fig. 1, there are mainly three modules in MERDCCA: unimodal Different from PCA, DCCA can reduce the dimension of two

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:14:23 UTC from IEEE Xplore. Restrictions apply.
1900 IEEE SIGNAL PROCESSING LETTERS, VOL. 28, 2021

spaces and provide the same heterogeneous representation. In TABLE I


DATASETS SPLIT
order to utilize DCCA to compute the cross-modal correlations,
extracted features of three modals are first fused into two chan-
nels according to the DCCA input dimension limitation. These
two channels are acquired by fusing text modal with audio and
visual separately through two GRU layers and attention mecha-
nism is added. The reason that we use text modal as the medium
of the bimodal fusion to help choose the key point of bimodal C. Classification and Loss Function
features during training is that according to [29], compared to the
other two modals, text contains the most comprehensive content The reconstructed pair (RT A , RT V ) is finally fused again
of conversations and has less chance of misunderstanding. through a GRU layer following with a fully connected layer
After this step, fused bimodal information are input to the to complete the classification. To give considerations on both
next encoder-decoder system which is composed of two group unimodal feature extractions as well as cross-modal correlation
of fully connected layers with symmetrical number of nodes. extraction, the final loss function Loss are composed of three
These two groups of layers correspond to former two bimodal parts in our method, namely
information channels respectively just as shown in Fig. 1. The Loss = lossDCCA + lossrec + lossclass (12)
output of the encoder are represented as (H1 , H2 ) which are two
comprehensive variable pairs of the original bimodal features where lossDCCA is the DCCA loss which is used to max-
after dimension reduction by the encoder. We extract the cross- imize the cross-modal features by measuring the correlation
modal features from these two pairs by using DCCA to maximize between comprehensive variable pairs (H1 , H2 ). Larger the
the correlation between them. The detailed calculation formulas correlation, smaller the loss. lossrec is the reconstruction dif-
are as below. ference using Euclidean distance between (RT A , RT V ) and
←−−→ (F usionT A , F usionT V ). It is intended to maintain the ex-
F usionT A = GRU (Attention(Tf usion , Af usion )) (4) tracted unimodal features from the first module. lossclass is the
←−−→ cross-entropy loss of the final classification.
F usionT V = GRU (Attention(Tf usion , Vf usion )) (5)
(H1 , H2 ) = Encoder(F usionT A , F usionT V ) (6) III. EXPERIMENTAL RESULTS AND ANALYSIS
where F usionT A and F usionT V are bimodal features fused A. Experimental Settings
after GRU layers. In DCCA, to using the correlation coefficient IEMOCAP [30] and CMU-MOSI [31] are chosen to carry
to calculate correlations for high-dimensional data, (H1 , H2 ) out experiments based on our model. Tri-modal information of
has to be processed as below. both datasets are all recorded on utterance level. The former is
H1 = aT H1 , H2 = bT H2 (7) labeled with one of six emotion labels, which are happy, sad,
neutral, angry, excited, and frustrated, while the latter is tagged
The goal of DCCA is to find the expected value (a∗ , b∗ ) of as a value between −3 to +3 that higher the value, the more
(a, b) to maximize the corr(H1 , H2 ), which is the correlation positive the emotion with the neutral boundary as 0. The split of
coefficient of (H1 , H2 ). datasets are shown in Table I.
To evaluate the performances of the experimental results, the
(a∗ , b∗ ) = arg max corr(H1 , H2 ) (8) accuracy and macro-average F-score are calculated as below.
(a,b)
precision · recall
cov(H1 , H2 ) Fβ = (1 + β 2 ) · (13)
corr(H1 , H2 ) =   (9) (β 2 · precision) + recall
D(H1 ) D(H1 )
where β represents the weight between precision and recall. In
where cov(H1 , H2 ) is the covariance of H1 and H2 , and D(H1 ), this letter, β is set as 1 meaning precision and recall have same
D(H1 ) are the variances of them, respectively. To optimize weight. P-value is calculated through paired T-test between our
this method, the formula above can be rewrite into following method and baselines as below to evaluate the significance with
constrained optimization formulation: the significance level α set as 0.05 [32].
arg max aT SH1 H2 b d − d0
(a,b) t= √ (14)
Sd / n
s.t. aT SH1 H1 a = 1, bT SH2 H2 b = 1 (10) df = n − 1 (15)
To obtain the maximal value of the target and corresponding
where d and d0 represent the sample mean of differences and the
dimensional reduction vector (a, b), singular value decomposi-
hypothesized population mean difference respectively, Sd is the
tion (SVD) is used to carry out the optimization. Then, com-
standard deviation of differences, df is the degree of freedom.
prehensive variable pairs (H1 , H2 ) are reconstructed through ←−−→
decoder which is mirror symmetry with encoder network. The The GRU layer in our method is set as 1 with 1024 hidden nodes
reconstituted output of decoder is represented as (RT A , RT V ). for unimodal feature extraction in the first module and 512 for
the others. The hidden nodes of the encoder-decoder system are
(RT A , RT V ) = Decoder(H1 , H2 ) (11) set as 512, 256, 2, 256 and 512 separately. The drop probably

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:14:23 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FEATURE FUSION FOR MULTIMODAL EMOTION RECOGNITION BASED ON DEEP CANONICAL CORRELATION ANALYSIS 1901

From Fig. 2(b) and Fig. 3(b), it can be seen clearly that
the distribution characteristics of each unimodal is maintained,
however, they are preliminary re-clustered while maximizing
cross-modal correlations through the second module using
DCCA. The addition of reconstruction loss and the DCCA loss
in loss function part enable enables our method to retain and
utilize both unimodal and cross-modal features at the same time
which is also one of the most significant differences with other
methods.
In Table II, our model achieves the best performance on
Fig. 2. Features of CMU-MOSI dataset. CMU-MOSI dataset and about 1% higher than the state-of-the-
art. The reason is that the method in [29] only consider the
cross-modal features based on attention mechanism, although
they found that text is the most effective modal as a pivot for
other modalities. [33] and [34] focus on multimodal information
fusion, but all fusions are completed through attention based
deep network directly that both unimodal and cross-modal fea-
tures are not specifically extracted. [35] does not give attention
on cross-modal features that it receives the best results by taking
context from the other related utterances into account and fusing
trimodal information through hierarchical module. The results
of [33], [34] and [35] are higher that its of [29] obviously. It
Fig. 3. Features of IEMOCAP dataset. is because these three methods put their emphases on unimodal
features extraction and on the contrary, [29] only cares about
TABLE II cross-modal features. However, unimodal information actually
EXPERIMENTAL RESULTS ON CMU-MOSI
plays an essential role in multimodal emotion recognition that
cannot be ignored. As a result, our method achieves better
overall recognition accuracy than baselines by considering both
unimodal and cross-modal aspects. In Table III, our method
obtains the best accuracy in sad, excited and average categories.
We are also very close to the best result in frustrated type. The
distance of the results in happy type is because happy and excited
TABLE III are considered to be the same in method of [38]. View-specific
EXPERIMENTAL RESULTS ON IEMOCAP and cross-view features are both considered in [36], however, the
definition of view in this method is confined to unimodal from
different time nodes instead of cross-modal. Method in [37] put
most efforts on mining contextual relations that trimodal infor-
mation are fused by cascading at the very first step. The lack of
effective features screening making the accuracy of this method
turns out to be the lowest. Our model achieves the best results
is set as 0.45 and the learning rate is 0.0015 with the optimizer on CMU-MOSI and most categories on IEMOCAP because we
weight decay as 0.0005. unite related utterances as new groups to grasp the context firstly.
Secondly, inner unimodal features are screened and cross-modal
B. Results and Discussion correlations are maximized under specific emotions. Finally, loss
function is optimized to maintain the extracted unimodal and
The experimental results on datasets are shown in Figure
cross-modal features.
and Figure respectively. For CMU-MOSI, sentiment analysis
is carried out and utterances are classified into positive and
negative. The trimodal information of text, audio and visual of
the first module is shown in Fig. 2(a). The value of (H1 , H2 )
after correlation calculation of the cross-modal through DCCA IV. CONCLUSION
in the second module is shown in Fig. 2(b). For IEMOCAP, In this letter, we propose a novel modularized multimodal
emotion recognition is done and six emotion types are classified. emotion recognition model based on deep canonical correlation
Fig. 3(a) dipict the unimodal features and Fig. 3(b) shows the analysis. Both unimodal features extraction and cross-modal
corresponding cross-modal value of each emotion type after correlations are placed in the same important position. Effective
DCCA. unimodal features are retained while maximizing the subsequent
The final recognition performances of our method are com- cross-modal correlations and related contextual information are
pared with several baselines and the results of two datasets are also considered to improve the performances. The experimen-
shown in Table II and Table III separately. In these tables, Acc. tal results on two public datasets prove the validity of our
means accuracy and bold font denotes the best performances. method.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:14:23 UTC from IEEE Xplore. Restrictions apply.
1902 IEEE SIGNAL PROCESSING LETTERS, VOL. 28, 2021

REFERENCES [20] C. Tang et al., “Cross-view locality preserved diversity and consensus
learning for multi-view unsupervised feature selection,” IEEE Trans.
[1] P. Jiang, B. Wan, Q. Wang, and J. Wu, “Fast and efficient facial expression Knowl. Data Eng., early access, 2021, doi: 10.1109/TKDE.2020.3048678
recognition using a gabor convolutional network,” IEEE Signal Process. [21] X. Shu, G.-J. Qi, J. Tang, and J. Wang, “Weakly-shared deep transfer
Lett., vol. 27, pp. 1954–1958, 2020. networks for heterogeneous-domain knowledge propagation,” in Proc.
[2] M. Hu, Q. Chu, X. Wang, L. He, and F. Ren, “A two-stage spatiotemporal 23rd ACM Int. Conf. Multimedia, 2015, pp. 35–44.
attention convolution network for continuous dimensional emotion recog- [22] J. Tang, X. Shu, Z. Li, G.-J. Qi, and J. Wang, “Generalized deep transfer
nition from facial video,” IEEE Signal Process. Lett., vol. 28, pp. 698–702, networks for knowledge propagation in heterogeneous domains,” ACM
2021. Trans. Multimedia Comput. Commun. Appl., vol. 12, no. 4, pp. 1–22, 2016.
[3] Y. Tian, J. Cheng, Y. Li, and S. Wang, “Secondary information aware [23] H. Wen, S. You, and Y. Fu, “Cross-modal context-gated convolution
facial expression recognition,” IEEE Signal Process. Lett., vol. 26, no. 12, for multi-modal sentiment analysis,” Pattern Recognit. Lett., vol. 146,
pp. 1753–1757, Dec. 2019. pp. 252–259, 2021.
[4] P. Jiang, G. Liu, Q. Wang, and J. Wu, “Accurate and reliable facial [24] C. Guanghui and Z. Xiaoping, “Multi-modal emotion recognition by
expression recognition using advanced softmax loss with fixed weights,” fusing correlation features of speech-visual,” IEEE Signal Process. Lett.,
IEEE Signal Process. Lett., vol. 27, pp. 725–729, 2020. vol. 28, pp. 533–537, 2021.
[5] G. Kim, H. Lee, B.-K. Kim, S.-H. Oh, and S.-Y. Lee, “Unpaired [25] L. Chen, K. Wang, M. Wu, W. Pedrycz, and K. Hirota, “K-means
speech enhancement by acoustic and adversarial supervision for speech clustering-based kernel canonical correlation analysis for multimodal
recognition,” IEEE Signal Process. Lett., vol. 26, no. 1, pp. 159–163, emotion recognition,” IFAC-PapersOnLine, vol. 53, no. 2, pp. 10 250–
Jan. 2019. 10254, 2020.
[6] S. Yang, Y. Wang, and L. Xie, “Adversarial feature learning and unsuper- [26] X. Li, M. Chen, F. Nie, and Q. Wang, “A multiview-based parameter free
vised clustering based speech synthesis for found data with acoustic and framework for group detection,” in Proc. AAAI Conf. Artif. Intell., vol. 31,
textual noise,” IEEE Signal Process. Lett., vol. 27, pp. 1730–1734, 2020. no. 1, 2017, pp. 4147–4153.
[7] R. Nawaz, K. H. Cheah, H. Nisar, and V. V. Yap, “Comparison of different [27] C. Guo and D. Wu, “Discriminative sparse generalized canonical cor-
feature extraction methods for eeg-based emotion recognition,” Biocybern. relation analysis (DSGCCA),” in Proc. Chin. Automat. Congr., 2019,
Biomed. Eng., vol. 40, no. 3, pp. 910–926, 2020. pp. 1959–1964.
[8] C. Gouveia, A. Tomé, F. Barros, S. C. Soares, J. Vieira, and P. Pinho, “Study [28] Z. Wang, F. Nie, C. Zhang, R. Wang, and X. Li, “Capped (p)-norm LDA for
on the usage feasibility of continuous-wave radar for emotion recognition,” outliers robust dimension reduction,” IEEE Signal Process. Lett., vol. 27,
Biomed. Signal Process. Control, vol. 58, 2020, Art. no. 101835. pp. 1315–1319, 2020.
[9] Y. Jiang, W. Li, M. S. Hossain, M. Chen, A. Alelaiwi, and M. Al- [29] D. Gkoumas, Q. Li, C. Lioma, Y. Yu, and D. Song, “What makes the
Hammadi, “A snapshot research and implementation of multimodal infor- difference? an empirical comparison of fusion strategies for multimodal
mation fusion for data-driven emotion recognition,” Inf. Fusion, vol. 53, language analysis,” Inf. Fusion, vol. 66, no. 6, pp. 184–197, 2021.
pp. 209–221, 2020. [30] C. Busso et al., “Iemocap: Interactive emotional dyadic motion capture
[10] V. Rajan, A. Brutti, and A. Cavallaro, “Conflictnet: End-to-end learning database,” Lang. Resour. Eval., vol. 42, no. 4, p. 335, 2008.
for speech-based conflict intensity estimation,” IEEE Signal Process. Lett., [31] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Mosi: Multimodal
vol. 26, no. 11, pp. 1668–1672, Nov. 2019. corpus of sentiment intensity and subjectivity analysis in online opinion
[11] Z. Li, D. Wu, F. Nie, R. Wang, Z. Sun, and X. Li, “Multi-view clus- videos,” 2016, arXiv:1606.06259.
tering based on invisible weights,” IEEE Signal Process. Lett., vol. 28, [32] K. Zhang, Y. Li, J. Wang, E. Cambria, and X. Li, “Real-time video
pp. 1051–1055, 2021. emotion recognition based on reinforcement learning and domain knowl-
[12] X. Li, M. Chen, and F. Nie, “Locality adaptive discriminant analysis,” in edge,” IEEE Trans. Circuits Syst. Video Technol., early access, 2021, doi:
Proc. Int. Joint Conf. Artif. Intell., 2017, pp. 2201–2207. 10.1109/TCSVT.2021.3072412
[13] S.-Y. Tseng, S. Narayanan, and P. Georgiou, “Multimodal embeddings [33] M. G. Huddar, S. S. Sannakki, and S. R. Vijay, “Attention-based multi-
from language models for emotion recognition in the wild,” IEEE Signal modal contextual fusion for sentiment and emotion classification using
Process. Lett., vol. 28, pp. 608–612, 2021. bidirectional LSTM,” Multimedia Tools Appl., vol. 80, no. 9, pp. 13059–
[14] T.-L. Nguyen, S. Kavuri, and M. Lee, “A multimodal convolutional neuro- 13076, 2021.
fuzzy network for emotion understanding of movie clips,” Neural Netw., [34] R. Cao, C. Ye, and H. Zhou, “Multimodel sentiment analysis with self-
vol. 118, pp. 208–219, 2019. attention,” in Proc. Future Technol. Conf., 2021, vol. 1, pp. 16–26.
[15] D. Nguyen, K. Nguyen, S. Sridharan, D. Dean, and C. Fookes, “Deep [35] N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria,
spatio-temporal feature fusion with compact bilinear pooling for multi- “Multimodal sentiment analysis using hierarchical fusion with context
modal emotion recognition,” Comput. Vis. Image Understanding, vol. 174, modeling,” Knowl.-Based Syst., vol. 161, pp. 124–133, 2018.
pp. 33–42, 2018. [36] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P.
[16] P. Tzirakis, J. Chen, S. Zafeiriou, and B. Schuller, “End-to-end multi- Morency, “Memory fusion network for multi-view sequential learning,”
modal affect recognition in real-world environments,” Inf. Fusion, vol. 68, in Proc. AAAI Conf. Artif. Intell., 2018, pp. 5634–5641.
pp. 46–53, 2021. [37] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann,
[17] J. Wang, Z. Ma, F. Nie, and X. Li, “Progressive self-supervised clustering “Icon: Interactive conversational memory network for multimodal emotion
with novel category discovery,” IEEE Trans. Cybern., early access, 2021, detection,” in Proc. Conf. Empirical Methods Natural Lang. Process.,
doi: 10.1109/TCYB.2021.3069836 2018, pp. 2594–2604.
[18] C. Tang, X. Zhu, X. Liu, L. Wang, and A. Zomaya, “Defusionnet: De- [38] S. Yildirim, Y. Kaya, and F. Kılıç, “A modified feature selection method
focus blur detection via recurrently fusing and refining multi-scale deep based on metaheuristic algorithms for speech emotion recognition,” Appl.
features,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., Acoust., vol. 173, 2021, Art. no. 107721.
2019, pp. 2700–2709.
[19] Y. Wang, F. Sun, M. Lu, and A. Yao, “Learning deep multimodal feature
representation with asymmetric multi-layer fusion,” in Proc. 28th ACM
Int. Conf. Multimedia, 2020, pp. 3902–3910.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:14:23 UTC from IEEE Xplore. Restrictions apply.

You might also like