You are on page 1of 4

Multimodal sentiments clasification using spatial attention in Spanish

Anonymous ACL submission

Abstract to this problem. Third, the subjectivity of emo- 040


tions is outstanding in this field, there are many 041
001 Thanks to the large amount of data that dif- variables of characteristics of people that can bias 042
002 ferent sources can offer us, we can work with
results. In this article, we propose an interesting 043
003 different types of data. This summary contains
004 a model with multimodality of audio and text
proposal to alleviate the problems of insufficient 044

005 to classify the emotion of the speaker, based on data in the Spanish language, which includes one 045
006 an attention model that allows to improve the of the biggest challenges is multimodality, in this 046
007 results with an accuracy of 0.9567. This is a case, audio and text. With this, good results have 047
008 work in progress. been obtained with very high precision, this also 048
thanks to the latest attention models that help to 049
009 1 Introduction take into account a certain level of attention in very 050
important words. 051
010 In the field of Affective Computing, it has been
011 evolving in order to close between technology and
2 Related Work 052
012 humanity, to achieve this it is very important that
013 machines can empathize with humans to assertively Many predictive models have been trained based 053
014 understand the relationship they should have (Pi- on spectral, prosodic, and voice quality charac- 054
015 card, 2000). With the advancement of Deep Learn- teristics, as well as text using sentiment analysis 055
016 ing together with different fields such as Computer to recognize emotional patterns. Typical machine 056
017 vision, voice, text, and others, many important ad- learning algorithms are used based in many mod- 057
018 vances have been developed in the area of Affective els like support vector machines (SVM), hidden 058
019 computing, especially in emotion recognition tasks. Markov model (HMM), k-nearest neighbors, de- 059
020 In particular, the emotion of speech is one of the cision trees, artificial neural networks. Unfortu- 060
021 most important tasks in the field of paralinguistics nately, the classification method has not yet been 061
022 (Kamiloğlu et al., 2021). This field has recently standardized (Duville et al., 2021). The selection 062
023 been expanded its applications, as it is a crucial of classification algorithms depends mainly on the 063
024 factor in optimal human-computer interactions, in- properties of the data (for example, the number of 064
025 cluding dialog systems. The goal of speech emo- observations and classes) and the nature and num- 065
026 tion recognition is to predict the emotional content ber of features (the attributes of the object to be 066
027 of speech and to classify speech according to one identified). Recently, Ancilin and Milton (2021) 067
028 of several labels (ie, happy, sad, neutral, and angry). extracted spectral features from voice recordings 068
029 Several have applied different types of deep learn- from six databases: the Berlin Emotional Speech 069
030 ing methods to increase the performance of emo- Database (German), the Ravdess Database (North 070
031 tion classifiers(Thao et al., 2021). However, this American English), the Savee Database (British 071
032 task is still considered challenging for various rea- English) , the EMOVO emotional speech database 072
033 sons. First, insufficient data to train complex neu- (Italian), the eNTERFACE database (English), and 073
034 ral network-based models are available, due to the the Urdu database (Urdu) (Khan et al., 2021). A 074
035 costs associated with human intervention, and this multiclass SVM classifier was used to recognize 075
036 especially highlights this in the Spanish language. emotions in a one-versus-one approach. In another 076
037 Second, the characteristics of emotions must be study, the Berlin Database of Emotional Speech 077
038 learned from low-level speech signals. Feature- was used to compare emotion recognition perfor- 078
039 based models show limited abilities when applied mance between backpropagation neural networks, 079

1
and Shinozaki, 2021) to include those audio fea- 103
tures as part of our proposed model inputs. There 104
are thousands of languages spoken around the 105
world, many with several different dialects, which 106
presents a huge challenge for building high-quality 107
speech recognition technology. It’s simply not fea- 108
sible to obtain resources for each dialect and every 109
language across the many possible domains (read 110
speech, telephone speech, etc.). Our new model, 111
wav2vec 2.0 pretrained in spanish is part of our 112
vision for machine learning models that rely less 113
on labeled data, thanks to self-supervised learning. 114

3.2 Text Features 115

BETO is a BERT model trained on a big Spanish 116


corpus (de Arriba Serra et al., 2021). BETO is of 117
size similar to a BERT-Base and was trained with 118
the Whole Word Masking technique. Below you 119
find Tensorflow and Pytorch checkpoints for the 120
uncased and cased versions, as well as some results 121
for Spanish benchmarks comparing BETO with 122
Multilingual BERT as well as other (not BERT- 123
based) models. For further details on how to use 124
Figure 1: multimodal model of audio and text with an BETO you can visit the Huggingface Transformers 125
attention module in Spanish for recognition of emotions. library. 126

3.3 Proposed AttendAffectNet Model 127


080 the extreme learning, probabilistic neural network,
Inspired by the Transformer model (Vaswani et al., 128
081 and MFCC-based SVM supervised learning mod-
2017), we propose the AttendAffectNet (AAN) 129
082 els (Shen et al., 2020). In a multimodal approach,
model–a multimodal neural network that integrates 130
083 SVM has been used as a late fusion algorithm to
the self-attention mechanism to predict the emo- 131
084 collect previously obtained prediction information
tions represented in six class (neutral, happy, angry, 132
085 on emotion recognition, based on acoustic and text
sad, disgusted and afraid) using features extracted 133
086 features separately. and generating a final multi-
from audio, and text. We implemented a called Fea- 134
087 modal predictive model. The authors found that
ture AAN, the self-attention mechanism is applied 135
088 SVM-based late fusion improved emotion recogni-
to the features obtained from different modalities. 136
089 tion performance (Do et al., 2021).
We taked a quick look at the Transformer architec- 137

090 3 Methodology ture. 138

091 We are using a multimodal architecture of text and 4 Results 139


092 audio to improve results. We have also used an
093 attention module [see Fig.1]. 4.1 MESD DataBase 140

The Mexican Emotional Discourse Database 141


094 3.1 Audio Features (MESD) provides one-word statements for the af- 142
095 The base model pretrained on 16kHz sampled fective prosodies of anger, disgust, fear, happi- 143
096 speech audio. When using the model make ness, neutral, and sadness with Mexican cultural 144
097 sure that your speech input is also sampled at conformation. The MESD has been pronounced 145
098 16Khz.This model does not have a tokenizer as by non-professional adult and child actors: 3 fe- 146
099 it was pretrained on audio alone. In order to use male, 2 male, and 6 children’s voices are available. 147
100 this model speech recognition, a tokenizer should The words of the emotional and neutral statements 148
101 be created and the model should be fine-tuned on come from two corpus: (corpus A) composed of 149
102 labeled text data. We chose Wav2vec 2.0 (Iwamoto nouns and adjectives that are repeated through emo- 150

2
Modality model Accuracy
Only Text Berto 0.903
Only Audio Wav2Vec2 0.9308
Both Text and Audio without attention 0.9512
Both Text and Audio with attention 0.9567

Table 1: Results of experiments using multimodal models. text and audio with an attention mechanism have better
results. It is meant to improve the quality in accuracy.

151 tional prosodies and types of voice (female, male, 5 Conclusions 188
152 child), and (corpus B) consisting of words con-
The aforementioned prototype may not perform 189
153 trolled for age of acquisition, frequency of use,
well in noisy environments or audio with a mu- 190
154 familiarity, concreteness, valence, arousal, and dis-
sical/noisy background because the models are 191
155 crete emotion dimensionality ratings. Audio files
trained on very little data (due to poor availabil- 192
156 were stored as a 24-bit stream with a sample rate
ity). To make our model robust, our future work 193
157 of 48000Hz. The amplitude of the acoustic wave-
includes: A multimodal model of text and audio 194
158 forms was rescaled between -1 and 1. We have also
with attention mechanism. A precision of 0.9567 195
159 manually labeled the text based on the words in
with a higher bias in the "fear" class. To verify and 196
160 this dataset for the model with text (Duville et al.,
establish whether critical prosodies for the audio 197
161 2021).
sentiment classification task are lost during model 198
fine-tuning for ASR purposes, an in-depth study 199

162 4.2 Implementation and root cause analysis should be performed. 200

Acknowledgements 201
163 The open-source MESD dataset was used to fine-
164 tune the Wav2Vec2 base model, which contains Thanks to the organizers of the "Somos NLP" 202
165 1200 audio recordings, all of which were recorded hackathon and to my audio sentiment classifica- 203
166 in professional studios and were only one sec- tion team, which was the basis for this project. 204
167 ond long. Out of 1200 audio recordings only
168 890 of the recordings were utilized for training.
169 Due to these factors, the model and hence this References 205

170 Gradio application may not be able to perform J Ancilin and A Milton. 2021. Improved speech emo- 206
171 well in noisy environments or audio with back- tion recognition with mel frequency magnitude coef- 207
172 ground music or noise. For the implementations ficient. Applied Acoustics, 179:108046. 208

173 we have used the following parameters learning Ariadna de Arriba Serra, Marc Oriol Hilari, and Javier 209
174 rate 0.0001, train batch size 64, eval batch size Franch Gutiérrez. 2021. Applying sentiment anal- 210
175 40,seed 42, optimizer Adam with betas=(0.9,0.999) ysis on spanish tweets using beto. In Proceedings 211
of the Iberian Languages Evaluation Forum (Iber- 212
176 and epsilon=1e-08, 100 epochs , implemented with LEF 2021): co-located with the Conference of the 213
177 Transformers 4.17.0, Pytorch 1.10.0+cu111, using Spanish Society for Natural Language Processing 214
178 Hugging Face as plataform with Tokenizers 0.11.6 (SEPLN 2021), XXXVII International Conference of 215
the Spanish Society for Natural Language Process- 216
ing: Málaga, Spain, September, 2021, pages 1–8. 217
CEUR-WS. org. 218
179 4.3 Experiments
Luu-Ngoc Do, Hyung-Jeong Yang, Hai-Duong Nguyen, 219
180 Table 1 shows the precision results using data with Soo-Hyung Kim, Guee-Sang Lee, and In-Seop Na. 220
181 only text, with only audio, using both characteris- 2021. Deep neural network-based fusion model for 221
emotion recognition using visual data. The Journal 222
182 tics without attention mechanism and with attention of Supercomputing, 77(10):10773–10790. 223
183 mechanism, the latter having the best precision re-
184 sult. It’s also worth mentioning that this model Mathilde Marie Duville, Luz María Alonso-Valerdi, and 224
David I Ibarra-Zarate. 2021. Mexican emotional 225
185 performs poorly when it comes to audio record- speech database based on semantic, frequency, famil- 226
186 ings from the class "Fear" which the model often iarity, concreteness, and cultural shaping of affective 227
187 misclassifies. prosody. Data, 6(12):130. 228

3
229 Yu Iwamoto and Takahiro Shinozaki. 2021. Unsuper-
230 vised spoken term discovery using wav2vec 2.0. In
231 2021 Asia-Pacific Signal and Information Processing
232 Association Annual Summit and Conference (APSIPA
233 ASC), pages 1082–1086. IEEE.
234 Roza G Kamiloğlu, George Boateng, Alisa Balabanova,
235 Chuting Cao, and Disa A Sauter. 2021. Superior com-
236 munication of positive emotions through nonverbal
237 vocalisations compared to speech prosody. Journal
238 of nonverbal behavior, 45(4):419–454.
239 Ihsan Ullah Khan, Aurangzeb Khan, Wahab Khan, Ma-
240 zliham Mohd Su’ud, Muhammad Mansoor Alam,
241 Fazli Subhan, and Muhammad Zubair Asghar. 2021.
242 A review of urdu sentiment analysis with multilin-
243 gual perspective: A case of urdu and roman urdu
244 language. Computers, 11(1):3.
245 Rosalind W Picard. 2000. Affective computing. MIT
246 press.
247 Weizheng Shen, Ding Tu, Yanling Yin, and Jun Bao.
248 2020. A new fusion feature based on convolutional
249 neural network for pig cough recognition in field
250 situations. Information Processing in Agriculture.
251 Ha Thi Phuong Thao, BT Balamurali, Gemma Roig, and
252 Dorien Herremans. 2021. Attendaffectnet–emotion
253 prediction of movie viewers using multimodal fusion
254 with self-attention. Sensors, 21(24):8356.
255 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
256 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
257 Kaiser, and Illia Polosukhin. 2017. Attention is all
258 you need. Advances in neural information processing
259 systems, 30.

You might also like