Professional Documents
Culture Documents
005 to classify the emotion of the speaker, based on data in the Spanish language, which includes one 045
006 an attention model that allows to improve the of the biggest challenges is multimodality, in this 046
007 results with an accuracy of 0.9567. This is a case, audio and text. With this, good results have 047
008 work in progress. been obtained with very high precision, this also 048
thanks to the latest attention models that help to 049
009 1 Introduction take into account a certain level of attention in very 050
important words. 051
010 In the field of Affective Computing, it has been
011 evolving in order to close between technology and
2 Related Work 052
012 humanity, to achieve this it is very important that
013 machines can empathize with humans to assertively Many predictive models have been trained based 053
014 understand the relationship they should have (Pi- on spectral, prosodic, and voice quality charac- 054
015 card, 2000). With the advancement of Deep Learn- teristics, as well as text using sentiment analysis 055
016 ing together with different fields such as Computer to recognize emotional patterns. Typical machine 056
017 vision, voice, text, and others, many important ad- learning algorithms are used based in many mod- 057
018 vances have been developed in the area of Affective els like support vector machines (SVM), hidden 058
019 computing, especially in emotion recognition tasks. Markov model (HMM), k-nearest neighbors, de- 059
020 In particular, the emotion of speech is one of the cision trees, artificial neural networks. Unfortu- 060
021 most important tasks in the field of paralinguistics nately, the classification method has not yet been 061
022 (Kamiloğlu et al., 2021). This field has recently standardized (Duville et al., 2021). The selection 062
023 been expanded its applications, as it is a crucial of classification algorithms depends mainly on the 063
024 factor in optimal human-computer interactions, in- properties of the data (for example, the number of 064
025 cluding dialog systems. The goal of speech emo- observations and classes) and the nature and num- 065
026 tion recognition is to predict the emotional content ber of features (the attributes of the object to be 066
027 of speech and to classify speech according to one identified). Recently, Ancilin and Milton (2021) 067
028 of several labels (ie, happy, sad, neutral, and angry). extracted spectral features from voice recordings 068
029 Several have applied different types of deep learn- from six databases: the Berlin Emotional Speech 069
030 ing methods to increase the performance of emo- Database (German), the Ravdess Database (North 070
031 tion classifiers(Thao et al., 2021). However, this American English), the Savee Database (British 071
032 task is still considered challenging for various rea- English) , the EMOVO emotional speech database 072
033 sons. First, insufficient data to train complex neu- (Italian), the eNTERFACE database (English), and 073
034 ral network-based models are available, due to the the Urdu database (Urdu) (Khan et al., 2021). A 074
035 costs associated with human intervention, and this multiclass SVM classifier was used to recognize 075
036 especially highlights this in the Spanish language. emotions in a one-versus-one approach. In another 076
037 Second, the characteristics of emotions must be study, the Berlin Database of Emotional Speech 077
038 learned from low-level speech signals. Feature- was used to compare emotion recognition perfor- 078
039 based models show limited abilities when applied mance between backpropagation neural networks, 079
1
and Shinozaki, 2021) to include those audio fea- 103
tures as part of our proposed model inputs. There 104
are thousands of languages spoken around the 105
world, many with several different dialects, which 106
presents a huge challenge for building high-quality 107
speech recognition technology. It’s simply not fea- 108
sible to obtain resources for each dialect and every 109
language across the many possible domains (read 110
speech, telephone speech, etc.). Our new model, 111
wav2vec 2.0 pretrained in spanish is part of our 112
vision for machine learning models that rely less 113
on labeled data, thanks to self-supervised learning. 114
2
Modality model Accuracy
Only Text Berto 0.903
Only Audio Wav2Vec2 0.9308
Both Text and Audio without attention 0.9512
Both Text and Audio with attention 0.9567
Table 1: Results of experiments using multimodal models. text and audio with an attention mechanism have better
results. It is meant to improve the quality in accuracy.
151 tional prosodies and types of voice (female, male, 5 Conclusions 188
152 child), and (corpus B) consisting of words con-
The aforementioned prototype may not perform 189
153 trolled for age of acquisition, frequency of use,
well in noisy environments or audio with a mu- 190
154 familiarity, concreteness, valence, arousal, and dis-
sical/noisy background because the models are 191
155 crete emotion dimensionality ratings. Audio files
trained on very little data (due to poor availabil- 192
156 were stored as a 24-bit stream with a sample rate
ity). To make our model robust, our future work 193
157 of 48000Hz. The amplitude of the acoustic wave-
includes: A multimodal model of text and audio 194
158 forms was rescaled between -1 and 1. We have also
with attention mechanism. A precision of 0.9567 195
159 manually labeled the text based on the words in
with a higher bias in the "fear" class. To verify and 196
160 this dataset for the model with text (Duville et al.,
establish whether critical prosodies for the audio 197
161 2021).
sentiment classification task are lost during model 198
fine-tuning for ASR purposes, an in-depth study 199
162 4.2 Implementation and root cause analysis should be performed. 200
Acknowledgements 201
163 The open-source MESD dataset was used to fine-
164 tune the Wav2Vec2 base model, which contains Thanks to the organizers of the "Somos NLP" 202
165 1200 audio recordings, all of which were recorded hackathon and to my audio sentiment classifica- 203
166 in professional studios and were only one sec- tion team, which was the basis for this project. 204
167 ond long. Out of 1200 audio recordings only
168 890 of the recordings were utilized for training.
169 Due to these factors, the model and hence this References 205
170 Gradio application may not be able to perform J Ancilin and A Milton. 2021. Improved speech emo- 206
171 well in noisy environments or audio with back- tion recognition with mel frequency magnitude coef- 207
172 ground music or noise. For the implementations ficient. Applied Acoustics, 179:108046. 208
173 we have used the following parameters learning Ariadna de Arriba Serra, Marc Oriol Hilari, and Javier 209
174 rate 0.0001, train batch size 64, eval batch size Franch Gutiérrez. 2021. Applying sentiment anal- 210
175 40,seed 42, optimizer Adam with betas=(0.9,0.999) ysis on spanish tweets using beto. In Proceedings 211
of the Iberian Languages Evaluation Forum (Iber- 212
176 and epsilon=1e-08, 100 epochs , implemented with LEF 2021): co-located with the Conference of the 213
177 Transformers 4.17.0, Pytorch 1.10.0+cu111, using Spanish Society for Natural Language Processing 214
178 Hugging Face as plataform with Tokenizers 0.11.6 (SEPLN 2021), XXXVII International Conference of 215
the Spanish Society for Natural Language Process- 216
ing: Málaga, Spain, September, 2021, pages 1–8. 217
CEUR-WS. org. 218
179 4.3 Experiments
Luu-Ngoc Do, Hyung-Jeong Yang, Hai-Duong Nguyen, 219
180 Table 1 shows the precision results using data with Soo-Hyung Kim, Guee-Sang Lee, and In-Seop Na. 220
181 only text, with only audio, using both characteris- 2021. Deep neural network-based fusion model for 221
emotion recognition using visual data. The Journal 222
182 tics without attention mechanism and with attention of Supercomputing, 77(10):10773–10790. 223
183 mechanism, the latter having the best precision re-
184 sult. It’s also worth mentioning that this model Mathilde Marie Duville, Luz María Alonso-Valerdi, and 224
David I Ibarra-Zarate. 2021. Mexican emotional 225
185 performs poorly when it comes to audio record- speech database based on semantic, frequency, famil- 226
186 ings from the class "Fear" which the model often iarity, concreteness, and cultural shaping of affective 227
187 misclassifies. prosody. Data, 6(12):130. 228
3
229 Yu Iwamoto and Takahiro Shinozaki. 2021. Unsuper-
230 vised spoken term discovery using wav2vec 2.0. In
231 2021 Asia-Pacific Signal and Information Processing
232 Association Annual Summit and Conference (APSIPA
233 ASC), pages 1082–1086. IEEE.
234 Roza G Kamiloğlu, George Boateng, Alisa Balabanova,
235 Chuting Cao, and Disa A Sauter. 2021. Superior com-
236 munication of positive emotions through nonverbal
237 vocalisations compared to speech prosody. Journal
238 of nonverbal behavior, 45(4):419–454.
239 Ihsan Ullah Khan, Aurangzeb Khan, Wahab Khan, Ma-
240 zliham Mohd Su’ud, Muhammad Mansoor Alam,
241 Fazli Subhan, and Muhammad Zubair Asghar. 2021.
242 A review of urdu sentiment analysis with multilin-
243 gual perspective: A case of urdu and roman urdu
244 language. Computers, 11(1):3.
245 Rosalind W Picard. 2000. Affective computing. MIT
246 press.
247 Weizheng Shen, Ding Tu, Yanling Yin, and Jun Bao.
248 2020. A new fusion feature based on convolutional
249 neural network for pig cough recognition in field
250 situations. Information Processing in Agriculture.
251 Ha Thi Phuong Thao, BT Balamurali, Gemma Roig, and
252 Dorien Herremans. 2021. Attendaffectnet–emotion
253 prediction of movie viewers using multimodal fusion
254 with self-attention. Sensors, 21(24):8356.
255 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
256 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
257 Kaiser, and Illia Polosukhin. 2017. Attention is all
258 you need. Advances in neural information processing
259 systems, 30.