Professional Documents
Culture Documents
Manuscript # SDATA-21-00959
Current Revision # 0
Submission Date 27th August 21
Current Stage Initial QC Started
Title SCorpus-PE: A speech emotional database in Spanish with Peruvian accent
Manuscript Type Data Descriptor
Corresponding Author Dr Alvaro Cuno (alvaroecp@gmail.com) (Universidad Nacional de San Agustín de Arequipa)
Contributing Authors Ms Alessandra Delgado , Professor Wilber Ramos , Ms Harieth Bernedo
Authorship Yes
Lately, there has been considerable progress in speech emotion recognition systems based on deep
learning. However, many of them are not generalizable to languages for which they were not trained.
One of the reasons for this limitation is the absence of corpora in all existing human languages,
dialects, and accents. This paper contributes to filling this gap, presenting ESCorpus-PE, which
Abstract
contains three hours, five minutes, and six seconds of speech of adult speakers in Spanish with a
Peruvian accent annotated with three emotional attributes: valence, arousal, and dominance. Speech
data originates from YouTube videos with interactions in their natural form. Our database can be a
valuable resource for researchers studying emotion recognition in multilingual settings.
Editorial Board Member Not Assigned
Scientific community and society/Scientific community
Subject Terms
Scientific community and society/Scientific community/Research data
Competing Interests Policy There is NO Competing Interest.
Human Subjects
No
Question
Competing Interests Policy No competing interest declared
Applicable Funding Source No Applicable Funding
Stage Start Date End Date Approximate Duration
Initial QC Started 27th August 21
Author Approved Converted Files 27th August 21
Preliminary Manuscript Data Submitted 27th August 21
tracking system home | privacy policy | cookie policy | manage cookies
https://mts-scidata.nature.com/cgi-bin/main.plex?form_type=status_details&j_id=129&ms_id=322603&ms_rev_no=0&ms_id_k… 1/1
1 ESCorpus-PE: A speech emotional database in
2 Spanish with Peruvian accent
3 Alvaro Cuno1,* , Alessandra Delgado1 , Wilber Ramos Lovón1 , and Harieth Bernedo1
4
1 Universidad Nacional de San Agustı́n de Arequipa, Departamento Académico de Ingenierı́a de Sistemas e
5 Informática, Arequipa, Perú
6
* corresponding author: Alvaro Cuno (acunopa@unsa.edu.pe)
7 ABSTRACT
Lately, there has been considerable progress in speech emotion recognition systems based on deep learning. However,
many of them are not generalizable to languages for which they were not trained. One of the reasons for this limitation is the
absence of corpora in all existing human languages, dialects, and accents. This paper contributes to filling this gap, presenting
8 ESCorpus-PE, which contains three hours, five minutes, and six seconds of speech of adult speakers in Spanish with a
Peruvian accent annotated with three emotional attributes: valence, arousal, and dominance. Speech data originates from
YouTube videos with interactions in their natural form. Our database can be a valuable resource for researchers studying
emotion recognition in multilingual settings.
42 Selection of Platform
43 There are several audio platforms on the Internet. However, not all of them were feasible to be used in our project. We
44 established two criteria in order to determine the most appropriate; they are diversity and accessibility. While diversity
45 definition in this work refers to platforms that provide a diversity of available audio/video (including diversity of topics and
46 speakers), accessibility refers to the fact that platforms provide unlimited access to audio/video, and their searches do not
47 impose restrictions. With these two criteria, three candidate platforms were identified: Spotify, Ivoox, and YouTube.The best of
48 these to adapted to our needs was YouTube, since it has a greater diversity of videos, topics, and speakers and allows a broader
49 search, unlike Spotify and Ivoox.
50 Audio/Video Selection
51 Once the platform was selected, the candidate audios/audios were identified, for which the following criteria were applied:
52 • Inclusion criteria:
53 – Select audios/videos with a Creative Commons license or with explicit consent for unrestricted use.
54 – Choose audios/videos using keywords of a geographical nature found in the audio description: Peru, Lima,
55 Arequipa, Cuzco.
56 – Select audios/videos in Spanish with a Peruvian accent in one of its dialects of Lima, Arequipa, Cuzco, Piura, and
57 others Peruvian departments.
58 – Select audios that present emotional variety: happy, sad, calm, active, and dominant.
59 – Maintain diversity, including audios/videos that present diverse topics, diverse speakers between men and women,
60 and where intense emotions of the participants are also presented.
61 – Select audios/videos with natural and spontaneous content. The audios/videos do not have to be the product of
62 performances.
63 • Exclusion criteria:
68 With these criteria, the videos that appear in Table 1 were selected.
69 Videos 3, 4, 5, and 6 have noise in some of their segments. In video 3, the emotion of happiness can be appreciated at its
70 maximum expression while the participants watch Peru vs. New Zealand match, the soccer match in which Peru qualifies to the
71 2018 World Cup, so there are seventy-nine segments with noise. In video 4, the tragedy in Arequipa in 1996 is narrated as
72 part of a journalistic report, so there are seventeen segments with noise. Video 5 comprises two reports about the cold weather
73 in Arequipa and Puno, so one hundred and one segments present instrumental music in the background. Finally, in video 6,
74 which is a report on one of the traditions of the Huaynacotas-Arequipa region, sixty-eight segments are presented with musical
75 backgrounds.
76 Pre-processing
77 Pre-processing is the procedure by which the videos are prepared for segmentation and annotation. The tool used for this
78 activity was the AUDACITYy®1 tool. It consists of converting the videos to WAV format audios in single-channel with 16-bit
79 PCM storage precision.
1 https://www.audacityteam.org/
2/7
Video ID Name URL Duration Emotional profile
1 Presidential debate 2021 https://youtu.be/4ejMuYd9SO4 1h 28m 18s Diverse emotions
Testimonials of Flor Huilca and
2 https://youtu.be/ez_4KGuirLE 0h 31m 00s Negative emotions
Martha Flores
Testimonials of Peru’s qualifica-
3 https://youtu.be/BCKtag_eQdw 0h 17m 11s Positive emotions
tion to the World Cup 2018
Report on the tragedy in Are-
4 https://youtu.be/RV9M8pIg6PA 0h 12m 47s Negative emotions
quipa in 1996
5 Frost report in Arequipa https://youtu.be/7aLDKH6Lnp0 0h 20m 09s Negative emotions
Report on Huaynacotas in Are-
6 https://youtu.be/tTNFWbsCUi8 0h 30m 59s Positive emotions
quipa
80 Segmentation
81 Segmentation is the procedure by which a set of audios is received as input, and for each audio, segments are generated as output.
82 The initial step involves the automated generation of segments using the PRAAT®17 tool and using the silences contained in
83 the audio as segmentation criteria. The next step comprises, for each automatically generated segment, the verification of the
84 following eligibility criteria:
85 • Contain only the voice of one speaker. Segments containing the voice of more than one speaker, simultaneously or not,
86 are discarded.
87 • Have a minimum duration of three seconds and a maximum duration of 11 seconds. Segments shorter than three seconds
88 are discarded. Segments longer than 11 seconds can be subdivided into new segments as long as all the criteria are met.
89 • Have a phrase or word that expresses a single emotion. Those segments where more than one emotion is perceived are
90 discarded.
91 Verifying the fulfillment of these criteria was performed manually; the input segments were listened to by a person, one by one
92 in their entirety. As a final segmentation step, a sequential identifier was assigned to each segment that passed the eligibility
93 criteria. The identification of the segments was carried out in order to be able to perform the following activities without
94 ambiguity. The Table 2 summarizes the number of segments generated in an automated manner and the number of segments
95 that meet the eligibility criteria.
96 In all cases, the number of segments generated manually is greater than those generated automatically; this is because many
97 of the segments generated by the PRAAT were extensive and had to be subdivided.
98 Annotation
99 The annotation is a procedure that receives the output of the segmentation procedure and generates a set of emotional attributes,
100 a vector (valence, arousal, dominance) for each segment. The audio segments and their corresponding labels make up the
101 corpus composed of n labeled samples S = zi ni=1 , where zi = (xi , yi ), xi ∈ X, and yi ∈ R3 . The annotation comprises two main
102 activities: individual annotation and consensus. In the first activity, each annotator generates a set of annotations A j individually
103 and independently. Then, the elements of the E j sets are classified into two groups: those that match in all three emotional
104 dimensions and those that differ in at least one dimension. The matching annotations become part of the labels of the final
3/7
105 corpus. For the unmatched annotations, the average value of each segment’s emotional attributes is calculated to assign the
106 value of the final label. A summary of matched and unmatched labels is presented in the Table 3:
107 The annotation of the audio segments was performed using the PRAAT tool. Each segment was labeled with the three
108 emotional attributes according to the Self-Assessment Manikin (SAM) scale shown in Figure 1.
109 At the end of the consensus stage, a PRAAT script was used to save all the audio segments in WAV format with their
110 respective labels. The emotional attributes (values of valence, arousal and dominance) were placed as the name of each WAV
111 file.
4/7
Corpus collection summary
Number of videos 6
Total videos length 3h 20m 24s
Number of annotators 2
Signal voice
Corpus contents
Number of audio segments 1559
Total segments length 3h 5m 6s
Dimensions Valence, Arousal, and Dominance
Annotations scale 01, 02, 03, 04, 05
Number of speakers 50
Gender of speakers 18 women and 32 men
Age range of speakers Adults
5/7
Figure 2. Histograms of the dimensional attributes in the corpus.
148 Limitations
149 Among the limitations of the corpus, we highlight the following:
150 • Size. The corpus has a duration of 3h 5m 6s, which is a shorter time than the most popular corpus, for example,
151 IEMOCAP18 (twelve hours), MS-IMPROV19 (nine hours), EMO-DB20 (twenty two minutes), Design and Evaluation of
152 Adult Emotional Speech Corpus for Natural Environment21 (13500 sentences), and the MSP-PODCAST CORPUS16
153 (twenty seven hours). This aspect could affect its use in systems that require a large amount of training data. However,
154 this limitation could be alleviated by applying data augmentation operations, such as time-stretching, speed and pitch,
155 and noise.
156 • Emotional balance. Although the design has sought to create a balanced corpus, the result has not been such. The corpus
157 has some shortcomings in evenly filling the emotional space. This limitation could make classifiers not as accurate when
158 presented with cases belonging to these voids in emotional space.
159 • Two annotators. Using only two annotators has allowed us to obtain a high Cronbach’s alpha value, however it does not
160 mean that we are free of biases or that all the segments have been well annotated. Increasing the number of annotators is
161 a potential task to be done.
167 References
168 1. Park, C. Y. et al. K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.
169 Sci. Data 7, 1–16 (2020).
170 2. Archer, M. S. Can humans and ai robots be friends? In Post-Human Futures, 132–152 (Routledge, 2021).
171 3. McStay, A. Emotional AI: The rise of empathic media (Sage, 2018).
172 4. Williamson, B., Bayne, S. & Shay, S. The datafication of teaching in higher education: critical issues and perspectives. Teach.
173 High. Educ. 25, 351–365, 10.1080/13562517.2020.1748811 (2020). https://doi.org/10.1080/13562517.2020.1748811.
174 5. Yoon, J., Arik, S. & Pfister, T. Data valuation using reinforcement learning. In International Conference on Machine
175 Learning, 10842–10851 (PMLR, 2020).
6/7
176 6. Pérez-Espinosa, H. et al. Iesc-child: An interactive emotional children’s speech corpus. Comput. Speech & Lang. 59,
177 55–74 (2020).
178 7. Villada Zapata, J. & Chaves Castaño, L. Theoretical contributions derived from language research between 2000 and 2010:
179 a review. Divers. Perspectivas en Psicología 8, 331–343 (2012).
180 8. Douglas-Cowie, E., Campbell, N., Cowie, R. & Roach, P. Emotional speech: Towards a new generation of databases.
181 Speech Commun. 40, 33–60, 10.1016/S0167-6393(02)00070-5 (2003).
182 9. David, E. M., Lourde, A., Cesar, G. F., Pascual, V. & Valentin, C. P. Caracterizacion acustica del acento basada en
183 corpus: un enfoque multilingue ingles-espanol. In International Conference on Machine Learning, 10 (Departamento
184 de Informatica, Universidad de Valladolid, Espana Departamento de Filología Hispanica, Universidad Autonoma de
185 Barcelona, Espana, 2011).
186 10. Laghari, M., Tahir, M. J., Azeem, A., Riaz, W. & Zhou, Y. Robust speech emotion recognition for sindhi language based
187 on deep convolutional neural network. In 2021 International Conference on Communications, Information System and
188 Computer Engineering (CISCE), 543–548 (IEEE, 2021).
189 11. Zaheer, N., Ahmad, O. U., Ahmed, A., Khan, M. S. & Shabbir, M. Semour: A scripted emotional speech repository for
190 urdu. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–12 (2021).
191 12. Hansen, J. H. & Liu, G. Unsupervised accent classification for deep data fusion of accent and language information.
192 Speech Commun. 78, 19–33 (2016).
193 13. Yang, X. et al. Joint modeling of accents and acoustics for multi-accent speech recognition. In 2018 IEEE International
194 Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2018).
195 14. Latif, S., Qayyum, A., Usman, M. & Qadir, J. Cross lingual speech emotion recognition: Urdu vs. western languages. In
196 2018 International Conference on Frontiers of Information Technology (FIT), 88–93 (IEEE, 2018).
197 15. Zehra, W., Javed, A. R., Jalil, Z., Khan, H. U. & Gadekallu, T. R. Cross corpus multi-lingual speech emotion recognition
198 using ensemble learning. Complex & Intell. Syst. 1–10 (2021).
199 16. Lotfian, R. & Busso, C. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from
200 existing podcast recordings. IEEE Transactions on Affect. Comput. 10, 471–483 (2019).
201 17. Boersma, P. PRAAT, a system for doing phonetics by computer. Glot. Int. 5, 341–345 (2001).
202 18. Busso, C. et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359
203 (2008).
204 19. Busso, C. et al. Msp-improv: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on
205 Affect. Comput. 8, 1–1, 10.1109/TAFFC.2016.2515617 (2016).
206 20. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. & Weiss, B. A database of german emotional speech. vol. 5,
207 1517–1520, 10.21437/Interspeech.2005-446 (2005).
208 21. Jia, N., Zheng, C. & Sun, W. Design and evaluation of adult emotional speech corpus for natural environment. In
209 2020 12th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 1, 53–56,
210 10.1109/IHMSC49165.2020.00020 (2020).
211 Acknowledgements
212 (not compulsory)
7/7