You are on page 1of 8

8/27/2021 Scientific Data

tracking system home author instructions reviewer instructions help logout journal home

Detailed Status Information

 
Manuscript # SDATA-21-00959 
Current Revision # 0
Submission Date 27th August 21
Current Stage Initial QC Started
Title SCorpus-PE: A speech emotional database in Spanish with Peruvian accent
Manuscript Type Data Descriptor
Corresponding Author Dr Alvaro Cuno (alvaroecp@gmail.com) (Universidad Nacional de San Agustín de Arequipa)
Contributing Authors Ms Alessandra Delgado , Professor Wilber Ramos , Ms Harieth Bernedo
Authorship Yes
Lately, there has been considerable progress in speech emotion recognition systems based on deep
learning. However, many of them are not generalizable to languages for which they were not trained.
One of the reasons for this limitation is the absence of corpora in all existing human languages,
dialects, and accents. This paper contributes to filling this gap, presenting ESCorpus-PE, which
Abstract
contains three hours, five minutes, and six seconds of speech of adult speakers in Spanish with a
Peruvian accent annotated with three emotional attributes: valence, arousal, and dominance. Speech
 
data originates from YouTube videos with interactions in their natural form. Our database can be a
valuable resource for researchers studying emotion recognition in multilingual settings.
Editorial Board Member Not Assigned
Scientific community and society/Scientific community
Subject Terms
Scientific community and society/Scientific community/Research data
Competing Interests Policy There is NO Competing Interest.
Human Subjects
No
Question
Competing Interests Policy No competing interest declared
Applicable Funding Source No Applicable Funding

 
Stage Start Date End Date Approximate Duration
Initial QC Started 27th August 21    
Author Approved Converted Files 27th August 21    
Preliminary Manuscript Data Submitted 27th August 21    

 
tracking system home | privacy policy | cookie policy | manage cookies

https://mts-scidata.nature.com/cgi-bin/main.plex?form_type=status_details&j_id=129&ms_id=322603&ms_rev_no=0&ms_id_k… 1/1
1 ESCorpus-PE: A speech emotional database in
2 Spanish with Peruvian accent
3 Alvaro Cuno1,* , Alessandra Delgado1 , Wilber Ramos Lovón1 , and Harieth Bernedo1

4
1 Universidad Nacional de San Agustı́n de Arequipa, Departamento Académico de Ingenierı́a de Sistemas e
5 Informática, Arequipa, Perú
6
* corresponding author: Alvaro Cuno (acunopa@unsa.edu.pe)

7 ABSTRACT

Lately, there has been considerable progress in speech emotion recognition systems based on deep learning. However,
many of them are not generalizable to languages for which they were not trained. One of the reasons for this limitation is the
absence of corpora in all existing human languages, dialects, and accents. This paper contributes to filling this gap, presenting
8 ESCorpus-PE, which contains three hours, five minutes, and six seconds of speech of adult speakers in Spanish with a
Peruvian accent annotated with three emotional attributes: valence, arousal, and dominance. Speech data originates from
YouTube videos with interactions in their natural form. Our database can be a valuable resource for researchers studying
emotion recognition in multilingual settings.

9 Background & Summary


10 Emotions are an essential component in communication between human beings, although they are also essential in commu-
11 nication between humans and machines1, 2 . For this reason, in many areas of the scientific community, automatic emotion
12 recognition systems are a latent research topic nowadays3, 4 . One of the schemes that have received the most attention from
13 researchers due to their availability is those based on recognizing emotions from speech using techniques based on supervised
14 deep learning. However, it is known that these techniques strongly depend on the existence of data for training, validation, and
15 testing and that the use of large-scale, diverse, and high-quality data results in superior quality ones5 .
16 Currently, there are several challenges related to the availability of speech datasets (also known as corpus) usable in
17 intelligent systems, particularly those to be used to recognize emotions. For example, recognize the emotions of speakers of
18 different ages6 or recognize the voice of speakers in different languages7 . Concerning the language problem, it is observed
19 that corpora can be grouped into two main groups: a majority group where several corpora of the same language can be found
20 (English or German)8 and a minority group made up of corpus in languages such as Spanish,9 Hebrew, Russian, Korean,
21 Japanese, Shindi10 and Urdu11 . One aspect that increases the complexity of the language challenge is the variety of dialects
22 that exist in the same12 , and the differences in accents of the speakers13 . Since, for example, it is not the same to compare a
23 Spanish-language speaker from Spain with a Spanish-language speaker from Peru. Studies by Hansen J.12 , and Yang X. et
24 al.,13 have shown that variation in speech due to dialect is a factor that significantly affects the speech system as each dialect
25 is differentiated by acoustic features such as vowel and consonant phonetics, rhythm and prosodic characteristics as well as
26 grammar and vocabulary. These characteristics are the ones that usually generate a mismatch in the performance of machine
27 learning systems.
28 For the reasons above, an oral corpus in Spanish but with a Peruvian accent is justified. Different machine learning systems
29 from different manufacturers could use this corpus to obtain greater precision when recognizing the emotions of Peruvian
30 speakers. Likewise, it could also be used in cross-lingual research speech emotion recognition14, 15 . It is essential to point out
31 that, traditionally, the construction of a specialized corpus for the detection of emotions is a complex task5 as it requires an
32 effort of expert and previously trained teams of actors, transcriptionists, annotators, and evaluators, which leads to a cost, not
33 only economic but also very high in time. Although this modality generates overemphasized expressions that differ subtly from
34 behaviors observed during interactions in daily life16 . Against this, an alternative is to create a naturalistic corpus with balanced
35 emotional content. This paper presents ESCorpus-PE, a free access database obtained from realistic interaction scenarios
36 between Spanish speakers with a Peruvian accent. It contains audio segments labeled with three emotional attributes: valence
37 (negative to positive), arousal (calm to active), and dominance (dominated to dominating). This corpus contributes to the current
38 state of the art of automatic recognition of emotions, which according to our knowledge, is the first of its kind.
39 Methods
40 This section introduces our research method in five phases: platforms selection, audio/video selection, pre-processing,
41 segmentation, and annotation.

42 Selection of Platform
43 There are several audio platforms on the Internet. However, not all of them were feasible to be used in our project. We
44 established two criteria in order to determine the most appropriate; they are diversity and accessibility. While diversity
45 definition in this work refers to platforms that provide a diversity of available audio/video (including diversity of topics and
46 speakers), accessibility refers to the fact that platforms provide unlimited access to audio/video, and their searches do not
47 impose restrictions. With these two criteria, three candidate platforms were identified: Spotify, Ivoox, and YouTube.The best of
48 these to adapted to our needs was YouTube, since it has a greater diversity of videos, topics, and speakers and allows a broader
49 search, unlike Spotify and Ivoox.

50 Audio/Video Selection
51 Once the platform was selected, the candidate audios/audios were identified, for which the following criteria were applied:

52 • Inclusion criteria:

53 – Select audios/videos with a Creative Commons license or with explicit consent for unrestricted use.
54 – Choose audios/videos using keywords of a geographical nature found in the audio description: Peru, Lima,
55 Arequipa, Cuzco.
56 – Select audios/videos in Spanish with a Peruvian accent in one of its dialects of Lima, Arequipa, Cuzco, Piura, and
57 others Peruvian departments.
58 – Select audios that present emotional variety: happy, sad, calm, active, and dominant.
59 – Maintain diversity, including audios/videos that present diverse topics, diverse speakers between men and women,
60 and where intense emotions of the participants are also presented.
61 – Select audios/videos with natural and spontaneous content. The audios/videos do not have to be the product of
62 performances.

63 • Exclusion criteria:

64 – Audios/videos that do not meet the inclusion criteria are excluded.


65 – Audios/videos with an accent other than Peruvian are excluded.
66 – Audios/videos with neutral emotions are excluded.
67 – Audios/videos of acted recordings are excluded, such as films, series, or programs previously rehearsed by actors.

68 With these criteria, the videos that appear in Table 1 were selected.
69 Videos 3, 4, 5, and 6 have noise in some of their segments. In video 3, the emotion of happiness can be appreciated at its
70 maximum expression while the participants watch Peru vs. New Zealand match, the soccer match in which Peru qualifies to the
71 2018 World Cup, so there are seventy-nine segments with noise. In video 4, the tragedy in Arequipa in 1996 is narrated as
72 part of a journalistic report, so there are seventeen segments with noise. Video 5 comprises two reports about the cold weather
73 in Arequipa and Puno, so one hundred and one segments present instrumental music in the background. Finally, in video 6,
74 which is a report on one of the traditions of the Huaynacotas-Arequipa region, sixty-eight segments are presented with musical
75 backgrounds.

76 Pre-processing
77 Pre-processing is the procedure by which the videos are prepared for segmentation and annotation. The tool used for this
78 activity was the AUDACITYy®1 tool. It consists of converting the videos to WAV format audios in single-channel with 16-bit
79 PCM storage precision.
1 https://www.audacityteam.org/

2/7
Video ID Name URL Duration Emotional profile
1 Presidential debate 2021 https://youtu.be/4ejMuYd9SO4 1h 28m 18s Diverse emotions
Testimonials of Flor Huilca and
2 https://youtu.be/ez_4KGuirLE 0h 31m 00s Negative emotions
Martha Flores
Testimonials of Peru’s qualifica-
3 https://youtu.be/BCKtag_eQdw 0h 17m 11s Positive emotions
tion to the World Cup 2018
Report on the tragedy in Are-
4 https://youtu.be/RV9M8pIg6PA 0h 12m 47s Negative emotions
quipa in 1996
5 Frost report in Arequipa https://youtu.be/7aLDKH6Lnp0 0h 20m 09s Negative emotions
Report on Huaynacotas in Are-
6 https://youtu.be/tTNFWbsCUi8 0h 30m 59s Positive emotions
quipa

Table 1. Selected videos.

80 Segmentation
81 Segmentation is the procedure by which a set of audios is received as input, and for each audio, segments are generated as output.
82 The initial step involves the automated generation of segments using the PRAAT®17 tool and using the silences contained in
83 the audio as segmentation criteria. The next step comprises, for each automatically generated segment, the verification of the
84 following eligibility criteria:

85 • Contain only the voice of one speaker. Segments containing the voice of more than one speaker, simultaneously or not,
86 are discarded.

87 • Have a minimum duration of three seconds and a maximum duration of 11 seconds. Segments shorter than three seconds
88 are discarded. Segments longer than 11 seconds can be subdivided into new segments as long as all the criteria are met.

89 • Have a phrase or word that expresses a single emotion. Those segments where more than one emotion is perceived are
90 discarded.

91 Verifying the fulfillment of these criteria was performed manually; the input segments were listened to by a person, one by one
92 in their entirety. As a final segmentation step, a sequential identifier was assigned to each segment that passed the eligibility
93 criteria. The identification of the segments was carried out in order to be able to perform the following activities without
94 ambiguity. The Table 2 summarizes the number of segments generated in an automated manner and the number of segments
95 that meet the eligibility criteria.

Segments generated by the Segments meeting eligibility cri-


Audio ID
PRAAT tool teria
1 265 639
2 94 210
3 52 199
4 37 116
5 57 231
6 88 166

Table 2. Number of segments generated automatically and manually.

96 In all cases, the number of segments generated manually is greater than those generated automatically; this is because many
97 of the segments generated by the PRAAT were extensive and had to be subdivided.

98 Annotation
99 The annotation is a procedure that receives the output of the segmentation procedure and generates a set of emotional attributes,
100 a vector (valence, arousal, dominance) for each segment. The audio segments and their corresponding labels make up the
101 corpus composed of n labeled samples S = zi ni=1 , where zi = (xi , yi ), xi ∈ X, and yi ∈ R3 . The annotation comprises two main
102 activities: individual annotation and consensus. In the first activity, each annotator generates a set of annotations A j individually
103 and independently. Then, the elements of the E j sets are classified into two groups: those that match in all three emotional
104 dimensions and those that differ in at least one dimension. The matching annotations become part of the labels of the final

3/7
105 corpus. For the unmatched annotations, the average value of each segment’s emotional attributes is calculated to assign the
106 value of the final label. A summary of matched and unmatched labels is presented in the Table 3:

Matching labels on three Matching labels on two Matching labels on one


Audio ID Unmatched labels
emotional attributes emotional attributes emotional attribute
1 4 (0.63%) 40 (6.26%) 268 (41.94%) 332 (51.96%)
2 9(4.29%) 33 (15.71%) 84 (40%) 90 (42.86%)
3 11(5.53%) 31 (15.58%) 100 (50.25%) 63 (31.66%)
4 6(5.17%) 23 (19.83%) 60 (51.72%) 27 (23.28%)
5 7(3.04%) 72 (31.3%) 102 (44.35%) 49 (21.3%)
6 0 (0%) 40 (24.1%) 84 (50.6%) 42 (25.3%)

Table 3. Annotation results.

107 The annotation of the audio segments was performed using the PRAAT tool. Each segment was labeled with the three
108 emotional attributes according to the Self-Assessment Manikin (SAM) scale shown in Figure 1.

Figure 1. The Self-Assessment Manikin scale.

109 At the end of the consensus stage, a PRAAT script was used to save all the audio segments in WAV format with their
110 respective labels. The emotional attributes (values of valence, arousal and dominance) were placed as the name of each WAV
111 file.

112 Data Records


113 Corpus summary
114 The ESCorpus-PE contains 3h 25min of audio divided into 1559 segments of natural speech. Table 4 summarizes the results of
115 the compilation of the audios and the content of the corpus.

4/7
Corpus collection summary
Number of videos 6
Total videos length 3h 20m 24s
Number of annotators 2
Signal voice
Corpus contents
Number of audio segments 1559
Total segments length 3h 5m 6s
Dimensions Valence, Arousal, and Dominance
Annotations scale 01, 02, 03, 04, 05
Number of speakers 50
Gender of speakers 18 women and 32 men
Age range of speakers Adults

Table 4. Summary of data collection results and the dataset.

116 Corpus contents


117 The corpus is freely available in the DRYAD repository (https://datadryad.org/stash/share/4PWU2tgtarFKkSLr3diVswoGAlILeJiZof-
118 v0K_AwDs) in a ZIP package. The 1559 audio segments in WAV format are organized in six folders corresponding to the six
119 source videos. The annotation for each segment is found in the name of the corresponding WAV file following the next pattern:
120 AudioN_N-VXX-AYY-DZZ.wav. The file name is divided into five segments separated by the underscore (_) and the dash
121 (-) symbol, where,
122 • in AudioN, N indicates the numeric identifier of the audio source; this can be between 1 and m.
123 • N, indicates the number of the audio segment that can go from 1 to n.
124 • In VXX, the V refers to the emotional dimension of valence, and XX indicates the numerical value of the valence of that
125 segment, which can be between 01 and 05.
126 • In AYY, the A refers to the emotional dimension of arousal, and YY indicates the numerical value of the arousal of that
127 segment, which can be between 01 and 05.
128 • In DZZ, the D refers to the emotional dimension of dominance, and ZZ indicates the numerical value of the dominance of
129 that segment, which can be between 01 and 05.
130 For example, the file name Audio1_1-02-02-04, indicates that this segment belongs to audio 1, which is the first
131 segment of that audio, and its emotional dimensions are Valence 02 (something unhappy), Arousal 02 (something calm), and
132 Dominance 04 (something dominating).

133 Technical Validation


134 Emotional diversity
135 The emotional diversity of the corpus content was evaluated. Figure 2 presents a histogram for each dimensional attribute.
136 As can be seen, there is no uniform emotion distribution in the three dimensions, although it is the valence that has the most
137 significant emotional amplitude. Arousal leans predominantly to the left side and dominance to the right one.

138 Annotators coincidence


139 This section compares the agreement between individual annotations in the corpus using Cronbach’s alpha. The total agreement
140 in each emotional dimension was calculated. Table 5 presents the alpha value for each dimension. In all three cases, the value is
141 greater than 0.9, which means that the annotations have relatively high internal consistency.

Valence Arousal Dominance


0.9283 0.9995 0.9840

Table 5. Cronbach’s alpha for each dimension.

5/7
Figure 2. Histograms of the dimensional attributes in the corpus.

142 Usage Notes


143 The corpus is publicly available at our DRYAD repository. No request is required for data download.

144 Potential applications


145 ESCorpus-PE can be used to train, validate and test supervised emotion recognition systems of Spanish-language speakers
146 with a Peruvian accent. Additionally, it can be used for testing cross-language emotion recognition systems. The dimensional
147 annotation allows the study of emotions beyond the basic ones; it could even determine the affective state.

148 Limitations
149 Among the limitations of the corpus, we highlight the following:

150 • Size. The corpus has a duration of 3h 5m 6s, which is a shorter time than the most popular corpus, for example,
151 IEMOCAP18 (twelve hours), MS-IMPROV19 (nine hours), EMO-DB20 (twenty two minutes), Design and Evaluation of
152 Adult Emotional Speech Corpus for Natural Environment21 (13500 sentences), and the MSP-PODCAST CORPUS16
153 (twenty seven hours). This aspect could affect its use in systems that require a large amount of training data. However,
154 this limitation could be alleviated by applying data augmentation operations, such as time-stretching, speed and pitch,
155 and noise.

156 • Emotional balance. Although the design has sought to create a balanced corpus, the result has not been such. The corpus
157 has some shortcomings in evenly filling the emotional space. This limitation could make classifiers not as accurate when
158 presented with cases belonging to these voids in emotional space.

159 • Two annotators. Using only two annotators has allowed us to obtain a high Cronbach’s alpha value, however it does not
160 mean that we are free of biases or that all the segments have been well annotated. Increasing the number of annotators is
161 a potential task to be done.

162 Code availability


163 No custom computer code is required to use the corpus. However, a GitHub repository (https://github.com/Alessandra-
164 UNSA/ESCorpus-PE) has been created containing information that guarantees the reproducibility of the corpus creation
165 process, such as videos, a script for automatic segmentation, a script to assign file names, and excel files with intermediate
166 results.

167 References
168 1. Park, C. Y. et al. K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.
169 Sci. Data 7, 1–16 (2020).
170 2. Archer, M. S. Can humans and ai robots be friends? In Post-Human Futures, 132–152 (Routledge, 2021).
171 3. McStay, A. Emotional AI: The rise of empathic media (Sage, 2018).
172 4. Williamson, B., Bayne, S. & Shay, S. The datafication of teaching in higher education: critical issues and perspectives. Teach.
173 High. Educ. 25, 351–365, 10.1080/13562517.2020.1748811 (2020). https://doi.org/10.1080/13562517.2020.1748811.
174 5. Yoon, J., Arik, S. & Pfister, T. Data valuation using reinforcement learning. In International Conference on Machine
175 Learning, 10842–10851 (PMLR, 2020).

6/7
176 6. Pérez-Espinosa, H. et al. Iesc-child: An interactive emotional children’s speech corpus. Comput. Speech & Lang. 59,
177 55–74 (2020).
178 7. Villada Zapata, J. & Chaves Castaño, L. Theoretical contributions derived from language research between 2000 and 2010:
179 a review. Divers. Perspectivas en Psicología 8, 331–343 (2012).
180 8. Douglas-Cowie, E., Campbell, N., Cowie, R. & Roach, P. Emotional speech: Towards a new generation of databases.
181 Speech Commun. 40, 33–60, 10.1016/S0167-6393(02)00070-5 (2003).
182 9. David, E. M., Lourde, A., Cesar, G. F., Pascual, V. & Valentin, C. P. Caracterizacion acustica del acento basada en
183 corpus: un enfoque multilingue ingles-espanol. In International Conference on Machine Learning, 10 (Departamento
184 de Informatica, Universidad de Valladolid, Espana Departamento de Filología Hispanica, Universidad Autonoma de
185 Barcelona, Espana, 2011).
186 10. Laghari, M., Tahir, M. J., Azeem, A., Riaz, W. & Zhou, Y. Robust speech emotion recognition for sindhi language based
187 on deep convolutional neural network. In 2021 International Conference on Communications, Information System and
188 Computer Engineering (CISCE), 543–548 (IEEE, 2021).
189 11. Zaheer, N., Ahmad, O. U., Ahmed, A., Khan, M. S. & Shabbir, M. Semour: A scripted emotional speech repository for
190 urdu. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–12 (2021).
191 12. Hansen, J. H. & Liu, G. Unsupervised accent classification for deep data fusion of accent and language information.
192 Speech Commun. 78, 19–33 (2016).
193 13. Yang, X. et al. Joint modeling of accents and acoustics for multi-accent speech recognition. In 2018 IEEE International
194 Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2018).
195 14. Latif, S., Qayyum, A., Usman, M. & Qadir, J. Cross lingual speech emotion recognition: Urdu vs. western languages. In
196 2018 International Conference on Frontiers of Information Technology (FIT), 88–93 (IEEE, 2018).
197 15. Zehra, W., Javed, A. R., Jalil, Z., Khan, H. U. & Gadekallu, T. R. Cross corpus multi-lingual speech emotion recognition
198 using ensemble learning. Complex & Intell. Syst. 1–10 (2021).
199 16. Lotfian, R. & Busso, C. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from
200 existing podcast recordings. IEEE Transactions on Affect. Comput. 10, 471–483 (2019).
201 17. Boersma, P. PRAAT, a system for doing phonetics by computer. Glot. Int. 5, 341–345 (2001).
202 18. Busso, C. et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359
203 (2008).
204 19. Busso, C. et al. Msp-improv: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on
205 Affect. Comput. 8, 1–1, 10.1109/TAFFC.2016.2515617 (2016).
206 20. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. & Weiss, B. A database of german emotional speech. vol. 5,
207 1517–1520, 10.21437/Interspeech.2005-446 (2005).
208 21. Jia, N., Zheng, C. & Sun, W. Design and evaluation of adult emotional speech corpus for natural environment. In
209 2020 12th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 1, 53–56,
210 10.1109/IHMSC49165.2020.00020 (2020).

211 Acknowledgements
212 (not compulsory)

213 Author contributions statement


214 A.C. conceived the idea, designed the data collection, designed the technical validation, wrote the manuscript, and advised the
215 overall project. A.D. conceived the idea, designed, prepared, and conducted the data collection, pre-processed and constructed
216 the corpus, and revised the manuscript. W.R.L. advised, revised, and verified the manuscript, supervised the dataset design and
217 the data collection, advised the overall project, and revised the manuscript. H.B. performed the data collection, constructed and
218 pre-processed the collected dataset, performed the technical validation, and revised the manuscript.

219 Competing interests


220 The authors declare no competing interests.

7/7

You might also like