You are on page 1of 17

Received September 20, 2018, accepted October 31, 2018, date of publication November 13, 2018,

date of current version December 19, 2018.


Digital Object Identifier 10.1109/ACCESS.2018.2881096

Evaluation of an Arabic Speech Corpus


of Emotions: A Perceptual
and Statistical Analysis
ALI HAMID MEFTAH 1 , YOUSEF AJAMI ALOTAIBI 1, AND
SID-AHMED SELOUANI 2 , (Senior Member, IEEE)
1 College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
2 Information Management Department, Université de Moncton, campus de Shippagan, Shippagan, NB E8S 1P6, Canada
Corresponding author: Ali Hamid Meftah (ameftah@ksu.edu.sa)
The Deanship of Scientific Research at the King Saud University is funding this work based on the research group No. (RG–1439–033).

ABSTRACT The processing of emotion has a wide range of applications in many different fields and
has become the subject of increasing interest and attention for many speech and language researchers.
Speech emotion recognition systems face many challenges. One of these is the degree of naturalness of the
emotions in speech corpora. To prove the ability of speakers to accurately emulate emotions and to check
whether listeners could identify the intended emotion, a human perception test was designed for a new
emotional speech corpus. This paper presents an exhaustive statistical and perceptual investigation of the
emotional speech corpus (KSUEmotions) for Arabic King Saud University approved by the Linguistic Data
Consortium. The KSUEmotions corpus was built in two phases and involved 23 native speakers (10 males
and 13 females) to emulate the following five emotions: neutral, sadness, happiness, surprise, and anger. Nine
listeners were participated in a blind and randomly structured human perceptual test to assess the validity of
the intended emotions. Statistical tests were used to analyze the effects of speaker gender, reviewer (listener)
gender, emotion type, sentence length, and the interaction between these factors. Conducted statistical tests
included the two-way analysis of variance, normality, chi-square, Bonferroni, Tukey, and Mann–Whitney
U tests. One of the outcomes of the study is that the speaker gender, emotion type, and interaction between
emotion type and speaker gender yield significant effects on the emotion perception in this corpus.

INDEX TERMS Corpus, modern standard Arabic, perceptual test, speech emotion recognition, statistical
evaluation.

I. INTRODUCTION (e.g., anxious muscles, moisture, dry mouth) attributed to


In human communication, the role of nonlinguistic informa- distinct behaviors (e.g., facial expressions, walloping, escap-
tion is very important, and it may be observed visually in ing), or can be expressed as a mixture of psychological and
videos to show the forms of facial expressions, in speech in physiological manifestations.
the form of expression of emotions, and in written text in A chief difficulty in speech emotion processing is the
the form of punctuation [1], [2]. Lately, an enormous amount nature of different types of emotion and the lack of con-
of attention has been paid to the investigation of speech sensual definitions of emotions owing to the difficulty in
emotion data [3]. Accordingly, the interactions between com- distinguishing the different states (i.e., moods, attitudes, inter-
puters and the human continue to increase. Correspondingly, personal stances, and even affective personality traits) by
the quality of the extensive man–machine interactions is very their relative intensities and durations [4]. Scherer [5] defined
important and requires proper attention. Good-quality inter- emotions as episodes of coordinated variations in many com-
actions between humans and machines constitute the etiology ponents including neurophysiological reaction, motor expres-
for the primary need for recognizing emotion from speech to sion, subjective feeling and cognitive process in response to
ensure that this type of communication is efficient. exterior and/or interior events and stimuli.
Human emotions lack a consensual definition. They could Research related to emotional taxonomy is generally based
be communicated as physiological variations in the body on two assumptions: the first is that a discrete number of

2169-3536
2018 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 6, 2018 Personal use is also permitted, but republication/redistribution requires IEEE permission. 72845
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 1. Summary of basic emotions used by researchers [6].

FIGURE 1. Space model of emotions (two dimensions) [48].

basic emotions are representative of all other emotions, and power space. In this model, active and passive emotions
the second considers a continuum of emotions and attempts are represented by the arousal dimension, positive and neg-
to map all emotions in a two- or three- dimensional space [2]. ative emotions (i.e., from anger to happiness) are repre-
Ortony and Turner [6] summarized the list of basic emotions sented by the valence dimension, and the power dimension
used by researchers (Table 1). represents the sense or degree of power of control
Regarding the second assumption, the characterization over one’s emotions [2], [7]. Fig. 1 shows a number
of emotions can be viewed as a three-class continu- of these emotions distributed across the two-dimensional
ous space model referred to as the arousal, valence, and space.

72846 VOLUME 6, 2018


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

Real-time emotion recognition from speech is a highly emotions based on speech. Spontaneous speech corpora are
challenging task, where noise and reverberations are two very difficult to obtain. To produce the acted speech corpora,
typically encountered problems [8]. Additionally, features a professional actor reads written texts. Elicited speech cor-
that are commonly extracted from speech waveforms pora are created by asking the speaker to imagine a particular
(e.g., pitch and energy contours) are considerably affected situation intended to evoke a target emotion [10].
by the variations between speakers and the rate of speech The purpose of the evaluation of the emotion database is
generation [9]. Moreover, features are not equally effective in to prove the ability of the speaker to accurately emulate the
emotional classification. Therefore, deciding which relevant emotions, thus allowing the assessment of the validity of the
features need to be extracted is still a challenge [3]. Another database and the determination on whether listeners could
important issue is the degree of naturalness of the recorded identify the intended emotion.
emotions. Depending on the purpose and the type of emotional
An important step required to represent the emotional state speech corpus, the number of listeners involved in the eval-
of the speaker is to perform an accurate feature extraction uation process may be different. In [14], 20 listeners were
from the speech signal. These speech features can be divided used to perform a Danish emotional listening speech test
into the following four categories: i) continuous features which was associated with the emotions of neutral, sadness,
that contain pitch-related, energy-related, formant, timing, happiness, surprise, and anger. In [15] different phrases were
and articulation features; ii) qualitative features that include spoken to emulate different emotional states by a native
the level of voicing and pitch, temporal structures, and Swedish male speaker judged by listeners of different nation-
boundaries between phrases, phonemes, words, and features, alities (35 Swedish, 12 English, 23 Finnish, and 23 Spanish).
iii) spectral features that can include the mel-frequency In [16], the emotion corpus database contained 700 short
cepstrum, linear prediction, and the linear prediction cepstral utterances that expressed the following five emotions: hap-
coefficients, and iv) nonlinguistic vocalizations that are non- piness, fear, anger, and sadness, which were produced by
verbal phenomena [10]. thirty speakers. The recorded file utterances were evaluated
Relevant acoustic and prosodic features are used by various by 23 listeners. Twenty of these listeners participated in
classification techniques to perform speech emotion recog- the recording. In [17], eight actors, four from each gender,
nition. Popular techniques are based on—among others— emulated seven emotions in the Spanish language: joy, fury,
Hidden Markov Models (HMMs), Support Vector Machines sadness, desire, fear, surprise, and disgust. The recorded file
(SVMs), Gaussian Mixture Models (GMMs), artificial neural was evaluated by 1054 persons based on a perceptual test.
networks, and deep belief networks [11]–[13]. In [18], four emotions, namely, anger, fear, joy, and sadness,
In this study, we present a comprehensive statistical anal- were emulated in Chinese by nine speakers and were evalu-
ysis that aims to provide a useful tool to assess emotional ated by four listeners. In [19], six emotions were considered,
speech data and hence to draw best practices to design emo- namely, anger, fear, sadness, disgust, joy, and surprise, for
tional speech corpora. The assessment presented in this study the Slovenian, Spanish, English, and French languages. Each
is carried out on a new Arabic corpus of emotional speech language was recorded by two professional actors, one male
that incorporated a blind human perceptual test to analyze and and one female, except for the case of the English language
evaluate the relevant speech recordings. The speech corpus where two males and one female speaker were recorded.
was designed with the help of speakers who emulated the To evaluate the recorded speech file, 16 nonprofessional lis-
targeted emotions. To ensure adequate performance quality teners were chosen to perform a perception test in Spanish
of the recorded utterances, a blind and randomly scrambled and Slovenian. In [18], 30 students with normal hearing
human perceptual test was conducted. The results of this per- participated in a test to evaluate the following emotions in
ceptual test were analyzed in terms of the effects of speaker the Serbian language: neutral, anger, happiness, sadness, and
gender, reviewer gender, emotion type, and the interaction fear. The data of six actors, three from each gender, were
between these different mentioned factors. recorded. In [20], six nonprofessional native Mandarin lan-
The remaining parts of this study are organized as fol- guage speakers were asked to emulate six emotions, namely,
lows. The emotional speech corpora evaluation is presented anger, fear, disgust, sadness, joy, and surprise. To determine
in Section II, the hypotheses testing and statistics overview is the ability of listeners to correctly classify the emotional
outlined in Section III, and the KSUEmotions corpus design modes of the recorded files, three listeners were engaged in
in Section IV. The human perceptual test results and analyses the subjective tests. A perception test with 20 listeners was
are presented in Section V. Statistical analyses outcomes conducted to evaluate a recorded speech file for the Berlin
and their discussion are presented in Section VI. Finally, Emotional Speech Database [21]. In this corpus, five actors
Section VII concludes the study and summarizes the major and five actresses contributed to produce the following seven
findings. emotions: happiness, neutral, boredom, disgust, fear, sadness,
and anger. In [22], 13 listeners contributed in the perception
II. EMOTIONAL SPEECH CORPORA EVALUATION tests to evaluate the utterance file that was recorded by seven
Spontaneous, acted, and elicited speech corpora are three speakers (3 women and 4 men) who emulated the following
types of speech corpora that have been used to study emotions in the Hungarian language: happiness, sadness,

VOLUME 6, 2018 72847


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

anger, surprise, scorn/disgust, nervousness/excitement, and TABLE 2. Possible outcomes of statistical hypothesis testing.
neutral.

III. HYPOTHESES TESTING AND STATISTICS OVERVIEW


Hypotheses testing methods are some of the pertinent and
beneficial approaches of inferential statistics. These methods
are used to extract conclusions based on the worth of few pop-
ulation parameters, or the shape of the probability distribution
that we could be inferred from a data samples [23]. In this
section, we will provide definitions of the main parameters
used throughout this study [23]–[25]:
Mean of the Observations (x): If x1 , x2 , . . . , xn are the
sample values, then the sample mean is
Pn
x1 + x2 + . . . + xn xi
x̄ = = i=1 (1)
n n FIGURE 2. Rejection and acceptance regions.

Sample variance Let x1 , x2 , . . . , xn , be the set of the


(S2 ):
The two types of errors are frequently resolved within the
sample observations. The sample variance is denoted by S 2
region defined by α = .05 (which is the most common value)
and is defined by,
subject to a constraint such that the sample size is sufficient
and representative to cause β to decrease.
Pn
(x − x̄)2
S 2 = i=1 i (2)
n−1
A. ELEMENTS OF A HYPOTHESIS TEST
Standard deviation (S) is the square root of the variance
s As illustrated in Fig. 3, the sequential steps to test a hypothe-
√ 2
Pn 2 sis are the following:
i=1 (x i − x̄) (1) State the null hypothesis (H0 ), (2) state the alter-
S= S = (3)
n−1 native hypothesis (H1 ), (3) specify a suitable test statistic
A population is an arbitrary collection of elements, and a (T.S) for the null and alternative hypotheses, (4) determine
sample is a subset of it. Alternatively, a population is the set the significance level α, α = 0.01, 0.025, 0.05, or 0.10,
of all possible events, or alternatively, the set of all potential (5) specify the region of rejection (R.R) or acceptance region
observations, i.e., emotions, speakers, etc. (A.R) for the test statistic, (6) reach a decision on whether
A hypothesis is a declaration or a statement about the to reject H0 (accept H1 ) if the value of the T.S. falls in
characteristic of the selected population. the R.R. of H0 , and vice versa. For example, if θ is an
A test of hypotheses is a test which follows a process unknown parameter of the population, we may be interested
which utilizes trial data to choose among the two opposing in testing the conjecture that θ > θ0 for some specific
hypotheses regarding a population characteristic. value θ0 [23], [24].
The null hypothesis (H0 ) is a declaration regarding the We usually test the null hypothesis, i.e.,
population characteristic which is firstly presumed to be true. H0 : θ = θ0 (null hypothesis) against one of the following
It is considered as the ‘‘starting point’’ of the investigation. alternative hypotheses:
The alternative hypothesis (H1 ) stipulates that the popu- 
lation assumes a value or a status that are different in some θ 6 = θ0

way from the value/status assigned to it based on the null H1 θ > θ0 (4)
θ < θ0


hypothesis.
The hypothesis test is performed using the test statistic. The possible situations of statistical hypothesis testing are
A test statistic is a function of sample data based on which a shown in Table 2.
decision is made on whether to reject or accept H0 .
• One-sided alternative hypotheses are
The p-value (also known as the observed significance
level) is the probability, assuming that H0 is true, of obtaining H0 : θ = θ0 or H0 : θ = θ0 (5)
a test statistic value at least as inconsistent with H0 as what H1 : θ > θ0 H1 : θ < θ0 (6)
actually resulted.
Two types of errors can occur in hypothesis testing: • Two-sided alternative hypothesis,
• Type I error which is termed as the level of significance H0 : θ = θ 0
of the test (α). In this error, we reject H0 when H0 is true
H1 : θ 6 = θ 0 (7)
• Type II error is failing to reject H0 when H0 is false.
We also define (β) as the probability of obtaining a Fig. 2 shows the R.R and A.R for one and two-sided alterna-
type II error tive hypotheses.

72848 VOLUME 6, 2018


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 3. Hypotheses test on single mean [24].

• Two samples: tests on two means


Suppose we need to test the null hypothesis
H0 : µ1 = µ2 ⇔ H0 : µ1 − µ2 = 0 (10)
Generally, suppose we need to test
H0 : µ1 − µ2 = d for some specific value d against one of
the following alternative hypotheses, as shown in Table 4,

µ1 − µ2 6 = d

H1 µ1 − µ2 > d (11)
µ1 − µ2 < d

When more than two samples of data are compared, apply-


ing statistical analyses is generally an effective approach [26].
In particular, evaluating the variance is very useful for assess-
ing whether the differences between samples are not deter-
mined just by chance.
FIGURE 3. Hypothesis testing procedure where H is the null hypothesis, When testing hypotheses, we typically apply several tests
H1 is the alternative hypothesis, R.R is the rejection region, and A.R is the on a well-defined set of hypotheses before we select the test
acceptance region.
that elicits the best properties.
In accordance to a formalized mathematical approach
to hypothesis tests, we start with a clearly defined set
The descriptions of the two parametric tests (single mean of hypotheses and choose the test with the best prop-
and two means) as examples for the testing of hypotheses for erties for the set hypotheses, as shown previously at
normal populations with their assumptions and the relevant the beginning of subsection A. We then specify a suit-
test statistic are presented below. able test statistic approach in accordance to the following
• Single sample: tests on a single mean (variance considerations [26], [27]:
unknown) i) if the measurements of the samples of data are normally
Suppose that X1 , X2 , . . . , Xn is a random sequence of sam- distributed, then t-tests and associated confidence inter-
ples of size n from a normal distribution with a mean µ and vals are used to compare the two means values;
an unknown variance σ 2 . ii) if the measurements of the samples of the data are
E X̄ = µX̄ = µ

(8) not normally distributed, then the analysis of variance
(ANOVA) is used to compare three or more mean
σ 2
values;
Var X̄ = σX̄2 =

(9)
n iii) The Mann–Whitney (Wilcoxon) test with confidence
Let µ0 be the known mean value. The test procedure is interval is used for datasets with median values that are
summarized in Table 3. close to each other.

VOLUME 6, 2018 72849


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 4. Hypotheses test on two means [24].

The two-way ANOVA is useful when comparing the The chi-square statistic is commonly used for testing
means of two independent variables A and B. It studies the relationships between categorical variables. The H0 of the
effects of the variables and their interactions [28]. ANOVA is chi-square test implies that no relationship exists between the
extensively used in experimental sciences, such as biology, categorical variables in the population, i.e., they are indepen-
psychology, and physics. dent. The chi-square statistic is computed by estimating the
Tukey’s test is an analysis of variance test that determines sum-of-the-squares of the differences of the observed minus
whether there is a significant difference between the means the expected frequency divided by the expected frequency in
of three groups, where two comparisons are conducted each each of the four cells. The chi-square formula is expressed in
time. Typically, after the existence of a significant difference accordance to [31]
is discovered using the ANOVA, Tukey’s test is used to !
X (O − E)2
determine where a difference exists or not. Tukey’s test is x2 = (14)
determined as E
r
MS where O = the observed frequency (the observed counts in
HSD = q (12)
n the cells) and E = the expected frequency if no relationship
where HSD is the honestly significant difference, MS is the exists between the variables.
mean square value calculated by ANOVA, n is the number of
samples in individual groups, and q is the resolute from the B. HYPOTHESIS TESTING EXAMPLE
studentized series dissemination table [29]. This example illustrates a situation at which we seek to
The Mann–Whitney U test (sometimes referred to as the identify whether the gender of the reviewers has an effect on
Mann–Whitney–Wilcoxon test) [30] is a population nonpara- the accuracy of the emotion classification based on the human
metric test used to compare two population means estimated perception test. The hypothesis testing procedure will lead us
from the same population where the null and two-sided to one of the following conclusions: reject H0 , or fail to reject
research hypotheses for the nonparametric test are stated H0 . Our starting assumption (H0 ) is that there is no effect of
as follows: H0 : the two populations (θ and θ0 ) are equal the gender on the accuracy of the emotion classification. The
versus H1 : the two populations are not equal. procedural steps are as follows.
• State Hypotheses
H0 : θ = θ 0 H0 : µ (male) = µ0 (female)
H1 : θ 6 = θ 0 (13) H1 : µ (male) 6 = µ0 (female)

72850 VOLUME 6, 2018


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 5. Overview of the Ksuemotions for Phases 1 and 2.

• Specify a suitable test statistic Phase 2 was produced. Seven male speakers from Phase 1,
Mann–Whitney test seven female speakers (four of them were chosen from
• Determine the significance level the participants in Phase 1 and three new female speakers
were added from Yemen) and 10 sentences were chosen for
α = 0.05 Phase 2. In this phase, the questioning emotion was excluded
while the anger emotion was added to this corpus to be more
• Specify R.R or A.R consistent with other similar corpora in the field. In Phase 2,
Accept the null hypothesis if the p-value ≥ 0.05 each sentence was uttered over two trials.
Reject the null hypothesis if the p-value ≤ 0.05 The male speakers and perception reviewers had ages
For Phase 1: (U = 770169.5, p = 0.022) between 20 and 37 years, and the female speakers and per-
For Phase 2: (U = 1085978.00.0, p = 0.000) ception reviewers had ages between 19 and 30 years. All
Since the p-value = 0.022 ≤ 0.05 = α, in Phase1 and speakers were undergraduate or graduate students except for
p-value = 0.000 ≤ 0.05 = α in Phase 2, we shall reject the two females who were secondary school students. Each
the null hypothesis. utterer or reviewer was requested to fill the documentation
Therefore, at α = 0.05 level of significance, we can confirm card that comprised his/her name, nationality, age, place
that there is adequate evidence to conclude that there is a of birth, geographical living area where a part of his/her
difference in the accuracy of emotion classification between childhood was spent, current living place, highest level of
male and female reviewers. education achieved, marital status, and others, as indicated
by the used identification card shown in Fig. 4.
IV. KSUEMOTIONS CORPUS DESIGN The total duration of recorded audio files was 2 hours
KSUEmotions corpus [32]–[34], was documented for MSA and 55 minutes for Phase 1 and 2 hours and 21 minutes for
by the means of 23 speakers (10 males and 13 females) Phase 2. Additionally, a blind human perceptual test was per-
from three Arabic nations: Yemen, Saudi Arabia, and Syria. formed for Phase 2 with the same nine listeners who reviewed
Recordings took place in two diverse periods in two dissimi- Phase 1. PRAAT software was used for the KSUEmotions
lar time breaks, as shown in Table 5. corpus recording process [35]. Table 5 lists a statistical com-
In Phase 1, 10 male speakers were designated from Saudi parison for the two phases of KSUEmotions.
Arabia, Yemen, and Syria, and 10 female speakers from only
Saudi Arabia and Syria. All speakers recited the 16 MSA
verdicts nominated from the former corpus [32], [33]. At this V. HUMAN PERCEPTUAL TEST RESULTS
stage, neutral, sadness, happiness, surprise, and questioning A blind human perceptual test was performed to check the
emotions were targeted. Questioning was maintained as an degree to which a listener could easily detect an emotion
emotion in phase 1 since it was included in the design of the type. Nine listeners participated in the test: six males and
original corpus [32], [33] but we have discard it from phase 2 three females. All of them were native Arabic speakers except
design in order to obtain a coherent analysis of emotions. for one male in his forties who was Indian and could speak
To evaluate Phase 1 recordings, a blind human percep- fluently and write as a native in the Arabic language. The
tual test was performed. Nine listeners (six males and three rest were undergraduate students in their twenties. In order
females) were chosen to listen to the recorded files to test to avoid any influence on their performances, the listeners
whether they were able to recognize the recorded emotion. were not given any prior training. They were also given the
In accordance to the results of the human perceptual test, and chance to replay an utterance if it was necessary. However,
by avoiding defective speakers and/or files, and to ensure uni- they were not allowed to replay a past utterance once they
formity among different variables, such as the speaker gender, moved to a subsequent one. In this way, they could avoid the

VOLUME 6, 2018 72851


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 6. Perception test results using the KSUEmotions corpus in the two phases.

perceptual test for the KSUEmotions regarding the emotions


of each phase.
Table 7 shows the confusion matrices of the KSUEmo-
tions corpus for Phases 1 and 2. In this table, unclassified
column refers to the situation where it was not possible for
the reviewer to classify the specific emotion audio files.

VI. STATISTICAL ANALYSES AND DISCUSSION


Statistical analyses of the KSUEmotions corpus were con-
ducted. It is important to mention that the questioning emo-
tion was not included in all the analyses. In order to be
consistent with other corpora in the field, and in accordance to
the approaches adopted by many other experts, such as those
presented in Table 1, ‘‘questioning’’ should not be considered
as a human speech emotion.
In the following subsections, we explore the effects of
speaker gender, reviewer gender, length of sentences, emo-
tion type, and the interactions among these factors on the
emotional perception. For this purpose, statistical tests were
applied, including the two-way ANOVA, Mann–Whitney,
chi-square, Bonferroni, and Tukey tests, using the IBM’s
FIGURE 4. Example of identification card. SPSS software program [37].

A. EFFECT OF DIFFERENT FACTORS ON EMOTION


comparison of different utterances of the same speaker. The CLASSIFICATION
listeners could take breaks when needed [32], [33] This section investigated the effect of the following factors:
The test was carried out as follows. First, all audio files speaker gender (male or female), emotions (five emotions in
were shuffled. A list of hyperlinks to the randomly ordered each phase), sentence length (long or short), and the interac-
files was provided to each listener. Thereafter, the listeners tion effects between these factors on emotion perception. For
submitted their emotion classification feedback, which was this purpose, we applied the two-way ANOVA.
converted to a mean opinion score (MOS) [36]. Each listener Throughout our experiment, all sentences were divided
actually provided a percentage representing his/her opinion into long and short sentences. A sentence was considered as
about the emotion. An increased percentage value corre- a long sentence if it contained more than seven words. Other-
sponded to a high confidence of perception by the listener wise, it was considered as a short sentence. Table 8 shows
for the specific emotion [32], [33]. Table 6 lists the blind the sentence types (short or long) and the phases consid-
human perceptual test results obtained for the KSUEmotions ered. Tables 16 to 21 in the Appendix show the two-way
corpus for both phases regarding the speaker gender, and the ANOVA results for the two phases. From these tables and
general results of the corpus, regardless of the speaker gender. in consideration of Phase 1, it can be observed that there
Fig. 5 presents the average percentage of the emotional was a statistical significance for the emotion (p < 0.05
recognition accuracy which was obtained by the blind human and F = 179.141) and a significant statistical interaction

72852 VOLUME 6, 2018


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 7. Confusion matrix results for human perceptual tests.

TABLE 8. Sentence types in KSUEmotions Corpus (S: Short; L: Long; 1: Phase1; 2: Phase 2).

FIGURE 5. Blind human perceptual final test results for both phases.

between speaker gender and emotion (p < 0.05 and length, and a statistical significant interaction effect between
F = 19.518). Furthermore, the speaker gender (p = 0.683 > speaker gender and emotions (p < 0.05 and F is large),
0.05 and F = 0.167), sentence length (p = 0.147 > 0.05 as indicated in Tables 17, 19, and 21.
and F = 2.108), the interaction between speaker gender Speaker gender, emotion, and sentence length factors
and sentence length(p = 0.661 > 0.05 and F = 0.192), in Phase 2 yielded statistical significant effects, while in
and the interaction between emotion and sentence length Phase 1 only the emotion factor had a statistical sig-
(p = 0.90 > 0.05 and F = 0.195) were not statistically nificant effect. Disparities in the impact of these factors
significant, as shown in Tables 16, 18, and 20. between Phases 1 and 2 can be explained by the changes of
In Phase 2, there are statistical significance effects among some speakers. Indeed, 52% of the speakers were changed
the three factors: speaker gender, emotion, and sentence in Phase 2, nine speakers of Phase 1 did not participate

VOLUME 6, 2018 72853


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 9. Speaker gender mann–whitney test results.

TABLE 10. Phase 1 and phase 2 reviewer gender mann–whitney test results.

in Phase 2, while three new female speakers were involved. Mann–Whitney test was applied to two independent sets of
In addition, in Phase 2, 10 sentences among the 16 sentences utterances in each phase.
used in Phase 1 were selected for the test. As depicted in Table 9, the results show that there is no
To ensure that the assumption of variance homogene- statistical difference effect in the emotion perception accord-
ity was respected, the Welch test [37] was performed and ing to the speaker gender in Phase 1 (p = 0.603 > 0.05).
the results were found to be identical with the results Therefore, the research hypothesis H0 : µ (male) = µ0
that we obtained in Phases 1 and 2, as presented in this (female) is accepted. In contrast, there is a signifi-
section. cant effect elicited by the speaker’s gender in Phase 2
(p = 0.000). Therefore, the research hypothesis H0 is
B. NORMALITY TEST rejected.
According to the two-way ANOVA results, we applied the To determine differences in accordance to the speaker
normality test for the factors that exhibited a significant effect gender, pairwise comparisons were performed using the Bon-
in the two phases to determine the type of the test that we can ferroni test. The result of the Bonferroni test showed that the
use. For this purpose, we applied two tests: the Kolmogorov– accuracy of emotion emulations that are performed by the
Smirnov and Shapiro–Wilk tests by using the IBM’s SPSS male speakers are better than those performed by females in
software [37]. Table 22 in the Appendix shows that the data both phases in this corpus.
normality was not observed for the studied factors. Even
though the two-way ANOVA was considered as a robust D. EFFECT OF REVIEWER GENDER ON EMOTION
test against the normality assumption, we performed the PERCEPTION
Mann–Whitney test in order to determine the significance To determine whether emotion perception is affected by a
of the studied factors, as will be described in the following reviewer gender, i.e., determine whether the gender of there-
subsections. viewer has any effect on the accuracy of the perception
of emotion classification, the Mann–Whitney test was also
C. EFFECT OF SPEAKER GENDER ON EMOTION applied to two independent sets (males and females) of utter-
PERCEPTION ances in the two phases. There were significant effects of the
To determine whether the perception of emotion is affected reviewer’s gender in Phase 1 (p = 0.022 < 0.05) and Phase 2
by a speaker gender (10 males and 13 females), i.e., deter- (p = 0.000 < 0.05), which imply that the reviewer gender
mine whether the gender of the speaker has any effect on had a statistically significant effect on emotion perception,
the accuracy of the perception of emotion classification, the as shown in Table 10.

72854 VOLUME 6, 2018


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 11. Chi-square test results.

E. EFFECTS OF THE REVIEWERS ON THE EMOTION percentage accuracy were considered. The first six reviewers
PERCEPTION REGARDLESS OF THEIR GENDER were males and the remaining three (reviewers 7–9) were
To determine whether emotion perception is affected by females. Reviewers 7 and 9 achieved the highest scores in the
the reviewers regardless of their gender, repetition and two successive phases. The seventh reviewer (in Phase 1) had

VOLUME 6, 2018 72855


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 12. Phase 1 results of tukey test.

TABLE 13. Phase 2 results of tukey test.

TABLE 14. Phase 1 & Phase 2 sentence length mann–whitney test results.

TABLE 15. Semantic interpetation examples. we used the chi-square test, which was usually applied to
compare several means as we mentioned previously. The chi-
square test results are listed in Table 11. The results show
that the emotion type has a statistically significant effect on
the emotion perception in Phase 1 (p = 0.00 < 0.05). The
highest review score in Phase 1 was observed for the neutral
emotion where it reached 100 %. The second highest score
was 99.7 % for sadness, followed by surprise at 85.6 %. The
happiness emotion yielded the lowest score at 82.4 %.
In Phase 2, the chi-square test results showed that emo-
tion type elicited a statistically significant effect on emotion
the highest score and 88 % of her evaluations agreed with perception (p = 0.00 < 0.05). The highest review score was
the actual emotions. In Phase 2, the ninth reviewer had the observed for sadness and reached 98.8 %. The second highest
highest score because 98.9 % of her evaluations agreed with score was 98.5 % for neutral followed by anger and surprise at
the actual emotions. The eighth reviewer obtained the lowest 92.6 % and 92 %, respectively. Finally, the happiness emotion
scores in Phases 1 and 2 at 53.6 %, and 66.1 %, respectively. also obtained the lowest score of 91.1 %.
To determine the differences between means according
F. EFFECT OF EMOTION TYPE ON EMOTION PERCEPTION to the emotion type, a post-hoc comparison was performed
To determine whether emotion perception was affected by the using the Tukey test. Results for Phases 1 and 2 for this test
emotion type (four emotions in Phase 1 and five in Phase 2), are summarized in Tables 12 and 13, whereby the positive and

72856 VOLUME 6, 2018


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 16. Effects of tests on speaker genders and emotion in Phase 1.

TABLE 17. Effects of tests on speaker genders and emotion in Phase 2.

TABLE 18. Effects of tests on speaker genders and sentences in phase 1.

negative numbers show the prominent differences between In Phase 2, the perception of neutral emotion is better than
emotions. Tables 12 and 13 show that there is a statistically happiness, surprise, and anger. Furthermore, there is a sta-
significant difference in emotional perception of the neu- tistically significant difference in the emotion perception for
tral emotion compared to happiness, sadness, and surprise surprise and sadness compared to happiness in Phase 1. The
in Phase 1. The accuracy of neutral emotion is the best. accuracy of the emotion emulation for neutral is better when

VOLUME 6, 2018 72857


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 19. Effects of tests on speaker genders and sentences in Phase 2.

TABLE 20. Effects of tests on emotion and sentences in Phase 1.

TABLE 21. Effects of tests on emotion and sentences in Phase 2.

compared to other emulated emotions in the two Phases, with to two independent sets of sentences (long and short) in
the exception of sadness in Phase 2. The happiness emotion the two phases. As shown in Table 14, the sentence length
emulation achieved the lowest performance. did not yield any significant effects in Phase 1, while the
In Phase 2, the perception of anger, surprise, and sadness length yielded a noticeably significant effect in Phase 2 only
was more accurate compared to happiness and the difference (p < 0.05).
was statistically significant. We can also observe that the Additionally, it can be observed that there is a significant
accuracy of the anger emulation was better than that for difference in emotion perception where the results elicited
surprise. The accuracy of the emulation of sadness was better from short sentences were better than those for long sen-
than that for surprise in the two phases, and was better than tences, as shown clearly in the case of the Phase 2 subset.
the emulations of neutral and anger in Phase 2. These results The sentences that were used to create the corpus were
confirm the findings we obtained previously (see Table 11) extracted from the data of regular news media and broadcast-
based on the chi-square test. ing, or from journalistic sentences. Most of these sentences
are neutral and some of them have a sad content, as shown
G. EFFECT OF THE SENTENCE LENGTH ON in Table 15.
THE EMOTION PERCEPTION There is no doubt that the semantic meaning of the sen-
To determine whether emotional perception is affected by tences has had a significant impact. For instance, when the
the sentence length, the Mann–Whitney test was applied semantic meaning is sad and the speaker is supposed to talk

72858 VOLUME 6, 2018


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

TABLE 22. Normality test results.

about it in a happy way, this will impose a negative impact test, we applied a series of statistical tests to study the
on the quality of workmanship and the quality of the acted effect of each factor as well as the effects of the interaction
emotion. This effect has been mainly noticed in Phase 1. between factors. Statistical analyses used in this investigation
As a result, and in order to reduce this negative effect in included the two-way ANOVA, Mann–Whitney, chi-square,
Phase 2, we have selected 10 sentences among the 16 used in Bonferroni, and Tukey tests. The analyses of these tests
Phase 1. Therefore, the long sentences and those that carried showed that the gender of both speakers and reviewers,
a prominent sad meaning were discarded in Phase 2. as well as the emotion type, yielded considerable effects
Additionally, we have to mention that the words ‘‘Yes’’ and on the perception of the emotion expressed by speech. The
‘‘No’’ were added in Phase 2. Because they are very short investigation of the interactions among all pairs of factors
words, the speakers had encountered difficulties in emulating showed that there were significant impacts of all interactions
the desired emotion. on the emotion perception except for the interaction between
emotion type and sentence length.
This study demonstrated the usefulness of statistical tests
VII. CONCLUSION
to assess emotional speech corpora. The statistical analyses
This study presented an evaluation of the KSUEmotions
can be used to measure the effect of multiple variables that
corpus from a statistical perspective. Ten males and thir-
have a direct impact on the design quality of emotional speech
teen female speakers were asked to record 3280 files using
corpora.
five simulated emotions, namely, neutral, sadness, happiness,
surprise, and anger. A blind human perceptual test was car-
ried out to evaluate the recorded corpus, where nine listen- APPENDIX
ers (six males and three females) participated. The results Note that in the following tables (16–22), the symbol
showed correct identification rates of 79.94 % (Phase 1), ∗ denotes the interaction between the studied parameters.
88.33 % (Phase 2), and 84.24 % (both phases). Our exper-
iments showed that factors, such as speaker gender, emotion ACKNOWLEDGMENT
type, reviewer gender, and sentence length, played important The authors extend their appreciation to the Deanship of
roles in the design, analysis, and/or testing of some emo- Scientific Research at the King Saud University for funding
tional speech corpora. To analyze the blind human perceptual this work based on the research group No. (RG–1439–033).

VOLUME 6, 2018 72859


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

REFERENCES [28] M. Kery, ‘‘Normal two-way ANOVA,’’ in Introduction to WinBUGS for


[1] S. G. Koolagudi and K. S. Rao, ‘‘Emotion recognition from speech: Ecologists, M. Kery, Ed. Boston, MA, USA: Academic, 2010, ch. 10,
A review,’’ Int. J. Speech Technol., vol. 15, no. 2, pp. 99–117, Jan. 2012. pp. 129–139.
[2] A. Hassan, ‘‘On automatic emotion classification using acoustic fea- [29] J. W. Tukey, Exploratory Data Analysis, vol. 2. Reading, MA, USA:
tures,’’ Ph.D. dissertation, Fac. Phys. Appl. Sci., Univ. Southampton, Addison-Wesley, 1977.
Southampton, U.K., 2012, p. 204. [30] N. Feltovich, ‘‘Nonparametric tests of differences in medians: Comparison
of the Wilcoxon–Mann–Whitney and robust rank-order tests,’’ Exp. Econ.,
[3] M. El Ayadi, M. S. Kamel, and F. Karray, ‘‘Survey on speech emo-
vol. 6, no. 3, pp. 273–297, 2003.
tion recognition: Features, classification schemes, and databases,’’ Pattern
[31] R. Schumacker and S. Tomek, ‘‘Chi-square test BT,’’ in Under-
Recognit., vol. 44, no. 3, pp. 572–587, 2011.
standing Statistics Using R. New York, NY, USA: Springer, 2013,
[4] K. R. Scherer, ‘‘Vocal communication of emotion: A review of research
pp. 169–175.
paradigms,’’ Speech Commun., vol. 40, nos. 1–2, pp. 227–256, 2003.
[32] A. Meftah, Y. Alotaibi, and S.-A. Selouani, ‘‘Designing, building, and ana-
[5] K. R. Scherer, ‘‘Psychological models of emotion,’’ in The Neuropsychol-
lyzing an arabic speech emotional corpus,’’ in Proc. Workshop Free/Open-
ogy of Emotion, vol. 137, no. 3. New York, NY, USA: Oxford Univ. Press,
Source Arabic Corpora Corpora Process. Tools Workshop Programme,
2000, pp. 137–162.
2014, p. 22.
[6] A. Ortony and T. J. Turner, ‘‘What’s basic about basic emotions?’’ Psychol.
[33] A. Meftah, Y. A. Alotaibi, and S.-A. Selouani, ‘‘Designing, building, and
Rev., vol. 97, no. 3, p. 315, 1990.
analyzing an arabic speech emotional corpus: Phase 2,’’ in Proc. 5th Int.
[7] R. Tato, R. Santos, R. Kompe, and J. M. Pardo, ‘‘Emotional space improves Conf. Arabic Language Process., Nov. 2014, pp. 181–184.
emotion recognition,’’ in Proc. INTERSPEECH, 2002, pp. 2029–2032. [34] A. H. Meftah, Y. A. Alotaibi, and S.-A. Selouani. King Saud
[8] S. Johar, Emotion, Affect and Personality in Speech: The Bias of Language University emotional speech corpus (ksuemotions). Linguistic
and Paralanguage. Cham, Switzerland: Springer, 2015. Data Consortium (LDC). 2017. [Online]. Available: https://catalog.
[9] K. Laskowski, ‘‘Contrasting emotion-bearing laughter types in multipar- ldc.upenn.edu/LDC2017S12
ticipant vocal activity detection for meetings,’’ in Proc. IEEE Int. Conf. [35] P. Boersma and D. Weenink. (Sep. 8, 2018). Praat: Doing Phonetics
Acoust., Speech Signal Process. (ICASSP), Apr. 2009, pp. 4765–4768. by Computer [Computer Program]. Version 6.0.43. [Online]. Available:
[10] E. Navas, A. Castelruiz, I. Luengo, J. Sánchez, and I. Hernáez, ‘‘Designing http://www.praat.org/
and recording an audiovisual database of emotional speech in basque,’’ in [36] F. Ribeiro, D. Florêncio, C. Zhang, and M. Seltzer, ‘‘Crowdmos: An
Proc. LREC, 2004, pp. 1387–1390. approach for crowdsourcing mean opinion score studies,’’ in Proc.
[11] M. Swain, A. Routray, and P. Kabisatpathy, ‘‘Databases, features and clas- IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2011,
sifiers for speech emotion recognition: A review,’’ Int. J. Speech Technol., pp. 2416–2419.
vol. 21, no. 1, pp. 93–120, Mar. 2018. [37] IBM SPSS Statistics for Windows, Armonk, NY, USA: IBM Corp., 2013.
[12] M. Srikanth, D. Pravena, and D. Govind, ‘‘Tamil speech emotion recogni- [38] W. McDougall, An Introduction to Social Psychology. New York, NY,
tion using deep belief network(DBN),’’ in Proc. Int. Symp. Signal Process. USA: Dover, 2013.
Intell. Recognit. Syst., 2017, pp. 328–336. [39] M. B. Arnold, Emotion and Personality: Neurological and Physiological
[13] V. V. R. Vegesna, K. Gurugubelli, and A. K. Vuppala, ‘‘Prosody modi- Aspects, vol. 2. New York, NY, USA: Columbia Univ. Press, USA, 1960.
fication for speech recognition in emotionally mismatched conditions,’’ [40] C. E. Izard, The Face of Emotion. New York, NY, USA: Appleton-Century-
Int. J. Speech Technol., vol. 21, no. 3, pp. 521–532, Sep. 2018. Crofts, 1971.
[14] I. S. Engberg and A. V. Hansen, ‘‘Documentation of the danish emotional [41] R. Plutchik, ‘‘A general psychoevolutionary theory of emotion,’’ in Theo-
speech database des,’’ Center Person Kommunikation, Denmark, U.K., ries of Emotion, vol. 1, no. 3. Amsterdam, The Netherlands: Elsevier, 1980,
Internal AAU Rep., Sep. 1996. pp. 3–33.
[15] Å. Abelin and J. Allwood, ‘‘Cross linguistic interpretation of emotional [42] E. P. P. Ekman and W. V. Friesen, ‘‘What emotion categories or dimensions
prosody,’’ in Proc. ITRW Speech Emotion, 2000, pp. 110–113. can observers judge from facial behavior?’’ in Emotion in the Human
[16] V. A. Petrushin, ‘‘Emotion recognition in speech signal: Experimental Face, P. Ekman, Ed. Cambridge, U.K.: Cambridge Univ. Press, 1982,
study, development, and application,’’ in Proc. 6th Int. Conf. Spoken Lang. pp. 39–55.
Process., vol. 3, no. 4, 2000, pp. 222–225. [43] J. A. Gray, ‘‘The neuropsychology of anxiety,’’ Brit. J. Psychol., vol. 69,
[17] I. Iriondo et al., ‘‘Validation of an acoustical modelling of emotional no. 4, pp. 417–434, Nov. 1982.
expression in Spanish using speech synthesis techniques,’’ in Proc. SCA [44] S. S. Tomkins, ‘‘Affect theory,’’ in Approaches to Emotion, K. Scherer
Tutorial Res. Workshop (ITRW) Speech Emotion, 2000, pp. 1–6. and P. Ekman, Eds. New York, NY, USA: Psychology Press, 1984,
[18] J. Yuan, L. Shen, and F. Chen, ‘‘The acoustic realization of anger, fear, joy pp. 163–195.
and sadness in chinese,’’ in Proc. 7th Int. Conf. Spoken Language Process., [45] B. Weiner and S. Graham, ‘‘An attributional approach to emotional devel-
2002, pp. 2025–2028. opment,’’ in Emotions, Cognition, and Behavior. New York, NY, USA:
[19] V. Hozjan, Z. Kacic, and A. Moreno, ‘‘Interface databases: Design and Cambridge Univ. Press, 1984, pp. 167–191.
collection of a multilingual emotional speech database,’’ in Proc. LREC, [46] N. H. Frijda, The Emotions. Cambridge, U.K.: Cambridge Univ. Press,
2002, pp. 2024–2028. 1986.
[20] T. L. Nwe, S. W. Foo, and L. C. De Silva, ‘‘Speech emotion recogni- [47] K. Oatley and P. N. Johnson-Laird, ‘‘Towards a cognitive theory of emo-
tion using hidden Markov models,’’ Speech Commun., vol. 41, no. 4, tions,’’ Cognition Emotion, vol. 1, no. 1, pp. 29–50, 1987.
pp. 603–623, 2003. [48] S. Zhang, ‘‘Emotion recognition in chinese natural speech by combining
[21] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, prosody and voice quality features,’’ in Advances in Neural Networks (Lec-
‘‘A database of German emotional speech,’’ in Proc. Interspeech., vol. 5, ture Notes in Computer Science), vol. 5264. Berlin, Germany: Springer,
2005, pp. 1517–1520. 2008, pp. 457–464.
[22] S. L. Tóth, D. Sztahó, and K. Vicsi, ‘‘Speech emotion perception by
human and machine,’’ in Verbal and Nonverbal Features of Human-Human
and Human-Machine Interaction (Lecture Notes in Computer Science),
vol. 5042. Berlin, Germany: Springer, 2008, pp. 213–224.
[23] M. Naghettini, Fundamentals of Statistical Hydrology. Cham, Switzerland:
Springer, 2017. ALI HAMID MEFTAH received the B.Sc. and
[24] R. E. Walpole, R. H. Myers, S. L. Myers, and K. Ye, Probability and M.Sc. degrees in computer engineering from
Statistics for Engineers and Scientists. London, U.K.: Pearson, 2014. King Saud University, Riyadh, Saudi Arabia,
[25] R. Peck, C. Olsen, and J. L. Devore, Introduction to Statistics and Data in 2009 and 2015 respectively, where he is cur-
Analysis. Boston, MA, USA: Cengage Learning, 2015. rently pursuing the Ph.D. degree. Since 2010, he
[26] T. J. Cleophas and A. H. Zwinderman, Statistics Applied to Clinical has been a Researcher with King Saud University.
Studies. Amsterdam, The Netherlands: Springer, 2012. His research interests are digital speech process-
[27] H. P. Wijnand and R. van de Velde, ‘‘Mann–Whitney/Wilcoxon’s nonpara- ing, specifically speech recognition and Arabic
metric cumulative probability distribution,’’ Comput. Methods Programs language and speech processing.
Biomed., vol. 63, no. 1, pp. 21–28, 2000.

72860 VOLUME 6, 2018


A. H. Meftah et al.: Evaluation of an Arabic Speech Corpus of Emotions

YOUSEF AJAMI ALOTAIBI (S’94–M’97– SID-AHMED SELOUANI (M’04–SM’08)


SM’11) received the B.Sc. degree in computer received the B.E and M.S. degrees in electron-
engineering from King Saud University, Riyadh, ics engineering from the University of Science
Saudi Arabia, in 1988, and the M.Sc. and Ph.D. and Technology Houari Boumediene (USTHB)
degrees in computer engineering from the Florida in 1987 and in 1991, respectively, and the D.Sc.
Institute of Technology, FL, USA, in 1994 and degree in speech processing from USTHB in 2000.
1997, respectively. From 1988 to 1992 and 1998 to He joined the Algerian-French Double Degree
1999, he joined Al-ELM Research and Develop- Program with the Université Joseph Fourier of
ment Corporation, Riyadh, as a Research Engi- Grenoble in 2000. He is currently a Full Professor
neer. From 1999 to 2008, he joined the College of with the Université de Moncton, Shippagan, where
Computer and Information Sciences, King Saud University, as an Assistant he is also the Founder of the Laboratory of Research in Human–System
Professor, where he was an Associate Professor from 2008 to 2012. Since Interaction. He is also an invited Professor with INRS-Telecommunications,
2012, he has been a Professor with King Saud University. His research Montreal, Canada. His main areas of research involve artificial intelligence,
interests are digital speech processing, specifically speech recognition and human–computer interaction, speech recognition robustness, speaker adap-
Arabic language and speech processing. tation, speech processing using soft computing and deep learning, dialog
systems, ubiquitous systems, intelligent agents, and speech enhancement.

VOLUME 6, 2018 72861

You might also like