You are on page 1of 15

Speech Communication 111 (2019) 29–43

Contents lists available at ScienceDirect

Speech Communication
journal homepage: www.elsevier.com/locate/specom

Anomaly detection based pronunciation verification approach using speech


attribute features
Mostafa Shahin a,b,1,∗, Beena Ahmed a,b,1
a
Department of Electrical and Computer Engineering, Texas A&M University, Doha 23874, Qatar
b
School of Electrical Engineering and Telecommunications, The University of New South Wales, UNSW Sydney, NSW 2052, Australia

a r t i c l e i n f o a b s t r a c t

Keywords: Computer aided pronunciation training tools require accurate automatic pronunciation error detection algorithms
Anomaly detection to identify errors made by their users. However, the performance of these algorithms is highly dependent on
One class SVM the amount of mispronounced speech data used to train them and the reliability of its manual annotation. To
Pronunciation verification
overcome this problem, we turned the mispronunciation detection into an anomaly detection problem, which
Speech attributes
utilize algorithms trained with only correctly pronounced speech data. In this work we adopted the One-Class
SVM as our anomaly detection model, with a specific model built for each phoneme. Each model was fed with a
set of speech attribute features, namely the manners and places of articulation, extracted from a bank of binary
DNN speech attribute detectors. We also applied multi-task learning and dropout approaches to alleviate the
overfitting problem in the DNN speech attribute detectors. We trained the system using the WSJ0 and TIMIT
standard data sets which contain only native English speech data and then evaluated it using three different
data sets, a native English speaker corpus with artificial errors, a foreign-accented speech corpus and a children’s
disordered speech corpus. Finally, we compared our system with the conventional Goodness-of-Pronunciation
(GOP) algorithm to demonstrate the effectiveness of our method. The results show that our method reduced the
false-acceptance and false-rejection rates by 26% and 39% respectively compared to the GOP method.

1. Introduction Assisted Pronunciation Training (CAPT) application with simple Auto-


matic Speech Recognition (ASR) to teach children English as a foreign
Automatic pronunciation verification tools are widely used in a language achieved short-term improvement in pronunciation compara-
variety of applications including Computer Aided Language Learning ble to traditional teacher-led learning. However, for the feedback to be
(CALL), Computer Aided Speech and Language Therapy (CASLT), lan- effective, it is essential that the pronunciation verification algorithms
guage proficiency tests and foreign accent detection systems. Their main used be accurate and reliable as inaccurate feedback may lead the user
purpose is to automate the process of assessing the speaker’s pronunci- to lose motivation or even negatively impact their progress.
ation and reduce or, in some cases, eliminate the need of human inter- Several approaches have been proposed to address the phoneme-
vention significantly saving cost and time. level pronunciation verification problem, which can be categorized, in
Pronunciation verification can be performed at different levels, start- terms of the detection algorithm, into three main groups: confidence-
ing from speaker-level, where a single score representing the speaker’s score based, rule based and classification based methods. In the
pronunciation fluency is estimated from a few sentences, as commonly confidence-score based approach, a single score measuring the pro-
used in proficiency test applications, down to phoneme-level pronun- nunciation quality of each pronounced phoneme is computed and the
ciation verification where evaluation is performed for each individual phoneme then accepted or rejected based on a predefined threshold,
phoneme. The task becomes more challenging as the level of pronun- such as the posterior probability and log likelihood ratio (Kim et al.,
ciation verification goes lower, towards phoneme level. Phoneme-level 1997). The rule-based approach is a task-specific method where a
evaluation can provide rich information about the position and the type set of predefined rules representing the expected pronunciation errors
of error made by the user, which then can be used to generate informa- for a particular task are estimated either manually by a language ex-
tive and corrective feedback to improve the learning process (Neri et al., pert (Abdou et al., 2006) or automatically using a data-driven method
2008; Neri et al., 2006). Neri et al. (2008) found that using a Computer (Lo et al., 2010). Finally, in the classification-based approach each
phoneme is classified as either correctly or incorrectly pronounced


Corresponding author.
E-mail address: m.shahin@student.unsw.edu.au (M. Shahin).
1
Member, IEEE

https://doi.org/10.1016/j.specom.2019.06.003
Received 31 May 2018; Received in revised form 23 April 2019; Accepted 7 June 2019
Available online 11 June 2019
0167-6393/© 2019 Elsevier B.V. All rights reserved.
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

after prior training of the classifier using both correct and incorrect pro- of the GOP have been improved using scaling (Zhang et al., 2008), linear
nunciations of the phoneme (Franco et al., 2014). A detailed discussion transformation of the posterior score and discriminative training of the
of each approach is provided in the next section. Most of these meth- acoustic model (Yan and Gong, 2011; Sim, 2009). A DNN-HMM based
ods are highly dependent on the availability and quality of sufficient version of the GOP averages the frame-level posterior probabilities out-
annotated non-native training corpora, which is hard to collect. More- put from the last softmax layer for each phoneme (Hu et al., 2013).
over, human labeling of mispronounced data is more challenging than The GOP algorithm is very sensitive to the quality of the acoustic
correctly pronounced data (Bonaventura et al., 2000), which adds an model used. It affects not only the estimation of the posterior proba-
additional source of error in the training data. bility but also the accuracy of time boundaries obtained from forced
In this paper we propose a novel method for achieving phoneme- alignment. In addition, as the decision threshold is determined using a
level pronunciation verification by treating the problem as an anomaly mispronunciation dataset, it can be error specific and thus very hard to
detection task, thus requiring only correct pronunciations of each generalize to different types of pronunciation errors.
phoneme to train a phoneme-specific One-Class Support Vector Machine
(OCSVM) model. We trained the OCSVM model using a set of attribute 2.2. Rule based methods
features, namely the manners and places of articulation, derived from a
bank of Deep Neural Network (DNN) based attribute detectors. Speech Rule based methods require a priori knowledge of expected mispro-
attribute features derived from detectors of the manners and places of nunciations rules. Unlike previous methods, they offer the advantage
articulation have recently been used to tackle the phoneme-level pro- of not only detecting the position of the pronunciation error but also
nunciation verification problem (Duan et al., 2016; Li et al., 2017b). the type of error made by the speaker. However, as the rules are cus-
We then tested the algorithm with three speech corpora: 1) two stan- tomized to a specific problem they can fail if the speaker produces errors
dard English native-speakers databases, namely the WSJ0 and TIMIT not catered for by the designed rules.
(Paul and Baker, 1992; Garofolo et al., 1993), 2) the GMU foreign- Typically, the mispronunciation rules are developed by domain
accented English speech collected from speakers with different L1 lan- experts. These have been implemented using techniques such as a
guages (Weinberger, 2015) and 3) a disordered speech corpus recorded phoneme mispronunciation network of common expert-defined errors
from children with Childhood Apraxia of Speech (CAS). Furthermore, made by speakers of Quranic Arabic, decoded using a HMM based ASR
we compared our method to the most commonly used pronunciation system (Abdou et al., 2006). The error rate was decreased by replac-
verification method, the Goodness Of Pronunciation (GOP) (Witt and ing the conventional GMM-HMM acoustic model with a hybrid DNN-
Young, 2000) to show the effectiveness of our approach. HMM in addition to implementing Minimum Phone Error (MPE) dis-
The rest of the paper is organized as follows. An overview of pro- criminative training (Elaraby et al., 2016). A similar approach was
nunciation verification approaches is given in Section 2. The method adopted to identify errors made by second language learners (Al-
and the speech corpora used are explained in Section 3. Section 4 rep- Barhamtoshy et al., 2014; Harrison et al., 2009). In our previous work
resents the experiments and results. Finally, conclusions are drawn in on recognizing children’s disordered speech, we used a lattice consisting
Section 5. of the correct phoneme sequence and an alternate garbage node to col-
lect the pronunciation errors (Shahin et al., 2015). We added alternative
2. Related work nodes representing expected pronunciation errors specified by a speech
and language pathologist (Shahin et al., 2014). We were also able to
A number of different approaches have been implemented to achieve improve the error detection rate by using a DNN-HMM acoustic model
phoneme-level pronunciation error detection, as outlined below. instead of a GMM-HMM acoustic model (Shahin et al., 2014). Instead
of using expert defined rules, mispronunciation rules automatically de-
2.1. Confidence score-based methods rived from a L2 speech corpus have shown to improve performance
(Lo et al., 2010). Error rates have also been decreased by integrating the
These methods compute a confidence score representing how close rule-based method with the GOP score (Wang and Lee, 2012); either by
the pronunciation is to the target phoneme and then compare it to a using the GOP to double check the output of the rule-based decoder or
predefined threshold to accept or reject the produced phoneme. Methods evaluating each phoneme based on which method can handle it better.
used to compute the confidence score include a posterior-probability Most recently, Li et al. (2017c); Ryu and Chung, (2017) introduced the
based score (Kim et al., 1997), a log-likelihood score and a log-likelihood acoustic-graphemic-phonemic model (AGPM) by combining the acous-
ratio (LLR) of two different acoustic models trained on both correct and tic features along with the graphemes and canonical transcription in one
mispronounced non-native speaker speech (Jo et al., 1998; Franco et al., multi-distribution DNN model.
1999). Though the log-likelihood ratio (LLR) was shown to outperform
posterior-probability scores, it needed large amounts of annotated non- 2.3. Classifier based-methods
native data to build the mispronounced acoustic model (Jo et al., 1998).
The most widely used confidence score is the goodness of pro- As phoneme level error detection can be treated as a binary classifi-
nunciation (GOP) introduced by Witt and Young in 1999 (Witt and cation problem where each phoneme is classified as “correct” or “mis-
Young, 2000). The GOP approximates the posterior probability of each pronounced”, conventional classification methods such as SVM, decision
phoneme by taking the ratio between the forced alignment likelihood tree, ANN, etc. have also been applied to this problem. When LDA and
and the maximum likelihood of the free-phone loop decoding using a decision tree were used to assess the pronunciation of three Dutch
a Hidden Markov Model (HMM) acoustic model. It has a high corre- sounds produced by non-native speakers (Truong et al., 2004), only the
lation with human rating, specifically when using a phoneme-specific LDA classifier outperformed the GOP method and on only one Dutch
threshold (Witt and Young, 2000) and thus been adopted as a standard sound (Strik et al., 2007). However, a SVM classifier (Franco et al., 2014)
method to measure phoneme-level pronunciation quality in a range of outperformed the LLR confidence score when used on non-native Span-
applications (Mak et al., 2003; Al Hindi et al., 2014; Maier et al., 2009; ish speakers; the best performance was obtained using a simple weighted
Saz et al., 2009; Pellegrini et al., 2014; de Wet et al., 2009; Luo et al., combination of both systems.
2009). Variations to the GOP include the generalized posterior proba- Different features have also been experimented with to improve clas-
bility (GPP), first proposed to compute the word posterior probability sifier performance. A comparison of a linear kernel SVM used with three
(Soong et al., 2004) and then modified to be used with the GOP score types of extracted features, the GOP score, Mel-Frequency Cepstral Co-
on the phoneme level to relax the time boundaries produced by forced efficients (MFCCs) and formants frequencies, found the MFCCs were
alignment (Zheng et al., 2007). Mispronunciation detection error rates best at detecting pronunciation errors of Dutch vowels (van Doremalen

30
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

et al., 2009). Another feature set used to improve SVM performance Discrete Cosine Transform (DCT) on the filter banks and yield typically
includes the difference between an utterance distance matrix of the 12 coefficients. The filter bank coefficients are fed to the DNN speech at-
production consisting of the distances between a specific phoneme and tribute detectors, given they have proven to be more efficient with deep
other phonemes in the same utterance to a similar matrix extracted from learning architectures than traditional MFCC features (Mohamed et al.,
a native Chinese utterance (Zhao et al., 2012). To cope with the pronun- 2012). The MFCC features are used to train context dependent GMM-
ciation variations of each phoneme, Pronunciation Space Models (PSMs) HMM acoustic models employed in the forced alignment stage.
have been used to develop phoneme-specific SVM models (Wei et al., As our method is text-dependent, the prompted word/sentence is
2009), while DNNs have been used to extract phoneme-level features to known ahead and therefore the phonetic transcription can be extracted
train phoneme-specific logistic regression classifiers (Hu et al., 2014). using any existing pronunciation dictionary. In this work, we adopted
Although all these classifiers led to a significant improvement com- the CMU pronunciation dictionary and standard ARPAbet phonetic sym-
pared to confidence score methods such as GOP, they still need sufficient bols (C. M. University 2014). The resultant phoneme sequence is then
amounts of accurately annotated non-native data to model the mispro- passed to the forced alignment stage along with the MFCC features ex-
nounced phonemes. Moreover, the mispronounced data has to include tracted from the speech signal to determine the time boundary of each
all possible pronunciation errors, which is usually not feasible. phoneme.
In this work we tackle the limitations of the three methods discussed In the speech attribute detection stage, a separate binary DNN clas-
above in two ways. Firstly, we use speech attribute features. Speech at- sifier is trained to recognize the existence or absence of each attribute
tribute features (manners and places of articulation) are more robust in the current frame. A mapping step is performed first to map each
against speech variations due to speakers, environmental noise, dialect phoneme to its corresponding speech attributes. Each phoneme is then
etc. compared to traditional features. In addition, mispronunciations, in modeled using an OCSVM which is trained by the speech attribute fea-
nature, can be defined as a change in one or more attributes of the pro- tures produced from the previous stage.
nounced phoneme making them more effective in detecting pronuncia- In the verification mode, forced alignment is performed first and
tion errors. Despite their potential, speech attribute features have found then the frames of each individual phoneme are classified as in-class,
limited use in the phoneme-level pronunciation error detection prob- i.e. belongs to the underlying phoneme, or out-of-class, i.e. anomalous
lem. In early work by Jo et al. (1998), pair-wise manner and place of pronunciation, using the phoneme-specific OCSVM model. The whole
articulation classifiers were used to detect missing articulation feature/s phoneme is accepted or rejected based on the ratio between the number
in mispronounced phonemes. They have also been used to improve the of in-class and out-of-class frames. As the OCSVM is an anomaly de-
performance of the GOP in (Stouten and Martens, 2006) where acoustic tection method based solely on the correct data set, only native speech
features were converted to the corresponding phonological features us- corpora were used for its training.
ing a set of ANN speech attributes detectors and in (Yoon et al., 2009),
the posterior probability of each phoneme was estimated using speech
landmark-based SVMs for combination with the conventional GOP-score
in a second SVM. Recently, speech attribute features have been used 3.2. Speech attribute detectors
with a DNN classifier and a binary Long Short-Term Memory (LSTM)
classifier to detect phoneme-level mispronunciations in L2 speech (Duan The speech attributes, namely the manners and places of articula-
et al., 2016; Li et al., 2017b). tion, provide a knowledge-rich representation of speech articulation.
Secondly, we utilize anomaly detection to identify the phoneme level The motivations behind the choice of these features are that they are
mispronunciations. Anomaly detection eliminates the need of a mispro- robust against background noise and inter-speaker variations due to di-
nunciation training dataset instead enabling training with only native- alect, age and/or gender of the speakers (Lee and Siniscalchi, 2013).
speech corpora, which are abundantly available for languages such as Furthermore, these features are shared among multiple languages and
English. A variety of methods have been proposed for anomaly detec- hence speech corpora from different languages can be utilized in train-
tion, e.g. decision trees, neural networks, nearest neighbors, Bayesian ing one universal speech attribute model (Behravan et al., 2014), thus
classifiers and the one-class SVM (OCSVM) (Chandola et al., 2009; overcoming the shortage of labeled speech corpora from low-resource
Khan and Madden, 2014), however we advocated the one-class SVM ap- languages (Wang et al., 2014).
proach. The OCSVM is a novelty detection method that performs very In this work, we utilized 25 manner and place of articulation at-
well when the training data is pure and not contaminated with outliers tributes in addition to silence. Silence was used to measure the absence
(Khan and Madden, 2014) because its decision boundary is affected sig- or existence of speech. A complete list of the adopted speech attributes
nificantly by outliers. This high sensitivity makes it is well suited to our and their associated phonemes is shown in Table 1. For each attribute,
problem. We trained the OCSVM using the speech attribute features of a separate binary frame-level DNN classifier was trained to determine
the correct pronunciation of each phoneme, making the anomaly de- the existence or absence of the attribute.
tection model sensitive to abnormal changes in the attribute features Furthermore, we examined different DNN methods to improve the
caused by mispronunciations. performance of the speech attribute detectors, including multi-task
learning (MTL) and dropout regularization (Dout). Bottleneck (BN) fea-
3. Method tures were also employed to better represent the phonetic variations
within each speech attribute by increasing the dimensionality of the ex-
3.1. System overview tracted features. As depicted in Fig. 2, a set of DNN models were trained
using filter-bank features to detect the existence or absence of each of
Fig. 1 presents the flow diagram of our proposed pronunciation er- the 26 speech attributes. The resultant feature vector either had 26 fea-
ror detection method. The system consists of four main stages: pre- tures, when features were formed from the output of the softmax layer,
processing, forced alignment, speech attribute detection and OCSVM or 260, when a bottleneck layer with 10 neurons was used for each
model building. speech attribute detector. The extracted features from each DNN con-
In the pre-processing stage, the speech signal is framed with a Ham- figuration were then used to train phoneme-specific OCSVM models. In
ming window of 25 msec width and sampling rate of 10 msec. From other words, each phoneme had 4 different OCSVM models each trained
each window, two types of features are obtained: (1) 26 filter bank fea- using features extracted from one DNN configuration. The 4 OCSVM
tures extracted by applying triangular filters on the Mel-scale to the models of each phoneme were then evaluated using a validation set and
power spectrum and (2) MFCC coefficients, which are the decorrelated the model that achieved the best performance was selected. In the fol-
and compressed version of the filter bank features produced by applying lowing subsections we detail each of these methods.

31
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Fig. 1. The system flow chart.

Table 1
List of speech attributes.

Attribute Phonemes

Vowels ao, aa, iy, uw, eh, ih, uh, ah, ax, ae, ey, ay, ow, aw, oy, er, axr
Stops p, b, t, d, k, g
Affricates ch, jh
Fricatives f, v, th, dh, s, z, sh, zh, hh
Nasals m, n, em, en, ng, eng
Liquids l, el, r, dx
Semivowels y, w
Approximant w, y, l, r
Coronal d, l, n, s, t, z
High ch, ih, iy, jh, sh, uh, uw, y, ow, g, k, ng
Dental dh, th
Glottal hh
Labial b, f, m, p, v, w
Low aa, ae, aw, ay, oy
Mid ah, eh, ey, ow
Velar g, k, ng
Back ay, aa, ah, ao, aw, ow, oy, uh, uw, g, k
Retroflex er, r
Anterior b, d, dh, f, l, m, n, p, s, t, th, v, z, w
Continuant ao, aa, iy, ay, r, ey, ih, uw, eh, uh, ah, ae, ow, aw, oy, er, f, v, th, dh, s, z, sh, l, y, w
Round aw, ow, uw, ao, uh, v, y, oy, r, w
Tense aa, ae, ao, aw, ay, ey, iy, ow, oy, uw, ch, s, sh, f, th, p, t, k, hh
Voiced aa, ae, ah, aw, ay, ao, b, d, dh, eh, er, ey, g, ih, iy, jh, l, m, n, ng, ow, oy, r, uh, uw, v, w, y, z
Monophthongs ao, aa, iy, uw, eh, ih, uh, ah, ax, ae
Diphthongs ey, ay, ow, aw, oy
Silence Silence

3.2.1. Deep neural network (DNN) architecture Following the work done in (Yu et al., 2012) on the Wall Street Jour-
We used a feed-forward deep neural network (DNN) as our attribute nal (WSJ0), we used 5 hidden layers each with typically 2048 non-linear
detector as shown in Fig. 3. The input to the DNN was the set of 26 filter neurons with the Rectified Linear Unit (ReLU) adopted for the hidden
bank features extracted from each frame along with their delta and ac- neurons, defined as follows:
celeration coefficients. To identify context, we combined features from
𝑓 (𝑥) = 𝑚𝑎𝑥(0, 𝑥) (2)
11 frames (5 neighboring frames on either side of the underlying frame)
to form one 858 size super feature vector. The output layer consisted of As the ReLU function has been shown to alleviate the vanishing gra-
2 neurons, one fired when the input was a +ve example, i.e. where the dient problem making pre-training less effective (Li et al., 2017a), we
frame belonged to the attribute, while the other neuron fired when the did not apply pre-training and the weights were initialized randomly
input was a –ve example, i.e. where the attribute is absent. The softmax based on uniform distribution (Glorot and Bengio, 2010):
[ √ √ ]
function was then applied to the output to convert the arbitrary out- 6 6
put values to probabilities. The softmax output of the j-th neuron was 𝑊 ∼𝑈 − , (3)
𝑓 𝑎𝑛𝐼𝑛 + 𝑓 𝑎𝑛𝑂𝑢𝑡 𝑓 𝑎𝑛𝐼𝑛 + 𝑓 𝑎𝑛𝑂𝑢𝑡
calculated as follows:
where fanIn is the number of inputs to the neuron and the fanOut is the
( ) 𝑒𝑧𝑗 number of outputs.
𝑓 𝑧𝑗 = ∑𝑛 (1)
𝑖=1 𝑒𝑧𝑖 The Mini-Batch Stochastic Gradient Decent (MBSGD) method was
used to fine-tune the network with a mini-batch size of 100 samples.
where n is the number of output neurons. The learning rate started with 0.1 and remained unchanged as long as

32
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Fig. 2. Comparison amongst the four different DNN configurations used in our speech attribute detectors.

Fig. 3. The feed-forward DNN architecture used in the speech attribute detec-
tors.

the loss in a separate validation set was greater than a certain threshold, Fig. 4. The architecture of DNN with bottleneck layer of 2048 nodes.
otherwise the learning rate was halved and the weights returned to their
previous values. Training continued until the learning rate reached its
minimum threshold of typically 0.0001. We used the binary cross en-
tropy as our objective function, which is defined as below: 3.2.2. Bottleneck features
As using a single output from each attribute classifier was inefficient
𝑵
1 ∑[ ( ) ( )] in discriminating between frames of different phonemes that shared the
𝑪=− 𝒕 ln 𝒚 𝒊 + 1 − 𝒕𝒊 𝐥𝐧 1 − 𝒚 𝒊 (4)
𝑵 𝒊=1 𝒊 existence or absence of the same attributes, we utilized bottleneck fea-
tures to increase the dimensionality of the features extracted from each
where N is the number of samples in the mini-batch and ti and yi are the attribute. These features were taken from a bottleneck layer, i.e. a hid-
target and predicted values of the sample i. den layer with a significantly lower number of non-linear ReLU neurons
A separate DNN classifier was trained and tuned for each of the 26 compared to other hidden layers in the DNN network, as shown in Fig. 4.
attributes used in this work. As the classifier outputs are probabilities We used a bottleneck hidden layer with typically 10 neurons. We tried
derived from the softmax function, we used the value of only one of different number of neurons in the bottleneck layer, namely 20, 15 and
the two output neurons, i.e. either the +ve neuron, which represented 10 with no significant impact on the performance of the whole system.
how likely it was that the attribute existed in the current frame, or the A slight improvement in the performance was obtained using 10 neu-
–ve neuron, which represented how likely it was that the attribute was rons. The size of the speech attribute feature vector of each frame was
absent in the current frame. The filter-banks features of each frame were increased from 26 to 260 values, where 10 values represented a single
then converted to a feature vector containing the 26 speech attributes. attribute.

33
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

dropout. During the testing phase, all nodes were active and fully con-
nected, however the fan-out weights of each neuron were scaled by 1-
p, where p is the probability of being dropped-out during the training
phase. Based on recommendations in (Srivastava et al., 2014) and previ-
ous studies on speech processing problems (Xu et al., 2015; Zhang et al.,
2014a), we chose dropout probability values of 0.2 for the input layer
and 0.3 for all hidden layers.

3.3. One class support vector machine (OCSVM)

As aforementioned, we treated the pronunciation error detection


problem as an anomaly detection problem, where the model was trained
on only the normal (correct) samples and anomalies (mispronounced)
classified as outliers.
The one class SVM is a special variation of the SVM classifier intro-
duced by Schölkopf et al. (2001) to solve the one class classification
problem. Unlike the traditional binary SVM classifier, where samples
from +ve and –ve classes are available in the training set, the OCSVM
is trained on only the +ve samples to find the hyperplane that separate
them from the origin. This is achieved by developing a decision func-
tion with value +1 in a “small” region that includes most of the training
Fig. 5. The architecture of multi-task learning with (1) attribute detection as samples and −1 elsewhere with a maximum margin from the origin as
the main task and (2) phoneme classification as the secondary task. Each task
shown in Fig. 7.
has its own output layer but shares the same input and hidden layers.
The OCSVM training was tuned using two parameters. First, the reg-
ularization parameter 𝜐, which takes values in the interval [0, 1] and
3.2.3. Multi-task learning (MTL) represents the maximum number of training samples allowed on the
We used multi-task learning, as it has been shown to improve the negative side of the decision hyperplane. And second, the minimum
generalization of the main classification task by learning secondary number of support vectors as a percentage of the total number of the
different but related tasks, which share the same model parameters training samples. A Gaussian kernel function was used, as it has been
(Ferber, 1999; Collobert and Weston, 2008; Zhang et al., 2014b). Multi- shown to be the most successful kernel for the OCSVM (Bounsiar and
task learning can be achieved on the DNN using either hard or soft Madden, 2014), with the function defined as:
parameter sharing (Ruder, 2017). In hard parameter sharing, all tasks ( )
𝐾 𝑋𝑖 , 𝑋𝑗 = exp −𝛾𝑋𝑖 − 𝑋𝑗 2 , 𝛾 > 0 (5)
share the same model parameters but have a separate output layer, while
in soft parameter sharing, each task is trained using its own model and where the value of 𝛾 was determined empirically.
regularization used to decrease the distance among the models’ param- The OCSVM model was trained using the speech attributes features
eters. produced from the bank of DNN attribute detectors. The size of the in-
We adopted hard parameter sharing method, the most commonly put feature vector was either 26 when taken from the +ve node of the
used method to avoid overfitting (Baxter, 1997). As shown in Fig. 5, last softmax layer with one value for each attribute, or 260 when the
we trained each DNN attribute classifier to achieve two tasks; the main attribute features were taken from the bottleneck layer with 10 values
task was the binary classification of each frame’s attribute, while the for each attribute.
secondary task was phoneme classification task. Unlike the attribute de-
tector which has two output neurons, the phoneme classifier has 120 3.4. Goodness of pronunciation (GOP)
output neurons representing the states of the 40 phonemes, 3 states per
phoneme. Using phoneme classification as a secondary task not only al- For comparison purposes, we implemented the GOP as proposed
lowed us to improve the performance of the attribute classification task in (Witt and Young, 2000) where the posterior probability of each
but more importantly allowed the network to learn variations amongst phoneme was estimated using the following equation:
( )
phonemes that shared the same attribute class and hence produce more ( ) 𝑃 𝑂∕𝑝𝑖
discriminative features. 𝑃 𝑝𝑖 ∕𝑂 = ( ( )) (6)
𝑚𝑎𝑥𝑝𝑗 ∈𝑄 𝑃 𝑂∕𝑝𝑗
We employed a simple implementation similar to the one proposed
by Huang et al. (2013). All the hidden layers were shared between the where pi is the underlying phoneme and the numerator P(O/pi ) is the
main and secondary tasks, while each task had its own softmax layer. phoneme likelihood computed from the forced alignment step and O
The dataset was divided into batches and each batch was fed to the is the observation segment of pi obtained from the forced alignment.
network twice for the first and second tasks. During training, the error A free-phoneme recognition step was performed using a phoneme loop
computed from the objective function belonging to a specific task was grammar created from the list of phonemes in Q. The denominator was
used to update the shared hidden layers weights and only the softmax the maximum likelihood from the free-phoneme recognition of the ob-
layer parameters of this specific task, while the parameters of the soft- servation segment O. Fig. 8 summarizes the GOP algorithm.
max layer of the other task remained intact. We used a DNN-HMM acoustic model to estimate the posterior prob-
ability in the GOP algorithm which consists of 5 hidden layers with 2048
3.2.4. Dropout regularization neurons in each layer. The activation of the hidden units is ReLU and
In addition to the MTL, we utilized the dropout approach to further the output layer is a softmax layer with 120 units representing the 3
reduce the effect of the overfitting, a major issue in DNN training, as states of the 40 monophones (Mohamed et al., 2012b).
proposed by Srivastava et al. (2014). It works by randomly selecting a The log value of the computed score was first normalized over the
percentage of nodes in input and hidden layers to be inactive (dropped- phoneme duration and then compared to a predefined threshold in order
out) with probability p during the training process, which means that all to accept or reject the pronunciation. The specific threshold for each
fan-in and fan-out connections of this node are removed and the weighs phoneme was tuned to maximize the phoneme F1 score in the validation
are not updated; Fig. 6 shows the network structure with and without set.

34
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Fig. 6. The DNN architecture (a) without and (b) with dropout
(Srivastava et al., 2014).

4. Experiments and results

We conducted three separate experiments to optimize the parame-


ters of each module separately and then evaluated the whole system in
detecting mispronunciation errors.
In the first experiment, we studied the speech attribute detection
module and tuned the parameters of the DNN classifiers to achieve best
performance for each detector using a native-speaker English corpus. In
the second experiment, we looked at the effectiveness of the phoneme-
specific OCSVM models, trained using the speech attribute features, in
modeling the correct pronunciation of each phoneme.
Finally, in the last experiment, we evaluated the integrated system
in pronunciation verification of native-speaker English test set with ar-
tificial errors, foreign-accented English test set and children disordered
speech. We compared these results to the DNN-GOP algorithm.

4.1. Results on speech attribute detection

Fig. 7. The one-class SVM decision regions. In this experiment, we studied the extraction of the speech attribute
features. A separate binary DNN classifier was trained for each of the
26 speech attributes in Table 1 evaluating the existence or absence of
3.5. Speech corpora the attribute in the input frame. Both the WSJ0 and TIMIT speech cor-
pora were used to train, validate and test the DNN classifiers. Frames of
We used four different speech corpora to evaluate the algorithms, all the phonemes that belonged to the underlying attribute formed the
two of which were collected from English native speakers and the other +ve examples, while all other frames that were not included in the +ve
two corpora contain foreign-accented and disordered speech. samples formed the –ve examples. The training, validation and testing
To validate and test our algorithms, we used two native speech sets were balanced by randomly choosing equal numbers of +ve and –
corpora. (1) The speaker independent subset (SI84) in the standard ve samples to prevent biasing of the model. The exact number of frames
Wall Street Journal (WSJ0) dataset, which consists of 101 speakers and used in the training, validation and testing of each speech attribute are
13,048 sentences (∼21 h in total). (2) The TIMIT dataset, which contains listed in Table 2.
6300 sentences produced by 630 speakers (∼3.5 h in total). The overall With a balanced test set, the classification accuracy score can be used
validation set consisted of 1400 sentences from 76 speakers, while the to represent the performance of the attribute detector. Fig. 9 shows the
test set consisted of 1500 sentences from 100 speakers. accuracy of the 26 attribute detectors over the training, validation and
To evaluate the effectiveness of the method in detecting pronuncia- testing sets using a shallow ANN (with one hidden layer) and a DNN
tion errors, we used two other datasets that contain natural pronuncia- (with 5 hidden layers). For both models the number of neurons is fixed
tion errors. (1) The GMU foreign-accented speech, which was collected to 2048 ReLUs. The results show that the DNN model outperformed the
from non-native English speakers from a variety of L1 language back- single hidden layer ANN. The average performance of the shallow ANN
grounds reading a common paragraph carefully selected to cover all was 89% ± 3%, while the 5 layers DNN achieved average performance
English phonemes. The data was transcribed by 2 to 4 native English of 91% ± 2%. Affricates and retroflex achieved the best performance
speakers who were phonetically educated (Weinberger, 2015). In this of around 94% accuracy, while tense and mid attributes gave the least
work we used 15 non-native English speakers with 4 different L1 lan- accuracy of around 85%. Overall, the average accuracy of all attributes
guages: Arabic (5 speakers), Dutch (4 speakers), German (4 speakers), was 90% with a standard deviation of 2.7%.
and Farsi (2 speakers). (2) A disordered speech corpus collected from It is obvious from the results that the DNN models overfit the train-
children with Childhood Apraxia of Speech (CAS). The recording and ing data, with an accuracy of almost 100% reached in classifiers such as
annotation process were performed by speech and language patholo- affricates and semivowels. To relax this overfitting and improve the gen-
gists (SLPs) at the University of Sydney. The corpus contained speech eralization of the model, we employed two different techniques, dropout
from 11 children pronouncing 450 single word prompts. regularization and MTL. A dropout rate of 20% was used for the input

35
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Fig. 8. The flow diagram of the GOP algorithm


(Witt and Young, 2000).

Table 2 Table 3
The distribution of the dataset over the training, validation The distribution of training samples over different phonemes.
and testing sets.
Phoneme N# training Phoneme N# training Phoneme N# training
Attribute N# training N# validation N# testing
ae 221,085 er 213,735 v 117,585
Vowels 6639,500 719,958 545,282 t 440,685 sh 837,27 ow 127,005
Stops 3129,490 354,896 236,484 eh 207,897 aa 176,252 b 101,506
Affricates 200,818 19,364 15,740 n 413,451 l 221,794 uw 98,513
Fricatives 2914,646 319,510 224,296 iy 286,984 m 184,162 y 50,256
Nasals 1465,152 166,466 104,850 s 378,509 jh 40,670 aw 48,422
Liquids 1124,678 119,766 93,590 dh 94,311 z 207,143 ey 176,512
Semivowels 319,916 33,346 29,082 ah 594,392 w 77,548 oy 17,406
Approximant 4580,322 477,940 390,168 d 245,684 k 315,311 uh 13,529
Coronal 1406,386 153,112 115,094 f 186,781 p 252,357 g 43,651
High 4316,732 489,946 327,352 ao 104,796 ay 146,427 th 52,698
Dental 3656,524 379,804 291,956 r 246,965 ng 57,559 ch 38,046
Glottal 328,972 36,862 26,336 ih 337,133 hh 119,823 zh 5461
Labial 267,900 30,126 16,086
Low 2081,142 234,694 160,284
Mid 1526,916 162,298 133,810
Velar 2434,322 286,848 178,306 Fig. 11 presents 2-D scatter plots of the attribute features of frames
Back 978,232 103,570 71,136 from different phonemes selected randomly from the validation set.
Retroflex 1048,488 116,674 82,670 Each frame was converted to its corresponding 26 attribute features us-
Anterior 6726,846 761,502 513,972
Continuant 3876,888 431,238 297,094
ing the pre-trained attribute detectors and then reduced to 2 dimensions
Round 5346,482 572,188 426,176 using t-Distributed Stochastic Neighbor Embedding (t-SNE) (Maaten and
Tense 2204,564 230,514 184,400 Hinton, 2008). The figures show how effective the attribute features are
Voiced 7750,246 844,118 634,428 in discriminating between the different phonemes. It is obvious from
Monophthongs 4848,766 529,628 387,524
the plots that each phoneme has a dominant region clearly separated
Diphthongs 1257,336 135,772 109,012
Silence 1111,396 88,178 118,294 from other phonemes in most cases. Fig. 12 shows the effect of MTL
and Dropout techniques in further improving the discriminability of the
attribute features. Here also the samples were randomly selected from
the validation set.
layer, while a rate of 30% was used for each of the 5 hidden layers for
all the attribute detectors. 4.2. Results on OCSVM phoneme specific models
In the MTL, each attribute detector was trained to classify each frame
as +ve or –ve, as a primary task, and to assign each frame to one of the In this experiment we studied the effectiveness of the OCSVM
monophone states, as a secondary task. There were 40 monophones with phoneme-specific model in discriminating between the in-class frames,
3 states, each forming 120 units in the output layer of the secondary task. i.e. the frames belonging to the underlying phoneme, and the out-of-
State alignment was performed using a context dependent GMM-HMM class frames, i.e. frames from other phonemes. Here also the WSJ0 and
acoustic model trained on the same training data. the TIMIT data sets were used to train and validate the OCSVM mod-
The effect of using dropout, multi-task learning and the combination els. The training set consisted of only in-class frames. 30% of the val-
of both techniques is demonstrated in Fig. 10. The average training ac- idation set consisted of randomly selected in-class frames, with the re-
curacy dropped from 97.5% ± 1.9 to around 92.7% ± 2.3 as opposed to maining frames randomly selected from all the other phonemes, with
a reduction in the test set classification error ranging from 5% to 18% all phonemes contributing equally. Table 3 shows the number of frames
when applying only dropout. On the other hand, MTL maintained the used in the training of each phoneme, while the validation set was fixed
average performance of the training set with a decrease in the test set er- to 2000 frames. Due to the imbalance in the validation set we chose
ror rate varying from 1% up to around 13%. Combining both techniques model parameters that maximize the F1 score, computed as follows:
further improved the accuracy of almost all the attribute detectors; the
2𝑇 𝐴
classification accuracy of the affricates and coronal attributes increased 𝐹1 = (7)
(2𝑇 𝐴 + 𝐹 𝐴 + 𝐹 𝑅)
from 94.7% and 87.7% using the DNN model without any regulariza-
tion techniques to 95.8% and 90% respectively when using both MTL where TA, FA and FR are the true-acceptance, false-acceptance and false-
and dropout methods. rejection respectively.

36
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Fig. 9. The accuracy of the speech attribute de-


tectors of the training, validation and test sets
when using a shallow neural network (1 hidden
layer) and a deep neural network (5 hidden lay-
ers).

Fig. 10. The accuracy of the speech attribute detectors when using Multi Task Learning (MTL), dropout regularization (Dout), both MTL and Dout algorithms
(MTL + Dout) and deep neural network without any regularization method (DNN) for (a) the test set and (b) the training set.

37
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Fig. 11. 2D scatter plot of the t-SNE projection of the


26 attribute features of subset of the phonemes.

Fig. 12. 2D scatter plot of the t-SNE projection of the 26 attribute features of subset of the phonemes with and without dropout and MTL.

The attribute features used in training the OCSVM models were ex- tion set. The results are summarized in Table 4 along with the number
tracted from four different attribute detectors, DNN-Dout, DNN-Dout-BN, of samples from each phoneme in the test set.
DNN-MTL and DNN-Dout-BN, where BN refers to extracting the features
from the bottleneck layer rather than the output layer. Fig. 13(a) and 4.3. Results on PV task
(b) show the frame-level performance of each phoneme-specific model
when using each of the aforementioned attribute features for the conso- In these experiments, we evaluated the performance of the OCSVM
nants and vowels respectively. based classifiers in detecting pronunciation errors in native speech with
The results show that using features extracted from the attribute de- artificial errors and foreign accented and disordered speech with natural
tectors trained with MTL criteria improved the performance of almost pronunciation errors. In all the experiments we used a system trained
all the OCSVM phoneme specific models as compared to when features only on native speech with the parameters that achieved the best per-
extracted from attribute detectors trained on a single task were used. formance on the validation set as explained in previous experiments.
Overall, the consonants performed better than vowels with an average In the first experiment, we generated artificial errors by manipu-
F1 score of 0.88±0.05 compared to 0.86±0.04 for the vowels. The con- lating the labels of the native speakers test set (WSJ0 + TIMIT). This
sonants /sh/ and /y/ had the highest performance with a F1 score of was done to overcome the lack of accurate annotated mispronounced
around 0.95, while the vowels /ah/ and /ih/ had the lowest perfor- data and provide a larger amount of data to better evaluate the sys-
mance with 0.71 F1 score. tem (Zhao et al., 2012; Yoon et al., 2009; Witt and Young, 1997).
We then evaluated the OCSVM models on a phoneme level by testing Kanters et al. (2009) showed that the behavior of the GOP algorithm
their performance in classifying the whole speech segment. The whole is almost the same on both real and artificial error datasets.
phoneme was considered in-class if the majority of its frames were clas- In order to simulate a real pronunciation error, the phonemes were
sified as in-class and out-of-class otherwise. In this experiment each changed based on common mispronunciation mistakes made by speak-
phoneme model was tested against samples from the same phoneme ers of Scandinavian languages who learn English as a second language
and samples from all other phonemes in the test set and each model’s according to (Smith, 2001). Table 5 summarizes these common errors.
FR and FA rates were computed respectively. For each phoneme we used The alterations are equally distributed over all possible mispronuncia-
the model that achieved the best frame level F1 score with the valida- tions of any specific phoneme. If phoneme i is the expected error of Ni

38
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Fig. 13. The frame-level F1 score of the OCSVM models of the (a) consonants and (b) vowels when trained using features extracted from different architectures for
speech attribute detectors.

Table 4
The phoneme-level false-acceptance (FA) and false-rejection (FR) rates
of the OCSVM mode.

Ph N# FR (%) FA (%) Ph N# FR (%) FA (%)

aa 940 3.19 4.82 g 243 3.29 3.17


ae 623 7.04 6.95 hh 522 6.30 5.32
ah 3695 7.46 12.37 Jh 267 4.87 2.04
ao 306 3.94 3.17 k 1832 2.02 3.48
aw 20 4.17 2.94 l 1552 3.51 1.27
ay 576 2.11 2.74 m 1199 2.59 2.25
eh 1274 7.40 9.42 n 3431 4.65 5.59
er 1355 4.06 3.92 ng 332 5.42 4.45
ey 811 1.34 5.62 p 1479 5.92 4.33
ih 2613 12.60 12.71 r 1961 3.62 2.98
iy 1978 4.64 4.65 s 2403 2.75 2.51
ow 557 3.17 2.53 sh 451 1.55 4.40
oy 152 3.23 2.40 t 3181 8.62 9.75
uh 108 7.27 8.63 th 343 7.58 10.24
uw 508 5.47 4.45 v 803 5.73 3.76
b 693 2.16 5.62 w 602 2.99 2.30
ch 176 5.68 1.71 y 380 3.42 2.34
d 2108 5.68 1.71 z 1337 3.07 6.51
dh 759 5.67 8.39 zh 30 6.67 2.54
f 976 5.68 1.71

Table 5
Common pronunciation errors of Scandinavian speakers learn-
ing English.
Fig. 14. Phoneme-level comparison between the OCSVM and GOP based PV
Phone b ch d dh g jh th w z systems in terms of the false-acceptance (FA) and false-rejection (FR) rates when
Error p t t d k d s v s applied on the native English dataset with artificial errors.
Phone zh ae ah aw ey ih ow uw
Error sh eh uh ow ae iy uw aw

To demonstrate the effectiveness of our system in pronunciation ver-


phonemes, the occurrences of phoneme i in the test set were divided into ification, we compared our OCSVM algorithm with the DNN-GOP algo-
equal 𝑁𝑖 + 1 folds where Ni folds were replaced with the Ni phonemes, rithm. Fig. 14 shows the FA and FR rates of each phoneme using both
to simulate the pronunciation errors, and one fold was kept intact, to the OCSVM and DNN-GOP models. Overall, the OCSVM method outper-
represent the correct pronunciation of the phoneme i. forms the DNN-GOP algorithm in most of the phonemes.

39
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Fig. 15. Comparison between the OCSVM and


GOP based PV when applied on native English
(WSJ0+TIMIT), the disordered speech (CAS) and
foreign-accented speech (GMU) datasets in terms
of the F1 score and false-acceptance (FA) and
false-rejection (FR) rates.

Fig. 16. A phoneme-level F1 score of (a) CAS and (b) GMU speech using both GOP and OCSVM methods.

The vowel /uw/ achieved the best performance with FA and FR rates
of 1.8% and 1% using the OCSVM method as opposed to 11.1% and
13.6% obtained with the GOP algorithm respectively. However, the GOP
discriminated between /zh/ and /sh/ better, where /sh/ is the common
mistake of /zh/ as in Table 5, with FA of 9.2% compared to 16% using
OCSVM algorithm. The vowels /ae/ and /ih/ and the consonant /z/ are
the most problematic phonemes with FA and FR between 25% and 30%
respectively for both the OCSVM and GOP models, where the actual
pronounced phonemes of the FA samples of these three phonemes were
/eh/, /iy/ and /s/ respectively.
In addition to the artificial error dataset, we evaluated the system
with two natural-error datasets, the CAS disordered speech and GMU
foreign-accented speech. Fig. 15, shows the FA and FR rates and the
F1 scores of the 3 data sets using our OCSVM method and the GOP Fig. 17. The breakdown of OCSVM performance of the GMU speech datasets
method. Our method improved the FA and FR rates obtained with the over different accents.
artificial error dataset, decreasing from around 26% and 29% respec-
tively with the GOP method to 19% and 17%, while with the CAS data
the FR reduced significantly from 40% using the GOP method to around GMU test sets, we see that both /b/ and /t/ maintained almost the same
26% when using OCSVM method. Both methods achieved similar FA F1 score, while /th/ degraded significantly from 0.89 to 0.73 with CAS
rates. Moreover, the OCSVM method achieved F1 score and FA and FR and GMU respectively.
rates of 0.80, 25% and 24% respectively when tested against the GMU The mispronounced phonemes most difficult to detect were the
foreign-accented data as opposed to 0.70, 38% and 30% obtained by the voiced fricatives /z/ and /v/ in the GMU speech test set. The common
GOP. mistakes in the pronunciation of these phonemes were replacements
Fig. 16(a) and (b) show a phoneme-level breakdown of the perfor- with their unvoiced versions of /s/ and /f/ respectively. A possible rea-
mance of the CAS and GMU speech using both GOP and OCSVM meth- son as to why the system failed to detect these mispronunciations could
ods. The mispronunciation detection of most of the phonemes was better be the noise level in the speech files recorded by volunteers with their
with our OCSVM method compared to the GOP method with some minor own personal recording equipment in the GMU archive.
exceptions such as /w/ and /b/ in the CAS test set and /p/ in the GMU Fig. 17 shows the average performance of our OCSVM system for
test set. The vowel /eh/ and the consonants /n/, /w/, /p/, /th/ and /t/ each accent of the GMU foreign-accented speech dataset. The Dutch and
achieved F1 score above or close to 0.9 in the CAS speech test set. On Arabic achieved higher average F1 scores of 0.83 and 0.79 respectively,
the other hand, the consonants /f/, /s/ and /t/ had F1 score greater than while the worst performance was obtained with the Farsi accent with
or almost 0.9. By looking at the common phonemes between CAS and an average F1 score of 0.67.

40
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

5. Discussion and conclusions ing and testing data were recorded from different domains and environ-
ments demonstrating that the system is domain-independent and robust
In this paper we proposed a novel pronunciation verification ap- against recording setups.
proach that overcomes the need for annotated mispronounced data Dropout is a powerful regularization tool applied successfully on dif-
to model pronunciation errors. Our approach is based upon anomaly ferent applications, and speech attribute detection is no exception. Our
detection, where only correctly pronounced speech is used to train a experiments showed clearly how dropout controlled the model overfit-
phoneme-specific acoustic model that can detect any deviation from the ting over the training set and improved the model generalization. On the
correct pronunciation. We adopted the OCSVM classifier as our anomaly other hand, even though MTL had a minor impact on the performance
detector which was fed by a set of speech attribute features derived from of the speech attribute detectors, it significantly improved the ability
a bank of DNN-based speech attribute detectors modeling the manners of the attribute features to discriminate between phonemes by using
and places of articulation. These attribute features perfectly describe the phoneme classification as a secondary task and therefore improved the
pronunciation characteristics of each phoneme and hence are sensitive performance of the anomaly detection model. This demonstrated an ad-
to any pronunciation error. ditional powerful benefit of using MTL, namely that it can be used to
As our system consists of two basic modules, speech attribute detec- improve the discriminability of the main task in the direction of the
tion and phoneme-specific OCSVM modeling, we first optimized each secondary task.
module separately to obtain the structure that achieves best perfor- Moreover, instead of representing each speech attribute with one
mance and then evaluated the whole system in pronunciation verifi- probability value from the softmax output layer of the DNN classifier,
cation. Two standard correctly pronounced speech corpora, namely the an extended representation was extracted from a bottleneck layer lo-
TIMIT and WSJ, were utilized for training, validating and testing the cated before the last hidden layer. This increases the degrees of freedom
two modules, while pronunciation verification was evaluated using (1) of the extracted features allowing variations among different phonemes
part of the correctly pronouncing speech corpora with artificial errors, sharing same speech attribute model. However, the results showed that
(2) the GMU foreign-accented speech corpus collected from non-native only four consonants, namely /sh/, /y/, /zh/ and /dh/, and two vowels,
English speakers and (3) the CAS children disordered speech corpus. namely /ao/ and /uw/, benefitted from the bottleneck features, while
The DNN-based speech attribute detectors achieved an average ac- the rest achieved better performance when features were extracted from
curacy of 90% ± 2.7% when 26 different manners and places of artic- the output of the softmax layer. A deeper analysis of the extracted fea-
ulation were utilized. These results showed that the DNN models over- tures and applying some sort of feature selection techniques may help
fit the training data. We therefore employed two different techniques in taking advantage of the bottleneck features.
to overcome this problem, dropout and the Multi-Task Learning (MTL) However, some phonemes still suffer from high FA and FR rates
with phoneme classification as a secondary task. The combination of reaching 30% for vowels such as /ih/ and /ae/. These results could
these two algorithms improved the accuracy of the attributes to an av- be improved by working on phoneme-specific feature selection of the
erage accuracy of 91.5% ± 2.5%. Our experiments demonstrated that speech attribute features. Furthermore, employing a more sophisticated
using MTL with phoneme classification as a secondary task increased anomaly detection approach can improve the sensitivity of the phoneme
the ability of the speech attribute DNN classifiers to further discrimi- model to pronunciation errors. Broader testing of the system on a wider
nate between phonemes that shared the same attribute(s). range of different languages and pronunciation verification domains is
OCSVM models were trained and tuned for each specific phoneme. also needed, such as in the second language learning domain.
Each speech frame was first converted to its corresponding speech at- In the future, we intend to focus on 1) improving the performance
tribute feature vector of size 26 representing the existence probabil- of the speech attribute detectors by utilizing more speech corpora and
ity of each of the 26 adopted speech attributes. The results demon- trying different DNN architectures, 2) carefully studying the speech at-
strated that using the OCSVM with speech attribute features is effective tribute features and selecting the most discriminative features for each
in modeling the correct pronunciation of each phoneme and discrimi- phoneme, 3) exploring other anomaly detection methods such as the
nates between frames belonging to the underlying phoneme (in-class), auto-encoder and LSTM. Moreover, to further validate the effectiveness
and frames belonging to other phonemes (out-of-class) with average F1 of our approach over other PV methods, we intend to conduct a com-
scores of 0.88 ± 0.05 and 0.86 ± 0.04 for the consonants and vowels parison with a binary classification method which has proved to be the
respectively. The results also showed that the features extracted from most accurate if enough correct and incorrect data is available.
the speech attribute detectors trained with MTL achieved the best per-
formance across the majority of the phonemes. Declaration of Competing Interest
Our results with the integrated PV system showed that our anomaly
detection approach outperforms the commonly used DNN-GOP algo- We wish to confirm that there are no known conflicts of interest as-
rithm in all testing sets. Our approach achieved F1 scores of 0.84, 0.82 sociated with this publication and there has been no significant financial
and 0.80 compared to 0.76, 0.71 and 0.70 obtained using the DNN-GOP support for this work that could have influenced its outcome.
algorithm when applied on the native speech with artificial errors, the
GMU foreign-accented speech and the CAS disordered speech respec- Acknowledgments
tively. The system was more effective with real errors than the artifi-
cially produced ones. This could be explained by the nature of the real This work was made possible by NPRP grant #[8-293-2-124] from
pronunciation error which in most cases is a distorted version of the tar- the Qatar National Research Fund (a member of Qatar Foundation). The
get phoneme and not a complete replacement with another phoneme. statements made herein are solely the responsibility of the authors.
Real pronunciation errors are more challenging for confidence-score
References
based methods such as the DNN-GOP because it is very hard to deter-
mine a pre-defined threshold value that is able to discriminate between Abdou, S.M., et al., 2006. Computer aided pronunciation learning system using speech
the correct and distorted pronunciation. recognition techniques. Ninth International Conference on Spoken Language Process-
The results showed that the speech attribute features can efficiently ing.
Al Hindi, A., Alsulaiman, M., Muhammad, G., Al-Kahtani, S., 2014. Automatic pronunci-
model the phoneme articulatory characteristics. The anomaly detection ation error detection of nonnative Arabic Speech. In: Computer Systems and Applica-
model trained using these features was sensitive to any deviation in the tions (AICCSA), 2014 IEEE/ACS 11th International Conference on. IEEE, pp. 190–197.
modeled characteristics. More interestingly, the system is robust against Al-Barhamtoshy, H., Jambi, K., Al-Jedaibi, W., Motaweh, D., Abdou, S., Rashwan, M.,
2014. Speak Correct: phonetic Editor Approach. Life Sci. J. 11 (8).
adult-child acoustic variations when the model was trained on adult Baxter, J., 1997. A Bayesian/information theoretic model of learning to learn via multiple
speech and tested on disordered children speech. Moreover, the train- task sampling. Mach. Learn. 28 (1), 7–39.

41
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Behravan, H., Hautamauki, V., Siniscalchi, S.M., Kinnunen, T., Lee, C.-H., 2014. Introduc- Mak, B., et al., 2003. PLASER: pronunciation learning via automatic speech recognition.
ing attribute features to foreign accent recognition. In: Acoustics, Speech and Signal In: Proceedings of the HLT-NAACL 03 workshop on Building educational applica-
Processing (ICASSP), 2014 IEEE International Conference on. IEEE, pp. 5332–5336. tions using natural language processing, 2. Association for Computational Linguistics,
Bonaventura, P., Howarth, P., Menzel, W., 2000. Phonetic annotation of a non-native pp. 23–29.
speech corpus. In: Proceedings International Workshop on Integrating Speech Tech- Mohamed, A., Hinton, G., Penn, G., 2012a. Understanding how deep belief networks per-
nology in the (Language) Learning and Assistive Interface. InStil, pp. 10–17. form acoustic modelling. In: Acoustics, Speech and Signal Processing (ICASSP), 2012
Bounsiar, A., Madden, M.G., 2014. Kernels for one-class support vector machines. In: In- IEEE International Conference on. IEEE, pp. 4273–4276.
formation Science and Applications (ICISA), 2014 International Conference on. IEEE, Mohamed, A.-R., Dahl, G.E., Hinton, G., 2012b. Acoustic modeling using deep belief net-
pp. 1–4. works. IEEE Trans. Audio Speech, Language Process. 20 (1), 14–22.
C. M. University. (2014). CMUdict. Available: http://www.speech.cs.cmu.edu/cgi- Neri, A., Cucchiarini, C., Strik, H., 2006. ASR-based corrective feedback on pronunciation:
bin/cmudict. does it really work? Ninth International Conference on Spoken Language Processing.
Chandola, V., Banerjee, A., Kumar, V., 2009. Anomaly detection: a survey. ACM Comput. Neri, A., Mich, O., Gerosa, M., Giuliani, D., 2008. The effectiveness of computer assisted
Surveys (CSUR) 41 (3), 15. pronunciation training for foreign language learning by children. Comput. Assist.
Collobert, R., Weston, J., 2008. A unified architecture for natural language processing: Lang. Learn. 21 (5), 393–408.
deep neural networks with multitask learning. In: Proceedings of the 25th interna- Paul, D.B., Baker, J.M., 1992. The design for the Wall Street Journal-based CSR corpus.
tional conference on Machine learning. ACM, pp. 160–167. In: Proceedings of the workshop on Speech and Natural Language. Association for
van Doremalen, J., Cucchiarini, C., Strik, H., 2009. Automatic detection of vowel pronun- Computational Linguistics, pp. 357–362.
ciation errors using multiple information sources. In: Automatic Speech Recognition Pellegrini, T., Fontan, L., Mauclair, J., Farinas, J., Robert, M., 2014. The goodness of
& Understanding, 2009. ASRU 2009. IEEE Workshop on. IEEE, pp. 580–585. pronunciation algorithm applied to disordered speech. Fifteenth Annual Conference
Duan, R., Kawahara, T., Dantsujii, M., Zhang, J., 2016. Pronunciation error detection using of the International Speech Communication Association.
DNN articulatory model based on multi-lingual and multi-task learning. In: Chinese Ruder, S., An overview of multi-task learning in deep neural networks, arXiv:1706.05098,
Spoken Language Processing (ISCSLP), 2016 10th International Symposium on. IEEE, 2017.
pp. 1–5. Ryu, H., Chung, M., 2017. Mispronunciation Diagnosis of L2 English at Articulatory Level
Elaraby, M.S., Abdallah, M., Abdou, S., Rashwan, M., 2016. A deep neural networks (DNN) Using Articulatory Goodness-Of-Pronunciation Features. In: Proc. 7th ISCA Workshop
based models for a computer aided pronunciation learning system. In: International on Speech and Language Technology in Education, pp. 65–70.
Conference on Speech and Computer. Springer, pp. 51–58. Saz, O., Yin, S.-C., Lleida, E., Rose, R., Vaquero, C., Rodríguez, W.R., 2009. Tools and
Ferber, J., 1999. Multi-agent Systems: An Introduction to Distributed Artificial Intelli- technologies for computer-aided speech and language therapy. Speech Commun. 51
gence. Addison-Wesley Reading. (10), 948–967.
Franco, H., Ferrer, L., Bratt, H., 2014. Adaptive and discriminative modeling for improved Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C., 2001. Estimating
mispronunciation detection. In: Acoustics, Speech and Signal Processing (ICASSP), the support of a high-dimensional distribution. Neural Comput. 13 (7), 1443–1471.
2014 IEEE International Conference on. IEEE, pp. 7709–7713. Shahin, M., et al., 2015. Tabby talks: an automated tool for the assessment of childhood
Franco, H., Neumeyer, L., Ramos, M., Bratt, H., 1999. Automatic detection of phone-level apraxia of speech. Speech Commun. 70, 49–64.
mispronunciation for language learning. Sixth European Conference on Speech Com- Shahin, M., Ahmed, B., McKechnie, J., Ballard, K., Gutierrez-Osuna, R., 2014. A compar-
munication and Technology. ison of gmm-hmm and dnn-hmm based pronunciation verification techniques for use
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., 1993. NASA STI/Recon in the assessment of childhood apraxia of speech. Fifteenth Annual Conference of the
Technical Report n. NASA STI/Recon Technical Report n, 93. International Speech Communication Association.
Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward Sim, K.C., 2009. A phone verification approach to pronunciation quality assessment for
neural networks. In: Proceedings of the thirteenth international conference on artifi- spoken language learning. In: Proceedings: APSIPA ASC 2009: Asia-Pacific Signal and
cial intelligence and statistics, pp. 249–256. Information Processing Association, 2009 Annual Summit and Conference. Interna-
Harrison, A.M., Lo, W.-K., Qian, X.-J., Meng, H., 2009. Implementation of an extended tional Organizing Committee, pp. 619–622. Asia-Pacific Signal and Information Pro-
recognition network for mispronunciation detection and diagnosis in computer-as- cessing Association, 2009 Annual Summit and Conference.
sisted pronunciation training. International Workshop on Speech and Language Tech- Smith, B., 2001. Learner English: A Teacher’s Guide to Interference and Other Problems.
nology in Education. Ernst Klett Sprachen.
Hu, W., Qian, Y., Soong, F.K., 2013. A new DNN-based high quality pronunciation evalu- Soong, F.K., Lo, W.-K., Nakamura, S., 2004. Generalized word posterior probability
ation for computer-aided language learning (CALL). In: Interspeech, pp. 1886–1890. (GWPP) for measuring reliability of recognized words. In: Proceeding SWIM2004,
Hu, W., Qian, Y., Soong, F.K., 2014. A new neural network based logistic regression clas- pp. 13–16.
sifier for improving mispronunciation detection of L2 language learners. In: Chinese Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout:
Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1),
pp. 245–249. 1929–1958.
Huang, J.-T., Li, J., Yu, D., Deng, L., Gong, Y., 2013. Cross-language knowledge trans- Stouten, F., Martens, J.-P., 2006. On the use of phonological features for pronunciation
fer using multilingual deep neural network with shared hidden layers. In: Acoustics, scoring. In: Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings.
Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2006 IEEE International Conference on, 1. IEEE pp. I-I.
pp. 7304–7308. Strik, H., Truong, K.P., d. Wet, F., Cucchiarini, C., 2007. Comparing classifiers for pronun-
Jo, C.-H., Kawahara, T., Doshita, S., Dantsuji, M., 1998. Automatic pronunciation error ciation error detection. Eighth Annual Conference of the International Speech Com-
detection and guidance for foreign language learning. Fifth International Conference munication Association.
on Spoken Language Processing. Truong, K., Neri, A., Cucchiarini, C., Strik, H., 2004. Automatic pronunciation error de-
Kanters, S., Cucchiarini, C., Strik, H., 2009. The goodness of pronunciation algorithm: tection: an acoustic-phonetic approach. InSTIL/ICALL Symposium 2004.
a detailed performance study. presented at the Speech and Language Technology in Wang, H., Zhao, Y., Xu, Y., Xu, X., Suo, X., Ji, Q., 2014. Cross-language speech at-
Education (SLaTE2009). tribute detection and phone recognition for Tibetan using deep learning. In: Chinese
Khan, S.S., Madden, M.G., 2014. One-class classification: taxonomy of study and review Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE,
of techniques. Knowl. Eng. Rev. 29 (3), 345–374. pp. 474–477.
Kim, Y., Franco, H., Neumeyer, L., 1997. Automatic pronunciation scoring of specific Wang, Y.-B., Lee, L.-S., 2012. Improved approaches of modeling and detecting error pat-
phone segments for language instruction. Fifth European Conference on Speech Com- terns with empirical analysis for computer-aided pronunciation training. In: Acoustics,
munication and Technology. Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE,
Lee, C.-H., Siniscalchi, S.M., 2013. An information-extraction approach to speech process- pp. 5049–5052.
ing: analysis, detection, verification, and recognition. Proc. IEEE 101 (5), 1089–1115. Wei, S., Hu, G., Hu, Y., Wang, R.-H., 2009. A new method for mispronunciation detection
Li, J., Zhang, T., Luo, W., Yang, J., Yuan, X.-T., Zhang, J., 2017a. Sparseness analysis in using support vector machine based on pronunciation space models. Speech Commun.
the pretraining of deep neural networks. IEEE Tran. Neural Netw. Learn. Syst. 28 (6), 51 (10), 896–905.
1425–1438. Weinberger, S., 2015. Speech Accent Archive. George Mason University Online:
Li, W., Chen, N.F., Siniscalchi, S.M., Lee, C.-H., 2017b. Improving mispronunciation de- http://accent.gmu.edu.
tection for non-native learners with multisource information and LSTM-based deep de Wet, F., Van der Walt, C., Niesler, T., 2009. Automatic assessment of oral language
models. In: Proceeding. Interspeech 2017, pp. 2759–2763. proficiency and listening comprehension. Speech Commun. 51 (10), 864–874.
Li, K., Qian, X., Meng, H., 2017c. Mispronunciation detection and diagnosis in L2 English Witt, S.M., Young, S.J., 2000. Phone-level pronunciation scoring and assessment for in-
speech using multidistribution deep neural networks. IEEE/ACM Trans. Audio Speech teractive language learning. Speech Commun. 30 (2–3), 95–108.
Lang. Process. 25 (1), 193–207. doi:10.1109/TASLP.2016.2621675. Witt, S., Young, S.J., 1997. Language learning based on non-native speech recognition.
Lo, W.-K., Zhang, S., Meng, H., 2010. Automatic derivation of phonological rules for Fifth European Conference on Speech Communication and Technology.
mispronunciation detection in a computer-assisted pronunciation training system. Xu, Y., Du, J., Dai, L.-R., Lee, C.-H., 2015. A regression approach to speech enhance-
Eleventh Annual Conference of the International Speech Communication Association. ment based on deep neural networks. IEEE/ACM Trans. Audio, Speech Lang. Process.
Luo, D., Minematsu, N., Yamauchi, Y., Hirose, K., 2009. Analysis and comparison of au- (TASLP) 23 (1), 7–19.
tomatic language proficiency assessment between shadowed sentences and read sen- Yan, K., Gong, S., 2011. Pronunciation proficiency evaluation based on discriminatively
tences. International Workshop on Speech and Language Technology in Education. refined acoustic models. Int. J. Info. Technol. Comput. Sci. 3 (2), 17–23.
Maaten, L.V.d., Hinton, G., 2008. Visualizing data using t-SNE. J.Mach. Learn. Res. 9 Yoon, S.-Y., Hasegawa-Johnson, M., Sproat, R., 2009. Automated pronunciation scoring
(Nov), 2579–2605. using confidence scoring and landmark-based SVM. Tenth Annual Conference of the
Maier, A., et al., 2009. Automatic detection of articulation disorders in children with cleft International Speech Communication Association.
lip and palate. J. Acoust. Soc. Am. 126 (5), 2589–2602.

42
M. Shahin and B. Ahmed Speech Communication 111 (2019) 29–43

Yu, D., Siniscalchi, S.M., Deng, L., Lee, C.-H., 2012. Boosting attribute and phone estima- Zhang, Z., Luo, P., Loy, C.C., Tang, X., 2014b. Facial landmark detection by deep multi-task
tion accuracies with deep neural networks for detection-based speech recognition. In: learning. In: European Conference on Computer Vision. Springer, pp. 94–108.
Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Confer- Zhao, T., Hoshino, A., Suzuki, M., Minematsu, N., Hirose, K., 2012. Automatic Chinese
ence on. IEEE, pp. 4169–4172. pronunciation error detection using SVM trained with structural features. In: Spoken
Zhang, F., Huang, C., Soong, F.K., Chu, M., Wang, R., 2008. Automatic mispronunciation Language Technology Workshop (SLT), 2012 IEEE. IEEE, pp. 473–478.
detection for Mandarin. In: Acoustics, Speech and Signal Processing, 2008. ICASSP Zheng, J., Huang, C., Chu, M., Soong, F.K., Ye, W.-P., 2007. Generalized segment posterior
2008. IEEE International Conference on. IEEE, pp. 5077–5080. probability for automatic Mandarin pronunciation evaluation. Acoustics, Speech and
Zhang, S., Bao, Y., Zhou, P., Jiang, H., Dai, L., 2014a. Improving deep neural networks Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, 4. IEEE pp.
for LVCSR using dropout and shrinking structure. In: Acoustics, Speech and Signal IV-201–IV-204.
Processing (ICASSP), 2014 IEEE International Conference on. IEEE, pp. 6849–6853.

43

You might also like