You are on page 1of 6

Biomedical Engineering: Applications, Basis and Communications, Vol. --, No.

- (--) --
DOI: ------------------------------------------

Computer-Aided Speech-Language Therapy Using Automatic Speech


Recognition Technique During The COVID-19 Pandemic

Hala S. Abuelmakarem*, Sahar Ali Fawzi†, Amal Quriba‡, Ahmed Elbialy§,


and Ahmed Hisham Kandil ¶

* Systems and Biomedical Engineering Department


The Higher Institute of Engineering, El-Shorouk Academy, El-Shorouk City, Cairo, Egypt

School of Information Technology and Computer Science,
Nile University- Sheikh Zayed District, 6th of October City, Giza, Egypt.

Phoniatric unit, ENT department,
Faculty of Medicine, Zagazig University, Zagazig, Egypt.
§¶
Systems and Biomedical Engineering Department
Faculty of Engineering, Cairo University, Giza, Egypt
*
h.saad@sha.edu.eg

sahar_fw@yahoo.com

asq281@yahoo.com
§
abialy_86@yahoo.com

ahkandil_1@yahoo.com

Accepted Day Month Year


Published Day Month Year

ABSTRACT
Objectives: This study aims to develop a computer-aided therapy (CAT) application to help children who suffer from delayed
language development (DLD) improve their language, especially during the COVID-19 pandemic.
Methods: The implemented system teaches the children four body parts using the Egyptian accent. Two datasets were collected
from healthy children (2800 words) and unhealthy children (236 words) who have DLD at the clinic. The model was
implemented using a speaker-independent isolated word recognizer based on a discrete-observation Hidden Markov Model
(DHMM) classifier. After the speech signal preprocessing step, K means algorithm generated a codebook to cluster the speech
segments. The task was completed using the MATLAB program. The graphical user interface was implemented successfully
under the C# umbrella to complete the CAT application task. The system was tested on healthy and DLD children. Also, in a
small clinical trial, five children who have DLD tested the program in an actual trial to monitor their pronunciation progress
during therapeutic sessions.
Results: The max recognition rate was 95.25 % for the healthy children dataset, while 93.82 % for the DLD dataset.
Conclusion: Discrete-observation HMM was implemented successfully using nine and five states based on different codebook
sizes (160,200). The implemented system achieved a high recognition rate using both datasets. The children enjoyed using the
application because it was interactive. Children who have DLD can use speech recognition applications.

Keywords: Speech Recognition, Isolated word recognition technique, Hidden Markov Model, Delayed language development
therapy
enhance their ability to speak and communicate. COVID-19
1. INTRODUCTION pandemic decreased the number of patients in the therapeutic
Speech is the communication key. Children who have DLD clinics, and it increased the need to develop new education
encounter communication difficulties, and subsequently, they applications. So, computer-assisted language learning (CALL) and
cannot practice everyday life such as playing and education. DLD home learning are keys for improving those children's progress.
is a symptom of many disorders as mental retardation, hearing loss, They are considered suitable alternatives to the traditional learning
an expressive language disorder, psychosocial deprivation, autism, process (face-to-face interaction). Computer-aided pronunciation
and cerebral palsy1. Those children need a therapeutic plan to learning (CAPL) received considerable attention as people moved
develop their language. They attend periodic education sessions to into the twenty-first century.

¶ Corresponding author: Hala S. Abuelmakarem, Systems and Biomedical Engineering Department, The Higher Institute of
Engineering, El-Shorouk Academy, El-Shorouk City, Cairo, Egypt.
E-mail: h.saad@sha.edu.eg
Hala S. Abuelmakarem et al

Computer-aided learning got great attention after the COVID-19 supervisor or in clinics to help DLD children in their treatment
crisis. Recent research improved speech recognition systems using trips.
various techniques based on statistical approaches such as the
Gaussian Mixture Models, Artificial Neural Networks, HMMs3. In 2. MATERIALS AND METHODS
the last two decades, researchers have been interested in the HMMs 2.1. Dataset collection
because of their success in countless real-world applications such
Two datasets were collected from healthy children and children
as automated speech recognition, gene prediction, and automatic
who have DLD. The healthy dataset (2800 words) was collected
facial expression recognition4.
from Fourteen children (Ages range from 3 to 9 years). Each child
repeated each word 50 times (different pronunciation speed,
Lately, a previous review has ensured that HMM is the most different places). This dataset was collected from children in the
simplified and successful approach for speech recognition5. Recent home, nursery school; the dataset was divided between training
research improved speech recognition systems, especially in (2400 words) and testing (400 words). The unhealthy dataset,
second language teaching; Witt SM 6 developed a system to assess which contained 236 words, was collected in the clinic (Phoniatric
phone pronunciation based on the automatic speech recognition unit, ENT department Zagazig, University hospital) from DLD
(ASR) technique. Hamid B7 evolved an education system to model children. To test children's performance in the range of age from
the speech attributes for foreign language training based on i-vector four to nine years. All words uttered in Egyptian dialect (44.1 kHz
modeling. Moreover, Mahdy S8 improved a recognition system to sampling frequency,16-bit resolution).
teach Arabic speech via the ‘HAFSS’ application based on HMM.
Bezoui M9 improved a system using HMM to detect the Moroccan
dialect. Automatic speech recognition applications need an 2.2. System Implementation
essential training stage to prepare the word for the recognition This CAT application aims to develop a speaker-independent
stage. In terms of segmentation, after speech signal processing, the isolated word recognizer (IWR) from acoustic signals based on
speech features extraction process is used to extract temporal DHMM. The IWR system was divided into two phases (training
and/or spectral features of the speech signal information by phase, recognition phase). In both phases, the speech boundaries
converting the input signal into a compact set of parameters. were detected, and the speech features were extracted14. Then,
Different speech features extraction approaches, such as linear speech segments were allocated in their clusters using vector
predictive coding (LPC) coefficients, Cepstral coefficients, Mel- quantization based on the k-means algorithm. In the training phase,
frequency Cepstral coefficients (MFCC) are commonly used as a the codebook was generated from quantized observation sequences
speech feature extractor10. The cepstral analysis is a particular case of the speech segments based on the k means algorithm. In the
of homomorphism signal processing. recognition phase, the output terms of the HMM training were
introduced to recognize the word of interest; Fig. 1 illustrates the
A homomorphic system is nonlinear with a linear two phases of the IWR block diagram.
superposition of the input signals under a nonlinear transformation.
Cepstral analysis has become popular in speech recognition since 2.2.1. Speech preprocessing and word boundaries detection
its finding in the late 1960s due to its power in modeling human Signal preprocessing, Endpoint detection (EPD) and speech
speech production behind it11. Deller J. R defines the Mel-scale segments framing play an essential role in the recognition system
conversion as the most popular derivation of cepstral analysis that success4. A notch filter at a cut-off frequency (60 Hz) removes
combines the cepstrum with a nonlinear frequency-warping12. In equipment-related disturbances14 to improve the recognition rate.
amid of all speech feature extractors, MFCCs are the most efficient
feature extractors because they extract spectral information in the 2.2.1.1 Zero-crossing rate
frequency domain. Spectral analysis is more concise than Short-time energy (STE) and zero-crossing rate (ZCR) were
temporarily13because it mimics the human ear’s spectral used to detect the word boundaries of each spoken word by
characteristics. The extracted features are introduced to the vector identifying the silence from the spoken segment. Equation (1)
quantization (VQ) and codebook generation. The VQ is a powerful describes the algorithm that determines the spoken word
technique that converts a colossal set of vectors into groups boundaries based on STE and ZCR; this algorithm compares the
represented by its centroid vector using the K-means algorithm to average frequency of the primary energy concentration in the
implement the codebook. It is introduced to the statistical model to voiced speech signal to the unvoiced speech segment14 because the
implement the recognition system4. most voiced speech energies have lower frequencies compared to
unvoiced signals.
𝒎
This research aims to develop a CAT application based on 𝟏 𝐬𝐠𝐧𝒔(𝒏) − 𝐬𝐠𝐧𝒔(𝒏 − 𝟏)
𝒁𝑪𝑹 = ( ) ∑ 𝒘(𝒎 − 𝒏) (1)
automatic speech recognition technology to help children who 𝑵 𝟐
suffer from DLD improve their language. The application was 𝒏=𝒎−𝑵+𝟏

implemented using MFCCs and HMM with different codebook The N-length frame interval ends at n=m; the multiplication by a
generators. The code was developed under the MATLAB factor of 1/N takes the average value of the zero-crossing
environment, and the graphical user interface (GUI) was measurements.
implemented using C#. The children’s performance was monitored
in the clinic during therapeutic sessions to evaluate the program’s 2.2.1.2 Framing and windowing for short-term analysis
validity. This application can be used at home under a parent
2
Computer Aided Speech-Language Therapy Using Automatic Speech Recognition Technique During The COVID-19 Pandemic

Speech is a dynamic and non-stationary process. Speech analysis implemented the codebook using the k-means algorithm depending
usually presumes that the non-stationary speech properties on the centroid vectors.
changed relatively slowly over time15,16. Therefore, the speech The generation process is summarized simply as the following16:
signals are divided into overlapped short frames to capture the In vector quantization, the vector x is mapped to another real-
speech features17 in ranging from 10 ms to 40 ms to capture speech valued; discrete-amplitude L-dimensional column vector y as
features. y = Q ( x) (5)
In mathematical terms, framing (windowing) means the
multiplication of a speech signal s(n) by a window w(n) to weight Typically, y takes on one of K distinct values such that
samples by the shape and duration of the window. The hamming Y =  y (ji ) ,1  j  K  where y j represents a vector of the form
window raised cosine with the particular coefficients as
T
represented in equation 2. y j =  y (1) (1) (1) (1)

j , y j , y j ,....,...., y j  .

𝟐𝝅𝒏 The set Y is the codebook, and each y j is a code vector.


𝒘(𝒏) = {𝟎. 𝟓𝟒 − 𝟎. 𝟒𝟔 𝒄𝒐𝒔 𝑵 − 𝟏 𝒏 = 𝟎, 𝟏, … . . 𝑵 − 𝟏
(2)
𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
The L-dimensional space SL of the speech feature vector x is
partitioned into K mutually regions or clusters C (1  j  k )
j
This research uses the hamming window to avoid data loss and to “k” is pre-selected, referring to the number of required clusters. A
keep the frequency and temporal variations. code vector y j is associated with each cluster. The vector
The current work applied notch to remove the tonal quantizer assigns the code vector y j to the feature vector xi ,
distribution and ZCR approaches to detect the word boundaries.
The used notch filter was at a cut-off frequency (60Hz) to remove which is in the cluster C j .

Q ( X ) = y j if x  c j
equipment-related disturbances. This research divided the speech
signals into overlapping frames (length: 20 msec, overlapping: (6)
50%) to capture temporal changes over a frame. Each one was
multiplied by a Hamming window function to minimize the In this study, the code vectors computation of the training set starts
spectral distortion caused by the discontinuities of abrupt changes with an arbitrary random initial estimate of the code vectors and
at the window endpoints16. applies the nearest neighbor condition and the centroid condition
iteratively until a termination criterion is satisfied. This research
2.2.2 Mel frequency cepstral coefficients feature extractor implemented two codebooks using two cluster sizes (160, 200) to
Mel-scale conversion is used to mimic the human ear’s spectral obtain the optimal recognition rate.
characteristics. The Mel-frequency Cepstral Coefficients (MFCC)
feature extractor is a leading technique for speech feature 2.2.4. Discrete-symbol Hidden Markov Model
extraction because it improves speaker verification. In this study, DHMM is a statistical approach; in which the system is modeled in
the Mel frequency cepstral coefficients technique was used to unobserved (hidden) states. The DHMM works with a finite set of
extract the speech features for each frame; 12 MFCCs and 12 delta- integer symbols from a codebook. It represents unknown words to
MFCCs with 24 triangular filters Mel-scale filter bank features observation sequences. Fig. 2 shows the isolated word recognition
were selected to extract the speech features. Mel-scale conversion technique based on DHMM. The system was implemented in the
is expressed as linear frequency spacing below 1 kHz and a following steps13:
logarithmic spacing above 1 kHz to mimic the human ear’s spectral First step: A separated HMM is built for each word of the
characteristics. It maps an acoustic frequency to a perceptual vocabularies using the same number of states. This step involves
frequency scale as the following equation17. the training phase to estimate the model parameters
FMel = 2595log10 (1 +
f ( Hz )
) (3) (
 w = A, B,  ) (7)
700 A represents the probability of state transition.
MFCC computes the parameters of monosyllabic word recognition 𝑩 is the symbol probability distribution in a state 𝑠𝑗 .
as equation 4 𝝅 = {1 0 0 0 0} is the initial state probability distribution.
m
  1 
MFCCn =  X k cos  n  k −   (4) Second step: The trained HMMs identify the unknown word in the
k =1   2  m
testing dataset. The recognition phase involves computing the
for n=0,1,2,……M likelihood for all possible models of the observation sequence
Where M is the number of MFCC coefficients, belonging to the designated unknown word.
k =1, 2,..,m, represents the kth filter’s log energy output.
The index of the model with the maximum likelihood is identified
2.2.3 Vector quantization and codebook generation as the designated word as13
Vector quantization is a technique in which speech features vectors
are clustered around some centroid locations. This research W * = arg max P O
1 w N
( W ) (8)

3
Hala S. Abuelmakarem et al

In this research, the implemented system uses the Egyptian dialect Table 1 shows the average recognition rate for healthy and DLD
to teach the children four body parts (mouth, hand, leg, hair). Two children when the HMM was implemented using five and nine
thousand four hundred spoken words were introduced to generate states. Each state was tested by using 160 or 200 clusters.
two codebooks (160,200) and train the DHMM. A left-to-right
DHMM structure attempts to model the sequential pattern. The
DHMM was implemented using two different numbers of states (5, 3.2 Clinical study Assessment
9) to evaluate the optimal implementation sequence number of In the clinical study, the implemented system taught the children
states. Nine states were assumed to allot three states for each word four body parts using mouth in Egyptian dialect (Mouth – Foot –
phoneme. Five states assumption is a suggested model size for Hair – Hand). The DLD children trials are shown in table 2. After
IWR16. The training sequences of quantized vectors were several trials, the system remarked that the patient started to
introduced to train the DHMM model to calculate the model pronounce well the selected words. Some of those children refused
parameters. In the testing step, the recognizer selects the model to utter some words and complete the session.
which maximizes the probability data11 as the following equation
𝑂
𝑃(𝑂⁄𝑀) = 𝑎𝑟𝑔𝑖 𝑚𝑎𝑥 [𝑃( )] 𝑓𝑜𝑟 𝑖 = 1, … , … ,4. 4. DISCUSSION
𝑀𝑖
The code was implemented under the MATLAB environment. The DLD Children encounter everyday life difficulties. They undergo
implemented system was tested by the datasets collected from a therapeutic protocol to improve their skills, including
healthy and unhealthy children who have DLD at the clinic. conversation sessions to enhance their language. Home learning
and computer-aided pronunciation learning are necessary
2.2.5 Interactive graphical user interface nowadays. This research aims to develop a computer-aided
pronunciation therapy application based on the ASR technique to
This application teaches the children some of their body parts. The evaluate the validation of ASR in therapeutic DLD children. Table
first stage is the teaching stage; after the mouse is moved over the 1 shows the children’s performance: the best average recognition
body, the selected organ appears largely. The computer repeats the rate (95.25 %) was for the healthy children of the data were
utterance of the organ name until the children move into the quantized in 160 clusters and were placed in nine states. The DLD
pronunciation stage. In the second stage, the children utter the recognition rate was 93.82 % when the DHMM was implemented
chosen words until successful pronunciation, and then the using nine states, and the speech features were quantized to 200
computer appears a congratulation message in case of the correct clusters. Overall, after a thorough analysis, the different number of
answer. Fig. 3 displays a screenshot of the interactive interface; this clusters did not cause a significant effect on the recognition rate for
task was completed using the C# programming language. the healthy children, but it caused a dramatic change in DLD
children results. Increasing the number of states to nine improves
2.3 CLINICAL STUDY the recognition rate because three states are assigned for each
This research evaluated the progress of children who have DLD syllable. On the other hand, in a small clinical evaluation, the
during their therapeutic session in the clinic; five cases joined the implemented recognition system evaluated the children's response
study. They had a range of age from four to six years. The during teaching their body parts (Mouth – Foot – Hair – Hand)
children's response differs from one to another; some cases refused using the Egyptian accent. As represented in table 2, after several
to utter some words. trials, the system remarked that the children pronounced the
selected words well. True and False terms used in Table 2 indicate
the maximum likelihood’s output as represented in equation 8. The
3. RESULTS first case uttered two words (Foot –Mouth) and refused to utter
Children who have DLD encounter learning difficulties. other words (Hair–Hand). The fifth case refused to utter word
Physicians plan a completely therapeutic protocol to improve their (Hand). The remind cases uttered all words. The children’s
communication skills. This research aims to develop a CAT response differs according to age and delayed stage.
application based on the automatic speech recognition (ASR)
technique to evaluate the validation of ASR in therapeutic DLD Fortunately, the interactive user interface led to thankful usage
children. The ASR system was implemented using DHMM13. The from the children because it was friendly. They have enjoyed the
experimental study was assayed in two stages. The first stage congratulation message. In general, it is not fair to compare the
evaluates the appropriate number of clusters (160,200) using implemented speech recognition systems results because of the
different numbers of states (5,9) to recognize the input spoken difference in the dataset and the difference in the classification
word using the testing collected Dataset from healthy (400 words) techniques. However, table 3 shows the previous Arabic speech
and DLD (236 words) children. In the second stage, DLD children research; Sadeghian R18 implemented a recognition system based
used the application in a small clinical study to evaluate the on the MFCCs, HMM, and deep neural network (DNN); it achieves
possibility of dealing with those children. different recognition rates according to the children's age. Moner
N. M. 19 implemented an Arabic speech recognition system using
3.1 System Evaluation the MFCCs feature extractor and the HMM and KMM classifier
based on the phonemes, and it attains 83.75% recognition accuracy.
Adults’ datasets implemented other Arabic speech recognition;
Mahfoudh A. S 20 improved a speech recognition system based on
4
Computer Aided Speech-Language Therapy Using Automatic Speech Recognition Technique During The COVID-19 Pandemic

long short-term memory and MFCCs features, and they used HMM 5. Bhavya M, Sunita G, Ajay K, A Systematic Review of Hidden Markov Models
and Their Applications. Arch. Comput. Methods Eng, 28, 2021
and recurrent neural network (RNN); the accuracy rate was 94%.
When Zada B. 21 used MFCC as a feature extractor and the HMM 6. Witt SM, Young SJ, Phone-level pronunciation scoring and assessment for
and CNN as a classifier, they achieved an 84.1% recognition rate. interactive language, Speech Commun., 30, 2000.
Bourouba, H. 22 accomplished recognition rate 89.79% using KNN 7. Hamid B, Ville H, i-Vector modeling of speech attributes for automatic foreign
algorithm and 87.48% using Support vector machines. accent recognition, IEEE/ACM Trans. Audio, Speech, Language Process. IEEE-
ACM T AUDIO SPE, 24:11, 2016
8. Mahdy S, Eldeen S, Computer Aided Pronunciation Learning System Using
The presented study evaluated the automatic speech recognition Speech Recognition Techniques, in: INTERSPEECH 2006 - ICSLP, Ninth
International Conference on Spoken Language Processing, Pittsburgh, PA, 2006.
application as a part of a therapeutic plan for children who suffer
9. Bezoui M, Beni HA, Elmoutaouakkil A, speech Recognition of Moroccan Dialect
from DLD. Children enjoyed using the application. Consequently, Using Hidden Markov model, Procedia Comput. Sc, vol. 151:1, 2019
our application is a suitable candidate at clinics and homes to
10. Batista, G. C. Washington Luis Santos Silva, Angelo Garangau Menezes,
improve the children’s pronunciation. Children’s speech “Automatic speech recognition using Support Vector Machine and Particle Swarm
recognition implementation is a challenging task because dataset Optimization, Athens, 2016.
collection is difficult; the wide age range causes a disparity in the 11. Kurcan RS, Isolated word Recognition using in ear microphone data using
sound tone, leading to a slight decrease in the recognition rate. The Hidden Markov Model, Monterey,California: Naval,2006
recognition rate does not affect using a neural network and the 12. Deller J. R., Discrete-Time Processing of Speech Signals, NewYork: Wiley-
HMM. IEEE Press, 2000.
13. Rabiner LR Juang B-H, Fundamentals of Speech Recognition, Prentice Hall,
1993
The future experiments system can be improved by using other
feature extraction techniques and other classification such as 14. Nilsson A, Claesson I, GSM TDMA frame rate internal active noise
cancellation, International Journal of Acoustics and Vibration, 8: 3, 2003.
machine and deep learning. Also, the system can be implemented
15. Deng, D. L., Speech Processing A Dynamic and Optimization-Oriented
based on phenome recognition. Approach, Marcel Dekker, New York, 2003.
16. Picone JW, Signal modeling techniques in speech recognition, Proceedings of
the IEEE 81:9, 1993.
5. CONCLUSION 17. R. S. Kurcan, Isolated word Recognition using in ear microphone data using
In this study, DHMM was implemented effectively using nine and Hidden Markov Model, Monterey, California: Nava, 2006.
five states based on different codebook sizes (160,200) in the 18. Sadeghian R.,automatic speech recognition techniques for diagnostic
MATLAB environment. The Graphical user interface was predictions of human health disorders, University of New York, 2017.
implemented successfully under the C# umbrella to complete the 19. Moner N. M. A dataset for speech recognition to support arabic phoneme
CAT application task. The difference in clustering number was pronunciation”, International Journal of Image, Graphics & Signal Processing,
insignificant in the normal children’s Dataset despite its 2018.
significance when tested using DLD’s children dataset. A small
clinical study assessed the validation of dealing with the Automatic 20. Mahfoudh A. S. Spoken Arabic digits recognition using deep learning, IEEE
International Conference on Automatic Control and Intelligent Systems (I2CACIS
speech recognition system in DLD’s children therapy to test the 2019) 2019
system. Overall, the children joyously interacted with the GUI, and
their pronunciation was enhanced. The developed program is 21. Zada B. Pashto isolated digits recognition using deep convolutional neural
suitable for clinics and homes during the COVID-19 pandemic. network heliyon journal 2020.
Future explorations to improve the recognition system using other
implementation criteria are in progress. 22. Bourouba, H. 2006, New hybrid system (supervised classifier/HMM) for
isolated Arabic speech recognition, International Conference on Information &
Communication Technologies IEEE Xplore, 2006
Acknowledgment: The authors would like to acknowledge prof. Amal
Saeed Quriba, Faculty of Medicine, Zagazig University, who allowed
testing children’s performance with language disabilities.
References
1. Committee on the Evaluation of the Supplemental Security Income (SSI)
Disability Program for Children with Speech Disorders and Language Disorders,
Speech and Language Disorders in Children: Implications for the Social Security
Administration’s Supplemental Security Income Program., Washington: National
Academies Press, 2016
2. Tsai P-h, Beyond self-directed computer-assisted pronunciation learning: a
qualitative investigation of a collaborative approach, Comput. Assist. Lang,7: 32,
2019.
3. Benselama ZA, Arabic Speech Pathology Therapy Computer-Aided System,
Comput. Sci., 9:3, no. 9, 685-692, 2007.
4. Jang R, Audio Signal Processing and Recognition. Mirlab, online book, 2007.
5
Hala S. Abuelmakarem et al

Fig 1: Block Diagram for Isolated Word Recognition System.

Table 1. Recognition rate for Healthy Children and DLD children


States Clusters Healthy children (%) DLD Children (%)
5 160 94.5 74.92
200 94.5 80.92
9 160 95.25 79.25
200 94.92 93.82

Table 2. Children progress in the clinical study


Case Trial Mouth Foot Hair Hand
Number
Case 1 1 FALSE TRUE
Fig.2 Block diagram describes DHMM recognizer. 2 TRUE TRUE
Case 2 1 FALSE TRUE TRUE FALSE
2 TRUE TRUE TRUE FALSE
3 TRUE TRUE TRUE FALSE
4 TRUE
Case 3 1 FALSE TRUE FALSE FALSE
2 FALSE TRUE FALSE FALSE
3 FALSE FALSE TRUE
4 FALSE TRUE TRUE
5 TRUE TRUE
Case 4 1 TRUE FALSE TRUE TRUE
2 FALSE TRUE
3 TRUE
Case 5 1 TRUE FALSE TRUE
2 TRUE TRUE
3 TRUE

Fig.3: Interactive graphical user interface appears to the children.

Table 3: Summary of previous work that are related to this study


Reference Feature Classification Dataset Age Accuracy
extraction Method

Children
Sadeghian R. MFCC HMM +DNN 206 children, each child spoke 5- 11 72.38% (age 10)
(2017) 18 100 words 68% (age 9)
65% (age 8)
Moner N. M.et al.(2018)19 MFCCs HMM+KNN 63 children, each child produces 28 5-11 83.75%
Arabic phonemes for 10 times.
Adults
Mahfoudh A. S. (2019) 20 MFCCs+ LSTM HMM+ 1040 samples of data (840 samples 94%
Recurrent neural for training & 200 samples for
network (RNN) testing)

Zada B., Ullah R. (2020)21 MFCC HMM + CNN 50 utterance digits from 0 to 9 84.1%

Bourouba, H.et al. 22 MFCC HMM + KNN 920 samples (92 speakers × 10 89.79% for KNN
log pitch of energy digits) 87.48% for SVM
92 participants (46 males and 46
females)

You might also like