1 s2.0 S0003682X20309154 Main

Applied Acoustics 175 (2021) 107810
Contents lists available at ScienceDirect
Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust
Noise robust in-domain children speech enhancement for automatic

Punjabi recognition system under mismatched conditions
Puneet Bawa a, Virender Kadyan b,⇑
a
Centre of Excellence for Speech and Multimodal Laboratory, Chitkara University Institute of Engineering & Technology, Chitkara University, Punjab, India
b
Department of Informatics, School of Computer Science, University of Petroleum & Energy Studies (UPES), Energy Acres, Bidholi, Dehradun- 248007, Uttarakhand, India
a r t i c l e i n f o a b s t r a c t
Article history: Success of any commercial Automatic Speech Recognition (ASR) system depends upon availability of its
Received 6 August 2020 training data. Although, it’s performance gets degraded due to absence of enough signal processing char-
Received in revised form 29 September acteristics in less resource language corpora. Development of Punjabi Children speech system is one such
2020
challenge where zero resource conditions and variabilities in children speech occurs due to speaking
Accepted 24 November 2020
Available online 24 December 2020
speed and vocal tract length than that of adult speech. In this paper, efforts have been made to build
Punjabi Children ASR system under mismatched conditions using noise robust approaches like Mel
Frequency Cepstral Coefficient (MFCC) or Gammatone Frequency Cepstral Coefficient (GFCC).
Keywords:
Children speech recognition
Consequently, acoustic and phonetic variations among adult and children speech are handled using gen-
Data augmentation der based in-domain training data augmentation and later acoustic variability among speakers in training
Gammatone frequency cepstral coefficient and testing sets are normalised using Vocal Tract Length Normalization (VTLN). We demonstrate that
(GFCC) inclusion of pitch features with test normalized children dataset has significantly enhanced system per-
Punjabi speech recognition formance over different environment conditions i.e clean or noisy. The experimental results show a rel-
Vocal tract length normalization ative improvement of 30.94% using adult female voice pooled with limited children speech over adult
male corpus on noise based training data augmentation respectively.
Ó 2020 Elsevier Ltd. All rights reserved.
1. Introduction children and rebuilding the whole system through mismatched

or multilingual datasets. This solution is not recommended
Speech recognition systems play a major role in the success of because of the process of manual collection of only children speech
any smart device like Amazon Alexa and Siri applications. Apart data which is more time consuming as well as resource intensive.
performance of these systems are successful in English, Arabic gen- However most of the earlier systems have been developed while
eral purpose speech but due to intrinsic spectral variations in adult tuning an existing adult dataset which will eventually lead to
and children voice has compelled the performance of these auto- increase in performance of the small or medium children voice
matic speech engines to be highly degraded at a faster pace. The dependent system. There exists phonetic and acoustic differences
successful development of a robust system requires more spectral between adults and children which indulged due to shorter vocal
property into the training dataset. In real life applications, the tract length (VTL) and slower speaking rate. These are one of the
quality of a conversation is influenced due to the presence of vari- two major reasons which are responsible for the degradation of
able degree of undesired frequency in any large input signal. Like- the performance of a children’s ASR systems as compared to adult
wise, tuning in such conditions is considered as the biggest systems [3]. Although, high performance of ASR system is ideal in
challenge while demonstrating the performance of any commercial matched testing and training conditions whereas a significant drop
ASR systems [1]. Though, with demand of speech technologies in proficiency is noticeable in mismatched noisy conditions [4].
among the children’s (online education sources creation) have con- Apart, it is necessary to address a noise-robust technique for fea-
sistently been increasing whereas the adult data is mostly col- ture extraction while building any practical applications. Mel Fre-
lected for the general purpose speech based applications [2]. The quency Cepstral Coefficient (MFCC) has been one of the broadly
resultant problem can be solved by the collection of data for the used feature extraction methods which is implemented by the
researchers however the drastic degradation of the performance
⇑ Corresponding author. is seen on changing the noise types and noise levels in test
E-mail addresses: puneet.bawa@chitkara.edu.in (P. Bawa), vkadyan@ddn.upes. speech datasets [5]. Therefore, various noise-robust
ac.in (V. Kadyan).
https://doi.org/10.1016/j.apacoust.2020.107810
0003-682X/Ó 2020 Elsevier Ltd. All rights reserved.
P. Bawa and V. Kadyan Applied Acoustics 175 (2021) 107810
techniques - Perceptual linear predictive (PLP) [6], Relative that the total population of Punjabi speakers are 105 million
spectral-perceptual linear prediction (RASTA-PLP) and GFCC have around the world, still, there is an inadequacy of resources because
been explored in earlier research work [7–8]. Moreover, sufficient of which developing Punjabi-ASR (P-ASR) is a challenging task. Till
quantity of data plays an inevitable role in enhancement of the date, no significant work has been reported on P-ASR. Further, a
performance of an ASR system, but the same is not true for most significant amount of speech data is not freely available. Limited
of the under-resourced language of the world. Therefore, the major research work on adult speakers have been reported in literature
challenge being faced by one such Punjabi language ASR system is [15–17]. Apart, researchers have created their own corpus to carry
due to unavailability of required corpora for an effective training of out different research activities. Therefore, data scarcity and lim-
the system [9]. Various techniques like maximum a posteriori ited resources are major reasons for non-availability of P-ASR.
(MAP) and original data replication using Data Augmentation The most challenging problem in P-ASR is the task of recognition
(DA) have been explored with an objective of artificially boosting of children’s speech. In the case of Punjabi language, the work on
the amount of training data. Generally, most DA techniques out- children speech is almost zero because of the non-availability of
performs while employing a strategy of Multi-Style Training child speech data. To overcome such issues three efforts have been
(MST) [10] for which data is synthesized and mixed with its train- made:
ing set and tested on original dataset.
In this paper, robust feature extraction techniques: MFCC and By inspecting similarities between existing adult speech corpus
GFCC are being explored for demonstration of an efficient Punjabi through gender based selection that have more relevance with
children mismatched ASR system. DNN-HMM based acoustic children speech.
model is further evaluated to fully utilize Neural Network (NN) To handle limited data scenarios, in-domain data augmentation
efficiency on artificially pooled training corpus. It is only possible strategy is used by employing original signal and noise-induced
through in-domain training augmentation strategy. It contributes synthetic data.
to the development of a robust children ASR system performance. To indulge tonal characteristics of Punjabi language, pitch fea-
Further the inclusion of differences among the linguistic and tures have been extracted and later test set normalization have
acoustic aspects of adult and children speakers has been revealed been performed using VTLN approach.
in the degraded performance of such systems. Therefore, an effort
has been made by normalizing the test dataset on an artificially 3. Related work
augmenting training model with different combinations of noise-
Babble, Factory, and Volvo at different SNRs values. Consequently, Nowadays, people feel more comfortable in communicating
an acoustic feature that exhibits tonal characteristics of train and with commercial devices making speech recognition an active area
test speakers have also been evaluated which will help in enhance- of research. Hu R et al. [18] presented the use of speech technolo-
ment of system performance. Finally, an effort for reducing acous- gies on real life applications including speech-based dictation pro-
tic differences between adult and children sets are experimented cessing tasks helping in mail as well as in report generation. In
by warping the spectrum along the frequency axis using Vocal addition, researchers also pointed upon the next-generation
Tract Length Normalization (VTLN) in the test dataset only. It speech technologies which arise an urge to develop noise robust
results in higher decay in system error rate respectively. ASR systems with adequate performance under adverse environ-
The remaining paper is aligned as follows: Section 2 overlays mental conditions. Likewise, daily use of speech based technolo-
the motivation and Section 3 describes the research work demon- gies and communication systems in educational areas are turned
strated by various researchers. Section 4 presents the current state out to be particularly important for children with physical disabil-
of the art techniques used for enhancement and building of a Chil- ities. Lopez G. et al. [19] investigated functional as well as usability
dren ASR system. Section 5 demonstrates the proposed robust in- tests of presenting comparative analysis between commercial
domain training based performance enhancement of Punjabi chil- available voice personal assistants: Amazon Alexa, Microsoft Cor-
dren ASR system. Section 6 describes the corpus setup strategy tana, Apple Siri, and Google Assistant. . Later, Singh A et al.[20] pre-
employed in experimental setup along with result evaluation. sented a comprehensive survey for state-of-the-art development of
Finally, work is concluded and future expansion of proposed work robust ASR systems in Indian Languages. Moreover, they explored
is presented in Section 7. the impact of numerous feature extraction techniques correspond-
ing to different speakers speaking various languages possessing
different characteristics under varied environmental conditions.
2. Motivation Therefore, it remained a vital challenge how one can improve effi-
ciency of such variations in speaker diversity based systems. Like-
Developing an ASR system for native languages is an active field wise, Zhen B. [21] experimented the importance of components for
of research. People generally feel more comfortable when interact- widely-used MFCC feature extraction technique in both speech as
ing with machines in their native languages [11]. Further, wide- well as speaker recognition through Dynamic Time Warping
spread applications of developing an ASR in native language, (DTW) under noisy environments. They concluded with an
such as assistive technology for physically handicapped persons, increased performance of the system by discarding lower MFCCs
handling customer queries through a voice-based interface e.g. coefficients and only considered middle and higher term coeffi-
railway stations, encouraged various researchers to work in this cients which depend upon the sensitivity of noise. However,
direction [12]. Due to less availability of electronic resources in another noise robust feature extraction technique of PLP was later
native languages, these are observed as under resourced (UR) lan- proposed by Hermansky, H. [22] with an objective of the reduction
guages. However, developing an UR language ASR is a challenging in the inter-speaker differences while preservation of relevant
task where the foremost challenge is zero availability of adequate information for an input speech signal. Furthermore, short-term
resources and corpora in their native languages [13]. Another noise variations in various applications of band-pass filters to pro-
important issue is a gap between technology experts and native cess energy of each frequency subband were investigated by Her-
speakers. It is very difficult to find a person with technical skills mansky H. [23]. Consequently, analysis of Linear Discriminant
to develop an ASR in their native language [14]. In the present Analysis (LDA) and log linear model approach in combination with
work, we have chosen Punjabi language, which is usually spoken two feature extraction approach of MFCC and PLP have been anal-
in northern states of India, specifically, Punjab. In spite of the fact ysed by Zolnay A. et al. [24]. Moreover the transformation
2
technique for Cochlear Filter Cepstral Coefficient(CFCC) feature Finally, frequency bands obtained on power spectrums are
extraction was also explored by Li Q. et al. [25] and has proven extracted by effective application of triangular filters on Mel-
to be better performed than that of MFCC features in acoustic mis- scale [31]. Further, conversion between Mel scale and frequency
match conditions. Therefore, Li Z. et al. [26] investigated a new fea- of an input speech signal is evaluated in Hertz using:
ture extraction technique of GFCC and found it to be more
prevalent and efficient than MFCC, RASTA-PLP, CFCC feature Melðf ðy; tÞÞ ¼ 2595:log 10 ð1 þ f ðy; tÞ=700Þ ð5Þ
extraction techniques in presence of external noise. On the other This results in highly correlated coefficients such that DCT can
hand, extreme acoustic mismatch conditions arise during recogni- be applied to yield the compressed representation of extracted
tion of children speech systems on the adaptation of adult speech coefficients. Therefore, it resulted in 2–13 MFC coefficients which
trained acoustic models. Similarly, Burnett D. [27] evaluated are retained and rest are represented through fast changes which
speaker adaptation on an adult speech trained recognizer which are further discarded. On other hand, noise-robust FFT-based
was dependent upon optimal selection of Bark Factor. They noted a front-end feature extraction technique of GFCC is used which helps
reduction of more than 50% in error rate on performing simple adap- in determination of context information from highly noisy speech
tation of adult speech on children’s speech. Moreover, Kinoshita et al. signals with lower SNRs [8]. The extraction procedure employed
[28] performed an analysis for various configurations relating to both use of a 64-channel Gammatone filterbank during pre-emphasis
frequency and time domain denoising Neural Networks(NNs). The stage. Likewise MFCC, the implications of Gammatone filter helped
researcher accounted for more than 30% of relative improvement in getting comparable amplitude for higher frequency segments of
using DenoisingTasNet on real life recordings utilizing a strong ASR an input speech signal. However, the absolute value for frequency-
back-end. However, the differences in intrinsic properties of the time portrayal is evaluated by decimating rectified filter response
speaker sometimes lead to variation in the spectral formant peaks to 100 Hz similarly by using time windowing.
which are inversely proportional to the vocal tract length. Therefore,
Giuliani D [29] investigated the use of VTLN under both matched and gðf ; tÞ ¼ ae2p=t:bcm cosð2ptf cm þ uÞ ð6Þ
mismatched conditions. Later, improvement in the performance th
using VTLN with 20.1% of relative improvement compared to the where f cm corresponds to central frequency related to m Gamma-
baseline system was observed under mismatched conditions. So in tone filter between lower frequencies (f LO ) and higher frequencies
this study advantage of noise robustness through MFCC or GFCC, (f HI ):
intrinsic variations in spectral formants peaks opposite to vocal tract
f cm ¼ 1000=4:37 þ ðf HI þ 1000=4:37Þexpðm=Mðlnðf HI
are further overcome through VTLN and pitch based feature vectors
are analyzed on Punjabi language. þ 1000=4:37Þ þ lnðf LO þ 1000=4:37ÞÞÞ ð7Þ
GFCC utilized rectangular filterbank based ERB scale instead of
4. Theoretical background 26 mel-bands triangular channel in MFCC. The expanded number
of channels directs in more noise-robustness. Furthermore it repre-
4.1. Feature extraction sents the finer resolutions at lower-frequency rates which are
adapted to human perceived loudness over certain signal intensity
A complex representation of raw speech signal does not act as [32]. The other significant distinction of GFCC is the use of cubic
an effective input to any ASR system. Front-end feature extraction roots, unlike log in MFCC, just before a numerical estimation in
technique of MFCC has been one of the broadly utilized methodol- DCT.
ogy for determining a dense representation of context raw audio
signals. The radiation effect of lips corresponds to high frequency bm ¼ b:ERBðf cm Þ ¼ 24:7ð4:37:f cm =100 þ 1Þ ð8Þ
of an input raw audio signal having lower amplitude [30]. Apart,
passing a speech signal through a pre-emphasis filter helps in
obtaining comparable amplitude for all frequency segments in an 4.2. Vocal tract length normalization
input signal using equation (1).
yðtÞ ¼ sðtÞ asðt 1Þ ð1Þ It was stated earlier that, in case of children’s mismatched ASR
system, upscaling of formant frequency in children’s speech leads
The effective lip spectral contribution is eliminated and the to degradation of recognition rate. Mismatch in formant frequen-
numerical problem for calculation of Fourier transformations are cies were addressed through a process of VTLN. It tried to employ
abstained using transfer function H(z,t) through equation (2): a warping factor which is estimated to further expand the fre-
z1 quency axis during the process of feature extraction. The acoustic
Hðz; tÞ ¼ 1 b ð2Þ model is built here using warped features of a child test dataset.
where b corresponds in controlling the slope of the filter with val- However, there exists two types of commonly used warping func-
ues ranged between 0.4 and 1.0. As frequency corresponding to tion that can be represented by Piecewise linear warping factor
speech signals changes over time, so the frequency contours in [33] and Bilinear warping factor [34]. These functions were calcu-
an entire signal over a time period are lost. Following the assump- lated using:
tion of stationary frequency over a very short period of time, the
signal is split into short-time frames. Further, a window function ua ¼ fa1 xif x < x0 bx þ cif x x0 ð9Þ
is evaluated on each frame of sliced signal which are given by:

ð1 aÞsinðxÞ
w½k; t ¼ 0:54 0:46 cosð2pk=N 1Þ ð3Þ ua ðxÞ ¼ x þ 2tan1 ð10Þ
ð1 ð1 aÞcosðxÞÞ
Further, Short-Time Fourier-Transform (STFT) is evaluated on
each frame by frame basis of a signal, inorder to obtain the corre- where, a corresponds to a warp factor which represents the char-
sponding frequency spectrum and the power spectrum which is acteristics of a speaker, and | x | is a transformed frequency from
calculated as: an unwrapped frequency x. x0 is a constant value which helped
in controlling the mismatch bandwidths among different speakers.
2
Pði; tÞ ¼ jFFTðyði; tÞj =N ð4Þ The eq (10) showcased the dependency bilinear warp function
using a warping factor. Therefore, the major variations between
3
Fig. 1. Block diagram represents the gender based training data mixture on robust front end approach.
them is due to function representation. It is a nonlinear function impact, as well as in MFCC method, the corresponding utterances
which is opposite to that of the piecewise linear function as repre- are divided using a frame size of 25 ms along with a frameshift
sented in eq (9). of 10 ms. Following the process of multiplication of each frame
by hamming window, the transformation method of Fast Fourier
5. Proposed system overview Transform (FFT) was later applied. Finally, 39 features are obtained
for either MFCC or GFCC after successful deployment of Cepstral
The front-end feature extraction techniques: MFCC and GFCC mean and variance normalization (CMVN), delta and delta-delta
have been explored for enhancing the performance of children feature vectors. The output 117 feature dimensions are further
ASR systems under noisy and mismatched conditions. Fig. 1 transformed into 40 dimensions using diagonal-transformation
demonstrates the initial system consisting of a mixture of permu- technique of Maximum Likelihood Linear Transform (MLLT). Final
tation based adult and children speech corpora. Firstly, the initial output hypothesis is selected from the baseline system after com-
system is built on real mixed dataset in training and original child puted it using DNN modeling classifier.
dataset in test which are further experimented using DNN-HMM Fig. 2 demonstrated the noise augmented baseline system con-
classifier in clean and noisy conditions. Secondly, gender based stituted through a mixture of female-adult and child training data-
classification of adult speech data is employed on the basis of set. Three different kinds of noise- babble, factory are dynamic
speakers i.e. male or female. It is mixed individually which is noise, and white is stationary noise which have been injected at
polled with original children dataset. varying SNRs ranging from 20 dB to 0 dB following the step size
The gender based augmented systems are also processed with of 5 dB. Further the noise robust feature extraction technique of
either MFCC or GFCC approach on each individual mixed train GFCC is employed. It is based upon the frame size as 25 ms and
dataset. Further, the objective of using such a system is to identify frame shift of 10 ms. Consequently, pitch features are concatenated
the amount of noise required for development of an augmented with MFCC and GFCC feature vectors. It is only possible by con-
train set by performing variations on the noise test set. Likewise, straining the pitch trajectory which is tried to be continuous
GFCC methods are found to be beneficial in reduction of noise throughout the performance enhancement of a mismatched
Fig. 2. Block diagram represents the test normalized in-domain augmented Punjabi Children ASR system under mismatched conditions.
4
system. The training noise augmented systems are further trained b. Further based upon the various types of augmented
using monophones (mono) and triphones (tri1, tri2) and corre- dataset, the data is normalised as
sponding transformation of feature dimension from 126 to 40 is test_vtln(test_aug_data) //using equation (10)
achieved using process of LDA-MLLT(tri3) with alignment of Fea- test_vtln(child_data) //using equation (10)
ture space maximum likelihood linear regression (fMLLR). More- c. Go to Step 5
over, speaker normalization technique of VTLN using a piecewise Step 11: Train LDA + MLLT triphones (tri3) on the output
linear function has been employed on test (child) dataset with an generated from Step 8 and align the corresponding
objective of reducing the inter-speaker variations under mis- triphones using fMLLR.
matched conditions. Finally, the noise augmented system is trained Step 12: Finally performance is analyzed upon training on
using a DNN-HMM classifier followed by training of monophones obtained features and corresponding datasets on DNN-
and triphones. The step by step procedure for enhancing the per- HMM acoustic models.
formance of low resource children speech recognition using test
normalization is performed as follows:
Step 1: Collection of original adult (male/female)

[adult_female_data] and children speech [child_data] corpus 6. Experimental setup
Step 2: Initialise:
training_data = adult_female_data + child_data The presented methodology has been implemented with an
testing_data = child_data objective to evaluate the effectiveness of augmented children
test_vtln_flag = 0 ASR systems in mismatched-noisy conditions. The performance
noise_factor = [20,15,10,5,0] evaluation was experimented on 21 speakers of adult data and
Step 3: Augment the dataset (noise_data) through NOISEX-92 32 speakers of children speech with sampling frequency of
database and inject it using random number from container 16 kHz. Three different kinds of training sets were experimented
noise_factor: which comprised of variating quantity of speakers and tested with
train_aug_data = training_data + random.choice(noise_factor) 4 number of male and 5 number of female speakers as depicted
* noise_data below:
test_aug_data = testing_data + random.choice(noise_factor) *
noise_data d 21 number of adult speakers and 22 number of child speakers
Step 4: Extract MFCC and GFCC features from training set as: (Adult-Child)
mfcc(train_aug_data) //using equation (5) d 9 number of male-adult speakers and 22 number of child speak-
gfcc(train_aug_data) //using equation (8) ers (Male_Adult-Child)
Step 5: Extract MFCC and GFCC features from the testing set d 12 number of female-adult speakers and 22 number of child
as: speakers(Female_Adult-Child)
mfcc(child_data) //using equation (5)
mfcc(test_aug_data) //using equation (5) Three different noises- factory, babble and white from standard
gfcc(child_data) //using equation (8) noise database NOISEX-92 [35] were used for augmentation pur-
gfcc(test_aug_data) //using equation (8) poses for performing experiments. Sox command has been further
go to Step 6. used in accordance with a random function that helped into injec-
Step 6: Perform Monophone training (mono) and align their tion of noise at varying SNR levels without hampering the useful
monophones information presented in an input speech signal. The experiments
Step 7: Perform delta-triphone training (tri1) and further were performed on an Ubuntu operating system using Kaldi toolkit
align the triphones [36]. Finally, enhanced performance of the systems were evaluated
Step 8: Perform delta + delta-delta triphone training (tri2) and using two key parameters - Word Error Rate (WER) as depicted in
align the triphones equation (11) and Relative Improvement (RI) in equation (12).
Step 9: if test_vtln_flag==0:
Go to Step 11 WERð%Þ ¼ ðS þ 1 þ DÞ=N ð11Þ
else:
Go to Step 10 O
RIð%Þ ¼ N 100 ð12Þ
Step 10: Perform VTLN normalisation on the test set using O
following procedure:
a. def test_vtln(data_type): 6.1. Performance analysis
Step 10.1: Select the optimum value of the warp factor
which ranged from 0.8 to 1.2. 6.1.1. Performance analysis of different feature extraction techniques
Step 10.2: The best state segment over data_type is on varying environment and training conditions.
obtained through Viterbi alignment. It is based upon its Children’s speech generally differs in quality and quantity while
given test set transcription. building an ASR system in comparison to that of adults. In order to
Step 10.3: The efficient warping factor is searched using a build a baseline system initially clean child test data was evaluated
suboptimal method depending upon the calculation of a on the child train dataset and found that it tried to generate a WER
best state segment. It tried to reduce the computational of 15.43% on the MFCC approach. The baseline system showed
cost without the loss of precision. good performance output in comparison to that of GFCC system.
Step 10.4: The best warping factor is calculated and its Further to overcome the issue of training data scarcity, adult data
corresponding model parameters are updated using Viterbi was only used in training and test data was kept the same. It has
alignment corresponding to each input speech been analyzed that MFCC generated acceptable performance of
transcription. around 41.67% due to mismatched training and test data.
5
Consequently, a mixture of children and adult train data was Table 2

framed and a reduced system performed with a R.I. of 7.5% than WER(%) for gender based data combination on different training conditions.
that of baseline system. The result for the same has been depicted Training set Child test-set DNN(WER %)
in Table 1 where at last noisy data was further employed in testing Noise MFCC GFCC
on a mixture of training dataset where finally GFCC performed bet-
Male_Adult-Child No 14.96 17.48
ter with a WER of 16.75%. Female_Adult_Child No 13.71 16.32
Male_Adult-Child Yes 16.06 15.80
6.1.2. Performance analysis of gender based data augmentation under Female_Adult_Child Yes 15.84 15.62
noisy conditions
In this subsection, gender based combination has been investi-
gated with or without original children train dataset. It has been
analyzed from Table 2 that the female train dataset has more sim- 6.2. Proposed system analysis with earlier proposed technique
ilarity with children data which enhanced the system performance
in comparison to that of male train data with a R.I. of 8.35%. The sys- Unfortunately, the success of any robust ASR system is
tem performance has been tested on clean and noisy environment dependent on the availability of necessary resources and rich
conditions. Later, children and female train dataset has boosted sys- linguistic datasets. In real life applications, advanced develop-
tem performance with R.I. of 3.92% on MFCC in clean and 5.79% on ment in ASR technology currently makes the wider use of
GFCC in noisy environment conditions respectively. In all the cases resource-rich languages like English, Thai, Chinese, Arabic. Ide-
system testing has been performed on children test dataset. ally, this leads to the motivation for generalization of techniques
which one can use for any under resourced languages like Pun-
6.1.3. Performance evaluation with training data augmentation jabi, Dogri etc with multilingual or augmentation strategy. The
through test normalization on data augmentation work presented earlier mostly focused around language which
There exists variability in the vocal tract of children and adults. have well-developed resources in text and speech forms. These
In order to analyze the effect of the vocal tract, the system testing languages have their smart devices which are operable in their
has been performed by normalizing only the test dataset. It native speaker’s language. In order to construct a system for
demonstrated that system performance has been improved by endangered or under resource language is always a challenging
8.22% in comparison to that of baseline children ASR systems. In task. It basically required large training data. However unavail-
order to obtain it initially optimal value of warp factor has been ability of a benchmark dataset is one of the major issues which
selected. In order to further utilize the efficiency of DNN approach, required large manual efforts with knowledge of their language
an artificial way of training a dataset has been built on a mixture of like Punjabi, Dogri etc. In the past, a small amount of research
adult_female + children original speech which were further work has been presented by researchers on Punjabi language
extended through injection of three different types of noise in [37–39] being spoken by 105 million native communities. The
order to generate in-domain augmented datasets. An in-domain earlier presented work is mostly focused on small vocabulary
data augmentation strategy has been adopted where clean data, isolated, connected or continuous systems which are operable
artificial noise (embedded using factory, babble or white) in both in clean or noisy environment conditions [12,40]. Apart from
female adults and children training dataset. This pooled dataset these systems are only built on adult speech dataset with zero
has been tried to overcome the challenge of data scarcity and also presence of children ASR system. It mainly occurs due to the dif-
helped in overcoming mismatch variability issues through extrac- ferences in the intrinsic properties of the speaker which some-
tion of vocal tract length information. The system testing has been times leads to variation in spectral formant peaks which are
performed on a child dataset using random selection of clean and inversely proportional to the vocal tract length among adult or
noisy dataset. In order to indulge the tonal characteristics of Pun- children utterances. The proposed implemented work in this
jabi language, pitch feature has also been calculated with MFCC study presented on noise robustness aspect by inspecting the
or GFCC front end approaches. Three methods have been adopted: impact of MFCC or GFCC approaches, intrinsic variations in spec-
where conventional MFCC or GFCC approach, pitch feature induced tral formant peaks corresponding to vocal tract among female-
front end feature vectors (MFCC + pitch or GFCC + pitch) and test adult and children are found beneficial while handling it through
normalized hybrid MFCC + pitch + VTLN or GFCC + pitch + VTLN test VTLN and further tonal characteristics are exhibited using
approaches were employed on pooled dataset. The system pitch induced feature vectors. These properties are applied on
obtained a performance improvement of 30.15% on GFCC + pitch + a very small children speech corpus which is enhanced through
VTLN hybrid approach in comparison to other evaluated augmentation strategy. The artificial data extension has been
approaches. The test normalization has been selected because it performed to fully utilize the efficiency of DNN model through
has more impact with child test dataset in comparison to that of in-domain data augmentation. Table 4 depicts the different state
training data normalization. Table 3 showed that system perfor- of the art existing work on different well developed or less
mance increased slightly with both the approaches using normal- resource languages along with small work presented in Punjabi
ization and pitch information extraction. adult dataset. It can be seen from the results that the proposed
system contributed towards more training time which further
results into complexity of the system. Therefore at the end sys-
Table 1
WER(%) for different train and environment conditions using heterogeneous front end
tem testing has been performed with no modification in its orig-
approaches. inal signals which takes less time. Therefore, the system
improvement is enhanced with pooling of train data only using
Training set Test set DNN(WER %)
in domain strategy which helped in usage of maximum DNN
Noise MFCC GFCC efficiency. While considering the tonal characteristics of Punjabi
Child No 15.43 15.61 language, pitch induced front end features are tried to enhance
Adult No 41.67 45.26 its performance. The test normalization on children’s dataset
Adult-Child No 14.27 16.75
has also contributed substantial improvement on different test-
Adult-Child Yes 17.28 16.58
ing environment conditions.
6
Table 3
WER(%) on in-domain data augmentation with different hybrid feature combinations.
DNN (WER%)
Datasets MFCC GFCC
Training Set Testing Set MFCC MFCC + PITCH MFCC + Pitch + VTLN GFCC GFCC + PITCH GFCC + Pitch + VTLN
Female_Adult + Child + Random Noise Clean Child 18.19% 17.87% 17.43% 12.36% 12.01% 11.35%
Noisy Child 17.93% 17.36% 17.02% 12.09% 11.77% 10.91%
Table 4
Comparative analysis of proposed work with earlier presented system.
Author Augmentation Type Front-end approach Dataset Type Acoustic System Simulation Observations
Modeling Results
Techniques
Kadyan et al. [41] Original dataset MFCC Adult Connected DNN-HMM Relative The researcher employed the
GFCC Words and GMM-HMM improvement of technique for speaker
Continuous 4–5% in connected adaptation on two types of
Sentences words and 1–3% in acoustic modelling - GMM-
continuous HMM and DNN-HMM
dataset
Kadyan et al. [12] Noise based augmentation MFCC,PLP,MFCC- 45,000 utterances Refined HMM HMM (59.88%), They employed different
with the various noises PLP,RASTA-PLP of adult data with parameter GA + HMM feature extraction approaches
from NoiseX-92 database 15 male and 10 optimisation (69.05%) for large Punjabi adult
female speakers techniques DE + HMM dataset. Lower system
using (73.56%) performance has been
GA + HMM and represented under both clean
DE + HMM and noisy environmental
conditions.
Shahnawazuddin Noise based augmentation MFCC SMAC R-SMAC British English DNN-HMM MFCC (12.93%,) The researcher employed
et al. [42] using white and babble speech corpora and SMAC various feature extraction
noise at 5 dB and 10Db including (12.54%.) techniques of MFFC,SMAC,R-
WSJCAMPF-star SMAC and found SMAC
children dataset of features more robust to
92 adult and 122 additive noise.
child speakers
Shahnawazuddin In-domain augmentation MFCC WSJCAMO adults TDNN Equal Error Rate Large vocabulary datasets
et al. [43] through speed and pitch data comprising of (EER) of 2.269 on using in-domain and out-
perturbation, Out-domain 5608 words and PF- children dataset domain training
augmentation including star children using the mix of augmentation have been
CycleGAN for voice dataset containing both speed and employed using English adult
conversion from adult data 5607 words pitch perturbation and children’s dataset under
mismatch conditions.
Guglani and Original dataset MFCC and PLP 1500 long DNN-HMM 69.4% of accuracy Very low performance has
Mishra [44] sentences has been reported been presented using small
using mix of adult dataset processed using
Pitch + SAcC POV pitch and POV features SAcC
using Kaldi toolkit
Proposed Adult Female Data mixed GFCC + Pitch + VTLN 9 h of adult dataset DNN-HMM 30.94% Gender based selection with
approach with children data and 5 hrs of child medium size Sentence level
data in Punjabi Punjabi children ASR system
language is employed using
mismatched and varying
environment conditions.
7. Conclusion with relative improvement of 30.94% than that of adult male or

mixture of adult speech with children speech in training dataset.
In this work, Punjabi children ASR system has been developed Through experiment it has been also analyzed that test normaliza-
over limited data conditions. The tonal characteristics of Punjabi tion of children speech with noise robust front end approach
has been induced in noise robust GFCC approach. Although, vocal (GFCC) has enhanced system performance with original and aug-
tract variability of test speech has been overcome by introducing mented training speech corpus. Our test normalization approach
test normalization. We also demonstrated gender based training with training of a system with three hybrid acoustic models pro-
data selection on less children’s speech corpus which is analogous cessed using DNN-HMM system outperforms the original children
to male based training data. The tuned female adult speech with speech corpus in both test or train model. Further work can include
children speech mixture has removed the criteria of high acoustic SMAC feature extraction on real speech corpus through inclusion of
mismatched in training data. Furthermore, artificial noise features prosody modification feature vectors that indulge more intrinsic
have been involved in train and test samples that enhanced the spectral properties. The system performance can be enhanced in
performance on training dataset through in-domain data augmen- near future with building of mismatched acoustic models by signif-
tation. The result analysis showed that system training pooled with icantly or combining multi dialectal characteristics using TDNN,
female adult speech and children speech has been outperformed CNN-TDNN acoustic models with multi objective functions.
7
CRediT authorship contribution statement [19] López G, Quesada L, Guerrero LA. In: July). Alexa vs. Siri vs. Cortana vs. Google
Assistant: a comparison of speech-based natural user
interfaces. Cham: Springer; 2017. p. 241–50.
Puneet Bawa: Investigation, Validation, Visualization, Writing - [20] Singh A, Kadyan V, Kumar M, Bassan N. ASRoIL: a comprehensive survey for
original draft, Formal analysis, Writing - review & editing. Viren- automatic speech recognition of Indian languages. Artif Intell Rev 2019:1–32.
[21] Zhen B, Wu X, Liu Z, Chi H. On the Importance of Components of the MFCC in
der Kadyan: Conceptualization, Data curation, Formal analysis,
Speech and Speaker Recognition. In Sixth International Conference on Spoken
Methodology, Supervision, Resources. Language Processing, 2000.
[22] Hermansky H. Perceptual linear predictive (PLP) analysis of speech. the Journal
of the Acoustical Society of America 1990;87(4):1738–52.
Declaration of Competing Interest [23] Hermansky H, Morgan N. RASTA processing of speech. IEEE transactions on
speech and audio processing 1994;2(4):578–89.
[24] Zolnay A, Schlüter R, Ney H. Robust speech recognition using a voiced-
The authors declare that they have no known competing finan- unvoiced feature. In Seventh International Conference on Spoken Language
cial interests or personal relationships that could have appeared Processing, 2002.
to influence the work reported in this paper. [25] Li Q, Huang Y. An auditory-based feature extraction algorithm for robust
speaker identification under mismatched conditions. IEEE Trans Audio Speech
Lang Process 2010;19(6):1791–801.
References [26] Li Z, Gao Y. Acoustic feature extraction method for robust speaker
identification. Multimedia Tools and Applications 2016;75(12):7391–406.
[27] Burnett DC, Fanty M. Rapid unsupervised adaptation to children’s speech on a
[1] Whipple G. Low residual noise speech enhancement utilizing time-frequency
connected-digit task 1996;ICSLP’96 (Vol. 2:1145–8.
filtering 1994;Vol. 1:pp. I-5)..
[28] Kinoshita K, Ochiai T, Delcroix M, Nakatani T. In: Improving noise robust
[2] Shahnawazuddin S, Sinha R. Low-memory fast on-line adaptation for
automatic speech recognition with single-channel time-domain enhancement
acoustically mismatched children’s speech recognition. In Sixteenth Annual
network. IEEE; 2020. p. 7009–13.
Conference of the International Speech Communication Association, 2015.
[29] Giuliani, D., & Gerosa, M. (2003). Investigating recognition of children’s speech.
[3] Shahnawazuddin S, Kathania HK, Sinha R. In: Enhancing the recognition of
In 2003 IEEE International Conference on Acoustics, Speech, and Signal
children’s speech on acoustically mismatched ASR system. IEEE; 2015. p. 1–5.
Processing, 2003. Proceedings.(ICASSP’03). (Vol. 2, pp. II-137). IEEE.
[4] Kozou H, Kujala T, Shtyrov Y, Toppila E, Starck J, Alku P, et al. The effect of
[30] Airaksinen M, Bäckström T, Alku P. Automatic estimation of the lip radiation
different noise types on the speech and non-speech elicited mismatch
effect in glottal inverse filtering. In Fifteenth Annual Conference of the
negativity. Hear Res 2005;199(1–2):31–9.
International Speech Communication Association, 2014.
[5] Martin, A., Charlet, D., & Mauuary, L. (2001). Robust speech/non-speech
[31] Kadyan V, Mantri A, Aggarwal RK. Improved filter bank on multitaper
detection using LDA applied to MFCC. In 2001 IEEE International Conference on
framework for robust Punjabi-ASR system. Int J Speech Technol 2020;23
Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221)
(1):87–100.
(Vol. 1, pp. 237-240). IEEE.
[32] Uppenkamp S, Röhl M. Human auditory neuroimaging of intensity and
[6] Psutka J, Müller L, Psutka JV. Comparison of MFCC and PLP parameterizations
loudness. Hear Res 2014;307:65–73.
in the speaker independent continuous speech recognition task. In Seventh
[33] Uebel LF, Woodland PC. An investigation into vocal tract length normalisation.
European Conference on Speech Communication and Technology, 2001.
In Sixth European Conference on Speech Communication and Technology,
[7] Hermansky H, Morgan N, Bayya A, Kohn P. RASTA-PLP speech analysis. Proc.
1999.
IEEE Int’l Conf. Acoustics, Speech and Signal Processing 1991;1:121–4.
[34] Acero, A., & Stern, R. M. (1991, April). Robust speech recognition by
[8] Shi, X., Yang, H., & Zhou, P. (2016). Robust speaker recognition based on
normalization of the acoustic space. In [Proceedings] ICASSP 91: 1991
improved GFCC. In 2016 2nd IEEE International Conference on Computer and
International Conference on Acoustics, Speech, and Signal Processing (pp.
Communications (ICCC) (pp. 1927-1931). IEEE
893-896). IEEE.
[9] Kaur, H., & Kadyan, V. (2020). Feature Space Discriminatively Trained Punjabi
[35] Varga A, Steeneken HJ. Assessment for automatic speech recognition: II.
Children Speech Recognition System Using Kaldi Toolkit. Available at SSRN
NOISEX-92: A database and an experiment to study the effect of additive noise
3565906.
on speech recognition systems. Speech Commun 1993;12(3):247–51.
[10] Lippmann, R., Martin, E., & Paul, D. (1987). Multi-style training for robust
[36] Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, et al. The Kaldi
isolated-word speech recognition. In ICASSP’87. IEEE International Conference
speech recognition toolkit. IEEE 2011 workshop on automatic speech
on Acoustics, Speech, and Signal Processing (Vol. 12, pp. 705-708). IEEE.
recognition and understanding (No. CONF). IEEE Signal Processing Society;
[11] Huang X, Acero A, Hon HW, Foreword By-Reddy R. Spoken language
2011.
processing: A guide to theory, algorithm, and system development. Prentice
[37] Ghai W, Singh N. Continuous speech recognition for Punjabi language.
hall PTR; 2001.
International Journal of Computer Applications 2013;72(14).
[12] Kadyan V, Mantri A, Aggarwal RK. Refinement of HMM model parameters for
[38] Arora A, Kadyan V, Singh A. Effect of tonal features on various dialectal
punjabi automatic speech recognition (PASR) system. IETE Journal of Research
variations of Punjabi language. In: Advances in Signal Processing and
2018;64(5):673–88.
Communication. Singapore: Springer; 2019. p. 467–75.
[13] Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., & Schwartz, R. M. (2016). Two-
[39] Kadyan V. Acoustic Features Optimization For Punjabi Automatic Speech
Stage Data Augmentation for Low-Resourced Speech Recognition. In
Recognition System (Doctoral dissertation. Chitkara University); 2018.
Interspeech (pp. 2378-2382).
[40] Dua M, Aggarwal RK, Kadyan V, Dua S. Punjabi automatic speech recognition
[14] Besacier L, Barnard E, Karpov A, Schultz T. Automatic speech recognition for
using HTK. International Journal of Computer Science Issues (IJCSI) 2012;9
under-resourced languages: A survey. Speech Commun 2014;56:85–100.
(4):359.
[15] Mittal, P., & Singh, N. (2017, September). Speaker-independent automatic
[41] Kadyan V, Mantri A, Aggarwal RK, Singh A. A comparative study of deep neural
speech recognition system for mobile phone applications in Punjabi. In
network based Punjabi-ASR system. Int J Speech Technol 2019;22(1):111–9.
International Symposium on Signal Processing and Intelligent Recognition
[42] Shahnawazuddin S, Deepak KT, Pradhan G, Sinha R. In: Enhancing noise and
Systems (pp. 369-382). Springer, Cham
pitch robustness of children’s ASR. IEEE; 2017. p. 5225–9.
[16] Kadyan V, Mantri A, Aggarwal RK. A heterogeneous speech feature vectors
[43] Shahnawazuddin S, Ahmad W, Adiga N, Kumar A. In: In-Domain and Out-of-
generation approach with hybrid hmm classifiers. Int J Speech Technol
Domain Data Augmentation to Improve Children’s Speaker Verification System
2017;20(4):761–9.
in Limited Data Scenario. IEEE; 2020. p. 7554–8.
[17] Kaur J, Singh A, Kadyan V. Automatic Speech Recognition System for Tonal
[44] Guglani J, Mishra AN. Automatic speech recognition system with pitch
Languages: State-of-the-Art Survey. Arch Comput Methods Eng 2020:1–30.
dependent features for Punjabi language on KALDI toolkit. Appl Acoust
[18] Hu R, Zhu S, Feng J, Sears A. In: Use of speech technology in real life
2020;167:107386.
environment. Berlin, Heidelberg: Springer; 2011. p. 62–71.

1 s2.0 S0003682X20309154 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0003682X20309154 Main

Uploaded by

Copyright:

Available Formats

Applied Acoustics 175 (2021) 107810

Contents lists available at ScienceDirect

Noise robust in-domain children speech enhancement for automatic

1. Introduction children and rebuilding the whole system through mismatched

Step 1: Collection of original adult (male/female)

Consequently, a mixture of children and adult train data was Table 2

7. Conclusion with relative improvement of 30.94% than that of adult male or

You might also like