You are on page 1of 3

Applied Acoustics 167 (2020) 107386

Contents lists available at ScienceDirect

Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust

Automatic speech recognition system with pitch dependent features


for Punjabi language on KALDI toolkit
Jyoti Guglani a,⇑, A.N. Mishra b
a
Department of Electronics & Communication, IMSEC, Ghaziabad, India
b
Department of Electronics & Communication, KEC Engineering College, Ghaziabad, India

a r t i c l e i n f o a b s t r a c t

Article history: In this paper the improvement in performance of automatic speech recognition (ASR) system is achieved
Received 28 January 2020 with help of pitch dependent features and probability of voicing estimated features. The pitch dependent
Received in revised form 27 March 2020 features are useful for tonal language ASR system. Punjabi language is highly tonal language and hence
Accepted 17 April 2020
here we are building ASR system for Punjabi language with pitch dependent features and probability of
Available online 5 May 2020
voicing estimated features. The word error rate of system gives the performance of system which dras-
tically improves with pitch dependent features and probability of voicing estimated features. Comparison
Keywords:
of Yin, SAcC, Fundamental Frequency Variation (FFV) and Kaldi pitch features of ASR system were done in
Automatic speech recognition system
Pitch dependent feature
terms of WER. The KALDI pitch tracker of Kaldi toolkit gives the best performance ASR system among
Probability voicing estimated feature other featured ASR systems. The performance of ASR system is evaluated for Punjabi language.
Tonal language Ó 2020 Elsevier Ltd. All rights reserved.
Punjabi language
Kaldi toolkit

1. Introduction dependent features and probability of voicing features which are


dependent on tones. So here the ASR system with pitch dependent
Speech recognition is an attractive field nowadays. Speech features for Punjabi language is built. That gives high performance
recognition of regional languages in India is of highly a concern as compared to normal ASR system. The various applications based
area due to the complexity of these languages. The progress in on pitch dependent ASR system were also built [13,14].
technology should be spread around the country without the bon- The paper is organized in Six sections. Section II describes the
dage of language. This can be realized with the help of ASR system different existing pitch extraction methods. Section III explains
of regional languages. Punjabi is an important Indian language and the Kaldi pitch extractor. Section IV represents the system descrip-
is spoken by a large portion of people around the world. The tion. Section V and VI gives the experimental results and
Speech recognition system based on Deep Neural Network is conclusions.
formed for the Punjabi language in this paper. The good performing
pitch dependent features to be used in speech recognition and to
produce a standardized pitch feature for use in the Kaldi Automatic 2. Pitch extraction methods
Speech Recognition (ASR) toolkit [1].
Pitch features are useful for recognition of speech, especially for The different pitch extractors are available such as, Yin [15],
tonal languages, such as Mandarin [2–4], Cantonese [5–7], Viet- Getf0 [16], SAcC [17], Wu [18], SWIPE [19] and YAAPT [20]. These
namese [8,9] and Thai [10,11], since pitch can serve as an informa- are used to extract the pitch dependent features. But in all these
tive source to distinguish different tones in tonal languages. extractors voiced and unvoiced frames are treated separately
Punjabi is an Indo-Aryan language with about one hundred million [21]. Yin is one of popular pitch estimator which is based on auto-
native speakers around the world and especially in the Indian correlation method having several refinements to reduce possible
sub-continent. Punjabi is a tonal language with three tones: high errors [22]. Getf0 performs much better than Yin and it is one of
falling, low rising, and level [12]. Due to this the ASR system for simpler algorithm to implement. Getf0 makes a hard decision
Punjabi language is to be modified i.e. the system with pitch whether any given frame is voiced or unvoiced. SAcC gives better
performance but it is complex method. SWIPE (Sawtooth Inspired
Pitch Estimator) enhances the performance of pitch tracking
⇑ Corresponding author. adopting some measures as it avoids the use of log of spectrum
E-mail address: jyoti.guglani@imsec.ac.in (J. Guglani). and it applies monotonically decaying weight to harmonics. Also,

https://doi.org/10.1016/j.apacoust.2020.107386
0003-682X/Ó 2020 Elsevier Ltd. All rights reserved.
2 J. Guglani, A.N. Mishra / Applied Acoustics 167 (2020) 107386

the spectrum in the neighborhood of the harmonics and middle positive. We process the raw NCCF value in two ways, for reasons
points between harmonics are observed and smooth weighting that will become clear.
functions are used [23]. YAAPT (Yet Another Algorithm for Pitch
Tracking) is a Pitch tracker algorithm designed to be very accurate
3.5. Method intended to give accurate probability of voicing
and robust for high quality speech [24].
Kaldi pitch tracker is a highly modified version of getf0 algo-
The method is only used as part of the pitch mean-subtraction
rithm. In this, first we find the lag values that maximizes NCCF
algorithm. It processes the NCCF value into a reasonably accurate
then all frames treated as voiced and allow Viterbi search ti inter-
probability of voicing measure.
polate across unvoiced regions. It improves the accuracy as we
extract features based on NCCF related to probability of voicing
which helps ASR. 3.6. Method for use as a feature

In this method we use to process the NCCF, gives a value that


3. System description
seems to give good performance when used as a feature. This tech-
nique was to provides the feature a reasonably Gaussian
Kaldi uses a modified version of the RAPT algorithm. The key
distribution.
difference is that Kaldi does not make a hard decision whether
any given frame is voiced or unvoiced; instead, it assigns a pitch
even to unvoiced frames while constraining the pitch trajectory 3.7. Normalization of pitch
to be continuous. ASR system with Kaldi toolkit is shown in Fig. 1.
And it is also providing a probability of voicing and other fea- We use the short-time mean subtraction approach [26], but
tures used in ASR system. The steps of Kaldi pitch tracker in ASR with POV weighting on each time t we subtract a weighted average
system are as follows: pitch value which was computed over a window width of 155
frames centered at t and weighted by the probability of voicing
3.1. Subsampling and normalization value [27]. The features, MFCC and PLP were used in ASR system.

Let the input to the algorithm be a discretely sampled signal,


4. Database preparation
sampled with sampling frequency S. The first stage is to use the
resampling method above, with the filter parameterized by
The database of the Punjabi continuous speech from a hundred
lowpass-cutoff and lowpass-filter-width, to sample the signal at
native speakers of Punjab was recorded. The 1500 phonetical rich
sampling frequency resample frequency. Next we normalize the
sentences of Punjabi language were recorded with each speaker.
resampled signal’s dynamic range by dividing by the root-mean-
Sixty-five Male and twenty female speakers of different age group
square signal value [25].
(25–45 years) and from the different region of Punjab (Ludhiana
and Bathinda) were chosen for the recording of database. The data-
3.2. Computing the NCCF base of the Punjabi continuous speech in Malwai dialect was made.
Total 12.5 h database was recorded using the cool software.
First, we need to establish the range of lags over which to com- Then the samples of each sentence for each speaker were seg-
pute the NCCF. These depend on the frequency range we search mented with sampling rate as 16 kHz, 16-bits and mono. For
over. Define the quantities min-lag = 1/max-f0, max-lag = 1/min- recording the microphone was kept at a 3 to 4 in. far from the
f0, which are the minimum and maximum lags (in seconds) at speaker’s mouth. Intex and iBall company microphones were used
which we need the NCCF, and furthermore define upsample- for audio recording purpose.
filter-frequency as resample-frequency/2 which is the filter cutoff. Meta-data for preparing language data is mandatory for Kaldi
ASR. That includes Lexicon.txt, nonsilence-phones.txt and silence.
3.3. Output txt files. The lexicon.txt contains the phone transcriptions of each
word. The nonsilence_phones.txt having all the phones provided
The output of this algorithm is the pitch on each frame and the to prepare the database. And the silence_phones.txt contains the
NCCF on each frame. silence.
The Meta-data for Kaldi ASR includes:
3.4. Processing NCCF into a POV measure
 spk2gender < speakers’ gender information.
The basis for our Probability of Voicing (POV) measures are the  wav.scp < the path of recorded audio files with speakers ID.
NCCF values for each frame. Its range is [1, 1], but it is usually  text < utterances matched with text transcriptions.
 utt2spk < mapping of utterance of a speaker.
 corpus.txt < utterance transcription used for implementing the
model

Kaldi is a progressively developing speech recognition toolkit


with excellent support provided by its authors and the wide base
of users. According to its development status, Kaldi allows imple-
menting the efficient speech recognition systems.
Only the transcriptions of the whole sentences of the training
sets were used to generate the text corpus for LM estimation.
The Kaldi toolkit provides an F0 estimate of each frame, whether
voiced or not, so an arbitrary threshold of 0.5 was applied on nor-
malized cross correlation function to make a voiced or unvoiced
Fig. 1. ASR model using Kaldi toolkit. decision for each frame [28].
J. Guglani, A.N. Mishra / Applied Acoustics 167 (2020) 107386 3

extractors. The WER reduces by 1.5% using Kaldi. As Kaldi pitch


tracker is used to extract tone related pitch features which are aug-
mented feature to acoustic model. So, for tonal language, Punjabi
language, performance of ASR system improved with pitch and
POV features.

Declaration of Competing Interest

The authors declare that they have no known competing finan-


cial interests or personal relationships that could have appeared
to influence the work reported in this paper.

REFERENCES

[1] Povey D, Ghoshal A, et al. The Kaldi speech recognition toolkit. In Proc. ASRU;
Fig. 2. Performance graph of different ASR features in terms of WER. 2011.
[2] Chen CJ, Gopinath RA, Monkowski MD, et al. New methods in continuous
mandarin speech recognition. In EUROSPEECH; 1997.
Table1 [3] Wang H-m, Ho T-H, Yang R-C, et al. Complete recognition of continuous
WER of ASR with different features. mandarin speech for Chinese language with very large vocabulary using
limited training data. IEEE Trans Speech Audio Process 1997;5(2):195–200.
ASR features % WER [4] Chang E, Zhou J, Di S, et al. Large vocabulary mandarin speech recognition with
different approaches in modeling tones. In ICSLT; 2000.
Yin pitch + FFV 69 [5] Ng AYP, Chan LW, Ching P. Automatic recognition of continuous cantonese
SAcC pitch + FFV 67.5 speech with very large vocabulary. In EUROSPEECH; 1997.
Kaldi pitch 64.25 [6] Lee T, Lau W, Wong YW, et al. Using tone information in cantonese continuous
speech recognition. TALIP 2002;1(1):83–102.
[7] Hu S, Liu S, Chang HF, Geng M, Chen J, Chung TKH, et al. The cuhk dysarthric
speech recognition systems for English and Cantonese. In Proc. interspeech
Table 2 2019; Sep. 2019.
Comparison of pitch and POV algorithm on [8] Vu NT, Schultz T, Vietnamese large vocabulary continuous speech recognition.
Punjabi language ASR system. In ASRU; 2009. p. 333–8.
[9] Nguyen HQ, Le TD, et al. Automatic speech recognition for Vietnamese using
Pitch POV % WER htk system. In RIVF; 2010. p. 1–4.
[10] Kasuriya S, Kanokphara S, Thatphilhakkul N, et al. Context-independent
SAcC SAcC 69.4
acoustic models for thai speech recognition. In ISCIT, vol. 2; 2004. p. 991–4.
Kaldi SAcC 64.1
[11] Suebvisai S, Charoenpornsawat P, Black A, et al. Thai automatic speech
Kaldi Kaldi 62.6 recognition. In ICASSP, vol. 1; 2005. p. I–857.
[12] Sandhu Balbir Singh. The tonal system of the Punjabi language. Chandigarh:
PARKH, Punjab University; Feb 24–25, 2009.
In our experiment we appended fundamental frequency varia- [13] Liu S, Hu S, Liu X, Meng H. On the use of pitch features for disordered speech
recognition. In Interspeech 2019; Sep. 2019
tion (FFV) features [29]. These are seven dimensional features
[14] Stahl J, Mowlaee P. Exploiting temporal correlation in pitch-adaptive speech
which are informative about pitch change. enhancement. Speech Commun 2019;111:1–13.
[15] De Alain, Cheveigné, Kawahara Hideki. Yin, a fundamental frequency estimator
for speech and music. J Acoust Soc Am 2002;111:1917.
5. Experimental results [16] Talkin David. A robust algorithm for pitch tracking (rapt). Speech Coding
Synthesis 1995;495:518.
To compare and evaluate the results of ASR system, the Word [17] Ellis DPW, Lee BS. Noise robust pitch tracking by sub band autocorrelation
classification. In 13th Annual conference of the international speech
Error Rate (WER) criterion was used. The WER is given as, communication association; 2012.
ðD þ S þ IÞ [18] Wu M, Wang DL, Brown GJ. A multipitch tracking algorithm for noisy speech.
WERð%Þ ¼  100ð%Þ IEEE Trans Speech Audio Process 2003;11(3):229–41.
N [19] Camacho A, Harris JG. A sawtooth waveform inspired pitch estimator for
speech and music. J Acoust Soc Am 2008;124(3):1638–52.
where, N is the number of words utilized in a test, D is the number [20] Kasi K, Zahorian SA. Yet another algorithm for pitch tracking. In 2002 IEEE
of Deletions in the test, S is the number of substitutions in test, and I international conference on acoustics, speech, and signal processing (ICASSP),
is the number of insertion error in the test (Fig. 2). vol. 1. IEEE; 2002, p. I–361.
[21] Lei X. Modeling lexical tones for mandarin large vocabulary continuous speech
Table1 shows the Comparison between the different pitch
recognition [Ph.D. thesis]. University of Washington; 2006.
extractors. With the Yin pitch extraction and fundamental fre- [22] de Chevegne A, Kawahara H. Yin a fundamental frequency estimators for
quency variation gives 69% WER and SAcC pitch gives 67.5% speech and music. J Acoust Soc Am 2002;III(4):1917–30.
[23] Camacho A. SWIPE A sawtooth waveform inspired pitch estimator for speech
WER. The Kaldi pitch without FFV performed very well among
and music [PhD thesis]. University of Florida.
all. It gives WER of 64.25%. The Kaldi pitch feature gives the best [24] Zahorian SA, Hu H. A spectral/temporal temporal method for robust
performance among Yin, SAcC and FVV features. Table 2 shows fundamental frequency tracking. J Acoust Soc Am 2008;123:4559–71.
the Pitch and POV performance with SAcC and Kaldi. It shows that [25] Plante F, Meyer GF, Ainsworth WA. A pitch extraction reference database. In
Eurospeech; 1995. p. 837–40.
the Pitch and POV of Kaldi performed best among the SAcC- SAcC [26] Chen G, Khudanpur S, Povey D, Trmal J, Yarowsky D, Yilmaz O. Quantifying the
and Kaldi- SAcC Pitch POV with 62.6% WER. value of pronunciation lexicons for keyword search in low resource languages.
In 2013 IEEE International conference on acoustics, speech and signal
processing (ICASSP); 2013. p. 8560–4.
6. Conclusions [27] D. Povey L. Burget et al. The subspace Gaussian mixture model–a structured
model for speech recognition Comput Speech Lang 25 2 2011 404 439
[28] Povey D, Kanevsky D, Kingsbury B, Ramabhadran B, Saon G, Visweswariah K.
In this paper the efficiency of ASR system of Punjabi language
Boosted MMI for model and feature-space discriminative training. In IEEE
has been reported on Kaldi toolkit in terms of WER. The perfor- international conference on acoustics, speech and signal processing, 2008,
mance of Kaldi pitch features ASR system is best among SAcC ICASSP 2008; 2008. p. 4057–60.
and Yin featured ASR system. ASR with SAcC pitch features give [29] Laskowski K, Heldner M, Edlund J. The fundamental frequency variation
spectrum. In FONETIK; 2000.
1.5% less WER than Yin feature based ASR. Also, the pitch and
POV feature of Kaldi gives the best performance among other pitch

You might also like