Professional Documents
Culture Documents
1 s2.0 S0003682X20304904 Main
1 s2.0 S0003682X20304904 Main
Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust
a r t i c l e i n f o a b s t r a c t
Article history: In this paper the improvement in performance of automatic speech recognition (ASR) system is achieved
Received 28 January 2020 with help of pitch dependent features and probability of voicing estimated features. The pitch dependent
Received in revised form 27 March 2020 features are useful for tonal language ASR system. Punjabi language is highly tonal language and hence
Accepted 17 April 2020
here we are building ASR system for Punjabi language with pitch dependent features and probability of
Available online 5 May 2020
voicing estimated features. The word error rate of system gives the performance of system which dras-
tically improves with pitch dependent features and probability of voicing estimated features. Comparison
Keywords:
of Yin, SAcC, Fundamental Frequency Variation (FFV) and Kaldi pitch features of ASR system were done in
Automatic speech recognition system
Pitch dependent feature
terms of WER. The KALDI pitch tracker of Kaldi toolkit gives the best performance ASR system among
Probability voicing estimated feature other featured ASR systems. The performance of ASR system is evaluated for Punjabi language.
Tonal language Ó 2020 Elsevier Ltd. All rights reserved.
Punjabi language
Kaldi toolkit
https://doi.org/10.1016/j.apacoust.2020.107386
0003-682X/Ó 2020 Elsevier Ltd. All rights reserved.
2 J. Guglani, A.N. Mishra / Applied Acoustics 167 (2020) 107386
the spectrum in the neighborhood of the harmonics and middle positive. We process the raw NCCF value in two ways, for reasons
points between harmonics are observed and smooth weighting that will become clear.
functions are used [23]. YAAPT (Yet Another Algorithm for Pitch
Tracking) is a Pitch tracker algorithm designed to be very accurate
3.5. Method intended to give accurate probability of voicing
and robust for high quality speech [24].
Kaldi pitch tracker is a highly modified version of getf0 algo-
The method is only used as part of the pitch mean-subtraction
rithm. In this, first we find the lag values that maximizes NCCF
algorithm. It processes the NCCF value into a reasonably accurate
then all frames treated as voiced and allow Viterbi search ti inter-
probability of voicing measure.
polate across unvoiced regions. It improves the accuracy as we
extract features based on NCCF related to probability of voicing
which helps ASR. 3.6. Method for use as a feature
REFERENCES
[1] Povey D, Ghoshal A, et al. The Kaldi speech recognition toolkit. In Proc. ASRU;
Fig. 2. Performance graph of different ASR features in terms of WER. 2011.
[2] Chen CJ, Gopinath RA, Monkowski MD, et al. New methods in continuous
mandarin speech recognition. In EUROSPEECH; 1997.
Table1 [3] Wang H-m, Ho T-H, Yang R-C, et al. Complete recognition of continuous
WER of ASR with different features. mandarin speech for Chinese language with very large vocabulary using
limited training data. IEEE Trans Speech Audio Process 1997;5(2):195–200.
ASR features % WER [4] Chang E, Zhou J, Di S, et al. Large vocabulary mandarin speech recognition with
different approaches in modeling tones. In ICSLT; 2000.
Yin pitch + FFV 69 [5] Ng AYP, Chan LW, Ching P. Automatic recognition of continuous cantonese
SAcC pitch + FFV 67.5 speech with very large vocabulary. In EUROSPEECH; 1997.
Kaldi pitch 64.25 [6] Lee T, Lau W, Wong YW, et al. Using tone information in cantonese continuous
speech recognition. TALIP 2002;1(1):83–102.
[7] Hu S, Liu S, Chang HF, Geng M, Chen J, Chung TKH, et al. The cuhk dysarthric
speech recognition systems for English and Cantonese. In Proc. interspeech
Table 2 2019; Sep. 2019.
Comparison of pitch and POV algorithm on [8] Vu NT, Schultz T, Vietnamese large vocabulary continuous speech recognition.
Punjabi language ASR system. In ASRU; 2009. p. 333–8.
[9] Nguyen HQ, Le TD, et al. Automatic speech recognition for Vietnamese using
Pitch POV % WER htk system. In RIVF; 2010. p. 1–4.
[10] Kasuriya S, Kanokphara S, Thatphilhakkul N, et al. Context-independent
SAcC SAcC 69.4
acoustic models for thai speech recognition. In ISCIT, vol. 2; 2004. p. 991–4.
Kaldi SAcC 64.1
[11] Suebvisai S, Charoenpornsawat P, Black A, et al. Thai automatic speech
Kaldi Kaldi 62.6 recognition. In ICASSP, vol. 1; 2005. p. I–857.
[12] Sandhu Balbir Singh. The tonal system of the Punjabi language. Chandigarh:
PARKH, Punjab University; Feb 24–25, 2009.
In our experiment we appended fundamental frequency varia- [13] Liu S, Hu S, Liu X, Meng H. On the use of pitch features for disordered speech
recognition. In Interspeech 2019; Sep. 2019
tion (FFV) features [29]. These are seven dimensional features
[14] Stahl J, Mowlaee P. Exploiting temporal correlation in pitch-adaptive speech
which are informative about pitch change. enhancement. Speech Commun 2019;111:1–13.
[15] De Alain, Cheveigné, Kawahara Hideki. Yin, a fundamental frequency estimator
for speech and music. J Acoust Soc Am 2002;111:1917.
5. Experimental results [16] Talkin David. A robust algorithm for pitch tracking (rapt). Speech Coding
Synthesis 1995;495:518.
To compare and evaluate the results of ASR system, the Word [17] Ellis DPW, Lee BS. Noise robust pitch tracking by sub band autocorrelation
classification. In 13th Annual conference of the international speech
Error Rate (WER) criterion was used. The WER is given as, communication association; 2012.
ðD þ S þ IÞ [18] Wu M, Wang DL, Brown GJ. A multipitch tracking algorithm for noisy speech.
WERð%Þ ¼ 100ð%Þ IEEE Trans Speech Audio Process 2003;11(3):229–41.
N [19] Camacho A, Harris JG. A sawtooth waveform inspired pitch estimator for
speech and music. J Acoust Soc Am 2008;124(3):1638–52.
where, N is the number of words utilized in a test, D is the number [20] Kasi K, Zahorian SA. Yet another algorithm for pitch tracking. In 2002 IEEE
of Deletions in the test, S is the number of substitutions in test, and I international conference on acoustics, speech, and signal processing (ICASSP),
is the number of insertion error in the test (Fig. 2). vol. 1. IEEE; 2002, p. I–361.
[21] Lei X. Modeling lexical tones for mandarin large vocabulary continuous speech
Table1 shows the Comparison between the different pitch
recognition [Ph.D. thesis]. University of Washington; 2006.
extractors. With the Yin pitch extraction and fundamental fre- [22] de Chevegne A, Kawahara H. Yin a fundamental frequency estimators for
quency variation gives 69% WER and SAcC pitch gives 67.5% speech and music. J Acoust Soc Am 2002;III(4):1917–30.
[23] Camacho A. SWIPE A sawtooth waveform inspired pitch estimator for speech
WER. The Kaldi pitch without FFV performed very well among
and music [PhD thesis]. University of Florida.
all. It gives WER of 64.25%. The Kaldi pitch feature gives the best [24] Zahorian SA, Hu H. A spectral/temporal temporal method for robust
performance among Yin, SAcC and FVV features. Table 2 shows fundamental frequency tracking. J Acoust Soc Am 2008;123:4559–71.
the Pitch and POV performance with SAcC and Kaldi. It shows that [25] Plante F, Meyer GF, Ainsworth WA. A pitch extraction reference database. In
Eurospeech; 1995. p. 837–40.
the Pitch and POV of Kaldi performed best among the SAcC- SAcC [26] Chen G, Khudanpur S, Povey D, Trmal J, Yarowsky D, Yilmaz O. Quantifying the
and Kaldi- SAcC Pitch POV with 62.6% WER. value of pronunciation lexicons for keyword search in low resource languages.
In 2013 IEEE International conference on acoustics, speech and signal
processing (ICASSP); 2013. p. 8560–4.
6. Conclusions [27] D. Povey L. Burget et al. The subspace Gaussian mixture model–a structured
model for speech recognition Comput Speech Lang 25 2 2011 404 439
[28] Povey D, Kanevsky D, Kingsbury B, Ramabhadran B, Saon G, Visweswariah K.
In this paper the efficiency of ASR system of Punjabi language
Boosted MMI for model and feature-space discriminative training. In IEEE
has been reported on Kaldi toolkit in terms of WER. The perfor- international conference on acoustics, speech and signal processing, 2008,
mance of Kaldi pitch features ASR system is best among SAcC ICASSP 2008; 2008. p. 4057–60.
and Yin featured ASR system. ASR with SAcC pitch features give [29] Laskowski K, Heldner M, Edlund J. The fundamental frequency variation
spectrum. In FONETIK; 2000.
1.5% less WER than Yin feature based ASR. Also, the pitch and
POV feature of Kaldi gives the best performance among other pitch