Professional Documents
Culture Documents
MUSIC Pitch Algorithm
MUSIC Pitch Algorithm
22, 4 (2001)
TECHNICAL REPORT
Abstract: In this article a new method for fundamental frequency estimation from the noisy
spectrum of a speech signal is introduced. The fundamental frequency is one of the most essential
characteristics for speech recognition, speech coding and so on. The proposed method uses the MUSIC
algorithm, which is an eigen-based subspace decomposition method.
293
Acoust. Sci. & Tech. 22, 4 (2001)
0.2
appears more clearly in a low-frequency domain [1]. Then, (a)
0
before describing the estimation method of fundamental -0.2
-20
-30
and the sampling frequency is 11.025 [kHz]. In considera- -40 (b)
-50
tion of the existence range of fundamental frequencies, -60
only the frequency components below 1 [kHz] are used for 0 500 1000 1500 2000 2500 3000
Frequency [Hz]
3500 4000 4500 5000
-50
nents of a MUSIC spectrum are the set of those at
-100 (c)
frequencies 43 [Hz], 86 [Hz], , fk ¼ 11025=256k ½Hz,
-150
, 991 [Hz] and 1 k 23ð¼ KÞ. The size of the -200
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
autocorrelation matrix Ryy is 256 256 and its rank will Frequency [Hz]
294
T. MURAKAMI and Y. ISHIDA: FUNDAMENTAL FREQUENCY ESTIMATION OF SPEECH SIGNALS USING MUSIC
295
Acoust. Sci. & Tech. 22, 4 (2001)
and the fundamental frequency is estimated from the Table 1 Average value of absolute error rates of
estimated fundamental frequencies.
peak location of the time-domain signal (i.e.,
cepstrum) obtained by its transformation using peak Male speakers (%)
picking [10]. /a/ /i/ /u/ /e/ /o/ Average
Japanese male and female vowels, /a/ and /i/, are tested
Cepstral method 22.5 40.7 3.6 2.5 1.8 14.2
in both noise free and noisy environments. In the MUSIC algorithm 0.5 3.9 2.2 3.9 0.5 2.2
experiment, the additive noise is Gaussian. We compare
the proposed method with the cepstral method, which is Female speakers (%)
commonly used for estimating the fundamental frequen- /a/ /i/ /u/ /e/ /o/ Average
cies. In Figs. 4 and 5, the experimental results for the Cepstral method 0.9 3.9 4.2 2.0 3.3 2.9
Japanese male vowel /a/ and the Japanese female vowel /i/ MUSIC algorithm 0.7 3.8 1.6 0.8 0.7 1.5
are shown respectively. In each figure, (a) shows the These numerical values represent the average value for each
original speech signal, (b) the speech signal corrupted with vowel.
0.2
0
additive noise (SNR ¼ 0:63 [dB] and 0:19 [dB], respec-
-0.2 (a)
-0.4
-0.6
tively), (c) FFT spectrum of the speech signal, (d) cepstrum
0 5 10 15 20
0.5
Time[msec]
obtained by the FFT, (e) MUSIC spectrum and (f) cepstrum
0 (b)
-0.5 by the MUSIC algorithm. In (c)–(f) of Figs. 4 and 5, the
0
-20
0 5 10
Time[msec]
15 20
solid lines denote the noise free environment and the dotted
-40
-60
(c)
lines denote the noisy environment, respectively.
-80
0 500 1000 1500 2000 2500 3000
Frequency[Hz]
3500 4000 4500 5000
In case of the cepstral method, the estimated funda-
0.4
0.2
136.1[Hz] 116.1[Hz]
(d) mental frequencies of the Japanese male vowel /a/ are
0
20 40 60
Quefrency[*0.1msec]
80 100 120 116.1 [Hz] for the noise free speech and 136.1 [Hz] for the
0
-600
0 500 1000 1500 2000 2500
Frequency[Hz]
3000 3500 4000 4500 5000
frequencies estimated by the MUSIC algorithm are
15000
10000
5000
116.1[Hz] (f) 116.1 [Hz] for both cases. For the Japanese female vowel
0
-5000
20 40 60 80 100 120
/i/, the estimated fundamental frequencies by the cepstral
Quefrency[*0.1msec]
-0.2
0 5 10 15 20
Time [msec]
0.5
0
0 5 10
Time [msec]
15 20 frequencies, respectively. The true fundamental frequen-
-20
-40 (c) cies were directly estimated from original speech wave-
-60
-80
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
forms. In this example, the average absolute error rate of
Frequency [Hz]
0.4
229.7[Hz] 220.5[Hz] the cepstral method for male speakers is 14.2% and that of
0.2 (d)
0
20 40 60 80 100 120
the MUSIC algorithm is 2.2%. In addition, the average
Quefrency [*0.1msec]
0
-200
absolute error rates for female speakers are 2.9% and 1.5%,
(e)
-400
-600
respectively. Though all the average values are large
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
15000
10000
Frequency [Hz]
because of the low SNR, Table 1 suggests that the
(f)
5000
0
220.5[Hz]
proposed method is superior to the conventional cepstral
-5000
20 40 60
Quefrency[*0.1msec]
80 100 120
method for estimating the approximately true fundamental
(a) Original speech signal. (b) Noisy speech signal. (c) FFT spectrum. (d) Cepstrum
frequency.
obtained by the FFT. (e) MUSIC spectrum. (f) Cepstrum by the MUSIC algorithm.
4. CONCLUSION
Fig. 5 Analysis results for a Japanese female vowel We have proposed a new method to estimate the
/i/ (SNR ¼ 0:19 [dB]).
296
T. MURAKAMI and Y. ISHIDA: FUNDAMENTAL FREQUENCY ESTIMATION OF SPEECH SIGNALS USING MUSIC
fundamental frequency of noisy speech signals. Although [7] N. Kikuma, Adaptive Signal Processing with Array Antenna
(Science and Technology Publishing Company, Tokyo, 1999)
the MUSIC algorithm is used briskly in the field of mobile
(in Japanese).
communications, it seems that it is seldom used in the field [8] R. O. Schmidt, ‘‘Multiple emitter location and signal para-
of speech analysis. This research is very fundamental as meter estimation’’, IEEE Trans. AP-34, 276–280 (1986).
application to speech signal processing of the MUSIC [9] M. S. Andrews, J. Picone and R. D. Degroat, ‘‘Robust pitch
algorithm. However, we confirm that the feature of the determination via SVD based cepstral methods’’, ICASSP ’90,
253–256 (1990).
method has been used efficiently. [10] J. D. Markel and A. H. Gray, Linear Prediction of Speech
(Springer-Verlag, New York, 1976).
ACKNOWLEDGEMENT
The authors are grateful to the anonymous reviewers Takahiro Murakami was born in Chiba, Japan,
on February 8, 1978. He received the B.E. degree
for their helpful suggestions in improving the quality of
in Electronics and Communication from Meiji
this paper. University, Kawasaki, Japan, in 2000. He is
REFERENCES currently working toward the M.E. degree at
Graduate School of Electrical Engineering, Meiji
[1] W. Hess, Pitch Determination of Speech Signals (Springer- University. He is interested in speech signal
Verlag, New York, 1983). processing. He is a member of IEICE.
[2] M. Kaveh and A. J. Barabell, ‘‘The statistical performance of
the MUSIC and the minimum-norm algorithms in resolving Yoshihisa Ishida was born in Tokyo, Japan, on
plane waves in noise’’, IEEE Trans. ASSP-34, 331–341 February 24, 1947. He received the B.E., the M.E.,
(1986). and the Dr. Eng. Degrees in Electrical Engineering
[3] M. Egawa, T. Kobayashi and S. Imai, ‘‘Instantaneous from Meiji University, Kawasaki, Japan, in 1970,
frequency estimation in low SNR environments using im- 1972, and 1978, respectively. In 1975 he joined the
proved DFT-MUSIC’’, 1996 IEICE General Conference, A- Department of Electrical Engineering, Meiji Uni-
158 (1996) (in Japanese). versity, as a Research Assistant and became a
[4] Y. Ogawa and K. Itoh, ‘‘High-resolution estimation using the Lecturer and an Associate Professor in 1978 and 1981, respectively.
MUSIC algorithm’’, Trans. IEE Jpn. 116, 671–677 (1996) (in He is currently a Professor at the Department of Electronics and
Japanese). Communication, Meiji University. His current research interests are
[5] S. L. Marple, Digital Spectral Analysis with Applications in the area of digital signal processing, speech analysis. He is a
(Prentice-Hall, New Jersey, 1987). member of ASJ, IEEE, and IEICE.
[6] S. V. Vaseghi, Advanced Signal Processing and Digital Noise
Reduction (Wiley, New York, 1996).
297