Professional Documents
Culture Documents
LPC Vocoder
1- Introduction:
The goal of this project is the development of a voice coder (vocoder) based on Linear Predictive
Coding (LPC). The matters related to quantization will not be investigated in this work, so the LPC filter
will not be quantized. The main objective here is to gain intuition on the limitations of the excitation
model adopted, basically a white noise generator (for unvoiced speech) and a pulse generator (for voiced
speech).
A summary of the activities conducted during this work is given below:
Nowadays, with some communications channels giving support to variable bit rates, as in ATM
(Asynchronous Transfer Mode) networks for example, one can save bandwidth using the fact that in a
conversation, most of the time one of the two channels is inactive. Thus, VAD algorithms are very
important today.
One very good reference on VAD algorithms is the paper describing the VAD adopted for the
G.729 vocoder [3]. In fact, we started its implementation but unfortunately the time was not enough.
The algorithm used here is a very simple one, inspired in the end-point detection algorithm,
described in [1] for the purpose of isolated-work speech recognition. The algorithm is documented in the
Appendix. Basically, the algorithm implemented makes a decision based on the zero-crossing rate (ZCR)
and the energy of a given frame. The frame is considered valid silence just if the actual values for ZCR
and energy are both below the their thresholds. Otherwise, the frame is considered as valid speech. More
sophisticated algorithms, like the one adopted for the G.729 vocoder, uses moving averages, in order to
adapt to variations in the background noise. The VAD implemented uses thresholds that are fixed during
the operation. Of course it is not a good approach in real situations. The thresholds were obtained trough
histograms as the ones shown in Figures 1 and 2.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 2.
50 500
40 400
Ocurrences
Ocurrences
30 300
20 200
10 100
0 0
0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Energy values for each frame -3 ZCR values for each frame
x 10
Fig. 1- Example of histogram of the noise energy for Fig. 2- Example of histogram of the ZCR for frames
frames of 20 ms. of 20 ms (environment where the A/D converter has
an offset).
3- Pitch Estimation:
The problem of pitch estimation is a topic of research during all the evolution of digital speech
processing. Now it is still a topic of great interest. A good estimation of the pitch period is crucial to
improving the performance of speech analysis and synthesis systems.
Most of the classical techniques, as described in [1], [2], are based on:
⇒ Autocorrelation function
⇒ Spectrum
⇒ Cepstrum
Despite the large number of pitch detection algorithms which have been developed so far, the
estimation of the fundamental frequency is still a problem without definitive solution. Recently in the area
of digital signal processing a range of advanced methods have been used. Some of these "modern"
techniques for pitch estimation are [5], [6], [7]:
Due to the scope of this work, it was not possible to implement one of these modern techniques.
So, we implemented two techniques:
that are discussed in the next paragraphs. Both implementations are based in [1] and [2] (mainly in [1]).
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 3.
-3
x 10 signal
1
-1
0 50 100 150 200 250
clipped signal
2
-1
0 50 100 150 200 250
normalized autocorrelation
1
0.3
24 160
Fig. 3- Example of one problem with the pitch estimation based on the autocorrelation.
In order to circumvent the problem exemplified in Figure 3, we derived a new algorithm, called
pitch4.m, that basically has the following heuristic: if the minimum value of the autocorrelation function
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 4.
in the interval of lags [1, 23] is greater than the maximum value in the interval of interest [24, 160] (the
situation exemplified in Figure 3), the frame is considered unvoiced. This small modification leads to a
much better speech quality.
⇒ continuity of pitch
⇒ pitch smoothing using non linear median filter
It should be noticed that, the greater the number of pitches considered in the median filter, the
greater the delay of the system.
Pitch lag
100
0
Result obtained with pitch2 function (Autocorrelation)
200
Pitch lag
100
0
Result obtained with pitch3 function (Cepstrum)
200
Pitch lag
100
0
Result obtained with pitch4 function (Autocorrelation)
200
Pitch lag
100
0
0 20 40 60 80 100 120 140 160 180
# of frame
Fig.4 - Results obtained without using any pitch smoothing technique.
100
0
Result obtained with pitch2 function (Autocorrelation)
200
Pitch lag
100
0
Result obtained with pitch3 function (Cepstrum)
100
Pitch lag
50
0
Result obtained with pitch4 function (Autocorrelation)
200
Pitch lag
100
0
0 20 40 60 80 100 120 140 160 180
# of frame
Fig.5 - Results obtained using a median filter with 3 pitch values.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 6.
Pitch lag
100
0
Result obtained with pitch2 function (Autocorrelation)
Pitch lag 200
100
0
Result obtained with pitch3 function (Cepstrum)
100
Pitch lag
50
0
Result obtained with pitch4 function (Autocorrelation)
200
Pitch lag
100
0
0 20 40 60 80 100 120 140 160 180
# of frame
Fig.6 - Results obtained using a median filter with 5 pitch values.
100
0
Result obtained with pitch2 function (Autocorrelation)
100
Pitch lag
50
0
Result obtained with pitch3 function (Cepstrum)
100
Pitch lag
50
0
Result obtained with pitch4 function (Autocorrelation)
100
Pitch lag
50
0
0 20 40 60 80 100 120 140 160 180
# of frame
Fig.7 - Results obtained using a median filter with 7 pitch values.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 7.
Comparing the result obtained with a median filter of 5 pitch values (Fig. 6) with Fig. 4, it is clear
that the median filter is making a good job. Using this kind of filter, the synthesized speech achieves a
quality comparable to the one obtained with the LPC-10e (that uses quantization). Although, even
increasing the number of values used in this filter (Fig. 7, for example), one can still see problems with
double and half pitch. Thus, the algorithm needs more intelligence in order to get better results. It should
be noticed that the results obtained with the cepstrum were slightly better than the ones obtained with the
autocorrelation. One can see too, that the modification suggested in the first autocorrelation-based
implementation leads to better results, avoiding very small values of the pitch due to low frequencies
fluctuations of the signal.
An important aspect is the gain of the excitation sequence. In fact, the LPC analysis gives, as an
"extra" result, the energy of the excitation sequence. However, the signals used as excitation (random
noise and pulse train) do not have the same spectrum of the excitation, and the matching between the
original speech energy and the energy of the synthesized speech is not so direct. In fact, some empirical
values were adopted in order to use just a fixed percentage of the gain given by the LPC analysis. These
empirical values led to better subjective results.
Figure 8 shows an example of the generation of voiced speech. Figures 9, 10 and 11 compare the
time sequences. Figures 12, 13 and 14 compare the spectrogram. One can see that we should do a better
job in order to obtain the same harmonic contents shown in the DoD vocoder. Figure 15 show how the
VAD works. In this specific case, this sentence did not belong to the training sequence used to estimate
the thresholds of the VAD algorithms. The beginning and the middle of the sentence, the VAD correctly
detected silence. However, it wrongly classified the stop due to the plosive "p" as silence. So, as
mentioned before, this algorithm is too simple and is not able to adapt to different background noises.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 8.
-3
x 10 synth excitation synth speech
5 0.015
4 0.01
3 0.005
2 0
1 -0.005
0 -0.01
0 50 100 150 200 0 50 100 150 200
0 0.1
-0.05 0
-0.1 -0.1
-0.15 -0.2
0 100 200 300 0 100 200 300
Fig.9 - Original speech file: "Paper scares so write with much care" (multiplied by 10).
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 9.
Fig. 14 - Spectogram of the speech synthesized by the vocoder implemented on this work. (VAD algorithm disabled).
Fig. 15 - Spectogram of the speech synthesized by the vocoder implemented on this work.
Example of the VAD algorithm: regions in dark blue correspond to silence (filled with zero values).
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 12.
5- Conclusions:
The implementation of a LPC vocoder is really an exciting and challenging matter. A lot of
techniques were learned from the literature and practice during this work. Looking to the complexity of
the voiced / unvoiced decision in the LPC-10e DoD vocoder, it is clear that a good algorithm must have a
lot of intelligence and adaptability in order to get good results.
The main problem is the estimation of the pitch. Secondly, a robust voiced / unvoiced decision is
very important.
It was found that considering the memory of the LPC filter leads to better results.
The median filter was not able to give a smooth pitch contour. Some techniques like avoiding
abrupt changes in the pitch value and avoiding double and half pitches should be incorporated in order to
get better results.
References:
[1] Rabiner, L. and Schafer, R.. Digital Processing of Speech Signals. Prentice-Hall.
[2] Kondoz, A.. Digital Speech - Coding for Low Bit Rate Communications Systems. Wiley. 1994.
[4] Campbell, J. and Tremain, T.. Voiced / Unvoiced Classification of Speech with Applications to the
U.S. Governement LPC-10E Algorithm. IEEE ICASSP. 1996. pp. 473-476.
[5] Ghaemmaghami, S.; Deriche, M.; Boashash, B. Edited by: Deriche, M.; Moody, M.; Bennamoun,
M. A new approach to pitch and voicing detection through spectrum periodicity measurement.
vol.2 , TENCON '97 Brisbane - Australia. Proceedings of IEEE TENCON '97. IEEE Region 10
Annual Conference. Speech and Image Technologies for Computing and Telecommunications Dec.
1997. New York, NY, USA.
[6] Juhar, J. Advanced pitch detection algorithms. DSP '97. 3rd International Conference on Digital
Signal Processing. Proceedings of the Conference, Proceedings of Digital Signal Processing '97,
Herl'any, Slovakia, 3-4 Sept. 1997. Kosice, Slovakia.
[7] Janer, L.; Bonet, J.J.; Lleida-Solano, E. Pitch detection and voiced/unvoiced decision algorithm
based on wavelet transforms. vol.2 , Proceedings ICSLP 96. Fourth International Conference on
Spoken Language Processing Cat. No.96TH8206 , Proceeding of Fourth International Conference
on Spoken Language Processing. ICSLP '96, Philadelphia, PA, USA, 3-6 Oct. 1996. New York, NY,
USA.