You are on page 1of 12

Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 1.

LPC Vocoder
1- Introduction:

The goal of this project is the development of a voice coder (vocoder) based on Linear Predictive
Coding (LPC). The matters related to quantization will not be investigated in this work, so the LPC filter
will not be quantized. The main objective here is to gain intuition on the limitations of the excitation
model adopted, basically a white noise generator (for unvoiced speech) and a pulse generator (for voiced
speech).
A summary of the activities conducted during this work is given below:

a) Bibliographical review of pitch estimation algorithms.


b) Bibliographical review of silence detection (or voice activity detection - VAD).
c) Implementation of a VAD algorithm inspired on the one described in Rabiner [1].
d) Implementation of a pitch estimation algorithm based on the autocorrelation function..
e) Implementation of a pitch estimation algorithm based on the cepstrum..
f) Implementation of techniques for improving the pitch estimation:
⇒ spectrum flattering
⇒ continuity of pitch
⇒ pitch smoothing using non linear median filter.
g) Implementation and "tuning" a LPC vocoder.
h) Comparison between the implemented LPC vocoder and the LPC-10e DoD vocoder (a Fortran
code was obtained in the Internet).

2- Voice Activity Detection (VAD):

Nowadays, with some communications channels giving support to variable bit rates, as in ATM
(Asynchronous Transfer Mode) networks for example, one can save bandwidth using the fact that in a
conversation, most of the time one of the two channels is inactive. Thus, VAD algorithms are very
important today.
One very good reference on VAD algorithms is the paper describing the VAD adopted for the
G.729 vocoder [3]. In fact, we started its implementation but unfortunately the time was not enough.
The algorithm used here is a very simple one, inspired in the end-point detection algorithm,
described in [1] for the purpose of isolated-work speech recognition. The algorithm is documented in the
Appendix. Basically, the algorithm implemented makes a decision based on the zero-crossing rate (ZCR)
and the energy of a given frame. The frame is considered valid silence just if the actual values for ZCR
and energy are both below the their thresholds. Otherwise, the frame is considered as valid speech. More
sophisticated algorithms, like the one adopted for the G.729 vocoder, uses moving averages, in order to
adapt to variations in the background noise. The VAD implemented uses thresholds that are fixed during
the operation. Of course it is not a good approach in real situations. The thresholds were obtained trough
histograms as the ones shown in Figures 1 and 2.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 2.

Energy histogram ZCR histogram


60 600

50 500

40 400
Ocurrences

Ocurrences
30 300

20 200

10 100

0 0
0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Energy values for each frame -3 ZCR values for each frame
x 10
Fig. 1- Example of histogram of the noise energy for Fig. 2- Example of histogram of the ZCR for frames
frames of 20 ms. of 20 ms (environment where the A/D converter has
an offset).

3- Pitch Estimation:

The problem of pitch estimation is a topic of research during all the evolution of digital speech
processing. Now it is still a topic of great interest. A good estimation of the pitch period is crucial to
improving the performance of speech analysis and synthesis systems.
Most of the classical techniques, as described in [1], [2], are based on:

⇒ Autocorrelation function
⇒ Spectrum
⇒ Cepstrum

Despite the large number of pitch detection algorithms which have been developed so far, the
estimation of the fundamental frequency is still a problem without definitive solution. Recently in the area
of digital signal processing a range of advanced methods have been used. Some of these "modern"
techniques for pitch estimation are [5], [6], [7]:

⇒ Wavelet transforms [6], [7]


⇒ Instantaneous frequency [5]
⇒ Neural networks [6]
⇒ Adaptive filtering [6]

Due to the scope of this work, it was not possible to implement one of these modern techniques.
So, we implemented two techniques:

⇒ Autocorrelation of a 3 level signal (called pitch2.m and pitch4.m)


⇒ Cepstrum (called pitch3.m)

that are discussed in the next paragraphs. Both implementations are based in [1] and [2] (mainly in [1]).
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 3.

Implementation of the pitch estimation based on Autocorrelation:


Initially, the signal is filtered with a linear phase FIR lowpass filter with cutoff frequency equal to
900 Hz. After, the samples corresponding to one frame of the signal are clipped to 3 values: -1, 0 and 1.
The clipping threshold is adaptive, given by 68% of the smallest between two values: the greatest absolute
value in the first 10ms and the greatest absolute value in the final 10ms, as suggested in [1]. The minimum
and maximum lags were specified as 24 and 160, respectively. For the sampling frequency of 8 kHz, this
corresponds to a pitch range of 50 Hz to 333 Hz. It should be noticed that some persons can achieve
pitches around 500 Hz, but it is assumed, for the sake of simplicity, that this vocoder will not work
satisfactory for such high-pitched voices. The threshold adopted was 30%, as suggested in [1].
We detected one problem with the first implementation (correspondent to the file pitch2.m). When
the signal has a low frequency fluctuation, that leads to high values of the autocorrelation function, as
shown in Fig. 3, the autocorrelation has values greater than the threshold in the valid range of [24, 160].
This will conduct to very high values of the pitch frequency that will produce an annoying synthesized
voice.

-3
x 10 signal
1

-1
0 50 100 150 200 250
clipped signal
2

-1
0 50 100 150 200 250
normalized autocorrelation
1

0.3

24 160

Fig. 3- Example of one problem with the pitch estimation based on the autocorrelation.

In order to circumvent the problem exemplified in Figure 3, we derived a new algorithm, called
pitch4.m, that basically has the following heuristic: if the minimum value of the autocorrelation function
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 4.

in the interval of lags [1, 23] is greater than the maximum value in the interval of interest [24, 160] (the
situation exemplified in Figure 3), the frame is considered unvoiced. This small modification leads to a
much better speech quality.

Implementation of the pitch estimation based on Cepstrum:


The homomorphic technique [1] is a very useful approach when one tries to separate two signals
mixed by a convolution operation (and multiplication too), as in the case of an excitation passed through a
filter, the model adopted for speech production.
In this work, the algorithm based on cepstrum has a fixed threshold. The only important aspect to
be considered here is maybe the symmetry of the Fourier transforms related to the cepstrum calculation.
Suppose a L-samples frame. After taking the DFT, squaring the magnitude and taking the log, the
spectrum now is even and real. The inverse DFT of this spectrum leads to the cepstrum, and one should
notice that the cepstrum will be even and real, in order to obey the symmetry properties of the DFT. In this
case, just the first half of the samples (more specifically the L/2 + 1 samples from 0 to L/2) are valid in
order to estimate the pitch. So, if one starts with L = 160 (corresponding to 20ms for a sampling frequency
of 8 kHz), and takes a 160-point DFT, the maximum pitch lag that can be estimated through the cepstrum
obtained is 80. One alternative to circumvent this problem is to use zero-padding in order to achieve a
DFT with MaxLag points, where MaxLag is the maximum lag specified for the pitch.

Improvements on the pitch estimation:


The improvements implemented were:

⇒ continuity of pitch
⇒ pitch smoothing using non linear median filter

It should be noticed that, the greater the number of pitches considered in the median filter, the
greater the delay of the system.

Comparison of pitch estimation methods:


In order to compare performances, it was necessary to obtain a reference for the pitch. The LPC-
10e DoD vocoder was used in order to obtain a very smoothed pitch contour that will be considered the
reference here. The results shown here correspond to the sentence: "Paper scares, so write with much
care", obtained from the UCSB Speech Processing Group Web site.
The results shown in Figure 4 were obtained without using any smoothing technique (the median
filter operated using just one pitch value). Clearly, the algorithms did not lead to a satisfactory result.
In order to smooth the pitch contour, we used a median filter. The results in Figure 5, 6 and 7,
correspond to using a median filter with 3, 5 and 7 pitch values, respectively.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 5.

Reference obtained with LPC-10e DoD vocoder


200

Pitch lag
100

0
Result obtained with pitch2 function (Autocorrelation)
200
Pitch lag

100

0
Result obtained with pitch3 function (Cepstrum)
200
Pitch lag

100

0
Result obtained with pitch4 function (Autocorrelation)
200
Pitch lag

100

0
0 20 40 60 80 100 120 140 160 180
# of frame
Fig.4 - Results obtained without using any pitch smoothing technique.

Reference obtained with LPC-10e DoD vocoder


200
Pitch lag

100

0
Result obtained with pitch2 function (Autocorrelation)
200
Pitch lag

100

0
Result obtained with pitch3 function (Cepstrum)
100
Pitch lag

50

0
Result obtained with pitch4 function (Autocorrelation)
200
Pitch lag

100

0
0 20 40 60 80 100 120 140 160 180
# of frame
Fig.5 - Results obtained using a median filter with 3 pitch values.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 6.

Reference obtained with LPC-10e DoD vocoder


200

Pitch lag
100

0
Result obtained with pitch2 function (Autocorrelation)
Pitch lag 200

100

0
Result obtained with pitch3 function (Cepstrum)
100
Pitch lag

50

0
Result obtained with pitch4 function (Autocorrelation)
200
Pitch lag

100

0
0 20 40 60 80 100 120 140 160 180
# of frame
Fig.6 - Results obtained using a median filter with 5 pitch values.

Reference obtained with LPC-10e DoD vocoder


200
Pitch lag

100

0
Result obtained with pitch2 function (Autocorrelation)
100
Pitch lag

50

0
Result obtained with pitch3 function (Cepstrum)
100
Pitch lag

50

0
Result obtained with pitch4 function (Autocorrelation)
100
Pitch lag

50

0
0 20 40 60 80 100 120 140 160 180
# of frame
Fig.7 - Results obtained using a median filter with 7 pitch values.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 7.

Comparing the result obtained with a median filter of 5 pitch values (Fig. 6) with Fig. 4, it is clear
that the median filter is making a good job. Using this kind of filter, the synthesized speech achieves a
quality comparable to the one obtained with the LPC-10e (that uses quantization). Although, even
increasing the number of values used in this filter (Fig. 7, for example), one can still see problems with
double and half pitch. Thus, the algorithm needs more intelligence in order to get better results. It should
be noticed that the results obtained with the cepstrum were slightly better than the ones obtained with the
autocorrelation. One can see too, that the modification suggested in the first autocorrelation-based
implementation leads to better results, avoiding very small values of the pitch due to low frequencies
fluctuations of the signal.

4- Implementation of the LPC vocoder:


The implemented LPC vocoder is based on the following steps:

1) LPC filter and the pitch are calculated using frames of 30 ms


2) analysis window is shifted by 20 ms
3) pitch is estimated based on a low-pass (cutoff equal to 900 Hz) version of the speech frame
4) for the LPC analysis, the speech is passed through a pre-emphasis filter (coefficient equal
to 0.9) and multiplied by a Hamming window
5) the order of the LPC analysis is 10. It was used the routine developed in a previous
homework
6) the speech is synthesized with a delay in order to allow the smoothing of the pitch value
7) the LPC filter memory of the previous frame is used, and this memory is updated for the
next frame
8) the synthesized speech is passed through the de-emphasis filter

An important aspect is the gain of the excitation sequence. In fact, the LPC analysis gives, as an
"extra" result, the energy of the excitation sequence. However, the signals used as excitation (random
noise and pulse train) do not have the same spectrum of the excitation, and the matching between the
original speech energy and the energy of the synthesized speech is not so direct. In fact, some empirical
values were adopted in order to use just a fixed percentage of the gain given by the LPC analysis. These
empirical values led to better subjective results.
Figure 8 shows an example of the generation of voiced speech. Figures 9, 10 and 11 compare the
time sequences. Figures 12, 13 and 14 compare the spectrogram. One can see that we should do a better
job in order to obtain the same harmonic contents shown in the DoD vocoder. Figure 15 show how the
VAD works. In this specific case, this sentence did not belong to the training sequence used to estimate
the thresholds of the VAD algorithms. The beginning and the middle of the sentence, the VAD correctly
detected silence. However, it wrongly classified the stop due to the plosive "p" as silence. So, as
mentioned before, this algorithm is too simple and is not able to adapt to different background noises.
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 8.

-3
x 10 synth excitation synth speech
5 0.015

4 0.01

3 0.005

2 0

1 -0.005

0 -0.01
0 50 100 150 200 0 50 100 150 200

original excitation original speech


0.05 0.2

0 0.1

-0.05 0

-0.1 -0.1

-0.15 -0.2
0 100 200 300 0 100 200 300

Fig. 8 - Example of the generation of voiced speech in the implemented vocoder.

Fig.9 - Original speech file: "Paper scares so write with much care" (multiplied by 10).
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 9.

Fig. 10 - Speech synthesized by the LPC-10e vocoder.

Fig. 11 - Speech synthesized by the vocoder implemented on this work.


Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 10.

Fig. 12 - Spectogram of the original speech.

Fig. 13 - Spectogram of the speech synthesized by the LPC-10e vocoder.


Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 11.

Fig. 14 - Spectogram of the speech synthesized by the vocoder implemented on this work. (VAD algorithm disabled).

Fig. 15 - Spectogram of the speech synthesized by the vocoder implemented on this work.
Example of the VAD algorithm: regions in dark blue correspond to silence (filled with zero values).
Aldebaro Klautau - http://speech.ucsd.edu/aldebaro - 02/03/00. Page 12.

5- Conclusions:
The implementation of a LPC vocoder is really an exciting and challenging matter. A lot of
techniques were learned from the literature and practice during this work. Looking to the complexity of
the voiced / unvoiced decision in the LPC-10e DoD vocoder, it is clear that a good algorithm must have a
lot of intelligence and adaptability in order to get good results.
The main problem is the estimation of the pitch. Secondly, a robust voiced / unvoiced decision is
very important.
It was found that considering the memory of the LPC filter leads to better results.
The median filter was not able to give a smooth pitch contour. Some techniques like avoiding
abrupt changes in the pitch value and avoiding double and half pitches should be incorporated in order to
get better results.

References:

[1] Rabiner, L. and Schafer, R.. Digital Processing of Speech Signals. Prentice-Hall.

[2] Kondoz, A.. Digital Speech - Coding for Low Bit Rate Communications Systems. Wiley. 1994.

[3] IEEE Communications Magazine. Special issue on Standardization and Characterization of


G.729. Sep. 1997.

[4] Campbell, J. and Tremain, T.. Voiced / Unvoiced Classification of Speech with Applications to the
U.S. Governement LPC-10E Algorithm. IEEE ICASSP. 1996. pp. 473-476.

[5] Ghaemmaghami, S.; Deriche, M.; Boashash, B. Edited by: Deriche, M.; Moody, M.; Bennamoun,
M. A new approach to pitch and voicing detection through spectrum periodicity measurement.
vol.2 , TENCON '97 Brisbane - Australia. Proceedings of IEEE TENCON '97. IEEE Region 10
Annual Conference. Speech and Image Technologies for Computing and Telecommunications Dec.
1997. New York, NY, USA.

[6] Juhar, J. Advanced pitch detection algorithms. DSP '97. 3rd International Conference on Digital
Signal Processing. Proceedings of the Conference, Proceedings of Digital Signal Processing '97,
Herl'any, Slovakia, 3-4 Sept. 1997. Kosice, Slovakia.

[7] Janer, L.; Bonet, J.J.; Lleida-Solano, E. Pitch detection and voiced/unvoiced decision algorithm
based on wavelet transforms. vol.2 , Proceedings ICSLP 96. Fourth International Conference on
Spoken Language Processing Cat. No.96TH8206 , Proceeding of Fourth International Conference
on Spoken Language Processing. ICSLP '96, Philadelphia, PA, USA, 3-6 Oct. 1996. New York, NY,
USA.

You might also like