You are on page 1of 9

242

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. 4, JULY 1995

A Mixed Excitation LPC Vocoder Model


for Low Bit Rate Speech Coding
Alan V. McCree, Member, ZEEE, and Thomas P. Barnwell 111, Fellow, ZEEE

is used to excite an all-pole filter, and the output is scaled to


match the level of the input speech. The voicing decision, pitch
period, filter coefficients, and gain are updated for every input
frame to track changes in the speech signal. The all-pole filter
provides an efficient representation of the spectral envelope of
the speech signal while capturing the resonances of the vocal
tract at the perceptually important formant frequencies.
Since the basic LPC vocoder does not produce high quality
speech, there has been significant effort aimed at improving the
standard model. Makhoul et al. [4] described an LPC vocoder
with a mixture of pulse and noise excitation. They concluded
that adding highpass filtered noise to the pulse excitation
could reduce the synthetic speech quality but perhaps at the
cost of introducing noisiness. They also described earlier
work with mixed excitation performed by Itakura and Saito
[2] and by Fujimura [5], who developed a multiband mixed
excitation channel vocoder. Kwon and Goldberg [6] suggested
a simpler mixed excitation LPC synthesizer with similar
capabilities. Kang and Everett [7], [8] presented a number
of experimentally derived improvements to the LPC model,
I. INTRODUCTION
again including a form of mixed pulse and noise excitation.
HE PROBLEM of representing digital speech signals Sambur et al. [9] reported some success with the use of a short
using as few bits as possible is becoming increasingly triangle pulse for the LPC excitation to reduce the synthetic
important in the modem era of digital communications. One character of the vocoded speech. On the other hand, Wong [ 101
particularly efficient speech coding algorithm, called the linear concluded that using a less peaked excitation pulse derived
predictive coding (LPC) vocoder, uses a fully parametric from natural speech was not very useful, while handpicking
model to mimic human speech. In this approach, only the individual pitch pulses and performing pitch synchronous LPC
parameters of a speech model are transmitted across the analysis did reduce the synthetic quality. Despite these efforts,
communication channel, and a synthesizer is used to regen- however, the fundamental problem of producing high quality
erate speech with the same perceptual characteristics as the speech at low bit rates using an LPC vocoder has not been
input speech waveform. Since periodic update of the model so1ved.
This paper presents a new mixed excitation LPC vocoder
parameters requires fewer bits than direct representation of
the speech signal, an LPC vocoder can operate at low bit rates model that represents a significant step toward the ultimate
(less than 3000 b/s). Unfortunately, the speech output from solution of this problem. This model is based on the traditional
LPC vocoders is not acceptable for many applications because LPC vocoder with either a periodic impulse train or white
it does not always sound like natural human speech, especially noise exciting an all-pole filter, but the model contains four
in the presence of acoustic background noise.
additional features. As shown in Fig. 2, the synthesizer has
A block diagram of a typical LPC synthesizer is shown in the following added capabilities:
Fig. 1 [1]-[3]. Either a periodic impulse train or white noise
1) mixed pulse and noise excitation
2) periodic or aperiodic pulses
Manuscript received August 3, 1993; revised January 19, 1995. The
3) adaptive spectral enhancement
associate editor coordinating the review of this paper and approving it for
publication was Dr. Huseyin Abut.
4) pulse dispersion filter.
A. V. McCree was with the School of Electrical Engineering, Georgia
Institute of Technology, Atlanta, GA 30332 USA. He is now with Systems and These features allow the mixed excitation LPC vocoder to
Information Science Laboratory, Texas Instruments Corporate R&D, Dallas, mimic more of the characteristics of natural human speech.
TX 75265 USA.
The next section of this paper describes the new model
T. P. Bamwell 111 is with the School of Electrical Engineering, Georgia
in detail, including the purpose and design approach for
Institute of Technology, Atlanta, GA 30332 USA.
IEEE Log Number 94 11831.
each of these features. Section I11 presents the design and

Absfruct- Traditional pitch-excited linear predictive coding


(LPC) vocoders use a fully parametric model to efficientlyencode
the important information in human speech. These vocoders can
produce intelligible speech at low data rates (800-2400 b/s), but
they often sound synthetic and generate annoying artifacts such
as buzzes, thumps, and tonal noises. These problems increase
dramatically if acoustic background noise is present at the speech
input.
This paper presents a new mixed excitation LPC vocoder model
that preserves the low bit rate of a fully parametric model but
adds more free parameters to the excitation signal so that the
synthesizer can mimic more characteristics of natural human
speech. The new model also eliminates the traditional requirement for a binary voicing decision so that the vocoder performs
well even in the presence of acoustic background noise. A 2400b/s LPC vocoder based on this model has been developed and
implemented in simulations and in a real-time system. Formal
subjective testing of this coder confirms that it produces natural
sounding speech even in a difficult noise environment. In fact,
diagnostic acceptibilitymeasure (DAM) test scores show that the
performance of the 2400-b/s mixed excitation LPC vocoder is
close to that of the government standard 48OO-b/s CELP coder.

10634676/95$04.00 0 1995 IEEE

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

243

MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL

PERIODIC
PULSE
TRAIN

1LPC
SYNTHESIS
FILTER

SYNTHESIZED
SPEECH

WHITE
NOISE

Fig. 1. Classical L K vocoder synthesizer.

PERIODIC
PULSE
TRAIN

POSITION
JITTER

SHAPING
FILTER

STRENGTHS

WHITE

SHAPING
FILTER

NOISE

ADAPTIVE
SPECTRAL
ENHANCEMENT

--C

LPC
SYNTHESIS
FILTER

-m

PULSE
DISPERSION
FILTER

SYNTHESIZED
SPEECH

Fig. 2. New L K synthesizer.

implementation of a 2400-b/s LPC vocoder based on this


model, Section IV describes the addition of Fourier series
modeling, and Section V discusses the results of a formal
performance evaluation.
11. THE NEW LPC VOCODER
MODEL

A. Mixed Excitation
The most important feature of the model shown in Fig. 2 is
the mixed pulse and noise excitation. Since the most annoying
aspect of the speech output from the basic LPC vocoder is
a strong buzzy quality, LPC vocoders have previously been
proposed with mixtures of pulse and noise excitation [4], [6],
[8]. Mixed excitations are also commonly used in formant
synthesizers [ 111, [121 and have recently been applied in the
context of sinusoidal coding [13], [14].
We have developed a mixed excitation LPC synthesizer that
can generate an excitation signal with different mixtures of
pulse and noise in each of a number (4-10) of frequency bands
[15]. As shown in Fig. 2, the pulse train and noise sequence
are each passed through time-varying spectral shaping filters
and then added together to give a fullband excitation. For each
frame, the frequency shaping filter coefficients are generated
by a weighted sum of fixed bandpass filters. The pulse filter is
calculated as the sum of each of the bandpass filters weighted
by the voicing strength in that band. The noise filter is
generated by a similar weighted sum, with weights set to keep
the total pulse and noise power constant in each frequency

band. These two frequency shaping filters combine to give a


spectrally flat excitation signal with a staircase approximation
to any desired noise spectrum. Since only two filters art5
required, regardless of the number of frequency bands, this
structure is more computationally efficient than using a bank
of bandpass filters [16].
For ideal bandpass filters, the excitation signal generated
by this approach will have a flat power spectrum as long
as the sum of the pulse and noise power in each frequency
band is kept constant. The important parameters in a practical
filter design are the passband and stopband ripple and the
amount of pulse distortion. We implement the filter bank
with FIR filters, designed by windowing the ideal bandpass
filter impulse responses with a Hamming window. This design
technique yields linear phase FIR filters with good frequency
response characteristics and has the additional benefit of a
nice reconstruction property; the sum of all the bandpass filter
responses is a digital impulse. Therefore, if all bands are
fully voiced, the fullband excitation will be an undistorted
pulse. Fig. 3 shows the frequency responses of a nonuniform
five-band design.
To make full use of this mixed excitation synthesizer, the
desired mixture spectral shaping must be accurately estimated
for each frame. The relative pulse and noise power in each
frequency band is determined by an estimate of the voicing
strength at that frequency in the input speech. We have
developed an algorithm to estimate these voicing strengths by
combining two methods of analysis of the bandpass filtered

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. VOL. 3, NO. 4, JULY 1995

244

lot

-50
-60

500

1000

1500

2000

2500

3000

3500

4WO

FREQUENCY IN HZ

Fig. 3. Example of bandpass filter responses; five-band, 48th-order.

input speech. First, the periodicity in each band is estimated


using the strength of normalized correlation coefficients around
the pitch lag, where the correlation coefficient at delay t is
defined by

!r--

This technique works well for stationary speech, but the


correlation values can be too low in regions of varying pitch.
The problem is worst at higher frequencies and results in a
slightly whispered quality to the synthetic speech. The second
method is similar to time-domain analysis of a wideband
spectrogram. The envelopes of the bandpass filtered speech
are generated by full wave rectification and smoothing with
2ow
a one-pole lowpass filter, while using a first-order notch filter
-zoo0
to remove the dc term from the output. At higher frequencies, -4ooo
(4
these envelopes can be seen to rise and fall with each pitch
pulse, just as in the spectrogram display. Autocorrelation
analysis of these bandpass filter envelopes yields an estimate Fig. 4. Example of speech waveforms for vowel spoken by a male speaker
of the amount of pitch periodicity. Since the peaks in the during a region of changing pitch (a) Input speech, (b) bandpass speech,
envelope signal are quite broad, the small pitch fluctuations 0-500 Hz; (c) bandpass speech, 2000-3000 Hz; (d) bandpass envelope,
often encountered in natural speech have little effect on the 2oo(r3000 Hz.
correlation values. Fig. 4 shows examples of these waveforms
during an interval of changing pitch. The overall voicing pulses which are often encountered in voicing transitions or in
strength in each frequency band is chosen as the largest of vocal fry [17]. This cannot be done for strongly voiced frames
the correlation of the bandpass filtered input speech and the without introducing a hoarse quality, however, so a control
correlation of the envelope of the bandpass filtered speech.
algorithm is needed to determine when the jitter should be
added.
B. Aperiodic Pulses
Therefore, we have added a third voicing state to the
This mixed excitation can remove the buzzy quality from voicing decision which is made at the transmitter [18]. The
the LPC speech output, but another distortion is sometimes input speech is now classified as either voiced, jittery voiced,
apparent. This is the presence of short, isolated tones in the or unvoiced. In both voiced states, the synthesizer uses a
synthesized speech, especially for female speakers. The tones mixed pulsehoise excitation, but in the jittery voiced state
can be eliminated by adding noise in the lower frequencies, the synthesizer uses aperiodic pulses, as shown in Fig. 2. This
but so much noise is required that the output speech sounds makes the problem of voicing detection easier, since strong
rough and noisy. A more effective solution is to destroy the pe- voicing is defined by periodicity and is easily detected from the
riodicity in the voiced excitation by varying each pitch period strength of the normalized correlation coefficient of the pitch
length with a pulse position jitter uniformly distributed up to search algorithm. Jittery voicing corresponds to erratic glottal
f 2 5 % . This allows the synthesizer to mimic the erratic glottal pulses, so it can be detected by either marginal correlation or
14mo

--TO50

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

MCCREE AND BARNWELL: MIXED EXClTATION LPC VOCODER MODEL

peakiness in the input speech. In this work, peakiness p is


defined by the ratio of the RMS power to the average value
of the full-wave rectified LPC residual signal [19]

In vector space terminology, this is the ratio of the L2 norm


to the L1 norm of the LPC residual vector. This peakiness
detector will detect unvoiced plosives as well as jittery voicing,
but this is not a problem since the use of randomly spaced
pulses has previously been suggested to improve the synthesis
for plosives [8]. In fact, misclassification of unvoiced frames
as jittery voiced is usually imperceivable, since the synthesizer
uses aperiodic pulses with a noise mixture. Thus, the jittery
voicing detector can be made very sensitive, for example,
with a threshold of p = 1.8. The carefully controlled use of
aperiodic pulses effectively removes the occasional tones from
the synthetic speech without introducing additional distortion.
This improvement comes at the expense of only one bit per
frame.
It is important to note that mixed excitation and aperiodic
pulses are both needed in the LPC synthesizer because they
perform different roles. For example, simply jittering the
pulses without using mixed excitation does not reduce the
buzzy quality, even though it destroys the periodicity of the
synthetic speech. Therefore, we believe that the human ear is
capable of separately detecting both periodicity and peakiness
[ 161. Both characteristics should be present for strongly voiced
speech, but each results in distortion if present for all unvoiced
speech. We believe that the buzz associated with LPC vocoders
comes from higher frequency waveform peakiness in unvoiced
or partially voiced speech frames, while tonal noises are
introduced when periodicity is present in speech frames which
should be unvoiced.
C. Adaptive Spectral Enhancement
The third feature in the model shown in Fig. 2 is adaptive
spectral enhancement. This adaptive filter helps the bandpass
filtered synthetic speech to match natural speech waveforms
in the formant regions. Typical formant resonances usually
do not completely decay in the time between pitch pulses in
either natural or synthetic speech, but the synthetic speech
waveforms reach a lower valley between the peaks than natural
speech waveforms do. This is probably caused by the inability
of the poles in the LPC synthesis filter to reproduce the features
of formant resonances in natural human speech. There are two
possible reasons for this problem. One cause could be improper
LPC pole bandwidth; the synthetic time signal may decay too
quickly because the LPC pole has a weaker resonance than
the true formant. Another possible explanation is that the true
formant bandwidth may vary somewhat within the pitch period
[12], and the synthetic speech cannot mimic this behavior.
These problems can be alleviated by systematically varying
the synthesis pole bandwidths within each pitch period, as was
done by Holmes in his formant synthesizer [12]. This is easily
implemented in the LPC vocoder by replacing each z-l term
in the z transform of the LPC filter with an az-'. If Q is

245

less than one, this moves the poles of the LPC synthesis filter
away from the unit circle in the z-plane and weakens the pole
resonances. An Q greater than one will sharpen the resonances
or even make the filter unstable. Simply sharpening the LPC
poles with a fixed Q: slightly greater than one does improve
the waveform match in many places, but it introduces chirpy
sounds to the synthetic speech, presumably due to occasional
quasi-stable LPC filters. A better approach is to sharpen the
LPC filter for the half of the pitch period starting with the
pitch pulse, and weaken it for the other half. Depending upon
the precise values of Q used, this can sharpen the overall
response, while avoiding the steady sinusoidal response typical
of a quasi-stable filter. It can also provide a better match to
natural speech waveforms. Unfortunately, the performance of
this pole modulation technique degrades with quantization of
the LPC filter coefficients, since the bandwidths of the poles of
the quantized LPC filter often vary significantly from frame
to frame.
The adaptive spectral enhancement filter provides a simpler
solution to the problem of matching formant waveforms.
This adaptive pole/zero filter is widely used in CELP coders
[20], [21] since it is intended to reduce quantization noise
in between the formant frequencies. The poles are generated
by a bandwidth expanded version of the LPC synthesis filter,
with Q equal to 0.8. Since this all-pole filter introduces a
disturbing lowpass filtering effect by increasing the spectral
tilt, a weaker all-zero filter calculated with Q equal to 0.5 is
used to decrease the tilt of the overall filter without reducing
the formant enhancement. In addition, a simple first-order FIR
filter is used to further reduce the lowpass muffling effect.
In the mixed excitation LPC vocoder, reducing quantization
noise is not a concern, but the time-domain properties of
this filter produce an effect similar to pitch-synchronous pole
bandwidth modulation. As shown in Fig. 5, a simple decaying
resonance has a less abrupt time-domain attack when this
enhancement filter is applied. This feature allows the LPC
vocoder speech output to better match the bandpass waveform
properties of natural speech in formant regions, and it increases
the perceived quality of the synthetic speech.
D. Pulse Dispersion Filter
The pulse dispersion filter shown in Fig. 2 improves the
match of bandpass filtered synthetic and natural speech waveforms in frequency bands which do not contain a formant
resonance. At these frequencies, the synthetic speech often
decays to a very small value between the pitch pulses. This
is also true for frequencies near the higher formants, since
these resonances decay significantly between excitation points,
especially for the longer pitch periods of male speakers.
In these cases, the bandpass filtered natural speech has a
smaller peak-to-valley ratio than the synthetic speech. In
natural speech, the excitation may not all be concentrated at the
point in time corresponding to closure of the glottis [22]. This
additional excitation prevents the natural bandpass envelope
from falling as low as the synthetic version. This could be due
to a secondary excitation peak from the opening of the glottis,
aspiration noise resulting from incomplete glottal closure, or

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

246

IEEE TRANSACI'IONS ON SPEECH AND AUDIO PROCESSING,VOL. 3, NO. 4, JULY 1995

4ooo
OQ-

3000
ZOO0

0e-

Io00

0 4 02-

-1000
0 -

-2000
-3000

4 2 -

-4000
4 4 -

-10

10

1 .

4 .......... <.........................

20

30

40

........... < .........................

I
!

60

70

e0

,........... :...........

.......... ........... 1> ...........i:........... i:............ i;........... i;.......................


........... :...... .."',
2 .......................

-400

20

40

60

80

100

120

L40

160

,so

: - I

200

(C)

500

1000

1500

2000

2500

3000

3500

4000

Frequency in Hz

(C)
(4
Fig. 5. Natural speech versus decaying resonance waveforms: (a) First formant of natural speech vowel; (b) synthetic exponentially decaying resonance;
(c) pole/zero enhancement filter impulse response for this resonance; (d)
enhanced decaying resonance.

a small amount of acoustic background noise which is visible


in between the excitation peaks. In all of these cases, there
is a greater difference between the peak and valley levels of
the bandpass filtered waveform envelopes for the LPC speech
than for natural human speech.
The pulse dispersion filter is a fixed FIR filter, based on
a spectrally flattened synthetic glottal pulse, which introduces
time-domain spread to the synthetic speech. We use a fixed
triangle pulse [9], [23] based on a typical male pitch period,
but first remove the lowpass character from its frequency
response. The filter coefficients are generated by taking a
DFT of the triangle pulse, setting the magnitudes to unity,
and taking the inverse DFT. The dispersion filter is applied to
the entire synthetic speech signal to avoid introducing delay
to the excitation signal prior to synthesis. Fig. 6 shows some
properties of this triangle pulse and the resulting dispersion

Fig. 6. Synthetic triangle pulse and FIR filter: (a) Triangle waveform; (b)
filter coefficients after spectral flattening with length 65 DFT, (c) Fourier
transform (DTlT) after spectral flattening.

filter. The pulse has considerable time-domain spread, as


well as some fine detail in its Fourier magnitude spectrum.
Using this dispersion filter decreases the synthetic bandpass
waveform peakiness in frequencies away from the formants
and results in more natural sounding LPC speech output.
However, pulse dispersion gives only a minor decrease in
peakiness at the high frequencies, so the mixed excitation is
still needed to remove the buzz by using noise instead of pulse
excitation at the high frequencies when necessary.

III. A 2400-b/s MIXEDEXCITATION


LPC VOCODER
The addition of these four features to the LPC vocoder
model provides a significant increase in speech quality at
the expense of only a few extra bits per frame. This mixed
excitation LPC vocoder produces speech which is free of
major artifacts and more natural in character. In addition,
the coder maintains good performance even in the presence
of acoustic background noise. Since these preliminary results

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

1i

MCCREE AND BARNWELL MIXED EXClTATION LPC VOCODER MODEL

TABLE I
2400-b/s MIXED
EXCITATION
LPC VWODERBITALLWATION
LPC coefficients (10 LsPs)
gain (2 per frame)
pitch and overall voicing
bandpass voicing

5-1

aperiodic flag

TOTAL: 54 bits

22.5 msec = 2400 bps

for the new vocoder model were so encouraging, we have


implemented a 2400-b/s LPC vocoder based on this model for
formal evaluation of performance [24]. This section describes
the design and implementation of this vocoder.
A. Design

The 24Wb/s speech coder based on this mixed excitation


LPC vocoder model uses the bit allocation shown in Table
I. The mixed excitation requires separate binary voicing decisions in each of five frequency bands, and the aperiodic pulses
require a single bit to distinguish the voiced and jittery voiced
states. Since the lowest frequency band is given the overall
voicing decision implicit in the pitch code, this requires only
a total of five extra bits for better parameterization of the
excitation signal. Neither the triangle pulse dispersion nor the
adaptive spectral enhancement require any new information at
the receiver, so their improvements come at no expense in bit
rate.
The analysis portion of the 2400-b/s vocoder algorithm is
similar to a traditional LPC vocoder. The LPC coefficients are
determined by the autocorrelation technique with a Hamming
window of length 25 m / s and coded using scalar quantization
of the LSP parameters [21]. The pitch is estimated from a
search of the normalized correlation coefficients of the lowpass
filtered LPC residual signal, with an explicit check for possible
pitch doubling. The sixth-order pole/zero lowpass filter allows
only the frequencies below about 1200 Hz to be used in the
pitch search, since they show stronger periodicity when the
pitch is changing. The LPC inverse filter removes the formant
structure to prevent harmonic tracking in the pitch search. The
normalized correlation coefficients, defined in (l), give reliable
estimates even during energy transitions, with only a small
increase in computation. To improve performance in noise, a
second pitch search is performed on the lowpass filtered input
speech if the first search does not yield significant correlation.
The input speech usually has a higher signal-to-noise ratio
(SNR) in a noisy environment since the LPC inverse filter
reduces the speech energy in the formant regions. Finally,
gross pitch errors are corrected using one past and one future
frame. The overall voicing decision is based primarily on
the strength of the pitch periodicity. Strong pitch correlation
results in classification as strongly voiced, and a frame with
marginal pitch correlation or high peakiness in the residual
signal is classified as jittery voiced so that aperiodic pulses
will be used. An unvoiced frame is declared if none of

241

these conditions are met. The speech gain is estimated twice


per frame using the RMS power level of the input speech.
For voiced frames, the estimation window is adjusted to be
a multiple of the pitch period. Both gains are uniformly
quantized on a logarithmic scale, but the second gain is coded
with five bits while the first gain uses three bits to span the
range from twice the maximum of its neighbors to one half
of the minimum of its neighbors. The mixed excitation in the
lowest frequency band is based on the overall voicing state,
while the binary voicing decisions in each of the higher four
frequency bands come from normalized correlation coefficients
calculated on the bandpass filtered speech and envelopes, as
described previously.
The synthesizer section of the vocoder is shown in Fig. 2.
The mixed excitation is generated as the frequency weighted
sum of a pulse train and white noise. This excitation is
then filtered by the adaptive spectral enhancement and LPC
synthesis filters and scaled to the correct power level. This
signal is filtered by the fixed triangle pulse dispersion to
give the output synthetic speech. The pitch period, bandpass
voicing strengths, LPC and adaptive spectral enhancement
filter coefficients, and gain are all interpolated with each pitch
period.
B. Implementation
This 2400-b/s mixed excitation LPC vocoder is implemented in the C language and runs on a SUN workstation. In addition, we have developed a real-time version on
a floating-point DSP processor (33-MHz Texas Instruments
TMS320C30) by inserting assembly code for time-critical
routines. The analysis and synthesis routines each use about
60% of one DSP processor; therefore, the coder is capable
of half-duplex operation in real-time. We expect that the full
vocoder could be made to run in real-time on a single DSP
chip either by optimizing the assembly code or by utilizing
a faster DSP.
Iv. FOURIERSERIES EXPANSION
In addition to the 2400-b/s vocoder described above, we
have also developed a 4800-b/s version using a Fourier series
expansion of the voiced excitation signal. This vocoder is
still based on the mixed excitation LPC vocoder model with
the additional features described previously, but the method
of generating the voiced excitation is more flexible. This
capability can be used for additional experimentation into the
perceptual importance of various aspects of the LPC excitation,
as was done in [25], or for the development of a higher bit
rate speech coder.
Implementing a Fourier series expansion in LPC synthesis is
straightforward. Instead of using a simple time-domain digital
impulse, each pitch pulse of the mixed excitation signal, shown
in Fig. 2, is generated by an inverse DFT of exactly one period
in length. If the magnitude values of this D I T are all equal
to one and the phases are all set to zero, this is simply an
alternate representation for a digital impulse. However, by
systematically varying these values, this synthesizer can be

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

248

E E E TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. 4, JULY 1995

TABLE I1
used to evaluate various Fourier series magnitude or phase
CLEANINPUT
DAM TESTSCORES
distortions, and it can also be used to reproduce transmitted
estimates of the magnitude or phase.
6 Speaker Male Female
Speech Coder
An interesting application of the Fourier series expansion
is to reproduce the magnitude or phase of selected harmonics
53.2
54.0
54.7
2400 bps DoD LPC-lOe
of the pitch fundamental. This requires the difficult task of
57.7
58.9
59.8
2400 bps ME LPC
accurately estimating these values. Early versions of this
61.5
62.6
63.5
4800 bps DoD CELP
speech coder used a Fourier series analysis of the LPC residual
61.3
61.6
61.6
4800 bps ME LPC
signal. This can give good estimates of both magnitude and
phase, but it depends on accurately locating individual pitch
pulses. A more reliable analysis technique is to perform a
DFT of an entire frame of the LPC residual signal, and to
estimate the Fourier series magnitudes by the largest DFT
magnitudes within frequency bands corresponding to each
pitch harmonic. This approach, which is similar to the analysis
used in sinusoidal speech coding [13], provides accurate and
2400 bps ME LPC
consistent estimates of the Fourier magnitudes since it requires
no timing alignment. Unfortunately, it does not measure the
4800 bps DoD CELP
phase of the excitation signal relative to the pitch pulse since
4800 bps ME LPC
it contains a large unknown linear phase term.
To evaluate the capabilities of Fourier series modeling, a
4800-b/s speech coder has been developed. This coder is based
To determine how well this mixed excitation vocoder peron the 2400-b/s mixed excitation LPC vocoder with the four forms in comparison to existing speech coders, formal subfeatures described in the previous section but also includes jective listening tests have been conducted. Since the goal of
coding of the Fourier series magnitudes. The magnitudes are this work is to produce natural sounding synthetic speech, the
estimated with a 512-point fast Fourier transform (m)of overall user acceptability of the processed speech has been
each frame of the LPC residual signal generated by inverse measured with the diagnostic acceptability measure (DAM)
filtering the input speech with the quantized LPC filter. To [26] as performed by Dynastat. This test is widely used to
obtain a 4800-b/s bit rate, the magnitudes of the first 18 har- evaluate low bit rate speech coders, so there is a substantial
monics are divided by the average over all the harmonics, and literature of DAM scores for various speech coding algorithms
the logarithms of these normalized magnitudes are uniformly [21], [27]. In addition to the 2400-b/s mixed excitation LPC
quantized and coded with three bits each. The remaining and 4800-b/s Fourier series mixed excitation LPC vocoders
harmonic magnitudes are synthesized with a fixed value of described previously, two U.S. government standard speech
one, and all the phases are set to zero to align the harmonics coders were also included in the DAM testing: 2400-b/s DoD
into a single pulse per pitch period.
LPC-lOe v.55 and 4800-b/s DoD CELP release 3.2 [21]. LPC10e is a significantly enhanced version of the classical LPC
vocoder standard LPC-10, and it includes some of the features
v. PERFORMANCE EVALUATION
developed by Kang and Everett [7], [SI. This vocoder has
The 2400- and 4800-b/s LPC mixed excitation vocoders previously been reported to score a 54 on the DAM test, which
have undergone both informal and formal listening tests. represents a significant improvement over the the earlier score
Informal listening on a database of about 20 speakers shows of 47 for LPC-10 [21], [27]. The tests were run on a speech
that these coders can produce high quality speech for both male database consisting of 12 sentences from each of three male
and female speakers. In addition, the coders maintain good speakers and three female speakers, and all processing was
performance in acoustic background noise. In a synthetic white done on a SUN workstation. Additional testing was done with
noise environment, the mixed excitation produces natural synthetic white noise added to the same speech input. The
sounding speech without obvious artifacts such as buzz or noise was generated by a Gaussian random number generator,
thumps. In standard military communications environments and the SNR over the six-speaker database was about 8 dB.
such as airplanes, tanks, and helicopters, the new coders still
The DAM test results for both clean and noisy speech are
produce natural sounding speech, although the noise itself shown in Tables I1 and 111. The clean speech DAM scores consounds somewhat distorted. The 4800-b/s coder performs firm that the 2400-b/s mixed excitation LPC vocoder produces
better than the 2400-b/s version due to the additional spectral speech which is significantly better than the current standard
information from the Fourier series magnitudes. This appears LPC-lOe vocoder. In fact, the DAM score for the mixed
to provide improvement in producing nasals, in reproducing excitation LPC is closer to the higher rate 4800-b/s CELP
the identity of a particular speaker, and in the quality of vowels standard than to LPC-10e. For the noisy speech, all the scores
in broad-band acoustic background noise. These are all cases are low due to the annoying amount of background noise,
where the all-pole assumption inherent in the LPC model may but the speech can still be clearly understood. In this difficult
not be accurate, and presumably the better representation of environment, the new coder is clearly superior to LPC- 10e, and
the Fourier spectrum can compensate for this mismatch.
even performs slightly better than the higher rate standard. In
~

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

MCCREE AND BARNWELL: MIXED EXCITATION LPC VOCODER MODEL

both cases, the additional Fourier series magnitude information


in the 4800-b/s mixed excitation coder provides a further
improvement of about 3 DAM points. Since the Fourier magnitudes require a significant increase in both computation and
bit rate, the 4800-b/s coder may not be as useful as the 2400b/s version. However, its performance does serve as a target
for possible future improvements to the parametric model.

VI. CONCLUSION
We have presented a mixed excitation LPC vocoder model
that can produce high quality speech at low bit rates. The
model maintains the efficiency of a fully parametric LPC
vocoder model, but it adds more free parameters to the excitation signal so the synthesizer can mimic more characteristics of
natural human speech. In addition, the requirement for a single
binary voicing decision is eliminated, so the vocoder performs
well even in the presence of severe acoustic background noise.
This mixed excitation model is based on the traditional LPC
vocoder with either a periodic impulse train or white noise
exciting an all-pole filter but contains four additional features:
mixed pulse and noise excitation, periodic or aperiodic pulses,
adaptive spectral enhancement, and pulse dispersion filter.
Each of these capabilities is intended to remove a particular
distortion from the synthetic speech. The mixed excitation
eliminates the buzzy quality usually associated with LPC
vocoders by allowing frequency-dependent voicing strength.
A separate aperiodic voiced state is added so the synthesizer can reproduce erratic glottal pulses without introducing
tonal noises. Adaptive spectral enhancement sharpens the formant resonances and improves the bandpass filtered waveform
match between synthetic and natural speech in the frequency
bands including formant frequencies. The pulse dispersion
filter allows the LPC synthesizer to better match waveforms
away from the formant regions by introducing time-domain
spread to the excitation signal.
To verify the performance of the mixed excitation LPC
vocoder model, we have developed and implemented a 2400b/s LPC vocoder. This speech coder is machine portable since
it is written in the C language, but it runs in real-time on
a special hardware platform using a fast DSP microprocessor.
Informal and formal subjective testing of this coder has shown
that it performs significantly better than the current state of
the art at such a low bit rate. Additional testing on a 4800-b/s
vocoder, which also includes Fourier magnitudes of the true
excitation signal, demonstrates the performance improvement
possible with more accurate spectral representation in the
parametric model.
This work could be extended in a number of ways. The
mixed excitation LPC vocoder model could be further improved, perhaps by better utilizing the available Fourier series
information. The design of the 2400-b/s LPC vocoder could be
tailored for specific applications by addressing issues such as
computational efficiency, channel errors, and input microphone
response. Finally, the mixed excitation LPC model could be
applied to other problems such as very low bit rate speech
coding (800-1200 b/s), speech synthesis, pitch scaling, or time
scale modification.

249

REFERENCES
[l] B. S . Atal and S. L. Hanauer, Speech analysis and synthesis by linear
prediction of the speech wave, J. Acourt. Soc. Amer., vol. 50, pp.
637455, Aug. 1971.
[2] F. Itakura and S . Saito, Analysis synthesis telephony based on the
maximum likelihood method, in Proc. Rep. 6th Int. Congr. Acoust.,
Aug. 1968, pp. C17-C20.
[3] T. E. Tremain, The government standard linear predictive coding
algorithm: LPC-IO, Speech Technol., pp. 4049, Apr. 1982.
[4] J. Makhoul, R. Viswanathan, R. Schwartz and A. W. F. Huggins, A
mixed-source model for speech compression and synthesis, J. Acoust.
Soc. Amer., vol. 64, pp. 1577-1581, Dec. 1978.
[5] 0. Fujimura, An approximation to voice aperidcity, IEEE Trans.
Audio Electroacoust., vol. AE-16, pp. 68-72, Mar. 1968.
[6] S. Y. Kwon and A. J. Goldberg, An enhanced LPC vocoder with no
voicedunvoiced switch, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 851-858, Aug. 1984.
[7] G. S . Kang and S. S. Everett, Improvement of the narrowband LPC
analysis, in Proc. IEEE Int. Con$ Acoust., Speech, Signal Processing,
Boston, 1983, pp. 89-92.
[8] ___, Improvement of the excitation source in the narrowband linear
prediction vocoder, IEEE Trans. Acoust., Speech, Signal Processing,
vol. ASSP-33, pp. 377-386, Apr. 1985.
[9] M. R. Sambur, A. E. Rosenberg, L. R. Rabiner, and C. A. McGonegal,
On reducing the buzz in LFC synthesis, J. Acoust. Soc. Amer., vol.
63, pp. 918-924, Mar. 1978.
[lo] D. Y. Wong, On understanding the quality problems of LPC speech,
in Proc. IEEE Int. Con$ Acoust.. Speech, Signal Processing, 1980, pp.
725-728.
[ l l ] D. H. Klatt, Review of text-to-speech conversion for English, J.
Acoust. Soc. Amer., vol. 82, pp. 737-793, Sept. 1987.
[I21 J. N. Holmes, The influence of glottal waveform on the naturalness
of speech from a parallel formant synthesizer, IEEE Trans. Audio
Electroacoust., vol. AE-21, pp. 298-305, June 1973.
[13] R. McAulay, T. Parks, T. Quatieri and M. Sabin, Sine-wave amplitude
coding at low data rates, in Advances in Speech Coding. Norwell,
MA: Kluwer, 1991, pp. 203-214.
141 M. Brandstein, J. Hardwick, and J. Lim, The multiband excitation
speech coder, in Advances in Speech Coding. Norwell, MA: Kluwer,
1991, pp. 215-224.
151 A. V. McCree and T. P. Barnwell III, Improving the performance
of a mixed excitation LPC vocoder in acoustic noise, in Proc. IEEE
Int. Con$ Acoust., Speech, Signal Processing, San Francisco, 1992, pp.
I1137-II140.
161 A. V. McCree, A new LPC vocoder model for low bit rate speech
coding, Ph.D. thesis, Georgia Inst. Technol., Atlanta, GA, Aug. 1992.
[17] W. Hess, Pitch Determination of Speech Signals. Vienna, NY:
Springer, 1983.
[18] A. V. McCree and T. P. Barnwell 111, A new mixed excitation LPC
vocoder, in Proc. IEEE Int. Con$ Acoust., Speech, Signal Processing,
Toronto, 1991, pp. 593-596.
[19] D. L. Thomson and D. P. Prezas, Selective modeling of the LPC
residual during unvoiced frames: White noise or pulse excitation, in
Proc. IEEE Int. Con$ Acoust., Speech, Signal Processing, Tokyo, 1986,
pp. 3087-3090.
[20] J. H. Chen and A. Gersho, Real-time vector APC speech coding at
4800 bps with adaptive postfiltering, in Proc. IEEE Int. Con$ Acoust.,
Speech, Signal Processing, Dallas, 1987, pp. 2185-2188.
[21] J. P. Campbell Jr., T. E. Tremain, and V. C. Welch, The DoD 4.8
kbps standard (proposed federal standard 1016), in Advances in Speech
Coding. Norwell, MA: Kluwer, 1991, pp. 121-133.
[22] J. N. Holmes, Formant excitation before and after glottal closure,
in Proc. IEEE Int. Con$ Acoust., Speech, Signal Processing, 1976, pp.
3942.
[23] A. E. Rosenberg, Effect of glottal pulse shape on the quality of natural
vowels, J. Acoust. Soc. Amer., vol. 49, pp. 583-590, 1971.
[24] A. V. McCree and T. P. Barnwell III, Implementation and evaluation
of a 2400 bps mixed excitation LPC vocoder, in Proc. IEEE Int. Con$
Acoust., Speech, Signal Processing, Minneapolis, 1993, pp. II159-II162.
[25] B. S . Atal and N. David, On synthesizing natural-sounding speech by
linear prediction, in Proc. Int. Con$ Acoust., Speech, Signal Processing,
1979, pp. 4447.
[26] W. D. Voiers, Diagnostic acceptability measure for speech communications systems, in Proc. IEEE Int. Con$ Acoust., Speech, Signal
Processing, 1977, pp. 204-207.
[27] C. Smith, Relating the performance of speech processors to the bit
error rate, Speech Technol., pp. 41-53, Sept. 1983.

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

250

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,VOL. 3, NO.4, JULY 1995

Alan V. MeCree (M85488-S91-M92) was


born in Lower Hun, New Zealand, on April 20,
1960. He received the B.S. degree in electrical
engineering (magna cum l a d e ) in 1981 and the
M.E.E degree in electrical engineering in 1982,
both from Rice University, Houston, TX, and the
W.D. degree in electrical engineeringin 1992, from
the Georgia Institute of Technology, Atlanta,
From 1982 to 1984, he worked as a Geophysicist
for Shell Oil, Houston, TX, interpreting electromagnetic data. From 1984 to 1988, he worked
as a Senior Engineer in the-Voice Products group at MA-COM Linkabit,
San Diego, CA. From 1992 to 1993, he was a consultant to AT&T Bell
Laboratories in speech coding. Since 1993, he has been a Member of Technical
Staff at Texas Instruments Corporate R&D, Dallas, TX.His primary research
interest is in speech coding.

Thomas P. Barnwell III (M764M8&F88)


received the B.S. degree in 1965, the M.S. degree
in 1967, and the Ph.D. degree in 1970, all from
Masachusetts Institute of Technology (M.I.T.),
Cambridge, MA.
Since 1971, he has been with the School
of Electrical Engineering, Georgia Institute of
Technology, Atlanta, GA, where he has developed
courses in speech therapy and digital systems and
has developed computer aided instruction support
facilities for undergraduate laboratories and for
networking and distributed processing. He has also been principal investigator
on numerous research contracts and grants in the areas of speech coding and
analysis, objective quality measures for speech, multiprocessor architectures
for digital signal processing, and computer networking and distributed
processing. His current research interests are in three primary areas: speech
analysis, synthesis, and coding; digital architectures for signal processing;
and computer networks for distributed processing.
Dr. Barnwell received a National Institutes of Health Fellowship and a
National Science Foundation Fellowship while at M.I.T.

Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.

You might also like