A Survey of Speech Bandwidth Compression Techniques-Zdc

104 IRE TRAhTSACTIONS ON AUDIO September-October
A Survey of Gornpresslon
A
S. J. CAMPANELLA+
Summary-Application of speech bandwidth compression tech- A simplified sketch of the vocal mechanism is shown i n
niques to voice communications provides the promise of more effi- Fig. 1. I t consists of a tube terminated at one end b>lthe
cient utilization of the available radio spectrum and the possibility
of
improved performance of noisy, long distance communications links. larynx and at the other end by the lips. Acoustic ex-
That significant potential gain existsis evident fromthe fact that the citation may be applied either in the form of a periodic
information rate requiredfor transmission of the conventional speech pressure wave generated a t the larynx, or byturbulence
signal is approximately 24,000 bits per second, whereas that for trans-generated a t some point of constriction along the tube.
mission of theequivalentword-intelligencecontentbymeans of Sounds produced by the periodic larynx excitation are
teletype is bits per second. The paper presents brief descriptions
of severalspeechbandwidthcompressiontechniques which are usuallyreferredto as beingvoiced, andsounds pro-
currently being employed or investigated to achieve varying degrees duced by turbulent excitation as unvoiced.
of compression. Also a method of estimating the influence of signal-
to-noise ratio on a communications link employing such compression
techniques is presented.
INTRODUCTIOK
F OR thepurpose of conveyinginformationfrom
one human being to another, speech communica-
tion is the vehiclemostreadilyacceptedand
rapidly comprehended. The human speech communica-
c
tion system may be considered to consistof the brain of

thespeaker as the source of information,the vocal
mechanism of the speaker as the transmitter, the hear-
ing mechanism of the listener as the sensory device, and
the brain of the listener as the receiver of information. Larynx
Source)
f
Throughout his evolution, man has naturally adapted
himself to this communication system, using thespeech Fig. 1-The vocal mechanism.
signal as the medium of communication.
Unfortunately, from the point of view of the com-
munication engineer, nature has not provided a n effi- The principal element entering into vocal tract vari-
cient system. This fact is made evident when the inior- ation is the tongue. It divides the vocal tract into two
mation rate required to transmit the speech signal of resonant cavitieswhichto a large extent control the
transmissioncharacteristic of the vocal tract.The
3000 cps bandwidth (24,000 t o 50,000 bits per second)
is compared with that theoretically required to transmit transmission characteristic can also be modified by
the basic information content of the speech signal (ap- coupling into the nasal cavity controlled by the velum.
proximately 60 bits per second). T h e influence of the vocal cavity on voiced excitation is
This inefficiency results i n uneconomical use of chan- illustrated in Fig. 2. T h e vocal cavity transmission char-
nel capacity and radio spectrum space. The desire for acteristic acting on the larynx signal 2(a) causes certain
more efficientuse of ourcommunicationfacilitiesfor frequencycomponentstopasswith less attenuation
speech, coupled with the advantages ol low information than others. The effect of a simple resonance transmis-
rate digital transmission for both purposes ol privacy sion characteristic would produce a damped sinusoid,
and of improved reliability, has led to the .investigation as shown in Fig. %(b). For themore complicated trans-
of several speech bandwidth compression systems in this mission characteristic of thehuman vocal tract,the
country and abroad. result is best illustrated by the spectral energy distri-
bution shown in Fig. 2(c). The line structure appearing
CHARACTERISTICS OF THE SPEECH SIGKAL SPECTRUM the spectrum is produced by the harmonics of the
in
periodic larynx excitation. The frequency of larynx vi-
I t is the idea-forming element of the brain acting on bration is generally referred to as the pitch frequency.
the vocal mechanism that gives rise to the speechsignal. The peaks of energy observed in the spectral distribu-
Manuscript received by the PGA, September 2 , 1958; revised

tion,resultingfromthe influence of thetransmission
manuscript received October 27, 1958. This paper was presented at characteristic of the vocal cavity,arereferred to as
the IRE WESCONConvention, Los Angeles,Calif., August 20, formants. The formants are designated in their ascend-
1958.
Melpar, Inc., Falls Church, Va. ing order of appearance as F1, F2, etc. In general, sta-
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:028 UTC from IE Xplore. Restricon aply.
1958 Campanella: A Survey of Speech Bandwidth Compression
Techniques 105
tistical studies have indicated that two formants suffice is found t o be equal to 6000 logz(l+lOO) bits
f o r specification of most of the voiced English vowels per second. Marginal performance might be achieved at
and consonants.l Similar formant structure is also ob- a signal-to-noise ratio of 10 db, this giving a channel
served for the unvoiced sounds of speech. Most of the capacity equal to 6000 logp(1 + l o ) 20,000 bitsper
significant information content of speech falls in the fre- second.
quency range below 3000 cps. Consider the rate of speech information transmission
now from the point of view of its source. In speech, a
T human utilizes approximately 40 basic sounds and pro-
duces these at an average rateof 10 per second. Using a
six bit code, it ispossible touniquelyspecify
U sounds. Hence, the rate of information transmission of
the speech channel is seen to be less than 60 bits per
second.
So, there is a considerable difference between the in-
formationratenecessarytocommunicatethespeech
signal waveform and the actual rate at which inforrna-
tion appears to be generated by the vocal mechanism.
Severalfactorsmaybe offeredin explanation of the
fact. It is rather obvious that in addition to the infor-
mation contained in the principle articulatory positions
(b) of thevocalmechanism,eachindividualspeakeralso
imposes information concerning his vocal identity and
Ranges emotional status. This information is transmitted con-
F2
tinuously, since at any instant in a conversation one
can usually identify the speaker from the “sound”of his
voice. Thus, not only docs t h e speechsignalidentify
rTl-r7-,. the spenlw- initially, but it continues to do so through-
out the event. This points out the fact that the speech
signal is redundant.Speakeridentity is nottheonly
form of redundancy present in speech. Another signifi-
cant form of redundancy is evident in the fact that most
Fig. 2-Typical speechwaveformsandspectraldistribution(a) vowel sounds havea duration i n excess of t h a t required
Pressure wave-shape a t laryns. ( b ) Pressure waveshape at lips.
(c) Spectral distribution of speech signal energy. for the hearing mechanism to identify the sound, and
Eurthermorc,thetypicalvowelsoundcontainsmany
COMPRESSION
BANDWIDTH SCHEMES fold repetitions of a basic vowel waveshape. A final fac-
I n conventional telephony, speech is transmitted by tor influencing the speech channel information rate is
t h e more or less exact replicaof the complicated acoustic inefficient use of the spectrum, thespeechsignal
waveforms that represcnt the sounds. I n fact, if speech does not occupyall of its spectrum spaceall of the time.
is regarded as a function of time, the process consists in T o summnrize, there is a considerable difference be-
principle of a mere change of the dependent variable tween the information rate necessary to communicate
from sound pressure to some electrical quantity at the the speech signal waveform i n the conventional manner
sending end and vice-versa a t t h e receiving end. I t is and that contained in the articulartory action of the
possible to estimate therateeof information transmission vocalmechanism.The difference canbeattributed
forthespeechsignal by consideringthenormaltele- toidentity of thespeakerand hisemotionalstatus,
phone channel. The conventional telephone channel has redundancy,and inefficientuse oi thespectrum. In
approximately 3000 cps bandwidth and performs well general,allspeechbandwidthcompressionsystems
at a signal-to-noise ratio of 20 db. Employing Shannon’s attempt to exploit one or more of these factors to ob-
relation for channel capacity,2 tain compression of‘ thebandwidthandconsequently
channel capacity (in bits per second).
The fact that communication of speech information
c 2 w log2 (I
by transmitting a replica of the speech waveform has
been and remains the most widely used technique and
where W is the bandwidth of the channel and S I N is is virtually considered as fundamental is undoubtedly
t h e signal-to-noise ratio, the capacity of such a channel duetotheinherentsimplicityandreliability of the
E. Peterson and H. L. Barney, “Control methods used in
equipment involved. Only recently the ever increasing
study of the vowels,” 3. SOC. Awter., vol. 24, pp. 175-184; demands on radio spectrum space and the requirement
March, 1952. to achieve communication over long distant radio chan-
W. Tuller, Theoretical linlitations on the rate of transmis-
sion of information,” PROC.IRE, vol. 37, pp. 468-478; May, 1949. nels at reducedchannelcapacitieshavespurred con-
106 IRE T R A N X A C T I O N S ON A U D I O September-October
siderable activity in achieving a means of voice com- Each sampling interval should be long enough to obtain
munication whichismore conservativeinterms of a samplerepresentative of thesound(thepitch period
channel
capacity
requirements.
would
be minimum).
theproper
the
By speech reduc-
In the following discussion several speech bandwidth tion apparatus, it is possible to simultaneously sample
orchannelcapacityreduction systems are described. In the speech and spread the sampledportionoverthe
generalthese systemscanbegroupedintofourprincipalentireintervalbetweensamples. A method of ac-
categories:
complishing this is illustrated
Fig. in 3. The sound signal
Timeorfrequencycompressionmethods.Such to be processed is carried on magnetic tape. The tape
methods exploit t h e redundancyorregularities is passed over a rotating playback head assembly with
existing in the speech signal b y sampling and fre- heads spaced a p a r t a t positions A , B , C,and
quency division techniques. These systems gener- The magnetic tape is in contact with the head over an
allyexhibitbandwidthcompression i n the order arc of 90”; thus, one of the playback heads is always in
of 1:2 to 1:4 and can be transmitted in binary contact with the surface of the magnetic tape.
code form over channels of 5000 to 10,000 bits per
second channel capacity.
Continuousanalysis-synthesis
methods.
Such
methodstransmit in place of the speechsignal
spectrum a description of the spectrumin terms of
a number of analog parametric control signals. As
such, they exploit both the redundancy andineffi-
ciencyexistingin
thesesystemsexhibit
the speechsignal. In general,
bandwidth compression in !\S
c c c
c
-A-
the order of 1: 10 to 1:20. A binary channel ca-
pacity of 1600 to 2000 bits per second has been
demonstrated for vocoders, andi t is expected t h a t
this channel capacity will be below 1000 bits per Fig. 3-Sampling action of Doppler frequency compressor.
second for formant coding schemes.
Discrete sound analysis-synthesis methods. Such
Let be the distance of contact along t h e arc, I I
methodstransmit inplace of the speechsignal
peripheral speed of the playback heads and v the specci
codegroups which identify the
fundamental
of the magnetic tape. The time required for one of the-
sounds that constitute the speech. As such, they
playback heads to move through the distance is
exploit theredundancyand inefficiency of t h e
This is the duration of the interval between samples.
speechsignal,and,in addition,removespeaker
During this same time interval, the magnetic tape \vi11
identity and emotional status cues. These methods
move a distance s ( v / u ) , and the net slippage betwct.11
are best categorized in terms of the binary channel
the tape and the playback head will be Z L ) / Z L (lengtlr
capacity requiredfor transmissionofthecode
of the tape sampled by the playback head). Since
groups. I t is expected that such systems should be
speed of the tape is v, theduration ol thesample is
capable of transmittingspeech at information
s ( v - z ~ ) / u v .The compression ratio is equal to the ratilt
rates as low as 60 bits per second.
of duration of sample to duration of interval betweclz
Sound group analysis-synthesismethods.Such
samples. Hence
methods transmit only certain groups of sounds
(particular words and phrases), each identified b y
a code group. Information rates in this case are, k compressiop ratio (2::
V
of course, a function of the size of the vocabulary.
Such a system appears to be most useful a t infor- Since the sampled signal is spread over the entire intcr-
mation rates in the order of 5-10 bits Der second. val, this is the ratio by which the frequencies of t h r *
speech signal are compressed. Also, i t is identical to t h e -
TIMEOR FREQUENCY COMPRESSION METITODS relationforshiftduetoDopplereffect. the nanw
Doppler
Frequency
Compression “Doppler
frequency
compression”
applied
be
may
to tht:
technique.
One of the principal forms of redundancy present in
Actually, the method may beused for both frequenq-
the speech signal is repetition of a waveshape character-
compressionandexpansion.Oneunit,placed at tht:
istic of the sound generated. This is especially true for
transmission end of the system, would be used to com-
the prolonged vowel sounds. Consequently, i t is possible
press the signal by the factor K , and another unit,placcxt
to sample the speech signal periodically, and to ignore
a t the receiver,wouldexpandthesignalbackto tbt*
the remainder of the speech signal between the san~ples.~
G. Fairbanks, W. L. Everitt, and J . P. Jaegy, “Method f r )I-
E. Peterson, Ph.D. dissertation,LouisianaStateUniv.,timeorfrequency compression-expansion of speech, 1953 I R E
Baton Rouge; 1939. VENTION RECORD,
pt. pp.
occur in synchronism with the pitch period of the speech,
and the sampling interval is equalto the duration of an
integralnumber of pitchperiods.Forthisideal case,
half of the pitch waveshapes are played back at half
speed, thus effectively reducing all spectral components
by one-half, and consequently reducing the bandwidth
by the same factor.
Systems employing the doppler frequency compres-
sion principle have been developed by Fairbanks al.,4
Gabor,6and Vilbiga6 Esperinlental results indicate t h a t
compression factors in the order of 1 4 are possible, and
t h a t b yobserving proper precautions,it may bepossible
t o obtain compressionratios of the order 1:6. Gabor
points out that sampling noise is a very serious limita-
11121
1-t0
I
5 IO
" W W " I
Fig. 5-Idcalizecl frcqucncy cotngresscd spccch waveform lor K
tion. He hasconstructed a deviceinwhich a multi-

plicity of overlappingsamples,eachmodulated in
amplitude by an error function, are summed to achieve
a minimum trarlsieIlt condition i n turning a sample on
and off. Gabor also points o u t that the compressed and
expanded speech contains not only spectral components
relatedtothe originalpitch frequellcy of theinput I
speech, but also spurious components which are caused

by the samplingaction.He shows that these compo-
nents may producesomeroughnessin the processed
speech and proposes to eliminate this by synchronizing
thesamplingactionwiththepitchfrequency of the
speechsignal.
Pitch-Synchronous Processing of Speech

The methodofpitchsynchronousprocessing’of
speech proposes to exploit the pitch period regularity in
speech t o achieveinformation ratecompression.In
general, the generationof impulses by the larynx occurs
at a r a t e considerably in excess of that exhibited by the
relatively slowly moving vocal cavity.Literally,the
impulses from the larynx appear to sample thesize and
shape of the vocal cavity; the resulting signal is radi-
ated at the lips in the form of speech. The periodic im-
pulse ratesvarygenerally from as low as 70 cpsfor
male speakers t o as high as 500 cps for female speakers.
Knowledge of the periodicity affords a means of reduc-
ing the channel capacity necessary to transmit a speech
signal. The pitchsynchronousmethodproposes to
eliminate N - 1 of every N pitch periods before trans-
mission, and to restore the missing parts at the receiver
by simply repeating each of the received periods N - 1
times.Providedconsonants can besimilarilytreated,
thechannelcapacityinbitsper second requiredto Fig. 7-Linear interpolator for N = 3. (Courtesy J . .-I~.o~~st. .~lmc~.)
accommodate the chopped speech signal is reduced to
1/N of that required for the entire signal. Providedthatthespeechintervalsareperiodic lx-
A blockdiagram of thetransmittingandreceiving tween samples, the restored speech signal rnny be qllitc
terminal devices is shown in Fig. 6. The input speech accurate;however,
shouldappreciablcvnriatioll I x
signal is applied both toa gate and a pitch pulse genera- exhibited between sampling intervals, collsitlcrd>le clis-
tor. Thepitch pulse generator produces a single impulse tortion i n the form of low-irequenc>. “gargle” or
for each pitch period. These impulses are fed t o a scale “burble”canoccur.Onemethodavailable f o r owr-
of N counter. The output of the scale of N counter opens coming the difficulty is to perform a linear interpolation
the gate via a gate control circuit for a full pitch period. on the pitch period samples irnmediately p r c c c d i ~ ~a ngt 1
Hence the gate circuit passes a pitch period waveshape following the zero interval in the reduced re1)rc:: hS( n t i t -
for every N pitch periods and remains closed for N - 1 tion. By this means the intervening pitch interv, 1s c;tn
periods. The output may be referred to as the reduced be restored without creating sharp discontinuit icbs. A
representation. It can subsequently be coded to match circuit for accomplishing this for N = 3 is shon-n ill 1;ig.
a giventransmissionchannel whose capacity need be 7. A doubleset of pitch-lengthdelaylines is L I S C ~ to
1/N oft h a t required if the entire wave wereto be trans- supplyboth
preceding
andfollowingperiotls tn
mitted.Atthe receiving terminal,thereducedrepre- weighted-summingnetwork.Let the pitch pcriotl prc:-
sentation is used tosynthesize theoriginalwave as ceding the zero interval of the reduced rcpreselll: 1011
closely as possible. Where a number of pitch periods are be PIOand the one following bePsO;then, if PI, P 2 ,P:,
removed,their places are filled byrepetitions of the P,v-Iare the periods tobe inserted,
previousretainedperiod. As shown inFig. 6, this is
accomplished by supplying the reduced representation
to a set of pitch-interval delay lines. For a removal of
N- pitchintervals in thereducedrepresentation, N 2 2
N - delays are used. The outputs of all delay lines are p PI0 P2a
summed to provide the restored speech signal. N N
N 1
PN--1 PI0 P20.
N N
Campanella: A Survey of Speech
Bandwidth
Compression
Techniques
This system of speechcompression has been tested

atthe BellTelephoneLaboratoryandreportedby
. ~ this test, a channel vocoder
Davis and M ~ D o n a l d For
was used as a convenient source of monotone speech.
This procedure eliminated the necessity of making the
pitch delay lines variable and was considered sufficient I / *
to demonstrate feasibilityof the approach.
These results were reported. Elimination of N-1 of
every N-pitch period does not destroy the fundamental 0
phonemic information for valuesof N as greatas 6 and a
pitchfrequency of 200 cps.Therestoredspeechis
highly articulate, although it is somewhat distorted by
harmonic effects. The latter effect can be reduced by
interpolating the missing parts as linear combinations
of the adjacent terms rather than by simply repeating 6
them. The unvoiced consonants appear to be relatively
undisturbed by this processing, even though they are
gated with the same duty cycle as the vowels. KC
Vobanc
The Vobanc,s (Voice Band Compression) is a speech Fig. 8-Block diagram of the vobanc.
bandwidth compression system which provides a reduc- (Courtesy J . Acousi. SOC.Amer.)
tion of two in the transmission channel bandwidth re-
quired to accommodate the speech signal. The general
principle is to divide the speech band into three parts-
0.2-1 kc, 1-2, kcand 2-3.2kc-by filterslocated at
carrier frequency. Each of these bands contains one of
the three principal vowel formants. The signal in each
band is passed through a regenerative modulator, which
halves the frequency of the strongest components of the
formant and translates the neighboring frequency com-
ponents downward correspondingly. T h e o u t p u t of the
regenerative modulator9 is filtered to a bandwidth one- 54.0
half that of the original. At the receiving end, each of

the component bands is frequency doubled and recom- Fig. 9-Vobanc filter response characteristics.
bined to occupy the normalspeech spectrum range. (Courtesy J . Acoust. Amer.)
A block diagram of the Vobanc is shown in Fig. 8.
The input speech signal is modulated by a 108-kc oscil-
lator. The difference frequency components are selected
by the A filters in three separate channels. Transmission
characteristics of the A filters are shown in Fig.
The A1 filter transmits the band from 107.8 t o 107 kc, 1
which corresponds to the difference components result-
ing from the product of the local oscillator and the 0.2 Fig. 10-Block diagram of theregenerativemodulator.
(Courtesy J . Acoust. Amer.)
to 1-kc first formant range. Second and third formant
ranges are transmitted by filtersA2 (107 t o 106 kc) and
A3 (106 to 104.8 kc) respectively. selects only difference frequellcy components. Because of
The output of each of the A filters is supplied to a balance, no feedback develops unless an input signal is
regenerative modulator. A block diagram of the modu- applied, and the output frequency isone-half the input
lator is shown in Fig. 10. The circuit consists of a bal-
frequency. In practice, a dynamic range of 35 db can be
anced modulator for which the local oscillator input is obtained with the circuit, I n order that the circuit be
fed back from the modulator output via a filter which stable, i t is necessary to tune the carrier feedback ampli-
fier broadly to the output frequency.
If two closely spaced frequencies are impressed on the
B. P. Bogert,''The vobanc-a two-to-onespeechbandwidth regenerative modulator, the average frequency of the
reduction system," JhAcoust. SOC. Amer., vol. 28, pp. 1956. input signal is halved, while the difference frequency be-
9 R. L. Miller,Fractional-frequencygeneratorsutilizingre-
generative modulation,'! PROC. IRE, vol. 27, pp. 446-457; July, 1939. tween the two components is preserved. For more com-
110 T R A N S A C T I O NI SR E O N AUDIO Scpiember-October
plicated speechsignals, the frequency of the strongest nant articulation scores for the system range between
components appears to be halved with t h e surrounding 79 and 91 per cent, depending t h e degree of learning
components displaced downward by the same amount. involved. Compared to this, the score on the conven-
The spacingbetween harmonic components of the speech tional 3500-cps bandwidth telephone is SO to per
signal remain the same in the process, but the range of cent.
formant variation is halved. Thus, at the output of the CONTINUOUS ANALYSIS-SSNTHESIS METHODS
regenerative modulator the speech formant range can be
includedwithinabandwidthone-half t h a t of the A Vocoder
filters. The filters which select the half frequency com- The vocoder(Voice Coder) isvery likely themost
ponents are designated theB filters. Their transmission widely usedspeechbandwidthcompressionsystem in
characteristics are shown in Fig. 9(b). The frequencies use today. I t was first proposed and successfully demon-
passed by the B filters are now modulated down to the strated by Dudley10 in 1939. Except for changes in the
frequency range 75 to 1925 cps by the outputmixers of circuitry details, the functional concept of the device
the transmittingterminal. has changed littlefrom thatoriginally conceived. A
Synthesizer
Speech Synthesized
Input Output
Fig. 11-Block diagram of the vocoder.
At the receiving terminal the compressed bandwidth blockdiagram of thevocoder is shown i n Fig. 11. I t
speechsignalisagain modulated up to the frequency consistsbasically of a speech analyzer located a t t h e
range of the B filters and supplied to a set of B filters. transmitting terminal and a synthesizer at the receiving
Each of the B filters outputs is then doubled in fre- terminal.Intheanalyzer,thespeechspectrumisdi-
quency, summed, and filtered to restore the 108-kc car- vided into a number of contiguous bands by an analyz-
rierrange. The resultingsignal is thenmixedwith a ing filter bank.It has been a general practice to identify
108-kc local oscillator to restore the signal to theoriginal a vocoder in terms of the number of anlayzing filters.
audio frequency range Thus,the deviceshowninFig. 11 is a 10-channel
In actual use, the frequency bandwidth is not quite vocoder. The signals transmitted by each filter are de-
halved since some guard-band must be allowed between tected to determine the amplitude of t h e signal level in
he three channels. The most noticeable distortion in- each band. These signals are then transmitted to t h e
roduced by the Vobanc is the “burbles,” which may be synthesizer, and there are used to enable artificial voice
roduced on fricativespeechsounds,breathingand
ther background noises. This distortion is more of an
.nno>-ancethan a restriction on intelligibility. Conso-
lo H.Dudley, “Re~nakingspeech,” J. d c a u s f . 306. Amer., vol. 11,
pp. October, 1939.
1958 Campanella: A Survey of Speech Bandwidth Compression
Techniques 111
a n d noise (unvoiced) excitation falling in a band corre- the 10 channel vocoder can therefore be transmitted a t
sponding to that from which they were originally de- approximately 2200 bits per second.
tected. For a 10 channelvocoder, Swaffieldl’ hasreported
A pitch control channel is also included t o control the syllable articulation in the orderof 83 to 85 per cent, as
artificial voice excitation in the synthesizer. The pitch compared to a value of 90 to 91 percentfor a high
informationisderivedfromthespeechsignalinthe quality voice circuit degraded by restricting its band-
analyzer by a pitch extractor circuit. This signal not width to 250 cps to 3000 cps. I t can be expected t h a t
only controls the frequency of pitch of the synthesized a vocoder using a larger number of analyzing filters (up
speech, but also controls the selection of voice excitation t o 18) will exhibitimprovedperformanceover that
for the vowel sounds and noise excitation for the frica- achieved for the 10 channel unit repoIted above; how-
tivesounds.Thepitchcontrolsignalcanbe used t o ever the performance will be improved a t t h eexpense of
control voice-unvoice selection because of its its nature. increased bandwidth and information rate.
I t takes on values above a certain threshold for voiced
sounds, but remains at a steady state value below the Formant Tracking
threshold for silence and fricative sounds. Formant tracking speech compression systems exploit
T h e m e t h o d of allocating the analyzing filters is of t h e f a c t t h aall
t of the vowel sounds occurring in speech
Fig. 12-Typical speech sonagrams.
interest. It is generally the practice to select the band- can be specified uniquely in terms of the relative fre-
widths of the filtersso that they increase logarithmically quency positions of the first two principal formants en-
with increasing frequency. This arrangement, called a countered in the speech signal spectrum, and that the
Koenig scale, keeps the amount of voice energy inter- specification is improvedbyincludingthethirdfor-
cepted by each filter roughly equal. mant.1J2Typical formant structure of speech isillustrated
I t is possible to transmit each of the amplitude con- inthesonagramsshown in Fig. 12. For male spoken
trol signals and the pitch control signal in analog form vowels, the first formant can range from 200 <F1< 1000
over a channel of approximately 25 cpsbandwidth.
l1 J. Swaffield, “Th7,potentialities of the vocoder for telephone
Hence, the total bandwidth required for the 10-channel over very long distance, P.O.Elect. Eng. J., vol. 41, pt. 1, pp. 22-28;
vocoder is 275 cps. For binary coding it is possible to April, 1948.
11 P. C. Delattre, A. M. Liberman, and F. Cooper, “TWO-
encode each channel in a 4 bit code employing a sam- formant syntheticvowels and cardinal vowels,”Le Xuitre Phonetipzte;
pling rate of 50 cps. The entire informatiotl output of July-December,
112 IRE TRANSACTIONS AUDIO
Low Po$;
I
No.1
No.2
4 1
4
t
From these facts it appearsreasonable to expect that nals would be either timedivision or freqnwc\* division
the speech signal can be specified in terms of a small multiplexed on a carrier, or convcrted tobillnt.\- mtle lor
number of control signals or parameters and accurately transmission.
reconstructred from such information. Lawrence dem- At the synthesizer a heterodyneprocessis used to
onstrated the reconstruction of speech from a set of six generate the artificial speech. Each of the iornlnnt con-
parameters consisting of three formants-voice ampli- trolsignalsdisplaces a formantgencratoroscillator
tude, unvoiced amplitude and pitch.I4 Each of these from a carrier of 18 kc by an amount proportian:tI the
signals can be conservatively passed through low-pass formant frequency. The first formant gcncx-ntor hns ;I.
filters of 25 cps bandwidth, thus making i t possible to rangefrom 18.2 to 19.0 kc, the second from 18.8
transmit speech in ‘a total bandwidth of 150 cps. kc, and the third from 20.3 t o 21.8 kc. ’I‘hcst,
-la M. Liberman, P.C. Delattre, F. Cooper, and L. J. Gerst- S. J. Campanella and T. E. Bayston, “A continuous analysis
man, “The role of consonant-vowel transitions in the perception of speech bandwidth compression system,” Third d 4 n ~ 2 z c n lAer*o-Com
the stop andnasal consonants,’’Psyc. fiionographs General and Appl., Symposiullz Abstracts, pp. November, 1957.
vol. pp. 1-13; 1954. l8 S. Chang, “Two schemes of speechcompression system,” J.
l4 W. Lawrence, “The ,:ynthesis of speech from sig:als which Acoust. Anzer., vol. 28, pp. 565-572; July,
have low information rate, in “CommunicationTheory, W. Jack-
son, ed., Butterworth Scientific Publications, London, Eng., ch. 34;
1953.
Campanella: A Survey of Speech Bandwidth Compression Techniques 113
pulses a r e supplied to balanced modulators, where they put channels of a conventional vocoder are supplied to
are mixed with sources of voiced or unvoiced excitation a pattern correlation matrix. I t is the function of this
centered at 18 kc. The bandwidth of the Sources of ex- matrix toselect the stored spectrum $attern mostclosely
citation determines the bandwidth of the synthesized correlated with the input spectrum and to indicate this
f o r m a n t s a n dis in the range of 150 to 200 CPS. The out- selection atitsoutput. A 4 x 5 patterncorrelation
puts of all the modulators are summed and the differ- matrix for performing the pattern selection is shown in
encefrequencycomponents selected toproducethe Fig. 15(a). The device consists of a matrix of cells. T h e
synthesized speech. The selection of voiced or unvoiced columns of cells are connected to the spectrum ampli-
excitation is controlled by the pitch signal. The repeti- tude outputs of a vocoder analyzingfilter bank. T h e
tion frequency of the source of voiced excitation is of rows are connected to an amplitude quantizing resist-
course controlled by the pitch signal. To permit inde- ance divider. When the inputs to any cell are within
pendent control of the amplitude of excitation in the plusorminusone-halfthe difference
between the
low a n d high frequency rangesof the synthesized speech, quantized amplitude levels, the cell produces a unit of
twosourcesaresuppliedforeachtype of excitation. output voltage. Otherwise its output is zero. In ordert o
Such independent control was found necessary to pro- normalize the amplitude levels, the voltage reference for
duce the nasal and fricative consonants.
Articulationtestsconducted on an experimental
model of the system described above gave an average
score of 67 percentfor PB (phoneticallybalanced)
word lists.18 It is expected that considerably improved
performance will be achieved i n future models of the
system.
The total information rate required to accommodate
the six control signals has been estimated to lie between
640 a n d 1200 bitspersecond.Theseestimatesare
based on quantizing each parametric controlsignal with
three or four bits ( i e , , with 8 or 16 quantized levels) and Fig. 14-Correlation instrumentation for the vocoder.
sampling at a rate of 40 or 50 cps. The resulting quan-
tized levels of each control signal produce changes in the
parameter under control which are i n the neighborhood
of those just discernible by the human ear.lg Theupper
estimateininformationrate of 1200 bitsper second
assumes four-bit quantization at a sampling rate of 50
cps. T h e lower estimate of 640 bitsper second is ob-
tainedbyconsideringthree-bitquantizationfor all
parameters with the exception of the pitch, using one
bit for the pitch (this will give a monotone quality to
the synthesized speech) and sampling at40 cps.
DISCRETE ANALYSIS-SYNTHESIS METHODS CELLS
Spectrum Pattern Quantization a

For all of the sounds of speech, it is well known that Fig. 15-Pattern correlation matrix.
each is characterized by a given shape of its spectrum.
If o n e could specify a set of shapes which uniquely de- the amplitude quantizing is determined by the average
fine a l l of the basic speech spectrum patterns both for of all the vocoder outputs. The resultof the amplitude
steady state and transient sounds, itwould be possible, andfrequencyquantizationin a samplespectrumis
provided this set is of reasonable size, to transmit the illustrated in Fig. lj(b). In this case, cells giving unit
speech spectrum at relatively low information rates. output are41, 22, 13, 24, and 45. If these cells are inter-
A technique to exploit this possibility has been Pro- connected so that the output from each is added, the
posedby Smith.20 Ablockdiagram of themethod is output exceeds that of all other possible interconnec-
shown in Fig. At the transmitting terminal, the out- tions byat least one unit. Thus, if many pattern outputs
are laced through the matrix, the most probable pattern
“Development of a Continuous Analysis Speech Compression
existing at any instantwill be that corresponding to the
System,” Final Eng. Rep., Contract AF33(600)-32293; July, 1952. interconnection of cells having the greatest output volt-
19 J. L. F l a t q a n , “A Speech Analyzer for aFormant-Coding
CompressionSystem,” Sci. Rep. No. 4, USAF Contract
age. For each pattern selection a code can be assigned
AF191605)-626; “ray, lY55. and transmitted to the receiving terminal. At the re-
20 C. P. Smith, Speech DataReduction,” AFCRC-TR-57-111,
ceiving terminal a corresponding pattern can bese-
ASTIA Document No. AD117290; May, 1957.
114 I R E TRANSACTIONS ON AUDIO September-October
lected from a set of stored patterns and used to control which will automatically recognize spoken digits. This
a conventional vocoder synthesizer. device has been given the name of Audreyz2 (Automatic
I t has been estimated that the total information rate Digit Recognizer). T h e machine analyzes the speech in-
necessary to transmit speech information in this form put to determine which sound in its memory it is most
would beless than 1000 bitspersecond.2o .An 11-bit like. I t firstbreaksthespokendigitinto a seriesof
codecould be used to specify set Of 2048 spectrum soundidentificationsandthendeterminesby com-
patterns. If the sampling rate were 50 per second, then parison which of the ten digits in its memory has the
j5O bits per second would be required to transmit these same sequence. The device is able recognizeto the digits
patterns. An additional 450 bits per second wouldsuffice as spoken by certain voices; however, i t cannot as yet
for pitch and amplitude, thus giving a total of 1000 bits perform satisfactorily for all voices. I t is pointed out
per second. that the sound recognition portion of the Audrey system
is operatingquitesuccessfully,andmayconstitute a
Phoneme Quantizing useful analyzingdeviceforphonemecoded speech
As far as speech compression is concerned, phoneme compression system.
quantizing would most probably provide the least pos-
sible information rate for transmitting the information
content of speech. Using such a technique, speech infor-
mation could betransmitted a t teletype code rates.
Approximately 40 phonemes constitute the sounds of
the Englishlanguage,andtheyoccur at rates of ap-
proximately IO per second.21 A 6-bit code would suffice
to specify 64 different phonemes. Thus, the phoneme
content of speech can theoretically be transmitted at a n
information rate of 60 bits per second. However, this
information rate does not allow for inflections in pitch
or changes in intensity of the synthesized speech.
A block diagram of a possible phoneme coded system
is shown in Fig. To this date no one has successfully
demonstrated a speech compression devicebased on
phoneme code, although a synthesizer has been demon- Fig. 16-Discrete phoneme coding system.
strated which, when properly programmed, could pro-
duce recognizable speech from a set of phoneme sounds
recorded on magnetic drums. The greatest difficulty in
realizing the deviceis in the analysis of theoriginal
speech. Although only 40 phonemes may be required to
specify the speech, each individual speaker imposes his
own unique characteristics o n the spectrum, and it be-
comes difficult to produce a device which indicates the Fig. 17-Discrete word coding system.
correct phoneme selection invariant of the characteris-
tics of the individual speaker. Perhaps with improved
For the synthesis of complete words, magnetic tape
analysis devices associated with appropriate spectrum
playbacks operated from a source of digital information
normalization to remove speaker-to-speaker variances,a
provides a satisfactorydevice. TheAutomaticVoice
phoneme coded system may become a reality.
Readout System (AVRS)23showninFig. 1 7 is such a
SOUND GROUP ANALYSIS-SYNTHESIS METHODS device. The AVRS is intended for use as a digital code
For a limited vocabulary, a system which automati- t o voice converter to read out commands from a digital
cally recognizes entire words, transmits a unique code computer vocally. A present model of the device has a
for each word, and at the receiver converts the word capacity of 35 words and hasa potential capacity of 100
code to synthetic speech may offer a means of transmis- words. Five digit control signals required to control a
sion a t extremely low information rates. For example,a relay pyramid are supplied from a coding unit and syll-
vocabulary of 32 words can be handled at information thesizer. The output of the relaypyramiddrives an
-ates of 10 bits per second. KOattempt has been made audio amplifier. T h e synchronizer serves a s t h e b a s i c
produce such a system; however, both analyzers and timing unit within the system. Sqwchronization pulses
wthesizers are being separately investigated. are received every drum revolution or one-half second,
Thetelephonecompany is working on an analyzer
W. E. Kock, “Speech bandwidthcompression,’’ Bell Labs. Reps.,
pp. 81-85; March, 1956.
R. J. Halsey and J. Swaffield, LLAnalysis-synthesistelephony 23 C. W. Poppe and I?. J. Suhr, “An automatic voice readout
w i t h special reference to the vocoder,” J . IEE, voi. pp. 391-406; tem,” Proc. Eastern Joint Conzputer Conf., pp. 219-221; December,
September, 1948.
and are used to advance the synchronizer to the next isto bemaintained.Considerforexample a case in
word interval. The functionof t h e c o d i n gu11it is to con- which the value of the exponellt is 10. This could result
vert the input code to a five digit parallel code required if an attempt were made to reduce the bandwidth by
to drive the relay pyramid. a factor Wl/W2=10, but no attempt were made to
A systemconsisting of a limited vocabuJa1-ycodillg eliminate
redundancy,
inefficiency and/or speaker
analyzersuch as Audrey, and limited vocabulary de- identity to reduce channel capacity, ;.e., C l I C ? 1. As
codingsynthesizersuch as t h e AVRS xvould, when shown in Fig. 18, the signal-to-noise ratio i n the com-
operated with both parts, constitute an operable voice pressedspeechchannelmustexceed t h a t i n the non-
communicationsystem of extremelylowinformation compressed speech channel by considerable amount if
rate. Although such a system possesses a highly limited comparable performance is to be a c h i e v e d . For exam-
vocabulary, there are certain situations such aircraft ple, to obtain performance comparableto t h a t achieved
trafic control, where such a limited vocabulary would with a 20 db signal-to-noise ratio inthe non-compressed
be entirely satisfactory. speechchannel,thesignal-to-noise ratiointhe com-
Signal- To-Noise Ratio in Compressed S;beech Channels pressed speech channel would have to be 200 db.
T h e a m o u n tof information which can be transmitted
( I + SR l
over a channel per unit time is given by (1) in terms of
bandwidth and signal-to-noise ratio. By use of this ex-
I log
pression, it is possible t o r e l a t e the signal-to-noise in the

compressedspeechchannel to the ratio of bandwidth
reduction, the ratio of information reduction and the
signal-to-noiseratioin the original speech channel. If
W1, and cl are the signal-to-noise ratio, barld-
width, and channel capacity, respectively, of the origi-
nal speech channel, and if the corresponding parameters
for the compressed speech channel are indicated by a
subscript 2 , then the signal-to-noise ratio in the com- loo[
pressed speech channel is given by the relation
s2/:y2 [I (Sl/iVl)] ( W 1 / W 2 ) ( C 2 / C l ) 1.
and for SZ/N2>>1 and Sl/Nl>>l the following approxi- 0 I
0
mation is valid:
(dbl
s2/.3T2 (Sl/fVl) ('CyIIW2) ( C Z / C l ) , ( 5 ) Fig. 18-Compressed speech channelsignal to noise ratio
function of original channel signal to noise ratio.
111 Fig. 18 thecompressedspeechchannelsignal-to-
noise ratio is plotted as a function of t h e signal-to-noise
ratio in the noncompressed speech channeI for values of T h e preceding discussion points out the rather inter-
the exponent ( W l / W 2 )(C2/C1) of 1 a n d The value esting fact thatspeech bandwidth compression without
of unity corresponds to the case where, by elimination corresponding elimination of information canbeac-
of redundancy, inefficiency and/or speaker identity, the complished only at the cost of considerable transmitted
channel capacity is reduced b y t h e s a m e f a c t o r as the power. I n this respect it is pointed o u t t h a t t h e t r u e
channel bandwidth. TOobtain comparable performance measure of the effectiveness of a b a n d w i d t h compression
in this case the signal-to-noise ratio in the compressed system cannot be measured by the bandwidth reduction
speech channel will be t h e s a m e a s t h a in t the noncom- factor alone; the
influence of information reduction must
pressedspeech channel. Furthermore, since the band- also enter the picture in terms of t h e signal-to-noise
width required to accommodate the compressed speech ratio, or the information rate that must exist in the
channel is reduced by the factor (Wl/W2), the inter- compressed speech channel to obtain speech reproduc-
ferring noise energy intercepted is reduced by the same tion with good signal-to-noise ratio.
factor (assuming that the noise is uniformly distributed CONCLUSIONS
in the frequency spectrum) and the immunity of the
compressed speech channel to noise interference is im- There are several methodsof speech bandwidth com-
proved by 10 log (Wl/W2) db. pression which are currently being investigated or used
Values of the exponent (Wl/W2)( C2/C1) 1 result today. These methods may be divided into four cate-
when the channel capacity is not reduced by a factor gories
as great as that for the reduction in bandwidth. In this 1) frequency or time compression,
case, the signal-to-noise ratio in the compressed speech 2) continuous analysis-synthesis,
channel will always be greater than that in the non- 3) discrete sound analysis-synthesis, anc'
compressed
speech
channel, if comparable performance sound groutn analvsis-svnthesis.
TRANSACTIONS
IRE ON AUDIO September-October
S!.stems in the first category are the most simple to may prove useful with a limited vocabulary oi perhaps
implementandcanprovidebandwidth compression less than 100 words.Such systemswouldemploy ex-
ratios of 1 to 1:6 with corresponding reductions in tremely low bit rates and would probably be used onl>r
informationrate.Systemsinthe second category are inhighlyspecialized applications. It is not expected
considerably more complicated and can provide band- that they would be useful for general speech communi-
width compression ratios in the order of 1: 10 to 1: 20, cation.
using information rates in the range from 800 to 3000 I t isimportant t o note that bandwidth reductio11
bits per second, Systemsin the third categoryhave not without elimination of unnecessary information can be
as yet been successfully demonstrated. They promise to accomplished only by an excessive increase in the signal-
provide speech communication at informationrates to-noise ratio of the compressed bandwidthchannel.
below 1,000 bits per second, and perhaps as low as 60 Hence, in specifying the performance of a speech band-
bits per second. The lower bit rates would be achieved widthcompression system,eitherthesignal-to-noise
by use of a phonemecodingand wouldpossess no ratio or information rate requiredto provide satisfactory
speaker identity cues. Systems in the fourth category speech reproduction should be indicated.
Contributors
John P. Quitter (S’40-A’43-M’44- lems, audio instrumentation,
electronic In 1950, hejoined the Naval Research
SM’54) wasborninBudapest,Hungary, counterparts of the piano, and application of Laboratory in Washington as an electronic
July 24, 1919. He received the E.E. degree modern engineering principles to the design scientist. He participated in development Qf
fromtheUniversity and manufacture of the piano. Mr. Quitter various
esperimen-
of Cincinnati,Ohio, is also mathematics instructor a t the Uni- tal sonar systems and
in 1941, and hasdone versity of Cincinnati Evening College. designed and devcl-
graduate
work
in He is a registeredprofessionalengineer oped sonar receiving
physics. and a memberof AIEE, Eta KappaNu, and systems and corrck-
In 1941 and 1942 the Engineering Society of Cincinnati. Mr. tion devices. Since
he was associated Quitter has served as Cincinnati Chairman 1953 he has been at
with
the
Westing- of IRE and AIEE, andof the Technical and MeIpar, Inc., w h e r e
house Electronics Di- ScientificSocietiesCouncil of Cincinnati. he has performed fire
vision in the develop- He is presently secretary of the Cincinnati control systemsstudy,
ment of magnetrons PGA. investigated com-
and high-frequency puter techniques for
5. p. QUITTER triodes.From 1943 to J. CAMPANELLA statisticaldata proc-
1948he
was with essing,designed a n d
Sperti, Inc., of Cincinnatiin the develop- Joseph Campanella (;1’52-M’57) was developed special purpose power spectrum
mentandmanufacture of variousinstru- born in Washington,D. on December 26, analysis equipment, conducted research and
ments and armamentdevices. 1926. He attended the Catholic University development in speech bandwidth campres-
In 1948hejoined the Baldwin Piano of Americain Washington,where he re- sion techniques for use in communications
Company as researchengineerengagedin ceived the B.E.E. degree (magnacum systems,andperformedcommunications
acoustical study of the piano,and is now laude) in 1950, and the University of Mary- system analysis and design.
supervisor of the Piano Laboratory. His work land, College Park, Md., where he received He is a member of Sigma S i , P h i Eta
is primarily in the fields of vibration prob- the M.S.E.E. degree in 1956. Sigma, and AOA.

A Survey of Speech Bandwidth Compression Techniques-Zdc

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey of Speech Bandwidth Compression Techniques-Zdc

Uploaded by

Copyright:

Available Formats

104 IRE TRAhTSACTIONS ON AUDIO September-October

tion system may be considered to consistof the brain of

Manuscript received by the PGA, September 2 , 1958; revised

Fig. 5-Idcalizecl frcqucncy cotngresscd spccch waveform lor K

tion. He hasconstructed a deviceinwhich a multi-

speech, but also spurious components which are caused

Pitch-Synchronous Processing of Speech

This system of speechcompression has been tested

half that of the original. At the receiving end, each of

Fig. 11-Block diagram of the vocoder.

Fig. 12-Typical speech sonagrams.

DISCRETE ANALYSIS-SYNTHESIS METHODS CELLS

Spectrum Pattern Quantization a

pression, it is possible t o r e l a t e the signal-to-noise in the

You might also like