This action might not be possible to undo. Are you sure you want to continue?
R
in
Signal Processing
Vol. 1, Nos. 1–2 (2007) 1–194
c 2007 L. R. Rabiner and R. W. Schafer
DOI: 10.1561/2000000001
Introduction to Digital Speech Processing
Lawrence R. Rabiner
1
and Ronald W. Schafer
2
1
Rutgers University and University of California, Santa Barbara, USA,
rabiner@ece.ucsb.edu
2
HewlettPackard Laboratories, Palo Alto, CA, USA
Abstract
Since even before the time of Alexander Graham Bell’s revolution
ary invention, engineers and scientists have studied the phenomenon
of speech communication with an eye on creating more eﬃcient and
eﬀective systems of humantohuman and humantomachine communi
cation. Starting in the 1960s, digital signal processing (DSP), assumed
a central role in speech studies, and today DSP is the key to realizing
the fruits of the knowledge that has been gained through decades of
research. Concomitant advances in integrated circuit technology and
computer architecture have aligned to create a technological environ
ment with virtually limitless opportunities for innovation in speech
communication applications. In this text, we highlight the central role
of DSP techniques in modern speech communication research and appli
cations. We present a comprehensive overview of digital speech process
ing that ranges from the basic nature of the speech signal, through a
variety of methods of representing speech in digital form, to applica
tions in voice communication and automatic synthesis and recognition
of speech. The breadth of this subject does not allow us to discuss any
aspect of speech processing to great depth; hence our goal is to pro
vide a useful introduction to the wide range of important concepts that
comprise the ﬁeld of digital speech processing. A more comprehensive
treatment will appear in the forthcoming book, Theory and Application
of Digital Speech Processing [101].
1
Introduction
The fundamental purpose of speech is communication, i.e., the trans
mission of messages. According to Shannon’s information theory [116],
a message represented as a sequence of discrete symbols can be quanti
ﬁed by its information content in bits, and the rate of transmission of
information is measured in bits/second (bps). In speech production, as
well as in many humanengineered electronic communication systems,
the information to be transmitted is encoded in the form of a contin
uously varying (analog) waveform that can be transmitted, recorded,
manipulated, and ultimately decoded by a human listener. In the case
of speech, the fundamental analog form of the message is an acous
tic waveform, which we call the speech signal. Speech signals, as illus
trated in Figure 1.1, can be converted to an electrical waveform by
a microphone, further manipulated by both analog and digital signal
processing, and then converted back to acoustic form by a loudspeaker,
a telephone handset or headphone, as desired. This form of speech pro
cessing is, of course, the basis for Bell’s telephone invention as well as
today’s multitude of devices for recording, transmitting, and manip
ulating speech and audio signals. Although Bell made his invention
without knowing the fundamentals of information theory, these ideas
3
4 Introduction
0 0.02 0.04 0.06 0.08 0.1 0.12
0.48
0.36
0.24
0.12
0
time in seconds
SH
UH
D W IY
CH
EY
S
Fig. 1.1 A speech waveform with phonetic labels for the text message “Should we chase.”
have assumed great importance in the design of sophisticated modern
communications systems. Therefore, even though our main focus will
be mostly on the speech waveform and its representation in the form of
parametric models, it is nevertheless useful to begin with a discussion
of how information is encoded in the speech waveform.
1.1 The Speech Chain
Figure 1.2 shows the complete process of producing and perceiving
speech from the formulation of a message in the brain of a talker, to
the creation of the speech signal, and ﬁnally to the understanding of
the message by a listener. In their classic introduction to speech sci
ence, Denes and Pinson aptly referred to this process as the “speech
chain” [29]. The process starts in the upper left as a message repre
sented somehow in the brain of the speaker. The message information
can be thought of as having a number of diﬀerent representations dur
ing the process of speech production (the upper path in Figure 1.2).
1.1 The Speech Chain 5
Message
Formulation
Message
Understanding
Language
Translation
Language
Code
NeuroMuscular
Controls
Vocal Tract
System
Transmission
Channel
acoustic
waveform
acoustic
waveform
Neural
Transduction
Basilar
Membrane
Motion
discrete input continuous input
Speech Production
Speech Perception
continuous output discrete output
text
phonemes,
prosody
articulatory motions
semantics
phonemes, words
sentences
feature
extraction
spectrum
analysis
50 bps 200 bps 2000 bps 64700 Kbps
information rate
excitation,
formants
Fig. 1.2 The Speech Chain: from message, to speech signal, to understanding.
For example the message could be represented initially as English text.
In order to “speak” the message, the talker implicitly converts the text
into a symbolic representation of the sequence of sounds corresponding
to the spoken version of the text. This step, called the language code
generator in Figure 1.2, converts text symbols to phonetic symbols
(along with stress and durational information) that describe the basic
sounds of a spoken version of the message and the manner (i.e., the
speed and emphasis) in which the sounds are intended to be produced.
As an example, the segments of the waveform of Figure 1.1 are labeled
with phonetic symbols using a computerkeyboardfriendly code called
ARPAbet.
1
Thus, the text “should we chase” is represented phoneti
cally (in ARPAbet symbols) as [SH UH D — W IY — CH EY S]. (See
Chapter 2 for more discussion of phonetic transcription.) The third step
in the speech production process is the conversion to “neuromuscular
controls,” i.e., the set of control signals that direct the neuromuscular
system to move the speech articulators, namely the tongue, lips, teeth,
1
The International Phonetic Association (IPA) provides a set of rules for phonetic tran
scription using an equivalent set of specialized symbols. The ARPAbet code does not
require special fonts and is thus more convenient for computer applications.
6 Introduction
jaw and velum, in a manner that is consistent with the sounds of the
desired spoken message and with the desired degree of emphasis. The
end result of the neuromuscular controls step is a set of articulatory
motions (continuous control) that cause the vocal tract articulators to
move in a prescribed manner in order to create the desired sounds.
Finally the last step in the Speech Production process is the “vocal
tract system” that physically creates the necessary sound sources and
the appropriate vocal tract shapes over time so as to create an acoustic
waveform, such as the one shown in Figure 1.1, that encodes the infor
mation in the desired message into the speech signal.
To determine the rate of information ﬂow during speech produc
tion, assume that there are about 32 symbols (letters) in the language
(in English there are 26 letters, but if we include simple punctuation
we get a count closer to 32 = 2
5
symbols). Furthermore, the rate of
speaking for most people is about 10 symbols per second (somewhat
on the high side, but still acceptable for a rough information rate esti
mate). Hence, assuming independent letters as a simple approximation,
we estimate the base information rate of the text message as about
50 bps (5 bits per symbol times 10 symbols per second). At the second
stage of the process, where the text representation is converted into
phonemes and prosody (e.g., pitch and stress) markers, the informa
tion rate is estimated to increase by a factor of 4 to about 200 bps. For
example, the ARBAbet phonetic symbol set used to label the speech
sounds in Figure 1.1 contains approximately 64 = 2
6
symbols, or about
6 bits/phoneme (again a rough approximation assuming independence
of phonemes). In Figure 1.1, there are 8 phonemes in approximately
600 ms. This leads to an estimate of 8 6/0.6 = 80 bps. Additional
information required to describe prosodic features of the signal (e.g.,
duration, pitch, loudness) could easily add 100 bps to the total infor
mation rate for a message encoded as a speech signal.
The information representations for the ﬁrst two stages in the speech
chain are discrete so we can readily estimate the rate of information
ﬂow with some simple assumptions. For the next stage in the speech
production part of the speech chain, the representation becomes con
tinuous (in the form of control signals for articulatory motion). If they
could be measured, we could estimate the spectral bandwidth of these
1.1 The Speech Chain 7
control signals and appropriately sample and quantize these signals to
obtain equivalent digital signals for which the data rate could be esti
mated. The articulators move relatively slowly compared to the time
variation of the resulting acoustic waveform. Estimates of bandwidth
and required accuracy suggest that the total data rate of the sampled
articulatory control signals is about 2000 bps [34]. Thus, the original
text message is represented by a set of continuously varying signals
whose digital representation requires a much higher data rate than
the information rate that we estimated for transmission of the mes
sage as a speech signal.
2
Finally, as we will see later, the data rate
of the digitized speech waveform at the end of the speech production
part of the speech chain can be anywhere from 64,000 to more than
700,000 bps. We arrive at such numbers by examining the sampling
rate and quantization required to represent the speech signal with a
desired perceptual ﬁdelity. For example, “telephone quality” requires
that a bandwidth of 0–4 kHz be preserved, implying a sampling rate of
8000 samples/s. Each sample can be quantized with 8 bits on a log scale,
resulting in a bit rate of 64,000 bps. This representation is highly intelli
gible (i.e., humans can readily extract the message from it) but to most
listeners, it will sound diﬀerent from the original speech signal uttered
by the talker. On the other hand, the speech waveform can be repre
sented with “CD quality” using a sampling rate of 44,100 samples/s
with 16 bit samples, or a data rate of 705,600 bps. In this case, the
reproduced acoustic signal will be virtually indistinguishable from the
original speech signal.
As we move from text to speech waveform through the speech chain,
the result is an encoding of the message that can be eﬀectively transmit
ted by acoustic wave propagation and robustly decoded by the hear
ing mechanism of a listener. The above analysis of data rates shows
that as we move from text to sampled speech waveform, the data rate
can increase by a factor of 10,000. Part of this extra information rep
resents characteristics of the talker such as emotional state, speech
mannerisms, accent, etc., but much of it is due to the ineﬃciency
2
Note that we introduce the term data rate for digital representations to distinguish from
the inherent information content of the message represented by the speech signal.
8 Introduction
of simply sampling and ﬁnely quantizing analog signals. Thus, moti
vated by an awareness of the low intrinsic information rate of speech,
a central theme of much of digital speech processing is to obtain a
digital representation with lower data rate than that of the sampled
waveform.
The complete speech chain consists of a speech production/
generation model, of the type discussed above, as well as a speech
perception/recognition model, as shown progressing to the left in the
bottom half of Figure 1.2. The speech perception model shows the series
of steps from capturing speech at the ear to understanding the mes
sage encoded in the speech signal. The ﬁrst step is the eﬀective con
version of the acoustic waveform to a spectral representation. This is
done within the inner ear by the basilar membrane, which acts as a
nonuniform spectrum analyzer by spatially separating the spectral
components of the incoming speech signal and thereby analyzing them
by what amounts to a nonuniform ﬁlter bank. The next step in the
speech perception process is a neural transduction of the spectral fea
tures into a set of sound features (or distinctive features as they are
referred to in the area of linguistics) that can be decoded and processed
by the brain. The next step in the process is a conversion of the sound
features into the set of phonemes, words, and sentences associated with
the incoming message by a language translation process in the human
brain. Finally, the last step in the speech perception model is the con
version of the phonemes, words and sentences of the message into an
understanding of the meaning of the basic message in order to be able
to respond to or take some appropriate action. Our fundamental under
standing of the processes in most of the speech perception modules in
Figure 1.2 is rudimentary at best, but it is generally agreed that some
physical correlate of each of the steps in the speech perception model
occur within the human brain, and thus the entire model is useful for
thinking about the processes that occur.
There is one additional process shown in the diagram of the com
plete speech chain in Figure 1.2 that we have not discussed — namely
the transmission channel between the speech generation and speech
perception parts of the model. In its simplest embodiment, this trans
mission channel consists of just the acoustic wave connection between
1.2 Applications of Digital Speech Processing 9
a speaker and a listener who are in a common space. It is essen
tial to include this transmission channel in our model for the speech
chain since it includes real world noise and channel distortions that
make speech and message understanding more diﬃcult in real com
munication environments. More interestingly for our purpose here —
it is in this domain that we ﬁnd the applications of digital speech
processing.
1.2 Applications of Digital Speech Processing
The ﬁrst step in most applications of digital speech processing is to
convert the acoustic waveform to a sequence of numbers. Most modern
AtoD converters operate by sampling at a very high rate, applying a
digital lowpass ﬁlter with cutoﬀ set to preserve a prescribed bandwidth,
and then reducing the sampling rate to the desired sampling rate, which
can be as low as twice the cutoﬀ frequency of the sharpcutoﬀ digital
ﬁlter. This discretetime representation is the starting point for most
applications. From this point, other representations are obtained by
digital processing. For the most part, these alternative representations
are based on incorporating knowledge about the workings of the speech
chain as depicted in Figure 1.2. As we will see, it is possible to incor
porate aspects of both the speech production and speech perception
process into the digital representation and processing. It is not an over
simpliﬁcation to assert that digital speech processing is grounded in a
set of techniques that have the goal of pushing the data rate of the
speech representation to the left along either the upper or lower path
in Figure 1.2.
The remainder of this chapter is devoted to a brief summary of the
applications of digital speech processing, i.e., the systems that people
interact with daily. Our discussion will conﬁrm the importance of the
digital representation in all application areas.
1.2.1 Speech Coding
Perhaps the most widespread applications of digital speech process
ing technology occur in the areas of digital transmission and storage
10 Introduction
AtoD
Converter
Analysis/
Encoding
Channel or
Medium
Synthesis/
Decoding
DtoA
Converter
speech
signal
samples
data
samples data
) (t x
c
] [n x ] [n y
] [n y ] [n x ) (t x
c
decoded
signal
Fig. 1.3 Speech coding block diagram — encoder and decoder.
of speech signals. In these areas the centrality of the digital repre
sentation is obvious, since the goal is to compress the digital wave
form representation of speech into a lower bitrate representation. It
is common to refer to this activity as “speech coding” or “speech
compression.”
Figure 1.3 shows a block diagram of a generic speech encod
ing/decoding (or compression) system. In the upper part of the ﬁgure,
the AtoD converter converts the analog speech signal x
c
(t) to a sam
pled waveform representation x[n]. The digital signal x[n] is analyzed
and coded by digital computation algorithms to produce a new digital
signal y[n] that can be transmitted over a digital communication chan
nel or stored in a digital storage medium as ˆ y[n]. As we will see, there
are a myriad of ways to do the encoding so as to reduce the data rate
over that of the sampled and quantized speech waveform x[n]. Because
the digital representation at this point is often not directly related to
the sampled speech waveform, y[n] and ˆ y[n] are appropriately referred
to as data signals that represent the speech signal. The lower path in
Figure 1.3 shows the decoder associated with the speech coder. The
received data signal ˆ y[n] is decoded using the inverse of the analysis
processing, giving the sequence of samples ˆ x[n] which is then converted
(using a DtoA Converter) back to an analog signal ˆ x
c
(t) for human
listening. The decoder is often called a synthesizer because it must
reconstitute the speech waveform from data that may bear no direct
relationship to the waveform.
1.2 Applications of Digital Speech Processing 11
With carefully designed error protection coding of the digital
representation, the transmitted (y[n]) and received (ˆ y[n]) data can be
essentially identical. This is the quintessential feature of digital coding.
In theory, perfect transmission of the coded digital representation is
possible even under very noisy channel conditions, and in the case of
digital storage, it is possible to store a perfect copy of the digital repre
sentation in perpetuity if suﬃcient care is taken to update the storage
medium as storage technology advances. This means that the speech
signal can be reconstructed to within the accuracy of the original cod
ing for as long as the digital representation is retained. In either case,
the goal of the speech coder is to start with samples of the speech signal
and reduce (compress) the data rate required to represent the speech
signal while maintaining a desired perceptual ﬁdelity. The compressed
representation can be more eﬃciently transmitted or stored, or the bits
saved can be devoted to error protection.
Speech coders enable a broad range of applications including nar
rowband and broadband wired telephony, cellular communications,
voice over internet protocol (VoIP) (which utilizes the internet as
a realtime communications medium), secure voice for privacy and
encryption (for national security applications), extremely narrowband
communications channels (such as battleﬁeld applications using high
frequency (HF) radio), and for storage of speech for telephone answer
ing machines, interactive voice response (IVR) systems, and pre
recorded messages. Speech coders often utilize many aspects of both
the speech production and speech perception processes, and hence may
not be useful for more general audio signals such as music. Coders that
are based on incorporating only aspects of sound perception generally
do not achieve as much compression as those based on speech produc
tion, but they are more general and can be used for all types of audio
signals. These coders are widely deployed in MP3 and AAC players and
for audio in digital television systems [120].
1.2.2 TexttoSpeech Synthesis
For many years, scientists and engineers have studied the speech pro
duction process with the goal of building a system that can start with
12 Introduction
Linguistic
Rules
Synthesis
Algorithm
DtoA
Converter
speech
text
Fig. 1.4 Texttospeech synthesis system block diagram.
text and produce speech automatically. In a sense, a texttospeech
synthesizer such as depicted in Figure 1.4 is a digital simulation of the
entire upper part of the speech chain diagram. The input to the system
is ordinary text such as an email message or an article from a newspa
per or magazine. The ﬁrst block in the texttospeech synthesis system,
labeled linguistic rules, has the job of converting the printed text input
into a set of sounds that the machine must synthesize. The conversion
from text to sounds involves a set of linguistic rules that must determine
the appropriate set of sounds (perhaps including things like emphasis,
pauses, rates of speaking, etc.) so that the resulting synthetic speech
will express the words and intent of the text message in what passes
for a natural voice that can be decoded accurately by human speech
perception. This is more diﬃcult than simply looking up the words in
a pronouncing dictionary because the linguistic rules must determine
how to pronounce acronyms, how to pronounce ambiguous words like
read, bass, object, how to pronounce abbreviations like St. (street or
Saint), Dr. (Doctor or drive), and how to properly pronounce proper
names, specialized terms, etc. Once the proper pronunciation of the
text has been determined, the role of the synthesis algorithm is to cre
ate the appropriate sound sequence to represent the text message in
the form of speech. In essence, the synthesis algorithm must simulate
the action of the vocal tract system in creating the sounds of speech.
There are many procedures for assembling the speech sounds and com
piling them into a proper sentence, but the most promising one today is
called “unit selection and concatenation.” In this method, the computer
stores multiple versions of each of the basic units of speech (phones, half
phones, syllables, etc.), and then decides which sequence of speech units
sounds best for the particular text message that is being produced. The
basic digital representation is not generally the sampled speech wave.
Instead, some sort of compressed representation is normally used to
1.2 Applications of Digital Speech Processing 13
save memory and, more importantly, to allow convenient manipulation
of durations and blending of adjacent sounds. Thus, the speech syn
thesis algorithm would include an appropriate decoder, as discussed in
Section 1.2.1, whose output is converted to an analog representation
via the DtoA converter.
Texttospeech synthesis systems are an essential component of
modern human–machine communications systems and are used to do
things like read email messages over a telephone, provide voice out
put from GPS systems in automobiles, provide the voices for talking
agents for completion of transactions over the internet, handle call cen
ter help desks and customer care applications, serve as the voice for
providing information from handheld devices such as foreign language
phrasebooks, dictionaries, crossword puzzle helpers, and as the voice of
announcement machines that provide information such as stock quotes,
airline schedules, updates on arrivals and departures of ﬂights, etc.
Another important application is in reading machines for the blind,
where an optical character recognition system provides the text input
to a speech synthesis system.
1.2.3 Speech Recognition and Other Pattern
Matching Problems
Another large class of digital speech processing applications is con
cerned with the automatic extraction of information from the speech
signal. Most such systems involve some sort of pattern matching.
Figure 1.5 shows a block diagram of a generic approach to pattern
matching problems in speech processing. Such problems include the
following: speech recognition, where the object is to extract the mes
sage from the speech signal; speaker recognition, where the goal is
to identify who is speaking; speaker veriﬁcation, where the goal is
to verify a speaker’s claimed identity from analysis of their speech
AtoD
Converter
Feature
Analysis
Pattern
Matching
symbols
speech
Fig. 1.5 Block diagram of general pattern matching system for speech signals.
14 Introduction
signal; word spotting, which involves monitoring a speech signal for
the occurrence of speciﬁed words or phrases; and automatic index
ing of speech recordings based on recognition (or spotting) of spoken
keywords.
The ﬁrst block in the pattern matching system converts the ana
log speech waveform to digital form using an AtoD converter. The
feature analysis module converts the sampled speech signal to a set
of feature vectors. Often, the same analysis techniques that are used
in speech coding are also used to derive the feature vectors. The ﬁnal
block in the system, namely the pattern matching block, dynamically
time aligns the set of feature vectors representing the speech signal with
a concatenated set of stored patterns, and chooses the identity associ
ated with the pattern which is the closest match to the timealigned set
of feature vectors of the speech signal. The symbolic output consists
of a set of recognized words, in the case of speech recognition, or the
identity of the best matching talker, in the case of speaker recognition,
or a decision as to whether to accept or reject the identity claim of a
speaker in the case of speaker veriﬁcation.
Although the block diagram of Figure 1.5 represents a wide range
of speech pattern matching problems, the biggest use has been in the
area of recognition and understanding of speech in support of human–
machine communication by voice. The major areas where such a system
ﬁnds applications include command and control of computer software,
voice dictation to create letters, memos, and other documents, natu
ral language voice dialogues with machines to enable help desks and
call centers, and for agent services such as calendar entry and update,
address list modiﬁcation and entry, etc.
Pattern recognition applications often occur in conjunction with
other digital speech processing applications. For example, one of the
preeminent uses of speech technology is in portable communication
devices. Speech coding at bit rates on the order of 8 Kbps enables nor
mal voice conversations in cell phones. Spoken name speech recognition
in cellphones enables voice dialing capability that can automatically
dial the number associated with the recognized name. Names from
directories with upwards of several hundred names can readily be rec
ognized and dialed using simple speech recognition technology.
1.2 Applications of Digital Speech Processing 15
Another major speech application that has long been a dream
of speech researchers is automatic language translation. The goal of
language translation systems is to convert spoken words in one lan
guage to spoken words in another language so as to facilitate natural
language voice dialogues between people speaking diﬀerent languages.
Language translation technology requires speech synthesis systems that
work in both languages, along with speech recognition (and generally
natural language understanding) that also works for both languages;
hence it is a very diﬃcult task and one for which only limited progress
has been made. When such systems exist, it will be possible for people
speaking diﬀerent languages to communicate at data rates on the order
of that of printed text reading!
1.2.4 Other Speech Applications
The range of speech communication applications is illustrated in
Figure 1.6. As seen in this ﬁgure, the techniques of digital speech
processing are a key ingredient of a wide range of applications that
include the three areas of transmission/storage, speech synthesis, and
speech recognition as well as many others such as speaker identiﬁcation,
speech signal quality enhancement, and aids for the hearing or visually
impaired.
The block diagram in Figure 1.7 represents any system where time
signals such as speech are processed by the techniques of DSP. This
ﬁgure simply depicts the notion that once the speech signal is sampled,
it can be manipulated in virtually limitless ways by DSP techniques.
Here again, manipulations and modiﬁcations of the speech signal are
Digital
Transmission
& Storage
Digital Speech Processing
Techniques
Speech
Synthesis
Speech
Recognition
Speaker
Verification/
Identification
Enhancement
of Speech
Quality
Aids for the
Handicapped
Fig. 1.6 Range of speech communication applications.
16 Introduction
AtoD
Converter
Computer
Algorithm
DtoA
Converter
speech
speech
Fig. 1.7 General block diagram for application of digital signal processing to speech signals.
usually achieved by transforming the speech signal into an alternative
representation (that is motivated by our understanding of speech pro
duction and speech perception), operating on that representation by
further digital computation, and then transforming back to the wave
form domain, using a DtoA converter.
One important application area is speech enhancement, where the
goal is to remove or suppress noise or echo or reverberation picked up by
a microphone along with the desired speech signal. In humantohuman
communication, the goal of speech enhancement systems is to make the
speech more intelligible and more natural; however, in reality the best
that has been achieved so far is less perceptually annoying speech that
essentially maintains, but does not improve, the intelligibility of the
noisy speech. Success has been achieved, however, in making distorted
speech signals more useful for further processing as part of a speech
coder, synthesizer, or recognizer. An excellent reference in this area is
the recent textbook by Loizou [72].
Other examples of manipulation of the speech signal include
timescale modiﬁcation to align voices with video segments, to mod
ify voice qualities, and to speedup or slowdown prerecorded speech
(e.g., for talking books, rapid review of voice mail messages, or careful
scrutinizing of spoken material).
1.3 Our Goal for this Text
We have discussed the speech signal and how it encodes information
for human communication. We have given a brief overview of the way
in which digital speech processing is being applied today, and we have
hinted at some of the possibilities that exist for the future. These and
many more examples all rely on the basic principles of digital speech
processing, which we will discuss in the remainder of this text. We
make no pretense of exhaustive coverage. The subject is too broad and
1.3 Our Goal for this Text 17
too deep. Our goal is only to provide an uptodate introduction to this
fascinating ﬁeld. We will not be able to go into great depth, and we will
not be able to cover all the possible applications of digital speech pro
cessing techniques. Instead our focus is on the fundamentals of digital
speech processing and their application to coding, synthesis, and recog
nition. This means that some of the latest algorithmic innovations and
applications will not be discussed — not because they are not interest
ing, but simply because there are so many fundamental triedandtrue
techniques that remain at the core of digital speech processing. We
hope that this text will stimulate readers to investigate the subject in
greater depth using the extensive set of references provided.
2
The Speech Signal
As the discussion in Chapter 1 shows, the goal in many applications of
digital speech processing techniques is to move the digital representa
tion of the speech signal from the waveform samples back up the speech
chain toward the message. To gain a better idea of what this means,
this chapter provides a brief overview of the phonetic representation of
speech and an introduction to models for the production of the speech
signal.
2.1 Phonetic Representation of Speech
Speech can be represented phonetically by a ﬁnite set of symbols called
the phonemes of the language, the number of which depends upon the
language and the reﬁnement of the analysis. For most languages the
number of phonemes is between 32 and 64. A condensed inventory of
the sounds of speech in the English language is given in Table 2.1,
where the phonemes are denoted by a set of ASCII symbols called the
ARPAbet. Table 2.1 also includes some simple examples of ARPAbet
transcriptions of words containing each of the phonemes of English.
Additional phonemes can be added to Table 2.1 to account for allo
phonic variations and events such as glottal stops and pauses.
18
2.1 Phonetic Representation of Speech 19
Table 2.1 Condensed list of ARPAbet phonetic symbols for North American English.
Class ARPAbet Example Transcription
Vowels and IY beet [B IY T]
diphthongs IH bit [B IH T]
EY bait [B EY T]
EH bet [B EH T]
AE bat [B AE T]
AA bob [B AA B]
AO born [B AO R N]
UH book [B UH K]
OW boat [B OW T]
UW boot [B UW T]
AH but [B AH T ]
ER bird [B ER D]
AY buy [B AY]
AW down [D AW N]
OY boy [B OY]
Glides Y you [Y UH]
R rent [R EH N T]
Liquids W wit [W IH T]
L l et [L EH T]
Nasals M met [M EH T]
N net [N EH T]
NG sing [S IH NG]
Stops P pat [P AE T]
B bet [B EH T]
T ten [T EH N]
D debt [D EH T]
K kit [K IH T]
G get [G EH T]
Fricatives HH hat [HH AE T]
F f at [F AE T]
V vat [V AE T]
TH thing [TH IH NG]
DH that [DH AE T]
S sat [S AE T]
Z zoo [Z UW]
SH shut [SH AH T]
ZH azure [AE ZH ER]
Aﬀricates CH chase [CH EY S]
JH j udge [JH AH JH ]
a
This set of 39 phonemes is used in the CMU Pronouncing Dictionary available online at
http://www.speech.cs.cmu.edu/cgibin/cmudict.
Figure 1.1 on p. 4 shows how the sounds corresponding to the text
“should we chase” are encoded into a speech waveform. We see that, for
the most part, phonemes have a distinctive appearance in the speech
waveform. Thus sounds like /SH/ and /S/ look like (spectrally shaped)
20 The Speech Signal
random noise, while the vowel sounds /UH/, /IY/, and /EY/ are highly
structured and quasiperiodic. These diﬀerences result from the distinc
tively diﬀerent ways that these sounds are produced.
2.2 Models for Speech Production
A schematic longitudinal crosssectional drawing of the human vocal
tract mechanism is given in Figure 2.1 [35]. This diagram highlights
the essential physical features of human anatomy that enter into the
ﬁnal stages of the speech production process. It shows the vocal tract
as a tube of nonuniform crosssectional area that is bounded at one end
by the vocal cords and at the other by the mouth opening. This tube
serves as an acoustic transmission system for sounds generated inside
the vocal tract. For creating nasal sounds like /M/, /N/, or /NG/,
a sidebranch tube, called the nasal tract, is connected to the main
acoustic branch by the trapdoor action of the velum. This branch path
radiates sound at the nostrils. The shape (variation of crosssection
along the axis) of the vocal tract varies with time due to motions of
the lips, jaw, tongue, and velum. Although the actual human vocal
tract is not laid out along a straight line as in Figure 2.1, this type of
model is a reasonable approximation for wavelengths of the sounds in
speech.
The sounds of speech are generated in the system of Figure 2.1 in
several ways. Voiced sounds (vowels, liquids, glides, nasals in Table 2.1)
Fig. 2.1 Schematic model of the vocal tract system. (After Flanagan et al. [35].)
2.2 Models for Speech Production 21
are produced when the vocal tract tube is excited by pulses of air pres
sure resulting from quasiperiodic opening and closing of the glottal
oriﬁce (opening between the vocal cords). Examples in Figure 1.1 are
the vowels /UH/, /IY/, and /EY/, and the liquid consonant /W/.
Unvoiced sounds are produced by creating a constriction somewhere in
the vocal tract tube and forcing air through that constriction, thereby
creating turbulent air ﬂow, which acts as a random noise excitation of
the vocal tract tube. Examples are the unvoiced fricative sounds such as
/SH/ and /S/. A third sound production mechanism is when the vocal
tract is partially closed oﬀ causing turbulent ﬂow due to the constric
tion, at the same time allowing quasiperiodic ﬂow due to vocal cord
vibrations. Sounds produced in this manner include the voiced frica
tives /V/, /DH/, /Z/, and /ZH/. Finally, plosive sounds such as /P/,
/T/, and /K/ and aﬀricates such as /CH/ are formed by momentarily
closing oﬀ air ﬂow, allowing pressure to build up behind the closure, and
then abruptly releasing the pressure. All these excitation sources cre
ate a wideband excitation signal to the vocal tract tube, which acts as
an acoustic transmission line with certain vocal tract shapedependent
resonances that tend to emphasize some frequencies of the excitation
relative to others.
As discussed in Chapter 1 and illustrated by the waveform in
Figure 1.1, the general character of the speech signal varies at the
phoneme rate, which is on the order of 10 phonemes per second, while
the detailed time variations of the speech waveform are at a much higher
rate. That is, the changes in vocal tract conﬁguration occur relatively
slowly compared to the detailed time variation of the speech signal. The
sounds created in the vocal tract are shaped in the frequency domain
by the frequency response of the vocal tract. The resonance frequencies
resulting from a particular conﬁguration of the articulators are instru
mental in forming the sound corresponding to a given phoneme. These
resonance frequencies are called the formant frequencies of the sound
[32, 34]. In summary, the ﬁne structure of the time waveform is created
by the sound sources in the vocal tract, and the resonances of the vocal
tract tube shape these sound sources into the phonemes.
The system of Figure 2.1 can be described by acoustic theory,
and numerical techniques can be used to create a complete physical
22 The Speech Signal
Linear
System
Excitation
Generator
excitation signal speech signal
Vocal Tract
Parameters
Excitation
Parameters
Fig. 2.2 Source/system model for a speech signal.
simulation of sound generation and transmission in the vocal tract
[36, 93], but, for the most part, it is suﬃcient to model the produc
tion of a sampled speech signal by a discretetime system model such
as the one depicted in Figure 2.2. The discretetime timevarying linear
system on the right in Figure 2.2 simulates the frequency shaping of
the vocal tract tube. The excitation generator on the left simulates the
diﬀerent modes of sound generation in the vocal tract. Samples of a
speech signal are assumed to be the output of the timevarying linear
system.
In general such a model is called a source/system model of speech
production. The shorttime frequency response of the linear system
simulates the frequency shaping of the vocal tract system, and since the
vocal tract changes shape relatively slowly, it is reasonable to assume
that the linear system response does not vary over time intervals on the
order of 10 ms or so. Thus, it is common to characterize the discrete
time linear system by a system function of the form:
H(z) =
M
¸
k=0
b
k
z
−k
1 −
N
¸
k=1
a
k
z
−k
=
b
0
M
¸
k=1
(1 − d
k
z
−1
)
N
¸
k=1
(1 − c
k
z
−1
)
, (2.1)
where the ﬁlter coeﬃcients a
k
and b
k
(labeled as vocal tract parameters
in Figure 2.2) change at a rate on the order of 50–100 times/s. Some
of the poles (c
k
) of the system function lie close to the unit circle
and create resonances to model the formant frequencies. In detailed
modeling of speech production [32, 34, 64], it is sometimes useful to
employ zeros (d
k
) of the system function to model nasal and fricative
2.2 Models for Speech Production 23
sounds. However, as we discuss further in Chapter 4, many applications
of the source/system model only include poles in the model because this
simpliﬁes the analysis required to estimate the parameters of the model
from the speech signal.
The box labeled excitation generator in Figure 2.2 creates an appro
priate excitation for the type of sound being produced. For voiced
speech the excitation to the linear system is a quasiperiodic sequence
of discrete (glottal) pulses that look very much like those shown in the
righthand half of the excitation signal waveform in Figure 2.2. The fun
damental frequency of the glottal excitation determines the perceived
pitch of the voice. The individual ﬁniteduration glottal pulses have a
lowpass spectrum that depends on a number of factors [105]. There
fore, the periodic sequence of smooth glottal pulses has a harmonic line
spectrum with components that decrease in amplitude with increasing
frequency. Often it is convenient to absorb the glottal pulse spectrum
contribution into the vocal tract system model of (2.1). This can be
achieved by a small increase in the order of the denominator over what
would be needed to represent the formant resonances. For unvoiced
speech, the linear system is excited by a random number generator
that produces a discretetime noise signal with ﬂat spectrum as shown
in the lefthand half of the excitation signal. The excitation in Fig
ure 2.2 switches from unvoiced to voiced leading to the speech signal
output as shown in the ﬁgure. In either case, the linear system imposes
its frequency response on the spectrum to create the speech sounds.
This model of speech as the output of a slowly timevarying digital
ﬁlter with an excitation that captures the nature of the voiced/unvoiced
distinction in speech production is the basis for thinking about the
speech signal, and a wide variety of digital representations of the speech
signal are based on it. That is, the speech signal is represented by the
parameters of the model instead of the sampled waveform. By assuming
that the properties of the speech signal (and the model) are constant
over short time intervals, it is possible to compute/measure/estimate
the parameters of the model by analyzing short blocks of samples of
the speech signal. It is through such models and analysis techniques
that we are able to build properties of the speech production process
into digital representations of the speech signal.
24 The Speech Signal
2.3 More Reﬁned Models
Source/system models as shown in Figure 2.2 with the system char
acterized by a timesequence of timeinvariant systems are quite suﬃ
cient for most applications in speech processing, and we shall rely on
such models throughout this text. However, such models are based on
many approximations including the assumption that the source and the
system do not interact, the assumption of linearity, and the assump
tion that the distributed continuoustime vocal tract transmission sys
tem can be modeled by a discrete linear timeinvariant system. Fluid
mechanics and acoustic wave propagation theory are fundamental phys
ical principles that must be applied for detailed modeling of speech
production. Since the early work of Flanagan and Ishizaka [34, 36, 51]
much work has been devoted to creating detailed simulations of glottal
ﬂow, the interaction of the glottal source and the vocal tract in speech
production, and the nonlinearities that enter into sound generation and
transmission in the vocal tract. Stevens [121] and Quatieri [94] provide
useful discussions of these eﬀects. For many years, researchers have
sought to measure the physical dimensions of the human vocal tract
during speech production. This information is essential for detailed sim
ulations based on acoustic theory. Early eﬀorts to measure vocal tract
area functions involved hand tracing on Xray pictures [32]. Recent
advances in MRI imaging and computer image analysis have provided
signiﬁcant advances in this area of speech science [17].
3
Hearing and Auditory Perception
In Chapter 2, we introduced the speech production process and showed
how we could model speech production using discretetime systems.
In this chapter we turn to the perception side of the speech chain to
discuss properties of human sound perception that can be employed to
create digital representations of the speech signal that are perceptually
robust.
3.1 The Human Ear
Figure 3.1 shows a schematic view of the human ear showing the three
distinct sound processing sections, namely: the outer ear consisting of
the pinna, which gathers sound and conducts it through the external
canal to the middle ear; the middle ear beginning at the tympanic
membrane, or eardrum, and including three small bones, the malleus
(also called the hammer), the incus (also called the anvil) and the stapes
(also called the stirrup), which perform a transduction from acoustic
waves to mechanical pressure waves; and ﬁnally, the inner ear, which
consists of the cochlea and the set of neural connections to the auditory
nerve, which conducts the neural signals to the brain.
25
26 Hearing and Auditory Perception
Fig. 3.1 Schematic view of the human ear (inner and middle structures enlarged). (After
Flanagan [34].)
Figure 3.2 [107] depicts a block diagram abstraction of the auditory
system. The acoustic wave is transmitted from the outer ear to the
inner ear where the ear drum and bone structures convert the sound
wave to mechanical vibrations which ultimately are transferred to the
basilar membrane inside the cochlea. The basilar membrane vibrates in
a frequencyselective manner along its extent and thereby performs a
rough (nonuniform) spectral analysis of the sound. Distributed along
Fig. 3.2 Schematic model of the auditory mechanism. (After Sachs et al. [107].)
3.2 Perception of Loudness 27
the basilar membrane are a set of inner hair cells that serve to convert
motion along the basilar membrane to neural activity. This produces
an auditory nerve representation in both time and frequency. The pro
cessing at higher levels in the brain, shown in Figure 3.2 as a sequence
of central processing with multiple representations followed by some
type of pattern recognition, is not well understood and we can only
postulate the mechanisms used by the human brain to perceive sound
or speech. Even so, a wealth of knowledge about how sounds are per
ceived has been discovered by careful experiments that use tones and
noise signals to stimulate the auditory system of human observers in
very speciﬁc and controlled ways. These experiments have yielded much
valuable knowledge about the sensitivity of the human auditory system
to acoustic properties such as intensity and frequency.
3.2 Perception of Loudness
A key factor in the perception of speech and other sounds is loudness.
Loudness is a perceptual quality that is related to the physical property
of sound pressure level. Loudness is quantiﬁed by relating the actual
sound pressure level of a pure tone (in dB relative to a standard refer
ence level) to the perceived loudness of the same tone (in a unit called
phons) over the range of human hearing (20 Hz–20 kHz). This relation
ship is shown in Figure 3.3 [37, 103]. These loudness curves show that
the perception of loudness is frequencydependent. Speciﬁcally, the dot
ted curve at the bottom of the ﬁgure labeled “threshold of audibility”
shows the sound pressure level that is required for a sound of a given
frequency to be just audible (by a person with normal hearing). It can
be seen that low frequencies must be signiﬁcantly more intense than
frequencies in the midrange in order that they be perceived at all. The
solid curves are equalloudnesslevel contours measured by comparing
sounds at various frequencies with a pure tone of frequency 1000 Hz and
known sound pressure level. For example, the point at frequency 100 Hz
on the curve labeled 50 (phons) is obtained by adjusting the power of
the 100 Hz tone until it sounds as loud as a 1000 Hz tone having a sound
pressure level of 50 dB. Careful measurements of this kind show that a
100 Hz tone must have a sound pressure level of about 60 dB in order
28 Hearing and Auditory Perception
Fig. 3.3 Loudness level for human hearing. (After Fletcher and Munson [37].)
to be perceived to be equal in loudness to the 1000 Hz tone of sound
pressure level 50 dB. By convention, both the 50 dB 1000 Hz tone and
the 60 dB 100 Hz tone are said to have a loudness level of 50 phons
(pronounced as /F OW N Z/).
The equalloudnesslevel curves show that the auditory system is
most sensitive for frequencies ranging from about 100 Hz up to about
6 kHz with the greatest sensitivity at around 3 to 4 kHz. This is almost
precisely the range of frequencies occupied by most of the sounds of
speech.
3.3 Critical Bands
The nonuniform frequency analysis performed by the basilar mem
brane can be thought of as equivalent to that of a set of bandpass ﬁlters
whose frequency responses become increasingly broad with increas
ing frequency. An idealized version of such a ﬁlter bank is depicted
schematically in Figure 3.4. In reality, the bandpass ﬁlters are not ideal
3.4 Pitch Perception 29
amplitude
Fig. 3.4 Schematic representation of bandpass ﬁlters according to the critical band theory
of hearing.
as shown in Figure 3.4, but their frequency responses overlap signiﬁ
cantly since points on the basilar membrane cannot vibrate indepen
dently of each other. Even so, the concept of bandpass ﬁlter analysis
in the cochlea is well established, and the critical bandwidths have
been deﬁned and measured using a variety of methods, showing that
the eﬀective bandwidths are constant at about 100 Hz for center fre
quencies below 500 Hz, and with a relative bandwidth of about 20%
of the center frequency above 500 Hz. An equation that ﬁts empirical
measurements over the auditory range is
∆f
c
= 25 + 75[1 + 1.4(f
c
/1000)
2
]
0.69
, (3.1)
where ∆f
c
is the critical bandwidth associated with center frequency
f
c
[134]. Approximately 25 critical band ﬁlters span the range from
0 to 20 kHz. The concept of critical bands is very important in under
standing such phenomena as loudness perception, pitch perception, and
masking, and it therefore provides motivation for digital representations
of the speech signal that are based on a frequency decomposition.
3.4 Pitch Perception
Most musical sounds as well as voiced speech sounds have a periodic
structure when viewed over short time intervals, and such sounds are
perceived by the auditory system as having a quality known as pitch.
Like loudness, pitch is a subjective attribute of sound that is related to
the fundamental frequency of the sound, which is a physical attribute of
the acoustic waveform [122]. The relationship between pitch (measured
30 Hearing and Auditory Perception
Fig. 3.5 Relation between subjective pitch and frequency of a pure tone.
on a nonlinear frequency scale called the melscale) and frequency of a
pure tone is approximated by the equation [122]:
Pitch in mels = 1127log
e
(1 + f/700), (3.2)
which is plotted in Figure 3.5. This expression is calibrated so that a
frequency of 1000 Hz corresponds to a pitch of 1000 mels. This empirical
scale describes the results of experiments where subjects were asked to
adjust the pitch of a measurement tone to half the pitch of a reference
tone. To calibrate the scale, a tone of frequency 1000 Hz is given a
pitch of 1000 mels. Below 1000 Hz, the relationship between pitch and
frequency is nearly proportional. For higher frequencies, however, the
relationship is nonlinear. For example, (3.2) shows that a frequency of
f = 5000 Hz corresponds to a pitch of 2364 mels.
The psychophysical phenomenon of pitch, as quantiﬁed by the mel
scale, can be related to the concept of critical bands [134]. It turns out
that more or less independently of the center frequency of the band, one
critical bandwidth corresponds to about 100 mels on the pitch scale.
This is shown in Figure 3.5, where a critical band of width ∆f
c
= 160 Hz
centered on f
c
= 1000 Hz maps into a band of width 106 mels and a
critical band of width 100 Hz centered on 350 Hz maps into a band of
width 107 mels. Thus, what we know about pitch perception reinforces
the notion that the auditory system performs a frequency analysis that
can be simulated with a bank of bandpass ﬁlters whose bandwidths
increase as center frequency increases.
3.5 Auditory Masking 31
Voiced speech is quasiperiodic, but contains many frequencies. Nev
ertheless, many of the results obtained with pure tones are relevant to
the perception of voice pitch as well. Often the term pitch period is used
for the fundamental period of the voiced speech signal even though its
usage in this way is somewhat imprecise.
3.5 Auditory Masking
The phenomenon of critical band auditory analysis can be explained
intuitively in terms of vibrations of the basilar membrane. A related
phenomenon, called masking, is also attributable to the mechanical
vibrations of the basilar membrane. Masking occurs when one sound
makes a second superimposed sound inaudible. Loud tones causing
strong vibrations at a point on the basilar membrane can swamp out
vibrations that occur nearby. Pure tones can mask other pure tones,
and noise can mask pure tones as well. A detailed discussion of masking
can be found in [134].
Figure 3.6 illustrates masking of tones by tones. The notion that a
sound becomes inaudible can be quantiﬁed with respect to the thresh
old of audibility. As shown in Figure 3.6, an intense tone (called the
masker) tends to raise the threshold of audibility around its location on
the frequency axis as shown by the solid line. All spectral components
whose level is below this raised threshold are masked and therefore do
Masker
Shifted
threshold
Threshold
of hearing
Masked
signals
(dashed)
Unmasked
signal
log frequency
a
m
p
l
i
t
u
d
e
Fig. 3.6 Illustration of eﬀects of masking.
32 Hearing and Auditory Perception
not need to be reproduced in a speech (or audio) processing system
because they would not be heard. Similarly, any spectral component
whose level is above the raised threshold is not masked, and therefore
will be heard. It has been shown that the masking eﬀect is greater for
frequencies above the masking frequency than below. This is shown in
Figure 3.6 where the falloﬀ of the shifted threshold is less abrupt above
than below the masker.
Masking is widely employed in digital representations of speech (and
audio) signals by “hiding” errors in the representation in areas where
the threshold of hearing is elevated by strong frequency components in
the signal [120]. In this way, it is possible to achieve lower data rate
representations while maintaining a high degree of perceptual ﬁdelity.
3.6 Complete Model of Auditory Processing
In Chapter 2, we described elements of a generative model of speech
production which could, in theory, completely describe and model the
ways in which speech is produced by humans. In this chapter, we
described elements of a model of speech perception. However, the
problem that arises is that our detailed knowledge of how speech is
perceived and understood, beyond the basilar membrane processing
of the inner ear, is rudimentary at best, and thus we rely on psy
chophysical experimentation to understand the role of loudness, critical
bands, pitch perception, and auditory masking in speech perception in
humans. Although some excellent auditory models have been proposed
[41, 48, 73, 115] and used in a range of speech processing systems, all
such models are incomplete representations of our knowledge about
how speech is understood.
4
ShortTime Analysis of Speech
In Figure 2.2 of Chapter 2, we presented a model for speech produc
tion in which an excitation source provides the basic temporal ﬁne
structure while a slowly varying ﬁlter provides spectral shaping (often
referred to as the spectrum envelope) to create the various sounds of
speech. In Figure 2.2, the source/system separation was presented at
an abstract level, and few details of the excitation or the linear sys
tem were given. Both the excitation and the linear system were deﬁned
implicitly by the assertion that the sampled speech signal was the out
put of the overall system. Clearly, this is not suﬃcient to uniquely
specify either the excitation or the system. Since our goal is to extract
parameters of the model by analysis of the speech signal, it is com
mon to assume structures (or representations) for both the excitation
generator and the linear system. One such model uses a more detailed
representation of the excitation in terms of separate source generators
for voiced and unvoiced speech as shown in Figure 4.1. In this model
the unvoiced excitation is assumed to be a random noise sequence,
and the voiced excitation is assumed to be a periodic impulse train
with impulses spaced by the pitch period (P
0
) rounded to the nearest
33
34 ShortTime Analysis of Speech
Fig. 4.1 Voiced/unvoiced/system model for a speech signal.
sample.
1
The pulses needed to model the glottal ﬂow waveform dur
ing voiced speech are assumed to be combined (by convolution) with
the impulse response of the linear system, which is assumed to be
slowlytimevarying (changing every 50–100 ms or so). By this we mean
that over the timescale of phonemes, the impulse response, frequency
response, and system function of the system remains relatively con
stant. For example over time intervals of tens of milliseconds, the sys
tem can be described by the convolution expression
s
ˆ n
[n] =
∞
¸
m=0
h
ˆ n
[m]e
ˆ n
[n − m], (4.1)
where the subscript ˆ n denotes the time index pointing to the block of
samples of the entire speech signal s[n] wherein the impulse response
h
ˆ n
[m] applies. We use n for the time index within that interval, and m is
the index of summation in the convolution sum. In this model, the gain
G
ˆ n
is absorbed into h
ˆ n
[m] for convenience. As discussed in Chapter 2,
the most general linear timeinvariant system would be characterized
by a rational system function as given in (2.1). However, to simplify
analysis, it is often assumed that the system is an allpole system with
1
As mentioned in Chapter 3, the period of voiced speech is related to the fundamental
frequency (perceived as pitch) of the voice.
35
system function of the form:
H(z) =
G
1 −
p
¸
k=1
a
k
z
−k
. (4.2)
The coeﬃcients G and a
k
in (4.2) change with time, and should there
fore be indexed with the subscript ˆ n as in (4.1) and Figure 4.1, but
this complication of notation is not generally necessary since it is usu
ally clear that the system function or diﬀerence equation only applies
over a short time interval.
2
Although the linear system is assumed to
model the composite spectrum eﬀects of radiation, vocal tract tube,
and glottal excitation pulse shape (for voiced speech only) over a short
time interval, the linear system in the model is commonly referred
to as simply the “vocal tract” system and the corresponding impulse
response is called the “vocal tract impulse response.” For allpole linear
systems, as represented by (4.2), the input and output are related by
a diﬀerence equation of the form:
s[n] =
p
¸
k=1
a
k
s[n − k] + Ge[n], (4.3)
where as discussed above, we have suppressed the indication of the time
at which the diﬀerence equation applies.
With such a model as the basis, common practice is to parti
tion the analysis of the speech signal into techniques for extracting
the parameters of the excitation model, such as the pitch period and
voiced/unvoiced classiﬁcation, and techniques for extracting the lin
ear system model (which imparts the spectrum envelope or spectrum
shaping).
Because of the slowly varying nature of the speech signal, it is com
mon to process speech in blocks (also called “frames”) over which the
properties of the speech waveform can be assumed to remain relatively
constant. This leads to the basic principle of shorttime analysis, which
2
In general, we use n and m for discrete indices for sequences, but whenever we want to
indicate a speciﬁc analysis time, we use ˆ n.
36 ShortTime Analysis of Speech
is represented in a general form by the equation:
X
ˆ n
=
∞
¸
m=−∞
T ¦x[m]w[ˆ n − m]¦, (4.4)
where X
ˆ n
represents the shorttime analysis parameter (or vector of
parameters) at analysis time ˆ n.
3
The operator T¦ ¦ deﬁnes the nature
of the shorttime analysis function, and w[ˆ n − m] represents a time
shifted window sequence, whose purpose is to select a segment of the
sequence x[m] in the neighborhood of sample m = ˆ n. We will see several
examples of such operators that are designed to extract or highlight
certain features of the speech signal. The inﬁnite limits in (4.4) imply
summation over all nonzero values of the windowed segment x
ˆ n
[m] =
x[m]w[ˆ n − m]; i.e., for all m in the region of support of the window.
For example, a ﬁniteduration
4
window might be a Hamming window
deﬁned by
w
H
[m] =
0.54 + 0.46cos(πm/M) −M ≤ m≤ M
0 otherwise.
(4.5)
Figure 4.2 shows a discretetime Hamming window and its discretetime
Fourier transform, as used throughout this chapter.
5
It can be shown
that a (2M + 1) sample Hamming window has a frequency main lobe
(full) bandwidth of 4π/M. Other windows will have similar properties,
i.e., they will be concentrated in time and frequency, and the frequency
width will be inversely proportional to the time width [89].
Figure 4.3 shows a 125 ms segment of a speech waveform that
includes both unvoiced (0–50 ms) and voiced speech (50–125 ms). Also
shown is a sequence of data windows of duration 40 ms and shifted by
15 ms (320 samples at 16 kHz sampling rate) between windows. This
illustrates how shorttime analysis is implemented.
3
Sometimes (4.4) is normalized by dividing by the eﬀective window length, i.e.,
∞
m=−∞
w[m], so that X
ˆ n
is a weighted average.
4
An example of an inﬁniteduration window is w[n] = na
n
for n ≥ 0. Such a window can
lead to recursive implementations of shorttime analysis functions [9].
5
We have assumed the time origin at the center of a symmetric interval of 2M + 1 samples.
A causal window would be shifted to the right by M samples.
4.1 ShortTime Energy and ZeroCrossing Rate 37
(a)
(b)
Fig. 4.2 Hamming window (a) and its discretetime Fourier transform (b).
0 20 40 60 80 100 120
time in ms
Fig. 4.3 Section of speech waveform with shorttime analysis windows.
4.1 ShortTime Energy and ZeroCrossing Rate
Two basic shorttime analysis functions useful for speech signals are the
shorttime energy and the shorttime zerocrossing rate. These func
tions are simple to compute, and they are useful for estimating prop
erties of the excitation function in the model.
The shorttime energy is deﬁned as
E
ˆ n
=
∞
¸
m=−∞
(x[m]w[ˆ n − m])
2
=
∞
¸
m=−∞
x
2
[m]w
2
[ˆ n − m]. (4.6)
38 ShortTime Analysis of Speech
In this case the operator T¦ ¦ is simply squaring the windowed samples.
As shown in (4.6), it is often possible to express shorttime analysis
operators as a convolution or linear ﬁltering operation. In this case,
E
ˆ n
= x
2
[n] ∗ h
e
[n]
n=ˆ n
, where the impulse response of the linear ﬁlter
is h
e
[n] = w
2
[n].
Similarly, the shorttime zero crossing rate is deﬁned as the weighted
average of the number of times the speech signal changes sign within
the time window. Representing this operator in terms of linear ﬁltering
leads to
Z
ˆ n
=
∞
¸
m=−∞
0.5[sgn¦x[m]¦ − sgn¦x[m − 1]¦[ w[ˆ n − m], (4.7)
where
sgn¦x¦ =
1 x ≥ 0
−1 x < 0.
(4.8)
Since 0.5[sgn¦x[m]¦ − sgn¦x[m − 1]¦[ is equal to 1 if x[m] and x[m − 1]
have diﬀerent algebraic signs and 0 if they have the same sign, it follows
that Z
ˆ n
in (4.7) is a weighted sum of all the instances of alternating sign
(zerocrossing) that fall within the support region of the shifted window
w[ˆ n − m]. While this is a convenient representation that ﬁts the general
framework of (4.4), the computation of Z
ˆ n
could be implemented in
other ways.
Figure 4.4 shows an example of the shorttime energy and zero
crossing rate for a segment of speech with a transition from unvoiced
to voiced speech. In both cases, the window is a Hamming window
(two examples shown) of duration 25 ms (equivalent to 401 samples
at a 16 kHz sampling rate).
6
Thus, both the shorttime energy and
the shorttime zerocrossing rate are output of a lowpass ﬁlter whose
frequency response is as shown in Figure 4.2(b). For the 401point
Hamming window used in Figure 4.4, the frequency response is very
small for discretetime frequencies above 2π/200 rad/s (equivalent to
16000/200 = 80 Hz analog frequency). This means that the shorttime
6
In the case of the shorttime energy, the window applied to the signal samples was the
squareroot of the Hamming window, so that he[n] = w
2
[n] is the Hamming window
deﬁned by (4.5).
4.1 ShortTime Energy and ZeroCrossing Rate 39
0 20 40 60 80 100 120
time in ms
Fig. 4.4 Section of speech waveform with shorttime energy and zerocrossing rate
superimposed.
energy and zerocrossing rate functions are slowly varying compared
to the time variations of the speech signal, and therefore, they can be
sampled at a much lower rate than that of the original speech signal.
For ﬁnitelength windows like the Hamming window, this reduction of
the sampling rate is accomplished by moving the window position ˆ n in
jumps of more than one sample as shown in Figure 4.3.
Note that during the unvoiced interval, the zerocrossing rate is rel
atively high compared to the zerocrossing rate in the voiced interval.
Conversely, the energy is relatively low in the unvoiced region com
pared to the energy in the voiced region. Note also that there is a small
shift of the two curves relative to events in the time waveform. This is
due to the time delay of M samples (equivalent to 12.5 ms) added to
make the analysis window ﬁlter causal.
The shorttime energy and shorttime zerocrossing rate are impor
tant because they abstract valuable information about the speech sig
nal, and they are simple to compute. The shorttime energy is an indi
cation of the amplitude of the signal in the interval around time ˆ n. From
our model, we expect unvoiced regions to have lower shorttime energy
than voiced regions. Similarly, the shorttime zerocrossing rate is a
crude frequency analyzer. Voiced signals have a high frequency (HF)
falloﬀ due to the lowpass nature of the glottal pulses, while unvoiced
sounds have much more HF energy. Thus, the shorttime energy and
shorttime zerocrossing rate can be the basis for an algorithm for mak
ing a decision as to whether the speech signal is voiced or unvoiced at
40 ShortTime Analysis of Speech
any particular time ˆ n. A complete algorithm would involve measure
ments of the statistical distributions of the energy and zerocrossing
rate for both voiced and unvoiced speech segments (and also back
ground noise distributions). These distributions can be used to derive
thresholds used in voiced/unvoiced decision [100].
4.2 ShortTime Autocorrelation Function (STACF)
The autocorrelation function is often used as a means of detecting
periodicity in signals, and it is also the basis for many spectrum analysis
methods. This makes it a useful tool for shorttime speech analysis. The
STACF is deﬁned as the deterministic autocorrelation function of the
sequence x
ˆ n
[m] = x[m]w[ˆ n − m] that is selected by the window shifted
to time ˆ n, i.e.,
φ
ˆ n
[] =
∞
¸
m=−∞
x
ˆ n
[m]x
ˆ n
[m + ]
=
∞
¸
m=−∞
x[m]w[ˆ n − m]x[m + ]w[ˆ n − m − ]. (4.9)
Using the familiar evensymmetric property of the autocorrelation,
φ
ˆ n
[−] = φ
ˆ n
[], (4.9) can be expressed in terms of linear timeinvariant
(LTI) ﬁltering as
φ
ˆ n
[] =
∞
¸
m=−∞
x[m]x[m − ] ˜ w
[ˆ n − m], (4.10)
where ˜ w
[m] = w[m]w[m + ]. Note that the STACF is a two
dimensional function of the discretetime index ˆ n (the window position)
and the discretelag index . If the window has ﬁnite duration, (4.9) can
be evaluated directly or using FFT techniques (see Section 4.6). For
inﬁniteduration decaying exponential windows, the shorttime auto
correlation of (4.10) can be computed recursively at time ˆ n by using a
diﬀerent ﬁlter ˜ w
[m] for each lag value [8, 9].
To see how the shorttime autocorrelation can be used in speech
analysis, assume that a segment of the sampled speech signal is a seg
ment of the output of the discretetime model shown in Figure 4.1 where
4.2 ShortTime Autocorrelation Function (STACF) 41
the system is characterized at a particular analysis time by an impulse
response h[n], and the input is either a periodic impulse train or random
white noise. (Note that we have suppressed the indication of analysis
time.) Diﬀerent segments of the speech signal will have the same form
of model with diﬀerent excitation and system impulse response. That
is, assume that s[n] = e[n] ∗ h[n], where e[n] is the excitation to the
linear system with impulse response h[n]. A well known, and easily
proved, property of the autocorrelation function is that
φ
(s)
[] = φ
(e)
[] ∗ φ
(h)
[], (4.11)
i.e., the autocorrelation function of s[n] = e[n] ∗ h[n] is the convolution
of the autocorrelation functions of e[n] and h[n]. In the case of the
speech signal, h[n] represents the combined (by convolution) eﬀects
of the glottal pulse shape (for voiced speech), vocal tract shape, and
radiation at the lips. For voiced speech, the autocorrelation of a peri
odic impulse train excitation with period P
0
is a periodic impulse train
sequence with the same period. In this case, therefore, the autocorre
lation of voiced speech is the periodic autocorrelation function
φ
(s)
[] =
∞
¸
m=−∞
φ
(h)
[ − mP
0
]. (4.12)
In the case of unvoiced speech, the excitation can be assumed to be ran
dom white noise, whose stochastic autocorrelation function would be
an impulse sequence at = 0. Therefore, the autocorrelation function
of unvoiced speech computed using averaging would be simply
φ
(s)
[] = φ
(h)
[]. (4.13)
Equation (4.12) assumes periodic computation of an inﬁnite periodic
signal, and (4.13) assumes probability average or averaging over an
inﬁnite time interval (for stationary random signals). However, the
deterministic autocorrelation function of a ﬁnitelength segment of the
speech waveform will have properties similar to those of (4.12) and
(4.13) except that the correlation values will taper oﬀ with lag due
to the tapering of the window and the fact that less and less data is
involved in the computation of the shorttime autocorrelation for longer
42 ShortTime Analysis of Speech
0 10 20 30 40
−0.2
−0.1
0
0.1
0.2
0 10 20 30 40
−0.2
−0.1
0
0.1
0.2
0 10 20 30 40
−0.5
0
0.5
1
0 20 40
−0.5
0
0.5
1
(a) Voiced Segment (b) Voiced Autocorrelation
(c) Unvoiced Segment (d) Unvoiced Autocorrelation
time in ms time in ms
Fig. 4.5 Voiced and unvoiced segments of speech and their corresponding STACF.
lag values. This tapering oﬀ in level is depicted in Figure 4.5 for both a
voiced and an unvoiced speech segment. Note the peak in the autocor
relation function for the voiced segment at the pitch period and twice
the pitch period, and note the absence of such peaks in the autocorrela
tion function for the unvoiced segment. This suggests that the STACF
could be the basis for an algorithm for estimating/detecting the pitch
period of speech. Usually such algorithms involve the autocorrelation
function and other shorttime measurements such as zerocrossings and
energy to aid in making the voiced/unvoiced decision.
Finally, observe that the STACF implicitly contains the shorttime
energy since
E
ˆ n
=
∞
¸
m=−∞
(x[m]w[ˆ n − m])
2
= φ
ˆ n
[0]. (4.14)
4.3 ShortTime Fourier Transform (STFT)
The shorttime analysis functions discussed so far are examples of
the general shorttime analysis principle that is the basis for most
4.3 ShortTime Fourier Transform (STFT) 43
algorithms for speech processing. We now turn our attention to what is
perhaps the most important basic concept in digital speech processing.
In subsequent chapters, we will ﬁnd that the STFT deﬁned as
X
ˆ n
(e
j ˆ ω
) =
∞
¸
m=−∞
x[m]w[ˆ n − m]e
−j ˆ ωm
, (4.15)
is the basis for a wide range of speech analysis, coding and synthesis sys
tems. By deﬁnition, for ﬁxed analysis time ˆ n, the STFT is the discrete
time Fourier transform (DTFT) of the signal x
ˆ n
[m] = x[m]w[ˆ n − m],
i.e., the DTFT of the (usually ﬁniteduration) signal selected and
amplitudeweighted by the sliding window w[ˆ n − m] [24, 89, 129]. Thus,
the STFT is a function of two variables; ˆ n the discretetime index denot
ing the window position, and ˆ ω representing the analysis frequency.
7
Since (4.15) is a sequence of DTFTs, the twodimensional function
X
ˆ n
(e
j ˆ ω
) at discretetime ˆ n is a periodic function of continuous radian
frequency ˆ ω with period 2π [89].
As in the case of the other shorttime analysis functions discussed
in this chapter, the STFT can be expressed in terms of a linear ﬁl
tering operation. For example, (4.15) can be expressed as the discrete
convolution
X
ˆ n
(e
j ˆ ω
) = (x[n]e
−j ˆ ωn
) ∗ w[n]
n=ˆ n
, (4.16)
or, alternatively,
X
ˆ n
(e
j ˆ ω
) =
x[n] ∗ (w[n]e
j ˆ ωn
)
e
−j ˆ ωn
n=ˆ n
. (4.17)
Recall that a typical window like a Hamming window, when viewed
as a linear ﬁlter impulse response, has a lowpass frequency response
with the cutoﬀ frequency varying inversely with the window length.
(See Figure 4.2(b).) This means that for a ﬁxed value of ˆ ω, X
ˆ n
(e
j ˆ ω
)
is slowly varying as ˆ n varies. Equation (4.16) can be interpreted as
follows: the amplitude modulation x[n]e
−j ˆ ωn
shifts the spectrum of
7
As before, we use ˆ n to specify the analysis time, and ˆ ω is used to distinguish the STFT
analysis frequency from the frequency variable of the nontimedependent Fourier trans
form frequency variable ω.
44 ShortTime Analysis of Speech
x[n] down by ˆ ω, and the window (lowpass) ﬁlter selects the resulting
band of frequencies around zero frequency. This is, of course, the
band of frequencies of x[n] that were originally centered on analysis
frequency ˆ ω. An identical conclusion follows from (4.17): x[n] is the
input to a bandpass ﬁlter with impulse response w[n]e
j ˆ ωn
, which selects
the band of frequencies centered on ˆ ω. Then that band of frequencies is
shifted down by the amplitude modulation with e
−j ˆ ωn
, resulting again
in the same lowpass signal [89].
In summary, the STFT has three interpretations: (1) It is a sequence
of discretetime Fourier transforms of windowed signal segments, i.e., a
periodic function of ˆ ω at each window position ˆ n. (2) For each frequency
ˆ ω with ˆ n varying, it is the time sequence output of a lowpass ﬁlter that
follows frequency downshifting by ˆ ω. (3) For each frequency ˆ ω, it is
the time sequence output resulting from frequency downshifting the
output of a bandpass ﬁlter.
4.4 Sampling the STFT in Time and Frequency
As deﬁned in (4.15), the STFT is a function of a continuous analysis
frequency ˆ ω. The STFT becomes a practical tool for both analysis
and applications when it is implemented with a ﬁniteduration window
moved in steps of R > 1 samples in time and computed at a discrete
set of frequencies as in
X
rR
[k] =
rR
¸
m=rR−L+1
x[m]w[rR − m]e
−j(2πk/N)m
k = 0, 1, . . . , N − 1,
(4.18)
where N is the number of uniformly spaced frequencies across the
interval 0 ≤ ˆ ω < 2π, and L is the window length (in samples). Note
that we have assumed that w[m] is causal and nonzero only in the
range 0 ≤ m≤ L − 1 so that the windowed segment x[m]w[rR − m]
is nonzero over rR − L + 1 ≤ m≤ rR. To aid in interpretation, it is
helpful to write (4.18) in the equivalent form:
X
rR
[k] =
˜
X
rR
[k]e
−j(2πk/N)rR
k = 0, 1, . . . , N − 1, (4.19)
4.4 Sampling the STFT in Time and Frequency 45
where
˜
X
rR
[k] =
L−1
¸
m=0
x[rR − m]w[m]e
j(2πk/N)m
k = 0, 1, . . . , N − 1. (4.20)
Since we have assumed, for speciﬁcity, that w[m] = 0 only in the range
0 ≤ m≤ L − 1, the alternative form,
˜
X
rR
[k], has the interpretation of
an Npoint DFT of the sequence x[rR − m]w[m], which, due to the
deﬁnition of the window, is nonzero in the interval 0 ≤ m≤ L − 1.
8
In
(4.20), the analysis time rR is shifted to the time origin of the DFT
computation, and the segment of the speech signal is the timereversed
sequence of L samples that precedes the analysis time. The complex
exponential factor e
−j(2πk/N)rR
in (4.19) results from the shift of the
time origin.
For a discretetime Fourier transform interpretation, it follows that
˜
X
rR
[k] can be computed by the following process:
(1) Form the sequence x
rR
[m] = x[rR − m]w[m], for m =
0, 1, . . . , L − 1.
(2) Compute the complex conjugate of the Npoint DFT of
the sequence x
rR
[m]. (This can be done eﬃciently with an
Npoint FFT algorithm.)
(3) The multiplication by e
−j(2πk/N)rR
can be done if necessary,
but often can be omitted (as in computing a sound spectro
gram or spectrographic display).
(4) Move the time origin by R samples (i.e., r →r + 1) and
repeat steps (1), (2), and (3), etc.
The remaining issue for complete speciﬁcation of the sampled STFT
is speciﬁcation of the temporal sampling period, R, and the number of
uniform frequencies, N. It can easily be shown that both R and N
are determined entirely by the time width and frequency bandwidth of
the lowpass window, w[m], used to compute the STFT [1], giving the
8
The DFT is normally deﬁned with a negative exponent. Thus, since x[rR − m]w[m] is
real, (4.20) is the complex conjugate of the DFT of the windowed sequence [89].
46 ShortTime Analysis of Speech
following constraints on R and N:
(1) R ≤ L/(2C) where C is a constant that is dependent on the
window frequency bandwidth; C = 2 for a Hamming window,
C = 1 for a rectangular window.
(2) N ≥ L, where L is the window length in samples.
Constraint (1) above is related to sampling the STFT in time at a
rate of twice the window bandwidth in frequency in order to eliminate
aliasing in frequency of the STFT, and constraint (2) above is related
to sampling in frequency at a rate of twice the equivalent time width
of the window to ensure that there is no aliasing in time of the STFT.
4.5 The Speech Spectrogram
Since the 1940s, the sound spectrogram has been a basic tool for gain
ing understanding of how the sounds of speech are produced and how
phonetic information is encoded in the speech signal. Up until the
1970s, spectrograms were made by an ingenious device comprised of
an audio tape loop, variable analog bandpass ﬁlter, and electrically
sensitive paper [66]. Today spectrograms like those in Figure 4.6 are
made by DSP techniques [87] and displayed as either pseudocolor or
grayscale images on computer screens.
Sound spectrograms like those in Figure 4.6 are simply a display of
the magnitude of the STFT. Speciﬁcally, the images in Figure 4.6 are
plots of
S(t
r
, f
k
) = 20log
10
[
˜
X
rR
[k][ = 20log
10
[X
rR
[k][, (4.21)
where the plot axes are labeled in terms of analog time and fre
quency through the relations t
r
= rRT and f
k
= k/(NT), where T is
the sampling period of the discretetime signal x[n] = x
a
(nT). In order
to make smooth looking plots like those in Figure 4.6, R is usually
quite small compared to both the window length L and the num
ber of samples in the frequency dimension, N, which may be much
larger than the window length L. Such a function of two variables
can be plotted on a two dimensional surface (such as this text) as
either a grayscale or a colormapped image. Figure 4.6 shows the time
4.5 The Speech Spectrogram 47
−1
0
1
SH UH D W IY CH EY S
f
r
e
q
u
e
n
c
y
i
n
k
H
z
0 100 200 300 400 500 600
0
1
2
3
4
5
−100
−50
0
f
r
e
q
u
e
n
c
y
i
n
k
H
z
time in ms
0 100 200 300 400 500 600
0
1
2
3
4
5
−120
−100
−80
−60
−40
−20
0
Fig. 4.6 Spectrogram for speech signal of Figure 1.1.
waveform at the top and two spectrograms computed with diﬀerent
length analysis windows. The bars on the right calibrate the color
map (in dB).
A careful interpretation of (4.20) and the corresponding spectro
gram images leads to valuable insight into the nature of the speech
signal. First note that the window sequence w[m] is nonzero only over
the interval 0 ≤ m≤ L − 1. The length of the window has a major
eﬀect on the spectrogram image. The upper spectrogram in Figure 4.6
was computed with a window length of 101 samples, corresponding to
10 ms time duration. This window length is on the order of the length
of a pitch period of the waveform during voiced intervals. As a result,
in voiced intervals, the spectrogram displays vertically oriented stri
ations corresponding to the fact that the sliding window sometimes
includes mostly large amplitude samples, then mostly small amplitude
48 ShortTime Analysis of Speech
samples, etc. As a result of the short analysis window, each individ
ual pitch period is resolved in the time dimension, but the resolution
in the frequency dimension is poor. For this reason, if the analysis
window is short, the spectrogram is called a wideband spectrogram.
This is consistent with the linear ﬁltering interpretation of the STFT,
since a short analysis ﬁlter has a wide passband. Conversely, when the
window length is long, the spectrogram is a narrowband spectrogram,
which is characterized by good frequency resolution and poor time
resolution.
The upper plot in Figure 4.7, for example, shows S(t
r
, f
k
) as a
function of f
k
at time t
r
= 430 ms. This vertical slice through the
spectrogram is at the position of the black vertical line in the upper
spectrogram of Figure 4.6. Note the three broad peaks in the spec
trum slice at time t
r
= 430 ms, and observe that similar slices would
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
−80
−60
−40
−20
0
l
o
g
m
a
g
n
i
t
u
d
e
(
d
B
)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
−80
−60
−40
−20
0
frequency in kHz
l
o
g
m
a
g
n
i
t
u
d
e
(
d
B
)
Fig. 4.7 Shorttime spectrum at time 430 ms (dark vertical line in Figure 4.6) with Hamming
window of length M = 101 in upper plot and M = 401 in lower plot.
4.6 Relation of STFT to STACF 49
be obtained at other times around t
r
= 430 ms. These large peaks are
representative of the underlying resonances of the vocal tract at the
corresponding time in the production of the speech signal.
The lower spectrogram in Figure 4.6 was computed with a win
dow length of 401 samples, corresponding to 40 ms time duration. This
window length is on the order of several pitch periods of the waveform
during voiced intervals. As a result, the spectrogram no longer displays
vertically oriented striations since several periods are included in the
window no matter where it is placed on the waveform in the vicinity
of the analysis time t
r
. As a result, the spectrogram is not as sensitive
to rapid time variations, but the resolution in the frequency dimension
is much better. Therefore, the striations tend to be horizontally ori
ented in the narrowband case since the fundamental frequency and its
harmonics are all resolved. The lower plot in Figure 4.7, for example,
shows S(t
r
, f
k
) as a function of f
k
at time t
r
= 430 ms. In this case,
the signal is voiced at the position of the window, so over the analy
sis window interval it acts very much like a periodic signal. Periodic
signals have Fourier spectra that are composed of impulses at the fun
damental frequency, f
0
, and at integer multiples (harmonics) of the
fundamental frequency [89]. Multiplying by the analysis window in the
timedomain results in convolution of the Fourier transform of the win
dow with the impulses in the spectrum of the periodic signal [89]. This
is evident in the lower plot of Figure 4.7, where the local maxima of the
curve are spaced at multiples of the fundamental frequency f
0
= 1/T
0
,
where T
0
is the fundamental period (pitch period) of the signal.
9
4.6 Relation of STFT to STACF
A basic property of Fourier transforms is that the inverse Fourier trans
form of the magnitudesquared of the Fourier transform of a signal is the
autocorrelation function for that signal [89]. Since the STFT deﬁned in
(4.15) is a discretetime Fourier transform for ﬁxed window position,
it follows that φ
ˆ n
[] given by (4.9) is related to the STFT given by
9
Each “ripple” in the lower plot is essentially a frequencyshifted copy of the Fourier trans
form of the Hamming window used in the analysis.
50 ShortTime Analysis of Speech
(4.15) as
φ
ˆ n
[] =
1
2π
π
−π
[X
ˆ n
(e
j ˆ ω
)[
2
e
j ˆ ω
dˆ ω. (4.22)
The STACF can also be computed from the sampled STFT. In partic
ular, an inverse DFT can be used to compute
˜
φ
rR
[] =
1
N
N−1
¸
k=0
[
˜
X
rR
(e
j(2πk/N)
)[
2
e
j(2πk/N)
(4.23)
and
˜
φ
rR
[] = φ
rR
[] for = 0, 1, . . . , L − 1 if N ≥ 2L. If L < N < 2L
time aliasing will occur, but
˜
φ
rR
[] = φ
rR
[] for = 0, 1, . . . , N − L. Note
that (4.14) shows that the shorttime energy can also be obtained from
either (4.22) or (4.23) by setting = 0.
Figure 4.8 illustrates the equivalence between the STACF and the
STFT. Figures 4.8(a) and 4.8(c) show the voiced and unvoiced autocor
relation functions that were shown in Figures 4.5(b) and 4.5(d). On the
right are the corresponding STFTs. The peaks around 9 and 18 ms in
0 10 20 30 40
−0.5
0
0.5
1
0 10 20 30 40
−0.5
0
0.5
1
0 2000 4000 6000 8000
−100
−50
0
50
0 2000 4000 6000 8000
−100
−50
0
50
(b) Voiced STFT (a) Voiced Autocorrelation
(d) Unvoiced STFT (c) Unvoiced Autocorrelation
time in msec frequency in kHz
Fig. 4.8 STACF and corresponding STFT.
4.7 ShortTime Fourier Synthesis 51
the autocorrelation function of the voiced segment imply a fundamen
tal frequency at this time of approximately 1000/9 = 111 Hz. Similarly,
note in the STFT on the right in Figure 4.8(b) that there are approx
imately 18 local regularly spaced peaks in the range 0–2000 Hz. Thus,
we can estimate the fundamental frequency to be 2000/18 = 111 Hz as
before.
4.7 ShortTime Fourier Synthesis
From the linear ﬁltering point of view of the STFT, (4.19) and (4.20)
represent the downsampled (by the factor R) output of the process
of bandpass ﬁltering with impulse response w[n]e
j(2πk/N)n
followed
by frequencydownshifting by (2πk/N). Alternatively, (4.18) shows
that X
rR
[k] is a downsampled output of the lowpass window ﬁlter
with frequencydownshifted input x[n]e
−j(2πk/N)n
. The left half of Fig
ure 4.9 shows a block diagram representation of the STFT as a com
bination of modulation, followed by lowpass ﬁltering, followed by a
downsampler by R. This structure is often called a ﬁlter bank, and
the outputs of the individual ﬁlters are called the channel signals. The
w[n]
w[n]
w[n]
x
x
x x
x
x
S
h
o
r
t
T
i
m
e
M
o
d
i
f
i
c
a
t
i
o
n
s
R
R
R
R
R
R
f [n]
f [n]
f [n]
+
] [k X
n
] [n x
n N j
N
e
) 1 (
2
− −
π
n N j
N
e
) 1 (
2
−
π
n j
N
e
) 1 (
2π
−
n j
N
e
) 0 (
2π
−
n j
N
e
) 1 (
2π
n j
N
e
) 0 (
2π
] [n y
] [k X
rR
] [k Y
rR
Fig. 4.9 Filterbank interpretation of shorttime Fourier analysis and synthesis.
52 ShortTime Analysis of Speech
spectrogram is comprised of the set of outputs of the ﬁlter bank with
each channel signal corresponding to a horizontal line in the spectro
gram. If we want to use the STFT for other types of speech process
ing, we may need to consider if and how it is possible to reconstruct
the speech signal from the STFT, i.e., from the channel signals. The
remainder of Figure 4.9 shows how this can be done.
First, the diagram shows the possibility that the STFT might be
modiﬁed by some sort of processing. An example might be quantization
of the channel signals for data compression. The modiﬁed STFT is
denoted as Y
rR
[k]. The remaining parts of the structure on the right in
Figure 4.9 implement the synthesis of a new time sequence y[n] from
the STFT. This part of the diagram represents the synthesis equation
y[n] =
N−1
¸
k=0
∞
¸
r=−∞
Y
rR
[k]f[n − rR]
e
j(2πk/N)n
. (4.24)
The steps represented by (4.24) and the right half of Figure 4.9 involve
ﬁrst upsampling by R followed by linear ﬁltering with f[n] (called
the synthesis window). This operation, deﬁned by the part within the
parenthesis in (4.24), interpolates the STFT Y
rR
[k] to the time sam
pling rate of the speech signal for each channel k.
10
Then, synthesis
is achieved by modulation with e
j(2πk/N)n
, which upshifts the lowpass
interpolated channel signals back to their original frequency bands cen
tered on frequencies (2πk/N). The sum of the upshifted signals is the
synthesized output.
There are many variations on the shorttime Fourier analy
sis/synthesis paradigm. The FFT algorithm can be used to implement
both analysis and synthesis; special eﬃciencies result when L > R = N,
and the analysis and synthesis can be accomplished using only real ﬁl
tering and with modulation operations being implicitly achieved by
the downsampling [24, 129]. However, it is most important to note
that with careful choice of the parameters L, R, and N and careful
design of the analysis window w[m] together with the synthesis win
dow f[n], it is possible to reconstruct the signal x[n] with negligible
10
The sequence f[n] is the impulse response of a LTI ﬁlter with gain R/N and normalized
cutoﬀ frequency π/R. It is often a scaled version of w[n].
4.8 ShortTime Analysis is Fundamental to our Thinking 53
error (y[n] = x[n]) from the unmodiﬁed STFT Y
rR
[k] = X
rR
[k]. One
condition that guarantees exact reconstruction is [24, 129].
∞
¸
r=−∞
w[rR − n + qN]f[n − rR] =
1 q = 0
0 q = 0.
(4.25)
Shorttime Fourier analysis and synthesis is generally formulated
with equally spaced channels with equal bandwidths. However, we
have seen that models for auditory processing involve nonuniform ﬁlter
banks. Such ﬁlter banks can be implemented as tree structures where
frequency bands are successively divided into low and HF bands. Such
structures are essentially the same as wavelet decompositions, a topic
beyond the scope of this text, but one for which a large body of liter
ature exists. See, for example [18, 125, 129].
4.8 ShortTime Analysis is Fundamental to our Thinking
It can be argued that the shorttime analysis principle, and particu
larly the shorttime Fourier representation of speech, is fundamental to
our thinking about the speech signal and it leads us to a wide variety
of techniques for achieving our goal of moving from the sampled time
waveform back along the speech chain toward the implicit message.
The fact that almost perfect reconstruction can be achieved from the
ﬁlter bank channel signals gives the shorttime Fourier representation
major credibility in the digital speech processing tool kit. This impor
tance is strengthened by the fact that, as discussed brieﬂy in Chapter 3,
models for auditory processing are based on a ﬁlter bank as the ﬁrst
stage of processing. Much of our knowledge of perceptual eﬀects is
framed in terms of frequency analysis, and thus, the STFT representa
tion provides a natural framework within which this knowledge can be
represented and exploited to obtain eﬃcient representations of speech
and more general audio signals. We will have more to say about this
in Chapter 7, but ﬁrst we will consider (in Chapter 5) the technique
of cepstrum analysis, which is based directly on the STFT, and the
technique of linear predictive analysis (in Chapter 6), which is based
on the STACF and thus equivalently on the STFT.
5
Homomorphic Speech Analysis
The STFT provides a useful framework for thinking about almost all
the important techniques for analysis of speech that have been devel
oped so far. An important concept that ﬂows directly from the STFT
is the cepstrum, more speciﬁcally, the shorttime cepstrum, of speech.
In this chapter, we explore the use of the shorttime cepstrum as a
representation of speech and as a basis for estimating the parameters
of the speech generation model. A more detailed discussion of the uses
of the cepstrum in speech processing can be found in [110].
5.1 Deﬁnition of the Cepstrum and Complex Cepstrum
The cepstrum was deﬁned by Bogert, Healy, and Tukey to be the inverse
Fourier transform of the log magnitude spectrum of a signal [16]. Their
original deﬁnition, loosely framed in terms of spectrum analysis of ana
log signals, was motivated by the fact that the logarithm of the Fourier
spectrum of a signal containing an echo has an additive periodic com
ponent depending only on the echo size and delay, and that further
Fourier analysis of the log spectrum can aid in detecting the presence
54
5.1 Deﬁnition of the Cepstrum and Complex Cepstrum 55
of that echo. Oppenheim, Schafer, and Stockham showed that the cep
strum is related to the more general concept of homomorphic ﬁltering
of signals that are combined by convolution [85, 90, 109]. They gave a
deﬁnition of the cepstrum of a discretetime signal as
c[n] =
1
2π
π
−π
log[X(e
jω
)[e
jωn
dω, (5.1a)
where log[X(e
jω
)[ is the logarithm of the magnitude of the DTFT
of the signal, and they extended the concept by deﬁning the complex
cepstrum as
ˆ x[n] =
1
2π
π
−π
log¦X(e
jω
)¦e
jωn
dω, (5.1b)
where log¦X(e
jω
)¦ is the complex logarithm of X(e
jω
) deﬁned as
ˆ
X(e
jω
) = log¦X(e
jω
)¦ = log[X(e
jω
)[ + j arg[X(e
jω
)]. (5.1c)
The transformation implied by (5.1b) is depicted as the block diagram
in Figure 5.1. The same diagram represents the cepstrum if the complex
logarithm is replaced by the logarithm of the magnitude of the DTFT.
Since we restrict our attention to real sequences, x[n], it follows from
the symmetry properties of Fourier transforms that the cepstrum is the
even part of the complex cepstrum, i.e., c[n] = (ˆ x[n] + ˆ x[−n])/2 [89].
As shown in Figure 5.1, the operation of computing the complex cep
strum from the input can be denoted as ˆ x[n] = D
∗
¦x[n]¦. In the theory
of homomorphic systems D
∗
¦ ¦ is called the characteristic system for
convolution. The connection between the cepstrum concept and homo
morphic ﬁltering of convolved signals is that the complex cepstrum has
the property that if x[n] = x
1
[n] ∗ x
2
[n], then
ˆ x[n] = D
∗
¦x
1
[n] ∗ x
2
[n]¦ = ˆ x
1
[n] + ˆ x
2
[n]. (5.2)
Fig. 5.1 Computing the complex cepstrum using the DTFT.
56 Homomorphic Speech Analysis
Fig. 5.2 The inverse of the characteristic system for convolution (inverse complex cepstrum).
That is, the complex cepstrum operator transforms convolution into
addition. This property, which is true for both the cepstrum and the
complex cepstrum, is what makes the cepstrum and the complex cep
strum useful for speech analysis, since our model for speech produc
tion involves convolution of the excitation with the vocal tract impulse
response, and our goal is often to separate the excitation signal from
the vocal tract signal. In the case of the complex cepstrum, the inverse
of the characteristic system exists as in Figure 5.2, which shows the
reverse cascade of the inverses of the operators in Figure 5.1. Homo
morphic ﬁltering of convolved signals is achieved by forming a modiﬁed
complex cepstrum
ˆ y[n] = g[n]ˆ x[n], (5.3)
where g[n] is a window (a “lifter” in the terminology of Bogert et al.)
which selects a portion of the complex cepstrum for inverse processing.
A modiﬁed output signal y[n] can then be obtained as the output of
Figure 5.2 with ˆ y[n] given by (5.3) as input. Observe that (5.3) deﬁnes
a linear operator in the conventional sense, i.e., if ˆ x[n] = ˆ x
1
[n] + ˆ x
2
[n]
then ˆ y[n] = g[n]ˆ x
1
[n] + g[n]ˆ x
2
[n]. Therefore, the output of the inverse
characteristic system will have the form y[n] = y
1
[n] ∗ y
2
[n], where
ˆ y
1
[n] = g[n]ˆ x
1
[n] is the complex cepstrum of y
1
[n], etc. Examples of
lifters used in homomorphic ﬁltering of speech are given in Sections 5.4
and 5.6.2.
The key issue in the deﬁnition and computation of the complex cep
strum is the computation of the complex logarithm; more speciﬁcally,
the computation of the phase angle arg[X(e
jω
)], which must be done
so as to preserve an additive combination of phases for two signals
combined by convolution [90, 109].
5.2 The ShortTime Cepstrum 57
The independent variable of the cepstrum and complex cepstrum
is nominally time. The crucial observation leading to the cepstrum
concept is that the log spectrum can be treated as a waveform to be
subjected to further Fourier analysis. To emphasize this interchanging
of domains of reference, Bogert et al. [16] coined the word cepstrum
by transposing some of the letters in the word spectrum. They created
many other special terms in this way including quefrency as the name
for the independent variable of the cepstrum and liftering for the oper
ation of linear ﬁltering the log magnitude spectrum by the operation
of (5.3). Only the terms cepstrum, quefrency, and liftering are widely
used today.
5.2 The ShortTime Cepstrum
The application of these deﬁnitions to speech requires that the
DTFT be replaced by the STFT. Thus the shorttime cepstrum is
deﬁned as
c
ˆ n
[m] =
1
2π
π
−π
log[X
ˆ n
(e
j ˆ ω
)[e
j ˆ ωm
dˆ ω, (5.4)
where X
ˆ n
(e
j ˆ ω
) is the STFT deﬁned in (4.15), and the shorttime
complex cepstrum is likewise deﬁned by replacing X(e
j ˆ ω
) by X
ˆ n
(e
j ˆ ω
)
in (5.1b).
1
The similarity to the STACF deﬁned by (4.9) should be
clear. The shorttime cepstrum is a sequence of cepstra of windowed
ﬁniteduration segments of the speech waveform. By analogy, a “cep
strogram” would be an image obtained by plotting the magnitude of
the shorttime cepstrum as a function of quefrency m and analysis
time ˆ n.
5.3 Computation of the Cepstrum
In (5.1a) and (5.1b), the cepstrum and complex cepstrum are deﬁned
in terms of the DTFT. This is useful in the basic deﬁnitions, but not
1
In cases where we wish to explicitly indicate analysistime dependence of the shorttime
cepstrum, we will use ˆ n for the analysis time and m for quefrency as in (5.4), but as in
other instances, we often suppress the subscript ˆ n.
58 Homomorphic Speech Analysis
for use in processing sampled speech signals. Fortunately, several com
putational options exist.
5.3.1 Computation Using the DFT
Since the DFT (computed with an FFT algorithm) is a sampled
(in frequency) version of the DTFT of a ﬁnitelength sequence (i.e.,
X[k] = X(e
j2πk/N
) [89]), the DFT and inverse DFT can be substi
tuted for the DTFT and its inverse in (5.1a) and (5.1b) as shown in
Figure 5.3, which shows that the complex cepstrum can be computed
approximately using the equations
X[k] =
N−1
¸
n=0
x[n]e
−j(2πk/N)n
(5.5a)
ˆ
X[k] = log[X[k][ + j arg¦X[k]¦ (5.5b)
˜
ˆ x[n] =
1
N
N−1
¸
k=0
ˆ
X[k]e
j(2πk/N)n
. (5.5c)
Note the “tilde” ˜ symbol above ˆ x[n] in (5.5c) and in Figure 5.3. Its
purpose is to emphasize that using the DFT instead of the DTFT
results in an approximation to (5.1b) due to the timedomain aliasing
resulting from the sampling of the log of the DTFT [89]. That is,
˜
ˆ x[n] =
∞
¸
r=−∞
ˆ x[n + rN], (5.6)
where ˆ x[n] is the complex cepstrum deﬁned by (5.1b). An identical
equation holds for the timealiased cepstrum ˜ c[n].
The eﬀect of timedomain aliasing can be made negligible by using a
large value for N. A more serious problem in computation of the com
plex cepstrum is the computation of the complex logarithm. This is
Fig. 5.3 Computing the cepstrum or complex cepstrum using the DFT.
5.3 Computation of the Cepstrum 59
because the angle of a complex number is usually speciﬁed modulo
2π, i.e., by the principal value. In order for the complex cepstrum
to be evaluated properly, the phase of the sampled DTFT must be
evaluated as samples of a continuous function of frequency. If the princi
pal value phase is ﬁrst computed at discrete frequencies, then it must be
“unwrapped” modulo 2π in order to ensure that convolutions are trans
formed into additions. A variety of algorithms have been developed for
phase unwrapping [90, 109, 127]. While accurate phase unwrapping
presents a challenge in computing the complex cepstrum, it is not a
problem in computing the cepstrum, since the phase is not used. Fur
thermore, phase unwrapping can be avoided by using numerical com
putation of the poles and zeros of the ztransform.
5.3.2 zTransform Analysis
The characteristic system for convolution can also be represented by
the twosided ztransform as depicted in Figure 5.4. This is very useful
for theoretical investigations, and recent developments in polynomial
root ﬁnding [117] have made the ztransform representation a viable
computational basis as well. For this purpose, we assume that the input
signal x[n] has a rational ztransform of the form:
X(z) = X
max
(z)X
uc
(z)X
min
(z), (5.7)
where
X
max
(z) = z
Mo
Mo
¸
k=1
(1 − a
k
z
−1
) =
Mo
¸
k=1
(−a
k
)
Mo
¸
k=1
(1 − a
−1
k
z), (5.8a)
X
uc
(z) =
Muc
¸
k=1
(1 − e
jθ
k
z
−1
), (5.8b)
X
min
(z) = A
M
i
¸
k=1
(1 − b
k
z
−1
)
N
i
¸
k=1
(1 − c
k
z
−1
)
. (5.8c)
The zeros of X
max
(z), i.e., z
k
= a
k
, are zeros of X(z) outside of the
unit circle ([a
k
[ > 1). X
max
(z) is thus the maximumphase part of X(z).
60 Homomorphic Speech Analysis
Fig. 5.4 ztransform representation of characteristic system for convolution.
X
uc
(z) contains all the zeros (with angles θ
k
) on the unit circle. The
minimumphase part is X
min
(z), where b
k
and c
k
are zeros and poles,
respectively, that are inside the unit circle ([b
k
[ < 1 and [c
k
[ < 1). The
factor z
Mo
implies a shift of M
o
samples to the left. It is included to
simplify the results in (5.11).
The complex cepstrum of x[n] is determined by assuming that the
complex logarithm log¦X(z)¦ results in the sum of logarithms of each
of the product terms, i.e.,
ˆ
X(z) = log
Mo
¸
k=1
(−a
k
)
+
Mo
¸
k=1
log(1 − a
−1
k
z) +
Muc
¸
k=1
log(1 − e
jθ
k
z
−1
)
+ log[A[ +
M
i
¸
k=1
log(1 − b
k
z
−1
) −
N
i
¸
k=1
log(1 − c
k
z
−1
). (5.9)
Applying the power series expansion
log(1 − a) = −
∞
¸
n=1
a
n
n
[a[ < 1 (5.10)
to each of the terms in (5.9) and collecting the coeﬃcients of the positive
and negative powers of z gives
ˆ x[n] =
Mo
¸
k=1
a
n
k
n
n < 0
log[A[ + log
mo
¸
k=1
(−a
k
)
n = 0
−
Muc
¸
k=1
e
jθ
k
n
n
−
M
i
¸
k=1
b
n
k
n
+
N
i
¸
k=1
c
n
k
n
n > 0.
(5.11)
Given all the poles and zeros of a ztransform X(z), (5.11) allows
us to compute the complex cepstrum with no approximation. This is
5.3 Computation of the Cepstrum 61
the case in theoretical analysis where the poles and zeros are speciﬁed.
However, (5.11) is also useful as the basis for computation. All that is
needed is a process for obtaining the ztransform as a rational function
and a process for ﬁnding the zeros of the numerator and denominator.
This has become more feasible with increasing computational power
and with new advances in ﬁnding roots of large polynomials [117].
One method of obtaining a ztransform is simply to select a ﬁnite
length sequence of samples of a signal. The ztransform is then simply
a polynomial with the samples x[n] as coeﬃcients, i.e.,
X(z) =
M
¸
n=0
x[n]z
−n
= A
Mo
¸
k=1
(1 − a
k
z
−1
)
M
i
¸
k=1
(1 − b
k
z
−1
). (5.12)
A second method that yields a ztransform is the method of linear
predictive analysis, to be discussed in Chapter 6.
5.3.3 Recursive Computation of the Complex Cepstrum
Another approach to computing the complex cepstrum applies only to
minimumphase signals, i.e., signals having a ztransform whose poles
and zeros are inside the unit circle. An example would be the impulse
response of an allpole vocal tract model with system function
H(z) =
G
1 −
p
¸
k=1
α
k
z
−k
=
G
p
¸
k=1
(1 − c
k
z
−1
)
. (5.13)
Such models are implicit in the use of linear predictive analysis of speech
(Chapter 6). In this case, all the poles c
k
must be inside the unit circle
for stability of the system. From (5.11) it follows that the complex
cepstrum of the impulse response h[n] corresponding to H(z) is
ˆ
h[n] =
0 n < 0
log[G[ n = 0
p
¸
k=1
c
n
k
n
n > 0.
(5.14)
62 Homomorphic Speech Analysis
It can be shown [90] that the impulse response and its complex cep
strum are related by the recursion formula:
ˆ
h[n] =
0 n < 0
logG n = 0
h[n]
h[0]
−
n−1
¸
k=1
k
n
ˆ
h[k]h[n − k]
h[0]
n ≥ 1.
(5.15)
Furthermore, working with the reciprocal (negative logarithm) of
(5.13), it can be shown that there is a direct recursive relationship
between the coeﬃcients of the denominator polynomial in (5.13) and
the complex cepstrum of the impulse response of the model ﬁlter, i.e.,
ˆ
h[n] =
0 n < 0
logG n = 0
α
n
+
n−1
¸
k=1
k
n
ˆ
h[k]α
n−k
n > 0.
(5.16)
From (5.16) it follows that the coeﬃcients of the denominator polyno
mial can be obtained from the complex cepstrum through
α
n
=
ˆ
h[n] −
n−1
¸
k=1
k
n
ˆ
h[k]α
n−k
1 ≤ n ≤ p. (5.17)
From (5.17), it follows that p + 1 values of the complex cepstrum are
suﬃcient to fully determine the speech model system in (5.13) since all
the denominator coeﬃcients and G can be computed from
ˆ
h[n] for n =
0, 1, . . . , p using (5.17). This fact is the basis for the use of the cepstrum
in speech coding and speech recognition as a vector representation of
the vocal tract properties of a frame of speech.
5.4 ShortTime Homomorphic Filtering of Speech
Figure 5.5 shows an example of the shorttime cepstrum of speech for
the segments of voiced and unvoiced speech in Figures 4.5(a) and 4.5(c).
The low quefrency part of the cepstrum is expected to be representative
of the slow variations (with frequency) in the log spectrum, while the
5.4 ShortTime Homomorphic Filtering of Speech 63
0 10 20 30 40
−1
0
1
0 10 20 30 40
−1
0
1
0 2000 4000 6000 8000
−8
−6
−4
−2
0
2
4
0 2000 4000 6000 8000
−8
−6
−4
−2
0
2
4
(b) Voiced log STFT (a) Voiced Cepstrum
(d) Unvoiced log STFT (c) Unvoiced Cepstrum
time in ms frequency in kHz
Fig. 5.5 Shorttime cepstra and corresponding STFTs and homomorphicallysmoothed
spectra.
high quefrency components would correspond to the more rapid ﬂuctu
ations of the log spectrum. This is illustrated by the plots in Figure 5.5.
The log magnitudes of the corresponding shorttime spectra are shown
on the right as Figures 5.5(b) and 5.5(d). Note that the spectrum for
the voiced segment in Figure 5.5(b) has a structure of periodic ripples
due to the harmonic structure of the quasiperiodic segment of voiced
speech. This periodic structure in the log spectrum of Figure 5.5(b),
manifests itself in the cepstrum peak at a quefrency of about 9 ms
in Figure 5.5(a). The existence of this peak in the quefrency range of
expected pitch periods strongly signals voiced speech. Furthermore, the
quefrency of the peak is an accurate estimate of the pitch period dur
ing the corresponding speech interval. As shown in Figure 4.5(b), the
autocorrelation function also displays an indication of periodicity, but
not nearly as unambiguously as does the cepstrum. On the other hand,
note that the rapid variations of the unvoiced spectra appear random
with no periodic structure. This is typical of Fourier transforms (peri
odograms) of short segments of random signals. As a result, there is no
strong peak indicating periodicity as in the voiced case.
64 Homomorphic Speech Analysis
To illustrate the eﬀect of liftering, the quefrencies above 5 ms are
multiplied by zero and the quefrencies below 5 ms are multiplied by 1
(with a short transition taper as shown in Figures 5.5(a) and 5.5(c)).
The DFT of the resulting modiﬁed cepstrum is plotted as the smooth
curve that is superimposed on the shorttime spectra in Figures 5.5(b)
and 5.5(d), respectively. These slowly varying log spectra clearly retain
the general spectral shape with peaks corresponding to the formant
resonance structure for the segment of speech under analysis. Therefore,
a useful perspective is that by liftering the cepstrum, it is possible
to separate information about the vocal tract contributions from the
shorttime speech spectrum [88].
5.5 Application to Pitch Detection
The cepstrum was ﬁrst applied in speech processing to determine the
excitation parameters for the discretetime speech model of Figure 7.10.
Noll [83] applied the shorttime cepstrum to detect local periodicity
(voiced speech) or the lack thereof (unvoiced speech). This is illus
trated in Figure 5.6, which shows a plot that is very similar to the plot
ﬁrst published by Noll [83]. On the left is a sequence of log shorttime
spectra (rapidly varying curves) and on the right is the correspond
ing sequence of cepstra computed from the log spectra on the left.
The successive spectra and cepstra are for 50 ms segments obtained by
moving the window in steps of 12.5 ms (100 samples at a sampling rate
of 8000 samples/sec). From the discussion of Section 5.4, it is apparent
that for the positions 1 through 5, the window includes only unvoiced
speech, while for positions 6 and 7 the signal within the window is
partly voiced and partly unvoiced. For positions 8 through 15 the win
dow only includes voiced speech. Note again that the rapid variations
of the unvoiced spectra appear random with no periodic structure. On
the other hand, the spectra for voiced segments have a structure of
periodic ripples due to the harmonic structure of the quasiperiodic
segment of voiced speech. As can be seen from the plots on the right,
the cepstrum peak at a quefrency of about 11–12 ms strongly signals
voiced speech, and the quefrency of the peak is an accurate estimate of
the pitch period during the corresponding speech interval.
5.5 Application to Pitch Detection 65
0 1 2 3 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0 5 10 15 20 25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ShortTime Log Spectra in Cepstrum Analysis
frequency in kHz
ShortTime Cepstra
time in ms
w
i
n
d
o
w
n
u
m
b
e
r
Fig. 5.6 Shorttime cepstra and corresponding STFTs and homomorphicallysmoothed
spectra.
The essence of the pitch detection algorithm proposed by Noll is
to compute a sequence of shorttime cepstra and search each succes
sive cepstrum for a peak in the quefrency region of the expected pitch
period. Presence of a strong peak implies voiced speech, and the que
frency location of the peak gives the estimate of the pitch period. As
in most modelbased signal processing applications of concepts such
as the cepstrum, the pitch detection algorithm includes many fea
tures designed to handle cases that do not ﬁt the underlying model
very well. For example, for frames 6 and 7 the cepstrum peak is weak
66 Homomorphic Speech Analysis
corresponding to the transition from unvoiced to voiced speech. In other
problematic cases, the peak at twice the pitch period may be stronger
than the peak at the quefrency of the pitch period. Noll applied tem
poral continuity constraints to prevent such errors.
5.6 Applications to Pattern Recognition
Perhaps the most pervasive application of the cepstrum in speech pro
cessing is its use in pattern recognition systems such as design of vec
tor quantizers (VQ) and automatic speech recognizers (ASR). In such
applications, a speech signal is represented on a framebyframe basis
by a sequence of shorttime cepstra. As we have shown, cepstra can be
computed either by ztransform analysis or by DFT implementation
of the characteristic system. In either case, we can assume that the
cepstrum vector corresponds to a gainnormalized (c[0] = 0) minimum
phase vocal tract impulse response that is deﬁned by the complex
cepstrum
2
ˆ
h[n] =
2c[n] 1 ≤ n ≤ n
co
0 n < 0.
(5.18)
In problems such as VQ or ASR, a test pattern c[n] (vector of cep
strum values n = 1, 2, . . . , n
co
) is compared against a similarly deﬁned
reference pattern ¯ c[n]. Such comparisons require a distance measure.
For example, the Euclidean distance applied to the cepstrum would give
D =
nco
¸
n=1
[c[n] − ¯ c[n][
2
. (5.19a)
Equivalently in the frequency domain,
D =
1
2π
π
−π
log[H(e
j ˆ ω
)[ − log[
¯
H(e
j ˆ ω
)[
2
dˆ ω, (5.19b)
where log[H(e
j ˆ ω
)[ is the log magnitude of the DTFT of h[n] corre
sponding to the complex cepstrum in (5.18) or the real part of the
2
For minimumphase signals, the complex cepstrum satisﬁes h[n] = 0 for n < 0. Since the
cepstrum is always the even part of the complex cepstrum, it follows that
ˆ
h[n] = 2c[n] for
n > 0.
5.6 Applications to Pattern Recognition 67
DTFT of
ˆ
h[n] in (5.18). Thus, cepstrumbased comparisons are strongly
related to comparisons of smoothed shorttime spectra. Therefore, the
cepstrum oﬀers an eﬀective and ﬂexible representation of speech for
pattern recognition problems, and its interpretation as a diﬀerence of
log spectra suggests a match to auditory perception mechanisms.
5.6.1 Compensation for Linear Filtering
Suppose that we have only a linearly ﬁltered version of the speech
signal, y[n] = h[n] ∗ x[n], instead of x[n]. If the analysis window is long
compared to the length of h[n], the shorttime cepstrum of one frame
of the ﬁltered speech signal y[n] will be
3
c
(y)
ˆ n
[m] = c
(x)
ˆ n
[m] + c
(h)
[m], (5.20)
where c
(h)
[m] will appear essentially the same in each frame. Therefore,
if we can estimate c
(h)
[n],
4
which we assume is nontimevarying, we can
obtain c
(x)
ˆ n
[m] at each frame from c
(y)
ˆ n
[m] by subtraction, i.e., c
(x)
n
[m] =
c
(y)
ˆ n
[m] − c
(h)
[m]. This property is extremely attractive in situations
where the reference pattern ¯ c[m] has been obtained under diﬀerent
recording or transmission conditions from those used to acquire the
test vector. In these circumstances, the test vector can be compensated
for the eﬀects of the linear ﬁltering prior to computing the distance
measures used for comparison of patterns.
Another approach to removing the eﬀects of linear distortions is to
observe that the cepstrum component due to the distortion is the same
in each frame. Therefore, it can be removed by a simple ﬁrst diﬀerence
operation of the form:
∆c
(y)
ˆ n
[m] = c
(y)
ˆ n
[m] − c
(y)
ˆ n−1
[m]. (5.21)
It is clear that if c
(y)
ˆ n
[m] = c
(x)
ˆ n
[m] + c
(h)
[m] with c
(h)
[m] being indepen
dent of ˆ n, then ∆c
(y)
ˆ n
[m] = ∆c
(x)
ˆ n
[m], i.e., the linear distortion eﬀects
are removed.
3
In this section, it will be useful to use somewhat more complicated notation. Speciﬁcally,
we denote the cepstrum at analysis time ˆ n of a signal x[n] as c
(x)
ˆ n
[m], where m denotes
the quefrency index of the cepstrum.
4
Stockham [124] showed how c
(h)
[m] for such linear distortions can be estimated from the
signal y[n] by time averaging the log of the STFT.
68 Homomorphic Speech Analysis
Furui [39, 40] ﬁrst noted that the sequence of cepstrum values has
temporal information that could be of value for a speaker veriﬁcation
system. He used polynomial ﬁts to cepstrum sequences to extract sim
ple representations of the temporal variation. The delta cepstrum as
deﬁned in (5.21) is simply the slope of a ﬁrstorder polynomial ﬁt to
the cepstrum time evolution.
5.6.2 Liftered Cepstrum Distance Measures
In using linear predictive analysis (discussed in Chapter 6) to obtain
cepstrum feature vectors for pattern recognition problems, it is
observed that there is signiﬁcant statistical variability due to a variety
of factors including shorttime analysis window position, bias toward
harmonic peaks, and additive noise [63, 126]. A solution to this problem
is to use weighted distance measures of the form:
D =
nco
¸
n=1
g
2
[n] [c[n] − ¯ c[n][
2
, (5.22a)
which can be written as the Euclidean distance of liftered cepstra
D =
nco
¸
n=1
[g[n]c[n] − g[n]¯ c[n][
2
. (5.22b)
Tohkura [126] found, for example, that when averaged over many
frames of speech and speakers, cepstrum values c[n] have zero means
and variances on the order of 1/n
2
. This suggests that g[n] = n for
n = 1, 2, . . . , n
co
could be used to equalize the contributions for each
term to the cepstrum diﬀerence.
Juang et al. [63] observed that the variability due to the vagaries of
LPC analysis could be lessened by using a lifter of the form:
g[n] = 1 + 0.5n
co
sin(πn/n
co
) n = 1, 2, . . . , n
co
. (5.23)
Tests of weighted distance measures showed consistent improvements
in automatic speech recognition tasks.
Itakura and Umezaki [55] used the group delay function to derive a
diﬀerent cepstrum weighting function. Instead of g[n] = n for all n, or
5.6 Applications to Pattern Recognition 69
the lifter of (5.23), Itakura proposed the lifter
g[n] = n
s
e
−n
2
/2τ
2
. (5.24)
This lifter has great ﬂexibility. For example, if s = 0 we have simply
lowpass liftering of the cepstrum. If s = 1 and τ is large, we have essen
tially g[n] = n for small n with high quefrency tapering. Itakura and
Umezaki [55] tested the group delay spectrum distance measure in an
automatic speech recognition system. They found that for clean test
utterances, the diﬀerence in recognition rate was small for diﬀerent
values of s when τ ≈ 5 although performance suﬀered with increas
ing s for larger values of τ. This was attributed to the fact that for
larger s the group delay spectrum becomes very sharply peaked and
thus more sensitive to small diﬀerences in formant locations. However,
in test conditions with additive white noise and also with linear ﬁlter
ing distortions, recognition rates improved signiﬁcantly with τ = 5 and
increasing values of the parameter s.
5.6.3 MelFrequency Cepstrum Coeﬃcients
As we have seen, weighted cepstrum distance measures have a directly
equivalent interpretation in terms of distance in the frequency domain.
This is signiﬁcant in light of models for human perception of sound,
which, as noted in Chapter 3, are based upon a frequency analysis
performed in the inner ear. With this in mind, Davis and Mermelstein
[27] formulated a new type of cepstrum representation that has come to
be widely used and known as the melfrequency cepstrum coeﬃcients
(mfcc).
The basic idea is to compute a frequency analysis based upon a
ﬁlter bank with approximately critical band spacing of the ﬁlters and
bandwidths. For 4 kHz bandwidth, approximately 20 ﬁlters are used.
In most implementations, a shorttime Fourier analysis is done ﬁrst,
resulting in a DFT X
ˆ n
[k] for analysis time ˆ n. Then the DFT values
are grouped together in critical bands and weighted by a triangular
weighting function as depicted in Figure 5.7. Note that the bandwidths
in Figure 5.7 are constant for center frequencies below 1 kHz and then
increase exponentially up to half the sampling rate of 4 kHz resulting
70 Homomorphic Speech Analysis
0 1000 2000 3000 4000
0
0.005
0.01
frequency in Hz
Fig. 5.7 Weighting functions for Melfrequency ﬁlter bank.
in a total of 22 “ﬁlters.” The melfrequency spectrum at analysis time
ˆ n is deﬁned for r = 1, 2, . . . , R as
MF
ˆ n
[r] =
1
A
r
Ur
¸
k=Lr
[V
r
[k]X
ˆ n
[k][
2
, (5.25a)
where V
r
[k] is the triangular weighting function for the rth ﬁlter ranging
from DFT index L
r
to U
r
, where
A
r
=
Ur
¸
k=Lr
[V
r
[k][
2
(5.25b)
is a normalizing factor for the rth melﬁlter. This normalization is built
into the weighting functions of Figure 5.7. It is needed so that a per
fectly ﬂat input Fourier spectrum will produce a ﬂat melspectrum. For
each frame, a discrete cosine transform of the log of the magnitude of
the ﬁlter outputs is computed to form the function mfcc
ˆ n
[m], i.e.,
mfcc
ˆ n
[m] =
1
R
R
¸
r=1
log(MF
ˆ n
[r])cos
¸
2π
R
r +
1
2
m
. (5.26)
Typically, mfcc
ˆ n
[m] is evaluated for a number of coeﬃcients, N
mfcc
,
that is less than the number of melﬁlters, e.g., N
mfcc
= 13 and R = 22.
Figure 5.8 shows the result of mfcc analysis of a frame of voiced speech
in comparison with the shorttime Fourier spectrum, LPC spectrum
(discussed in Chapter 6), and a homomorphically smoothed spectrum.
The large dots are the values of log(MF
ˆ n
[r]) and the line interpolated
5.7 The Role of the Cepstrum 71
0 500 1000 1500 2000 2500 3000 3500 4000
−7
−6
−5
−4
−3
−2
−1
0
1
2
3
l
o
g
m
a
g
n
i
t
u
d
e
frequency in Hz
Shorttime Fourier Transform
Homomorphic smoothing,
LPC smoothing,
Mel cepstrum smoothing,
Fig. 5.8 Comparison of spectral smoothing methods.
between them is a spectrum reconstructed at the original DFT frequen
cies. Note that all these spectra are diﬀerent, but they have in common
that they have peaks at the formant resonances. At higher frequencies,
the reconstructed melspectrum of course has more smoothing due to
the structure of the ﬁlter bank.
Note that the delta cepstrum idea expressed by (5.21) can be applied
to mfcc to remove the eﬀects of linear ﬁltering as long as the frequency
response of the distorting linear ﬁlter does not vary much across each
of the melfrequency bands.
5.7 The Role of the Cepstrum
As discussed in this chapter, the cepstrum entered the realm of speech
processing as a basis for pitch detection. It remains one of the most
eﬀective indicators of voice pitch that have been devised. Because the
72 Homomorphic Speech Analysis
vocal tract and excitation components are well separated in the cep
strum, it was natural to consider analysis techniques for estimation of
the vocal tract system as well [86, 88, 111]. While separation techniques
based on the cepstrum can be very eﬀective, the linear predictive anal
ysis methods to be discussed in the next chapter have proven to be
more eﬀective for a variety of reasons. Nevertheless, the cepstrum con
cept has demonstrated its value when applied to vocal tract system
estimates obtained by linear predictive analysis.
6
Linear Predictive Analysis
Linear predictive analysis is one of the most powerful and widely
used speech analysis techniques. The importance of this method lies
both in its ability to provide accurate estimates of the speech param
eters and in its relative speed of computation. In this chapter, we
present a formulation of the ideas behind linear prediction, and dis
cuss some of the issues that are involved in using it in practical speech
applications.
6.1 Linear Prediction and the Speech Model
We come to the idea of linear prediction of speech by recalling the
source/system model that was introduced in Chapter 2, where the
sampled speech signal was modeled as the output of a linear, slowly
timevarying system excited by either quasiperiodic impulses (during
voiced speech), or random noise (during unvoiced speech). The par
ticular form of the source/system model implied by linear predictive
analysis is depicted in Figure 6.1, where the speech model is the part
inside the dashed box. Over short time intervals, the linear system is
73
74 Linear Predictive Analysis
Fig. 6.1 Model for linear predictive analysis of speech signals.
described by an allpole system function of the form:
H(z) =
S(z)
E(z)
=
G
1 −
p
¸
k=1
a
k
z
−k
. (6.1)
In linear predictive analysis, the excitation is deﬁned implicitly by the
vocal tract system model, i.e., the excitation is whatever is needed
to produce s[n] at the output of the system. The major advantage of
this model is that the gain parameter, G, and the ﬁlter coeﬃcients
¦a
k
¦ can be estimated in a very straightforward and computationally
eﬃcient manner by the method of linear predictive analysis.
For the system of Figure 6.1 with the vocal tract model of (6.1), the
speech samples s[n] are related to the excitation e[n] by the diﬀerence
equation
s[n] =
p
¸
k=1
a
k
s[n − k] + Ge[n]. (6.2)
A linear predictor with prediction coeﬃcients, α
k
, is deﬁned as a system
whose output is
˜ s[n] =
p
¸
k=1
α
k
s[n − k], (6.3)
and the prediction error, deﬁned as the amount by which ˜ s[n] fails to
exactly predict sample s[n], is
d[n] = s[n] − ˜ s[n] = s[n] −
p
¸
k=1
α
k
s[n − k]. (6.4)
6.1 Linear Prediction and the Speech Model 75
From (6.4) it follows that the prediction error sequence is the output
of an FIR linear system whose system function is
A(z) = 1 −
p
¸
k=1
α
k
z
−k
=
D(z)
S(z)
. (6.5)
It can be seen by comparing (6.2) and (6.4) that if the speech signal
obeys the model of (6.2) exactly, and if α
k
= a
k
, then d[n] = Ge[n].
Thus, the prediction error ﬁlter, A(z), will be an inverse ﬁlter for the
system, H(z), of (6.1), i.e.,
H(z) =
G
A(z)
. (6.6)
The basic problem of linear prediction analysis is to determine the
set of predictor coeﬃcients ¦α
k
¦ directly from the speech signal in
order to obtain a useful estimate of the timevarying vocal tract system
through the use of (6.6). The basic approach is to ﬁnd a set of predic
tor coeﬃcients that will minimize the meansquared prediction error
over a short segment of the speech waveform. The resulting param
eters are then assumed to be the parameters of the system function
H(z) in the model for production of the given segment of the speech
waveform. This process is repeated periodically at a rate appropriate
to track the phonetic variation of speech (i.e., order of 50–100 times
per second).
That this approach will lead to useful results may not be immedi
ately obvious, but it can be justiﬁed in several ways. First, recall that
if α
k
= a
k
, then d[n] = Ge[n]. For voiced speech this means that d[n]
would consist of a train of impulses, i.e., d[n] would be small except at
isolated samples spaced by the current pitch period, P
0
. Thus, ﬁnding
α
k
s that minimize the meansquared prediction error seems consistent
with this observation. A second motivation for this approach follows
from the fact that if a signal is generated by (6.2) with nontimevarying
coeﬃcients and excited either by a single impulse or by a stationary
white noise input, then it can be shown that the predictor coeﬃcients
that result from minimizing the meansquared prediction error (over
all time) are identical to the coeﬃcients of (6.2). A third pragmatic
justiﬁcation for using the minimum meansquared prediction error as a
76 Linear Predictive Analysis
basis for estimating the model parameters is that this approach leads to
an exceedingly useful and accurate representation of the speech signal
that can be obtained by eﬃcient solution of a set of linear equations.
The shorttime average prediction error is deﬁned as
E
ˆ n
=
d
2
ˆ n
[m]
=
s
ˆ n
[m] −
p
¸
k=1
α
k
s
ˆ n
[m − k]
2
¸
, (6.7)
where s
ˆ n
[m] is a segment of speech that has been selected in a neigh
borhood of the analysis time ˆ n, i.e.,
s
ˆ n
[m] = s[m + ˆ n] − M
1
≤ m≤ M
2
. (6.8)
That is, the time origin of the analysis segment is shifted to sample
ˆ n of the entire signal. The notation ' ` denotes averaging over a ﬁnite
number of samples. The details of speciﬁc deﬁnitions of the averaging
operation will be discussed below.
We can ﬁnd the values of α
k
that minimize E
ˆ n
in (6.7) by setting
∂E
ˆ n
/∂α
i
= 0, for i = 1, 2, . . . , p, thereby obtaining the equations
1
p
¸
k=1
˜ α
k
's
ˆ n
[m − i]s
ˆ n
[m − k]` = 's
ˆ n
[m − i]s
ˆ n
[m]` 1 ≤ i ≤ p, (6.9)
where the ˜ α
k
are the values of α
k
that minimize E
ˆ n
in (6.7). (Since the
˜ α
k
are unique, we will drop the tilde and use the notation α
k
to denote
the values that minimize E
ˆ n
.) If we deﬁne
ϕ
ˆ n
[i, k] = 's
ˆ n
[m − i]s
ˆ n
[m − k]`, (6.10)
then (6.9) can be written more compactly as
2
p
¸
k=1
α
k
ϕ
ˆ n
[i, k] = ϕ
ˆ n
[i, 0] i = 1, 2, . . . , p. (6.11)
1
More accurately, the solutions of (6.7) only provide a stationary point which can be shown
to be a minimum of E
ˆ n
since E
ˆ n
is a convex function of the α
i
s.
2
The quantities ϕ
ˆ n
[i, k] are in the form of a correlation function for the speech segment
s
ˆ n
[m]. The details of the deﬁnition of the averaging operation used in (6.10) have a
signiﬁcant eﬀect on the properties of the prediction coeﬃcients that are obtained by solving
(6.11).
6.2 Computing the Prediction Coeﬃcients 77
If we know ϕ
ˆ n
[i, k] for 1 ≤ i ≤ p and 0 ≤ k ≤ p, this set of p equations
in p unknowns, which can be represented by the matrix equation:
Φα = ψ, (6.12)
can be solved for the vector α = ¦α
k
¦ of unknown predictor coeﬃcients
that minimize the average squared prediction error for the segment
s
ˆ n
[m].
3
Using (6.7) and (6.9), the minimum meansquared prediction
error can be shown to be [3, 5]
E
ˆ n
= ϕ
ˆ n
[0, 0] −
p
¸
k=1
α
k
ϕ
ˆ n
[0, k]. (6.13)
Thus, the total minimum meansquared error consists of a ﬁxed com
ponent equal to the meansquared value of the signal segment minus a
term that depends on the predictor coeﬃcients that satisfy (6.11), i.e.,
the optimum coeﬃcients reduce E
ˆ n
in (6.13) the most.
To solve for the optimum predictor coeﬃcients, we must ﬁrst com
pute the quantities ϕ
ˆ n
[i, k] for 1 ≤ i ≤ p and 0 ≤ k ≤ p. Once this
is done we only have to solve (6.11) to obtain the α
k
s. Thus, in
principle, linear prediction analysis is very straightforward. However,
the details of the computation of ϕ
ˆ n
[i, k] and the subsequent solu
tion of the equations are somewhat intricate and further discussion
is required.
6.2 Computing the Prediction Coeﬃcients
So far we have not been explicit about the meaning of the averaging
notation ' ` used to deﬁne the meansquared prediction error in (6.7).
As we have stated, in a shorttime analysis procedure, the averaging
must be over a ﬁnite interval. We shall see below that two methods for
linear predictive analysis emerge out of a consideration of the limits of
summation and the deﬁnition of the waveform segment s
ˆ n
[m].
4
3
Although the α
k
s are functions of ˆ n (the time index at which they are estimated) this
dependence will not be explicitly shown. We shall also ﬁnd it advantageous to drop the
subscripts ˆ n on E
ˆ n
, s
ˆ n
[m], and ϕ
ˆ n
[i, k] when no confusion will result.
4
These two methods, applied to the same speech signal yield slightly diﬀerent optimum
predictor coeﬃcients.
78 Linear Predictive Analysis
6.2.1 The Covariance Method
One approach to computing the prediction coeﬃcients is based on the
deﬁnition
E
ˆ n
=
M
2
¸
m=−M
1
(d
ˆ n
[m])
2
=
M
2
¸
m=−M
1
s
ˆ n
[m] −
p
¸
k=1
α
k
s
ˆ n
[m − k]
2
, (6.14)
where −M
1
≤ n ≤ M
2
. The quantities ϕ
ˆ n
[i, k] needed in (6.11) inherit
the same deﬁnition of the averaging operator, i.e.,
ϕ
ˆ n
[i, k] =
M
2
¸
m=−M
1
s
ˆ n
[m − i]s
ˆ n
[m − k]
1 ≤ i ≤ p
0 ≤ k ≤ p.
(6.15)
Both (6.14) and (6.15) require values of s
ˆ n
[m] = s[m + ˆ n] over the
range −M
1
− p ≤ m≤ M
2
. By changes of index of summation, (6.15)
can be expressed in the equivalent forms:
ϕ
ˆ n
[i, k] =
M
2
−i
¸
m=−M
1
−i
s
ˆ n
[m]s
ˆ n
[m + i − k] (6.16a)
=
M
2
−k
¸
m=−M
1
−k
s
ˆ n
[m]s
ˆ n
[m + k − i], (6.16b)
from which it follows that ϕ
ˆ n
[i, k] = ϕ
ˆ n
[k, i].
Figure 6.2 shows the sequences that are involved in computing the
meansquared prediction error as deﬁned by (6.14). The top part of
this ﬁgure shows a sampled speech signal s[m], and the box denotes
a segment of that waveform selected around some time index ˆ n. The
second plot shows that segment extracted as a ﬁnitelength sequence
s
ˆ n
[m] = s[m + ˆ n] for −M
1
− p ≤ m≤ M
2
. Note the p “extra” samples
(light shading) at the beginning that are needed to start the prediction
error ﬁlter at time −M
1
. This method does not require any assump
tion about the signal outside the interval −M
1
− p ≤ m≤ M
2
since
the samples s
ˆ n
[m] for −M
1
− p ≤ m≤ M
2
are suﬃcient to evaluate
ϕ[i, k] for all required values of i and k. The third plot shows the pre
diction error computed with the optimum predictor coeﬃcients. Note
that in solving for the optimum prediction coeﬃcients using (6.11), the
6.2 Computing the Prediction Coeﬃcients 79
Fig. 6.2 Illustration of windowing and prediction error for the covariance method.
prediction error is implicitly computed over the range −M
1
≤ m≤ M
2
as required in (6.14) and (6.15). The minimum meansquared predic
tion error would be simply the sum of squares of all the samples shown.
In most cases of linear prediction, the prediction error sequence is not
explicitly computed since solution of (6.11) does not require it.
The mathematical structure that deﬁnes the covariance method of
linear predictive analysis implies a number of useful properties of the
solution. It is worthwhile to summarize them as follows:
(C.1) The meansquared prediction error satisﬁes E
ˆ n
≥ 0. With
this method, it is theoretically possible for the average
error to be exactly zero.
(C.2) The matrix Φ in (6.12) is a symmetric positivesemi
deﬁnite matrix.
(C.3) The roots of the prediction error ﬁlter A(z) in (6.5) are
not guaranteed to lie within the unit circle of the zplane.
This implies that the vocal tract model ﬁlter (6.6) is not
guaranteed to be stable.
80 Linear Predictive Analysis
(C.4) As a result of (C.2), the equations (6.11) can be solved
eﬃciently using, for example, the well known Cholesky
decomposition of the covariance matrix Φ [3].
6.2.2 The Autocorrelation Method
Perhaps the most widely used method of linear predictive analysis
is called the autocorrelation method because the covariance function
ϕ
ˆ n
[i, k] needed in (6.11) reduces to the STACF φ
ˆ n
[[i − k[] that we dis
cussed in Chapter 4 [53, 54, 74, 78]. In the autocorrelation method, the
analysis segment s
ˆ n
[m] is deﬁned as
s
ˆ n
[m] =
s[n + m]w[m] −M
1
≤ m≤ M
2
0 otherwise,
(6.17)
where the analysis window w[m] is used to taper the edges of the seg
ment to zero. Since the analysis segment is deﬁned by the windowing
of (6.17) to be zero outside the interval −M
1
≤ m≤ M
2
, it follows that
the prediction error sequence d
ˆ n
[m] can be nonzero only in the range
−M
1
≤ m≤ M
2
+ p. Therefore, E
ˆ n
is deﬁned as
E
ˆ n
=
M
2
+p
¸
m=−M
1
(d
ˆ n
[m])
2
=
∞
¸
m=−∞
(d
ˆ n
[m])
2
. (6.18)
The windowing of (6.17) allows us to use the inﬁnite limits to signify
that the sum is over all nonzero values of d
ˆ n
[m]. Applying this notion
to (6.16a) and (6.16b) leads to the conclusion that
ϕ
ˆ n
[i, k] =
∞
¸
m=−∞
s
ˆ n
[m]s
ˆ n
[m + [i − k[] = φ
ˆ n
[[i − k[]. (6.19)
Thus, ϕ[i, k] is a function only of [i − k[. Therefore, we can replace
ϕ
ˆ n
[i, k] by φ
ˆ n
[[i − k[], which is the STACF deﬁned in Chapter 4 as
φ
ˆ n
[k] =
∞
¸
m=−∞
s
ˆ n
[m]s
ˆ n
[m + k] = φ
ˆ n
[−k]. (6.20)
The resulting set of equations for the optimum predictor coeﬃcients is
therefore
p
¸
k=1
α
k
φ
ˆ n
[[i − k[] = φ
ˆ n
[i] i = 1, 2, . . . , p. (6.21)
6.2 Computing the Prediction Coeﬃcients 81
Fig. 6.3 Illustration of windowing and prediction error for the autocorrelation method.
Figure 6.3 shows the sequences that are involved in computing the
optimum prediction coeﬃcients using the autocorrelation method. The
upper plot shows the same sampled speech signal s[m] as in Figure 6.2
with a Hamming window centered at time index ˆ n. The middle plot
shows the result of multiplying the signal s[ˆ n + m] by the window w[m]
and redeﬁning the time origin to obtain s
ˆ n
[m]. Note that the zero
valued samples outside the window are shown with light shading. The
third plot shows the prediction error computed using the optimum
coeﬃcients. Note that for this segment, the prediction error (which is
implicit in the solution of (6.21)) is nonzero over the range −M
1
≤ m≤
M
2
+ p. Also note the lightly shaded p samples at the beginning. These
samples can be large due to the fact that the predictor must predict
these samples from the zerovalued samples that precede the windowed
segment s
ˆ n
[m]. It is easy to see that at least one of these ﬁrst p samples
of the prediction error must be nonzero. Similarly, the last p samples of
the prediction error can be large due to the fact that the predictor must
predict zerovalued samples from windowed speech samples. It can also
be seen that at least one of these last p samples of the prediction error
must be nonzero. For this reason, it follows that E
ˆ n
, being the sum of
82 Linear Predictive Analysis
squares of the prediction error samples, must always be strictly greater
than zero.
5
As in the case of the covariance method, the mathematical structure
of the autocorrelation method implies a number of properties of the
solution, including the following:
(A.1) The meansquared prediction error satisﬁes E
ˆ n
> 0. With
this method, it is theoretically impossible for the error
to be exactly zero because there will always be at least
one sample at the beginning and one at the end of the
prediction error sequence that will be nonzero.
(A.2) The matrix Φ in (6.12) is a symmetric positivedeﬁnite
Toeplitz matrix [46].
(A.3) The roots of the prediction error ﬁlter A(z) in (6.5) are
guaranteed to lie within the unit circle of the zplane so
that the vocal tract model ﬁlter of (6.6) is guaranteed to
be stable.
(A.4) As a result of (A.2), the equations (6.11) can be solved
eﬃciently using the Levinson–Durbin algorithm, which,
because of its many implications, we discuss in more
detail in Section 6.3.
6.3 The Levinson–Durbin Recursion
As stated in (A.2) above, the matrix Φin (6.12) is a symmetric positive
deﬁnite Toeplitz matrix, which means that all the elements on a given
diagonal in the matrix are equal. Equation (6.22) shows the detailed
structure of the matrix equation Φα = ψ for the autocorrelation
method.
φ[0] φ[1] φ[p − 1]
φ[1] φ[0] φ[p − 2]
φ[p − 1] φ[p − 2] φ[0]
¸
¸
¸
¸
¸
α
1
α
2
α
p
¸
¸
¸
¸
¸
=
φ[1]
φ[2]
φ[p]
¸
¸
¸
¸
¸
(6.22)
5
For this reason, a tapering window is generally used in the autocorrelation method.
6.3 The Levinson–Durbin Recursion 83
Note that the vector ψ is composed of almost the same autocorrelation
values as comprise Φ. Because of the special structure of (6.22) it is
possible to derive a recursive algorithm for inverting the matrix Φ.
That algorithm, known as the Levinson–Durbin algorithm, is speciﬁed
by the following steps:
Levinson–Durbin Algorithm
E
0
= φ[0] (D.1)
for i = 1, 2, . . . , p
k
i
=
¸
φ[i] −
i−1
¸
j=1
α
(i−1)
j
φ[i − j]
¸
E
(i−1)
(D.2)
α
(i)
i
= k
i
(D.3)
if i > 1 then for j = 1, 2, . . . , i − 1
α
(i)
j
= α
(i−1)
j
− k
i
α
(i−1)
i−j
(D.4)
end
E
(i)
= (1 − k
2
i
)E
(i−1)
(D.5)
end
α
j
= α
(p)
j
j = 1, 2, . . . , p (D.6)
An important feature of the Levinson–Durbin algorithm is that it
determines by recursion the optimum ithorder predictor from the opti
mum (i − 1)thorder predictor, and as part of the process, all predic
tors from order 0 (no prediction) to order p are computed along with
the corresponding meansquared prediction errors E
(i)
. Speciﬁcally, the
equations labeled (D.3) and (D.4) can be used to show that the pre
diction error system function satisﬁes
A
(i)
(z) = A
(i−1)
(z) − k
i
z
−i
A
(i−1)
(z
−1
). (6.23)
Deﬁning the ithorder forward prediction error e
(i)
[n] as the out
put of the prediction error ﬁlter with system function A
(i)
(z) and
b
(i)
[n] as the output of the ithorder backward prediction error ﬁlter
B
(i)
(z) = z
−i
A
(i)
(z
−1
), (6.23) leads (after some manipulations) to an
interpretation of the Levinson–Durbin algorithm in terms of a lattice
ﬁlter structure as in Figure 6.4(a).
84 Linear Predictive Analysis
Fig. 6.4 Lattice structures derived from the Levinson–Durbin recursion. (a) Prediction error
ﬁlter A(z). (b) Vocal tract ﬁlter H(z) = 1/A(z).
Also, by solving two equations in two unknowns recursively, it is
possible to start at the output of Figure 6.4(a) and work to the left
eventually computing s[n] in terms of e[n]. The lattice structure corre
sponding to this is shown in Figure 6.4(b). Lattice structures like this
can be derived from acoustic principles applied to a physical model
composed of concatenated lossless tubes [101]. If the input and output
signals in such physical models are sampled at just the right sampling
rate, the sampled signals are related by a transfer function identical to
(6.6). In this case, the coeﬃcients k
i
behave as reﬂection coeﬃcients at
the tube boundaries [3, 78, 101].
Note that the parameters k
i
for i = 1, 2, . . . , p play a key role in
the Levinson–Durbin recursion and also in the lattice ﬁlter interpreta
tion. Itakura and Saito [53, 54], showed that the parameters k
i
in the
Levinson–Durbin recursion and the lattice ﬁlter interpretation obtained
from it also could be derived by looking at linear predictive analysis
from a statistical perspective. They called the k
i
parameters, PARCOR
(for partial correlation) coeﬃcients [54], because they can be computed
directly as a ratio of crosscorrelation values between the forward and
6.4 LPC Spectrum 85
backward prediction errors at the output of the (i − 1)th stage of pre
diction in Figure 6.4(a), i.e.,
k
i
=
∞
¸
m=−∞
e
(i−1)
[m]b
(i−1)
[m − 1]
∞
¸
m=−∞
e
(i−1)
[m]
2
∞
¸
m=−∞
b
(i−1)
[m − 1]
2
1/2
. (6.24)
In the PARCOR interpretation, each stage of Figure 6.4(a) removes
part of the correlation in the input signal. The PARCOR coeﬃcients
computed using (6.24) are identical to the k
i
obtained as a result of the
Levinson–Durbin algorithm. Indeed, Equation (D.2) in the Levinson–
Durbin algorithm can be replaced by (6.24), and the result is an algo
rithm for transforming the PARCOR representation into the linear pre
dictor coeﬃcient representation.
The Levinson–Durbin formulation provides one more piece of useful
information about the PARCOR coeﬃcients. Speciﬁcally, from Equa
tion (D.5) of the algorithm, it follows that, since E
(i)
= (1 − k
2
i
)E
(i−1)
is strictly greater than zero for predictors of all orders, it must be true
that −1 < k
i
< 1 for all i. It can be shown that this condition on the
PARCORs also guarantees that all the zeros of a prediction error ﬁlter
A
(i)
(z) of any order must be strictly inside the unit circle of the zplane
[74, 78].
6.4 LPC Spectrum
The frequencydomain interpretation of linear predictive analysis pro
vides an informative link to our earlier discussions of the STFT and cep
strum analysis. The autocorrelation method is based on the shorttime
autocorrelation function, φ
ˆ n
[m], which is the inverse discrete Fourier
transform of the magnitudesquared of the STFT, [S
ˆ n
(e
j ˆ ω
)[
2
, of the
windowed speech signal s
ˆ n
[m] = s[n + m]w[m]. The values φ
ˆ n
[m] for
m = 0, 1, . . . , p are used to compute the prediction coeﬃcients and gain,
which in turn deﬁne the vocal tract system function H(z) in (6.6).
Therefore, the magnitudesquared of the frequency response of this sys
tem, obtained by evaluating H(z) on the unit circle at angles 2πf/f
s
,
86 Linear Predictive Analysis
is of the form:
[H(e
j2πf/fs
)[
2
=
G
1 −
p
¸
k=1
α
k
e
−j2πf/fs
2
, (6.25)
and can be thought of as an alternative shorttime spectral representa
tion. Figure 6.5 shows a comparison between shorttime Fourier analy
sis and linear predictive spectrum analysis for segments of voiced and
unvoiced speech. Figures 6.5(a) and 6.5(c) show the STACF, with the
ﬁrst 23 values plotted with a heavy line. These values are used to
compute the predictor coeﬃcients and gain for an LPC model with
p = 22. The frequency responses of the corresponding vocal tract sys
tem models are computed using (6.25) where the sampling frequency is
f
s
= 16 kHz. These frequency responses are superimposed on the corre
sponding STFTs (shown in gray). As we have observed before, the rapid
variations with frequency in the STFT are due primarily to the exci
tation, while the overall shape is assumed to be determined by the
eﬀects of glottal pulse, vocal tract transfer function, and radiation. In
0 5 10 15
−0.5
0
0.5
1
0 5 10 15
−0.5
0
0.5
1
0 2000 4000 6000 8000
−100
−50
0
50
0 2000 4000 6000 8000
−100
−50
0
50
(b) Voiced LPC Spectrum (a) Voiced Autocorrelation
(d) Unvoiced LPC Spectrum (c) Unvoiced Autocorrelation
time in ms frequency in kHz
Fig. 6.5 Comparison of shorttime Fourier analysis with linear predictive analysis.
6.4 LPC Spectrum 87
cepstrum analysis, the excitation eﬀects are removed by lowpass lifter
ing the cepstrum. In linear predictive spectrum analysis, the excitation
eﬀects are removed by focusing on the lowtime autocorrelation coef
ﬁcients. The amount of smoothing of the spectrum is controlled by
the choice of p. Figure 6.5 shows that a linear prediction model with
p = 22 matches the general shape of the shorttime spectrum, but does
not represent all its local peaks and valleys, and this is exactly what is
desired.
The question naturally arises as to how p should be chosen.
Figure 6.6 oﬀers a suggestion. In Figure 6.6(a) are shown the STFT
(in gray) and the frequency responses of a 12thorder model (heavy
dark line) and a 40thorder model (thin dark line). Evidently, the lin
ear predictive spectra tend to favor the peaks of the shorttime Fourier
transform. That this is true in general can be argued using the Parseval
theorem of Fourier analysis [74]. This is in contrast to homomorphic
Fig. 6.6 Eﬀect of predictor order on: (a) estimating the spectral envelope of a 4kHz band
width signal; and (b) the normalized meansquared prediction error for the same signal.
88 Linear Predictive Analysis
smoothing of the STFT, which tends toward an average of the peaks
and valleys of the STFT. As an example, see Figure 5.8, which shows
a comparison between the shorttime Fourier spectrum, the LPC spec
trum with p = 12, a homomorphically smoothed spectrum using 13
cepstrum values, and a melfrequency spectrum.
Figure 6.6(b) shows the normalized meansquared prediction error
V
(p)
= E
(p)
/φ[0] as a function of p for the segment of speech used to
produce Figure 6.6(a). Note the sharp decrease from p = 0 (no pre
diction implies V
(0)
= 1) to p = 1 and the less abrupt decrease there
after. Furthermore, notice that the meansquared error curve ﬂattens
out above about p = 12 and then decreases modestly thereafter. The
choice p = 12 gives a good match to the general shape of the STFT,
highlighting the formant structure imposed by the vocal tract ﬁlter
while ignoring the periodic pitch structure. Observe that p = 40 gives
a much diﬀerent result. In this case, the vocal tract model is highly
inﬂuenced by the pitch harmonics in the shorttime spectrum. It can
be shown that if p is increased to the length of the windowed speech
segment, that [H(e
j2πf/fs
)[
2
→[S
ˆ n
(e
j2πf/fs
)[
2
, but as pointed out in
(A.1) above, E
(2M)
does not go to zero [75].
In a particular application, the prediction order is generally ﬁxed
at a value that captures the general spectral shape due to the glot
tal pulse, vocal tract resonances, and radiation. From the acoustic
theory of speech production, it follows that the glottal pulse spec
trum is lowpass, the radiation ﬁltering is highpass, and the vocal tract
imposes a resonance structure that, for adult speakers, is comprised
of about one resonance per kilohertz of frequency [101]. For the sam
pled speech signal, the combination of the lowpass glottal pulse spec
trum and the highpass ﬁltering of radiation are usually adequately
represented by one or two additional complex pole pairs. When cou
pled with an estimate of one resonance per kilohertz, this leads to a
rule of thumb of p = 4 + f
s
/1000. For example, for a sampling rate of
f
s
= 8000 Hz, it is common to use a predictor order of 10–12. Note that
in Figure 6.5 where the sampling rate was f
s
= 16000 Hz, a predictor
order of p = 22 gave a good representation of the overall shape and
resonance structure of the speech segment over the band from 0 to
8000 Hz.
6.5 Equivalent Representations 89
The example of Figure 6.6 illustrates an important point about
linear predictive analysis. In that example, the speaker was a high
pitched female, and the wide spacing between harmonics causes the
peaks of the linear predictive vocal tract model to be biased toward
those harmonics. This eﬀect is a serious limitation for highpitched
voices, but is less problematic for male voices where the spacing between
harmonics is generally much smaller.
6.5 Equivalent Representations
The basic parameters obtained by linear predictive analysis are the gain
G and the prediction coeﬃcients ¦α
k
¦. From these, a variety of diﬀerent
equivalent representations can be obtained. These diﬀerent representa
tions are important, particularly when they are used in speech coding,
because we often wish to quantize the model parameters for eﬃciency
in storage or transmission of coded speech.
In this section, we give only a brief summary of the most important
equivalent representations.
6.5.1 Roots of Prediction Error System Function
Equation (6.5) shows that the system function of the prediction error
ﬁlter is a polynomial in z
−1
and therefore it can be represented in terms
of its zeros as
A(z) = 1 −
p
¸
k=1
α
k
z
−k
=
p
¸
k=1
(1 − z
k
z
−1
). (6.26)
According to (6.6), the zeros of A(z) are the poles of H(z). Therefore,
if the model order is chosen judiciously as discussed in the previous
section, then it can be expected that roughly f
s
/1000 of the roots will
be close in frequency (angle in the zplane) to the formant frequencies.
Figure 6.7 shows an example of the roots (marked with ) of a 12th
order predictor. Note that eight (four complex conjugate pairs) of the
roots are close to the unit circle. These are the poles of H(z) that model
the formant resonances. The remaining four roots lie well within the
unit circle, which means that they only provide for the overall spectral
shaping resulting from the glottal and radiation spectral shaping.
90 Linear Predictive Analysis
Fig. 6.7 Poles of H(z) (zeros of A(z)) marked with × and LSP roots marked with ∗ and o.
The prediction coeﬃcients are perhaps the most susceptible to quan
tization errors of all the equivalent representations. It is wellknown
that the roots of a polynomial are highly sensitive to errors in its coef
ﬁcients — all the roots being a function of all the coeﬃcients [89].
Since the pole locations are crucial to accurate representation of the
spectrum, it is important to maintain high accuracy in the location of
the zeros of A(z). One possibility, not often invoked, would be to factor
the polynomial as in (6.26) and then quantize each root (magnitude
and angle) individually.
6.5.2 LSP Coeﬃcients
A much more desirable alternative to quantization of the roots of A(z)
was introduced by Itakura [52], who deﬁned the line spectrum pair
(LSP) polynomials
6
P(z) = A(z) + z
−(p+1)
A(z
−1
) (6.27a)
Q(z) = A(z) − z
−(p+1)
A(z
−1
). (6.27b)
6
Note the similarity to (6.23), from which it follows that P(z) and Q(z) are system functions
of lattice ﬁlters obtained by extending Figure 6.4(a) with an additional section with k
p+1
=
∓1, respectively.
6.5 Equivalent Representations 91
This transformation of the linear prediction parameters is invertible:
to recover A(z) from P(z) and Q(z) simply add the two equations to
obtain A(z) = (P(z) + Q(z))/2.
An illustration of the LSP representation is given in Figure 6.7,
which shows the roots of A(z), P(z), and Q(z) for a 12thorder pre
dictor. This new representation has some very interesting and useful
properties that are conﬁrmed by Figure 6.7 and summarized below
[52, 119].
(LSP.1) All the roots of P(z) and Q(z) are on the unit circle,
i.e., they can be represented as
P(z) =
p+1
¸
k=1
(1 − e
jΩ
k
z
−1
) (6.28a)
Q(z) =
p+1
¸
k=1
(1 − e
jΘ
k
z
−1
). (6.28b)
The normalized discretetime frequencies (angles in the
zplane), Ω
k
and Θ
k
, are called the line spectrum fre
quencies or LSFs. Knowing the p LSFs that lie in the
range 0 < ω < π is suﬃcient to completely deﬁne the
LSP polynomials and therefore A(z).
(LSP.2) If p is an even integer, then P(−1) = 0 and Q(1) = 0.
(LSP.3) The PARCOR coeﬃcients corresponding to A(z) sat
isfy [k
i
[ < 1 if and only if the roots of P(z) and Q(z)
alternate on the unit circle, i.e., the LSFs are interlaced
over the range of angles 0 to π.
(LSP.4) The LSFs are close together when the roots of A(z)
are close to the unit circle.
These properties make it possible to represent the linear predictor
by quantized diﬀerences between the successive LSFs [119]. If prop
erty (LSP.3) is maintained through the quantization, the reconstructed
polynomial A(z) will still have its zeros inside the unit circle.
92 Linear Predictive Analysis
6.5.3 Cepstrum of Vocal Tract Impulse Response
One of the most useful alternative representations of the linear predic
tor is the cepstrum of the impulse response, h[n], of the vocal tract
ﬁlter. The impulse response can be computed recursively through the
diﬀerence equation:
h[n] =
p
¸
k=1
α
k
h[n − k] + Gδ[n], (6.29)
or a closed form expression for the impulse response can be obtained
by making a partial fraction expansion of the system function H(z).
As discussed in Section 5.3.3, since H(z) is a minimumphase sys
tem (all its poles inside the unit circle), it follows that the complex cep
strum of h[n] can be computed using (5.14), which would require that
the poles of H(z) be found by polynomial rooting. Then, any desired
number of cepstrum values can be computed. However, because of the
minimum phase condition, a more direct recursive computation is pos
sible [101]. Equation (5.16) gives the recursion for computing
ˆ
h[n] from
the predictor coeﬃcients and gain, and (5.17) gives the inverse recursion
for computing the predictor coeﬃcients from the complex cepstrum of
the vocal tract impulse response. These relationships are particularly
useful in speech recognition applications where a small number of cep
strum values is used as a feature vector.
6.5.4 PARCOR Coeﬃcients
We have seen that the PARCOR coeﬃcients are bounded by ±1. This
makes them an attractive parameter for quantization.
We have mentioned that if we have a set of PARCOR coeﬃcients,
we can simply use them at step (D.2) of the Levinson–Durbin algo
rithm thereby obtaining an algorithm for converting PARCORs to
predictor coeﬃcients. By working backward through the Levinson–
Durbin algorithm, we can compute the PARCOR coeﬃcients from
a given set of predictor coeﬃcients. The resulting algorithm is
given below.
6.5 Equivalent Representations 93
PredictortoPARCOR Algorithm
α
(p)
j
= α
j
j = 1, 2, . . . , p
k
p
= α
(p)
p
(P.1)
for i = p, p − 1, . . . , 2
for j = 1, 2, . . . , i − 1
α
(i−1)
j
=
α
(i)
j
+ k
i
α
(i)
i−j
1 − k
2
i
(P.2)
end
k
i−1
= α
(i−1)
i−1
(P.3)
end
6.5.5 Log Area Coeﬃcients
As mentioned above, the lattice ﬁlter representation of the vocal tract
system is strongly related to concatenated acoustic tube models of the
physics of sound propagation in the vocal tract. Such models are char
acterized by a set of tube crosssectional areas denoted A
i
. These “area
function” parameters are useful alternative representations of the vocal
tract model obtained from linear predictive analysis. Speciﬁcally, the
log area ratio parameters are deﬁned as
g
i
= log
¸
A
i+1
A
i
= log
¸
1 − k
i
1 + k
i
, (6.30)
where A
i
and A
i+1
are the areas of two successive tubes and the PAR
COR coeﬃcient −k
i
is the reﬂection coeﬃcient for sound waves imping
ing on the junction between the two tubes [3, 78, 101]. The inverse
transformation (from g
i
to k
i
) is
k
i
=
1 − e
g
i
1 + e
g
i
. (6.31)
The PARCOR coeﬃcients can be converted to predictor coeﬃcients if
required using the technique discussed in Section 6.5.4.
Viswanathan and Makhoul [131] showed that the frequency response
of a vocal tract ﬁlter represented by quantized log area ratio coeﬃcients
is relatively insensitive to quantization errors.
94 Linear Predictive Analysis
6.6 The Role of Linear Prediction
In this chapter, we have discussed many of the fundamentals of linear
predictive analysis of speech. The value of linear predictive analysis
stems from its ability to represent in a compact form the part of the
speech model associated with the vocal tract, which in turn is closely
related to the phonemic representation of the speech signal. Over 40
years of research on linear predictive methods have yielded a wealth of
knowledge that has been widely and eﬀectively applied in almost every
area of speech processing, but especially in speech coding, speech syn
thesis, and speech recognition. The present chapter and Chapters 3–5
provide the basis for the remaining chapters of this text, which focus
on these particular application areas.
7
Digital Speech Coding
This chapter, the ﬁrst of three on digital speech processing applications,
focuses on speciﬁc techniques that are used in digital speech coding.
We begin by describing the basic operation of sampling a speech signal
and directly quantizing and encoding of the samples. The remainder
of the chapter discusses a wide variety of techniques that represent the
speech signal in terms of parametric models of speech production and
perception.
7.1 Sampling and Quantization of Speech (PCM)
In any application of digital signal processing (DSP), the ﬁrst step is
sampling and quantization of the resulting samples into digital form.
These operations, which comprise the process of AtoD conversion,
are depicted for convenience in analysis and discussion in the block
diagram of Figure 7.1.
1
1
Current AtoD and DtoA converters use oversampling techniques to implement the
functions depicted in Figure 7.1.
95
96 Digital Speech Coding
Sampler
Quantizer
Q{ }
) (t x
c
] [n x ] [ ˆ n x
Encoder
] [n c
AtoD Converter
Decoder
] [n c′
] [ ˆ n x′
∆
∆
T
Reconstruction
Filter
DtoA Converter
) ( ˆ t x′
c
Fig. 7.1 The operations of sampling and quantization.
The sampler produces a sequence of numbers x[n] = x
c
(nT), where
T is the sampling period and f
s
= 1/T is the sampling frequency. The
oretically, the samples x[n] are real numbers that are useful for theo
retical analysis, but never available for computations. The well known
Shannon sampling theorem states that a bandlimited signal can be
reconstructed exactly from samples taken at twice the highest frequency
in the input signal spectrum (typically between 7 and 20 kHz for speech
and audio signals) [89]. Often, we use a lowpass ﬁlter to remove spectral
information above some frequency of interest (e.g., 4 kHz) and then use
a sampling rate such as 8000 samples/s to avoid aliasing distortion [89].
Figure 7.2 depicts a typical quantizer deﬁnition. The quantizer sim
ply takes the real number inputs x[n] and assigns an output ˆ x[n] accord
ing to the nonlinear discreteoutput mapping Q¦ ¦.
2
In the case of
the example of Figure 7.2, the output samples are mapped to one
of eight possible values, with samples within the peaktopeak range
being rounded and samples outside the range being “clipped” to either
the maximum positive or negative level. For samples within range, the
quantization error, deﬁned as
e[n] = ˆ x[n] − x[n] (7.1)
2
Note that in this chapter, the notation ˆ x[n] denotes the quantized version of x[n], not the
complex cepstrum of x[n] as in Chapter 5.
7.1 Sampling and Quantization of Speech (PCM) 97
Fig. 7.2 8level midtread quantizer.
satisﬁes the condition
−∆/2 < e[n] ≤ ∆/2, (7.2)
where ∆ is the quantizer step size. A Bbit quantizer such as the one
shown in Figure 7.2 has 2
B
levels (1 bit usually signals the sign).
Therefore, if the peaktopeak range is 2X
m
, the step size will be
∆ = 2X
m
/2
B
. Because the quantization levels are uniformly spaced by
∆, such quantizers are called uniform quantizers. Traditionally, repre
sentation of a speech signal by binarycoded quantized samples is called
pulsecode modulation (or just PCM) because binary numbers can be
represented for transmission as on/oﬀ pulse amplitude modulation.
The block marked “encoder” in Figure 7.1 represents the assigning
of a binary code word to each quantization level. These code words rep
resent the quantized signal amplitudes, and generally, as in Figure 7.2,
these code words are chosen to correspond to some convenient binary
number system such that arithmetic can be done on the code words as
98 Digital Speech Coding
if they were proportional to the signal samples.
3
In cases where accu
rate amplitude calibration is desired, the step size could be applied as
depicted in the block labeled “Decoder” at the bottom of Figure 7.1.
4
Generally, we do not need to be concerned with the ﬁne distinction
between the quantized samples and the coded samples when we have
a ﬁxed step size, but this is not the case if the step size is adapted
from sampletosample. Note that the combination of the decoder
and lowpass reconstruction ﬁlter represents the operation of DtoA
conversion [89].
The operation of quantization confronts us with a dilemma. Most
signals, speech especially, have a wide dynamic range, i.e., their ampli
tudes vary greatly between voiced and unvoiced sounds and between
speakers. This means that we need a large peaktopeak range so as to
avoid clipping of the loudest sounds. However, for a given number of
levels (bits), the step size ∆ = 2X
m
/2
B
is proportional to the peakto
peak range. Therefore, as the peaktopeak range increases, (7.2) states
that the size of the maximum quantization error grows. Furthermore,
for a uniform quantizer, the maximum size of the quantization error
is the same whether the signal sample is large or small. For a given
peaktopeak range for a quantizer such as the one in Figure 7.2 we
can decrease the quantization error only by adding more levels (bits).
This is the fundamental problem of quantization.
The data rate (measured in bits/second) of sampled and quantized
speech signals is I = B f
s
. The standard values for sampling and quan
tizing sound signals (speech, singing, instrumental music) are B = 16
and f
s
= 44.1 or 48 kHz. This leads to a bitrate of I = 16 44100 =
705, 600 bits/s.
5
This value is more than adequate and much more than
desired for most speech communication applications. The bit rate can
be lowered by using fewer bits/sample and/or using a lower sampling
rate; however, both these simple approaches degrade the perceived
quality of the speech signal. This chapter deals with a wide range of
3
In a real AtoD converter, the sampler, quantizer, and coder are all integrated into one
system, and only the binary code words c[n] are available.
4
The
denotes the possibility of errors in the codewords. Symbol errors would cause addi
tional error in the reconstructed samples.
5
Even higher rates (24 bits and 96 kHz sampling rate) are used for highquality audio.
7.1 Sampling and Quantization of Speech (PCM) 99
techniques for signiﬁcantly reducing the bit rate while maintaining an
adequate level of speech quality.
7.1.1 Uniform Quantization Noise Analysis
A more quantitative description of the eﬀect of quantization can be
obtained using random signal analysis applied to the quantization error.
If the number of bits in the quantizer is reasonably high and no clip
ping occurs, the quantization error sequence, although it is completely
determined by the signal amplitudes, nevertheless behaves as if it is a
random signal with the following properties [12]:
(Q.1) The noise samples appear to be
6
uncorrelated with the
signal samples.
(Q.2) Under certain assumptions, notably smooth input prob
ability density functions and high rate quantizers, the
noise samples appear to be uncorrelated from sampleto
sample, i.e., e[n] acts like a white noise sequence.
(Q.3) The amplitudes of noise samples are uniformly dis
tributed across the range −∆/2 < e[n] ≤ ∆/2, resulting
in average power σ
2
e
= ∆
2
/12.
These simplifying assumptions allow a linear analysis that yields accu
rate results if the signal is not too coarsely quantized. Under these con
ditions, it can be shown that if the output levels of the quantizer are
optimized, then the quantizer error will be uncorrelated with the quan
tizer output (however not the quantizer input, as commonly stated).
These results can be easily shown to hold in the simple case of a uni
formly distributed memoryless input and Bennett has shown how the
result can be extended to inputs with smooth densities if the bit rate
is assumed high [12].
With these assumptions, it is possible to derive the following
formula for the signaltoquantizingnoise ratio (in dB) of a Bbit
6
By “appear to be” we mean that measured correlations are small. Bennett has shown that
the correlations are only small because the error is small and, under suitable conditions,
the correlations are equal to the negative of the error variance [12]. The condition “uncor
related” often implies independence, but in this case the error is a deterministic function
of the input and hence it cannot be independent of the input.
100 Digital Speech Coding
uniform quantizer:
SNR
Q
= 10log
σ
2
x
σ
2
e
= 6.02B + 4.78 − 20log
10
X
m
σ
x
, (7.3)
where σ
x
and σ
e
are the rms values of the input signal and quantization
noise samples, respectively. The formula of Equation (7.3) is an increas
ingly good approximation as the bit rate (or the number of quantization
levels) gets large. It can be way oﬀ, however, if the bit rate is not large
[12]. Figure 7.3 shows a comparison of (7.3) with signaltoquantization
noise ratios measured for speech signals. The measurements were done
by quantizing 16bit samples to 8, 9, and 10 bits. The faint dashed
lines are from (7.3) and the dark dashed lines are measured values for
uniform quantization. There is good agreement between these graphs
indicating that (7.3) is a reasonable estimate of SNR.
Note that X
m
is a ﬁxed parameter of the quantizer, while σ
x
depends on the input signal level. As signal level increases, the ratio
X
m
/σ
x
decreases moving to the left in Figure 7.3. When σ
x
gets close
to X
m
, many samples are clipped, and the assumptions underlying
(7.3) no longer hold. This accounts for the precipitous fall in SNR for
1 < X
m
/σ
x
< 8.
0
10
20
30
40
50
60
8
9
10
Comparison of Law and Uniform Quantizers
S
N
R
i
n
d
B
Fig. 7.3 Comparison of µlaw and linear quantization for B = 8, 9, 10: Equation (7.3) — light
dashed lines. Measured uniform quantization SNR — dark dashed lines. µlaw (µ = 100)
compression — solid lines.
7.1 Sampling and Quantization of Speech (PCM) 101
Also, it should be noted that (7.3) and Figure 7.3 show that with
all other parameters being ﬁxed, increasing B by 1 bit (doubling the
number of quantization levels) increases the SNR by 6 dB. On the other
hand, it is also important to note that halving σ
x
decreases the SNR
by 6 dB. In other words, cutting the signal level in half is like throwing
away one bit (half of the levels) of the quantizer. Thus, it is exceed
ingly important to keep input signal levels as high as possible without
clipping.
7.1.2 µLaw Quantization
Also note that the SNR curves in Figure 7.3 decrease linearly with
increasing values of log
10
(X
m
/σ
x
). This is because the size of the quan
tization noise remains the same as the signal level decreases. If the
quantization error were proportional to the signal amplitude, the SNR
would be constant regardless of the size of the signal. This could be
achieved if log(x[n]) were quantized instead of x[n], but this is not pos
sible because the log function is illbehaved for small values of x[n]. As
a compromise, µlaw compression (of the dynamic range) of the signal,
deﬁned as
y[n] = X
m
log
1 + µ
x[n]
Xm
log(1 + µ)
sign(x[n]) (7.4)
can be used prior to uniform quantization [118]. Quantization of µlaw
compressed signals is often termed logPCM. If µ is large (> 100) this
nonlinear transformation has the eﬀect of distributing the eﬀective
quantization levels uniformly for small samples and logarithmically over
the remaining range of the input samples. The ﬂat curves in Figure 7.3
show that the signaltonoise ratio with µlaw compression remains rel
atively constant over a wide range of input levels.
µlaw quantization for speech is standardized in the CCITT
G.711 standard. In particular, 8bit, µ = 255, logPCM with f
s
=
8000 samples/s is widely used for digital telephony. This is an exam
ple of where speech quality is deliberately compromised in order to
achieve a much lower bit rate. Once the bandwidth restriction has been
imposed, 8bit logPCM introduces little or no perceptible distortion.
102 Digital Speech Coding
This conﬁguration is often referred to as “toll quality” because when
it was introduced at the beginning of the digital telephony era, it was
perceived to render speech equivalent to speech transmitted over the
best long distance lines. Nowadays, readily available hardware devices
convert analog signals directly into binarycoded µlaw samples and
also expand compressed samples back to uniform scale for conver
sion back to analog signals. Such an AtoD/DtoA device is called
a encoder/decoder or “codec.” If the digital output of a codec is used
as input to a speech processing algorithm, it is generally necessary to
restore the linearity of the amplitudes through the inverse of (7.4).
7.1.3 NonUniform and Adaptive Quantization
µlaw compression is an example of nonuniform quantization. It is
based on the intuitive notion of constant percentage error. A more rig
orous approach is to design a nonuniform quantizer that minimizes the
meansquared quantization error. To do this analytically, it is necessary
to know the probability distribution of the signal sample values so that
the most probable samples, which for speech are the low amplitude
samples, will incur less error than the least probable samples. To apply
this idea to the design of a nonuniform quantizer for speech requires
an assumption of an analytical form for the probability distribution or
some algorithmic approach based on measured distributions. The fun
damentals of optimum quantization were established by Lloyd [71] and
Max [79]. Paez and Glisson [91] gave an algorithm for designing opti
mum quantizers for assumed Laplace and Gamma probability densities,
which are useful approximations to measured distributions for speech.
Lloyd [71] gave an algorithm for designing optimum nonuniform quan
tizers based on sampled speech signals.
7
Optimum nonuniform quan
tizers can improve the signaltonoise ratio by as much as 6 dB over
µlaw quantizers with the same number of bits, but with little or no
improvement in perceived quality of reproduction, however.
7
Lloyd’s work was initially published in a Bell Laboratories Technical Note with portions
of the material having been presented at the Institute of Mathematical Statistics Meeting
in Atlantic City, New Jersey in September 1957. Subsequently this pioneering work was
published in the open literature in March 1982 [71].
7.2 Digital Speech Coding 103
Another way to deal with the wide dynamic range of speech is to let
the quantizer step size vary with time. When the quantizer in a PCM
system is adaptive, the system is called an adaptive PCM (APCM)
system. For example, the step size could satisfy
∆[n] = ∆
0
(E
ˆ n
)
1/2
, (7.5)
where ∆
0
is a constant and E
ˆ n
is the shorttime energy of the speech
signal deﬁned, for example, as in (4.6). Such adaption of the step size
is equivalent to ﬁxed quantization of the signal after division by the
rms signal amplitude (E
ˆ n
)
1/2
. With this deﬁnition, the step size will
go up and down with the local rms amplitude of the speech signal.
The adaptation speed can be adjusted by varying the analysis win
dow size. If the step size control is based on the unquantized signal
samples, the quantizer is called a feedforward adaptive quantizer, and
the step size information must be transmitted or stored along with the
quantized sample, thereby adding to the information rate. To amortize
this overhead, feedforward quantization is usually applied to blocks of
speech samples. On the other hand, if the step size control is based on
past quantized samples, the quantizer is a feedback adaptive quantizer.
Jayant [57] studied a class of feedback adaptive quantizers where the
step size ∆[n] is a function of the step size for the previous sample,
∆[n − 1]. In this approach, if the previous sample is quantized using
step size ∆[n − 1] to one of the low quantization levels, then the step
size is decreased for the next sample. On the other hand, the step size
is increased if the previous sample was quantized in one of the highest
levels. By basing the adaptation on the previous sample, it is not nec
essary to transmit the step size. It can be derived at the decoder by
the same algorithm as used for encoding.
Adaptive quantizers can achieve about 6 dB improvement (equiva
lent to adding one bit) over ﬁxed quantizers [57, 58].
7.2 Digital Speech Coding
As we have seen, the data rate of sampled and quantized speech is
B f
s
. Virtually perfect perceptual quality is achievable with high data
104 Digital Speech Coding
rate.
8
By reducing B or f
s
, the bit rate can be reduced, but percep
tual quality may suﬀer. Reducing f
s
requires bandwidth reduction,
and reducing B too much will introduce audible distortion that may
resemble random noise. The main objective in digital speech coding
(sometimes called speech compression) is to lower the bit rate while
maintaining an adequate level of perceptual ﬁdelity. In addition to the
two dimensions of quality and bit rate, the complexity (in terms of
digital computation) is often of equal concern.
Since the 1930s, engineers and speech scientists have worked toward
the ultimate goal of achieving more eﬃcient representations. This work
has ledtoabasic approachinspeechcodingthat is basedonthe use of DSP
techniques and the incorporation of knowledge of speech production and
perception into the quantization process. There are many ways that this
can be done. Traditionally, speech coding methods have been classiﬁed
according to whether they attempt to preserve the waveform of the
speech signal or whether they only seek to maintain an acceptable level
of perceptual quality and intelligibility. The former are generally called
waveform coders, and the latter modelbased coders. Modelbased coders
are designed for obtaining eﬃcient digital representations of speech and
only speech. Straightforward sampling and uniformquantization (PCM)
is perhaps the only pure waveform coder. Nonuniform quantization
and adaptive quantization are simple attempts to build in properties
of the speech signal (timevarying amplitude distribution) and speech
perception (quantization noise is masked by loud sounds), but these
extensions still are aimed at preserving the waveform.
Most modern modelbased coding methods are based on the notion
that the speech signal can be represented in terms of an excitation sig
nal and a timevarying vocal tract system. These two components are
quantized, and then a “quantized” speech signal can be reconstructed
by exciting the quantized ﬁlter with the quantized excitation. Figure 7.4
illustrates why linear predictive analysis can be fruitfully applied for
modelbased coding. The upper plot is a segment of a speech signal.
8
If the signal that is reconstructed from the digital representation is not perceptibly diﬀerent
from the original analog speech signal, the digital representation is often referred to as
being “transparent.”
7.2 Digital Speech Coding 105
0 100 200 300 400
−1
0
1
0 100 200 300 400
−0.2
0
0.2
Speech Waveform
Prediction Error Sequence
sample index
Fig. 7.4 Speech signal and corresponding prediction error signal for a 12thorder predictor.
The lower plot is the output of a linear prediction error ﬁlter A(z)
(with p = 12) that was derived from the given segment by the tech
niques of Chapter 6. Note that the prediction error sequence amplitude
of Figure 7.4 is a factor of ﬁve lower than that of the signal itself, which
means that for a ﬁxed number of bits, this segment of the prediction
error could be quantized with a smaller step size than the waveform
itself. If the quantized prediction error (residual) segment is used as
input to the corresponding vocal tract ﬁlter H(z) = 1/A(z), an approx
imation to the original segment of speech would be obtained.
9
While
direct implementation of this idea does not lead to a practical method
of speech coding, it is nevertheless suggestive of more practical schemes
which we will discuss.
Today, the line between waveform coders and modelbased coders
is not distinct. A more useful classiﬁcation of speech coders focusses
on how the speech production and perception models are incorporated
into the quantization process. One class of systems simply attempts to
extract an excitation signal and a vocal tract system from the speech
signal without any attempt to preserve a relationship between the
waveforms of the original and the quantized speech. Such systems are
attractive because they can be implemented with modest computation.
9
To implement this timevarying inverse ﬁltering/reconstruction requires care in ﬁtting the
segments of the signals together at the block boundaries. Overlapadd methods [89] can
be used eﬀectively for this purpose.
106 Digital Speech Coding
These coders, which we designate as openloop coders, are also called
vocoders (voice coder) since they are based on the principles estab
lished by H. Dudley early in the history of speech processing research
[30]. A second class of coders employs the source/system model for
speech production inside a feedback loop, and thus are called closed
loop coders. These compare the quantized output to the original input
and attempt to minimize the diﬀerence between the two in some pre
scribed sense. Diﬀerential PCM systems are simple examples of this
class, but increased availability of computational power has made it
possible to implement much more sophisticated closedloop systems
called analysisbysynthesis coders. Closed loop systems, since they
explicitly attempt to minimize a timedomain distortion measure, often
do a good job of preserving the speech waveform while employing many
of the same techniques used in openloop systems.
7.3 ClosedLoop Coders
7.3.1 Predictive Coding
The essential features of predictive coding of speech were set forth in a
classic paper by Atal and Schroeder [5], although the basic principles of
predictive quantization were introduced by Cutler [26]. Figure 7.5 shows
a general block diagram of a large class of speech coding systems that
are called adaptive diﬀerential PCM (ADPCM) systems. These systems
are generally classiﬁed as waveform coders, but we prefer to emphasize
that they are closedloop, modelbased systems that also preserve the
waveform. The reader should ignore initially the blocks concerned with
adaptation and all the dotted control lines and focus instead on the core
DPCM system, which is comprised of a feedback structure that includes
the blocks labeled Q and P. The quantizer Q can have a number of
levels ranging from 2 to much higher, it can be uniform or nonuniform,
and it can be adaptive or not, but irrespective of the type of quantizer,
the quantized output can be expressed as
ˆ
d[n] = d[n] + e[n], where e[n]
is the quantization error. The block labeled P is a linear predictor, so
˜ x[n] =
p
¸
k=1
α
k
ˆ x[n − k], (7.6)
7.3 ClosedLoop Coders 107
Fig. 7.5 General block diagram for adaptive diﬀerential PCM (ADPCM).
i.e., the signal ˜ x[n] is predicted based on p past samples of the signal
ˆ x[n].
10
The signal d[n] = x[n] − ˜ x[n], the diﬀerence between the input
and the predicted signal, is the input to the quantizer. Finally, the
10
In a feedforward adaptive predictor, the predictor parameters are estimated from the
input signal x[n], although, as shown in Figure 7.5, they can also be estimated from past
samples of the reconstructed signal ˆ x[n]. The latter would be termed a feedback adaptive
predictor.
108 Digital Speech Coding
relationship between ˆ x[n] and
ˆ
d[n] is
ˆ x[n] =
p
¸
k=1
α
k
ˆ x[n − k] +
ˆ
d[n], (7.7)
i.e., the signal ˆ x[n] is the output of what we have called the “vocal
tract ﬁlter,” and
ˆ
d[n] is the quantized excitation signal. Finally, since
ˆ x[n] = ˜ x[n] +
ˆ
d[n] it follows that
ˆ x[n] = x[n] + e[n]. (7.8)
This is the key result for DPCM systems. It says that no matter what
predictor is used in Figure 7.5, the quantization error in ˆ x[n] is iden
tical to the quantization error in the quantized excitation signal
ˆ
d[n].
If the prediction is good, then the variance of d[n] will be less than
the variance of x[n] so it will be possible to use a smaller step size and
therefore to reconstruct ˆ x[n] with less error than if x[n] were quan
tized directly. The feedback structure of Figure 7.5 contains within it a
source/system speech model, which, as shown in Figure 7.5(b), is the
system needed to reconstruct the quantized speech signal ˆ x[n] from the
coded diﬀerence signal input.
From (7.8), the signaltonoise ratio of the ADPCM system is
SNR = σ
2
x
/σ
2
e
. A simple, but informative modiﬁcation leads to
SNR =
σ
2
x
σ
2
d
σ
2
d
σ
2
e
= G
P
SNR
Q
, (7.9)
where the obvious deﬁnitions apply.
SNR
Q
is the signaltonoise ratio of the quantizer, which, for ﬁne
quantization (i.e., large number of quantization levels with small step
size), is given by (7.3). The number of bits B determines the bitrate
of the coded diﬀerence signal. As shown in Figure 7.5, the quantizer
can be either feedforward or feedbackadaptive. As indicated, if the
quantizer is feedforward adaptive, step size information must be part
of the digital representation.
The quantity G
P
is called the prediction gain, i.e., if G
P
> 1 it
represents an improvement gained by placing a predictor based on the
speech model inside the feedback loop around the quantizer. Neglecting
7.3 ClosedLoop Coders 109
the correlation between the signal and the quantization error, it can be
shown that
G
P
=
σ
2
x
σ
2
d
=
1
1 −
p
¸
k=1
α
k
ρ[k]
, (7.10)
where ρ[k] = φ[k]/φ[0] is the normalized autocorrelation function used
to compute the optimum predictor coeﬃcients. The predictor can be
either ﬁxed or adaptive, and adaptive predictors can be either feed
forward or feedback (derived by analyzing past samples of the quan
tized speech signal reconstructed as part of the coder). Fixed predictors
can be designed based on longterm average correlation functions. The
lower ﬁgure in Figure 7.6 shows an estimated longterm autocorrela
tion function for speech sampled at an 8 kHz sampling rate. The upper
plot shows 10log
10
G
P
computed from (7.10) as a function of predictor
order p. Note that a ﬁrstorder ﬁxed predictor can yield about 6 dB
prediction gain so that either the quantizer can have 1 bit less or the
0 2 4 6 8 10 12
0
5
10
G
p
i
n
d
B
0 2 4 6 8 10 12
−0.5
0
0.5
1
predictor order p
autocorrelation lag
Fig. 7.6 Longterm autocorrelation function ρ[m] (lower plot) and corresponding prediction
gain G
P
.
110 Digital Speech Coding
reconstructed signal can have a 6 dB higher SNR. Higherorder ﬁxed
predictors can achieve about 4 dB more prediction gain, and adapting
the predictor at a phoneme rate can produce an additional 4 dB of gain
[84]. Great ﬂexibility is inherent in the block diagram of Figure 7.5, and
not surprisingly, many systems based upon the basic principle of dif
ferential quantization have been proposed, studied, and implemented
as standard systems. Here we can only mention a few of the most
important.
7.3.2 Delta Modulation
Delta modulation (DM) systems are the simplest diﬀerential coding
systems since they use a 1bit quantizer and usually only a ﬁrstorder
predictor. DM systems originated in the classic work of Cutler [26]
and deJager [28]. While DM systems evolved somewhat independently
of the more general diﬀerential coding methods, they nevertheless ﬁt
neatly into the general theory of predictive coding. To see how such
systems can work, note that the optimum predictor coeﬃcient for a
ﬁrstorder predictor is α
1
= ρ[1], so from (7.10) it follows that the pre
diction gain for a ﬁrstorder predictor is
G
P
=
1
1 − ρ
2
[1]
. (7.11)
Furthermore, note the nature of the longterm autocorrelation function
in Figure 7.6. This correlation function is for f
s
= 8000 samples/s. If
the same bandwidth speech signal were oversampled at a sampling
rate higher than 8000 samples/s, we could expect that the correlation
value ρ[1] would lie on a smooth curve interpolating between samples
0 and 1 in Figure 7.6, i.e., as f
s
gets large, both ρ[1] and α
1
approach
unity and G
P
gets large. Thus, even though a 1bit quantizer has a
very low SNR, this can be compensated by the prediction gain due to
oversampling.
Delta modulation systems can be implemented with very simple
hardware, and in contrast to more general predictive coding systems,
they are not generally implemented with block processing. Instead,
adaptation algorithms are usually based on the bit stream at the output
of the 1bit quantizer. The simplest delta modulators use a very high
7.3 ClosedLoop Coders 111
sampling rate with a ﬁxed predictor. Their bit rate, being equal to f
s
,
must be very high to achieve good quality reproduction. This limitation
can be mitigated with only modest increase in complexity by using an
adaptive 1bit quantizer [47, 56]. For example, Jayant [56] showed that
for bandlimited (3 kHz) speech, adaptive delta modulation (ADM) with
bit rate (sampling rate) 40 kbits/s has about the same SNR as 6bit
log PCM sampled at 6.6 kHz. Furthermore, he showed that doubling
the sampling rate of his ADM system improved the SNR by about
10 dB.
Adaptive delta modulation has the following advantages: it is very
simple to implement, only bit synchronization is required in transmis
sion, it has very low coding delay, and it can be adjusted to be robust to
channel errors. For these reasons, adaptive delta modulation (ADM) is
still used for terminalcostsensitive digital transmission applications.
7.3.3 Adaptive Diﬀerential PCM Systems
Diﬀerential quantization with a multibit quantizer is called diﬀeren
tial PCM (DPCM). As depicted in Figure 7.5, adaptive diﬀerential
PCM systems can have any combination of adaptive or ﬁxed quantiz
ers and/or predictors. Generally, the operations depicted in Figure 7.5
are implemented on short blocks of input signal samples, which intro
duces delay. A careful comparison of a variety of ADPCM systems by
Noll [84] gives a valuable perspective on the relative contributions of
the quantizer and predictor. Noll compared log PCM, adaptive PCM,
ﬁxed DPCM, and three ADPCM systems all using quantizers of 2, 3,
4, and 5bits with 8 kHz sampling rate. For all bit rates (16, 24, 32, and
40 kbps), the results were as follows
11
:
(1) Log PCM had the lowest SNR.
(2) Adapting the quantizer (APCM) improved SNR by 6 dB.
(3) Adding ﬁrstorder ﬁxed or adaptive prediction improved the
SNR by about 4 dB over APCM.
11
The bit rates mentioned do not include overhead information for the quantized prediction
coeﬃcients and quantizer step sizes. See Section 7.3.3.1 for a discussion of this issue.
112 Digital Speech Coding
(4) A fourthorder adaptive predictor added about 4 dB, and
increasing the predictor order to 12 added only 2 dB more.
The superior performance of ADPCM relative to logPCM and its
relatively low computational demands have led to several standard ver
sions (ITU G.721, G.726, G.727), which are often operated at 32 kbps
where quality is superior to logPCM (ITU G.711) at 64 kbps.
The classic paper by Atal and Schroeder [5] contained a number
of ideas that have since been applied with great success in adaptive
predictive coding of speech. One of these concerns the quantization
noise introduced by ADPCM, which we have seen is simply added to
the input signal in the reconstruction process. If we invoke the white
noise approximation for quantization error, it is clear that the noise will
be most prominent in the speech at high frequencies where speech spec
trum amplitudes are low. A simple solution to this problem is to pre
emphasize the speech signal before coding, i.e., ﬁlter the speech with a
linear ﬁlter that boosts the highfrequency (HF) part of the spectrum.
After reconstruction of the preemphasized speech, it can be ﬁltered
with the inverse system to restore the spectral balance, and in the
process, the noise spectrum will take on the shape of the deemphasis
ﬁlter spectrum. A simple preemphasis system of this type has system
function (1 − γz
−1
) where γ is less than one.
12
A more sophisticated
application of this idea is to replace the simple ﬁxed preemphasis ﬁl
ter with a timevarying ﬁlter designed to shape the quantization noise
spectrum. An equivalent approach is to deﬁne a new feedback cod
ing system where the quantization noise is computed as the diﬀerence
between the input and output of the quantizer and then shaped before
adding to the prediction residual. All these approaches raise the noise
spectrum at low frequencies and lower it at high frequencies, thus tak
ing advantage of the masking eﬀects of the prominent low frequencies
in speech [2].
Another idea that has been applied eﬀectively in ADPCM systems
as well as analysisbysynthesis systems is to include a longdelay pre
dictor to capture the periodicity as well as the shortdelay correlation
12
Values around γ = 0.4 for fs = 8 kHz work best. If γ is too close to one, the low frequency
noise is emphasized too much.
7.3 ClosedLoop Coders 113
inherent in voiced speech. Including the simplest longdelay predictor
would change (7.6) to
˜ x[n] = βˆ x[n − M] +
p
¸
k=1
α
k
(ˆ x[n] − βˆ x[n − M]). (7.12)
The parameter M is essentially the pitch period (in samples) of voiced
speech, and the parameter β accounts for amplitude variations between
periods. The predictor parameters in (7.12) cannot be jointly opti
mized. One suboptimal approach is to estimate β and M ﬁrst by
determining M that maximizes the correlation around the expected
pitch period and setting β equal to the normalized correlation at M.
Then the shortdelay predictor parameters α
k
are estimated from the
output of the longdelay prediction error ﬁlter.
13
When this type of
predictor is used, the “vocal tract system” used in the decoder would
have system function
H(z) =
¸
¸
¸
¸
¸
1
1 −
p
¸
k=1
α
k
z
−k
¸
1
1 − βz
−M
. (7.13)
Adding longdelay prediction takes more information out of the speech
signal and encodes it in the predictor. Even more complicated predic
tors with multiple delays and gains have been found to improve the
performance of ADPCM systems signiﬁcantly [2].
7.3.3.1 Coding the ADPCM Parameters
A glance at Figure 7.5 shows that if the quantizer and predictor are
ﬁxed or feedbackadaptive, then only the coded diﬀerence signal is
needed to reconstruct the speech signal (assuming that the decoder
has the same coeﬃcients and feedbackadaptation algorithms). Other
wise, the predictor coeﬃcients and quantizer step size must be included
as auxiliary data (side information) along with the quantized diﬀerence
13
Normally, this analysis would be performed on the input speech signal with the resulting
predictor coeﬃcients applied to the reconstructed signal as in (7.12).
114 Digital Speech Coding
signal. The diﬀerence signal will have the same sampling rate as the
input signal. The step size and predictor parameters will be estimated
and changed at a much lower sampling rate, e.g., 50–100 times/s. The
total bit rate will be the sum of all the bit rates.
Predictor Quantization The predictor parameters must be quan
tized for eﬃcient digital representation of the speech signal. As dis
cussed in Chapter 6, the shortdelay predictors can be represented in
many equivalent ways, almost all of which are preferable to direct quan
tization of the predictor coeﬃcients themselves. The PARCOR coeﬃ
cients can be quantized with a nonuniform quantizer or transformed
with an inverse sine or hyperbolic tangent function to ﬂatten their sta
tistical distribution and then quantized with a ﬁxed uniform quantizer.
Each coeﬃcient can be allocated a number of bits contingent on its
importance in accurately representing the speech spectrum [131].
Atal [2] reported a system which used a 20thorder predictor and
quantized the resulting set of PARCOR coeﬃcients after transforma
tion with an inverse sine function. The number of bits per coeﬃcient
ranged from 5 for each of the lowest two PARCORs down to 1 each for
the six highestorder PARCORs, yielding a total of 40 bits per frame.
If the parameters are updated 100 times/s, then a total of 4000 bps for
the shortdelay predictor information is required. To save bit rate, it
is possible to update the shortdelay predictor 50 times/s for a total
bit rate contribution of 2000 bps [2]. The long delay predictor has a
delay parameter M, which requires 7 bits to cover the range of pitch
periods to be expected with 8 kHz sampling rate of the input. The
longdelay predictor in the system mentioned above used delays of
M − 1, M, M + 1 and three gains each quantized to 4 or 5 bit accu
racy. This gave a total of 20 bits/frame and added 2000 bps to the
overall bit rate.
Vector Quantization for Predictor Quantization Another
approach to quantizing the predictor parameters is called vector quan
tization, or VQ [45]. The basic principle of vector quantization is
depicted in Figure 7.7. VQ can be applied in any context where a set
of parameters naturally groups together into a vector (in the sense of
7.3 ClosedLoop Coders 115
Distortion
Measure
Code Book
2
B
Vectors
Table
Lookup
Code Book
2
B
Vectors
v
i
i
k
vˆ
k
vˆ
i
vˆ
Encoder Decoder
Fig. 7.7 Vector quantization.
linear algebra).
14
For example the predictor coeﬃcients can be repre
sented by vectors of predictor coeﬃcients, PARCOR coeﬃcients, cep
strum values, log area ratios, or line spectrum frequencies. In VQ, a
vector, v, to be quantized is compared exhaustively to a “codebook”
populated with representative vectors of that type. The index i of the
closest vector ˆ v
i
to v, according to a prescribed distortion measure,
is returned from the exhaustive search. This index i then represents
the quantized vector in the sense that if the codebook is known, the
corresponding quantized vector ˆ v
i
can be looked up. If the codebook
has 2
B
entries, the index i can be represented by a B bit number.
The design of a vector quantizer requires a training set consisting
of a large number (L) of examples of vectors that are drawn from the
same distribution as vectors to be quantized later, along with a dis
tortion measure for comparing two vectors. Using an iterative process,
2
B
codebook vectors are formed from the training set of vectors. The
“training phase” is computationally intense, but need only be done
once, and it is facilitated by the LBG algorithm [70]. One advantage of
VQ is that the distortion measure can be designed to sort the vectors
into perceptually relevant prototype vectors. The resulting codebook is
then used to quantize general test vectors. Typically a VQ training set
consists of at least 10 times the ultimate number of codebook vectors,
i.e., L ≥ 10 2
B
.
14
Independent quantization of individual speech samples or individual predictor coeﬃcients
is called scalar quantization.
116 Digital Speech Coding
The quantization process is also time consuming because a given
vector must be systematically compared with all the codebook vectors
to determine which one it is closest to. This has led to numerous inno
vations in codebook structure that speed up the look up process [45].
Diﬀerence Signal Quantization It is necessary to code the diﬀer
ence signal in ADPCM. Since the diﬀerence signal has the same sam
pling rate as the input signal, it is desirable to use as few bits as possible
for the quantizer. If straightforward Bbit uniform quantization is used,
the additional bit rate would be B f
s
. This was the approach used in
earlier studies by Noll [84] as mentioned in Section 7.3.3. In order to
reduce the bit rate for coding the diﬀerence samples, some sort of block
coding must be applied. One approach is to design a multibit quantizer
that only operates on the largest samples of the diﬀerence signal. Atal
and Schroeder [2, 7] proposed to precede the quantizer by a “center
clipper,” which is a system that sets to zero value all samples whose
amplitudes are within a threshold band and passes those samples above
threshold unchanged. By adjusting the threshold, the number of zero
samples can be controlled. For high thresholds, long runs of zero sam
ples result, and the entropy of the quantized diﬀerence signal becomes
less than one. The blocks of zero samples can be encoded eﬃciently
using variablelengthtoblock codes yielding average bits/sample on
the order of 0.7 [2]. For a sampling rate of 8 kHz, this coding method
needs a total of 5600 bps for the diﬀerence signal. When this bit rate
of the diﬀerential coder is added to the bit rate for quantizing the
predictor information (≈4000 bps) the total comes to approximately
10,000 bps.
7.3.3.2 Quality vs. Bit Rate for ADPCM Coders
ADPCM systems normally operate with an 8 kHz sampling rate, which
is a compromise aimed at providing adequate intelligibility and mod
erate bit rate for voice communication. With this sampling rate, the
speech signal must be bandlimited to somewhat less than 4 kHz by a
lowpass ﬁlter. Adaptive prediction and quantization can easily lower
the bit rate to 32 kbps with no degradation in perceived quality (with
7.3 ClosedLoop Coders 117
64 kbps log PCM toll quality as a reference). Of course, raising the
bit rate for ADPCM above 64 kbps cannot improve the quality over
logPCM because of the inherent frequency limitation. However, the
bit rate can be lowered below 32 kbps with only modest distortion
until about 10 kbps. Below this value, quality degrades signiﬁcantly.
In order to achieve neartollquality at low bit rates, all the techniques
discussed above and more must be brought into play. This increases
the system complexity, although not unreasonably so for current DSP
hardware. Thus, ADPCM is an attractive coding method when toll
quality is required at modest cost, and where adequate transmission
and/or storage capacity is available to support bit rates on the order
of 10 kbps.
7.3.4 AnalysisbySynthesis Coding
While ADPCM coding can produce excellent results at moderate bit
rates, its performance is fundamentally constrained by the fact that the
diﬀerence signal has the same sampling rate as the input signal. The
center clipping quantizer produces a diﬀerence signal that can be coded
eﬃciently. However, this approach is clearly not optimal since the center
clipper throws away information in order to obtain a sparse sequence.
What is needed is a way of creating an excitation signal for the vocal
tract ﬁlter that is both eﬃcient to code and also produces decoded
speech of high quality. This can be done within the same closedloop
framework as ADPCM, but with some signiﬁcant modiﬁcations.
7.3.4.1 Basic AnalysisbySynthesis Coding System
Figure 7.8 shows the block diagram of another class of closedloop
digital speech coders. These systems are called “analysisbysynthesis
coders” because the excitation is built up using an iterative process to
produce a “synthetic” vocal tract ﬁlter output ˆ x[n] that matches the
input speech signal according to a perceptually weighted error crite
rion. As in the case of the most sophisticated ADPCM systems, the
operations of Figure 7.8 are carried out on blocks of speech samples. In
particular, the diﬀerence, d[n], between the input, x[n], and the output
of the vocal tract ﬁlter, ˆ x[n], is ﬁltered with a linear ﬁlter called the
118 Digital Speech Coding
Perceptual
Weighting
W(z)
Excitation
Generator
Vocal Tract
Filter
H(z)
+
] [n x ] [n d
] [
ˆ
n d
] [ ˆ n x

+
] [n d′
Fig. 7.8 Structure of analysisbysynthesis speech coders.
perceptual weighting ﬁlter, W(z). As the ﬁrst step in coding a block of
speech samples, both the vocal tract ﬁlter and the perceptual weighting
ﬁlter are derived from a linear predictive analysis of the block. Then,
the excitation signal is determined from the perceptually weighted dif
ference signal d
[n] by an algorithm represented by the block labeled
“Excitation Generator.”
Note the similarity of Figure 7.8 to the core ADPCM diagram of
Figure 7.5. The perceptual weighting and excitation generator inside
the dotted box play the role played by the quantizer in ADPCM, where
an adaptive quantization algorithm operates on d[n] to produce a quan
tized diﬀerence signal
ˆ
d[n], which is the input to the vocal tract system.
In ADPCM, the vocal tract model is in the same position in the closed
loop system, but instead of the synthetic output ˆ x[n], a signal ˜ x[n]
predicted from ˆ x[n] is subtracted from the input to form the diﬀerence
signal. This is a key diﬀerence. In ADPCM, the synthetic output diﬀers
from the input x[n] by the quantization error. In analysisbysynthesis,
ˆ x[n] = x[n] − d[n], i.e., the reconstruction error is −d[n], and a percep
tually weighted version of that error is minimized in the meansquared
sense by the selection of the excitation
ˆ
d[n].
7.3.4.2 Perceptual Weighting of the Diﬀerence Signal
Since −d[n] is the error in the reconstructed signal, it is desirable to
shape its spectrum to take advantage of perceptual masking eﬀects. In
ADPCM, this is accomplished by quantization noise feedback or preem
phasis/deemphasis ﬁltering. In analysisbysynthesis coding, the input
7.3 ClosedLoop Coders 119
to the vocal tract ﬁlter is determined so as to minimize the perceptually
weighted error d
[n]. The weighting is implemented by linear ﬁltering,
i.e., d
[n] = d[n] ∗ w[n] with the weighting ﬁlter usually deﬁned in terms
of the vocal tract ﬁlter as the linear system with system function
W(z) =
A(z/α
1
)
A(z/α
2
)
=
H(z/α
2
)
H(z/α
1
)
. (7.14)
The poles of W(z) lie at the same angle but at α
2
times the radii of
the poles of H(z), and the zeros of W(z) are at the same angles but at
radii of α
1
times the radii of the poles of H(z). If α
1
> α
2
the frequency
response is like a “controlled” inverse ﬁlter for H(z), which is the shape
desired. Figure 7.9 shows the frequency response of such a ﬁlter, where
typical values of α
1
= 0.9 and α
2
= 0.4 are used in (7.14). Clearly, this
ﬁlter tends to emphasize the high frequencies (where the vocal tract
ﬁlter gain is low) and it deemphasizes the low frequencies in the error
signal. Thus, it follows that the error will be distributed in frequency so
that relatively more error occurs at low frequencies, where, in this case,
such errors would be masked by the high amplitude low frequencies.
By varying α
1
and α
2
in (7.14) the relative distribution of error can be
adjusted.
0 1000 2000 3000 4000
−30
−20
−10
0
10
20
30
frequency in Hz,
l
o
g
m
a
g
n
i
t
u
d
e
i
n
d
B
Fig. 7.9 Comparison of frequency responses of vocal tract ﬁlter and perceptual weighting
ﬁlter in analysisbysynthesis coding.
120 Digital Speech Coding
7.3.4.3 Generating the Excitation Signal
Most analysisbysynthesis systems generate the excitation from a ﬁnite
ﬁxed collection of input components, which we designate here as f
γ
[n]
for 0 ≤ n ≤ L − 1, where L is the excitation frame length and γ ranges
over the ﬁnite set of components. The input is composed of a ﬁnite sum
of scaled components selected from the given collection, i.e.,
ˆ
d[n] =
N
¸
k=1
β
k
f
γ
k
[n]. (7.15)
The β
k
s and the sequences f
γ
k
[n] are chosen to minimize
E =
L−1
¸
n=0
((x[n] − h[n] ∗
ˆ
d[n]) ∗ w[n])
2
, (7.16)
where h[n] is the vocal tract impulse response and w[n] is the impulse
response of the perceptual weighting ﬁlter with system function (7.14).
Since the component sequences are assumed to be known at both the
coder and decoder, the βs and γs are all that is needed to represent the
input
ˆ
d[n]. It is very diﬃcult to solve simultaneously for the optimumβs
and γs that minimize (7.16). However, satisfactory results are obtained
by solving for the component signals one at a time.
Starting with the assumption that the excitation signal is zero dur
ing the current excitation
15
analysis frame (indexed 0 ≤ n ≤ L − 1), the
output in the current frame due to the excitation determined for previ
ous frames is computed and denoted ˆ x
0
[n] for 0 ≤ n ≤ L − 1. Normally
this would be a decaying signal that could be truncated after L samples.
The error signal in the current frame at this initial stage of the iterative
process would be d
0
[n] = x[n] − ˆ x
0
[n], and the perceptually weighted
error would be d
0
[n] = d
0
[n] ∗ w[n]. Now at the ﬁrst iteration stage,
assume that we have determined which of the collection of input com
ponents f
γ
1
[n] will reduce the weighted meansquared error the most
and we have also determined the required gain β
1
. By superposition,
ˆ x
1
[n] = ˆ x
0
[n] + β
1
f
γ
1
[n] ∗ h[n]. The weighted error at the ﬁrst itera
tion is d
1
[n] = (d
0
[n] − β
1
f
γ
1
[n] ∗ h[n] ∗ w[n]), or if we deﬁne the per
ceptually weighted vocal tract impulse response as h
[n] = h[n] ∗ w[n]
15
Several analysis frames are usually included in one linear predictive analysis frame.
7.3 ClosedLoop Coders 121
and invoke the distributive property of convolution, then the weighted
error sequence is d
1
[n] = d
0
[n] − β
1
f
γ
1
[n] ∗ h
[n]. This process is contin
ued by ﬁnding the next component, subtracting it from the previously
computed residual error, and so forth. Generalizing to the kth iteration,
the corresponding equation takes the form:
d
k
[n] = d
k−1
[n] − β
k
y
k
[n], (7.17)
where y
[n] = f
γ
k
[n] ∗ h
[n] is the output of the perceptually weighted
vocal tract impulse response due to the input component f
γ
k
[n]. The
meansquared error at stage k of the iteration is deﬁned as
E
k
=
L−1
¸
n=0
(d
k
[n])
2
=
L−1
¸
n=0
(d
k−1
[n] − β
k
y
k
[n])
2
, (7.18)
where it is assumed that d
k−1
[n] is the weighted diﬀerence signal that
remains after k − 1 steps of the process.
Assuming that γ
k
is known, the value of β
k
that minimizes E
k
in
(7.18) is
β
k
=
L−1
¸
n=0
d
k−1
[n]y
k
[n]
L−1
¸
n=0
(y
[n])
2
, (7.19)
and the corresponding minimum meansquared error is
E
(min)
k
=
L−1
¸
n=0
(d
k−1
[n])
2
− β
2
k
L−1
¸
n=0
(y
k
[n])
2
. (7.20)
While we have assumed in the above discussion that f
γ
k
[n] and
β
k
are known, we have not discussed how they can be found. Equa
tion (7.20) suggests that the meansquared error can be minimized by
maximizing
β
2
k
L−1
¸
n=0
(y
k
[n])
2
=
L−1
¸
n=0
d
k−1
[n]y
k
[n]
2
L−1
¸
n=0
(y
k
[n])
2
, (7.21)
122 Digital Speech Coding
which is essentially the normalized crosscorrelation between the new
input component and the weighted residual error d
k−1
[n]. An exhaus
tive search through the collection of input components will determine
which component f
γ
k
[n] will maximize the quantity in (7.21) and thus
reduce the meansquared error the most. Once this component is found,
the corresponding β
k
can be found from (7.19), and the new error
sequence computed from (7.17).
After N iterations of this process, the complete set of component
input sequences and corresponding coeﬃcients will have been deter
mined.
16
If desired, the set of input sequences so determined can be
assumed and a new set of βs can be found jointly by solving a set of
N linear equations.
7.3.4.4 MultiPulse Excitation Linear Prediction (MPLP)
The procedure described in Section 7.3.4.3 is quite general since the
only constraint was that the collection of component input sequences be
ﬁnite. The ﬁrst analysisbysynthesis coder was called multipulse linear
predictive coding [4]. In this system, the component input sequences are
simply isolated unit impulse sequences, i.e., f
γ
[n] = δ[n − γ], where γ
is an integer such that 0 ≤ γ ≤ L − 1. The excitation sequence derived
by the process described in Section 7.3.4.3 is therefore of the form:
ˆ
d[n] =
N
¸
k=1
β
k
δ[n − γ
k
], (7.22)
where N is usually on the order of 4–5 impulses in a 5 ms (40 samples
for f
s
= 8 kHz) excitation analysis frame.
17
In this case, an exhaustive
search at each stage can be achieved by computing just one cross
correlation function and locating the impulse at the maximum of the
crosscorrelation. This is because the components y
k
[n] are all just
shifted versions of h
[n].
16
Alternatively, N need not be ﬁxed in advance. The error reduction process can be stopped
when E
(min)
k
falls below a prescribed threshold.
17
It is advantageous to include two or more excitation analysis frames in one linear predic
tive analysis frame, which normally is of 20 ms (160 samples) duration.
7.3 ClosedLoop Coders 123
In a speech coder based on multipulse analysisbysynthesis, the
speech signal is represented by quantized versions of the prediction
coeﬃcients, the pulse locations, and the pulse amplitudes. The predic
tor coeﬃcients can be coded as discussed for ADPCM by encoding any
of the alternative representations either by scalar or vector quantiza
tion. A number of diﬀerent approaches have been devised for coding
the excitation parameters. These usually involve some sort of diﬀer
ential coding scheme. At bit rates on the order of 10–16 kbps multi
pulse coding can approach toll quality; however, below about 10 kbps,
quality degrades rapidly due to the many parameters that must be
coded [13].
Regularpulse excitation (RPE) is a special case of multipulse exci
tation where it is easier to encode the impulse locations. Speciﬁcally
after determining the location γ
1
of the ﬁrst impulse, all other impulses
are located at integer multiples of a ﬁxed spacing on either side of γ
1
.
Thus, only γ
1
, the integer multiples, and the gain constants must be
encoded. Using a preset spacing of 4 and an excitation analysis win
dow of 40 samples yields high quality at about 10 kbps [67]. Because the
impulse locations are ﬁxed after the ﬁrst iteration, RPE requires signiﬁ
cantly less computation than fullblown multipulse excitation analysis.
One version of RPE is the basis for the 13 kbps digital coder in the
GSM mobile communications system.
7.3.4.5 CodeExcited Linear Prediction (CELP)
Since the major portion of the bit rate of an analysisbysynthesis coder
lies in the excitation signal, it is not surprising that a great deal of
eﬀort has gone into schemes for ﬁnding excitation signals that are eas
ier to encode than multipulse excitation, but yet maintain high qual
ity of reproduction of the speech signal. The next major innovation
in the history of analysisbysynthesis systems was called codeexcited
linear predictive coding or CELP [112].
18
In this scheme, the excita
tion signal components are Gaussian random sequences stored in a
18
Earlier work by Stewart [123] used a residual codebook (populated by the Lloyd algorithm
[71]) with a low complexity trellis search.
124 Digital Speech Coding
“codebook.”
19
If the codebook contains 2
M
sequences, then a given
sequence can be speciﬁed by an Mbit number. Typical codebook sizes
are 256, 512 or 1024. Within the codebook, the sequence lengths are
typically L = 40–60 samples (5–7.5 ms at 8 kHz sampling rate). If N
sequences are used to form the excitation, then a total of M N bits
will be required to code the sequences. Additional bits are still required
to code the β
k
s. With this method, the decoder must also have a copy
of the analysis codebook so that the excitation can be regenerated at
the decoder.
The main disadvantage of CELP is the high computational cost of
exhaustively searching the codebook at each stage of the error min
imization. This is because each codebook sequence must be ﬁltered
with the perceptually weighted impulse response before computing the
cross correlation with the residual error sequence. Eﬃcient searching
schemes and structured codebooks have eased the computational bur
den, and modern DSP hardware can easily implement the computations
in real time.
The basic CELP framework has been applied in the development of
numerous speech coders that operate with bit rates in the range from
4800 to 16000 bps. The Federal Standard 1016 (FS1016) was adopted
by the Department of Defense for use in secure voice transmission
at 4800 bps. Another system called VSELP (vector sum excited linear
prediction), which uses multiple codebooks to achieve a bit rate of
8000 bps, was adopted in 1989 for North American Cellular Systems.
The GSM half rate coder operating at 5600 bps is based on the IS54
VSELP coder. Due to the block processing structure, CELP coders
can introduce more than 40 ms delay into communication systems. The
ITU G.728 lowdelay CELP coder uses very short excitation analysis
frames and backward linear prediction to achieve high quality at a bit
rate of 16,000 bps with delays less than 2 ms. Still another standardized
CELP system is the ITU G.729 conjugatestructure algebraic CELP
(CSACELP) coder that is used in some mobile communication systems.
19
While it may seem counterintuitive that the excitation could be comprised of white noise
sequence, remember that, in fact, the object of linear prediction is to produce just such
a signal.
7.4 OpenLoop Coders 125
7.3.4.6 LongDelay Predictors in AnalysisbySynthesis
Coders
Longdelay predictors have an interesting interpretation in analysisby
synthesis coding. If we maintain a memory of one or more frames of
previous values of the excitation signal, we can add a term β
0
ˆ
d[n − γ
0
]
to (7.15). The long segment of the past excitation acts as a sort of
codebook where the Lsample codebook sequences overlap by L − 1
samples. The ﬁrst step of the excitation computation would be to com
pute γ
0
using (7.21) and then β
0
using (7.19). For each value of γ
0
to be
tested, this requires that the weighted vocal tract impulse response be
convolved with Lsample segments of the past excitation starting with
sample γ
0
. This can be done recursively to save computation. Then the
component β
0
ˆ
d[n − γ
0
] can be subtracted from the initial error to start
the iteration process in either the multipulse or the CELP framework.
The additional bit rate for coding β
0
and γ
0
is often well worthwhile,
and longdelay predictors are used in many of the standardized coders
mentioned above. This incorporation of components of the past history
of the excitation has been referred to as “selfexcitation” [104] or the
use of an “adaptive codebook” [19].
7.4 OpenLoop Coders
ADPCM coding has not been useful below about 9600 bps, multipulse
coding has similar limitations, and CELP coding has not been used at
bit rates below about 4800 bits/s. While these closedloop systems have
many attractive features, it has not been possible to generate excitation
signals that can be coded at low bit rates and also produce high quality
synthetic speech. For bit rates below 4800 bps, engineers have turned
to the vocoder principles that were established decades ago. We call
these systems openloop systems because they do not determine the
excitation by a feedback process.
7.4.1 The TwoState Excitation Model
Figure 7.10 shows a source/system model for speech that is closely
related to physical models for speech production. As we have discussed
126 Digital Speech Coding
Fig. 7.10 Twostateexcitation model for speech synthesis.
in the previous chapters, the excitation model can be very simple.
Unvoiced sounds are produced by exciting the system with white noise,
and voiced sounds are produced by a periodic impulse train excitation,
where the spacing between impulses is the pitch period, P
0
. We shall
refer to this as the twostate excitation model. The slowlytimevarying
linear system models the combined eﬀects of vocal tract transmission,
radiation at the lips, and, in the case of voiced speech, the lowpass
frequency shaping of the glottal pulse. The V/UV (voiced/unvoiced
excitation) switch produces the alternating voiced and unvoiced seg
ments of speech, and the gain parameter G controls the level of the
ﬁlter output. When values for V/UV decision, G, P
0
, and the param
eters of the linear system are supplied at periodic intervals (frames)
then the model becomes a speech synthesizer, as discussed in Chap
ter 8. When the parameters of the model are estimated directly
from a speech signal, the combination of estimator and synthesizer
becomes a vocoder or, as we prefer, an openloop analysis/synthesis
speech coder.
7.4.1.1 Pitch, Gain, and V/UV Detection
The fundamental frequency of voiced speech can range from well below
100 Hz for lowpitched male speakers to over 250 Hz for highpitched
voices of women and children. The fundamental frequency varies slowly,
with time, more or less at the same rate as the vocal tract motions. It
is common to estimate the fundamental frequency F
0
or equivalently,
7.4 OpenLoop Coders 127
the pitch period P
0
= 1/F
0
at a frame rate of about 50–100 times/s.
To do this, short segments of speech are analyzed to detect periodicity
(signaling voiced speech) or aperiodicity (signaling unvoiced speech).
One of the simplest, yet most eﬀective approaches to pitch detection,
operates directly on the time waveform by locating corresponding peaks
and valleys in the waveform, and measuring the times between the
peaks [43]. The STACF is also useful for this purpose as illustrated
in Figure 4.8, which shows autocorrelation functions for both voiced
speech, evincing a peak at the pitch period, and unvoiced speech, which
shows no evidence of periodicity. In pitch detection applications of the
shorttime autocorrelation, it is common to preprocess the speech by
a spectrum ﬂattening operation such as center clipping [96] or inverse
ﬁltering [77]. This preprocessing tends to enhance the peak at the pitch
period for voiced speech while suppressing the local correlation due to
formant resonances.
Another approach to pitch detection is suggested by Figure 5.5,
which shows the cepstra of segments of voiced and unvoiced speech. In
this case, a strong peak in the expected pitch period range signals voiced
speech, and the location of the peak is the pitch period. Similarly, lack
of a peak in the expected range signals unvoiced speech [83]. Figure 5.6
shows a sequence of shorttime cepstra which moves from unvoiced to
voiced speech in going from the bottom to the top of the ﬁgure. The
time variation of the pitch period is evident in the upper plots.
The gain parameter G is also found by analysis of short segments
of speech. It should be chosen so that the shorttime energy of the syn
thetic output matches the shorttime energy of the input speech signal.
For this purpose, the autocorrelation function value φ
ˆ n
[0] at lag 0 or
the cepstrum value c
ˆ n
[0] at quefrency 0 can be used to determine the
energy of the segment of the input signal.
For digital coding applications, the pitch period, V/UV, and gain
must be quantized. Typical values are 7 bits for pitch period (P
0
= 0
signals UV) and 5 bits for G. For a frame rate of 50 frames/s, this totals
600 bps, which is well below the bit rate used to encode the excitation
signal in closedloop coders such as ADPCM or CELP. Since the vocal
tract ﬁlter can be coded as in ADPCM or CELP, much lower total bit
128 Digital Speech Coding
rates are common in openloop systems. This comes at a large cost in
quality of the synthetic speech output, however.
7.4.1.2 Vocal Tract System Estimation
The vocal tract system in the synthesizer of Figure 7.10 can take many
forms. The primary methods that have been used have been homomor
phic ﬁltering and linear predictive analysis as discussed in Chapters 5
and 6, respectively.
The Homomorphic Vocoder Homomorphic ﬁltering can be used
to extract a sequence of impulse responses from the sequence of cep
stra that result from shorttime cepstrum analysis. Thus, one cepstrum
computation can yield both an estimate of pitch and the vocal tract
impulse response. In the original homomorphic vocoder, the impulse
response was digitally coded by quantizing each cepstrum value indi
vidually (scalar quantization) [86]. The impulse response, reconstructed
from the quantized cepstrum at the synthesizer, is simply convolved
with the excitation created from the quantized pitch, voicing, and gain
information, i.e., s[n] = Ge[n] ∗ h[n].
20
In a more recent application of
homomorphic analysis in an analysisbysynthesis framework [21], the
cepstrum values were coded using vector quantization, and the excita
tion derived by analysisbysynthesis as described in Section 7.3.4.3. In
still another approach to digital coding, homomorphic ﬁltering was used
to remove excitation eﬀects in the shorttime spectrum and then three
formant frequencies were estimated from the smoothed spectra. Fig
ure 5.6 shows an example of the formants estimated for voiced speech.
These formant frequencies were used to control the resonance frequen
cies of a synthesizer comprised of a cascade of secondorder section IIR
digital ﬁlters [111]. Such a speech coder is called a formant vocoder.
LPC Vocoder Linear predictive analysis can also be used to esti
mate the vocal tract system for an openloop coder with twostate
20
Care must be taken at frame boundaries. For example, the impulse response can be
changed at the time a new pitch impulse occurs, and the resulting output can overlap
into the next frame.
7.4 OpenLoop Coders 129
excitation [6]. In this case, the prediction coeﬃcients can be coded in
one of the many ways that we have already discussed. Such analy
sis/synthesis coders are called LPC vocoders. A vocoder of this type
was standardized by the Department of Defense as the Federal Stan
dard FS1015. This system is also called LPC10 (or LPC10e) because
a 10thorder covariance linear predictive analysis is used to estimate
the vocal tract system. The LPC10 system has a bit rate of 2400 bps
using a frame size of 22.5 ms with 12 bits/frame allocated to pitch,
voicing and gain and the remaining bits allocated to the vocal tract
ﬁlter coded as PARCOR coeﬃcients.
7.4.2 ResidualExcited Linear Predictive Coding
In Section 7.2, we presented Figure 7.4 as motivation for the use of
the source/system model in digital speech coding. This ﬁgure shows
an example of inverse ﬁltering of the speech signal using a predic
tion error ﬁlter, where the prediction error (or residual) is signiﬁcantly
smaller and less lowpass in nature. The speech signal can be recon
structed from the residual by passing it through the vocal tract sys
tem H(z) = 1/A(z). None of the methods discussed so far attempt
to directly code the prediction error signal in an openloop manner.
ADPCM and analysisbysynthesis systems derive the excitation to the
synthesis ﬁlter by a feedback process. The twostate model attempts
to construct the excitation signal by direct analysis and measurement
of the input speech signal. Systems that attempt to code the resid
ual signal directly are called residualexcited linear predictive (RELP)
coders.
Direct coding of the residual faces the same problem faced in
ADPCM or CELP: the sampling rate is the same as that of the input,
and accurate coding could require several bits per sample. Figure 7.11
shows a block diagram of a RELP coder [128]. In this system, which is
quite similar to voiceexcited vocoders (VEV) [113, 130], the problem
of reducing the bit rate of the residual signal is attacked by reducing
its bandwidth to about 800 Hz, lowering the sampling rate, and coding
the samples with adaptive quantization. Adaptive delta modulation
was used in [128], but APCM could be used if the sampling rate is
130 Digital Speech Coding
Inverse
Filter
A(z)
Lowpass
&
Decimator
Adaptive
Quantizer
Linear
Predictive
Analysis
] [n x ] [n r
Adaptive
Quantizer
Decoder
Interpolator
& Spectrum
Flattener
Synthesis
Filter
H(z)
] [n x
] [n d
} {α
] [ ˆ n
} {α
] [n d
Encoder
Decoder
Fig. 7.11 Residualexcited linear predictive (RELP) coder and decoder.
lowered to 1600 Hz. The 800 Hz band is wide enough to contain several
harmonics of the highest pitched voices. The reduced bandwidth resid
ual is restored to full bandwidth prior to its use as an excitation signal
by a nonlinear spectrum ﬂattening operation, which restores higher
harmonics of voiced speech. White noise is also added according to an
empirically derived recipe. In the implementation of [128], the sam
pling rate of the input was 6.8 kHz and the total bit rate was 9600 kbps
with 6800 bps devoted to the residual signal. The quality achieved at
this rate was not signiﬁcantly better than the LPC vocoder with a
twostate excitation model. The principle advantage of this system is
that no hard V/UV decision must be made, and no pitch detection is
required. While this system did not become widely used, its basic prin
ciples can be found in subsequent openloop coders that have produced
much better speech quality at bit rates around 2400 bps.
7.4.3 Mixed Excitation Systems
While twostate excitation allows the bit rate to be quite low, the qual
ity of the synthetic speech output leaves much to be desired. The output
of such systems is often described as “buzzy,” and in many cases, errors
in estimating pitch period or voicing decision cause the speech to sound
unnatural if not unintelligible. The weaknesses of the twostate model
7.4 OpenLoop Coders 131
for excitation spurred interest in a mixed excitation model where a
hard decision between V and UV is not required. Such a model was
ﬁrst proposed by Makhoul et al. [75] and greatly reﬁned by McCree
and Barnwell [80].
Figure 7.12 depicts the essential features of the mixedexcitation
linear predictive coder (MELP) proposed by McCree and Barn
well [80]. This conﬁguration was developed as the result of careful
experimentation, which focused onebyone on the sources of distor
tions manifest in the twostate excitation coder such as buzziness and
tonal distortions. The main feature is that impulse train excitation
and noise excitation are added instead of switched. Prior to their addi
tion, they each pass through a multiband spectral shaping ﬁlter. The
gains in each of ﬁve bands are coordinated between the two ﬁlters so
that the spectrum of e[n] is ﬂat. This mixed excitation helps to model
shorttime spectral eﬀects such as “devoicing” of certain bands during
voiced speech.
21
In some situations, a “jitter” parameter ∆P is invoked
to better model voicing transitions. Other important features of the
MELP system are lumped into the block labeled “Enhancing Filters.”
This represents adaptive spectrum enhancement ﬁlters used to enhance
formant regions,
22
and a spectrally ﬂat “pulse dispersion ﬁlter” whose
Fig. 7.12 Mixedexcitation linear predictive (MELP) decoder.
21
Devoicing is evident in frame 9 in Figure 5.6.
22
Such ﬁlters are also used routinely in CELP coders and are often referred to as post
ﬁlters.
132 Digital Speech Coding
purpose is to reduce “peakiness” due to the minimumphase nature of
the linear predictive vocal tract system.
These modiﬁcations to the basic twostate excitation LPC vocoder
produce marked improvements in the quality of reproduction of the
speech signal. Several new parameters of the excitation must be esti
mated at analysis time and coded for transmission, but these add only
slightly to either the analysis computation or the bit rate [80]. The
MELP coder is said to produce speech quality at 2400 bps that is com
parable to CELP coding at 4800 bps. In fact, its superior performance
led to a new Department of Defense standard in 1996 and subsequently
to MILSTD3005 and NATO STANAG 4591, which operates at 2400,
1200, and 600 bps.
7.5 FrequencyDomain Coders
Although we have discussed a wide variety of digital speech coding
systems, we have really only focused on a few of the most important
examples that have current relevance. There is much room for varia
tion within the general frameworks that we have identiﬁed. There is
one more class of coders that should be discussed. These are called
frequencydomain coders because they are based on the principle
of decomposing the speech signal into individual frequency bands.
Figure 7.13 depicts the general nature of this large class of coding
systems. On the far left of the diagram is a set (bank) of analysis band
pass ﬁlters. Each of these is followed by a downsampler by a factor
appropriate for the bandwidth of its bandpass input. On the far right
in the diagram is a set of “bandpass interpolators” where each interpo
lator is composed of an upsampler (i.e., raising the sampling rate by
the same factor as the corresponding downsampling box) followed by
a bandpass ﬁlter similar to the analysis ﬁlter. If the ﬁlters are carefully
designed, and the outputs of the downsamplers are connected to the
corresponding inputs of the bandpass interpolators, then it is possi
ble for the output ˆ x[n] to be virtually identical to the input x[n] [24].
Such perfect reconstruction ﬁlter bank systems are the basis for a wide
ranging class of speech and audio coders called subband coders [25].
When the outputs of the ﬁlter bank are quantized by some sort of
quantizer, the output ˆ x[n] is not equal to the input, but by careful
7.5 FrequencyDomain Coders 133
Fig. 7.13 Subband coder and decoder for speech audio.
design of the quantizer the output can be indistinguishable from the
input perceptually. The goal with such coders is, of course, for the total
composite bit rate
23
to be as low as possible while maintaining high
quality.
In contrast to the coders that we have already discussed, which
incorporated the speech production model into the quantization pro
cess, the ﬁlter bank structure incorporates important features of the
speech perception model. As discussed in Chapter 3, the basilar mem
brane eﬀectively performs a frequency analysis of the sound impinging
on the eardrum. The coupling between points on the basilar membrane
results in the masking eﬀects that we mentioned in Chapter 3. This
masking was incorporated in some of the coding systems discussed so
far, but only in a rudimentary manner.
To see how subband coders work, suppose ﬁrst that the individ
ual channels are quantized independently. The quantizers can be any
quantization operator that preserves the waveform of the signal. Typ
ically, adaptive PCM is used, but ADPCM could be used in princi
ple. Because the downsampled channel signals are full band at the
lower sampling rate, the quantization error aﬀects all frequencies in
23
The total bit rate will be the sum of the products of the sampling rates of the down
sampled channel signals times the number of bits allocated to each channel.
134 Digital Speech Coding
that band, but the important point is that no other bands are aﬀected
by the errors in a given band. Just as important, a particular band
can be quantized according to its perceptual importance. Further
more, bands with low energy can be encoded with correspondingly low
absolute error.
The simplest approach is to preassign a ﬁxed number of bits to
each channel. In an early nonuniformly spaced ﬁve channel system,
Crochiere et al. [25] found that subband coders operating at 16 kbps
and 9.6 kbps were preferred by listeners to ADPCM coders operating
at 22 and 19 kbps, respectively.
Much more sophisticated quantization schemes are possible in the
conﬁguration shown in Figure 7.13 if, as suggested by the dotted boxes,
the quantization of the channels is done jointly. For example, if the
channel bandwidths are all the same so that the output samples of the
downsamplers can be treated as a vector, vector quantization can be
used eﬀectively [23].
Another approach is to allocate bits among the channels dynami
cally according to a perceptual criterion. This is the basis for modern
audio coding standards such as the various MPEG standards. Such sys
tems are based on the principles depicted in Figure 7.13. What is not
shown in that ﬁgure is additional analysis processing that is done to
determine how to allocate the bits among the channels. This typically
involves the computation of a ﬁnegrained spectrum analysis using the
DFT. From this spectrum it is possible to determine which frequen
cies will mask other frequencies, and on the basis of this a threshold of
audibility is determined. Using this audibility threshold, bits are allo
cated among the channels in an iterative process so that as much of the
quantization error is inaudible as possible for the total bit budget. The
details of perceptual audio coding are presented in the recent textbook
by Spanias [120].
7.6 Evaluation of Coders
In this chapter, we have discussed a wide range of options for digital
coding of speech signals. These systems rely heavily on the techniques of
linear prediction, ﬁlter banks, and cepstrum analysis and also on models
7.6 Evaluation of Coders 135
for speech production and speech perception. All this knowledge and
technique is brought to bear on the problem of reducing the bit rate of
the speech representation while maintaining high quality reproduction
of the speech signal.
Throughout this chapter we have mentioned bit rates, complexity
of computation, processing delay, and quality as important practical
dimensions of the speech coding problem. Most coding schemes have
many operations and parameters that can be chosen to tradeoﬀ among
the important factors. In general, increasing the bit rate will improve
quality of reproduction; however, it is important to note that for many
systems, increasing the bit rate indeﬁnitely does not necessarily con
tinue to improve quality. For example, increasing the bit rate of an
openloop twostateexcitation LPC vocoder above about 2400 bps does
not improve quality very much, but lowering the bit rate causes notice
able degradation. On the other hand, improved pitch detection and
source modeling can improve quality in an LPC vocoder as witnessed
by the success of the MELP system, but this generally would come
with an increase in computation and processing delay.
In the ﬁnal analysis, the choice of a speech coder will depend on
constraints imposed by the application such as cost of coder/decoder,
available transmission capacity, robustness to transmission errors,
and quality requirements. Fortunately, there are many possibilities to
choose from.
8
TexttoSpeech Synthesis Methods
In this chapter, we discuss systems whose goal is to convert ordinary
text messages into intelligible and natural sounding synthetic speech
so as to transmit information from a machine to a human user. In
other words, we will be concerned with computer simulation of the
upper part of the speech chain in Figure 1.2. Such systems are often
referred to as texttospeech synthesis (or TTS) systems, and their gen
eral structure is illustrated in Figure 8.1. The input to the TTS sys
tem is text and the output is synthetic speech. The two fundamental
processes performed by all TTS systems are text analysis (to deter
mine the abstract underlying linguistic description of the speech) and
speech synthesis (to produce the speech sounds corresponding to the
text input).
Fig. 8.1 Block diagram of general TTS system.
136
8.1 Text Analysis 137
8.1 Text Analysis
The text analysis module of Figure 8.1 must determine three things
from the input text string, namely:
(1) pronunciation of the text string: the text analysis pro
cess must decide on the set of phonemes that is to be spoken,
the degree of stress at various points in speaking, the into
nation of the speech, and the duration of each of the sounds
in the utterance
(2) syntactic structure of the sentence to be spoken: the
text analysis process must determine where to place pauses,
what rate of speaking is most appropriate for the material
being spoken, and how much emphasis should be given to
individual words and phrases within the ﬁnal spoken output
speech
(3) semantic focus and ambiguity resolution: the text anal
ysis process must resolve homographs (words that are spelled
alike but can be pronounced diﬀerently, depending on con
text), and also must use rules to determine word etymology
to decide on how best to pronounce names and foreign words
and phrases.
Figure 8.2 shows more detail on how text analysis is performed.
The input to the analysis is plain English text. The ﬁrst stage of pro
cessing does some basic text processing operations, including detecting
the structure of the document containing the text (e.g., email message
versus paragraph of text from an encyclopedia article), normalizing the
text (so as to determine how to pronounce words like proper names or
homographs with multiple pronunciations), and ﬁnally performing a
linguistic analysis to determine grammatical information about words
and phrases within the text. The basic text processing beneﬁts from
an online dictionary of word pronunciations along with rules for deter
mining word etymology. The output of the basic text processing step
is tagged text, where the tags denote the linguistic properties of the
words of the input text string.
138 TexttoSpeech Synthesis Methods
Fig. 8.2 Components of the text analysis process.
8.1.1 Document Structure Detection
The document structure detection module seeks to determine the loca
tion of all punctuation marks in the text, and to decide their signiﬁcance
with regard to the sentence and paragraph structure of the input text.
For example, an end of sentence marker is usually a period, ., a ques
tion mark, ?, or an exclamation point, !. However this is not always the
case as in the sentence, “This car is 72.5 in. long” where there are two
periods, neither of which denote the end of the sentence.
8.1.2 Text Normalization
Text normalization methods handle a range of text problems that occur
in real applications of TTS systems, including how to handle abbrevi
ations and acronyms as in the following sentences:
Example 1: “I live on Bourbon St. in St. Louis”
Example 2: “She worked for DEC in Maynard MA”
8.1 Text Analysis 139
where, in Example 1, the text “St.” is pronounced as street or saint,
depending on the context, and in Example 2 the acronym DEC can
be pronounced as either the word “deck” (the spoken acronym) or
the name of the company, i.e., Digital Equipment Corporation, but is
virtually never pronounced as the letter sequence “D E C.”
Other examples of text normalization include number strings like
“1920” which can be pronounced as the year “nineteen twenty” or the
number “one thousand, nine hundred, and twenty,” and dates, times
currency, account numbers, etc. Thus the string “$10.50” should be
pronounced as “ten dollars and ﬁfty cents” rather than as a sequence
of characters.
One other important text normalization problem concerns the pro
nunciation of proper names, especially those from languages other than
English.
8.1.3 Linguistic Analysis
The third step in the basic text processing block of Figure 8.2 is a
linguistic analysis of the input text, with the goal of determining, for
each word in the printed string, the following linguistic properties:
• the part of speech (POS) of the word
• the sense in which each word is used in the current context
• the location of phrases (or phrase groups) with a sentence (or
paragraph), i.e., where a pause in speaking might be appro
priate
• the presence of anaphora (e.g., the use of a pronoun to refer
back to another word unit)
• the word (or words) on which emphasis are to be placed, for
prominence in the sentence
• the style of speaking, e.g., irate, emotional, relaxed, etc.
A conventional parser could be used as the basis of the linguistic anal
ysis of the printed text, but typically a simple, shallow analysis is per
formed since most linguistic parsers are very slow.
140 TexttoSpeech Synthesis Methods
8.1.4 Phonetic Analysis
Ultimately, the tagged text obtained from the basic text processing
block of a TTS system has to be converted to a sequence of tagged
phones which describe both the sounds to be produced as well as the
manner of speaking, both locally (emphasis) and globally (speaking
style). The phonetic analysis block of Figure 8.2 provides the process
ing that enables the TTS system to perform this conversion, with the
help of a pronunciation dictionary. The way in which these steps are
performed is as follows.
8.1.5 Homograph Disambiguation
The homograph disambiguation operation must resolve the correct pro
nunciation of each word in the input string that has more than one pro
nunciation. The basis for this is the context in which the word occurs.
One simple example of homograph disambiguation is seen in the phrases
“an absent boy” versus the sentence “do you choose to absent yourself.”
In the ﬁrst phrase the word “absent” is an adjective and the accent is
on the ﬁrst syllable; in the second phrase the word “absent” is a verb
and the accent is on the second syllable.
8.1.6 LettertoSound (LTS) Conversion
The second step of phonetic analysis is the process of graphemeto
phoneme conversion, namely conversion from the text to (marked)
speech sounds. Although there are a variety of ways of performing this
analysis, perhaps the most straightforward method is to rely on a stan
dard pronunciation dictionary, along with a set of lettertosound rules
for words outside the dictionary.
Figure 8.3 shows the processing for a simple dictionary search for
word pronunciation. Each individual word in the text string is searched
independently. First a “whole word” search is initiated to see if the
printed word exists, in its entirety, in the word dictionary. If so, the
conversion to sounds is straightforward and the dictionary search begins
on the next word. If not, as is the case most often, the dictionary search
attempts to ﬁnd aﬃxes (both preﬁxes and suﬃxes) and strips them
8.1 Text Analysis 141
Fig. 8.3 Block diagram of dictionary search for proper word pronunciation.
from the word attempting to ﬁnd the “root form” of the word, and
then does another “whole word” search. If the root form is not present
in the dictionary, a set of lettertosound rules is used to determine the
best pronunciation (usually based on etymology of the word) of the
root form of the word, again followed by the reattachment of stripped
out aﬃxes (including the case of no stripped out aﬃxes).
8.1.7 Prosodic Analysis
The last step in the text analysis system of Figure 8.2 is prosodic anal
ysis which provides the speech synthesizer with the complete set of
synthesis controls, namely the sequence of speech sounds, their dura
tions, and an associated pitch contour (variation of fundamental fre
quency with time). The determination of the sequence of speech sounds
is mainly performed by the phonetic analysis step as outlined above.
The assignment of duration and pitch contours is done by a set of pitch
142 TexttoSpeech Synthesis Methods
and duration rules, along with a set of rules for assigning stress and
determining where appropriate pauses should be inserted so that the
local and global speaking rates appear to be natural.
8.2 Evolution of Speech Synthesis Systems
A summary of the progress in speech synthesis, over the period 1962–
1997, is given in Figure 8.4. This ﬁgure shows that there have been
3 generations of speech synthesis systems. During the ﬁrst genera
tion (between 1962 and 1977) formant synthesis of phonemes using
a terminal analog synthesizer was the dominant technology using
rules which related the phonetic decomposition of the sentence to
formant frequency contours. The synthesis suﬀered from poor intel
ligibility and poor naturalness. The second generation of speech syn
thesis methods (from 1977 to 1992) was based primarily on an LPC
representation of subword units such as diphones (half phones). By
carefully modeling and representing diphone units via LPC param
eters, it was shown that good intelligibility synthetic speech could
be reliably obtained from text input by concatenating the appropri
ate diphone units. Although the intelligibility improved dramatically
Fig. 8.4 Time line of progress in speech synthesis and TTS systems.
8.2 Evolution of Speech Synthesis Systems 143
over ﬁrst generation formant synthesis, the naturalness of the syn
thetic speech remained low due to the inability of single diphone units
to represent all possible combinations of sound using that diphone
unit. A detailed survey of progress in texttospeech conversion up
to 1987 is given in the review paper by Klatt [65]. The synthesis
examples that accompanied that paper are available for listening at
http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html.
The third generation of speech synthesis technology was the period
from 1992 to the present, in which the method of “unit selection syn
thesis” was introduced and perfected, primarily by Sagisaka at ATR
Labs in Kyoto [108]. The resulting synthetic speech from this third
generation technology had good intelligibility and naturalness that
approached that of humangenerated speech. We begin our discussion
of speech synthesis approaches with a review of early systems, and then
discuss unit selection methods of speech synthesis later in this chapter.
8.2.1 Early Speech Synthesis Approaches
Once the abstract underlying linguistic description of the text input has
been determined via the steps of Figure 8.2, the remaining (major) task
of TTS systems is to synthesize a speech waveform whose intelligibility
is very high (to make the speech useful as a means of communication
between a machine and a human), and whose naturalness is as close to
real speech as possible. Both tasks, namely attaining high intelligibility
along with reasonable naturalness, are diﬃcult to achieve and depend
critically on three issues in the processing of the speech synthesizer
“backend” of Figure 8.1, namely:
(1) choice of synthesis units: including whole words, phones,
diphones, dyads, or syllables
(2) choice of synthesis parameters: including LPC features,
formants, waveform templates, articulatory parameters, sinu
soidal parameters, etc.
(3) method of computation: including rulebased systems or
systems which rely on the concatenation of stored speech
units.
144 TexttoSpeech Synthesis Methods
8.2.2 Word Concatenation Synthesis
Perhaps the simplest approach to creating a speech utterance corre
sponding to a given text string is to literally splice together prerecorded
words corresponding to the desired utterance. For greatest simplicity,
the words can be stored as sampled waveforms and simply concatenated
in the correct sequence. This approach generally produces intelligible,
but unnatural sounding speech, since it does not take into account the
“coarticulation” eﬀects of producing phonemes in continuous speech,
and it does not provide either for the adjustment of phoneme dura
tions or the imposition of a desired pitch variation across the utter
ance. Words spoken in continuous speech sentences are generally much
shorter in duration than when spoken in isolation (often up to 50%
shorter) as illustrated in Figure 8.5. This ﬁgure shows wideband spec
trograms for the sentence “This shirt is red” spoken as a sequence of
isolated words (with short, distinct pauses between words) as shown
at the top of Figure 8.5, and as a continuous utterance, as shown at
the bottom of Figure 8.5. It can be seen that, even for this trivial
Fig. 8.5 Wideband spectrograms of a sentence spoken as a sequence of isolated words (top
panel) and as a continuous speech utterance (bottom panel).
8.2 Evolution of Speech Synthesis Systems 145
example, the duration of the continuous sentence is on the order of half
that of the isolated word version, and further the formant tracks of the
continuous sentence do not look like a set of uniformly compressed for
mant tracks from the individual words. Furthermore, the boundaries
between words in the upper plot are sharply deﬁned, while they are
merged in the lower plot.
The word concatenation approach can be made more sophisticated
by storing the vocabulary words in a parametric form (formants, LPC
parameters, etc.) such as employed in the speech coders discussed in
Chapter 7 [102]. The rational for this is that the parametric represen
tations, being more closely related to the model for speech produc
tion, can be manipulated so as to blend the words together, shorten
them, and impose a desired pitch variation. This requires that the con
trol parameters for all the words in the task vocabulary (as obtained
from a training set of words) be stored as representations of the words.
A special set of word concatenation rules is then used to create the
control signals for the synthesizer. Although such a synthesis system
would appear to be an attractive alternative for general purpose syn
thesis of speech, in reality this type of synthesis is not a practical
approach.
1
There are many reasons for this, but a major problem is that there
are far too many words to store in a word catalog for word concatena
tion synthesis to be practical except in highly restricted situations. For
example, there are about 1.7 million distinct surnames in the United
States and each of them would have to be spoken and stored for a gen
eral word concatenation synthesis method. A second, equally important
limitation is that wordlength segments of speech are simply the wrong
subunits. As we will see, shorter units such as phonemes or diphones
are more suitable for synthesis.
Eﬀorts to overcome the limitations of word concatenation followed
two separate paths. One approach was based on controlling the motions
of a physical model of the speech articulators based on the sequence of
phonemes from the text analysis. This requires sophisticated control
1
It should be obvious that whole word concatenation synthesis (from stored waveforms) is
also impractical for general purpose synthesis of speech.
146 TexttoSpeech Synthesis Methods
rules that are mostly derived empirically. From the vocal tract shapes
and sources produced by the articulatory model, control parameters
(e.g., formants and pitch) can be derived by applying the acoustic theory
of speech production and then used to control a synthesizer such those
used for speech coding. An alternative approach eschews the articulatory
model and proceeds directly to the computation of the control signals
for a source/system model (e.g., formant parameters, LPC parameters,
pitch period, etc.). Again, the rules (algorithm) for computing the control
parameters are mostly derived by empirical means.
8.2.3 Articulatory Methods of Synthesis
Articulatory models of human speech are based upon the application of
acoustic theory to physical models such as depicted in highly stylized
form in Figure 2.1 [22]. Such models were thought to be inherently
more natural than vocal tract analog models since:
• we could impose known and fairly well understood phys
ical constraints on articular movements to create realistic
motions of the tongue, jaw, teeth, velum, etc.
• we could use Xray data (MRI data today) to study the
motion of the articulators in the production of individual
speech sounds, thereby increasing our understanding of the
dynamics of speech production
• we could model smooth articulatory parameter motions
between sounds, either via direct methods (namely solving
the wave equation), or indirectly by converting articulatory
shapes to formants or LPC parameters
• we could highly constrain the motions of the articu
latory parameters so that only natural motions would
occur, thereby potentially making the speech more natural
sounding.
What we have learned about articulatory modeling of speech is
that it requires a highly accurate model of the vocal cords and of the
vocal tract for the resulting synthetic speech quality to be considered
acceptable. It further requires rules for handling the dynamics of the
8.2 Evolution of Speech Synthesis Systems 147
articulator motion in the context of the sounds being produced. So
far we have been unable to learn all such rules, and thus, articula
tory speech synthesis methods have not been found to be practical for
synthesizing speech of acceptable quality.
8.2.4 Terminal Analog Synthesis of Speech
The alternative to articulatory synthesis is called terminal analog speech
synthesis. In this approach, each sound of the language (phoneme) is
characterized by a source excitation function and an ideal vocal tract
model. Speech is produced by varying (in time) the excitation and the
vocal tract model control parameters at a rate commensurate with the
sounds being produced. This synthesis process has been called terminal
analog synthesis because it is based on a model (analog) of the human
vocal tract production of speech that seeks to produce a signal at its
output terminals that is equivalent to the signal produced by a human
talker.
2
The basis for terminal analog synthesis of speech is the
source/system model of speech production that we have used many
times in this text; namely an excitation source, e[n], and a transfer
function of the human vocal tract in the form of a rational system
function, i.e.,
H(z) =
S(z)
E(z)
=
B(z)
A(z)
=
b
0
+
q
¸
k=1
b
k
z
−k
1 −
p
¸
k=1
a
k
z
−k
, (8.1)
where S(z) is the ztransform of the output speech signal, s[n],
E(z) is the ztransform of the vocal tract excitation signal, e[n],
and ¦b
k
¦ = ¦b
0
, b
1
, b
2
, . . . , b
q
¦ and ¦a
k
¦ = ¦a
1
, a
2
, . . . , a
p
¦ are the (time
varying) coeﬃcients of the vocal tract ﬁlter. (See Figures 2.2 and 4.1.)
2
In the designation “terminal analog synthesis”, the terms analog and terminal result from
the historical context of early speech synthesis studies. This can be confusing since today,
“analog” implies “not digital” as well as an analogous thing. “Terminal” originally implied
the “output terminals” of an electronic analog (not digital) circuit or system that was an
analog of the human speech production system.
148 TexttoSpeech Synthesis Methods
For most practical speech synthesis systems, both the excitation signal
properties (P
0
and V/UV ) and the ﬁlter coeﬃcients of (8.1) change
periodically so as to synthesize diﬀerent phonemes.
The vocal tract representation of (8.1) can be implemented as a
speech synthesis system using a direct form implementation. However,
it has been shown that it is preferable to factor the numerator and
denominator polynomials into either a series of cascade (serial) reso
nances, or into a parallel combination of resonances. It has also been
shown that an allpole model (B(z) =constant) is most appropriate for
voiced (nonnasal) speech. For unvoiced speech a simpler model (with
one complex pole and one complex zero), implemented via a paral
lel branch is adequate. Finally a ﬁxed spectral compensation network
can be used to model the combined eﬀects of glottal pulse shape and
radiation characteristics from the lips and mouth, based on two real
poles in the zplane. A complete serial terminal analog speech syn
thesis model, based on the above discussion, is shown in Figure 8.6
[95, 111].
The voiced speech (upper) branch includes an impulse generator
(controlled by a timevarying pitch period, P
0
), a timevarying voiced
signal gain, A
V
, and an allpole discretetime system that consists
of a cascade of 3 timevarying resonances (the ﬁrst three formants,
F
1
, F
2
, F
3
), and one ﬁxed resonance F
4
.
Fig. 8.6 Speech synthesizer based on a cascade/serial (formant) synthesis model.
8.3 Unit Selection Methods 149
The unvoiced speech (lower) branch includes a white noise
generator, a timevarying unvoiced signal gain, A
N
, and a reso
nance/antiresonance system consisting of a timevarying pole (F
P
) and
a timevarying zero (F
Z
).
The voiced and unvoiced components are added and processed by
the ﬁxed spectral compensation network to provide the ﬁnal synthetic
speech output.
The resulting quality of the synthetic speech produced using a ter
minal analog synthesizer of the type shown in Figure 8.6 is highly vari
able with explicit model shortcomings due to the following:
• voiced fricatives are not handled properly since their mixed
excitation is not part of the model of Figure 8.6
• nasal sounds are not handled properly since nasal zeros are
not included in the model
• stop consonants are not handled properly since there is no
precise timing and control of the complex excitation signal
• use of a ﬁxed pitch pulse shape, independent of the pitch
period, is inadequate and produces buzzy sounding voiced
speech
• the spectral compensation model is inaccurate and does not
work well for unvoiced sounds.
Many of the short comings of the model of Figure 8.6 are alleviated by
a more complex model proposed by Klatt [64]. However, even with a
more sophisticated synthesis model, it remains a very challenging task
to compute the synthesis parameters. Nevertheless Klatt’s Klattalk
system achieved adequate quality by 1983 to justify commercializa
tion by the Digital Equipment Corporation as the DECtalk system.
Some DECtalk systems are still in operation today as legacy systems,
although the unit selection methods to be discussed in Section 8.3 now
provide superior quality in current applications.
8.3 Unit Selection Methods
The key idea of a concatenative TTS system, using unit selection
methods, is to use synthesis segments that are sections of prerecorded
150 TexttoSpeech Synthesis Methods
natural speech [31, 50, 108]. The word concatenation method discussed
in Section 8.2.2 is perhaps the simplest embodiment of this idea; how
ever, as we discussed, shorter segments are required to achieve better
quality synthesis. The basic idea is that the more segments recorded,
annotated, and saved in the database, the better the potential quality
of the resulting speech synthesis. Ultimately, if an inﬁnite number of
segments were recorded and saved, the resulting synthetic speech would
sound natural for virtually all possible synthesis tasks. Concatenative
speech synthesis systems, based on unit selection methods, are what is
conventionally known as “data driven” approaches since their perfor
mance tends to get better the more data that is used for training the
system and selecting appropriate units.
In order to design and build a unit selection system based on using
recorded speech segments, several issues have to be resolved, including
the following:
(1) What speech units should be used as the basic synthesis
building blocks?
(2) How the synthesis units are selected (extracted) from natural
speech utterances?
(3) How the units are labeled for retrieval from a large database
of units?
(4) What signal representation should be used to represent the
units for storage and reproduction purposes?
(5) What signal processing methods can be used to spectrally
smooth the units (at unit junctures) and for prosody modi
ﬁcation (pitch, duration, amplitude)?
We now attempt to answer each of these questions.
8.3.1 Choice of Concatenation Units
The units for unit selection can (in theory) be as large as words and
as small as phoneme units. Words are prohibitive since there are essen
tially an inﬁnite number of words in English. Subword units include
syllables (about 10,000 in English), phonemes (about 45 in English,
but they are highly context dependent), demisyllables (about 2500
8.3 Unit Selection Methods 151
in English), and diphones (about 1500–2500 in English). The ideal
synthesis unit is context independent and easily concatenates with
other (appropriate) subword units [15]. Based on this criterion, the
most reasonable choice for unit selection synthesis is the set of diphone
units.
3
Before any synthesis can be done, it is necessary to prepare an inven
tory of units (diphones). This requires signiﬁcant eﬀort if done manu
ally. In any case, a large corpus of speech must be obtained from which
to extract the diphone units as waveform snippets. These units are then
represented in some compressed form for eﬃcient storage. High quality
coding such as MPLP or CELP is used to limit compression artifacts.
At the ﬁnal synthesis stage, the diphones are decoded into waveforms
for ﬁnal merging, duration adjustment and pitch modiﬁcation, all of
which take place on the time waveform.
8.3.2 From Text to Diphones
The synthesis procedure from subword units is straightforward, but far
from trivial [14]. Following the process of text analysis into phonemes
and prosody, the phoneme sequence is ﬁrst converted to the appropriate
sequence of units from the inventory. For example, the phrase “I want”
would be converted to diphone units as follows:
Text Input: I want.
Phonemes: /#/ AY/ /W/ /AA/ /N/ /T/ /#/
Diphones: /# AY/ /AYW/ /WAA/ /AAN/ /NT/ /T #/,
where the symbol # is used to represent silence (at the beginning and
end of each sentence or phrase).
The second step in online synthesis is to select the most appropri
ate sequence of diphone units from the stored inventory. Since each
diphone unit occurs many times in the stored inventory, the selection
3
A diphone is simply the concatenation of two phonemes that are allowed by the language
constraints to be sequentially contiguous in natural speech. For example, a diphone based
upon the phonemes AY and W would be denoted AY−W where the − denotes the joining
of the two phonemes.
152 TexttoSpeech Synthesis Methods
of the best sequence of diphone units involves solving a dynamic pro
gramming search for the sequence of units that minimizes a speciﬁed
cost function. The cost function generally is based on diphone matches
at each of the boundaries between diphones, where the diphone match
is deﬁned in terms of spectral matching characteristics, pitch matching
characteristics, and possibly phase matching characteristics.
8.3.3 Unit Selection Synthesis
The “Unit Selection” problem is basically one of having a given set of
target features corresponding to the spoken text, and then automat
ically ﬁnding the sequence of units in the database that most closely
match these features. This problem is illustrated in Figure 8.7 which
shows a target feature set corresponding to the sequence of sounds
(phonemes for this trivial example) /HH/ /EH/ /L/ /OW/ from the
word /hello/.
4
As shown, each of the phonemes in this word have mul
tiple representations (units) in the inventory of sounds, having been
extracted from diﬀerent phonemic environments. Hence there are many
versions of /HH/ and many versions of /EH/ etc., as illustrated in
Figure 8.7. The task of the unit selection module is to choose one of
each of the multiple representations of the sounds, with the goal being
to minimize the total perceptual distance between segments of the cho
sen sequence, based on spectral, pitch and phase diﬀerences throughout
the sounds and especially at the boundary between sounds. By spec
ifying costs associated with each unit, both globally across the unit,
HH
EH
L
OW
•••
H
H
EH
L
O
W
“Trained” Perceptual
Distance
HH
H
H
H
H
EH
E
H
E
H
E
H
E
H L
L
L
L
L
O
W
OW
O
W
Fig. 8.7 Illustration of basic process of unit selection. (After Dutoit [31].)
4
If diphones were used, the sequence would be /HHEH/ /EHL/ /LOW/.
8.3 Unit Selection Methods 153
Fig. 8.8 Online unit selection based on a Viterbi search through a lattice of alternatives.
and locally at the unit boundaries with adjacent units, we can ﬁnd
the sequence of units that best “join each other” in the sense of min
imizing the accumulated distance across the sequence of units. This
optimal sequence (or equivalently the optimal path through the combi
nation of all possible versions of each unit in the string) can be found
using a Viterbi search (dynamic programming) [38, 99]. This dynamic
programming search process is illustrated in Figure 8.8 for a 3 unit
search. The Viterbi search eﬀectively computes the cost of every pos
sible path through the lattice and determines the path with the low
est total cost, where the transitional costs (the arcs) reﬂect the cost
of concatenation of a pair of units based on acoustic distances, and
the nodes represent the target costs based on the linguistic identity of
the unit.
Thus there are two costs associated with the Viterbi search, namely
a nodal cost based on the unit segmental distortion (USD) which is
deﬁned as the diﬀerence between the desired spectral pattern of the
target (suitably deﬁned) and that of the candidate unit, throughout
the unit, and the unit concatenative distortion (UCD) which is deﬁned
as the spectral (and/or pitch and/or phase) discontinuity across the
boundaries of the concatenated units. By way of example, consider
a target context of the word “cart” with target phonemes /K/ /AH/
/R/ /T/, and a source context of the phoneme /AH/ obtained from the
154 TexttoSpeech Synthesis Methods
Fig. 8.9 Illustration of unit selection costs associated with a string of target units and a
presumptive string of selected units.
source word “want” with source phonemes /W/ /AH/ /N/ /T/. The
USD distance would be the cost (speciﬁed analytically) between the
sound /AH/ in the context /W/ /AH/ /N/ versus the desired sound
in the context /K/ /AH/ /R/.
Figure 8.9 illustrates concatenative synthesis for a given string of
target units (at the bottom of the ﬁgure) and a string of selected units
from the unit inventory (shown at the top of the ﬁgure). We focus
our attention on Target Unit t
j
and Selected Units θ
j
and θ
j+1
. Asso
ciated with the match between units t
j
and θ
j
is a USD Unit Cost
and associated with the sequence of selected units, θ
j
and θ
j+1
, is a
UCD concatenation cost. The total cost of an arbitrary string of N
selected units, Θ = ¦θ
1
, θ
2
, . . . , θ
N
¦, and the string of N target units,
T = ¦t
1
, t
2
, . . . , t
N
¦ is deﬁned as:
d(Θ, T) =
N
¸
j=1
d
u
(θ
j
, t
j
) +
N−1
¸
j=1
d
t
(θ
j
, θ
j+1
), (8.2)
where d
u
(θ
j
, t
j
) is the USD cost associated with matching target unit t
j
with selected unit θ
j
and d
t
(θ
j
, θ
j+1
) is the UCD cost associated with
concatenating the units θ
j
and θ
j+1
. It should be noted that when
selected units θ
j
and θ
j+1
come from the same source sentence and
are adjacent units, the UCD cost goes to zero, as this is as natural a
concatenation as can exist in the database. Further, it is noted that
generally there is a small overlap region between concatenated units.
The optimal path (corresponding to the optimal sequence of units) can
be eﬃciently computed using a standard Viterbi search ([38, 99]) in
8.3 Unit Selection Methods 155
which the computational demand scales linearly with both the number
of target and the number of concatenation units.
The concatenation cost between two units is essentially the spectral
(and/or phase and/or pitch) discontinuity across the boundary between
the units, and is deﬁned as:
d
t
(θ
j
, θ
j+1
) =
p+2
¸
k=1
w
k
C
k
(θ
j
, θ
j+1
), (8.3)
where p is the size of the spectral feature vector (typically p = 12 mfcc
coeﬃcients (melfrequency cepstral coeﬃcients as explained in Sec
tion 5.6.3), often represented as a VQ codebook vector), and the extra
two features are log power and pitch. The weights, w
k
are chosen dur
ing the unit selection inventory creation phase and are optimized using
a trialanderror procedure. The concatenation cost essentially mea
sures a spectral plus log energy plus pitch diﬀerence between the two
concatenated units at the boundary frames. Clearly the deﬁnition of
concatenation cost could be extended to more than a single boundary
frame. Also, as stated earlier, the concatenation cost is deﬁned to be
zero (d
t
(θ
j
, θ
j+1
) = 0) whenever units θ
j
and θ
j+1
are consecutive (in
the database) since, by deﬁnition, there is no discontinuity in either
spectrum or pitch in this case. Although there are a variety of choices
for measuring the spectral/log energy/pitch discontinuity at the bound
ary, a common cost function is the normalized meansquared error in
the feature parameters, namely:
C
k
(θ
j
, θ
j+1
) =
[f
θ
j
k
(m) − f
θ
j+1
k
(1)]
2
σ
2
k
, (8.4)
where f
θ
j
k
(l) is the kth feature parameter of the lth frame of segment θ
j
,
m is the (normalized) duration of each segment, and σ
2
k
is the variance
of the kth feature vector component.
The USD or target costs are conceptually more diﬃcult to under
stand, and, in practice, more diﬃcult to instantiate. The USD cost
associated with units θ
j
and t
j
is of the form:
d
u
(θ
j
, t
j
) =
q
¸
i=1
w
t
i
φ
i
T
i
(f
θ
j
i
), T
i
(f
t
j
i
)
¸
, (8.5)
156 TexttoSpeech Synthesis Methods
where q is the number of features that specify the unit θ
j
or t
j
,
w
t
i
, i = 1, 2, . . . , q is a trained set of target weights, and T
i
() can be
either a continuous function (for a set of features, f
i
, such as segmental
pitch, power or duration), or a set of integers (in the case of cate
gorical features, f
i
, such as unit identity, phonetic class, position in
the syllable from which the unit was extracted, etc.). In the latter
case, φ
i
can be looked up in a table of distances. Otherwise, the local
distance function can be expressed as a simple quadratic distance of
the form
φ
i
T
i
(f
θ
j
i
), T
i
(f
t
j
i
)
¸
=
T
i
(f
θ
j
i
) − T
i
(f
t
j
i
)
2
. (8.6)
The training of the weights, w
t
i
, is done oﬀline. For each phoneme
in each phonetic class in the training speech database (which might be
the entire recorded inventory), each exemplar of each unit is treated
as a target and all others are treated as candidate units. Using this
training set, a leastsquares system of linear equations can be derived
from which the weight vector can be solved. The details of the weight
training methods are described by Schroeter [114].
The ﬁnal step in the unit selection synthesis, having chosen the opti
mal sequence of units to match the target sentence, is to smooth/modify
the selected units at each of the boundary frames to better match
the spectra, pitch and phase at each unit boundary. Various smooth
ing methods based on the concepts of time domain harmonic scaling
(TDHS) [76] and pitch synchronous overlap add (PSOLA) [20, 31, 82]
have been proposed and optimized for such smoothing/modiﬁcation at
the boundaries between adjacent diphone units.
8.4 TTS Applications
Speech technology serves as a way of intelligently and eﬃciently
enabling humans to interact with machines with the goal of bring
ing down cost of service for an existing service capability, or providing
new products and services that would be prohibitively expensive with
out the automation provided by a viable speech processing interactive
8.5 TTS Future Needs 157
system. Examples of existing services where TTS enables signiﬁcant
cost reduction are as follows:
• a dialog component for customer care applications
• a means for delivering text messages over an audio connection
(e.g., cellphone)
• a means of replacing expensive recorded Interactive Voice
Response system prompts (a service which is particularly
valuable when the prompts change often during the course
of the day, e.g., stock price quotations).
Similarly some examples of new products and services which are
enabled by a viable TTS technology are the following:
• locationbased services, e.g., alerting you to locations of
restaurants, gas stations, stores in the vicinity of your current
location
• providing information in cars (e.g., driving directions, traﬃc
reports)
• uniﬁed messaging (e.g., reading email and fax messages)
• voice portal providing voice access to webbased services
• ecommerce agents
• customized News, Stock Reports, Sports scores, etc.
• giving voice to small and embedded devices for reporting
information and alerts.
8.5 TTS Future Needs
Modern TTS systems have the capability of producing highly intelligi
ble, and surprisingly natural speech utterances, so long as the utterance
is not too long, or too syntactically complicated, or too technical. The
biggest problem with most TTS systems is that they have no idea as
to how things should be said, but instead rely on the text analysis for
emphasis, prosody, phrasing and all socalled suprasegmental features
of the spoken utterance. The more TTS systems learn how to produce
contextsensitive pronunciations of words (and phrases), the more nat
ural sounding these systems will become. Hence, by way of example,
158 TexttoSpeech Synthesis Methods
the utterance “I gave the book to John” has at least three diﬀerent
semantic interpretations, each with diﬀerent emphasis on words in the
utterance, i.e.,
I gave the book to John, i.e., not to Mary or Bob.
I gave the book to John, i.e., not the photos or the
apple.
I gave the book to John, i.e., I did it, not someone else.
The second future need of TTS systems is improvements in the unit
selection process so as to better capture the target cost for mismatch
between predicted unit speciﬁcation (i.e., phoneme name, duration,
pitch, spectral properties) and actual features of a candidate recorded
unit. Also needed in the unit selection process is a better spectral dis
tance measure that incorporates human perception measures so as to
ﬁnd the best sequence of units for a given utterance.
Finally, better signal processing would enable improved compression
of the units database, thereby making the footprint of TTS systems
small enough to be usable in handheld and mobile devices.
9
Automatic Speech Recognition (ASR)
In this chapter, we examine the process of speech recognition by
machine, which is in essence the inverse of the texttospeech problem.
The driving factor behind research in machine recognition of speech
has been the potentially huge payoﬀ of providing services where humans
interact solely with machines, thereby eliminating the cost of live agents
and signiﬁcantly reducing the cost of providing services. Interestingly,
as a side beneﬁt, this process often provides users with a natural and
convenient way of accessing information and services.
9.1 The Problem of Automatic Speech Recognition
The goal of an ASR system is to accurately and eﬃciently convert
a speech signal into a text message transcription of the spoken words,
independent of the device used to record the speech (i.e., the transducer
or microphone), the speaker’s accent, or the acoustic environment in
which the speaker is located (e.g., quiet oﬃce, noisy room, outdoors).
That is, the ultimate goal, which has not yet been achieved, is to per
form as well as a human listener.
159
160 Automatic Speech Recognition (ASR)
Fig. 9.1 Conceptual model of speech production and speech recognition processes.
A simple conceptual model of the speech generation and speech
recognition processes is given in Figure 9.1, which is a simpliﬁed version
of the speech chain shown in Figure 1.2. It is assumed that the speaker
intends to express some thought as part of a process of conversing
with another human or with a machine. To express that thought, the
speaker must compose a linguistically meaningful sentence, W, in the
form of a sequence of words (possibly with pauses and other acoustic
events such as uh’s, um’s, er’s etc.). Once the words are chosen, the
speaker sends appropriate control signals to the articulatory speech
organs which form a speech utterance whose sounds are those required
to speak the desired sentence, resulting in the speech waveform s[n]. We
refer to the process of creating the speech waveform from the speaker’s
intention as the Speaker Model since it reﬂects the speaker’s accent and
choice of words to express a given thought or request. The processing
steps of the Speech Recognizer are shown at the right side of Figure 9.1
and consist of an acoustic processor which analyzes the speech signal
and converts it into a set of acoustic (spectral, temporal) features, X,
which eﬃciently characterize the speech sounds, followed by a linguistic
decoding process which makes a best (maximum likelihood) estimate of
the words of the spoken sentence, resulting in the recognized sentence
ˆ
W. This is in essence a digital simulation of the lower part of the speech
chain diagram in Figure 1.2.
Figure 9.2 shows a more detailed block diagram of the overall speech
recognition system. The input speech signal, s[n], is converted to the
sequence of feature vectors, X = ¦x
1
, x
2
, . . . , x
T
¦, by the feature analy
sis block (also denoted spectral analysis). The feature vectors are com
puted on a framebyframe basis using the techniques discussed in the
earlier chapters. In particular, the mel frequency cepstrum coeﬃcients
9.2 Building a Speech Recognition System 161
Fig. 9.2 Block diagram of an overall speech recognition system.
are widely used to represent the shorttime spectral characteristics. The
pattern classiﬁcation block (also denoted as the decoding and search
block) decodes the sequence of feature vectors into a symbolic repre
sentation that is the maximum likelihood string,
ˆ
W that could have
produced the input sequence of feature vectors. The pattern recogni
tion system uses a set of acoustic models (represented as hidden Markov
models) and a word lexicon to provide the acoustic match score for each
proposed string. Also, an Ngram language model is used to compute
a language model score for each proposed word string. The ﬁnal block
in the process is a conﬁdence scoring process (also denoted as an utter
ance veriﬁcation block), which is used to provide a conﬁdence score for
each individual word in the recognized string. Each of the operations
in Figure 9.2 involves many details and, in some cases, extensive digital
computation. The remainder of this chapter is an attempt to give the
ﬂavor of what is involved in each part of Figure 9.2.
9.2 Building a Speech Recognition System
The steps in building and evaluating a speech recognition system are
the following:
(1) choose the feature set and the associated signal processing
for representing the properties of the speech signal over time
162 Automatic Speech Recognition (ASR)
(2) choose the recognition task, including the recognition word
vocabulary (the lexicon), the basic speech sounds to repre
sent the vocabulary (the speech units), the task syntax or
language model, and the task semantics (if any)
(3) train the set of speech acoustic and language models
(4) evaluate performance of the resulting speech recognition
system.
Each of these steps may involve many choices and can involve signiﬁ
cant research and development eﬀort. Some of the important issues are
summarized in this section.
9.2.1 Recognition Feature Set
There is no “standard” set of features for speech recognition. Instead,
various combinations of acoustic, articulatory, and auditory features
have been utilized in a range of speech recognition systems. The most
popular acoustic features have been the (LPCderived) melfrequency
cepstrum coeﬃcients and their derivatives.
A block diagram of the signal processing used in most modern
large vocabulary speech recognition systems is shown in Figure 9.3.
The analog speech signal is sampled and quantized at rates between
8000 up to 20,000 samples/s. A ﬁrst order (highpass) preemphasis net
work (1 − αz
−1
) is used to compensate for the speech spectral falloﬀ
at higher frequencies and approximates the inverse to the mouth trans
mission frequency response. The preemphasized signal is next blocked
into frames of N samples, with adjacent frames spaced M samples
apart. Typical values for N and M correspond to frames of duration
15–40 ms, with frame shifts of 10 ms being most common; hence adja
cent frames overlap by 5–30 ms depending on the chosen values of N
and M. A Hamming window is applied to each frame prior to spec
tral analysis using either standard spectral analysis or LPC methods.
Following (optionally used) simple noise removal methods, the spec
tral coeﬃcients are normalized and converted to melfrequency cep
stral coeﬃcients via standard analysis methods of the type discussed
in Chapter 5 [27]. Some type of cepstral bias removal is often used
prior to calculation of the ﬁrst and second order cepstral derivatives.
9.2 Building a Speech Recognition System 163
Fig. 9.3 Block diagram of feature extraction process for feature vector consisting of mfcc
coeﬃcients and their ﬁrst and second derivatives.
Typically the resulting feature vector is the set of cepstral coeﬃcients,
and their ﬁrst and secondorder derivatives. It is typical to use about
13 mfcc coeﬃcients, 13 ﬁrstorder cepstral derivative coeﬃcients, and
13 secondorder derivative cepstral coeﬃcients, making a D = 39 size
feature vector.
9.2.2 The Recognition Task
Recognition tasks vary from simple word and phrase recognition sys
tems to large vocabulary conversational interfaces to machines. For
example, using a digits vocabulary, the task could be recognition of a
string of digits that forms a telephone number, or identiﬁcation code,
or a highly constrained sequence of digits that form a password.
9.2.3 Recognition Training
There are two aspects to training models for speech recognition, namely
acoustic model training and language model training. Acoustic model
164 Automatic Speech Recognition (ASR)
training requires recording each of the model units (whole words,
phonemes) in as many contexts as possible so that the statistical
learning method can create accurate distributions for each of the
model states. Acoustic training relies on accurately labeled sequences of
speech utterances which are segmented according to the transcription,
so that training of the acoustic models ﬁrst involves segmenting the
spoken strings into recognition model units (via either a Baum–Welch
[10, 11] or Viterbi alignment method), and then using the segmented
utterances to simultaneously build model distributions for each state of
the vocabulary unit models. The resulting statistical models form the
basis for the pattern recognition operations at the heart of the ASR
system. As discussed in Section 9.3, concepts such as Viterbi search are
employed in the pattern recognition process as well as in training.
Language model training requires a sequence of text strings that
reﬂect the syntax of spoken utterances for the task at hand. Generally
such text training sets are created automatically (based on a model of
grammar for the recognition task) or by using existing text sources,
such as magazine and newspaper articles, or closed caption transcripts
of television news broadcasts, etc. Other times, training sets for lan
guage models can be created from databases, e.g., valid strings of tele
phone numbers can be created from existing telephone directories.
9.2.4 Testing and Performance Evaluation
In order to improve the performance of any speech recognition system,
there must be a reliable and statistically signiﬁcant way of evaluating
recognition system performance based on an independent test set of
labeled utterances. Typically we measure word error rate and sentence
(or task) error rate as a measure of recognizer performance. A brief
summary of performance evaluations across a range of ASR applica
tions is given in Section 9.4.
9.3 The Decision Processes in ASR
The heart of any automatic speech recognition system is the pattern
classiﬁcation and decision operations. In this section, we shall give a
brief introduction to these important topics.
9.3 The Decision Processes in ASR 165
9.3.1 Mathematical Formulation of the ASR Problem
The problem of automatic speech recognition is represented as a statis
tical decision problem. Speciﬁcally it is formulated as a Bayes maximum
a posteriori probability (MAP) decision process where we seek to ﬁnd
the word string
ˆ
W (in the task language) that maximizes the a pos
teriori probability P(W[X) of that string, given the measured feature
vector, X, i.e.,
ˆ
W = argmax
W
P(W[X). (9.1)
Using Bayes’ rule we can rewrite (9.1) in the form:
ˆ
W = argmax
W
P(X[W)P(W)
P(X)
. (9.2)
Equation (9.2) shows that the calculation of the a posteriori probability
is decomposed into two terms, one that deﬁnes the a priori probability
of the word sequence, W, namely P(W), and the other that deﬁnes the
likelihood that the word string, W, produced the feature vector, X,
namely P(X[W). For all future calculations we disregard the denom
inator term, P(X), since it is independent of the word sequence W
which is being optimized. The term P(X[W) is known as the “acoustic
model” and is generally denoted as P
A
(X[W) to emphasize the acous
tic nature of this term. The term P(W) is known as the “language
model” and is generally denoted as P
L
(W) to emphasize the linguistic
nature of this term. The probabilities associated with P
A
(X[W) and
P
L
(W) are estimated or learned from a set of training data that have
been labeled by a knowledge source, usually a human expert, where the
training set is as large as reasonably possible. The recognition decoding
process of (9.2) is often written in the form of a 3step process, i.e.,
ˆ
W = argmax
W
. .. .
Step 3
P
A
(X[W)
. .. .
Step 1
P
L
(W)
. .. .
Step 2
, (9.3)
where Step 1 is the computation of the probability associated with the
acoustic model of the speech sounds in the sentence W, Step 2 is the
computation of the probability associated with the linguistic model of
the words in the utterance, and Step 3 is the computation associated
166 Automatic Speech Recognition (ASR)
with the search through all valid sentences in the task language for the
maximum likelihood sentence.
In order to be more explicit about the signal processing and com
putations associated with each of the three steps of (9.3) we need to
more explicit about the relationship between the feature vector, X, and
the word sequence W. As discussed above, the feature vector, X, is a
sequence of acoustic observations corresponding to each of T frames of
the speech, of the form:
X = ¦x
1
, x
2
, . . . , x
T
¦, (9.4)
where the speech signal duration is T frames (i.e., T times the frame
shift in ms) and each frame, x
t
, t = 1, 2, . . . , T is an acoustic feature
vector of the form:
x
t
= (x
t1
, x
t2
, . . . , x
tD
) (9.5)
that characterizes the spectral/temporal properties of the speech signal
at time t and D is the number of acoustic features in each frame.
Similarly we can express the optimally decoded word sequence, W, as:
W = w
1
, w
2
, . . . , w
M
, (9.6)
where there are assumed to be exactly M words in the decoded string.
9.3.2 The Hidden Markov Model
The most widely used method of building acoustic models (for both
phonemes and words) is the use of a statistical characterization known
as Hidden Markov Models (HMMs) [33, 69, 97, 98]. Figure 9.4 shows
a simple Q = 5state HMM for modeling a whole word. Each HMM
state is characterized by a mixture density Gaussian distribution that
characterizes the statistical behavior of the feature vectors within the
states of the model [61, 62]. In addition to the statistical feature densi
ties within states, the HMM is also characterized by an explicit set of
state transitions, a
ij
, which specify the probability of making a tran
sition from state i to state j at each frame, thereby deﬁning the time
sequence of the feature vectors over the duration of the word. Usually
the selftransitions, a
ii
are large (close to 1.0), and the jump transitions,
a
12
, a
23
, a
34
, a
45
, in the model, are small (close to 0).
9.3 The Decision Processes in ASR 167
Fig. 9.4 Wordbased, lefttoright, HMM with 5 states.
The complete HMM characterization of a Qstate word model (or a
subword unit like a phoneme model) is generally written as λ(A, B, π)
with state transition matrix A = ¦a
ij
, 1 ≤ i, j ≤ Q¦, state observation
probability density, B = ¦b
j
(x
t
), 1 ≤ j ≤ Q¦, and initial state distribu
tion, π = ¦π
i
, 1 ≤ i ≤ Q¦ with π
1
set to 1 for the “lefttoright” models
of the type shown in Figure 9.4.
In order to train the HMM (i.e., learn the optimal model param
eters) for each word (or subword) unit, a labeled training set of sen
tences (transcribed into words and subword units) is used to guide
an eﬃcient training procedure known as the Baum–Welch algorithm
[10, 11].
1
This algorithm aligns each of the various words (or sub
word units) with the spoken inputs and then estimates the appro
priate means, covariances and mixture gains for the distributions in
each model state. The Baum–Welch method is a hillclimbing algo
rithm and is iterated until a stable alignment of models and speech
is obtained. The details of the Baum–Welch procedure are beyond
the scope of this chapter but can be found in several references on
Speech Recognition methods [49, 99]. The heart of the training pro
cedure for reestimating HMM model parameters using the Baum–
Welch procedure is shown in Figure 9.5. An initial HMM model is
used to begin the training process. The initial model can be ran
domly chosen or selected based on a priori knowledge of the model
1
The Baum–Welch algorithm is also widely referred to as the forward–backward method.
168 Automatic Speech Recognition (ASR)
Fig. 9.5 The Baum–Welch training procedure based on a given training set of utterances.
parameters. The iteration loop is a simple updating procedure for com
puting the forward and backward model probabilities based on an input
speech database (the training set of utterances) and then optimizing
the model parameters to give an updated HMM. This process is iter
ated until no further improvement in probabilities occurs with each new
iteration.
It is a simple matter to go from the HMM for a whole word,
as shown in Figure 9.4, to an HMM for a subword unit (such as a
phoneme) as shown in Figure 9.6. This simple 3state HMM is a basic
subword unit model with an initial state representing the statistical
characteristics at the beginning of a sound, a middle state representing
the heart of the sound, and an ending state representing the spec
tral characteristics at the end of the sound. A word model is made
by concatenating the appropriate subword HMMs, as illustrated in
Fig. 9.6 Subwordbased HMM with 3 states.
9.3 The Decision Processes in ASR 169
Fig. 9.7 Wordbased HMM for the word /is/ created by concatenating 3state subword
models for the subword units /ih/ and /z/.
Figure 9.7, which concatenates the 3state HMM for the sound /IH/
with the 3state HMM for the sound /Z/, giving the word model for
the word “is” (pronounced as /IH Z/). In general the composition of
a word (from subword units) is speciﬁed in a word lexicon or dictio
nary; however once the word model has been built it can be used much
the same as whole word models for training and for evaluating word
strings for maximizing the likelihood as part of the speech recognition
process.
We are now ready to deﬁne the procedure for aligning a sequence
of M word models, w
1
, w
w
, . . . , w
M
with a sequence of feature vectors,
X = ¦x
1
, x
2
, . . . , x
T
¦. The resulting alignment procedure is illustrated
in Figure 9.8. We see the sequence of feature vectors along the horizon
tal axis and the concatenated sequence of word states along the vertical
axis. An optimal alignment procedure determines the exact best match
ing sequence between word model states and feature vectors such that
the ﬁrst feature vector, x
1
, aligns with the ﬁrst state in the ﬁrst word
model, and the last feature vector, x
T
, aligns with the last state in the
Mth word model. (For simplicity we show each word model as a 5state
HMM in Figure 9.8, but clearly the alignment procedure works for any
size model for any word, subject to the constraint that the total num
ber of feature vectors, T, exceeds the total number of model states, so
that every state has at least a single feature vector associated with that
state.) The procedure for obtaining the best alignment between feature
vectors and model states is based on either using the Baum–Welch sta
tistical alignment procedure (in which we evaluate the probability of
every alignment path and add them up to determine the probability of
the word string), or a Viterbi alignment procedure [38, 132] for which
170 Automatic Speech Recognition (ASR)
Fig. 9.8 Alignment of concatenated HMM word models with acoustic feature vectors based
on either a Baum–Welch or Viterbi alignment procedure.
we determine the single best alignment path and use the probability
score along that path as the probability measure for the current word
string. The utility of the alignment procedure of Figure 9.8 is based on
the ease of evaluating the probability of any alignment path using the
Baum–Welch or Viterbi procedures.
We now return to the mathematical formulation of the ASR prob
lem and examine in more detail the three steps in the decoding
Equation (9.3).
9.3.3 Step 1 — Acoustic Modeling
The function of the acoustic modeling step (Step 1) is to assign prob
abilities to the acoustic realizations of a sequence of words, given the
observed acoustic vectors, i.e., we need to compute the probability that
the acoustic vector sequence X = ¦x
1
, x
2
, . . . , x
T
¦ came from the word
sequence W = w
1
, w
2
, . . . , w
M
(assuming each word is represented as an
HMM) and perform this computation for all possible word sequences.
9.3 The Decision Processes in ASR 171
This calculation can be expressed as:
P
A
(X[W) = P
A
(¦x
1
, x
2
, . . . , x
T
¦[w
1
, w
2
, . . . , w
M
). (9.7)
If we make the assumption that each frame, x
t
, is aligned with
word i(t) and HMM model state j(t) via the function w
i(t)
j(t)
and if we
assume that each frame is independent of every other frame, we can
express (9.7) as the product
P
A
(X[W) =
T
¸
t=1
P
A
x
t
[w
i(t)
j(t)
, (9.8)
where we associate each frame of X with a unique word and state,
w
i(t)
j(t)
, in the word sequence. Further, we calculate the local probability
P
A
x
t
[w
i(t)
j(t)
given that we know the word from which frame t came.
The process of assigning individual speech frames to the appropriate
word model in an utterance is based on an optimal alignment process
between the concatenated sequence of word models and the sequence
of feature vectors of the spoken input utterance being recognized. This
alignment process is illustrated in Figure 9.9 which shows the set of
T feature vectors (frames) along the horizontal axis, and the set of M
words (and word model states) along the vertical axis. The optimal
segmentation of these feature vectors (frames) into the M words is
shown by the sequence of boxes, each of which corresponds to one of
the words in the utterance and its set of optimally matching feature
vectors.
We assume that each word model is further decomposed into a set
of states which reﬂect the changing statistical properties of the fea
ture vectors over time for the duration of the word. We assume that
each word is represented by an Nstate HMM model, and we denote
the states as S
j
, j = 1, 2, . . . , N. Within each state of each word there
is a probability density that characterizes the statistical properties of
the feature vectors in that state. We have seen in the previous section
that the probability density of each state, and for each word, is learned
during a training phase of the recognizer. Using a mixture of Gaussian
densities to characterize the statistical distribution of the feature vec
tors in each state, j, of the word model, which we denote as b
j
(x
t
),
172 Automatic Speech Recognition (ASR)
Fig. 9.9 Illustration of time alignment process between unknown utterance feature vectors
and set of M concatenated word models.
we get a statebased probability density of the form:
b
j
(x
t
) =
K
¸
k=1
c
jk
N[x
t
, µ
jk
, U
jk
], (9.9)
where K is the number of mixture components in the density function,
c
jk
is the weight of the kth mixture component in state j, with the
constraint c
jk
≥ 0, and N is a Gaussian density function with mean
vector, µ
jk
, for mixture k for state j, and covariance matrix, U
jk
, for
mixture k in state j. The density constraints are:
K
¸
k=1
c
jk
= 1, 1 ≤ j ≤ N (9.10)
∞
−∞
b
j
(x
t
)dx
t
= 1, 1 ≤ j ≤ N. (9.11)
We now return to the issue of the calculation of the probability of
frame x
t
being associated with the j(t)th state of the i(t)th word in
9.3 The Decision Processes in ASR 173
the utterance, P
A
(x
t
[w
i(t)
j(t)
), which is calculated as
P
A
x
t
[w
i(t)
j(t)
= b
i(t)
j(t)
(x
t
), (9.12)
The computation of (9.12) is incomplete since we have ignored the
computation of the probability associated with the links between word
states, and we have also not speciﬁed how to determine the withinword
state, j, in the alignment between a given word and a set of feature
vectors corresponding to that word. We come back to these issues later
in this section.
The key point is that we assign probabilities to acoustic realizations
of a sequence of words by using hidden Markov models of the acoustic
feature vectors within words. Using an independent (and orthograph
ically labeled) set of training data, we “train the system” and learn
the parameters of the best acoustic models for each word (or more
speciﬁcally for each sound that comprises each word). The parameters,
according to the mixture model of (9.9) are, for each state of the model,
the mixture weights, the mean vectors, and the covariance matrix.
Although we have been discussing acoustic models for whole words,
it should be clear that for any reasonable size speech recognition task,
it is impractical to create a separate acoustic model for every possible
word in the vocabulary since each word would have to be spoken in
every possible context in order to build a statistically reliable model
of the density functions of (9.9). Even for modest size vocabularies
of about 1000 words, the amount of training data required for word
models is excessive.
The alternative to word models is to build acousticphonetic models
for the 40 or so phonemes in the English language and construct the
model for a word by concatenating (stringing together sequentially)
the models for the constituent phones in the word (as represented in a
word dictionary or lexicon). The use of such subword acousticphonetic
models poses no real diﬃculties in either training or when used to
build up word models and hence is the most widely used representation
for building word models in a speech recognition system. Stateofthe
art systems use contextdependent phone models as the basic units of
recognition [99].
174 Automatic Speech Recognition (ASR)
9.3.4 Step 2 — The Language Model
The language model assigns probabilities to sequences of words, based
on the likelihood of that sequence of words occurring in the context
of the task being performed by the speech recognition system. Hence
the probability of the text string W =“Call home” for a telephone
number identiﬁcation task is zero since that string makes no sense for
the speciﬁed task. There are many ways of building Language Models
for speciﬁc tasks, including:
(1) statistical training from text databases transcribed from
taskspeciﬁc dialogs (a learning procedure)
(2) rulebased learning of the formal grammar associated with
the task
(3) enumerating, by hand, all valid text strings in the language
and assigning appropriate probability scores to each string.
The purpose of the language model, or grammar, is to enable the
computation of the a priori probability, P
L
(W), of a word string, W,
consistent with the recognition task [59, 60, 106]. Perhaps the most
popular way of constructing the language model is through the use
of a statistical Ngram word grammar that is estimated from a large
training set of text utterances, either from the task at hand or from a
generic database with applicability to a wide range of tasks. We now
describe the way in which such a language model is built.
Assume we have a large text training set of wordlabeled utterances.
(Such databases could include millions or even tens of millions of text
sentences.) For every sentence in the training set, we have a text ﬁle
that identiﬁes the words in that sentence. If we make the assumption
that the probability of a word in a sentence is conditioned on only
the previous N − 1 words, we have the basis for an Ngram language
model. Thus we assume we can write the probability of the sentence
W, according to an Ngram language model, as
P
L
(W) = P
L
(w
1
, w
2
, . . . , w
M
) (9.13)
=
M
¸
n=1
P
L
(w
n
[w
n−1
, w
n−2
, . . . , w
n−N+1
), (9.14)
9.3 The Decision Processes in ASR 175
where the probability of a word occurring in the sentence only depends
on the previous N − 1 words and we estimate this probability by
counting the relative frequencies of Ntuples of words in the train
ing set. Thus, for example, to estimate word “trigram” probabilities
(i.e., the probability that a word w
n
was preceded by the pair of words
(w
n−1
, w
n−2
)), we compute this quantity as
P(w
n
[w
n−1
, w
n−2
) =
C(w
n−2
, w
n−1
, w
n
)
C(w
n−2
, w
n−1
)
, (9.15)
where C(w
n−2
, w
n−1
, w
n
) is the frequency count of the word triplet
(i.e., the trigram of words) consisting of (w
n−2
, w
n−1
, w
n
) as it occurs
in the text training set, and C(w
n−2
, w
n−1
) is the frequency count of
the word doublet (i.e., bigram of words) (w
n−2
, w
n−1
) as it occurs in
the text training set.
9.3.5 Step 3 — The Search Problem
The third step in the Bayesian approach to automatic speech recogni
tion is to search the space of all valid word sequences from the language
model, to ﬁnd the one with the maximum likelihood of having been spo
ken. The key problem is that the potential size of the search space can
be astronomically large (for large vocabularies and high average word
branching factor language models), thereby taking inordinate amounts
of computing power to solve by heuristic methods. Fortunately, through
the use of methods from the ﬁeld of Finite State Automata Theory,
Finite State Network (FSN) methods have evolved that reduce the
computational burden by orders of magnitude, thereby enabling exact
maximum likelihood solutions in computationally feasible times, even
for very large speech recognition problems [81].
The basic concept of a ﬁnite state network transducer is illustrated
in Figure 9.10 which shows a word pronunciation network for the word
/data/. Each arc in the state diagram corresponds to a phoneme in the
word pronunciation network, and the weight is an estimate of the proba
bility that the arc is utilized in the pronunciation of the word in context.
We see that for the word /data/ there are four total pronunciations,
176 Automatic Speech Recognition (ASR)
Fig. 9.10 Word pronunciation transducer for four pronunciations of the word /data/.
(After Mohri [81].)
namely (along with their (estimated) pronunciation probabilities):
(1) /D/ /EY/ /D/ /AX/ — probability of 0.32
(2) /D/ /EY/ /T/ /AX/ — probability of 0.08
(3) /D/ /AE/ /D/ /AX/ — probability of 0.48
(4) /D/ /AE/ /T/ /AX/ — probability of 0.12.
The combined FSN of the 4 pronunciations is a lot more eﬃcient than
using 4 separate enumerations of the word since all the arcs are shared
among the 4 pronunciations and the total computation for the full FSN
for the word /data/ is close to 1/4 the computation of the 4 variants
of the same word.
We can continue the process of creating eﬃcient FSNs for each
word in the task vocabulary (the speech dictionary or lexicon), and
then combine word FSNs into sentence FSNs using the appropriate
language model. Further, we can carry the process down to the level of
HMM phones and HMM states, making the process even more eﬃcient.
Ultimately we can compile a very large network of model states, model
phones, model words, and even model phrases, into a much smaller
network via the method of weighted ﬁnite state transducers (WFST),
which combine the various representations of speech and language and
optimize the resulting network to minimize the number of search states
(and, equivalently, thereby minimize the amount of duplicate computa
tion). A simple example of such a WFST network optimization is given
in Figure 9.11 [81].
Using the techniques of network combination (which include net
work composition, determinization, minimization, and weight push
ing) and network optimization, the WFST uses a uniﬁed mathematical
9.4 Representative Recognition Performance 177
Fig. 9.11 Use of WFSTs to compile a set of FSNs into a single optimized network to
minimize redundancy in the network. (After Mohri [81].)
framework to eﬃciently compile a large network into a minimal rep
resentation that is readily searched using standard Viterbi decoding
methods [38]. Using these methods, an unoptimized network with 10
22
states (the result of the cross product of model states, model phones,
model words, and model phrases) was able to be compiled down to
a mathematically equivalent model with 10
8
states that was readily
searched for the optimum word string with no loss of performance or
word accuracy.
9.4 Representative Recognition Performance
The block diagram of Figure 9.1 represents a wide range of possibil
ities for the implementation of automatic speech recognition. In Sec
tion 9.2.1, we suggest how the techniques of digital speech analysis
discussed in Chapters 4–6 can be applied to extract a sequence of fea
ture vectors from the speech signal, and in Section 9.3 we describe the
most widely used statistical pattern recognition techniques that are
employed for mapping the sequence of feature vectors into a sequence
of symbols or words.
Many variations on this general theme have been investigated over
the past 30 years or more, and as we mentioned before, testing and
evaluation is a major part of speech recognition research and devel
opment. Table 9.1 gives a summary of a range of the performance of
some speech recognition and natural language understanding systems
178 Automatic Speech Recognition (ASR)
Table 9.1 Word error rates for a range of speech recognition systems.
Corpus Type of Vocabulary Word
speech size error rate (%)
Connected digit strings
(TI Database)
Spontaneous 11 (0–9, oh) 0.3
Connected digit strings
(AT&T Mall
Recordings)
Spontaneous 11 (0–9, oh) 2.0
Connected digit strings
(AT&T HMIHY
c
)
Conversational 11 (0–9, oh) 5.0
Resource management
(RM)
Read speech 1000 2.0
Airline travel information
system (ATIS)
Spontaneous 2500 2.5
North American business
(NAB & WSJ)
Read text 64,000 6.6
Broadcast News Narrated News 210,000 ≈15
Switchboard Telephone 45,000 ≈27
conversation
Callhome Telephone 28,000 ≈35
conversation
that have been developed so far. This table covers a range of vocabulary
sizes, speaking styles, and application contexts [92].
It can be seen that for a vocabulary of 11 digits, the word error
rates are very low (0.3% for a very clean recording environment for
the TI (Texas Instruments) connected digits database [68]), but when
the digit strings are spoken in a noisy shopping mall environment the
word error rate rises to 2.0% and when embedded within conversational
speech (the AT&T HMIHY (How May I Help You) c (system) the word
error rate increases signiﬁcantly to 5.0%, showing the lack of robustness
of the recognition system to noise and other background disturbances
[44]. Table 9.1 also shows the word error rates for a range of DARPA
tasks ranging from
• read speech of commands and informational requests about
a naval ships database (the resource management system or
RM) with a 1000 word vocabulary and a word error rate of
2.0%
• spontaneous speech input for booking airlines travel (the air
line travel information system, or ATIS [133]) with a 2500
word vocabulary and a word error rate of 2.5%
9.5 Challenges in ASR Technology 179
• read text from a range of business magazines and newspa
pers (the North American business task, or NAB) with a
vocabulary of 64,000 words and a word error rate of 6.6%
• narrated news broadcasts from a range of TV news providers
like CNBC, (the broadcast news task) with a 210,000 word
vocabulary and a word error rate of about 15%
• recorded live telephone conversations between two unrelated
individuals (the switchboard task [42]) with a vocabulary of
45,000 words and a word error rate of about 27%, and a
separate task for live telephone conversations between two
family members (the call home task) with a vocabulary of
28,000 words and a word error rate of about 35%.
9.5 Challenges in ASR Technology
So far, ASR systems fall far short of human speech perception in all
but the simplest, most constrained tasks. Before ASR systems become
ubiquitous in society, many improvements will be required in both
system performance and operational performance. In the system area
we need large improvements in accuracy, eﬃciency, and robustness in
order to utilize the technology for a wide range of tasks, on a wide
range of processors, and under a wide range of operating conditions.
In the operational area we need better methods of detecting when a
person is speaking to a machine and isolating the spoken input from
the background, we need to be able to handle users talking over the
voice prompts (socalled bargein conditions), we need more reliable
and accurate utterance rejection methods so we can be sure that a word
needs to be repeated when poorly recognized the ﬁrst time, and ﬁnally
we need better methods of conﬁdence scoring of words, phrases, and
even sentences so as to maintain an intelligent dialog with a customer.
Conclusion
We have attempted to provide the reader with a broad overview of
the ﬁeld of digital speech processing and to give some idea as to the
remarkable progress that has been achieved over the past 4–5 decades.
Digital speech processing systems have permeated society in the form of
cellular speech coders, synthesized speech response systems, and speech
recognition and understanding systems that handle a wide range of
requests about airline ﬂights, stock price quotations, specialized help
desks etc.
There remain many diﬃcult problems yet to be solved before digital
speech processing will be considered a mature science and technology.
Our basic understanding of the human articulatory system and how the
various muscular controls come together in the production of speech is
rudimentary at best, and our understanding of the processing of speech
in the human brain is at an even lower level.
In spite of our shortfalls in understanding, we have been able to cre
ate remarkable speech processing systems whose performance increases
at a steady pace. A ﬁrm understanding of the fundamentals of acous
tics, linguistics, signal processing, and perception provide the tools for
180
Conclusion 181
building systems that work and can be used by the general public. As
we increase our basic understanding of speech, the application systems
will only improve and the pervasiveness of speech processing in our
daily lives will increase dramatically, with the end result of improving
the productivity in our work and home environments.
Acknowledgments
We wish to thank Professor Robert Gray, EditorinChief of Now Pub
lisher’s Foundations and Trends in Signal Processing, for inviting us
to prepare this text. His patience, technical advice, and editorial skill
were crucial at every stage of the writing. We also wish to thank
Abeer Alwan, Yariv Ephraim, Luciana Ferrer, Sadaoki Furui, and Tom
Quatieri for their detailed and perceptive comments, which greatly
improved the ﬁnal result. Of course, we are responsible for any weak
nesses or inaccuracies that remain in this text.
182
References
[1] J. B. Allen and L. R. Rabiner, “A uniﬁed theory of shorttime spectrum
analysis and synthesis,” Proceedings of IEEE, vol. 65, no. 11, pp. 1558–1564,
November 1977.
[2] B. S. Atal, “Predictive coding of speech at low bit rates,” IEEE Transactions
on Communications, vol. COM30, no. 4, pp. 600–614, April 1982.
[3] B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis by linear predic
tion of the speech wave,” Journal of the Acoustical Society of America, vol. 50,
pp. 561–580, 1971.
[4] B. S. Atal and J. Remde, “A new model of LPC exitation for producing
naturalsounding speech at low bit rates,” Proceedings of IEEE ICASSP,
pp. 614–617, 1982.
[5] B. S. Atal and M. R. Schroeder, “Adaptive predictive coding of speech sig
nals,” Bell System Technical Journal, vol. 49, pp. 1973–1986, October 1970.
[6] B. S. Atal and M. R. Schroeder, “Predictive coding of speech signals and sub
jective error criterion,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. ASSP27, pp. 247–254, June 1979.
[7] B. S. Atal and M. R. Schroeder, “Improved quantizer for adaptive predictive
coding of speech signals at low bit rates,” Proceedings of ICASSP, pp. 535–538,
April 1980.
[8] T. B. Barnwell III, “Recursive windowing for generating autocorrelation anal
ysis for LPC analysis,” IEEE Transactions on Acoustics, Speech and Signal
Processing, vol. ASSP29, no. 5, pp. 1062–1066, October 1981.
[9] T. B. Barnwell III, K. Nayebi, and C. H. Richardson, Speech Coding, A Com
puter Laboratory Textbook. John Wiley and Sons, 1996.
183
184 References
[10] L. E. Baum, “An inequality and associated maximization technique in statis
tical estimation for probabilistic functions of Markov processes,” Inequalities,
vol. 3, pp. 1–8, 1972.
[11] L. E. Baum, T. Petri, G. Soules, and N. Weiss, “A maximization tech
nique occurring in the statistical analysis of probabilistic functions of Markov
chains,” Annals in Mathematical Statistics, vol. 41, pp. 164–171, 1970.
[12] W. R. Bennett, “Spectra of quantized signals,” Bell System Technical Journal,
vol. 27, pp. 446–472, July 1948.
[13] M. Berouti, H. Garten, P. Kabal, and P. Mermelstein, “Eﬃcient computation
and encoding of the multipulse excitation for LPC,” Proceedings of ICASSP,
pp. 384–387, March 1984.
[14] M. Beutnagel, A. Conkie, and A. K. Syrdal, “Diphone synthesis using
unit selection,” Third Speech Synthesis Workshop, Jenolan Caes, Australia,
November 1998.
[15] M. Beutnatel and A. Conkie, “Interaction of units in a unit selection
database,” Proceedings of Eurospeech ’99, Budapest, Hungary, September
1999.
[16] B. P. Bogert, M. J. R. Healy, and J. W. Tukey, “The quefrency alanysis of
times series for echos: Cepstrum, pseudoautocovariance, crosscepstrum, and
saphe cracking,” in Proceedings of the Symposium on Time Series Analysis,
(M. Rosenblatt, ed.), New York: John Wiley and Sons, Inc., 1963.
[17] E. Bresch, J. Nielsen, K. Nayak, and S. Narayanan, “Synchornized and noise
robust audio recordings during realtime MRI scans,” Journal of the Acoustical
Society of America, vol. 120, no. 4, pp. 1791–1794, October 2006.
[18] C. S. Burrus and R. A. Gopinath, Introduction to Wavelets and Wavelet Trans
forms. PrenticeHall Inc., 1998.
[19] J. P. Campbell Jr., V. C. Welch, and T. E. Tremain, “An expandable error
pretected 4800 bps CELP coder,” Proceedings of ICASSP, vol. 2, pp. 735–738,
May 1989.
[20] F. Charpentier and M. G. Stella, “Diphone synthesis using an overlapadd
technique for speech waveform concatenation,” Proceedings of International
Conference on Acoustics, Speech and Signal Processing, pp. 2015–2018, 1986.
[21] J. H. Chung and R. W. Schafer, “Performance evaluation of analysisby
synthesis homomorphic vocoders,” Proceedings of IEEE ICASSP, vol. 2,
pp. 117–120, March 1992.
[22] C. H. Coker, “A model of articulatory dynamics and control,” Proceedings of
IEEE, vol. 64, pp. 452–459, 1976.
[23] R. V. Cox, S. L. Gay, Y. Shoham, S. Quackenbush, N. Seshadri, and N. Jayant,
“New directions in subband coding,” IEEE Journal of Selected Areas in Com
munications, vol. 6, no. 2, pp. 391–409, February 1988.
[24] R. E. Crochiere and L. R. Rabiner, Multirate Digital Signal Processing.
PrenticeHall Inc., 1983.
[25] R. E. Crochiere, S. A. Webber, and J. L. Flanagan, “Digital coding of speech
in subbands,” Bell System Technical Journal, vol. 55, no. 8, pp. 1069–1085,
October 1976.
References 185
[26] C. C. Cutler, “Diﬀerential quantization of communication signals,” U.S.
Patent 2,605,361, July 29, 1952.
[27] S. B. Davis and P. Mermelstein, “Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences,” IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 28, pp. 357–366,
August 1980.
[28] F. deJager, “Delta modulation — a new method of PCM transmission using
the 1unit code,” Philips Research Reports, pp. 442–466, December 1952.
[29] P. B. Denes and E. N. Pinson, The speech chain. W. H. Freeman Company,
2nd Edition, 1993.
[30] H. Dudley, “The vocoder,” Bell Labs Record, vol. 17, pp. 122–126, 1939.
[31] T. Dutoit, An Introduction to TexttoSpeech Synthesis. Netherlands: Kluwer
Academic Publishers, 1997.
[32] G. Fant, Acoustic Theory of Speech Production. The Hague: Mouton & Co.,
1960; Walter de Gruyter, 1970.
[33] J. D. Ferguson, “Hidden Markov Analysis: An Introduction,” Hidden Markov
Models for Speech, Princeton: Institute for Defense Analyses, 1980.
[34] J. L. Flanagan, Speech Analysis, Synthesis and Perception. SpringerVerlag,
1972.
[35] J. L. Flanagan, C. H. Coker, L. R. Rabiner, R. W. Schafer, and N. Umeda,
“Synthetic voices for computers,” IEEE Spectrum, vol. 7, pp. 22–45, October
1970.
[36] J. L. Flanagan, K. Ishizaka, and K. L. Shipley, “Synthesis of speech from a
dynamic model of the vocal cords and vocal tract,” Bell System Technical
Journal, vol. 54, no. 3, pp. 485–506, March 1975.
[37] H. Fletcher and W. J. Munson, “Loudness, its deﬁnition, measurement and
calculation,” Journal of Acoustical Society of America, vol. 5, no. 2, pp. 82–
108, October 1933.
[38] G. D. Forney, “The Viterbi algorithm,” IEEE Proceedings, vol. 61, pp. 268–
278, March 1973.
[39] S. Furui, “Cepstral analysis technique for automatic speaker veriﬁcation,”
IEEE Transactions on Acoustics Speech, and Signal Processing, vol. ASSP
29, no. 2, pp. 254–272, April 1981.
[40] S. Furui, “Speaker independent isolated word recognition using dynamic fea
tures of speech spectrum,” IEEE Transactions on Acoustics, Speech, Signal
Processing, vol. ASSP26, no. 1, pp. 52–59, February 1986.
[41] O. Ghitza, “Audiotry nerve representation as a basis for speech processing,” in
Advances in Speech Signal Processing, (S. Furui and M. Sondhi, eds.), pp. 453–
485, NY: Marcel Dekker, 1991.
[42] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone
Speech corpus for research and development,” Proceedings of ICASSP 1992,
pp. 517–520, 1992.
[43] B. Gold and L. R. Rabiner, “Parallel processing techniques for estimating
pitch period of speech in the time domain,” Journal of Acoustical Society of
America, vol. 46, no. 2, pt. 2, pp. 442–448, August 1969.
186 References
[44] A. L. Gorin, B. A. Parker, R. M. Sachs, and J. G. Wilpon, “How may I help
you?,” Proceedings of the Interactive Voice Technology for Telecommunications
Applications (IVTTA), pp. 57–60, 1996.
[45] R. M. Gray, “Vector quantization,” IEEE Signal Processing Magazine,
pp. 4–28, April 1984.
[46] R. M. Gray, “Toeplitz and circulant matrices: A review,” Foundations and
Trends in Communications and Information Theory, vol. 2, no. 3, pp. 155–239,
2006.
[47] J. A. Greefkes and K. Riemens, “Code modulation with digitally controlled
companding for speech transmission,” Philips Technical Review, pp. 335–353,
1970.
[48] H. Hermansky, “Auditory modeling in automatic recognition of speech,” in
Proceedings of First European Conference on Signal Analysis and Prediction,
pp. 17–21, Prague, Czech Republic, 1997.
[49] X. Huang, A. Acero, and H.W. Hon, Spoken Language Processing. Prentice
Hall Inc., 2001.
[50] A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis
system using a large speech database,” Proceedings of ICASSP96, Atlanta,
vol. 1, pp. 373–376, 1996.
[51] K. Ishizaka and J. L. Flanagan, “Synthesis of voiced sounds from a two
mass model of the vocal cords,” Bell System Technical Journal, vol. 51, no. 6,
pp. 1233–1268, 1972.
[52] F. Itakura, “Line spectrum representation of linear predictive coeﬃcients of
speech signals,” Journal of Acoustical Society of America, vol. 57, pp. 535(a),
p. s35(A).
[53] F. Itakura and S. Saito, “Analysissynthesis telephony based upon the max
imum likelihood method,” Proceedings of 6th International of Congress on
Acoustics, pp. C17–C20, 1968.
[54] F. Itakura and S. Saito, “A statistical method for estimation of speech spectral
density and formant frequencies,” Electronics and Communications in Japan,
vol. 53A, no. 1, pp. 36–43, 1970.
[55] F. Itakura and T. Umezaki, “Distance measure for speech recognition based on
the smoothed group delay spectrum,” in Proceedings of ICASSP87, pp. 1257–
1260, Dallas TX, April 1987.
[56] N. S. Jayant, “Adaptive delta modulation with a onebit memory,” Bell System
Technical Journal, pp. 321–342, March 1970.
[57] N. S. Jayant, “Adaptive quantization with one word memory,” Bell System
Technical Journal, pp. 1119–1144, September 1973.
[58] N. S. Jayant and P. Noll, Digital Coding of Waveforms. PrenticeHall, 1984.
[59] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge: MIT Press,
1997.
[60] F. Jelinek, R. L. Mercer, and S. Roucos, “Principles of lexical language model
ing for speech recognition,” in Advances in Speech Signal Processing, (S. Furui
and M. M. Sondhi, eds.), pp. 651–699, Marcel Dekker, 1991.
References 187
[61] B. H. Juang, “Maximum likelihood estimation for mixture multivariate
stochastic observations of Markov chains,” AT&T Technology Journal, vol. 64,
no. 6, pp. 1235–1249, 1985.
[62] B. H. Juang, S. E. Levinson, and M. M. Sondhi, “Maximum likelihood esti
mation for multivariate mixture observations of Markov chains,” IEEE Trans
actions in Information Theory, vol. 32, no. 2, pp. 307–309, 1986.
[63] B. H. Juang, L. R. Rabiner, and J. G. Wilpon, “On the use of bandpass
liftering in speehc recognition,” IEEE Transactions on Acoustics, Speech and
Signal Processing, vol. ASSP35, no. 7, pp. 947–954, July 1987.
[64] D. H. Klatt, “Software for a cascade/parallel formant synthesizer,” Journal of
the Acoustical Society of America, vol. 67, pp. 971–995, 1980.
[65] D. H. Klatt, “Review of texttospeech conversion for English,” Journal of the
Acoustical Society of America, vol. 82, pp. 737–793, September 1987.
[66] W. Koenig, H. K. Dunn, and L. Y. Lacey, “The sound spectrograph,” Journal
of the Acoustical Society of America, vol. 18, pp. 19–49, 1946.
[67] P. Kroon, E. F. Deprettere, and R. J. Sluyter, “Regularpulse excitation:
A nove approach to eﬀective and eﬃcient multipulse coding of speech,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP34,
pp. 1054–1063, October 1986.
[68] R. G. Leonard, “A database for speakerindependent digit recognition,” Pro
ceedings of ICASSP 1984, pp. 42.11.1–42.11.4, 1984.
[69] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An introduction to the
application of the theory of probabilistic functions of a Markov process to
automatic speech recognition,” Bell System Technical Journal, vol. 62, no. 4,
pp. 1035–1074, 1983.
[70] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer
design,” IEEE Transactions on Communications, vol. COM28, pp. 84–95,
January 1980.
[71] S. P. Lloyd, “Least square quantization in PCM,” IEEE Transactions on Infor
mation Theory, vol. 28, pp. 129–137, March 1982.
[72] P. C. Loizou, Speech Enhancement, Theory and Practice. CRC Press, 2007.
[73] R. F. Lyon, “A computational model of ﬁltering, detection and compression in
the cochlea,” in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing, Paris, France, May 1982.
[74] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of IEEE,
vol. 63, pp. 561–580, 1975.
[75] J. Makhoul, V. Viswanathan, R. Schwarz, and A. W. F. Huggins, “A mixed
source model for speech compression and synthesis,” Journal of the Acoustical
Society of America, vol. 64, pp. 1577–1581, December 1978.
[76] D. Malah, “Timedomain algorithms for harmonic bandwidth reduction and
timescaling of pitch signals,” IEEE Transactions on Acoustics, Speech and
Signal Processing, vol. 27, no. 2, pp. 121–133, 1979.
[77] J. D. Markel, “The SIFT algorithm for fundamental frequency estimation,”
IEEE Transactions on Audio and Electroacoustics, vol. AU20, no. 5, pp. 367–
377, December 1972.
[78] J. D. Markel and A. H. Gray, Linear Prediction of Speech. New York: Springer
Verlag, 1976.
188 References
[79] J. Max, “Quantizing for minimum distortion,” IRE Transactions on Informa
tion Theory, vol. IT6, pp. 7–12, March 1960.
[80] A. V. McCree and T. P. Barnwell III, “A mixed excitation LPC vocoder
model for low bit rate speech coding,” IEEE Transactions on Speech and
Audio Processing, vol. 3, no. 4, pp. 242–250, July 1995.
[81] M. Mohri, “Finitestate transducers in language and speech processing,” Com
putational Linguistics, vol. 23, no. 2, pp. 269–312, 1997.
[82] E. Moulines and F. Charpentier, “Pitch synchronous waveform processing
techniques for texttospeech synthesis using diphones,” Speech Communica
tion, vol. 9, no. 5–6, 1990.
[83] A. M. Noll, “Cepstrum pitch determination,” Journal of the Acoustical Society
of America, vol. 41, no. 2, pp. 293–309, February 1967.
[84] P. Noll, “A comparative study of various schemes for speech encoding,” Bell
System Technical Journal, vol. 54, no. 9, pp. 1597–1614, November 1975.
[85] A. V. Oppenehim, “Superposition in a class of nonlinear systems,” PhD dis
sertation, MIT, 1964. Also: MIT Research Lab. of Electronics, Cambridge,
Massachusetts, Technical Report No. 432, 1965.
[86] A. V. Oppenheim, “A speech analysissynthesis system based on homomorphic
ﬁltering,” Journal of the Acoustical Society of America, vol. 45, no. 2, pp. 293–
309, February 1969.
[87] A. V. Oppenheim, “Speech spectrograms using the fast Fourier transform,”
IEEE Spectrum, vol. 7, pp. 57–62, August 1970.
[88] A. V. Oppenheim and R. W. Schafer, “Homomorphic analysis of speech,”
IEEE Transactions on Audio and Electroacoustics, vol. AU16, pp. 221–228,
June 1968.
[89] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, DiscreteTime Signal Pro
cessing. PrenticeHall Inc., 1999.
[90] A. V. Oppenheim, R. W. Schafer, and T. G. Stockham Jr., “Nonlinear ﬁltering
of multiplied and convolved signals,” Proceedings of IEEE, vol. 56, no. 8,
pp. 1264–1291, August 1968.
[91] M. D. Paez and T. H. Glisson, “Minimum meansquared error quantization in
speech,” IEEE Transactions on Communications, vol. Com20, pp. 225–230,
April 1972.
[92] D. S. Pallett et al., “The 1994 benchmark tests for the ARPA spoken language
program,” Proceedings of 1995 ARPA Human Language Technology Work
shop, pp. 5–36, 1995.
[93] M. R. Portnoﬀ, “A quasionedimensional simulation for the timevarying
vocal tract,” MS Thesis, MIT, Department of Electrical Engineering, 1973.
[94] T. F. Quatieri, Discretetime speech signal processing. Prentice Hall, 2002.
[95] L. R. Rabiner, “A model for synthesizing speech by rule,” IEEE Transactions
on Audio and Electroacoustics, vol. AU17, no. 1, pp. 7–13, March 1969.
[96] L. R. Rabiner, “On the use of autocorrelation analysis for pitch detection,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP25,
no. 1, pp. 24–33, February 1977.
[97] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications
in speech recognition,” IEEE Proceedings, vol. 77, no. 2, pp. 257–286, 1989.
References 189
[98] L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov models,”
IEEE Signal Processing Magazine, 1985.
[99] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Prentice
Hall Inc., 1993.
[100] L. R. Rabiner and M. R. Sambur, “An algorithm for determining the endpoints
of isolated utterances,” Bell System Technical Journal, vol. 54, no. 2, pp. 297–
315, February 1975.
[101] L. R. Rabiner and R. W. Schafer, Theory and Application of Digital Speech
Processing. PrenticeHall Inc., 2009. (In preparation).
[102] L. R. Rabiner, R. W. Schafer, and J. L. Flanagan, “Computer synthesis of
speech by concatenation of formant coded words,” Bell System Technical Jour
nal, vol. 50, no. 5, pp. 1541–1558, May–June 1971.
[103] D. W. Robinson and R. S. Dadson, “A redetermination of the equalloudness
contours for pure tones,” British Journal of Applied Physics, vol. 7, pp. 166–
181, 1956.
[104] R. C. Rose and T. P. Barnwell III, “The self excited vocoder — an alternate
approach to toll quality at 4800 bps,” Proceedings of ICASSP ’86, vol. 11,
pp. 453–456, April 1986.
[105] A. E. Rosenberg, “Eﬀect of glottal pulse shape on the quality of natural vow
els,” Journal of the Acoustical Society of America, vol. 43, no. 4, pp. 822–828,
February 1971.
[106] R. Rosenfeld, “Two decades of statistical language modeling: Where do we go
from here?,” IEEE Proceedings, vol. 88, no. 8, pp. 1270–1278, 2000.
[107] M. B. Sachs, C. C. Blackburn, and E. D. Young, “Rateplace and temporal
place representations of vowels in the auditory nerve and anteroventral
cochlear nucleus,” Journal of Phonetics, vol. 16, pp. 37–53, 1988.
[108] Y. Sagisaka, “Speech synthesis by rule using an optimal selection of non
uniform synthesis units,” Proceedings of the International Conference on
Acoustics, Speech and Signal Processing, pp. 679–682, 1988.
[109] R. W. Schafer, “Echo removal by discrete generalized linear ﬁltering,” PhD
dissertation, MIT, 1968. Also: MIT Research Laboratory of Electronics, Cam
bridge, Massachusetts, Technical Report No. 466, 1969.
[110] R. W. Schafer, “Homomorphic systems and cepstrum analysis of speech,”
Springer Handbook of Speech Processing and Communication, Springer, 2007.
[111] R. W. Schafer and L. R. Rabiner, “System for automatic formant analysis of
voiced speech,” Journal of the Acoustical Society of America, vol. 47, no. 2,
pp. 458–465, February 1970.
[112] M. R. Schroeder and B. S. Atal, “Codeexcited linear prediction (CELP):
Highquality speech at very low bit rates,” Proceedings of IEEE ICASSP,
pp. 937–940, 1985.
[113] M. R. Schroeder and E. E. David, “A vocoder for transmitting 10 kc/s speech
over a 3.5 kc/s channel,” Acustica, vol. 10, pp. 35–43, 1960.
[114] J. H. Schroeter, “Basic principles of speech synthesis,” Springer Handbook of
Speech Processing, SpringerVerlag, 2006.
[115] S. Seneﬀ, “A joint synchrony/meanrate model of auditory speech processing,”
Journal of Phonetics, vol. 16, pp. 55–76, 1988.
190 References
[116] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication.
University of Illinois Press, Urbana, 1949.
[117] G. A. Sitton, C. S. Burrus, J. W. Fox, and S. Treitel, “Factoring veryhigh
degree polynomials,” IEEE Signal Processing Magazine, vol. 20, no. 6, pp. 27–
42, November 2003.
[118] B. Smith, “Instantaneous companding of quantized signals,” Bell System Tech
nical Journal, vol. 36, no. 3, pp. 653–709, May 1957.
[119] F. K. Soong and B.H. Juang, “Optimal quantization of LSP parameters,”
IEEE Transactions on Speech and Audio Processing, vol. 1, no. 1, pp. 15–24,
January 1993.
[120] A. Spanias, T. Painter, and V. Atti, Audio Signal Processing and Coding.
Wiley Interscience, 2007.
[121] K. N. Stevens, Acoustic Phonetics. MIT Press, 1998.
[122] S. S. Stevens and J. Volkman, “The relation of pitch to frequency,” American
Journal of Psychology, vol. 53, p. 329, 1940.
[123] L. C. Stewart, R. M. Gray, and Y. Linde, “The design of trellis waveform
coders,” IEEE transactions on Communications, vol. COM30, pp. 702–710,
April 1982.
[124] T. G. Stockham Jr., T. M. Cannon, and R. B. Ingebretsen, “Blind deconvolu
tion through digital signal processing,” Proceedings of IEEE, vol. 63, pp. 678–
692, April 1975.
[125] G. Strang and T. Nguyen, Wavelets and Filter Banks. Wellesley, MA:
WellesleyCambridge Press, 1996.
[126] Y. Tohkura, “A weighted cepstral distance measure for speech recognition,”
IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 35,
pp. 1414–1422, October 1987.
[127] J. M. Tribolet, “A new phase unwrapping algorithm,” IEEE Transactions on
Acoustical, Speech, and Signal Processing, vol. ASSP25, no. 2, pp. 170–177,
April 1977.
[128] C. K. Un and D. T. Magill, “The residualexcited linear prediction vocoder
with transmission rate below 9.6 kbits/s,” IEEE Transactions on Communi
cations, vol. COM23, no. 12, pp. 1466–1474, December 1975.
[129] P. P. Vaidyanathan, Multirate Systems and Filter Banks. PrenticeHall Inc.,
1993.
[130] R. Viswanathan, W. Russell, and J. Makhoul, “Voiceexcited LPC coders for
9.6 kbps speech transmission,” vol. 4, pp. 558–561, April 1979.
[131] V. Viswanathan and J. Makhoul, “Quantization properties of transmission
parameters in linear predictive systems,” IEEE Transactions on Acoustics,
Speech, and Signal Processing, vol. ASSP23, no. 3, pp. 309–321, June 1975.
[132] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically
optimal decoding Aalgorithm,” IEEE Transactions on Information Theory,
vol. IT13, pp. 260–269, April 1967.
[133] W. Ward, “Evaluation of the CMU ATIS system,” Proceedings of DARPA
Speech and Natural Language Workshop, pp. 101–105, February 1991.
[134] E. Zwicker and H. Fastl, Psychoacoustics. SpringerVerlag, 2nd Edition, 1990.
Supplemental References
The speciﬁc references that comprise the Bibliography of this text
are representative of the literature of the ﬁeld of digital speech pro
cessing. In addition, we provide the following list of journals and
books as a guide for further study. Listing the books in chronologi
cal order of publication provides some perspective on the evolution of
the ﬁeld.
Speech Processing Journals
• IEEE Transactions on Signal Processing. Main publication of IEEE Sig
nal Processing Society.
• IEEE Transactions on Speech and Audio. Publication of IEEE Signal
Processing Society that is focused on speech and audio processing.
• Journal of the Acoustical Society of America. General publication of the
American Institute of Physics. Papers on speech and hearing as well as
other areas of acoustics.
• Speech Communication. Published by Elsevier. A publication of the
European Association for Signal Processing (EURASIP) and of the
International Speech Communication Association (ISCA).
191
192 Supplemental References
General Speech Processing References
• Speech Analysis, Synthesis and Perception, J. L. Flanagan, Springer
Verlag, Second Edition, Berlin, 1972.
• Linear Prediction of Speech, J. D. Markel and A. H. Gray, Jr., Springer
Verlag, Berlin, 1976.
• Digital Processing of Speech Signals, L. R. Rabiner and R. W. Schafer,
PrenticeHall Inc., 1978.
• Speech Analysis, R. W. Schafer and J. D. Markel (eds.), IEEE Press
Selected Reprint Series, 1979.
• Speech Communication, Human and Machine, D. O’Shaughnessy,
AddisonWesley, 1987.
• Advances in Speech Signal Processing, S. Furui and M. M. Sondhi, Mar
cel Dekker Inc., New York, 1991.
• DiscreteTime Processing of Speech Signals, J. Deller, Jr., J. H. L.
Hansen, and J. G. Proakis, WileyIEEE Press, Classic Reissue, 1999.
• Acoustic Phonetics, K. N. Stevens, MIT Press, 1998.
• Speech and Audio Signal Processing, B. Gold and N. Morgan, John Wiley
and Sons, 2000.
• Digital Speech Processing, Synthesis and Recognition, S. Furui, Second
Edition, Marcel Dekker Inc., New York, 2001.
• DiscreteTime Speech Signal Processing, T. F. Quatieri, Prentice Hall
Inc., 2002.
• Speech Processing, A Dynamic and OptimizationOriented Approach,
L. Deng and D. O’Shaughnessy, Marcel Dekker, 2003.
• Springer Handbook of Speech Processing and Speech Communication,
J. Benesty, M. M. Sondhi and Y Huang (eds.), Springer, 2008.
• Theory and Application of Digital Speech Processing, L. R. Rabiner and
R. W. Schafer, Prentice Hall Inc., 2009.
Speech Coding References
• Digital Coding of Waveforms, N. S. Jayant and P. Noll, Prentice Hall
Inc., 1984.
• Practical Approaches to Speech Coding, P. E. Papamichalis, Prentice
Hall Inc., 1987.
• Vector Quantization and Signal Compression, A. Gersho and R. M.
Gray, Kluwer Academic Publishers, 1992.
• Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Elsevier,
1995.
• Speech Coding, A Computer Laboratory Textbook, T. P. Barnwell and
K. Nayebi, John Wiley and Sons, 1996.
Supplemental References 193
• A Practical Handbook of Speech Coders, R. Goldberg and L. Riek, CRC
Press, 2000.
• Speech Coding Algorithms, W. C. Chu, John Wiley and Sons, 2003.
• Digital Speech: Coding for Low Bit Rate Communication Systems, Sec
ond Edition, A. M. Kondoz, John Wiley and Sons, 2004.
Speech Synthesis
• From Text to Speech, J. Allen, S. Hunnicutt and D. Klatt, Cambridge
University Press, 1987.
• Acoustics of American English, J. P. Olive, A. Greenwood and J. Cole
man, SpringerVerlag, 1993.
• Computing Prosody, Y. Sagisaka, N. Campbell and N. Higuchi, Springer
Verlag, 1996.
• Progress in Speech Synthesis, J. VanSanten, R. W. Sproat, J. P. Olive
and J. Hirschberg (eds.), SpringerVerlag, 1996.
• An Introduction to TexttoSpeech Synthesis, T. Dutoit, Kluwer Aca
demic Publishers, 1997.
• Speech Processing and Synthesis Toolboxes, D. Childers, John Wiley and
Sons, 1999.
• Text To Speech Synthesis: New Paradigms and Advances, S. Narayanan
and A. Alwan (eds.), Prentice Hall Inc., 2004.
• TexttoSpeech Synthesis, P. Taylor, Cambridge University Press, 2008.
Speech Recognition and Natural Language Processing
• Fundamentals of Speech Recognition, L. R. Rabiner and B. H. Juang,
Prentice Hall Inc., 1993.
• Connectionist Speech RecognitionA Hybrid Approach, H. A. Bourlard
and N. Morgan, Kluwer Academic Publishers, 1994.
• Automatic Speech and Speaker Recognition, C. H. Lee, F. K. Soong and
K. K. Paliwal (eds.), Kluwer Academic Publisher, 1996.
• Statistical Methods for Speech Recognition, F. Jelinek, MIT Press, 1998.
• Foundations of Statistical Natural Language Processing, C. D. Manning
and H. Schutze, MIT Press, 1999.
• Spoken Language Processing, X. Huang, A. Acero and H.W. Hon, Pren
tice Hall Inc., 2000.
• Speech and Language Processing, D. Jurafsky and J. H. Martin, Prentice
Hall Inc., 2000.
• Mathematical Models for Speech Technology, S. E. Levinson, John Wiley
and Sons, 2005.
194 Supplemental References
Speech Enhancement
• Digital Speech Transmission, Enhancement, Coding and Error Conceal
ment, P. Vary and R. Martin, John Wiley and Sons, Ltd., 2006.
• Speech Enhancement, Theory and Practice, P. C. Loizou, CRC Press,
2007.
Audio Processing
• Applications of Digital Signal Processing to Audio and Acoustics,
H. Kahrs and K. Brandenburg (eds.), Kluwer Academic Publishers,
1998.
• Audio Signal Processing and Coding, A. Spanias, T. Painter and V. Atti,
John Wiley and Sons, 2007.
aspect of speech processing to great depth; hence our goal is to provide a useful introduction to the wide range of important concepts that comprise the ﬁeld of digital speech processing. A more comprehensive treatment will appear in the forthcoming book, Theory and Application of Digital Speech Processing [101].
1
Introduction
The fundamental purpose of speech is communication, i.e., the transmission of messages. According to Shannon’s information theory [116], a message represented as a sequence of discrete symbols can be quantiﬁed by its information content in bits, and the rate of transmission of information is measured in bits/second (bps). In speech production, as well as in many humanengineered electronic communication systems, the information to be transmitted is encoded in the form of a continuously varying (analog) waveform that can be transmitted, recorded, manipulated, and ultimately decoded by a human listener. In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals, as illustrated in Figure 1.1, can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired. This form of speech processing is, of course, the basis for Bell’s telephone invention as well as today’s multitude of devices for recording, transmitting, and manipulating speech and audio signals. Although Bell made his invention without knowing the fundamentals of information theory, these ideas
3
In their classic introduction to speech science.” have assumed great importance in the design of sophisticated modern communications systems. it is nevertheless useful to begin with a discussion of how information is encoded in the speech waveform.02 0. The message information can be thought of as having a number of diﬀerent representations during the process of speech production (the upper path in Figure 1. Therefore. and ﬁnally to the understanding of the message by a listener. The process starts in the upper left as a message represented somehow in the brain of the speaker.08 0.1 A speech waveform with phonetic labels for the text message “Should we chase. Denes and Pinson aptly referred to this process as the “speech chain” [29]. .1 The Speech Chain Figure 1.12 W IY CH 0.12 Fig.2). 1.06 time in seconds 0. 1.04 0. to the creation of the speech signal.48 0 0.4 Introduction SH 0 UH D 0.24 EY 0.36 S 0.1 0. even though our main focus will be mostly on the speech waveform and its representation in the form of parametric models.2 shows the complete process of producing and perceiving speech from the formulation of a message in the brain of a talker.
As an example. formants Vocal Tract System acoustic waveform 5 50 bps 200 bps 2000 bps information rate 64700 Kbps Transmission Channel semantics Message Understanding phonemes. the text “should we chase” is represented phonetically (in ARPAbet symbols) as [SH UH D — W IY — CH EY S]..e. (See Chapter 2 for more discussion of phonetic transcription.” i. This step.e. The ARPAbet code does not require special fonts and is thus more convenient for computer applications.1 The Speech Chain Speech Production text Message Formulation phonemes.. the speed and emphasis) in which the sounds are intended to be produced. namely the tongue.) The third step in the speech production process is the conversion to “neuromuscular controls. 1. For example the message could be represented initially as English text.1 Thus.1.2. the segments of the waveform of Figure 1. prosody Language Code discrete input articulatory motions NeuroMuscular Controls continuous input excitation. words sentences Language Translation discrete output spectrum analysis Basilar Neural Membrane Transduction Motion continuous output feature extraction acoustic waveform Speech Perception Fig.2 The Speech Chain: from message. the set of control signals that direct the neuromuscular system to move the speech articulators. .1 are labeled with phonetic symbols using a computerkeyboardfriendly code called ARPAbet. converts text symbols to phonetic symbols (along with stress and durational information) that describe the basic sounds of a spoken version of the message and the manner (i. to speech signal. to understanding. teeth. called the language code generator in Figure 1. In order to “speak” the message. 1 The International Phonetic Association (IPA) provides a set of rules for phonetic transcription using an equivalent set of specialized symbols. the talker implicitly converts the text into a symbolic representation of the sequence of sounds corresponding to the spoken version of the text. lips.
For example. assuming independent letters as a simple approximation. The information representations for the ﬁrst two stages in the speech chain are discrete so we can readily estimate the rate of information ﬂow with some simple assumptions.g. Additional information required to describe prosodic features of the signal (e. loudness) could easily add 100 bps to the total information rate for a message encoded as a speech signal. In Figure 1..6 Introduction jaw and velum. the ARBAbet phonetic symbol set used to label the speech sounds in Figure 1.1. or about 6 bits/phoneme (again a rough approximation assuming independence of phonemes). assume that there are about 32 symbols (letters) in the language (in English there are 26 letters. we estimate the base information rate of the text message as about 50 bps (5 bits per symbol times 10 symbols per second).1 contains approximately 64 = 26 symbols. For the next stage in the speech production part of the speech chain. This leads to an estimate of 8 × 6/0..g. but if we include simple punctuation we get a count closer to 32 = 25 symbols). that encodes the information in the desired message into the speech signal. Finally the last step in the Speech Production process is the “vocal tract system” that physically creates the necessary sound sources and the appropriate vocal tract shapes over time so as to create an acoustic waveform. pitch and stress) markers. the information rate is estimated to increase by a factor of 4 to about 200 bps. the rate of speaking for most people is about 10 symbols per second (somewhat on the high side. there are 8 phonemes in approximately 600 ms. Hence. the representation becomes continuous (in the form of control signals for articulatory motion). where the text representation is converted into phonemes and prosody (e. such as the one shown in Figure 1. pitch. The end result of the neuromuscular controls step is a set of articulatory motions (continuous control) that cause the vocal tract articulators to move in a prescribed manner in order to create the desired sounds. in a manner that is consistent with the sounds of the desired spoken message and with the desired degree of emphasis. we could estimate the spectral bandwidth of these .6 = 80 bps. At the second stage of the process.1. Furthermore. but still acceptable for a rough information rate estimate). duration. If they could be measured. To determine the rate of information ﬂow during speech production.
100 samples/s with 16 bit samples. the reproduced acoustic signal will be virtually indistinguishable from the original speech signal. As we move from text to speech waveform through the speech chain. On the other hand. but much of it is due to the ineﬃciency 2 Note that we introduce the term data rate for digital representations to distinguish from the inherent information content of the message represented by the speech signal. as we will see later. For example. Estimates of bandwidth and required accuracy suggest that the total data rate of the sampled articulatory control signals is about 2000 bps [34]. implying a sampling rate of 8000 samples/s. or a data rate of 705. etc.1. . The articulators move relatively slowly compared to the time variation of the resulting acoustic waveform.1 The Speech Chain 7 control signals and appropriately sample and quantize these signals to obtain equivalent digital signals for which the data rate could be estimated. accent. This representation is highly intelligible (i. the result is an encoding of the message that can be eﬀectively transmitted by acoustic wave propagation and robustly decoded by the hearing mechanism of a listener.000. The above analysis of data rates shows that as we move from text to sampled speech waveform. the data rate of the digitized speech waveform at the end of the speech production part of the speech chain can be anywhere from 64. humans can readily extract the message from it) but to most listeners. In this case. We arrive at such numbers by examining the sampling rate and quantization required to represent the speech signal with a desired perceptual ﬁdelity. the data rate can increase by a factor of 10.600 bps. resulting in a bit rate of 64. Each sample can be quantized with 8 bits on a log scale. it will sound diﬀerent from the original speech signal uttered by the talker. the speech waveform can be represented with “CD quality” using a sampling rate of 44. the original text message is represented by a set of continuously varying signals whose digital representation requires a much higher data rate than the information rate that we estimated for transmission of the message as a speech signal.000 to more than 700.. “telephone quality” requires that a bandwidth of 0–4 kHz be preserved.e..2 Finally.000 bps. Thus. Part of this extra information represents characteristics of the talker such as emotional state.000 bps. speech mannerisms.
Finally. The complete speech chain consists of a speech production/ generation model. words. motivated by an awareness of the low intrinsic information rate of speech.8 Introduction of simply sampling and ﬁnely quantizing analog signals. The ﬁrst step is the eﬀective conversion of the acoustic waveform to a spectral representation. This is done within the inner ear by the basilar membrane. The next step in the speech perception process is a neural transduction of the spectral features into a set of sound features (or distinctive features as they are referred to in the area of linguistics) that can be decoded and processed by the brain.2. and sentences associated with the incoming message by a language translation process in the human brain. Thus. words and sentences of the message into an understanding of the meaning of the basic message in order to be able to respond to or take some appropriate action. this transmission channel consists of just the acoustic wave connection between . and thus the entire model is useful for thinking about the processes that occur. In its simplest embodiment. the last step in the speech perception model is the conversion of the phonemes. The next step in the process is a conversion of the sound features into the set of phonemes. which acts as a nonuniform spectrum analyzer by spatially separating the spectral components of the incoming speech signal and thereby analyzing them by what amounts to a nonuniform ﬁlter bank. There is one additional process shown in the diagram of the complete speech chain in Figure 1.2 is rudimentary at best.2 that we have not discussed — namely the transmission channel between the speech generation and speech perception parts of the model. The speech perception model shows the series of steps from capturing speech at the ear to understanding the message encoded in the speech signal. a central theme of much of digital speech processing is to obtain a digital representation with lower data rate than that of the sampled waveform. Our fundamental understanding of the processes in most of the speech perception modules in Figure 1. as shown progressing to the left in the bottom half of Figure 1. but it is generally agreed that some physical correlate of each of the steps in the speech perception model occur within the human brain. of the type discussed above. as well as a speech perception/recognition model.
1.2 Applications of Digital Speech Processing
9
a speaker and a listener who are in a common space. It is essential to include this transmission channel in our model for the speech chain since it includes real world noise and channel distortions that make speech and message understanding more diﬃcult in real communication environments. More interestingly for our purpose here — it is in this domain that we ﬁnd the applications of digital speech processing.
1.2
Applications of Digital Speech Processing
The ﬁrst step in most applications of digital speech processing is to convert the acoustic waveform to a sequence of numbers. Most modern AtoD converters operate by sampling at a very high rate, applying a digital lowpass ﬁlter with cutoﬀ set to preserve a prescribed bandwidth, and then reducing the sampling rate to the desired sampling rate, which can be as low as twice the cutoﬀ frequency of the sharpcutoﬀ digital ﬁlter. This discretetime representation is the starting point for most applications. From this point, other representations are obtained by digital processing. For the most part, these alternative representations are based on incorporating knowledge about the workings of the speech chain as depicted in Figure 1.2. As we will see, it is possible to incorporate aspects of both the speech production and speech perception process into the digital representation and processing. It is not an oversimpliﬁcation to assert that digital speech processing is grounded in a set of techniques that have the goal of pushing the data rate of the speech representation to the left along either the upper or lower path in Figure 1.2. The remainder of this chapter is devoted to a brief summary of the applications of digital speech processing, i.e., the systems that people interact with daily. Our discussion will conﬁrm the importance of the digital representation in all application areas.
1.2.1
Speech Coding
Perhaps the most widespread applications of digital speech processing technology occur in the areas of digital transmission and storage
10 Introduction
speech signal data
xc (t )
AtoD Converter
samples
x[n]
Analysis/ Encoding
y[n] Channel or Medium
decoded signal
xc (t )
DtoA Converter
samples
x[n]
Synthesis/ Decoding
data
y[n]
Fig. 1.3 Speech coding block diagram — encoder and decoder.
of speech signals. In these areas the centrality of the digital representation is obvious, since the goal is to compress the digital waveform representation of speech into a lower bitrate representation. It is common to refer to this activity as “speech coding” or “speech compression.” Figure 1.3 shows a block diagram of a generic speech encoding/decoding (or compression) system. In the upper part of the ﬁgure, the AtoD converter converts the analog speech signal xc (t) to a sampled waveform representation x[n]. The digital signal x[n] is analyzed and coded by digital computation algorithms to produce a new digital signal y[n] that can be transmitted over a digital communication channel or stored in a digital storage medium as y [n]. As we will see, there ˆ are a myriad of ways to do the encoding so as to reduce the data rate over that of the sampled and quantized speech waveform x[n]. Because the digital representation at this point is often not directly related to the sampled speech waveform, y[n] and y [n] are appropriately referred ˆ to as data signals that represent the speech signal. The lower path in Figure 1.3 shows the decoder associated with the speech coder. The received data signal y [n] is decoded using the inverse of the analysis ˆ processing, giving the sequence of samples x[n] which is then converted ˆ (using a DtoA Converter) back to an analog signal xc (t) for human ˆ listening. The decoder is often called a synthesizer because it must reconstitute the speech waveform from data that may bear no direct relationship to the waveform.
1.2 Applications of Digital Speech Processing
11
With carefully designed error protection coding of the digital representation, the transmitted (y[n]) and received (ˆ[n]) data can be y essentially identical. This is the quintessential feature of digital coding. In theory, perfect transmission of the coded digital representation is possible even under very noisy channel conditions, and in the case of digital storage, it is possible to store a perfect copy of the digital representation in perpetuity if suﬃcient care is taken to update the storage medium as storage technology advances. This means that the speech signal can be reconstructed to within the accuracy of the original coding for as long as the digital representation is retained. In either case, the goal of the speech coder is to start with samples of the speech signal and reduce (compress) the data rate required to represent the speech signal while maintaining a desired perceptual ﬁdelity. The compressed representation can be more eﬃciently transmitted or stored, or the bits saved can be devoted to error protection. Speech coders enable a broad range of applications including narrowband and broadband wired telephony, cellular communications, voice over internet protocol (VoIP) (which utilizes the internet as a realtime communications medium), secure voice for privacy and encryption (for national security applications), extremely narrowband communications channels (such as battleﬁeld applications using high frequency (HF) radio), and for storage of speech for telephone answering machines, interactive voice response (IVR) systems, and prerecorded messages. Speech coders often utilize many aspects of both the speech production and speech perception processes, and hence may not be useful for more general audio signals such as music. Coders that are based on incorporating only aspects of sound perception generally do not achieve as much compression as those based on speech production, but they are more general and can be used for all types of audio signals. These coders are widely deployed in MP3 and AAC players and for audio in digital television systems [120]. 1.2.2 TexttoSpeech Synthesis
For many years, scientists and engineers have studied the speech production process with the goal of building a system that can start with
bass. The input to the system is ordinary text such as an email message or an article from a newspaper or magazine. In a sense. The conversion from text to sounds involves a set of linguistic rules that must determine the appropriate set of sounds (perhaps including things like emphasis. The ﬁrst block in the texttospeech synthesis system. has the job of converting the printed text input into a set of sounds that the machine must synthesize. a texttospeech synthesizer such as depicted in Figure 1. text and produce speech automatically. The basic digital representation is not generally the sampled speech wave. and how to properly pronounce proper names.” In this method. the computer stores multiple versions of each of the basic units of speech (phones. but the most promising one today is called “unit selection and concatenation. Dr. how to pronounce ambiguous words like read. In essence. the synthesis algorithm must simulate the action of the vocal tract system in creating the sounds of speech. etc. some sort of compressed representation is normally used to . (Doctor or drive). rates of speaking.) so that the resulting synthetic speech will express the words and intent of the text message in what passes for a natural voice that can be decoded accurately by human speech perception.4 is a digital simulation of the entire upper part of the speech chain diagram. half phones. This is more diﬃcult than simply looking up the words in a pronouncing dictionary because the linguistic rules must determine how to pronounce acronyms. pauses. object. labeled linguistic rules. (street or Saint). the role of the synthesis algorithm is to create the appropriate sound sequence to represent the text message in the form of speech. and then decides which sequence of speech units sounds best for the particular text message that is being produced. syllables. 1. etc.12 Introduction text Linguistic Rules Synthesis Algorithm DtoA Converter speech Fig. Once the proper pronunciation of the text has been determined. how to pronounce abbreviations like St. There are many procedures for assembling the speech sounds and compiling them into a proper sentence. specialized terms. Instead.). etc.4 Texttospeech synthesis system block diagram.
3 Speech Recognition and Other Pattern Matching Problems Another large class of digital speech processing applications is concerned with the automatic extraction of information from the speech signal. more importantly. where an optical character recognition system provides the text input to a speech synthesis system. serve as the voice for providing information from handheld devices such as foreign language phrasebooks.5 Block diagram of general pattern matching system for speech signals. handle call center help desks and customer care applications. where the object is to extract the message from the speech signal. .2. and as the voice of announcement machines that provide information such as stock quotes. Texttospeech synthesis systems are an essential component of modern human–machine communications systems and are used to do things like read email messages over a telephone. whose output is converted to an analog representation via the DtoA converter. where the goal is to identify who is speaking. to allow convenient manipulation of durations and blending of adjacent sounds. Such problems include the following: speech recognition.2. the speech synthesis algorithm would include an appropriate decoder.2 Applications of Digital Speech Processing 13 save memory and. provide the voices for talking agents for completion of transactions over the internet. as discussed in Section 1. dictionaries. Thus.1. Another important application is in reading machines for the blind. speaker recognition. speaker veriﬁcation.5 shows a block diagram of a generic approach to pattern matching problems in speech processing. where the goal is to verify a speaker’s claimed identity from analysis of their speech speech AtoD Converter Feature Analysis Pattern Matching symbols Fig. updates on arrivals and departures of ﬂights. Figure 1. etc.1. provide voice output from GPS systems in automobiles. 1. Most such systems involve some sort of pattern matching. crossword puzzle helpers. 1. airline schedules.
namely the pattern matching block. The major areas where such a system ﬁnds applications include command and control of computer software. Names from directories with upwards of several hundred names can readily be recognized and dialed using simple speech recognition technology. address list modiﬁcation and entry. Speech coding at bit rates on the order of 8 Kbps enables normal voice conversations in cell phones. in the case of speech recognition. and other documents. one of the preeminent uses of speech technology is in portable communication devices. Often. the biggest use has been in the area of recognition and understanding of speech in support of human– machine communication by voice. The ﬁnal block in the system. in the case of speaker recognition. and for agent services such as calendar entry and update.5 represents a wide range of speech pattern matching problems. which involves monitoring a speech signal for the occurrence of speciﬁed words or phrases. Although the block diagram of Figure 1. dynamically time aligns the set of feature vectors representing the speech signal with a concatenated set of stored patterns. or a decision as to whether to accept or reject the identity claim of a speaker in the case of speaker veriﬁcation. For example. . The ﬁrst block in the pattern matching system converts the analog speech waveform to digital form using an AtoD converter. Pattern recognition applications often occur in conjunction with other digital speech processing applications. natural language voice dialogues with machines to enable help desks and call centers. the same analysis techniques that are used in speech coding are also used to derive the feature vectors. Spoken name speech recognition in cellphones enables voice dialing capability that can automatically dial the number associated with the recognized name. The symbolic output consists of a set of recognized words. or the identity of the best matching talker. etc.14 Introduction signal. memos. The feature analysis module converts the sampled speech signal to a set of feature vectors. and chooses the identity associated with the pattern which is the closest match to the timealigned set of feature vectors of the speech signal. and automatic indexing of speech recordings based on recognition (or spotting) of spoken keywords. voice dictation to create letters. word spotting.
it will be possible for people speaking diﬀerent languages to communicate at data rates on the order of that of printed text reading! 1. The goal of language translation systems is to convert spoken words in one language to spoken words in another language so as to facilitate natural language voice dialogues between people speaking diﬀerent languages. When such systems exist. along with speech recognition (and generally natural language understanding) that also works for both languages. hence it is a very diﬃcult task and one for which only limited progress has been made. Language translation technology requires speech synthesis systems that work in both languages. it can be manipulated in virtually limitless ways by DSP techniques.6 Range of speech communication applications. Here again. 1. . and speech recognition as well as many others such as speaker identiﬁcation.4 Other Speech Applications The range of speech communication applications is illustrated in Figure 1. The block diagram in Figure 1.7 represents any system where time signals such as speech are processed by the techniques of DSP. and aids for the hearing. speech synthesis.or visuallyimpaired.2 Applications of Digital Speech Processing 15 Another major speech application that has long been a dream of speech researchers is automatic language translation. manipulations and modiﬁcations of the speech signal are Digital Speech Processing Techniques Digital Speech Transmission Synthesis & Storage Speech Recognition Speaker Enhancement Aids for the Verification/ of Speech Handicapped Identification Quality Fig.6. the techniques of digital speech processing are a key ingredient of a wide range of applications that include the three areas of transmission/storage. This ﬁgure simply depicts the notion that once the speech signal is sampled. speech signal quality enhancement.1. As seen in this ﬁgure.2.
An excellent reference in this area is the recent textbook by Loizou [72]. where the goal is to remove or suppress noise or echo or reverberation picked up by a microphone along with the desired speech signal. and to speedup or slowdown prerecorded speech (e. We make no pretense of exhaustive coverage. however. Other examples of manipulation of the speech signal include timescale modiﬁcation to align voices with video segments.16 Introduction speech AtoD Converter Computer Algorithm DtoA Converter speech Fig.3 Our Goal for this Text We have discussed the speech signal and how it encodes information for human communication. in making distorted speech signals more useful for further processing as part of a speech coder. rapid review of voice mail messages. These and many more examples all rely on the basic principles of digital speech processing. however.. but does not improve.g. which we will discuss in the remainder of this text. The subject is too broad and . and then transforming back to the waveform domain. synthesizer. for talking books. usually achieved by transforming the speech signal into an alternative representation (that is motivated by our understanding of speech production and speech perception). and we have hinted at some of the possibilities that exist for the future. One important application area is speech enhancement. or careful scrutinizing of spoken material). operating on that representation by further digital computation.7 General block diagram for application of digital signal processing to speech signals. 1. using a DtoA converter. 1. Success has been achieved. in reality the best that has been achieved so far is less perceptually annoying speech that essentially maintains. or recognizer. the intelligibility of the noisy speech. In humantohuman communication. to modify voice qualities. the goal of speech enhancement systems is to make the speech more intelligible and more natural. We have given a brief overview of the way in which digital speech processing is being applied today.
synthesis. This means that some of the latest algorithmic innovations and applications will not be discussed — not because they are not interesting. . and we will not be able to cover all the possible applications of digital speech processing techniques.3 Our Goal for this Text 17 too deep. but simply because there are so many fundamental triedandtrue techniques that remain at the core of digital speech processing. and recognition. We hope that this text will stimulate readers to investigate the subject in greater depth using the extensive set of references provided.1. Instead our focus is on the fundamentals of digital speech processing and their application to coding. We will not be able to go into great depth. Our goal is only to provide an uptodate introduction to this fascinating ﬁeld.
1 Phonetic Representation of Speech Speech can be represented phonetically by a ﬁnite set of symbols called the phonemes of the language. where the phonemes are denoted by a set of ASCII symbols called the ARPAbet. For most languages the number of phonemes is between 32 and 64. the number of which depends upon the language and the reﬁnement of the analysis. the goal in many applications of digital speech processing techniques is to move the digital representation of the speech signal from the waveform samples back up the speech chain toward the message.2 The Speech Signal As the discussion in Chapter 1 shows.1 to account for allophonic variations and events such as glottal stops and pauses. this chapter provides a brief overview of the phonetic representation of speech and an introduction to models for the production of the speech signal. 2. Additional phonemes can be added to Table 2. A condensed inventory of the sounds of speech in the English language is given in Table 2. To gain a better idea of what this means. 18 . Table 2.1 also includes some simple examples of ARPAbet transcriptions of words containing each of the phonemes of English.1.
cmu. We see that. Thus sounds like /SH/ and /S/ look like (spectrally shaped) .speech. Class Vowels and diphthongs ARPAbet IY IH EY EH AE AA AO UH OW UW AH ER AY AW OY Y R W L M N NG P B T D K G HH F V TH DH S Z SH ZH CH JH Example beet bit bait bet bat bob born book boat boot but bir d buy dow n boy you r ent w it let met net sing pat bet ten debt k it get hat f at v at thing that sat z oo shut az ure chase j udge Transcription [B IY T] [B IH T] [B EY T] [B EH T] [B AE T] [B AA B] [B AO R N] [B UH K] [B OW T] [B UW T] [B AH T ] [B ER D] [B AY] [D AW N] [B OY] [Y UH] [R EH N T] [W IH T] [L EH T] [M EH T] [N EH T] [S IH NG] [P AE T] [B EH T] [T EH N] [D EH T] [K IH T] [G EH T] [HH AE T] [F AE T] [V AE T] [TH IH NG] [DH AE T] [S AE T] [Z UW] [SH AH T] [AE ZH ER] [CH EY S] [JH AH JH ] 19 Glides Liquids Nasals Stops Fricatives Aﬀricates a This set of 39 phonemes is used in the CMU Pronouncing Dictionary available online at http://www. for the most part.2. 4 shows how the sounds corresponding to the text “should we chase” are encoded into a speech waveform. phonemes have a distinctive appearance in the speech waveform.1 Phonetic Representation of Speech Table 2. Figure 1.cs.1 Condensed list of ARPAbet phonetic symbols for North American English.edu/cgibin/cmudict.1 on p.
1 in several ways. The shape (variation of crosssection along the axis) of the vocal tract varies with time due to motions of the lips. this type of model is a reasonable approximation for wavelengths of the sounds in speech. This tube serves as an acoustic transmission system for sounds generated inside the vocal tract. Although the actual human vocal tract is not laid out along a straight line as in Figure 2. 2.1 Schematic model of the vocal tract system. is connected to the main acoustic branch by the trapdoor action of the velum. This branch path radiates sound at the nostrils.1 [35]. jaw. This diagram highlights the essential physical features of human anatomy that enter into the ﬁnal stages of the speech production process. Voiced sounds (vowels. (After Flanagan et al. /N/. For creating nasal sounds like /M/. glides. while the vowel sounds /UH/. 2. nasals in Table 2. The sounds of speech are generated in the system of Figure 2.1.1) Fig. /IY/. It shows the vocal tract as a tube of nonuniform crosssectional area that is bounded at one end by the vocal cords and at the other by the mouth opening.) . called the nasal tract. a sidebranch tube. These diﬀerences result from the distinctively diﬀerent ways that these sounds are produced. and velum. [35]. or /NG/. liquids.2 Models for Speech Production A schematic longitudinal crosssectional drawing of the human vocal tract mechanism is given in Figure 2. and /EY/ are highly structured and quasiperiodic.20 The Speech Signal random noise. tongue.
which is on the order of 10 phonemes per second. and the liquid consonant /W/. Sounds produced in this manner include the voiced fricatives /V/. Unvoiced sounds are produced by creating a constriction somewhere in the vocal tract tube and forcing air through that constriction. the general character of the speech signal varies at the phoneme rate.2. and numerical techniques can be used to create a complete physical . while the detailed time variations of the speech waveform are at a much higher rate.1 can be described by acoustic theory. The system of Figure 2.1 are the vowels /UH/. That is. A third sound production mechanism is when the vocal tract is partially closed oﬀ causing turbulent ﬂow due to the constriction. allowing pressure to build up behind the closure. the ﬁne structure of the time waveform is created by the sound sources in the vocal tract. and then abruptly releasing the pressure.2 Models for Speech Production 21 are produced when the vocal tract tube is excited by pulses of air pressure resulting from quasiperiodic opening and closing of the glottal oriﬁce (opening between the vocal cords). plosive sounds such as /P/. Examples are the unvoiced fricative sounds such as /SH/ and /S/. thereby creating turbulent air ﬂow. and /K/ and aﬀricates such as /CH/ are formed by momentarily closing oﬀ air ﬂow. The sounds created in the vocal tract are shaped in the frequency domain by the frequency response of the vocal tract. which acts as a random noise excitation of the vocal tract tube. Finally. and the resonances of the vocal tract tube shape these sound sources into the phonemes. and /ZH/.1. which acts as an acoustic transmission line with certain vocal tract shapedependent resonances that tend to emphasize some frequencies of the excitation relative to others. As discussed in Chapter 1 and illustrated by the waveform in Figure 1. All these excitation sources create a wideband excitation signal to the vocal tract tube. /Z/. /IY/. 34]. /T/. /DH/. at the same time allowing quasiperiodic ﬂow due to vocal cord vibrations. In summary. Examples in Figure 1. These resonance frequencies are called the formant frequencies of the sound [32. the changes in vocal tract conﬁguration occur relatively slowly compared to the detailed time variation of the speech signal. and /EY/. The resonance frequencies resulting from a particular conﬁguration of the articulators are instrumental in forming the sound corresponding to a given phoneme.
and since the vocal tract changes shape relatively slowly. (2. Thus. for the most part.2 simulates the frequency shaping of the vocal tract tube.2 Source/system model for a speech signal. In general such a model is called a source/system model of speech production. it is suﬃcient to model the production of a sampled speech signal by a discretetime system model such as the one depicted in Figure 2. it is sometimes useful to employ zeros (dk ) of the system function to model nasal and fricative . In detailed modeling of speech production [32. The discretetime timevarying linear system on the right in Figure 2. 34. but. 2. Samples of a speech signal are assumed to be the output of the timevarying linear system. it is reasonable to assume that the linear system response does not vary over time intervals on the order of 10 ms or so. simulation of sound generation and transmission in the vocal tract [36.1) 1− k=1 ak z −k (1 − ck z −1 ) where the ﬁlter coeﬃcients ak and bk (labeled as vocal tract parameters in Figure 2. 64].2) change at a rate on the order of 50–100 times/s.2. 93]. The shorttime frequency response of the linear system simulates the frequency shaping of the vocal tract system. it is common to characterize the discretetime linear system by a system function of the form: M bk z H(z) = k=0 N −k M b0 = k=1 N k=1 (1 − dk z −1 ) . Some of the poles (ck ) of the system function lie close to the unit circle and create resonances to model the formant frequencies.22 The Speech Signal Excitation Parameters Excitation Generator excitation signal Vocal Tract Parameters Linear System speech signal Fig. The excitation generator on the left simulates the diﬀerent modes of sound generation in the vocal tract.
it is possible to compute/measure/estimate the parameters of the model by analyzing short blocks of samples of the speech signal.1). That is. The fundamental frequency of the glottal excitation determines the perceived pitch of the voice. the linear system imposes its frequency response on the spectrum to create the speech sounds.2 creates an appropriate excitation for the type of sound being produced. many applications of the source/system model only include poles in the model because this simpliﬁes the analysis required to estimate the parameters of the model from the speech signal. By assuming that the properties of the speech signal (and the model) are constant over short time intervals.2 switches from unvoiced to voiced leading to the speech signal output as shown in the ﬁgure. It is through such models and analysis techniques that we are able to build properties of the speech production process into digital representations of the speech signal. Therefore. In either case. . The excitation in Figure 2. the periodic sequence of smooth glottal pulses has a harmonic line spectrum with components that decrease in amplitude with increasing frequency. This can be achieved by a small increase in the order of the denominator over what would be needed to represent the formant resonances. as we discuss further in Chapter 4. The box labeled excitation generator in Figure 2.2 Models for Speech Production 23 sounds.2. the linear system is excited by a random number generator that produces a discretetime noise signal with ﬂat spectrum as shown in the lefthand half of the excitation signal. and a wide variety of digital representations of the speech signal are based on it. Often it is convenient to absorb the glottal pulse spectrum contribution into the vocal tract system model of (2. For unvoiced speech. The individual ﬁniteduration glottal pulses have a lowpass spectrum that depends on a number of factors [105]. the speech signal is represented by the parameters of the model instead of the sampled waveform. For voiced speech the excitation to the linear system is a quasiperiodic sequence of discrete (glottal) pulses that look very much like those shown in the righthand half of the excitation signal waveform in Figure 2. However. This model of speech as the output of a slowly timevarying digital ﬁlter with an excitation that captures the nature of the voiced/unvoiced distinction in speech production is the basis for thinking about the speech signal.2.
Recent advances in MRI imaging and computer image analysis have provided signiﬁcant advances in this area of speech science [17]. Stevens [121] and Quatieri [94] provide useful discussions of these eﬀects. such models are based on many approximations including the assumption that the source and the system do not interact. . However. researchers have sought to measure the physical dimensions of the human vocal tract during speech production. and the nonlinearities that enter into sound generation and transmission in the vocal tract.24 The Speech Signal 2. 51] much work has been devoted to creating detailed simulations of glottal ﬂow.3 More Reﬁned Models Source/system models as shown in Figure 2. For many years. Fluid mechanics and acoustic wave propagation theory are fundamental physical principles that must be applied for detailed modeling of speech production. Since the early work of Flanagan and Ishizaka [34. Early eﬀorts to measure vocal tract area functions involved hand tracing on Xray pictures [32]. This information is essential for detailed simulations based on acoustic theory. the interaction of the glottal source and the vocal tract in speech production.2 with the system characterized by a timesequence of timeinvariant systems are quite suﬃcient for most applications in speech processing. the assumption of linearity. 36. and the assumption that the distributed continuoustime vocal tract transmission system can be modeled by a discrete linear timeinvariant system. and we shall rely on such models throughout this text.
the inner ear. the incus (also called the anvil) and the stapes (also called the stirrup).1 shows a schematic view of the human ear showing the three distinct sound processing sections. which conducts the neural signals to the brain. which perform a transduction from acoustic waves to mechanical pressure waves. which consists of the cochlea and the set of neural connections to the auditory nerve. 3. and including three small bones. namely: the outer ear consisting of the pinna. which gathers sound and conducts it through the external canal to the middle ear. and ﬁnally. or eardrum.3 Hearing and Auditory Perception In Chapter 2. the middle ear beginning at the tympanic membrane. 25 .1 The Human Ear Figure 3. the malleus (also called the hammer). In this chapter we turn to the perception side of the speech chain to discuss properties of human sound perception that can be employed to create digital representations of the speech signal that are perceptually robust. we introduced the speech production process and showed how we could model speech production using discretetime systems.
Distributed along Fig. 3. The acoustic wave is transmitted from the outer ear to the inner ear where the ear drum and bone structures convert the sound wave to mechanical vibrations which ultimately are transferred to the basilar membrane inside the cochlea.) . (After Flanagan [34]. [107]. The basilar membrane vibrates in a frequencyselective manner along its extent and thereby performs a rough (nonuniform) spectral analysis of the sound. (After Sachs et al.26 Hearing and Auditory Perception Fig.) Figure 3. 3.2 Schematic model of the auditory mechanism.1 Schematic view of the human ear (inner and middle structures enlarged).2 [107] depicts a block diagram abstraction of the auditory system.
103]. This produces an auditory nerve representation in both time and frequency.2 as a sequence of central processing with multiple representations followed by some type of pattern recognition. The solid curves are equalloudnesslevel contours measured by comparing sounds at various frequencies with a pure tone of frequency 1000 Hz and known sound pressure level. Loudness is a perceptual quality that is related to the physical property of sound pressure level. Loudness is quantiﬁed by relating the actual sound pressure level of a pure tone (in dB relative to a standard reference level) to the perceived loudness of the same tone (in a unit called phons) over the range of human hearing (20 Hz–20 kHz). shown in Figure 3.2 Perception of Loudness A key factor in the perception of speech and other sounds is loudness. is not well understood and we can only postulate the mechanisms used by the human brain to perceive sound or speech.3 [37. Even so. The processing at higher levels in the brain. This relationship is shown in Figure 3.2 Perception of Loudness 27 the basilar membrane are a set of inner hair cells that serve to convert motion along the basilar membrane to neural activity. the point at frequency 100 Hz on the curve labeled 50 (phons) is obtained by adjusting the power of the 100 Hz tone until it sounds as loud as a 1000 Hz tone having a sound pressure level of 50 dB. 3. These experiments have yielded much valuable knowledge about the sensitivity of the human auditory system to acoustic properties such as intensity and frequency. It can be seen that low frequencies must be signiﬁcantly more intense than frequencies in the midrange in order that they be perceived at all. Speciﬁcally. a wealth of knowledge about how sounds are perceived has been discovered by careful experiments that use tones and noise signals to stimulate the auditory system of human observers in very speciﬁc and controlled ways. These loudness curves show that the perception of loudness is frequencydependent. For example. the dotted curve at the bottom of the ﬁgure labeled “threshold of audibility” shows the sound pressure level that is required for a sound of a given frequency to be just audible (by a person with normal hearing).3. Careful measurements of this kind show that a 100 Hz tone must have a sound pressure level of about 60 dB in order .
3. The equalloudnesslevel curves show that the auditory system is most sensitive for frequencies ranging from about 100 Hz up to about 6 kHz with the greatest sensitivity at around 3 to 4 kHz.3 Critical Bands The nonuniform frequency analysis performed by the basilar membrane can be thought of as equivalent to that of a set of bandpass ﬁlters whose frequency responses become increasingly broad with increasing frequency.) to be perceived to be equal in loudness to the 1000 Hz tone of sound pressure level 50 dB. the bandpass ﬁlters are not ideal . An idealized version of such a ﬁlter bank is depicted schematically in Figure 3.4.28 Hearing and Auditory Perception Fig. This is almost precisely the range of frequencies occupied by most of the sounds of speech. 3.3 Loudness level for human hearing. In reality. (After Fletcher and Munson [37]. By convention. both the 50 dB 1000 Hz tone and the 60 dB 100 Hz tone are said to have a loudness level of 50 phons (pronounced as /F OW N Z/).
4. but their frequency responses overlap signiﬁcantly since points on the basilar membrane cannot vibrate independently of each other. and such sounds are perceived by the auditory system as having a quality known as pitch.4 Schematic representation of bandpass ﬁlters according to the critical band theory of hearing. and the critical bandwidths have been deﬁned and measured using a variety of methods. The concept of critical bands is very important in understanding such phenomena as loudness perception. An equation that ﬁts empirical measurements over the auditory range is ∆fc = 25 + 75[1 + 1. and masking. and it therefore provides motivation for digital representations of the speech signal that are based on a frequency decomposition. pitch is a subjective attribute of sound that is related to the fundamental frequency of the sound. 3.1) where ∆fc is the critical bandwidth associated with center frequency fc [134].3. Even so.4 Pitch Perception amplitude 29 Fig. pitch perception. Approximately 25 critical band ﬁlters span the range from 0 to 20 kHz. which is a physical attribute of the acoustic waveform [122].69 . (3.4 Pitch Perception Most musical sounds as well as voiced speech sounds have a periodic structure when viewed over short time intervals. Like loudness.4(fc /1000)2 ]0. and with a relative bandwidth of about 20% of the center frequency above 500 Hz. showing that the eﬀective bandwidths are constant at about 100 Hz for center frequencies below 500 Hz. as shown in Figure 3. 3. the concept of bandpass ﬁlter analysis in the cochlea is well established. The relationship between pitch (measured .
2) shows that a frequency of f = 5000 Hz corresponds to a pitch of 2364 mels.5 Relation between subjective pitch and frequency of a pure tone. a tone of frequency 1000 Hz is given a pitch of 1000 mels. Below 1000 Hz. This expression is calibrated so that a frequency of 1000 Hz corresponds to a pitch of 1000 mels.2) which is plotted in Figure 3. For higher frequencies. The psychophysical phenomenon of pitch. the relationship between pitch and frequency is nearly proportional. . This is shown in Figure 3.5. 3. Thus. can be related to the concept of critical bands [134]. To calibrate the scale. (3. For example. on a nonlinear frequency scale called the melscale) and frequency of a pure tone is approximated by the equation [122]: Pitch in mels = 1127 loge (1 + f /700). (3.30 Hearing and Auditory Perception Fig. one critical bandwidth corresponds to about 100 mels on the pitch scale. It turns out that more or less independently of the center frequency of the band. as quantiﬁed by the melscale. This empirical scale describes the results of experiments where subjects were asked to adjust the pitch of a measurement tone to half the pitch of a reference tone.5. however. where a critical band of width ∆fc = 160 Hz centered on fc = 1000 Hz maps into a band of width 106 mels and a critical band of width 100 Hz centered on 350 Hz maps into a band of width 107 mels. the relationship is nonlinear. what we know about pitch perception reinforces the notion that the auditory system performs a frequency analysis that can be simulated with a bank of bandpass ﬁlters whose bandwidths increase as center frequency increases.
Often the term pitch period is used for the fundamental period of the voiced speech signal even though its usage in this way is somewhat imprecise. 3. A related phenomenon. Loud tones causing strong vibrations at a point on the basilar membrane can swamp out vibrations that occur nearby. All spectral components whose level is below this raised threshold are masked and therefore do amplitude Masker Shifted threshold Unmasked signal Masked signals (dashed) Threshold of hearing log frequency Fig. many of the results obtained with pure tones are relevant to the perception of voice pitch as well. As shown in Figure 3. Pure tones can mask other pure tones. The notion that a sound becomes inaudible can be quantiﬁed with respect to the threshold of audibility. .6 illustrates masking of tones by tones. Figure 3.5 Auditory Masking The phenomenon of critical band auditory analysis can be explained intuitively in terms of vibrations of the basilar membrane.3. Masking occurs when one sound makes a second superimposed sound inaudible. 3. is also attributable to the mechanical vibrations of the basilar membrane. Nevertheless. but contains many frequencies. called masking. A detailed discussion of masking can be found in [134].6 Illustration of eﬀects of masking. an intense tone (called the masker) tends to raise the threshold of audibility around its location on the frequency axis as shown by the solid line. and noise can mask pure tones as well.5 Auditory Masking 31 Voiced speech is quasiperiodic.6.
any spectral component whose level is above the raised threshold is not masked. and thus we rely on psychophysical experimentation to understand the role of loudness.6 Complete Model of Auditory Processing In Chapter 2. in theory. . we described elements of a generative model of speech production which could. 3. This is shown in Figure 3. is rudimentary at best. pitch perception. In this chapter. Similarly. and auditory masking in speech perception in humans.6 where the falloﬀ of the shifted threshold is less abrupt above than below the masker. the problem that arises is that our detailed knowledge of how speech is perceived and understood. However. Masking is widely employed in digital representations of speech (and audio) signals by “hiding” errors in the representation in areas where the threshold of hearing is elevated by strong frequency components in the signal [120]. 115] and used in a range of speech processing systems.32 Hearing and Auditory Perception not need to be reproduced in a speech (or audio) processing system because they would not be heard. we described elements of a model of speech perception. all such models are incomplete representations of our knowledge about how speech is understood. 73. completely describe and model the ways in which speech is produced by humans. 48. critical bands. and therefore will be heard. beyond the basilar membrane processing of the inner ear. It has been shown that the masking eﬀect is greater for frequencies above the masking frequency than below. In this way. Although some excellent auditory models have been proposed [41. it is possible to achieve lower data rate representations while maintaining a high degree of perceptual ﬁdelity.
and few details of the excitation or the linear system were given. Since our goal is to extract parameters of the model by analysis of the speech signal.2 of Chapter 2.1. it is common to assume structures (or representations) for both the excitation generator and the linear system. In Figure 2. Both the excitation and the linear system were deﬁned implicitly by the assertion that the sampled speech signal was the output of the overall system.2. and the voiced excitation is assumed to be a periodic impulse train with impulses spaced by the pitch period (P0 ) rounded to the nearest 33 . this is not suﬃcient to uniquely specify either the excitation or the system. Clearly.4 ShortTime Analysis of Speech In Figure 2. the source/system separation was presented at an abstract level. In this model the unvoiced excitation is assumed to be a random noise sequence. we presented a model for speech production in which an excitation source provides the basic temporal ﬁne structure while a slowly varying ﬁlter provides spectral shaping (often referred to as the spectrum envelope) to create the various sounds of speech. One such model uses a more detailed representation of the excitation in terms of separate source generators for voiced and unvoiced speech as shown in Figure 4.
to simplify analysis. In this model. However. By this we mean that over the timescale of phonemes. it is often assumed that the system is an allpole system with 1 As mentioned in Chapter 3. ˆ ˆ the most general linear timeinvariant system would be characterized by a rational system function as given in (2.1) where the subscript n denotes the time index pointing to the block of ˆ samples of the entire speech signal s[n] wherein the impulse response hn [m] applies. the gain Gn is absorbed into hn [m] for convenience. For example over time intervals of tens of milliseconds. the system can be described by the convolution expression ∞ sn [n] = ˆ m=0 hn [m]en [n − m]. which is assumed to be slowlytimevarying (changing every 50–100 ms or so). and m is ˆ the index of summation in the convolution sum. We use n for the time index within that interval. the impulse response. 4.34 ShortTime Analysis of Speech Fig. and system function of the system remains relatively constant. the period of voiced speech is related to the fundamental frequency (perceived as pitch) of the voice. As discussed in Chapter 2.1 The pulses needed to model the glottal ﬂow waveform during voiced speech are assumed to be combined (by convolution) with the impulse response of the linear system. ˆ ˆ (4. .1). frequency response.1 Voiced/unvoiced/system model for a speech signal. sample.
but whenever we want to indicate a speciﬁc analysis time. Because of the slowly varying nature of the speech signal. common practice is to partition the analysis of the speech signal into techniques for extracting the parameters of the excitation model. and glottal excitation pulse shape (for voiced speech only) over a short time interval.1. and should therefore be indexed with the subscript n as in (4.2) change with time. and techniques for extracting the linear system model (which imparts the spectrum envelope or spectrum shaping). it is common to process speech in blocks (also called “frames”) over which the properties of the speech waveform can be assumed to remain relatively constant. ˆ .35 system function of the form: H(z) = 1− k=1 G p . (4. as represented by (4.2) The coeﬃcients G and ak in (4. This leads to the basic principle of shorttime analysis. With such a model as the basis. such as the pitch period and voiced/unvoiced classiﬁcation. we use n and m for discrete indices for sequences.3) where as discussed above. we use n.” For allpole linear systems.1) and Figure 4. ak z −k (4. but ˆ this complication of notation is not generally necessary since it is usually clear that the system function or diﬀerence equation only applies over a short time interval. vocal tract tube. the linear system in the model is commonly referred to as simply the “vocal tract” system and the corresponding impulse response is called the “vocal tract impulse response.2).2 Although the linear system is assumed to model the composite spectrum eﬀects of radiation. we have suppressed the indication of the time at which the diﬀerence equation applies. which 2 In general. the input and output are related by a diﬀerence equation of the form: p s[n] = k=1 ak s[n − k] + Ge[n].
3 The operator T { } deﬁnes the nature ˆ of the shorttime analysis function.4) where Xn represents the shorttime analysis parameter (or vector of ˆ parameters) at analysis time n.4) imply summation over all nonzero values of the windowed segment xn [m] = ˆ x[m]w[ˆ − m].54 + 0.4) is normalized by dividing by the eﬀective window length. ˆ 4 An example of an inﬁniteduration window is w[n] = nan for n ≥ 0. (4.5 It can be shown that a (2M + 1) sample Hamming window has a frequency main lobe (full) bandwidth of 4π/M . i.46 cos(πm/M ) 0 −M ≤ m ≤ M otherwise. Figure 4. Also shown is a sequence of data windows of duration 40 ms and shifted by 15 ms (320 samples at 16 kHz sampling rate) between windows.e.5) Figure 4. whose purpose is to select a segment of the sequence x[m] in the neighborhood of sample m = n. Other windows will have similar properties. This illustrates how shorttime analysis is implemented.. i.e. and w[ˆ − m] represents a timen shifted window sequence. 5 We have assumed the time origin at the center of a symmetric interval of 2M + 1 samples.. i.2 shows a discretetime Hamming window and its discretetime Fourier transform. 3 Sometimes (4. so that Xn is a weighted average. n For example.3 shows a 125 ms segment of a speech waveform that includes both unvoiced (0–50 ms) and voiced speech (50–125 ms). The inﬁnite limits in (4. as used throughout this chapter. A causal window would be shifted to the right by M samples. We will see several ˆ examples of such operators that are designed to extract or highlight certain features of the speech signal. a ﬁniteduration4 window might be a Hamming window deﬁned by wH [m] = 0. for all m in the region of support of the window. Such a window can lead to recursive implementations of shorttime analysis functions [9].e. . and the frequency width will be inversely proportional to the time width [89]. they will be concentrated in time and frequency.36 ShortTime Analysis of Speech is represented in a general form by the equation: ∞ Xn = ˆ m=−∞ T {x[m]w[ˆ − m]} . ∞ m=−∞ w[m]. n (4..
and they are useful for estimating properties of the excitation function in the model. These functions are simple to compute. 0 20 40 60 time in ms 80 100 120 Fig.3 Section of speech waveform with shorttime analysis windows.6) . 4. 4.2 Hamming window (a) and its discretetime Fourier transform (b). n (4. The shorttime energy is deﬁned as ∞ ∞ En = ˆ m=−∞ (x[m]w[ˆ − m]) = n m=−∞ 2 x2 [m]w2 [ˆ − m].1 ShortTime Energy and ZeroCrossing Rate Two basic shorttime analysis functions useful for speech signals are the shorttime energy and the shorttime zerocrossing rate. 4.1 ShortTime Energy and ZeroCrossing Rate (a) 37 (b) Fig.4.
4 shows an example of the shorttime energy and zerocrossing rate for a segment of speech with a transition from unvoiced to voiced speech.5 sgn{x[m]} − sgn{x[m − 1]} w[ˆ − m]. the frequency response is very small for discretetime frequencies above 2π/200 rad/s (equivalent to 16000/200 = 80 Hz analog frequency).38 ShortTime Analysis of Speech In this case the operator T { } is simply squaring the windowed samples. In both cases.4. both the shorttime energy and the shorttime zerocrossing rate are output of a lowpass ﬁlter whose frequency response is as shown in Figure 4. This means that the shorttime 6 In the case of the shorttime energy. so that he [n] = w2 [n] is the Hamming window deﬁned by (4.7) is a weighted sum of all the instances of alternating sign ˆ (zerocrossing) that fall within the support region of the shifted window w[ˆ − m]. As shown in (4. Similarly. (4. While this is a convenient representation that ﬁts the general n framework of (4.4). the shorttime zero crossing rate is deﬁned as the weighted average of the number of times the speech signal changes sign within the time window. the window is a Hamming window (two examples shown) of duration 25 ms (equivalent to 401 samples at a 16 kHz sampling rate).8) Since 0.6).5).6 Thus. n (4. it is often possible to express shorttime analysis operators as a convolution or linear ﬁltering operation. where the impulse response of the linear ﬁlter ˆ n is he [n] = w2 [n].2(b). the window applied to the signal samples was the squareroot of the Hamming window.5sgn{x[m]} − sgn{x[m − 1]} is equal to 1 if x[m] and x[m − 1] have diﬀerent algebraic signs and 0 if they have the same sign. it follows that Zn in (4. En = x2 [n] ∗ he [n] n=ˆ . Representing this operator in terms of linear ﬁltering leads to ∞ Zn = ˆ m=−∞ 0. For the 401point Hamming window used in Figure 4. Figure 4. .7) where sgn{x} = 1 −1 x≥0 x < 0. In this case. the computation of Zn could be implemented in ˆ other ways.
4. For ﬁnitelength windows like the Hamming window. they can be sampled at a much lower rate than that of the original speech signal. 4. and they are simple to compute.1 ShortTime Energy and ZeroCrossing Rate 39 0 20 40 60 time in ms 80 100 120 Fig. Note also that there is a small shift of the two curves relative to events in the time waveform. Thus. This is due to the time delay of M samples (equivalent to 12. Note that during the unvoiced interval. The shorttime energy is an indication of the amplitude of the signal in the interval around time n. Conversely. the shorttime zerocrossing rate is a crude frequency analyzer. energy and zerocrossing rate functions are slowly varying compared to the time variations of the speech signal. the shorttime energy and shorttime zerocrossing rate can be the basis for an algorithm for making a decision as to whether the speech signal is voiced or unvoiced at . Voiced signals have a high frequency (HF) falloﬀ due to the lowpass nature of the glottal pulses.4 Section of speech waveform with shorttime energy and zerocrossing rate superimposed. From ˆ our model. Similarly. while unvoiced sounds have much more HF energy. we expect unvoiced regions to have lower shorttime energy than voiced regions. The shorttime energy and shorttime zerocrossing rate are important because they abstract valuable information about the speech signal. and therefore.5 ms) added to make the analysis window ﬁlter causal. the energy is relatively low in the unvoiced region compared to the energy in the voiced region. the zerocrossing rate is relatively high compared to the zerocrossing rate in the voiced interval. this reduction of the sampling rate is accomplished by moving the window position n in ˆ jumps of more than one sample as shown in Figure 4.3.
These distributions can be used to derive thresholds used in voiced/unvoiced decision [100].10) where w [m] = w[m]w[m + ]. Note that the STACF is a two˜ dimensional function of the discretetime index n (the window position) ˆ and the discretelag index . ˜ To see how the shorttime autocorrelation can be used in speech analysis. This makes it a useful tool for shorttime speech analysis. ˆ ∞ φn [ ] = ˆ m=−∞ ∞ xn [m]xn [m + ] ˆ ˆ x[m]w[ˆ − m]x[m + ]w[ˆ − m − ]. (4. The STACF is deﬁned as the deterministic autocorrelation function of the n sequence xn [m] = x[m]w[ˆ − m] that is selected by the window shifted ˆ to time n. φn [− ] = φn [ ]. n n m=−∞ = (4. For inﬁniteduration decaying exponential windows. 9].2 ShortTime Autocorrelation Function (STACF) The autocorrelation function is often used as a means of detecting periodicity in signals.6).9) can be evaluated directly or using FFT techniques (see Section 4. If the window has ﬁnite duration. and it is also the basis for many spectrum analysis methods. i. ˜ n (4.10) can be computed recursively at time n by using a ˆ diﬀerent ﬁlter w [m] for each lag value [8. the shorttime autocorrelation of (4. 4.9) can be expressed in terms of linear timeinvariant ˆ ˆ (LTI) ﬁltering as ∞ φn [ ] = ˆ m=−∞ x[m]x[m − ]w [ˆ − m]..1 where .e. assume that a segment of the sampled speech signal is a segment of the output of the discretetime model shown in Figure 4.40 ShortTime Analysis of Speech any particular time n. (4.9) Using the familiar evensymmetric property of the autocorrelation. A complete algorithm would involve measureˆ ments of the statistical distributions of the energy and zerocrossing rate for both voiced and unvoiced speech segments (and also background noise distributions).
11) i. therefore. the autocorrelation function of unvoiced speech computed using averaging would be simply φ(s) [ ] = φ(h) [ ]. However. and (4.4. property of the autocorrelation function is that φ(s) [ ] = φ(e) [ ] ∗ φ(h) [ ].13) except that the correlation values will taper oﬀ with lag due to the tapering of the window and the fact that less and less data is involved in the computation of the shorttime autocorrelation for longer .13) Equation (4. h[n] represents the combined (by convolution) eﬀects of the glottal pulse shape (for voiced speech). For voiced speech. Therefore.13) assumes probability average or averaging over an inﬁnite time interval (for stationary random signals). the deterministic autocorrelation function of a ﬁnitelength segment of the speech waveform will have properties similar to those of (4. the autocorrelation function of s[n] = e[n] ∗ h[n] is the convolution of the autocorrelation functions of e[n] and h[n]. and easily proved.) Diﬀerent segments of the speech signal will have the same form of model with diﬀerent excitation and system impulse response. the autocorrelation of a periodic impulse train excitation with period P0 is a periodic impulse train sequence with the same period.12) and (4. (4. whose stochastic autocorrelation function would be an impulse sequence at = 0. In this case. That is. and the input is either a periodic impulse train or random white noise.12) assumes periodic computation of an inﬁnite periodic signal..12) In the case of unvoiced speech. assume that s[n] = e[n] ∗ h[n]. (4. A well known. (Note that we have suppressed the indication of analysis time.e. where e[n] is the excitation to the linear system with impulse response h[n]. (4. vocal tract shape. In the case of the speech signal. the excitation can be assumed to be random white noise. and radiation at the lips.2 ShortTime Autocorrelation Function (STACF) 41 the system is characterized at a particular analysis time by an impulse response h[n]. the autocorrelation of voiced speech is the periodic autocorrelation function ∞ φ (s) [ ]= m=−∞ φ(h) [ − mP0 ].
2 0.2 0 10 20 30 40 1 0. 4.5 0 20 time in ms 40 Fig. Finally. This tapering oﬀ in level is depicted in Figure 4.2 0.1 0 −0. Note the peak in the autocorrelation function for the voiced segment at the pitch period and twice the pitch period.5 for both a voiced and an unvoiced speech segment.42 ShortTime Analysis of Speech (a) Voiced Segment (b) Voiced Autocorrelation 0.5 Voiced and unvoiced segments of speech and their corresponding STACF.5 0 −0.1 0 −0.3 ShortTime Fourier Transform (STFT) The shorttime analysis functions discussed so far are examples of the general shorttime analysis principle that is the basis for most . Usually such algorithms involve the autocorrelation function and other shorttime measurements such as zerocrossings and energy to aid in making the voiced/unvoiced decision. and note the absence of such peaks in the autocorrelation function for the unvoiced segment. n ˆ (4. lag values. observe that the STACF implicitly contains the shorttime energy since ∞ En = ˆ m=−∞ (x[m]w[ˆ − m])2 = φn [0].1 −0.5 0 −0.5 0 10 20 30 40 (c) Unvoiced Segment (d) Unvoiced Autocorrelation 0.14) 4.1 −0. This suggests that the STACF could be the basis for an algorithm for estimating/detecting the pitch period of speech.2 0 10 20 time in ms 30 40 1 0.
coding and synthesis systems. we will ﬁnd that the STFT deﬁned as ∞ Xn (e ) = ˆ m=−∞ jω ˆ ˆ x[m]w[ˆ − m]e−j ωm .4. In subsequent chapters. We now turn our attention to what is perhaps the most important basic concept in digital speech processing. . the STFT can be expressed in terms of a linear ﬁltering operation.3 ShortTime Fourier Transform (STFT) 43 algorithms for speech processing.17) Recall that a typical window like a Hamming window. the DTFT of the (usually ﬁniteduration) signal selected and amplitudeweighted by the sliding window w[ˆ − m] [24.15) can be expressed as the discrete convolution ˆ ˆ Xn (ej ω ) = (x[n]e−j ωn ) ∗ w[n] ˆ n=ˆ n . ˆ As in the case of the other shorttime analysis functions discussed in this chapter.15) is the basis for a wide range of speech analysis.) This means that for a ﬁxed value of ω . when viewed as a linear ﬁlter impulse response.2(b).16) or.7 ˆ Since (4. for ﬁxed analysis time n. (4.. and ω representing the analysis frequency. n (4. 129]. By deﬁnition. n the STFT is a function of two variables. (4. the STFT is the discreteˆ n time Fourier transform (DTFT) of the signal xn [m] = x[m]w[ˆ − m]. ˆ (See Figure 4. For example. n the discretetime index denotˆ ing the window position. alternatively.16) can be interpreted as ˆ ˆ follows: the amplitude modulation x[n]e−j ωn shifts the spectrum of 7 As before. Thus.e. (4. we use n to specify the analysis time. Xn (ej ω ) ˆ ˆ is slowly varying as n varies. the twodimensional function ˆ ˆ Xn (ej ω ) at discretetime n is a periodic function of continuous radian ˆ frequency ω with period 2π [89].15) is a sequence of DTFTs. ˆ ˆ ˆ Xn (ej ω ) = x[n] ∗ (w[n]ej ωn ) e−j ωn ˆ n=ˆ n . and ω is used to distinguish the STFT ˆ ˆ analysis frequency from the frequency variable of the nontimedependent Fourier transform frequency variable ω. Equation (4. ˆ i. has a lowpass frequency response with the cutoﬀ frequency varying inversely with the window length. 89.
44 ShortTime Analysis of Speech x[n] down by ω . and L is the window length (in samples). it is the time sequence output of a lowpass ﬁlter that ˆ ˆ follows frequency downshifting by ω .17): x[n] is the ˆ ˆ input to a bandpass ﬁlter with impulse response w[n]ej ωn . i. the band of frequencies of x[n] that were originally centered on analysis frequency ω . resulting again in the same lowpass signal [89].18) where N is the number of uniformly spaced frequencies across the interval 0 ≤ ω < 2π. . the STFT has three interpretations: (1) It is a sequence of discretetime Fourier transforms of windowed signal segments. N − 1. To aid in interpretation. of course. In summary. Note ˆ that we have assumed that w[m] is causal and nonzero only in the range 0 ≤ m ≤ L − 1 so that the windowed segment x[m]w[rR − m] is nonzero over rR − L + 1 ≤ m ≤ rR. which selects the band of frequencies centered on ω . 1.e. . 4. . it is helpful to write (4. 1. it is ˆ ˆ the time sequence output resulting from frequency downshifting the output of a bandpass ﬁlter. . and the window (lowpass) ﬁlter selects the resulting ˆ band of frequencies around zero frequency. .18) in the equivalent form: ˜ XrR [k] = XrR [k]e−j(2πk/N )rR k = 0. (2) For each frequency ˆ ˆ ω with n varying. (4. (3) For each frequency ω .. Then that band of frequencies is ˆ ˆ shifted down by the amplitude modulation with e−j ωn .19) . The STFT becomes a practical tool for both analysis ˆ and applications when it is implemented with a ﬁniteduration window moved in steps of R > 1 samples in time and computed at a discrete set of frequencies as in rR XrR [k] = m=rR−L+1 x[m]w[rR − m]e−j(2πk/N )m k = 0. This is.4 Sampling the STFT in Time and Frequency As deﬁned in (4. the STFT is a function of a continuous analysis frequency ω .15). . An identical conclusion follows from (4. . N − 1. a periodic function of ω at each window position n. (4. .
(4.20).20) Since we have assumed. due to the deﬁnition of the window. the analysis time rR is shifted to the time origin of the DFT computation. that w[m] = 0 only in the range ˜ 0 ≤ m ≤ L − 1.4. . etc. R. . It can easily be shown that both R and N are determined entirely by the time width and frequency bandwidth of the lowpass window. has the interpretation of an N point DFT of the sequence x[rR − m]w[m]. and the number of uniform frequencies. 1. (2). (4) Move the time origin by R samples (i. L − 1. N − 1. for m = 0. The remaining issue for complete speciﬁcation of the sampled STFT is speciﬁcation of the temporal sampling period. . for speciﬁcity. . since x[rR − m]w[m] is real.19) results from the shift of the time origin.) (3) The multiplication by e−j(2πk/N )rR can be done if necessary. (4. r → r + 1) and repeat steps (1). is nonzero in the interval 0 ≤ m ≤ L − 1. and (3). . N . w[m]. . 1.8 In (4. (This can be done eﬃciently with an N point FFT algorithm. used to compute the STFT [1]. it follows that ˜ XrR [k] can be computed by the following process: (1) Form the sequence xrR [m] = x[rR − m]w[m]. which. (2) Compute the complex conjugate of the N point DFT of the sequence xrR [m]. .e. The complex exponential factor e−j(2πk/N )rR in (4.4 Sampling the STFT in Time and Frequency 45 where L−1 ˜ XrR [k] = m=0 x[rR − m]w[m]ej(2πk/N )m k = 0. but often can be omitted (as in computing a sound spectrogram or spectrographic display). . and the segment of the speech signal is the timereversed sequence of L samples that precedes the analysis time. the alternative form. For a discretetime Fourier transform interpretation.. Thus. XrR [k]. giving the 8 The DFT is normally deﬁned with a negative exponent.20) is the complex conjugate of the DFT of the windowed sequence [89]. .
the sound spectrogram has been a basic tool for gaining understanding of how the sounds of speech are produced and how phonetic information is encoded in the speech signal.6.6 shows the time . and electrically sensitive paper [66].46 ShortTime Analysis of Speech following constraints on R and N : (1) R ≤ L/(2C) where C is a constant that is dependent on the window frequency bandwidth.6 are plots of ˜ S(tr . which may be much larger than the window length L. spectrograms were made by an ingenious device comprised of an audio tape loop.6 are made by DSP techniques [87] and displayed as either pseudocolor or grayscale images on computer screens. Up until the 1970s. (2) N ≥ L.5 The Speech Spectrogram Since the 1940s. Sound spectrograms like those in Figure 4. Such a function of two variables can be plotted on a two dimensional surface (such as this text) as either a grayscale or a colormapped image. C = 1 for a rectangular window. Figure 4. In order to make smooth looking plots like those in Figure 4. Today spectrograms like those in Figure 4.21) where the plot axes are labeled in terms of analog time and frequency through the relations tr = rRT and fk = k/(N T ). Constraint (1) above is related to sampling the STFT in time at a rate of twice the window bandwidth in frequency in order to eliminate aliasing in frequency of the STFT. (4.6 are simply a display of the magnitude of the STFT. C = 2 for a Hamming window. and constraint (2) above is related to sampling in frequency at a rate of twice the equivalent time width of the window to ensure that there is no aliasing in time of the STFT. Speciﬁcally. variable analog bandpass ﬁlter. N . fk ) = 20 log10 XrR [k] = 20 log10 XrR [k]. the images in Figure 4. where T is the sampling period of the discretetime signal x[n] = xa (nT ). R is usually quite small compared to both the window length L and the number of samples in the frequency dimension. where L is the window length in samples. 4.
6 Spectrogram for speech signal of Figure 1.6 was computed with a window length of 101 samples. waveform at the top and two spectrograms computed with diﬀerent length analysis windows. then mostly small amplitude . The bars on the right calibrate the color map (in dB). the spectrogram displays vertically oriented striations corresponding to the fact that the sliding window sometimes includes mostly large amplitude samples.1. First note that the window sequence w[m] is nonzero only over the interval 0 ≤ m ≤ L − 1. A careful interpretation of (4. corresponding to 10 ms time duration. As a result. The upper spectrogram in Figure 4.20) and the corresponding spectrogram images leads to valuable insight into the nature of the speech signal.4.5 The Speech Spectrogram 1 0 −1 5 frequency in kHz 4 3 2 SH UH D W IY CH EY S 47 0 −50 −100 1 0 0 5 100 200 300 400 500 600 0 −20 −40 −60 −80 −100 −120 frequency in kHz 4 3 2 1 0 0 100 200 300 time in ms 400 500 600 Fig. in voiced intervals. 4. This window length is on the order of the length of a pitch period of the waveform during voiced intervals. The length of the window has a major eﬀect on the spectrogram image.
7.5 4 4.7 Shorttime spectrum at time 430 ms (dark vertical line in Figure 4. etc.5 5 Fig. for example. the spectrogram is a narrowband spectrogram. As a result of the short analysis window.5 2 2. fk ) as a function of fk at time tr = 430 ms. when the window length is long.5 4 4. This is consistent with the linear ﬁltering interpretation of the STFT. Note the three broad peaks in the spectrum slice at time tr = 430 ms.5 3 frequency in kHz 3.6) with Hamming window of length M = 101 in upper plot and M = 401 in lower plot. each individual pitch period is resolved in the time dimension.5 3 3. the spectrogram is called a wideband spectrogram. since a short analysis ﬁlter has a wide passband. For this reason. This vertical slice through the spectrogram is at the position of the black vertical line in the upper spectrogram of Figure 4. shows S(tr .5 2 2. . 4.48 ShortTime Analysis of Speech samples. which is characterized by good frequency resolution and poor time resolution.5 5 log magnitude (dB) 0 −20 −40 −60 −80 0 0. if the analysis window is short.5 1 1.5 1 1. but the resolution in the frequency dimension is poor. and observe that similar slices would log magnitude (dB) 0 −20 −40 −60 −80 0 0. Conversely.6. The upper plot in Figure 4.
fk ) as a function of fk at time tr = 430 ms.6 was computed with a window length of 401 samples. These large peaks are representative of the underlying resonances of the vocal tract at the corresponding time in the production of the speech signal. Therefore. so over the analysis window interval it acts very much like a periodic signal. the signal is voiced at the position of the window. This is evident in the lower plot of Figure 4.15) is a discretetime Fourier transform for ﬁxed window position. for example. where the local maxima of the curve are spaced at multiples of the fundamental frequency f0 = 1/T0 .6 Relation of STFT to STACF 49 be obtained at other times around tr = 430 ms. Periodic signals have Fourier spectra that are composed of impulses at the fundamental frequency.4. it follows that φn [ ] given by (4.7. As a result.9) is related to the STFT given by ˆ 9 Each “ripple” in the lower plot is essentially a frequencyshifted copy of the Fourier transform of the Hamming window used in the analysis. shows S(tr . . f0 .9 4. the spectrogram is not as sensitive to rapid time variations. but the resolution in the frequency dimension is much better. As a result. The lower plot in Figure 4. Since the STFT deﬁned in (4. where T0 is the fundamental period (pitch period) of the signal. the striations tend to be horizontally oriented in the narrowband case since the fundamental frequency and its harmonics are all resolved. Multiplying by the analysis window in the timedomain results in convolution of the Fourier transform of the window with the impulses in the spectrum of the periodic signal [89]. corresponding to 40 ms time duration. and at integer multiples (harmonics) of the fundamental frequency [89]. In this case.7.6 Relation of STFT to STACF A basic property of Fourier transforms is that the inverse Fourier transform of the magnitudesquared of the Fourier transform of a signal is the autocorrelation function for that signal [89]. This window length is on the order of several pitch periods of the waveform during voiced intervals. the spectrogram no longer displays vertically oriented striations since several periods are included in the window no matter where it is placed on the waveform in the vicinity of the analysis time tr . The lower spectrogram in Figure 4.
ω ˆ (4. an inverse DFT can be used to compute 1 ˜ φrR [ ] = N N −1 ˜ XrR (ej(2πk/N ) )2 ej(2πk/N ) k=0 (4. If L < N < 2L ˜ time aliasing will occur.5(d). 1.22) The STACF can also be computed from the sampled STFT.5 50 0 −50 −100 0 10 20 time in msec 30 40 0 2000 4000 6000 frequency in kHz 8000 Fig.5 0 −0. .15) as φn [ ] = ˆ 1 2π π −π ˆ ˆ Xn (ej ω )2 ej ω dˆ . . . 4.5 0 −0.5(b) and 4.5 50 0 −50 −100 0 10 20 30 40 0 2000 4000 6000 8000 (c) Unvoiced Autocorrelation (d) Unvoiced STFT 1 0. . Note that (4. . . The peaks around 9 and 18 ms in (a) Voiced Autocorrelation (b) Voiced STFT 1 0.23) by setting = 0. Figures 4.8(a) and 4. but φrR [ ] = φrR [ ] for = 0.23) ˜ and φrR [ ] = φrR [ ] for = 0. On the right are the corresponding STFTs.8(c) show the voiced and unvoiced autocorrelation functions that were shown in Figures 4. In particular. 1. . .14) shows that the shorttime energy can also be obtained from either (4. N − L.8 STACF and corresponding STFT.22) or (4.8 illustrates the equivalence between the STACF and the STFT. . Figure 4.50 ShortTime Analysis of Speech (4. L − 1 if N ≥ 2L.
8(b) that there are approximately 18 local regularly spaced peaks in the range 0–2000 Hz. followed by lowpass ﬁltering. followed by a downsampler by R.7 ShortTime Fourier Synthesis 51 the autocorrelation function of the voiced segment imply a fundamental frequency at this time of approximately 1000/9 = 111 Hz.7 ShortTime Fourier Synthesis From the linear ﬁltering point of view of the STFT. The left half of Figure 4. 4. we can estimate the fundamental frequency to be 2000/18 = 111 Hz as before.20) represent the downsampled (by the factor R) output of the process of bandpass ﬁltering with impulse response w[n]ej(2πk/N )n followed by frequencydownshifting by (2πk/N ). Alternatively.4. 4. Thus. This structure is often called a ﬁlter bank. The 2π ( 0) n N e −j X n [k ] X [k ] rR w[n] R Short Time Modifications YrR [k ] R f [n] e j 2π ( 0) n N x 2π − j (1) n e N x e R f [n] j 2π (1) n N x w[n] R x x[n] + e R f [n] j 2π ( N −1) n N y[n] e −j 2π ( N −1) n N x w[n] R x Fig.18) shows that XrR [k] is a downsampled output of the lowpass window ﬁlter with frequencydownshifted input x[n]e−j(2πk/N )n . (4.19) and (4. .9 shows a block diagram representation of the STFT as a combination of modulation.9 Filterbank interpretation of shorttime Fourier analysis and synthesis. note in the STFT on the right in Figure 4. and the outputs of the individual ﬁlters are called the channel signals. (4. Similarly.
9 implement the synthesis of a new time sequence y[n] from the STFT. (4. The remaining parts of the structure on the right in Figure 4.10 Then. which upshifts the lowpass interpolated channel signals back to their original frequency bands centered on frequencies (2πk/N ). It is often a scaled version of w[n]. it is most important to note that with careful choice of the parameters L. and N and careful design of the analysis window w[m] together with the synthesis window f [n].52 ShortTime Analysis of Speech spectrogram is comprised of the set of outputs of the ﬁlter bank with each channel signal corresponding to a horizontal line in the spectrogram. This operation.e. and the analysis and synthesis can be accomplished using only real ﬁltering and with modulation operations being implicitly achieved by the downsampling [24.24). 129].24) The steps represented by (4. the diagram shows the possibility that the STFT might be modiﬁed by some sort of processing. i. The remainder of Figure 4. interpolates the STFT YrR [k] to the time sampling rate of the speech signal for each channel k. However. R. The sum of the upshifted signals is the synthesized output. from the channel signals. special eﬃciencies result when L > R = N . If we want to use the STFT for other types of speech processing. First. it is possible to reconstruct the signal x[n] with negligible 10 The sequence f [n] is the impulse response of a LTI ﬁlter with gain R/N and normalized cutoﬀ frequency π/R.9 shows how this can be done. This part of the diagram represents the synthesis equation N −1 ∞ y[n] = k=0 r=−∞ YrR [k]f [n − rR] ej(2πk/N )n .24) and the right half of Figure 4. An example might be quantization of the channel signals for data compression. There are many variations on the shorttime Fourier analysis/synthesis paradigm. we may need to consider if and how it is possible to reconstruct the speech signal from the STFT. synthesis is achieved by modulation with ej(2πk/N )n . The FFT algorithm can be used to implement both analysis and synthesis. .9 involve ﬁrst upsampling by R followed by linear ﬁltering with f [n] (called the synthesis window). The modiﬁed STFT is denoted as YrR [k]. deﬁned by the part within the parenthesis in (4..
but one for which a large body of literature exists. is fundamental to our thinking about the speech signal and it leads us to a wide variety of techniques for achieving our goal of moving from the sampled time waveform back along the speech chain toward the implicit message. a topic beyond the scope of this text.8 ShortTime Analysis is Fundamental to our Thinking It can be argued that the shorttime analysis principle. models for auditory processing are based on a ﬁlter bank as the ﬁrst stage of processing. which is based directly on the STFT. One condition that guarantees exact reconstruction is [24.8 ShortTime Analysis is Fundamental to our Thinking 53 error (y[n] = x[n]) from the unmodiﬁed STFT YrR [k] = XrR [k]. 129]. Much of our knowledge of perceptual eﬀects is framed in terms of frequency analysis. and thus. Such structures are essentially the same as wavelet decompositions. ∞ w[rR − n + qN ]f [n − rR] = r=−∞ 1 0 q=0 q = 0.25) Shorttime Fourier analysis and synthesis is generally formulated with equally spaced channels with equal bandwidths.4. we have seen that models for auditory processing involve nonuniform ﬁlter banks. 4. This importance is strengthened by the fact that. as discussed brieﬂy in Chapter 3. Such ﬁlter banks can be implemented as tree structures where frequency bands are successively divided into low. the STFT representation provides a natural framework within which this knowledge can be represented and exploited to obtain eﬃcient representations of speech and more general audio signals. (4. The fact that almost perfect reconstruction can be achieved from the ﬁlter bank channel signals gives the shorttime Fourier representation major credibility in the digital speech processing tool kit. 129]. 125. but ﬁrst we will consider (in Chapter 5) the technique of cepstrum analysis. We will have more to say about this in Chapter 7. However. which is based on the STACF and thus equivalently on the STFT. for example [18. and the technique of linear predictive analysis (in Chapter 6).and HF bands. and particularly the shorttime Fourier representation of speech. . See.
and that further Fourier analysis of the log spectrum can aid in detecting the presence 54 . was motivated by the fact that the logarithm of the Fourier spectrum of a signal containing an echo has an additive periodic component depending only on the echo size and delay.5 Homomorphic Speech Analysis The STFT provides a useful framework for thinking about almost all the important techniques for analysis of speech that have been developed so far. In this chapter. more speciﬁcally. and Tukey to be the inverse Fourier transform of the log magnitude spectrum of a signal [16]. loosely framed in terms of spectrum analysis of analog signals. the shorttime cepstrum. An important concept that ﬂows directly from the STFT is the cepstrum.1 Deﬁnition of the Cepstrum and Complex Cepstrum The cepstrum was deﬁned by Bogert. Healy. 5. Their original deﬁnition. of speech. we explore the use of the shorttime cepstrum as a representation of speech and as a basis for estimating the parameters of the speech generation model. A more detailed discussion of the uses of the cepstrum in speech processing can be found in [110].
ˆ ˆ ˆ (5.1 Computing the complex cepstrum using the DTFT.1a) where log X(ejω ) is the logarithm of the magnitude of the DTFT of the signal. i. 90.1c) The transformation implied by (5.5. (5.1. and they extended the concept by deﬁning the complex cepstrum as x[n] = ˆ 1 2π π log{X(ejω )}ejωn dω. Since we restrict our attention to real sequences. (5.1 Deﬁnition of the Cepstrum and Complex Cepstrum 55 of that echo.1. Schafer. The connection between the cepstrum concept and homomorphic ﬁltering of convolved signals is that the complex cepstrum has the property that if x[n] = x1 [n] ∗ x2 [n]. x[n].1b) where log{X(ejω )} is the complex logarithm of X(ejω ) deﬁned as ˆ X(ejω ) = log{X(ejω )} = log X(ejω ) + j arg[X(ejω )].. Oppenheim. −π (5. 109]. In the theory ˆ of homomorphic systems D∗ { } is called the characteristic system for convolution. the operation of computing the complex cepstrum from the input can be denoted as x[n] = D∗ {x[n]}. 5. c[n] = (ˆ[n] + x[−n])/2 [89]. then x[n] = D∗ {x1 [n] ∗ x2 [n]} = x1 [n] + x2 [n].1b) is depicted as the block diagram in Figure 5. x ˆ As shown in Figure 5. it follows from the symmetry properties of Fourier transforms that the cepstrum is the even part of the complex cepstrum. and Stockham showed that the cepstrum is related to the more general concept of homomorphic ﬁltering of signals that are combined by convolution [85. .e.2) Fig. The same diagram represents the cepstrum if the complex logarithm is replaced by the logarithm of the magnitude of the DTFT. They gave a deﬁnition of the cepstrum of a discretetime signal as c[n] = 1 2π π −π log X(ejω )ejωn dω.
2.6.2 The inverse of the characteristic system for convolution (inverse complex cepstrum). which shows the reverse cascade of the inverses of the operators in Figure 5. In the case of the complex cepstrum. the complex cepstrum operator transforms convolution into addition. The key issue in the deﬁnition and computation of the complex cepstrum is the computation of the complex logarithm.3) where g[n] is a window (a “lifter” in the terminology of Bogert et al. Homomorphic ﬁltering of convolved signals is achieved by forming a modiﬁed complex cepstrum y [n] = g[n]ˆ[n]. and our goal is often to separate the excitation signal from the vocal tract signal.3) as input. .. Observe that (5. A modiﬁed output signal y[n] can then be obtained as the output of Figure 5.2 with y [n] given by (5. since our model for speech production involves convolution of the excitation with the vocal tract impulse response. which is true for both the cepstrum and the complex cepstrum. the inverse of the characteristic system exists as in Figure 5.2. etc. 109]. the computation of the phase angle arg[X(ejω )]. if x[n] = x1 [n] + x2 [n] ˆ ˆ x then y [n] = g[n]ˆ1 [n] + g[n]ˆ2 [n]. where x y1 [n] = g[n]ˆ1 [n] is the complex cepstrum of y1 [n]. is what makes the cepstrum and the complex cepstrum useful for speech analysis. 5.3) deﬁnes ˆ ˆ a linear operator in the conventional sense. which must be done so as to preserve an additive combination of phases for two signals combined by convolution [90.e. i. ˆ x (5. the output of the inverse ˆ x characteristic system will have the form y[n] = y1 [n] ∗ y2 [n]. This property. Examples of ˆ lifters used in homomorphic ﬁltering of speech are given in Sections 5.56 Homomorphic Speech Analysis Fig.4 and 5. That is.1. Therefore. more speciﬁcally.) which selects a portion of the complex cepstrum for inverse processing.
Bogert et al. we often suppress the subscript n. They created many other special terms in this way including quefrency as the name for the independent variable of the cepstrum and liftering for the operation of linear ﬁltering the log magnitude spectrum by the operation of (5. The crucial observation leading to the cepstrum concept is that the log spectrum can be treated as a waveform to be subjected to further Fourier analysis.15). clear. but not 1 In cases where we wish to explicitly indicate analysistime dependence of the shorttime cepstrum. [16] coined the word cepstrum by transposing some of the letters in the word spectrum. and the shorttime ˆ ˆ ˆ complex cepstrum is likewise deﬁned by replacing X(ej ω ) by Xn (ej ω ) ˆ 1 The similarity to the STACF deﬁned by (4. Only the terms cepstrum. and liftering are widely used today.3). quefrency.9) should be in (5.2 The ShortTime Cepstrum 57 The independent variable of the cepstrum and complex cepstrum is nominally time.1a) and (5. Thus the shorttime cepstrum is deﬁned as cn [m] = ˆ 1 2π π −π ˆ ˆ log Xn (ej ω )ej ωm dˆ . To emphasize this interchanging of domains of reference.1b). a “cepstrogram” would be an image obtained by plotting the magnitude of the shorttime cepstrum as a function of quefrency m and analysis time n. we will use n for the analysis time and m for quefrency as in (5. ˆ 5. 5. the cepstrum and complex cepstrum are deﬁned in terms of the DTFT. By analogy.3 Computation of the Cepstrum In (5.4) ˆ where Xn (ej ω ) is the STFT deﬁned in (4. ˆ .1b).4). This is useful in the basic deﬁnitions.2 The ShortTime Cepstrum The application of these deﬁnitions to speech requires that the DTFT be replaced by the STFT. The shorttime cepstrum is a sequence of cepstra of windowed ﬁniteduration segments of the speech waveform. but as in ˆ other instances.5. ω ˆ (5.
1a) and (5.1b). That is. several computational options exist. An identical ˆ equation holds for the timealiased cepstrum c[n]. which shows that the complex cepstrum can be computed approximately using the equations N −1 X[k] = n=0 x[n]e−j(2πk/N )n (5.6) where x[n] is the complex cepstrum deﬁned by (5. ˆ (5. X[k] = X(ej2πk/N ) [89]). k=0 Note the “tilde” ˜ symbol above x[n] in (5.. Its ˆ purpose is to emphasize that using the DFT instead of the DTFT results in an approximation to (5. A more serious problem in computation of the complex cepstrum is the computation of the complex logarithm.e.1 Computation Using the DFT Since the DFT (computed with an FFT algorithm) is a sampled (in frequency) version of the DTFT of a ﬁnitelength sequence (i.1b) as shown in Figure 5.3. ˜ The eﬀect of timedomain aliasing can be made negligible by using a large value for N .5b) (5.1b) due to the timedomain aliasing resulting from the sampling of the log of the DTFT [89]. 5.5c) and in Figure 5. 5.5a) (5. ∞ ˜ x[n] = ˆ r=−∞ x[n + rN ]. .3. Fortunately.58 Homomorphic Speech Analysis for use in processing sampled speech signals. This is Fig.3 Computing the cepstrum or complex cepstrum using the DFT.5c) ˆ X[k] = log X[k] + j arg{X[k]} 1 ˜ ˆ x[n] = N N −1 ˆ X[k]ej(2πk/N )n .3. the DFT and inverse DFT can be substituted for the DTFT and its inverse in (5.
are zeros of X(z) outside of the unit circle (ak  > 1). phase unwrapping can be avoided by using numerical computation of the poles and zeros of the ztransform. zk = ak . For this purpose. 5. This is very useful for theoretical investigations..5. where Mo Mo Mo (5.e. the phase of the sampled DTFT must be evaluated as samples of a continuous function of frequency. 127]. Furthermore. In order for the complex cepstrum to be evaluated properly. If the principal value phase is ﬁrst computed at discrete frequencies. . While accurate phase unwrapping presents a challenge in computing the complex cepstrum. Xmax (z) is thus the maximumphase part of X(z).3 Computation of the Cepstrum 59 because the angle of a complex number is usually speciﬁed modulo 2π.4. we assume that the input signal x[n] has a rational ztransform of the form: X(z) = Xmax (z)Xuc (z)Xmin (z). by the principal value.8b) Muc (1 − ejθk z −1 ). i.. and recent developments in polynomial root ﬁnding [117] have made the ztransform representation a viable computational basis as well.7) Xmax (z) = z Mo k=1 (1 − ak z Xuc (z) = −1 )= k=1 (−ak ) k=1 (1 − a−1 z).e. since the phase is not used.8a) k (5. it is not a problem in computing the cepstrum. (5. A variety of algorithms have been developed for phase unwrapping [90.3. (1 − bk z −1 ) .2 zTransform Analysis The characteristic system for convolution can also be represented by the twosided ztransform as depicted in Figure 5. 109.8c) (1 − ck z k=1 ) The zeros of Xmax (z). then it must be “unwrapped” modulo 2π in order to ensure that convolutions are transformed into additions. i. −1 k=1 Mi Xmin (z) = A k=1 N i (5.
The complex cepstrum of x[n] is determined by assuming that the complex logarithm log{X(z)} results in the sum of logarithms of each of the product terms. It is included to simplify the results in (5.e. The factor z M o implies a shift of Mo samples to the left.11) k=1 Mi n Ni n Muc jθk n bk ck e − − + n > 0. Xuc (z) contains all the zeros (with angles θk ) on the unit circle. that are inside the unit circle (bk  < 1 and ck  < 1). i. This is .4 ztransform representation of characteristic system for convolution.60 Homomorphic Speech Analysis Fig.9) and collecting the coeﬃcients of the positive and negative powers of z gives Mo an k n<0 n k=1 mo (−ak ) n=0 x[n] = log A + log ˆ (5. (5. (5. respectively. n n n k=1 k=1 k=1 Given all the poles and zeros of a ztransform X(z). where bk and ck are zeros and poles. The minimumphase part is Xmin (z).9) Applying the power series expansion log(1 − a) = − n=1 an n a < 1 (5.11). 5.10) to each of the terms in (5.11) allows us to compute the complex cepstrum with no approximation.. Mo Mo ˆ X(z) = log k=1 (−ak ) + k=1 Mi log(1 − a−1 z) + k Ni k=1 ∞ Muc k=1 log(1 − ejθk z −1 ) + log A + k=1 log(1 − bk z −1 ) − log(1 − ck z −1 ).
This has become more feasible with increasing computational power and with new advances in ﬁnding roots of large polynomials [117].13) αk z (1 − ck z k=1 ) Such models are implicit in the use of linear predictive analysis of speech (Chapter 6).3 Computation of the Cepstrum 61 the case in theoretical analysis where the poles and zeros are speciﬁed.12) A second method that yields a ztransform is the method of linear predictive analysis. All that is needed is a process for obtaining the ztransform as a rational function and a process for ﬁnding the zeros of the numerator and denominator.11) is also useful as the basis for computation.e. signals having a ztransform whose poles and zeros are inside the unit circle. 5. The ztransform is then simply a polynomial with the samples x[n] as coeﬃcients. (5. (5. In this case. to be discussed in Chapter 6.14) n > 0. One method of obtaining a ztransform is simply to select a ﬁnitelength sequence of samples of a signal.5. −1 (5. From (5. i.3 Recursive Computation of the Complex Cepstrum Another approach to computing the complex cepstrum applies only to minimumphase signals. . i. An example would be the impulse response of an allpole vocal tract model with system function H(z) = 1− k=1 G p = −k G p .. M X(z) = n=0 x[n]z −n = A Mo k=1 (1 − ak z −1 ) Mi k=1 (1 − bk z −1 ). all the poles ck must be inside the unit circle for stability of the system.3..e. However.11) it follows that the complex cepstrum of the impulse response h[n] corresponding to H(z) is n<0 0 log G n = 0 p ˆ h[n] = k=1 cn k n (5.
17). h[0] − n h[0] k=1 Furthermore.62 Homomorphic Speech Analysis It can be shown [90] that the impulse response and its complex cepstrum are related by the recursion formula: n<0 0 log G n=0 ˆ (5.16) it follows that the coeﬃcients of the denominator polynomial can be obtained from the complex cepstrum through n−1 ˆ αn = h[n] − k=1 k ˆ h[k]αn−k n 1 ≤ n ≤ p. 1. while the .15) h[n] = ˆ h[n] n−1 k h[k]h[n − k] n ≥ 1. . .4 ShortTime Homomorphic Filtering of Speech Figure 5.13) since all ˆ the denominator coeﬃcients and G can be computed from h[n] for n = 0..13). n<0 0 log G n=0 ˆ (5. The low quefrency part of the cepstrum is expected to be representative of the slow variations (with frequency) in the log spectrum. p using (5. .17). This fact is the basis for the use of the cepstrum in speech coding and speech recognition as a vector representation of the vocal tract properties of a frame of speech.5(c). . i. 5.5 shows an example of the shorttime cepstrum of speech for the segments of voiced and unvoiced speech in Figures 4. it follows that p + 1 values of the complex cepstrum are suﬃcient to fully determine the speech model system in (5.17) From (5. n k=1 From (5. (5. working with the reciprocal (negative logarithm) of (5.16) h[n] = n−1 k ˆ αn + h[k]αn−k n > 0.5(a) and 4. it can be shown that there is a direct recursive relationship between the coeﬃcients of the denominator polynomial in (5.13) and the complex cepstrum of the impulse response of the model ﬁlter.e.
the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. This periodic structure in the log spectrum of Figure 5. The existence of this peak in the quefrency range of expected pitch periods strongly signals voiced speech. 5. manifests itself in the cepstrum peak at a quefrency of about 9 ms in Figure 5.5. The log magnitudes of the corresponding shorttime spectra are shown on the right as Figures 5. high quefrency components would correspond to the more rapid ﬂuctuations of the log spectrum. This is typical of Fourier transforms (periodograms) of short segments of random signals.5(d). Note that the spectrum for the voiced segment in Figure 5. On the other hand.4 ShortTime Homomorphic Filtering of Speech (a) Voiced Cepstrum (b) Voiced log STFT 63 1 0 −1 0 10 20 30 40 4 2 0 −2 −4 −6 −8 0 2000 4000 6000 8000 (c) Unvoiced Cepstrum (d) Unvoiced log STFT 1 0 −1 0 10 20 time in ms 30 40 4 2 0 −2 −4 −6 −8 0 2000 4000 6000 frequency in kHz 8000 Fig.5(b) has a structure of periodic ripples due to the harmonic structure of the quasiperiodic segment of voiced speech.5(a).5(b) and 5.5(b). the autocorrelation function also displays an indication of periodicity. As shown in Figure 4. there is no strong peak indicating periodicity as in the voiced case.5(b). This is illustrated by the plots in Figure 5. . Furthermore. As a result. but not nearly as unambiguously as does the cepstrum.5.5 Shorttime cepstra and corresponding STFTs and homomorphicallysmoothed spectra. note that the rapid variations of the unvoiced spectra appear random with no periodic structure.
the quefrencies above 5 ms are multiplied by zero and the quefrencies below 5 ms are multiplied by 1 (with a short transition taper as shown in Figures 5. From the discussion of Section 5. the cepstrum peak at a quefrency of about 11–12 ms strongly signals voiced speech. Therefore. respectively. a useful perspective is that by liftering the cepstrum. it is apparent that for the positions 1 through 5. On the other hand. it is possible to separate information about the vocal tract contributions from the shorttime speech spectrum [88].5(d).5(b) and 5. 5. These slowly varying log spectra clearly retain the general spectral shape with peaks corresponding to the formant resonance structure for the segment of speech under analysis.5(c)).5(a) and 5. This is illustrated in Figure 5. For positions 8 through 15 the window only includes voiced speech. which shows a plot that is very similar to the plot ﬁrst published by Noll [83]. The successive spectra and cepstra are for 50 ms segments obtained by moving the window in steps of 12. the window includes only unvoiced speech. and the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. while for positions 6 and 7 the signal within the window is partly voiced and partly unvoiced. On the left is a sequence of log shorttime spectra (rapidly varying curves) and on the right is the corresponding sequence of cepstra computed from the log spectra on the left. .64 Homomorphic Speech Analysis To illustrate the eﬀect of liftering. the spectra for voiced segments have a structure of periodic ripples due to the harmonic structure of the quasiperiodic segment of voiced speech. Noll [83] applied the shorttime cepstrum to detect local periodicity (voiced speech) or the lack thereof (unvoiced speech). Note again that the rapid variations of the unvoiced spectra appear random with no periodic structure.6.10. As can be seen from the plots on the right.5 Application to Pitch Detection The cepstrum was ﬁrst applied in speech processing to determine the excitation parameters for the discretetime speech model of Figure 7. The DFT of the resulting modiﬁed cepstrum is plotted as the smooth curve that is superimposed on the shorttime spectra in Figures 5.5 ms (100 samples at a sampling rate of 8000 samples/sec).4.
and the quefrency location of the peak gives the estimate of the pitch period. As in most modelbased signal processing applications of concepts such as the cepstrum. The essence of the pitch detection algorithm proposed by Noll is to compute a sequence of shorttime cepstra and search each successive cepstrum for a peak in the quefrency region of the expected pitch period. For example. the pitch detection algorithm includes many features designed to handle cases that do not ﬁt the underlying model very well.5.5 Application to Pitch Detection ShortTime Log Spectra in Cepstrum Analysis ShortTime Cepstra 65 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 window number 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 frequency in kHz 3 4 0 5 10 15 time in ms 20 25 Fig. 5. Presence of a strong peak implies voiced speech. for frames 6 and 7 the cepstrum peak is weak .6 Shorttime cepstra and corresponding STFTs and homomorphicallysmoothed spectra.
D= 1 2π π −π ˆ ¯ ˆ log H(ej ω ) − log H(ej ω ) dˆ . . the peak at twice the pitch period may be stronger than the peak at the quefrency of the pitch period.19b) ˆ where log H(ej ω ) is the log magnitude of the DTFT of h[n] corresponding to the complex cepstrum in (5. cepstra can be computed either by ztransform analysis or by DFT implementation of the characteristic system.18) In problems such as VQ or ASR. Since the ˆ cepstrum is always the even part of the complex cepstrum. In either case. the complex cepstrum satisﬁes h[n] = 0 for n < 0. ¯ For example.18) or the real part of the 2 For minimumphase signals.66 Homomorphic Speech Analysis corresponding to the transition from unvoiced to voiced speech. In such applications. we can assume that the cepstrum vector corresponds to a gainnormalized (c[0] = 0) minimumphase vocal tract impulse response that is deﬁned by the complex cepstrum2 2c[n] 1 ≤ n ≤ nco ˆ h[n] = 0 n < 0. As we have shown. a speech signal is represented on a framebyframe basis by a sequence of shorttime cepstra. Such comparisons require a distance measure. nco ) is compared against a similarly deﬁned reference pattern c[n]. ω 2 (5. ¯ (5. it follows that h[n] = 2c[n] for n > 0. 5.6 Applications to Pattern Recognition Perhaps the most pervasive application of the cepstrum in speech processing is its use in pattern recognition systems such as design of vector quantizers (VQ) and automatic speech recognizers (ASR). . Noll applied temporal continuity constraints to prevent such errors. . a test pattern c[n] (vector of cepstrum values n = 1. the Euclidean distance applied to the cepstrum would give nco D= n=1 c[n] − c[n]2 . (5. . 2. . In other problematic cases.19a) Equivalently in the frequency domain.
cepstrumbased comparisons are strongly related to comparisons of smoothed shorttime spectra.e. and its interpretation as a diﬀerence of log spectra suggests a match to auditory perception mechanisms. ˆ ˆ (y) (x) (5. cn [m] = ˆ ˆ (y) cn [m] − c(h) [m]. the test vector can be compensated for the eﬀects of the linear ﬁltering prior to computing the distance measures used for comparison of patterns.20) where c(h) [m] will appear essentially the same in each frame. 4 Stockham [124] showed how c(h) [m] for such linear distortions can be estimated from the signal y[n] by time averaging the log of the STFT. ˆ ˆ ˆ (y) (x) (y) (y) (y) (5.4 which we assume is nontimevarying. i. y[n] = h[n] ∗ x[n]. we can (x) (y) (x) obtain cn [m] at each frame from cn [m] by subtraction. This property is extremely attractive in situations ˆ where the reference pattern c[m] has been obtained under diﬀerent ¯ recording or transmission conditions from those used to acquire the test vector.6 Applications to Pattern Recognition 67 ˆ DTFT of h[n] in (5.21) It is clear that if cn [m] = cn [m] + c(h) [m] with c(h) [m] being indepenˆ ˆ (y) (x) dent of n. If the analysis window is long compared to the length of h[n]. the cepstrum oﬀers an eﬀective and ﬂexible representation of speech for pattern recognition problems..1 Compensation for Linear Filtering Suppose that we have only a linearly ﬁltered version of the speech signal. 5. where m denotes ˆ ˆ the quefrency index of the cepstrum. Another approach to removing the eﬀects of linear distortions is to observe that the cepstrum component due to the distortion is the same in each frame.. then ∆cn [m] = ∆cn [m]. the shorttime cepstrum of one frame of the ﬁltered speech signal y[n] will be3 cn [m] = cn [m] + c(h) [m]. if we can estimate c(h) [n]. Thus. it will be useful to use somewhat more complicated notation.6. . i. Speciﬁcally. Therefore.5.e. (x) we denote the cepstrum at analysis time n of a signal x[n] as cn [m]. 3 In this section. Therefore. instead of x[n]. Therefore.18). it can be removed by a simple ﬁrst diﬀerence operation of the form: ∆cn [m] = cn [m] − cn−1 [m]. In these circumstances. the linear distortion eﬀects ˆ ˆ ˆ are removed.
This suggests that g[n] = n for n = 1. . Juang et al. A solution to this problem is to use weighted distance measures of the form: nco D= n=1 g 2 [n] c[n] − c[n]2 . .22b) Tohkura [126] found. 5.6.2 Liftered Cepstrum Distance Measures In using linear predictive analysis (discussed in Chapter 6) to obtain cepstrum feature vectors for pattern recognition problems. He used polynomial ﬁts to cepstrum sequences to extract simple representations of the temporal variation. cepstrum values c[n] have zero means and variances on the order of 1/n2 . .23) Tests of weighted distance measures showed consistent improvements in automatic speech recognition tasks. .68 Homomorphic Speech Analysis Furui [39. bias toward harmonic peaks. (5. nco could be used to equalize the contributions for each term to the cepstrum diﬀerence. 2. for example. and additive noise [63. The delta cepstrum as deﬁned in (5. it is observed that there is signiﬁcant statistical variability due to a variety of factors including shorttime analysis window position. . or . Instead of g[n] = n for all n.5nco sin(πn/nco ) n = 1. nco . . 40] ﬁrst noted that the sequence of cepstrum values has temporal information that could be of value for a speaker veriﬁcation system. 126]. .22a) which can be written as the Euclidean distance of liftered cepstra nco D= n=1 g[n]c[n] − g[n]¯[n]2 . c (5. .21) is simply the slope of a ﬁrstorder polynomial ﬁt to the cepstrum time evolution. [63] observed that the variability due to the vagaries of LPC analysis could be lessened by using a lifter of the form: g[n] = 1 + 0. ¯ (5. 2. Itakura and Umezaki [55] used the group delay function to derive a diﬀerent cepstrum weighting function. that when averaged over many frames of speech and speakers.
ˆ resulting in a DFT Xn [k] for analysis time n.5. The basic idea is to compute a frequency analysis based upon a ﬁlter bank with approximately critical band spacing of the ﬁlters and bandwidths. They found that for clean test utterances. If s = 1 and τ is large.7. a shorttime Fourier analysis is done ﬁrst. the diﬀerence in recognition rate was small for diﬀerent values of s when τ ≈ 5 although performance suﬀered with increasing s for larger values of τ . if s = 0 we have simply lowpass liftering of the cepstrum. approximately 20 ﬁlters are used. With this in mind. For 4 kHz bandwidth. 5. recognition rates improved signiﬁcantly with τ = 5 and increasing values of the parameter s. weighted cepstrum distance measures have a directly equivalent interpretation in terms of distance in the frequency domain. Itakura proposed the lifter g[n] = ns e−n 2 /2τ 2 . Note that the bandwidths in Figure 5. Then the DFT values ˆ are grouped together in critical bands and weighted by a triangular weighting function as depicted in Figure 5. This was attributed to the fact that for larger s the group delay spectrum becomes very sharply peaked and thus more sensitive to small diﬀerences in formant locations.6.24) This lifter has great ﬂexibility. we have essentially g[n] = n for small n with high quefrency tapering. as noted in Chapter 3. For example. Davis and Mermelstein [27] formulated a new type of cepstrum representation that has come to be widely used and known as the melfrequency cepstrum coeﬃcients (mfcc). However.7 are constant for center frequencies below 1 kHz and then increase exponentially up to half the sampling rate of 4 kHz resulting .3 MelFrequency Cepstrum Coeﬃcients As we have seen. in test conditions with additive white noise and also with linear ﬁltering distortions. which. In most implementations.23). are based upon a frequency analysis performed in the inner ear. Itakura and Umezaki [55] tested the group delay spectrum distance measure in an automatic speech recognition system. (5.6 Applications to Pattern Recognition 69 the lifter of (5. This is signiﬁcant in light of models for human perception of sound.
25a) where Vr [k] is the triangular weighting function for the rth ﬁlter ranging from DFT index Lr to Ur . LPC spectrum (discussed in Chapter 6). Figure 5. For each frame. Nmfcc . . where Ur Ar = k=Lr Vr [k]2 (5. ˆ that is less than the number of melﬁlters. The large dots are the values of log (MFn [r]) and the line interpolated ˆ .” The melfrequency spectrum at analysis time n is deﬁned for r = 1. .25b) is a normalizing factor for the rth melﬁlter. and a homomorphically smoothed spectrum. i.70 Homomorphic Speech Analysis 0.. This normalization is built into the weighting functions of Figure 5. R 2 (5. in a total of 22 “ﬁlters.26) Typically.g.005 0 0 1000 2000 frequency in Hz 3000 4000 Fig. ˆ k=Lr (5. .7 Weighting functions for Melfrequency ﬁlter bank. e.. ˆ mfccn [m] = ˆ 1 R R log (MFn [r]) cos ˆ r=1 1 2π r+ m . mfccn [m] is evaluated for a number of coeﬃcients. a discrete cosine transform of the log of the magnitude of the ﬁlter outputs is computed to form the function mfccn [m]. 2. Nmfcc = 13 and R = 22. R as ˆ MFn [r] = ˆ 1 Ar Ur Vr [k]Xn [k]2 .8 shows the result of mfcc analysis of a frame of voiced speech in comparison with the shorttime Fourier spectrum. 5. It is needed so that a perfectly ﬂat input Fourier spectrum will produce a ﬂat melspectrum.e. .7.01 0.
5. It remains one of the most eﬀective indicators of voice pitch that have been devised. Mel cepstrum smoothing. between them is a spectrum reconstructed at the original DFT frequencies. but they have in common that they have peaks at the formant resonances. 5. LPC smoothing.5. Note that the delta cepstrum idea expressed by (5.21) can be applied to mfcc to remove the eﬀects of linear ﬁltering as long as the frequency response of the distorting linear ﬁlter does not vary much across each of the melfrequency bands.7 The Role of the Cepstrum 3 2 1 0 log magnitude 71 −1 −2 −3 −4 −5 −6 −7 Shorttime Fourier Transform Homomorphic smoothing. 0 500 1000 1500 2000 2500 frequency in Hz 3000 3500 4000 Fig. Because the . At higher frequencies.8 Comparison of spectral smoothing methods.7 The Role of the Cepstrum As discussed in this chapter. the reconstructed melspectrum of course has more smoothing due to the structure of the ﬁlter bank. Note that all these spectra are diﬀerent. the cepstrum entered the realm of speech processing as a basis for pitch detection.
it was natural to consider analysis techniques for estimation of the vocal tract system as well [86. the cepstrum concept has demonstrated its value when applied to vocal tract system estimates obtained by linear predictive analysis. While separation techniques based on the cepstrum can be very eﬀective.72 Homomorphic Speech Analysis vocal tract and excitation components are well separated in the cepstrum. Nevertheless. . the linear predictive analysis methods to be discussed in the next chapter have proven to be more eﬀective for a variety of reasons. 88. 111].
the linear system is 73 .6 Linear Predictive Analysis Linear predictive analysis is one of the most powerful and widely used speech analysis techniques. and discuss some of the issues that are involved in using it in practical speech applications.1. The importance of this method lies both in its ability to provide accurate estimates of the speech parameters and in its relative speed of computation. 6. where the sampled speech signal was modeled as the output of a linear. slowly timevarying system excited by either quasiperiodic impulses (during voiced speech). The particular form of the source/system model implied by linear predictive analysis is depicted in Figure 6. In this chapter. we present a formulation of the ideas behind linear prediction. where the speech model is the part inside the dashed box.1 Linear Prediction and the Speech Model We come to the idea of linear prediction of speech by recalling the source/system model that was introduced in Chapter 2. or random noise (during unvoiced speech). Over short time intervals.
(6. (6. (6. is p d[n] = s[n] − s[n] = s[n] − ˜ k=1 αk s[n − k]. The major advantage of this model is that the gain parameter.e. is deﬁned as a system whose output is p s[n] = ˜ k=1 αk s[n − k]. αk .1 with the vocal tract model of (6. For the system of Figure 6. deﬁned as the amount by which s[n] fails to ˜ exactly predict sample s[n]. the excitation is whatever is needed to produce s[n] at the output of the system. i. and the ﬁlter coeﬃcients {ak } can be estimated in a very straightforward and computationally eﬃcient manner by the method of linear predictive analysis. (6.3) and the prediction error.4) . described by an allpole system function of the form: H(z) = S(z) = E(z) G p .1) 1− k=1 ak z −k In linear predictive analysis.74 Linear Predictive Analysis Fig. the speech samples s[n] are related to the excitation e[n] by the diﬀerence equation p s[n] = k=1 ak s[n − k] + Ge[n].1). 6.1 Model for linear predictive analysis of speech signals. the excitation is deﬁned implicitly by the vocal tract system model..2) A linear predictor with prediction coeﬃcients. G.
A third pragmatic justiﬁcation for using the minimum meansquared prediction error as a .6) The basic problem of linear prediction analysis is to determine the set of predictor coeﬃcients {αk } directly from the speech signal in order to obtain a useful estimate of the timevarying vocal tract system through the use of (6. recall that if αk = ak . First.2). The basic approach is to ﬁnd a set of predictor coeﬃcients that will minimize the meansquared prediction error over a short segment of the speech waveform.2) exactly. This process is repeated periodically at a rate appropriate to track the phonetic variation of speech (i. Thus.2) with nontimevarying coeﬃcients and excited either by a single impulse or by a stationary white noise input.e.4) it follows that the prediction error sequence is the output of an FIR linear system whose system function is p A(z) = 1 − k=1 αk z −k = D(z) . ﬁnding αk s that minimize the meansquared prediction error seems consistent with this observation. the prediction error ﬁlter. will be an inverse ﬁlter for the system. P0 . then it can be shown that the predictor coeﬃcients that result from minimizing the meansquared prediction error (over all time) are identical to the coeﬃcients of (6..1)..4) that if the speech signal obeys the model of (6.1 Linear Prediction and the Speech Model 75 From (6.6. S(z) (6.2) and (6. H(z) = G . Thus. i. For voiced speech this means that d[n] would consist of a train of impulses. That this approach will lead to useful results may not be immediately obvious. of (6. i.e. A(z) (6. A second motivation for this approach follows from the fact that if a signal is generated by (6.e.. order of 50–100 times per second). The resulting parameters are then assumed to be the parameters of the system function H(z) in the model for production of the given segment of the speech waveform. A(z). then d[n] = Ge[n]. then d[n] = Ge[n].6).5) It can be seen by comparing (6. and if αk = ak . but it can be justiﬁed in several ways. d[n] would be small except at isolated samples spaced by the current pitch period. H(z).
e. . . . . .7) where sn [m] is a segment of speech that has been selected in a neighˆ borhood of the analysis time n. The details of speciﬁc deﬁnitions of the averaging operation will be discussed below.10) αk ϕn [i. we will drop the tilde and use the notation αk to denote ˜ the values that minimize En . . (6. i. k] = sn [m − i]sn [m − k] . the time origin of the analysis segment is shifted to sample n of the entire signal.) If we deﬁne ˆ ϕn [i.7) by setting ˆ ∂En /∂αi = 0.11) accurately.7). ˆ ˆ 2 The quantities ϕ [i. 2. 0] i = 1. (6. ˆ ˆ ˆ then (6. The notation ˆ denotes averaging over a ﬁnite number of samples. The details of the deﬁnition of the averaging operation used in (6. k] are in the form of a correlation function for the speech segment n ˆ sn [m].10) have a ˆ signiﬁcant eﬀect on the properties of the prediction coeﬃcients that are obtained by solving (6. k] = ϕn [i. for i = 1. ˆ ˆ sn [m] = s[m + n] ˆ − M 1 ≤ m ≤ M2 .8) That is. The shorttime average prediction error is deﬁned as p 2 En = ˆ d2 [m] n ˆ = sn [m] − ˆ k=1 αk sn [m − k] ˆ .11). (Since the ˜ ˆ αk are unique. the solutions of (6. p. 2.9) can be written more compactly as2 p (6. We can ﬁnd the values of αk that minimize En in (6. .9) where the αk are the values of αk that minimize En in (6. ˆ ˆ k=1 1 More (6. thereby obtaining the equations1 ˆ p αk sn [m − i]sn [m − k] = sn [m − i]sn [m] ˜ ˆ ˆ ˆ ˆ k=1 1 ≤ i ≤ p.7) only provide a stationary point which can be shown to be a minimum of En since En is a convex function of the αi s. p. . (6. .76 Linear Predictive Analysis basis for estimating the model parameters is that this approach leads to an exceedingly useful and accurate representation of the speech signal that can be obtained by eﬃcient solution of a set of linear equations..
in principle.11).6. 5] p En = ϕn [0. We shall also ﬁnd it advantageous to drop the subscripts n on En . ˆ To solve for the optimum predictor coeﬃcients. We shall see below that two methods for linear predictive analysis emerge out of a consideration of the limits of summation and the deﬁnition of the waveform segment sn [m].13) Thus. the details of the computation of ϕn [i. k] when no confusion will result. and ϕn [i. 0] − ˆ ˆ k=1 αk ϕn [0. k]. (6. k] for 1 ≤ i ≤ p and 0 ≤ k ≤ p. . As we have stated. applied to the same speech signal yield slightly diﬀerent optimum predictor coeﬃcients.7) and (6. ˆ ˆ ˆ ˆ 4 These two methods. However. k] for 1 ≤ i ≤ p and 0 ≤ k ≤ p. the averaging must be over a ﬁnite interval.12) can be solved for the vector α = {αk } of unknown predictor coeﬃcients that minimize the average squared prediction error for the segment sn [m]. Thus.7). 6. which can be represented by the matrix equation: Φα = ψ. this set of p equations ˆ in p unknowns.e.3 Using (6. the minimum meansquared prediction ˆ error can be shown to be [3. the total minimum meansquared error consists of a ﬁxed component equal to the meansquared value of the signal segment minus a term that depends on the predictor coeﬃcients that satisfy (6. k] and the subsequent soluˆ tion of the equations are somewhat intricate and further discussion is required. we must ﬁrst compute the quantities ϕn [i.9).11) to obtain the αk s.4 ˆ 3 Although the αk s are functions of n (the time index at which they are estimated) this ˆ dependence will not be explicitly shown. linear prediction analysis is very straightforward.2 Computing the Prediction Coeﬃcients 77 If we know ϕn [i. ˆ (6.. Once this ˆ is done we only have to solve (6.2 Computing the Prediction Coeﬃcients So far we have not been explicit about the meaning of the averaging notation used to deﬁne the meansquared prediction error in (6.13) the most. the optimum coeﬃcients reduce En in (6. i. in a shorttime analysis procedure. sn [m].
(6.15) require values of sn [m] = s[m + n] over the ˆ range −M1 − p ≤ m ≤ M2 . k] = ˆ m=−M1 sn [m − i]sn [m − k] ˆ ˆ 1≤i≤p 0 ≤ k ≤ p. k] = ϕn [k. i. The top part of this ﬁgure shows a sampled speech signal s[m].. the .78 Linear Predictive Analysis 6. M2 ϕn [i. By changes of index of summation. ˆ ˆ (6.14) and (6.14) where −M1 ≤ n ≤ M2 .11) inherit ˆ the same deﬁnition of the averaging operator.e. k] needed in (6.16b) from which it follows that ϕn [i.2.2 shows the sequences that are involved in computing the meansquared prediction error as deﬁned by (6. Note that in solving for the optimum prediction coeﬃcients using (6. The quantities ϕn [i.16a) = m=−M1 −k sn [m]sn [m + k − i]. and the box denotes a segment of that waveform selected around some time index n.15) can be expressed in the equivalent forms: M2 −i ϕn [i. ˆ ˆ Figure 6. k] for all required values of i and k.11). (6. Note the p “extra” samples ˆ (light shading) at the beginning that are needed to start the prediction error ﬁlter at time −M1 . k] = ˆ m=−M1 −i M2 −k sn [m]sn [m + i − k] ˆ ˆ (6.14).1 The Covariance Method One approach to computing the prediction coeﬃcients is based on the deﬁnition M2 M2 p 2 En = ˆ m=−M1 (dn [m]) = ˆ m=−M1 2 sn [m] − ˆ k=1 αk sn [m − k] ˆ .15) ˆ Both (6. The third plot shows the prediction error computed with the optimum predictor coeﬃcients. The ˆ second plot shows that segment extracted as a ﬁnitelength sequence ˆ sn [m] = s[m + n] for −M1 − p ≤ m ≤ M2 . (6. i]. This method does not require any assumption about the signal outside the interval −M1 − p ≤ m ≤ M2 since the samples sn [m] for −M1 − p ≤ m ≤ M2 are suﬃcient to evaluate ˆ ϕ[i.
the prediction error sequence is not explicitly computed since solution of (6.2 Illustration of windowing and prediction error for the covariance method. The mathematical structure that deﬁnes the covariance method of linear predictive analysis implies a number of useful properties of the solution.6. The minimum meansquared prediction error would be simply the sum of squares of all the samples shown.11) does not require it.3) The roots of the prediction error ﬁlter A(z) in (6. it is theoretically possible for the average error to be exactly zero. 6. (C.15).1) The meansquared prediction error satisﬁes En ≥ 0.5) are not guaranteed to lie within the unit circle of the zplane. With ˆ this method. .2) The matrix Φ in (6. prediction error is implicitly computed over the range −M1 ≤ m ≤ M2 as required in (6. In most cases of linear prediction. It is worthwhile to summarize them as follows: (C.6) is not guaranteed to be stable.14) and (6.12) is a symmetric positivesemideﬁnite matrix.2 Computing the Prediction Coeﬃcients 79 Fig. (C. This implies that the vocal tract model ﬁlter (6.
19) Thus. the analysis segment sn [m] is deﬁned as ˆ sn [m] = ˆ s[n + m]w[m] −M1 ≤ m ≤ M2 0 otherwise. Applying this notion ˆ to (6. .2. (6. ˆ ˆ ˆ (6. Therefore.18) The windowing of (6. the well known Cholesky decomposition of the covariance matrix Φ [3]. k] = ˆ m=−∞ sn [m]sn [m + i − k] = φn [i − k].11) can be solved eﬃciently using.20) The resulting set of equations for the optimum predictor coeﬃcients is therefore p αk φn [i − k] = φn [i] i = 1. Since the analysis segment is deﬁned by the windowing of (6. for example. ϕ[i.16b) leads to the conclusion that ∞ ϕn [i.2 The Autocorrelation Method Perhaps the most widely used method of linear predictive analysis is called the autocorrelation method because the covariance function ϕn [i.4) As a result of (C.11) reduces to the STACF φn [i − k] that we disˆ ˆ cussed in Chapter 4 [53.80 Linear Predictive Analysis (C. . ˆ ˆ ˆ (6. 54. 78].21) . ˆ ˆ k=1 (6. the equations (6. we can replace ϕn [i.17) to be zero outside the interval −M1 ≤ m ≤ M2 . ˆ m=−∞ (6.17) allows us to use the inﬁnite limits to signify that the sum is over all nonzero values of dn [m]. p. which is the STACF deﬁned in Chapter 4 as ˆ ˆ ∞ φn [k] = ˆ m=−∞ sn [m]sn [m + k] = φn [−k].2). 2. it follows that the prediction error sequence dn [m] can be nonzero only in the range ˆ −M1 ≤ m ≤ M2 + p. 74. k] by φn [i − k].16a) and (6. En is deﬁned as ˆ M2 +p ∞ En = ˆ m=−M1 (dn [m])2 = ˆ (dn [m])2 . 6. . k] is a function only of i − k. In the autocorrelation method. .17) where the analysis window w[m] is used to taper the edges of the segment to zero. Therefore. k] needed in (6.
6. It is easy to see that at least one of these ﬁrst p samples ˆ of the prediction error must be nonzero. the prediction error (which is implicit in the solution of (6.2 Computing the Prediction Coeﬃcients 81 Fig. The third plot shows the prediction error computed using the optimum coeﬃcients.21)) is nonzero over the range −M1 ≤ m ≤ M2 + p. The upper plot shows the same sampled speech signal s[m] as in Figure 6. These samples can be large due to the fact that the predictor must predict these samples from the zerovalued samples that precede the windowed segment sn [m]. Note that the zeroˆ valued samples outside the window are shown with light shading.3 Illustration of windowing and prediction error for the autocorrelation method. Also note the lightly shaded p samples at the beginning. For this reason. 6. Note that for this segment. It can also be seen that at least one of these last p samples of the prediction error must be nonzero. Figure 6.2 with a Hamming window centered at time index n.3 shows the sequences that are involved in computing the optimum prediction coeﬃcients using the autocorrelation method. the last p samples of the prediction error can be large due to the fact that the predictor must predict zerovalued samples from windowed speech samples. The middle plot ˆ shows the result of multiplying the signal s[ˆ + m] by the window w[m] n and redeﬁning the time origin to obtain sn [m]. it follows that En . being the sum of ˆ . Similarly.
12) is a symmetric positivedeﬁnite Toeplitz matrix. With ˆ this method.2) The matrix Φ in (6. Equation (6.3 The Levinson–Durbin Recursion As stated in (A. including the following: (A. (A. the matrix Φ in (6.4) As a result of (A. which. (A. which means that all the elements on a given diagonal in the matrix are equal.22) = ··· ··· · · · · · · · · · ··· αp φ[p] φ[p − 1] φ[p − 2] · · · φ[0] 5 For this reason. it is theoretically impossible for the error to be exactly zero because there will always be at least one sample at the beginning and one at the end of the prediction error sequence that will be nonzero. .3) The roots of the prediction error ﬁlter A(z) in (6. a tapering window is generally used in the autocorrelation method.82 Linear Predictive Analysis squares of the prediction error samples.1) The meansquared prediction error satisﬁes En > 0. must always be strictly greater than zero.5 As in the case of the covariance method.6) is guaranteed to be stable.5) are guaranteed to lie within the unit circle of the zplane so that the vocal tract model ﬁlter of (6. because of its many implications.22) shows the detailed structure of the matrix equation Φα = ψ for the autocorrelation method. we discuss in more detail in Section 6. the mathematical structure of the autocorrelation method implies a number of properties of the solution.3.12) is a symmetric positivedeﬁnite Toeplitz matrix [46].2) above.2).11) can be solved eﬃciently using the Levinson–Durbin algorithm. the equations (6. (A. 6. φ[1] φ[0] φ[1] · · · φ[p − 1] α1 φ[0] · · · φ[p − 2] α2 φ[2] φ[1] (6.
23) leads (after some manipulations) to an interpretation of the Levinson–Durbin algorithm in terms of a lattice ﬁlter structure as in Figure 6. .22) it is possible to derive a recursive algorithm for inverting the matrix Φ. all predictors from order 0 (no prediction) to order p are computed along with the corresponding meansquared prediction errors E (i) . Because of the special structure of (6. the equations labeled (D. Speciﬁcally. (6. 2.2) (D. .3) and (D.1) αj (i−1) i−1 φ[i − j] E (i−1) (D. (6. and as part of the process. .5) (D.3 The Levinson–Durbin Recursion 83 Note that the vector ψ is composed of almost the same autocorrelation values as comprise Φ. i − 1 (i) (i−1) (i−1) − ki αi−j αj = αj end 2 E (i) = (1 − ki )E (i−1) end (p) j = 1. .4) (D. 2. . is speciﬁed by the following steps: Levinson–Durbin Algorithm E0 = φ[0] for i = 1.6. That algorithm.23) Deﬁning the ithorder forward prediction error e(i) [n] as the output of the prediction error ﬁlter with system function A(i) (z) and b(i) [n] as the output of the ithorder backward prediction error ﬁlter B (i) (z) = z −i A(i) (z −1 ). . .4) can be used to show that the prediction error system function satisﬁes A(i) (z) = A(i−1) (z) − ki z −i A(i−1) (z −1 ).3) (D. p ki = φ[i] − (i) αi (D.6) j=1 = ki if i > 1 then for j = 1. . . .4(a). . . . p αj = αj An important feature of the Levinson–Durbin algorithm is that it determines by recursion the optimum ithorder predictor from the optimum (i − 1)thorder predictor. known as the Levinson–Durbin algorithm. 2.
101].84 Linear Predictive Analysis Fig. 78. Itakura and Saito [53. Note that the parameters ki for i = 1. 2. . In this case.4(b). .6). PARCOR (for partial correlation) coeﬃcients [54]. Lattice structures like this can be derived from acoustic principles applied to a physical model composed of concatenated lossless tubes [101]. because they can be computed directly as a ratio of crosscorrelation values between the forward and . 54].4 Lattice structures derived from the Levinson–Durbin recursion. p play a key role in the Levinson–Durbin recursion and also in the lattice ﬁlter interpretation. (a) Prediction error ﬁlter A(z). 6.4(a) and work to the left eventually computing s[n] in terms of e[n]. They called the ki parameters. the sampled signals are related by a transfer function identical to (6. the coeﬃcients ki behave as reﬂection coeﬃcients at the tube boundaries [3. by solving two equations in two unknowns recursively. The lattice structure corresponding to this is shown in Figure 6. (b) Vocal tract ﬁlter H(z) = 1/A(z). it is possible to start at the output of Figure 6. If the input and output signals in such physical models are sampled at just the right sampling rate. . Also. showed that the parameters ki in the Levinson–Durbin recursion and the lattice ﬁlter interpretation obtained from it also could be derived by looking at linear predictive analysis from a statistical perspective. .
The autocorrelation method is based on the shorttime autocorrelation function.6. .2) in the Levinson– Durbin algorithm can be replaced by (6. and the result is an algorithm for transforming the PARCOR representation into the linear predictor coeﬃcient representation.24) are identical to the ki obtained as a result of the Levinson–Durbin algorithm. p are used to compute the prediction coeﬃcients and gain. The PARCOR coeﬃcients computed using (6. Sn (ej ω )2 .6).24) In the PARCOR interpretation.. each stage of Figure 6. .4 LPC Spectrum The frequencydomain interpretation of linear predictive analysis provides an informative link to our earlier discussions of the STFT and cepstrum analysis.4(a). Equation (D.4 LPC Spectrum 85 backward prediction errors at the output of the (i − 1)th stage of prediction in Figure 6. which in turn deﬁne the vocal tract system function H(z) in (6. from Equa2 tion (D.e. of the ˆ windowed speech signal sn [m] = s[n + m]w[m]. φn [m]. the magnitudesquared of the frequency response of this system. which is the inverse discrete Fourier ˆ ˆ transform of the magnitudesquared of the STFT. ∞ e(i−1) [m]b(i−1) [m − 1] ki = m=−∞ ∞ e(i−1) [m] m=−∞ 2 ∞ b(i−1) [m − 1] m=−∞ 2 1/2 . It can be shown that this condition on the PARCORs also guarantees that all the zeros of a prediction error ﬁlter A(i) (z) of any order must be strictly inside the unit circle of the zplane [74. . Indeed. Therefore. .24). 6.4(a) removes part of the correlation in the input signal. it follows that. . 78]. The Levinson–Durbin formulation provides one more piece of useful information about the PARCOR coeﬃcients. i. Speciﬁcally. since E (i) = (1 − ki )E (i−1) is strictly greater than zero for predictors of all orders. (6. The values φn [m] for ˆ ˆ m = 0. obtained by evaluating H(z) on the unit circle at angles 2πf /fs . it must be true that −1 < ki < 1 for all i.5) of the algorithm. 1.
As we have observed before.5 50 0 −50 −100 0 5 10 time in ms 15 0 2000 4000 6000 frequency in kHz 8000 Fig. with the ﬁrst 23 values plotted with a heavy line. vocal tract transfer function.5 Comparison of shorttime Fourier analysis with linear predictive analysis. 6.5 50 0 −50 −100 0 5 10 15 0 2000 4000 6000 8000 (c) Unvoiced Autocorrelation (d) Unvoiced LPC Spectrum 1 0.5 shows a comparison between shorttime Fourier analysis and linear predictive spectrum analysis for segments of voiced and unvoiced speech. Figure 6. while the overall shape is assumed to be determined by the eﬀects of glottal pulse.25) where the sampling frequency is fs = 16 kHz. These frequency responses are superimposed on the corresponding STFTs (shown in gray). The frequency responses of the corresponding vocal tract system models are computed using (6. . These values are used to compute the predictor coeﬃcients and gain for an LPC model with p = 22.86 Linear Predictive Analysis is of the form: 2 H(ej2πf /fs )2 = 1− G p .5 0 −0. −j2πf /fs (6. In (a) Voiced Autocorrelation (b) Voiced LPC Spectrum 1 0. and radiation. Figures 6. the rapid variations with frequency in the STFT are due primarily to the excitation.25) αk e k=1 and can be thought of as an alternative shorttime spectral representation.5 0 −0.5(c) show the STACF.5(a) and 6.
Evidently.6. 6. but does not represent all its local peaks and valleys.4 LPC Spectrum 87 cepstrum analysis. . In linear predictive spectrum analysis. The question naturally arises as to how p should be chosen. The amount of smoothing of the spectrum is controlled by the choice of p. Figure 6. Figure 6. and (b) the normalized meansquared prediction error for the same signal.6 Eﬀect of predictor order on: (a) estimating the spectral envelope of a 4kHz bandwidth signal.6 oﬀers a suggestion.6(a) are shown the STFT (in gray) and the frequency responses of a 12thorder model (heavy dark line) and a 40thorder model (thin dark line). That this is true in general can be argued using the Parseval theorem of Fourier analysis [74]. the excitation eﬀects are removed by lowpass liftering the cepstrum.5 shows that a linear prediction model with p = 22 matches the general shape of the shorttime spectrum. and this is exactly what is desired. This is in contrast to homomorphic Fig. In Figure 6. the excitation eﬀects are removed by focusing on the lowtime autocorrelation coefﬁcients. the linear predictive spectra tend to favor the peaks of the shorttime Fourier transform.
the LPC spectrum with p = 12. and the vocal tract imposes a resonance structure that. for adult speakers.1) above. for a sampling rate of fs = 8000 Hz. notice that the meansquared error curve ﬂattens out above about p = 12 and then decreases modestly thereafter. The choice p = 12 gives a good match to the general shape of the STFT. it follows that the glottal pulse spectrum is lowpass. but as pointed out in ˆ (A. For the sampled speech signal. and a melfrequency spectrum. the vocal tract model is highly inﬂuenced by the pitch harmonics in the shorttime spectrum. Observe that p = 40 gives a much diﬀerent result.5 where the sampling rate was fs = 16000 Hz. a predictor order of p = 22 gave a good representation of the overall shape and resonance structure of the speech segment over the band from 0 to 8000 Hz.6(a). It can be shown that if p is increased to the length of the windowed speech segment. Note the sharp decrease from p = 0 (no prediction implies V (0) = 1) to p = 1 and the less abrupt decrease thereafter. is comprised of about one resonance per kilohertz of frequency [101]. In a particular application. the prediction order is generally ﬁxed at a value that captures the general spectral shape due to the glottal pulse.6(b) shows the normalized meansquared prediction error V (p) = E (p) /φ[0] as a function of p for the segment of speech used to produce Figure 6. E (2M ) does not go to zero [75]. Note that in Figure 6.8. it is common to use a predictor order of 10–12. . highlighting the formant structure imposed by the vocal tract ﬁlter while ignoring the periodic pitch structure.88 Linear Predictive Analysis smoothing of the STFT. When coupled with an estimate of one resonance per kilohertz. For example. In this case. vocal tract resonances. this leads to a rule of thumb of p = 4 + fs /1000. As an example. that H(ej2πf /fs )2 → Sn (ej2πf /fs )2 . a homomorphically smoothed spectrum using 13 cepstrum values. Figure 6. which shows a comparison between the shorttime Fourier spectrum. which tends toward an average of the peaks and valleys of the STFT. see Figure 5. Furthermore. From the acoustic theory of speech production. and radiation. the combination of the lowpass glottal pulse spectrum and the highpass ﬁltering of radiation are usually adequately represented by one or two additional complex pole pairs. the radiation ﬁltering is highpass.
(6. but is less problematic for male voices where the spacing between harmonics is generally much smaller.7 shows an example of the roots (marked with ×) of a 12thorder predictor.6. and the wide spacing between harmonics causes the peaks of the linear predictive vocal tract model to be biased toward those harmonics. because we often wish to quantize the model parameters for eﬃciency in storage or transmission of coded speech. particularly when they are used in speech coding. 6.26) According to (6.5 Equivalent Representations The basic parameters obtained by linear predictive analysis are the gain G and the prediction coeﬃcients {αk }. we give only a brief summary of the most important equivalent representations. a variety of diﬀerent equivalent representations can be obtained. The remaining four roots lie well within the unit circle. if the model order is chosen judiciously as discussed in the previous section. Figure 6. Note that eight (four complex conjugate pairs) of the roots are close to the unit circle. Therefore. the zeros of A(z) are the poles of H(z). the speaker was a highpitched female. These diﬀerent representations are important. These are the poles of H(z) that model the formant resonances. which means that they only provide for the overall spectral shaping resulting from the glottal and radiation spectral shaping. . then it can be expected that roughly fs /1000 of the roots will be close in frequency (angle in the zplane) to the formant frequencies.5) shows that the system function of the prediction error ﬁlter is a polynomial in z −1 and therefore it can be represented in terms of its zeros as p p A(z) = 1 − k=1 αk z −k = k=1 (1 − zk z −1 ).5. In this section. In that example.6).5 Equivalent Representations 89 The example of Figure 6. 6.6 illustrates an important point about linear predictive analysis. From these. This eﬀect is a serious limitation for highpitched voices.1 Roots of Prediction Error System Function Equation (6.
Since the pole locations are crucial to accurate representation of the spectrum. it is important to maintain high accuracy in the location of the zeros of A(z).90 Linear Predictive Analysis Fig.26) and then quantize each root (magnitude and angle) individually. the similarity to (6. .5.4(a) with an additional section with kp+1 = ∓1.2 LSP Coeﬃcients A much more desirable alternative to quantization of the roots of A(z) was introduced by Itakura [52].27b) −(p+1) A(z −1 ). One possibility. It is wellknown that the roots of a polynomial are highly sensitive to errors in its coefﬁcients — all the roots being a function of all the coeﬃcients [89]. would be to factor the polynomial as in (6. 6. 6.27a) (6. The prediction coeﬃcients are perhaps the most susceptible to quantization errors of all the equivalent representations. not often invoked. from which it follows that P (z) and Q(z) are system functions of lattice ﬁlters obtained by extending Figure 6. respectively.23).7 Poles of H(z) (zeros of A(z)) marked with × and LSP roots marked with ∗ and o. who deﬁned the line spectrum pair (LSP) polynomials6 P (z) = A(z) + z −(p+1) A(z −1 ) Q(z) = A(z) − z 6 Note (6.
k=1 (6. An illustration of the LSP representation is given in Figure 6. (LSP. This new representation has some very interesting and useful properties that are conﬁrmed by Figure 6.e.1) All the roots of P (z) and Q(z) are on the unit circle. then P (−1) = 0 and Q(1) = 0.3) is maintained through the quantization. If property (LSP.28a) Q(z) = (6. the LSFs are interlaced over the range of angles 0 to π. i. Ωk and Θk .4) The LSFs are close together when the roots of A(z) are close to the unit circle.5 Equivalent Representations 91 This transformation of the linear prediction parameters is invertible: to recover A(z) from P (z) and Q(z) simply add the two equations to obtain A(z) = (P (z) + Q(z))/2. .e. (LSP.28b) The normalized discretetime frequencies (angles in the zplane). which shows the roots of A(z)... 119].7. they can be represented as p+1 P (z) = k=1 p+1 (1 − ejΩk z −1 ) (1 − ejΘk z −1 ). P (z). i. the reconstructed polynomial A(z) will still have its zeros inside the unit circle. are called the line spectrum frequencies or LSFs. (LSP. (LSP. Knowing the p LSFs that lie in the range 0 < ω < π is suﬃcient to completely deﬁne the LSP polynomials and therefore A(z).2) If p is an even integer.7 and summarized below [52. These properties make it possible to represent the linear predictor by quantized diﬀerences between the successive LSFs [119]. and Q(z) for a 12thorder predictor.6.3) The PARCOR coeﬃcients corresponding to A(z) satisfy ki  < 1 if and only if the roots of P (z) and Q(z) alternate on the unit circle.
We have mentioned that if we have a set of PARCOR coeﬃcients. 6. any desired number of cepstrum values can be computed. which would require that the poles of H(z) be found by polynomial rooting. This makes them an attractive parameter for quantization. However. since H(z) is a minimumphase system (all its poles inside the unit circle).29) or a closed form expression for the impulse response can be obtained by making a partial fraction expansion of the system function H(z). it follows that the complex cepstrum of h[n] can be computed using (5.16) gives the recursion for computing h[n] from the predictor coeﬃcients and gain. a more direct recursive computation is posˆ sible [101]. These relationships are particularly useful in speech recognition applications where a small number of cepstrum values is used as a feature vector. we can compute the PARCOR coeﬃcients from a given set of predictor coeﬃcients. because of the minimum phase condition.92 Linear Predictive Analysis 6.4 PARCOR Coeﬃcients We have seen that the PARCOR coeﬃcients are bounded by ±1.2) of the Levinson–Durbin algorithm thereby obtaining an algorithm for converting PARCORs to predictor coeﬃcients.17) gives the inverse recursion for computing the predictor coeﬃcients from the complex cepstrum of the vocal tract impulse response. The resulting algorithm is given below. and (5. Then. h[n]. .5. we can simply use them at step (D. The impulse response can be computed recursively through the diﬀerence equation: p h[n] = k=1 αk h[n − k] + Gδ[n]. As discussed in Section 5.3.14).3 Cepstrum of Vocal Tract Impulse Response One of the most useful alternative representations of the linear predictor is the cepstrum of the impulse response.3. Equation (5. (6. By working backward through the Levinson– Durbin algorithm.5. of the vocal tract ﬁlter.
p kp = αp for i = p. . the log area ratio parameters are deﬁned as gi = log 1 − ki Ai+1 = log . . . 78. .5 Equivalent Representations 93 PredictortoPARCOR Algorithm αj = αj j = 1. 101]. Such models are characterized by a set of tube crosssectional areas denoted Ai . . 2 for j = 1. The inverse transformation (from gi to ki ) is ki = 1 − egi . .5. These “area function” parameters are useful alternative representations of the vocal tract model obtained from linear predictive analysis.5. Ai 1 + ki (6. .6. Speciﬁcally. .4. . Viswanathan and Makhoul [131] showed that the frequency response of a vocal tract ﬁlter represented by quantized log area ratio coeﬃcients is relatively insensitive to quantization errors. p − 1.31) The PARCOR coeﬃcients can be converted to predictor coeﬃcients if required using the technique discussed in Section 6.1) (P. 2. . . .3) As mentioned above.5 Log Area Coeﬃcients (p) (p) (P.2) (P. i − 1 (i) (i) αj + ki αi−j (i−1) = αj 2 1 − ki end (i−1) ki−1 = αi−1 end 6. 2. 1 + egi (6. .30) where Ai and Ai+1 are the areas of two successive tubes and the PARCOR coeﬃcient −ki is the reﬂection coeﬃcient for sound waves impinging on the junction between the two tubes [3. the lattice ﬁlter representation of the vocal tract system is strongly related to concatenated acoustic tube models of the physics of sound propagation in the vocal tract.
which focus on these particular application areas. The value of linear predictive analysis stems from its ability to represent in a compact form the part of the speech model associated with the vocal tract. but especially in speech coding. speech synthesis. we have discussed many of the fundamentals of linear predictive analysis of speech.94 Linear Predictive Analysis 6. . Over 40 years of research on linear predictive methods have yielded a wealth of knowledge that has been widely and eﬀectively applied in almost every area of speech processing.6 The Role of Linear Prediction In this chapter. The present chapter and Chapters 3–5 provide the basis for the remaining chapters of this text. which in turn is closely related to the phonemic representation of the speech signal. and speech recognition.
1. the ﬁrst step is sampling and quantization of the resulting samples into digital form. the ﬁrst of three on digital speech processing applications. The remainder of the chapter discusses a wide variety of techniques that represent the speech signal in terms of parametric models of speech production and perception. 7.1 1 Current AtoD and DtoA converters use oversampling techniques to implement the functions depicted in Figure 7.1 Sampling and Quantization of Speech (PCM) In any application of digital signal processing (DSP). are depicted for convenience in analysis and discussion in the block diagram of Figure 7. We begin by describing the basic operation of sampling a speech signal and directly quantizing and encoding of the samples. 95 . focuses on speciﬁc techniques that are used in digital speech coding.7 Digital Speech Coding This chapter.1. which comprise the process of AtoD conversion. These operations.
2. The well known Shannon sampling theorem states that a bandlimited signal can be reconstructed exactly from samples taken at twice the highest frequency in the input signal spectrum (typically between 7 and 20 kHz for speech and audio signals) [89].2 depicts a typical quantizer deﬁnition. with samples within the peaktopeak range being rounded and samples outside the range being “clipped” to either the maximum positive or negative level. For samples within range. Often. 4 kHz) and then use a sampling rate such as 8000 samples/s to avoid aliasing distortion [89]. the samples x[n] are real numbers that are useful for theoretical analysis. Theoretically. . the example of Figure 7. we use a lowpass ﬁlter to remove spectral information above some frequency of interest (e.1) that in this chapter. The quantizer simply takes the real number inputs x[n] and assigns an output x[n] accordˆ 2 In the case of ing to the nonlinear discreteoutput mapping Q{ }. 7. the notation x[n] denotes the quantized version of x[n]. The sampler produces a sequence of numbers x[n] = xc (nT ). deﬁned as e[n] = x[n] − x[n] ˆ 2 Note (7. the quantization error. but never available for computations.1 The operations of sampling and quantization. the output samples are mapped to one of eight possible values. where T is the sampling period and fs = 1/T is the sampling frequency. Figure 7.96 Digital Speech Coding AtoD Converter xc (t ) Sampler x[n] Quantizer x[n] ˆ Q{ } Encoder c[n] T ∆ DtoA Converter c′[n] Decoder x′[n] Reconstruction ˆ Filter xc (t ) ˆ′ ∆ Fig. not the ˆ complex cepstrum of x[n] as in Chapter 5..g.
Therefore. if the peaktopeak range is 2Xm .2.2) where ∆ is the quantizer step size. (7. These code words represent the quantized signal amplitudes. these code words are chosen to correspond to some convenient binary number system such that arithmetic can be done on the code words as . the step size will be ∆ = 2Xm /2B . Traditionally. 7. and generally. such quantizers are called uniform quantizers.2 8level midtread quantizer. as in Figure 7.7.2 has 2B levels (1 bit usually signals the sign).1 represents the assigning of a binary code word to each quantization level. A Bbit quantizer such as the one shown in Figure 7. The block marked “encoder” in Figure 7. representation of a speech signal by binarycoded quantized samples is called pulsecode modulation (or just PCM) because binary numbers can be represented for transmission as on/oﬀ pulse amplitude modulation.1 Sampling and Quantization of Speech (PCM) 97 Fig. Because the quantization levels are uniformly spaced by ∆. satisﬁes the condition −∆/2 < e[n] ≤ ∆/2.
and coder are all integrated into one system. 5 Even higher rates (24 bits and 96 kHz sampling rate) are used for highquality audio.2) states that the size of the maximum quantization error grows.. i. Therefore. This chapter deals with a wide range of 3 In a real AtoD converter.1 or 48 kHz. However. This means that we need a large peaktopeak range so as to avoid clipping of the loudest sounds.e. For a given peaktopeak range for a quantizer such as the one in Figure 7. their amplitudes vary greatly between voiced and unvoiced sounds and between speakers. This is the fundamental problem of quantization. speech especially. Furthermore. (7. both these simple approaches degrade the perceived quality of the speech signal. 4 The denotes the possibility of errors in the codewords. as the peaktopeak range increases. Most signals. singing. for a uniform quantizer. 600 bits/s. for a given number of levels (bits).98 Digital Speech Coding if they were proportional to the signal samples. have a wide dynamic range. Note that the combination of the decoder and lowpass reconstruction ﬁlter represents the operation of DtoA conversion [89]. but this is not the case if the step size is adapted from sampletosample. . quantizer. instrumental music) are B = 16 and fs = 44. the step size could be applied as depicted in the block labeled “Decoder” at the bottom of Figure 7. The operation of quantization confronts us with a dilemma. however. and only the binary code words c[n] are available. the maximum size of the quantization error is the same whether the signal sample is large or small. Symbol errors would cause additional error in the reconstructed samples. The bit rate can be lowered by using fewer bits/sample and/or using a lower sampling rate. The data rate (measured in bits/second) of sampled and quantized speech signals is I = B · fs . we do not need to be concerned with the ﬁne distinction between the quantized samples and the coded samples when we have a ﬁxed step size. This leads to a bitrate of I = 16 · 44100 = 705.4 Generally. the sampler.2 we can decrease the quantization error only by adding more levels (bits).3 In cases where accurate amplitude calibration is desired.1. The standard values for sampling and quantizing sound signals (speech.5 This value is more than adequate and much more than desired for most speech communication applications. the step size ∆ = 2Xm /2B is proportional to the peaktopeak range.
1) The noise samples appear to be6 uncorrelated with the signal samples. With these assumptions. Under these conditions.7. (Q. 7.1 Sampling and Quantization of Speech (PCM) 99 techniques for signiﬁcantly reducing the bit rate while maintaining an adequate level of speech quality. the quantization error sequence. (Q. e[n] acts like a white noise sequence. the noise samples appear to be uncorrelated from sampletosample.3) The amplitudes of noise samples are uniformly distributed across the range −∆/2 < e[n] ≤ ∆/2. the correlations are equal to the negative of the error variance [12].1.. i. If the number of bits in the quantizer is reasonably high and no clipping occurs. . The condition “uncorrelated” often implies independence.e. it is possible to derive the following formula for the signaltoquantizingnoise ratio (in dB) of a Bbit 6 By “appear to be” we mean that measured correlations are small. it can be shown that if the output levels of the quantizer are optimized. as commonly stated). under suitable conditions. nevertheless behaves as if it is a random signal with the following properties [12]: (Q. These results can be easily shown to hold in the simple case of a uniformly distributed memoryless input and Bennett has shown how the result can be extended to inputs with smooth densities if the bit rate is assumed high [12]. These simplifying assumptions allow a linear analysis that yields accurate results if the signal is not too coarsely quantized. Bennett has shown that the correlations are only small because the error is small and. although it is completely determined by the signal amplitudes.1 Uniform Quantization Noise Analysis A more quantitative description of the eﬀect of quantization can be obtained using random signal analysis applied to the quantization error. notably smooth input probability density functions and high rate quantizers.2) Under certain assumptions. but in this case the error is a deterministic function of the input and hence it cannot be independent of the input. resulting 2 in average power σe = ∆2 /12. then the quantizer error will be uncorrelated with the quantizer output (however not the quantizer input.
As signal level increases.3. Measured uniform quantization SNR — dark dashed lines. Note that Xm is a ﬁxed parameter of the quantizer. µlaw (µ = 100) compression — solid lines. however. 9.3 shows a comparison of (7. and 10 bits. while σx depends on the input signal level. The measurements were done by quantizing 16bit samples to 8. The faint dashed lines are from (7. (7. Comparison of Law and Uniform Quantizers 60 50 40 SNR in dB 10 9 8 30 20 10 0 Fig. Figure 7.3 Comparison of µlaw and linear quantization for B = 8. 10: Equation (7.02B + 4. This accounts for the precipitous fall in SNR for 1 < Xm /σx < 8.78 − 20 log10 Xm σx . There is good agreement between these graphs indicating that (7.100 Digital Speech Coding uniform quantizer: SNRQ = 10 log 2 σx 2 σe = 6.3) is a reasonable estimate of SNR. many samples are clipped.3) is an increasingly good approximation as the bit rate (or the number of quantization levels) gets large.3) where σx and σe are the rms values of the input signal and quantization noise samples.3) with signaltoquantizationnoise ratios measured for speech signals.3) — light dashed lines. the ratio Xm /σx decreases moving to the left in Figure 7. if the bit rate is not large [12]. When σx gets close to Xm . 9. respectively.3) no longer hold. 7. and the assumptions underlying (7. .3) and the dark dashed lines are measured values for uniform quantization. The formula of Equation (7. It can be way oﬀ.
This is an example of where speech quality is deliberately compromised in order to achieve a much lower bit rate. This is because the size of the quantization noise remains the same as the signal level decreases. In other words.3 show that the signaltonoise ratio with µlaw compression remains relatively constant over a wide range of input levels. Once the bandwidth restriction has been imposed.4) can be used prior to uniform quantization [118].3 show that with all other parameters being ﬁxed. deﬁned as y[n] = Xm log 1 + µ x[n] Xm log(1 + µ) · sign(x[n]) (7. it is also important to note that halving σx decreases the SNR by 6 dB. On the other hand.3 decrease linearly with increasing values of log10 (Xm /σx ).7. but this is not possible because the log function is illbehaved for small values of x[n].711 standard. µ = 255.1. µlaw quantization for speech is standardized in the CCITT G.3) and Figure 7. Quantization of µlaw compressed signals is often termed logPCM. As a compromise. logPCM with fs = 8000 samples/s is widely used for digital telephony. . the SNR would be constant regardless of the size of the signal. cutting the signal level in half is like throwing away one bit (half of the levels) of the quantizer. 7. In particular. If the quantization error were proportional to the signal amplitude. µlaw compression (of the dynamic range) of the signal. The ﬂat curves in Figure 7. it is exceedingly important to keep input signal levels as high as possible without clipping. 8bit logPCM introduces little or no perceptible distortion. If µ is large (> 100) this nonlinear transformation has the eﬀect of distributing the eﬀective quantization levels uniformly for small samples and logarithmically over the remaining range of the input samples.1 Sampling and Quantization of Speech (PCM) 101 Also. it should be noted that (7. 8bit. increasing B by 1 bit (doubling the number of quantization levels) increases the SNR by 6 dB. This could be achieved if log(x[n]) were quantized instead of x[n].2 µLaw Quantization Also note that the SNR curves in Figure 7. Thus.
Such an AtoD/DtoA device is called a encoder/decoder or “codec. it was perceived to render speech equivalent to speech transmitted over the best long distance lines. Nowadays.102 Digital Speech Coding This conﬁguration is often referred to as “toll quality” because when it was introduced at the beginning of the digital telephony era.7 Optimum nonuniform quantizers can improve the signaltonoise ratio by as much as 6 dB over µlaw quantizers with the same number of bits. To do this analytically. but with little or no improvement in perceived quality of reproduction. New Jersey in September 1957.” If the digital output of a codec is used as input to a speech processing algorithm. Paez and Glisson [91] gave an algorithm for designing optimum quantizers for assumed Laplace and Gamma probability densities. Lloyd [71] gave an algorithm for designing optimum nonuniform quantizers based on sampled speech signals. it is generally necessary to restore the linearity of the amplitudes through the inverse of (7. It is based on the intuitive notion of constant percentage error.3 NonUniform and Adaptive Quantization µlaw compression is an example of nonuniform quantization. 7. Subsequently this pioneering work was published in the open literature in March 1982 [71].1. will incur less error than the least probable samples. The fundamentals of optimum quantization were established by Lloyd [71] and Max [79].4). however. which for speech are the low amplitude samples. it is necessary to know the probability distribution of the signal sample values so that the most probable samples. readily available hardware devices convert analog signals directly into binarycoded µlaw samples and also expand compressed samples back to uniform scale for conversion back to analog signals. 7 Lloyd’s work was initially published in a Bell Laboratories Technical Note with portions of the material having been presented at the Institute of Mathematical Statistics Meeting in Atlantic City. which are useful approximations to measured distributions for speech. . A more rigorous approach is to design a nonuniform quantizer that minimizes the meansquared quantization error. To apply this idea to the design of a nonuniform quantizer for speech requires an assumption of an analytical form for the probability distribution or some algorithmic approach based on measured distributions.
If the step size control is based on the unquantized signal samples. feedforward quantization is usually applied to blocks of speech samples. Virtually perfect perceptual quality is achievable with high data . Such adaption of the step size is equivalent to ﬁxed quantization of the signal after division by the rms signal amplitude (En )1/2 .6). thereby adding to the information rate. 58]. On the other hand. if the previous sample is quantized using step size ∆[n − 1] to one of the low quantization levels. Jayant [57] studied a class of feedback adaptive quantizers where the step size ∆[n] is a function of the step size for the previous sample.2 Digital Speech Coding As we have seen. By basing the adaptation on the previous sample.2 Digital Speech Coding 103 Another way to deal with the wide dynamic range of speech is to let the quantizer step size vary with time. Adaptive quantizers can achieve about 6 dB improvement (equivalent to adding one bit) over ﬁxed quantizers [57. On the other hand. the quantizer is a feedback adaptive quantizer. the quantizer is called a feedforward adaptive quantizer. For example. the data rate of sampled and quantized speech is B · fs . the step size is increased if the previous sample was quantized in one of the highest levels.7. the system is called an adaptive PCM (APCM) system. 7. With this deﬁnition. ˆ (7. it is not necessary to transmit the step size. then the step size is decreased for the next sample. It can be derived at the decoder by the same algorithm as used for encoding.5) where ∆0 is a constant and En is the shorttime energy of the speech ˆ signal deﬁned. In this approach. The adaptation speed can be adjusted by varying the analysis window size. ∆[n − 1]. When the quantizer in a PCM system is adaptive. for example. the step size will ˆ go up and down with the local rms amplitude of the speech signal. To amortize this overhead. and the step size information must be transmitted or stored along with the quantized sample. if the step size control is based on past quantized samples. as in (4. the step size could satisfy ∆[n] = ∆0 (En )1/2 .
Modelbased coders are designed for obtaining eﬃcient digital representations of speech and only speech. These two components are quantized. Since the 1930s. Traditionally. the complexity (in terms of digital computation) is often of equal concern. the digital representation is often referred to as being “transparent. The former are generally called waveform coders. Figure 7.104 Digital Speech Coding rate. Reducing fs requires bandwidth reduction. and reducing B too much will introduce audible distortion that may resemble random noise. but perceptual quality may suﬀer. This work has led to a basic approach in speech coding that is based on the use of DSP techniques and the incorporation of knowledge of speech production and perception into the quantization process. and then a “quantized” speech signal can be reconstructed by exciting the quantized ﬁlter with the quantized excitation. Nonuniform quantization and adaptive quantization are simple attempts to build in properties of the speech signal (timevarying amplitude distribution) and speech perception (quantization noise is masked by loud sounds).” . There are many ways that this can be done. In addition to the two dimensions of quality and bit rate. Most modern modelbased coding methods are based on the notion that the speech signal can be represented in terms of an excitation signal and a timevarying vocal tract system. and the latter modelbased coders.8 By reducing B or fs .4 illustrates why linear predictive analysis can be fruitfully applied for modelbased coding. Straightforward sampling and uniform quantization (PCM) is perhaps the only pure waveform coder. The upper plot is a segment of a speech signal. but these extensions still are aimed at preserving the waveform. The main objective in digital speech coding (sometimes called speech compression) is to lower the bit rate while maintaining an adequate level of perceptual ﬁdelity. speech coding methods have been classiﬁed according to whether they attempt to preserve the waveform of the speech signal or whether they only seek to maintain an acceptable level of perceptual quality and intelligibility. engineers and speech scientists have worked toward the ultimate goal of achieving more eﬃcient representations. the bit rate can be reduced. 8 If the signal that is reconstructed from the digital representation is not perceptibly diﬀerent from the original analog speech signal.
2 Digital Speech Coding Speech Waveform 105 1 0 −1 0 100 200 300 400 Prediction Error Sequence 0. 7. One class of systems simply attempts to extract an excitation signal and a vocal tract system from the speech signal without any attempt to preserve a relationship between the waveforms of the original and the quantized speech. If the quantized prediction error (residual) segment is used as input to the corresponding vocal tract ﬁlter H(z) = 1/A(z). this segment of the prediction error could be quantized with a smaller step size than the waveform itself. Note that the prediction error sequence amplitude of Figure 7. 9 To implement this timevarying inverse ﬁltering/reconstruction requires care in ﬁtting the segments of the signals together at the block boundaries. Today.2 0 −0. Overlapadd methods [89] can be used eﬀectively for this purpose. The lower plot is the output of a linear prediction error ﬁlter A(z) (with p = 12) that was derived from the given segment by the techniques of Chapter 6.4 Speech signal and corresponding prediction error signal for a 12thorder predictor. .4 is a factor of ﬁve lower than that of the signal itself. an approximation to the original segment of speech would be obtained. which means that for a ﬁxed number of bits.9 While direct implementation of this idea does not lead to a practical method of speech coding.7. the line between waveform coders and modelbased coders is not distinct. it is nevertheless suggestive of more practical schemes which we will discuss. Such systems are attractive because they can be implemented with modest computation.2 0 100 200 sample index 300 400 Fig. A more useful classiﬁcation of speech coders focusses on how the speech production and perception models are incorporated into the quantization process.
5 shows a general block diagram of a large class of speech coding systems that are called adaptive diﬀerential PCM (ADPCM) systems. ˆ (7. The block labeled P is a linear predictor. and thus are called closedloop coders. These compare the quantized output to the original input and attempt to minimize the diﬀerence between the two in some prescribed sense. The quantizer Q can have a number of levels ranging from 2 to much higher. modelbased systems that also preserve the waveform. where e[n] is the quantization error.6) .1 ClosedLoop Coders Predictive Coding The essential features of predictive coding of speech were set forth in a classic paper by Atal and Schroeder [5]. which we designate as openloop coders. which is comprised of a feedback structure that includes the blocks labeled Q and P . A second class of coders employs the source/system model for speech production inside a feedback loop. Dudley early in the history of speech processing research [30].3. although the basic principles of predictive quantization were introduced by Cutler [26]. 7. since they explicitly attempt to minimize a timedomain distortion measure.3 7. are also called vocoders (voice coder ) since they are based on the principles established by H. but irrespective of the type of quantizer. and it can be adaptive or not. Closed loop systems. but we prefer to emphasize that they are closedloop. ˆ the quantized output can be expressed as d[n] = d[n] + e[n]. The reader should ignore initially the blocks concerned with adaptation and all the dotted control lines and focus instead on the core DPCM system. Diﬀerential PCM systems are simple examples of this class. but increased availability of computational power has made it possible to implement much more sophisticated closedloop systems called analysisbysynthesis coders. Figure 7. often do a good job of preserving the speech waveform while employing many of the same techniques used in openloop systems.106 Digital Speech Coding These coders. These systems are generally classiﬁed as waveform coders. so p x[n] = ˜ k=1 αk x[n − k]. it can be uniform or nonuniform.
Finally.. . as shown in Figure 7. the diﬀerence between the input ˜ x[n]. 7.3 ClosedLoop Coders 107 Fig. although. the signal x[n] is predicted based on p past samples of the signal ˜ 10 The signal d[n] = x[n] − x[n]. The latter would be termed a feedback adaptive ˆ predictor. the 10 In a feedforward adaptive predictor. the predictor parameters are estimated from the input signal x[n]. is the input to the quantizer. ˆ and the predicted signal.5 General block diagram for adaptive diﬀerential PCM (ADPCM).e.7. i.5. they can also be estimated from past samples of the reconstructed signal x[n].
e. the quantization error in x[n] is idenˆ ˆ tical to the quantization error in the quantized excitation signal d[n]. for ﬁne quantization (i. the signaltonoise ratio of the ADPCM system is 2 2 SNR = σx /σe . ˆ (7. as shown in Figure 7.108 Digital Speech Coding ˆ relationship between x[n] and d[n] is ˆ p x[n] = ˆ k=1 ˆ αk x[n − k] + d[n]. which. A simple.7) i. the quantizer can be either feedforward. The number of bits B determines the bitrate of the coded diﬀerence signal... The feedback structure of Figure 7. Finally. As shown in Figure 7.5 contains within it a source/system speech model. step size information must be part of the digital representation.8) This is the key result for DPCM systems. Neglecting . which. since ˆ x[n] = x[n] + d[n] it follows that ˆ ˜ x[n] = x[n] + e[n]. but informative modiﬁcation leads to SNR = 2 2 σx σd · 2 = GP · SNRQ . The quantity GP is called the prediction gain.e.. is the system needed to reconstruct the quantized speech signal x[n] from the ˆ coded diﬀerence signal input.” and d[n] is the quantized excitation signal.e. If the prediction is good. is given by (7. 2 σd σe (7.9) where the obvious deﬁnitions apply. From (7.5. It says that no matter what predictor is used in Figure 7.5. ˆ (7. if GP > 1 it represents an improvement gained by placing a predictor based on the speech model inside the feedback loop around the quantizer. i.or feedbackadaptive.5(b). the signal x[n] is the output of what we have called the “vocal ˆ ˆ tract ﬁlter. large number of quantization levels with small step size).3). then the variance of d[n] will be less than the variance of x[n] so it will be possible to use a smaller step size and therefore to reconstruct x[n] with less error than if x[n] were quanˆ tized directly. As indicated. if the quantizer is feedforward adaptive. SNRQ is the signaltonoise ratio of the quantizer.8).
Fixed predictors can be designed based on longterm average correlation functions.3 ClosedLoop Coders 109 the correlation between the signal and the quantization error. 7. it can be shown that GP = 2 σx 2 = σd 1 p .7.10) as a function of predictor order p.10) 1− k=1 where ρ[k] = φ[k]/φ[0] is the normalized autocorrelation function used to compute the optimum predictor coeﬃcients.5 0 −0. The predictor can be either ﬁxed or adaptive.6 Longterm autocorrelation function ρ[m] (lower plot) and corresponding prediction gain GP . and adaptive predictors can be either feedforward or feedback (derived by analyzing past samples of the quantized speech signal reconstructed as part of the coder). αk ρ[k] (7. The upper plot shows 10 log10 GP computed from (7. .6 shows an estimated longterm autocorrelation function for speech sampled at an 8 kHz sampling rate. The lower ﬁgure in Figure 7.5 0 2 4 6 autocorrelation lag 8 10 12 Fig. Note that a ﬁrstorder ﬁxed predictor can yield about 6 dB prediction gain so that either the quantizer can have 1 bit less or the 10 G in dB 5 p 0 0 2 4 6 8 predictor order p 10 12 1 0.
i. even though a 1bit quantizer has a very low SNR. and implemented as standard systems. studied. 1 − ρ2 [1] (7. While DM systems evolved somewhat independently of the more general diﬀerential coding methods.6. they are not generally implemented with block processing. Higherorder ﬁxed predictors can achieve about 4 dB more prediction gain. this can be compensated by the prediction gain due to oversampling. many systems based upon the basic principle of differential quantization have been proposed. To see how such systems can work.5. note the nature of the longterm autocorrelation function in Figure 7. Great ﬂexibility is inherent in the block diagram of Figure 7.11) Furthermore. both ρ[1] and α1 approach unity and GP gets large. and adapting the predictor at a phoneme rate can produce an additional 4 dB of gain [84].3.6. Here we can only mention a few of the most important. The simplest delta modulators use a very high . we could expect that the correlation value ρ[1] would lie on a smooth curve interpolating between samples 0 and 1 in Figure 7. If the same bandwidth speech signal were oversampled at a sampling rate higher than 8000 samples/s. and in contrast to more general predictive coding systems.2 Delta Modulation Delta modulation (DM) systems are the simplest diﬀerential coding systems since they use a 1bit quantizer and usually only a ﬁrstorder predictor. as fs gets large. This correlation function is for fs = 8000 samples/s.. 7.e. adaptation algorithms are usually based on the bit stream at the output of the 1bit quantizer. Delta modulation systems can be implemented with very simple hardware. Instead.10) it follows that the prediction gain for a ﬁrstorder predictor is GP = 1 . Thus. DM systems originated in the classic work of Cutler [26] and deJager [28].110 Digital Speech Coding reconstructed signal can have a 6 dB higher SNR. so from (7. note that the optimum predictor coeﬃcient for a ﬁrstorder predictor is α1 = ρ[1]. they nevertheless ﬁt neatly into the general theory of predictive coding. and not surprisingly.
he showed that doubling the sampling rate of his ADM system improved the SNR by about 10 dB.5. For example. See Section 7. adaptive delta modulation (ADM) is still used for terminalcostsensitive digital transmission applications. 7. 32. A careful comparison of a variety of ADPCM systems by Noll [84] gives a valuable perspective on the relative contributions of the quantizer and predictor. must be very high to achieve good quality reproduction. 3. As depicted in Figure 7. Their bit rate. the operations depicted in Figure 7. 24.1 for a discussion of this issue.7.3 Adaptive Diﬀerential PCM Systems Diﬀerential quantization with a multibit quantizer is called diﬀerential PCM (DPCM). it has very low coding delay. This limitation can be mitigated with only modest increase in complexity by using an adaptive 1bit quantizer [47.3. and 5bits with 8 kHz sampling rate. 56]. and 40 kbps). (2) Adapting the quantizer (APCM) improved SNR by 6 dB. and three ADPCM systems all using quantizers of 2. For all bit rates (16. and it can be adjusted to be robust to channel errors. adaptive diﬀerential PCM systems can have any combination of adaptive or ﬁxed quantizers and/or predictors. 4. adaptive delta modulation (ADM) with bit rate (sampling rate) 40 kbits/s has about the same SNR as 6bit log PCM sampled at 6. which introduces delay. only bit synchronization is required in transmission. Furthermore. (3) Adding ﬁrstorder ﬁxed or adaptive prediction improved the SNR by about 4 dB over APCM. Jayant [56] showed that for bandlimited (3 kHz) speech. . Generally. the results were as follows11 : (1) Log PCM had the lowest SNR.5 are implemented on short blocks of input signal samples. 11 The bit rates mentioned do not include overhead information for the quantized prediction coeﬃcients and quantizer step sizes. ﬁxed DPCM. Adaptive delta modulation has the following advantages: it is very simple to implement.3. adaptive PCM.3. being equal to fs .6 kHz. Noll compared log PCM. For these reasons.3 ClosedLoop Coders 111 sampling rate with a ﬁxed predictor.
112 Digital Speech Coding (4) A fourthorder adaptive predictor added about 4 dB. One of these concerns the quantization noise introduced by ADPCM. which we have seen is simply added to the input signal in the reconstruction process.711) at 64 kbps.e. All these approaches raise the noise spectrum at low frequencies and lower it at high frequencies. it can be ﬁltered with the inverse system to restore the spectral balance. If γ is too close to one. i. which are often operated at 32 kbps where quality is superior to logPCM (ITU G. A simple preemphasis system of this type has system function (1 − γz −1 ) where γ is less than one. G. the low frequency noise is emphasized too much.4 for fs = 8 kHz work best.726.12 A more sophisticated application of this idea is to replace the simple ﬁxed preemphasis ﬁlter with a timevarying ﬁlter designed to shape the quantization noise spectrum. A simple solution to this problem is to preemphasize the speech signal before coding. the noise spectrum will take on the shape of the deemphasis ﬁlter spectrum. The superior performance of ADPCM relative to logPCM and its relatively low computational demands have led to several standard versions (ITU G.727). it is clear that the noise will be most prominent in the speech at high frequencies where speech spectrum amplitudes are low. Another idea that has been applied eﬀectively in ADPCM systems as well as analysisbysynthesis systems is to include a longdelay predictor to capture the periodicity as well as the shortdelay correlation 12 Values around γ = 0. and in the process. The classic paper by Atal and Schroeder [5] contained a number of ideas that have since been applied with great success in adaptive predictive coding of speech. ﬁlter the speech with a linear ﬁlter that boosts the highfrequency (HF) part of the spectrum. An equivalent approach is to deﬁne a new feedback coding system where the quantization noise is computed as the diﬀerence between the input and output of the quantizer and then shaped before adding to the prediction residual.. After reconstruction of the preemphasized speech.721. If we invoke the white noise approximation for quantization error. and increasing the predictor order to 12 added only 2 dB more. thus taking advantage of the masking eﬀects of the prominent low frequencies in speech [2]. . G.
(7. this analysis would be performed on the input speech signal with the resulting predictor coeﬃcients applied to the reconstructed signal as in (7. x ˆ (7.1 Coding the ADPCM Parameters A glance at Figure 7.3. the predictor coeﬃcients and quantizer step size must be included as auxiliary data (side information) along with the quantized diﬀerence 13 Normally.12) cannot be jointly optimized.12).6) to p x[n] = β x[n − M ] + ˜ ˆ k=1 αk (ˆ[n] − β x[n − M ]).3 ClosedLoop Coders 113 inherent in voiced speech. and the parameter β accounts for amplitude variations between periods.5 shows that if the quantizer and predictor are ﬁxed or feedbackadaptive. .13 When this type of predictor is used. Including the simplest longdelay predictor would change (7. 7. Even more complicated predictors with multiple delays and gains have been found to improve the performance of ADPCM systems signiﬁcantly [2].13) Adding longdelay prediction takes more information out of the speech signal and encodes it in the predictor. The predictor parameters in (7. Otherwise. Then the shortdelay predictor parameters αk are estimated from the output of the longdelay prediction error ﬁlter.12) The parameter M is essentially the pitch period (in samples) of voiced speech.3.7. then only the coded diﬀerence signal is needed to reconstruct the speech signal (assuming that the decoder has the same coeﬃcients and feedbackadaptation algorithms). One suboptimal approach is to estimate β and M ﬁrst by determining M that maximizes the correlation around the expected pitch period and setting β equal to the normalized correlation at M . the “vocal tract system” used in the decoder would have system function H(z) = 1 p 1− k=1 αk z 1 1 − βz −M −k .
Vector Quantization for Predictor Quantization Another approach to quantizing the predictor parameters is called vector quantization. or VQ [45]. The PARCOR coeﬃcients can be quantized with a nonuniform quantizer or transformed with an inverse sine or hyperbolic tangent function to ﬂatten their statistical distribution and then quantized with a ﬁxed uniform quantizer. the shortdelay predictors can be represented in many equivalent ways. Atal [2] reported a system which used a 20thorder predictor and quantized the resulting set of PARCOR coeﬃcients after transformation with an inverse sine function. The number of bits per coeﬃcient ranged from 5 for each of the lowest two PARCORs down to 1 each for the six highestorder PARCORs. The diﬀerence signal will have the same sampling rate as the input signal. M + 1 and three gains each quantized to 4 or 5 bit accuracy. The step size and predictor parameters will be estimated and changed at a much lower sampling rate.7. 50–100 times/s. which requires 7 bits to cover the range of pitch periods to be expected with 8 kHz sampling rate of the input. e. This gave a total of 20 bits/frame and added 2000 bps to the overall bit rate. The total bit rate will be the sum of all the bit rates. Each coeﬃcient can be allocated a number of bits contingent on its importance in accurately representing the speech spectrum [131]. As discussed in Chapter 6.. To save bit rate. The basic principle of vector quantization is depicted in Figure 7.g. If the parameters are updated 100 times/s. The longdelay predictor in the system mentioned above used delays of M − 1. almost all of which are preferable to direct quantization of the predictor coeﬃcients themselves. VQ can be applied in any context where a set of parameters naturally groups together into a vector (in the sense of . M. Predictor Quantization The predictor parameters must be quantized for eﬃcient digital representation of the speech signal. The long delay predictor has a delay parameter M .114 Digital Speech Coding signal. yielding a total of 40 bits per frame. then a total of 4000 bps for the shortdelay predictor information is required. it is possible to update the shortdelay predictor 50 times/s for a total bit rate contribution of 2000 bps [2].
3 ClosedLoop Coders 115 Code Book 2B Vectors Code Book 2B Vectors vk ˆ v Distortion Measure Encoder Fig. 2B codebook vectors are formed from the training set of vectors. 7. The design of a vector quantizer requires a training set consisting of a large number (L) of examples of vectors that are drawn from the same distribution as vectors to be quantized later.. The “training phase” is computationally intense. ˆ is returned from the exhaustive search.7 Vector quantization.14 For example the predictor coeﬃcients can be represented by vectors of predictor coeﬃcients. If the codebook ˆ has 2B entries. In VQ.e. 14 Independent quantization of individual speech samples or individual predictor coeﬃcients is called scalar quantization. The resulting codebook is then used to quantize general test vectors. . cepstrum values. the corresponding quantized vector vi can be looked up. and it is facilitated by the LBG algorithm [70]. according to a prescribed distortion measure. This index i then represents the quantized vector in the sense that if the codebook is known. One advantage of VQ is that the distortion measure can be designed to sort the vectors into perceptually relevant prototype vectors. to be quantized is compared exhaustively to a “codebook” populated with representative vectors of that type. along with a distortion measure for comparing two vectors. i. or line spectrum frequencies. The index i of the closest vector vi to v. but need only be done once. a vector. the index i can be represented by a B bit number. PARCOR coeﬃcients. Using an iterative process. v. log area ratios.7. vk ˆ i i Table Lookup Decoder vi ˆ linear algebra). L ≥ 10 · 2B . Typically a VQ training set consists of at least 10 times the ultimate number of codebook vectors.
For a sampling rate of 8 kHz. When this bit rate of the diﬀerential coder is added to the bit rate for quantizing the predictor information (≈4000 bps) the total comes to approximately 10.2 Quality vs. Bit Rate for ADPCM Coders ADPCM systems normally operate with an 8 kHz sampling rate. Since the diﬀerence signal has the same sampling rate as the input signal.3. the speech signal must be bandlimited to somewhat less than 4 kHz by a lowpass ﬁlter. the additional bit rate would be B · fs . some sort of block coding must be applied. 7.3. long runs of zero samples result. If straightforward Bbit uniform quantization is used.116 Digital Speech Coding The quantization process is also time consuming because a given vector must be systematically compared with all the codebook vectors to determine which one it is closest to. the number of zero samples can be controlled. Adaptive prediction and quantization can easily lower the bit rate to 32 kbps with no degradation in perceived quality (with .” which is a system that sets to zero value all samples whose amplitudes are within a threshold band and passes those samples above threshold unchanged. This was the approach used in earlier studies by Noll [84] as mentioned in Section 7. this coding method needs a total of 5600 bps for the diﬀerence signal. By adjusting the threshold.3. and the entropy of the quantized diﬀerence signal becomes less than one. For high thresholds. One approach is to design a multibit quantizer that only operates on the largest samples of the diﬀerence signal.7 [2].000 bps. 7] proposed to precede the quantizer by a “center clipper. Atal and Schroeder [2. In order to reduce the bit rate for coding the diﬀerence samples. With this sampling rate. it is desirable to use as few bits as possible for the quantizer. The blocks of zero samples can be encoded eﬃciently using variablelengthtoblock codes yielding average bits/sample on the order of 0. Diﬀerence Signal Quantization It is necessary to code the diﬀerence signal in ADPCM. This has led to numerous innovations in codebook structure that speed up the look up process [45]. which is a compromise aimed at providing adequate intelligibility and moderate bit rate for voice communication.3.
all the techniques discussed above and more must be brought into play. 7. Of course.3. and the output of the vocal tract ﬁlter.1 Basic AnalysisbySynthesis Coding System Figure 7. and where adequate transmission and/or storage capacity is available to support bit rates on the order of 10 kbps. Below this value. although not unreasonably so for current DSP hardware.8 shows the block diagram of another class of closedloop digital speech coders.8 are carried out on blocks of speech samples. between the input. quality degrades signiﬁcantly. What is needed is a way of creating an excitation signal for the vocal tract ﬁlter that is both eﬃcient to code and also produces decoded speech of high quality. However. These systems are called “analysisbysynthesis coders” because the excitation is built up using an iterative process to produce a “synthetic” vocal tract ﬁlter output x[n] that matches the ˆ input speech signal according to a perceptually weighted error criterion. Thus.3. In order to achieve neartollquality at low bit rates. raising the bit rate for ADPCM above 64 kbps cannot improve the quality over logPCM because of the inherent frequency limitation. d[n]. x[n]. ADPCM is an attractive coding method when toll quality is required at modest cost.4 AnalysisbySynthesis Coding While ADPCM coding can produce excellent results at moderate bit rates.4. As in the case of the most sophisticated ADPCM systems. 7. but with some signiﬁcant modiﬁcations. The center clipping quantizer produces a diﬀerence signal that can be coded eﬃciently. this approach is clearly not optimal since the center clipper throws away information in order to obtain a sparse sequence. In particular. This can be done within the same closedloop framework as ADPCM. the diﬀerence.7. This increases the system complexity. the bit rate can be lowered below 32 kbps with only modest distortion until about 10 kbps. the operations of Figure 7. x[n]. However. its performance is fundamentally constrained by the fact that the diﬀerence signal has the same sampling rate as the input signal. is ﬁltered with a linear ﬁlter called the ˆ .3 ClosedLoop Coders 117 64 kbps log PCM toll quality as a reference).
In analysisbysynthesis. 7. this is accomplished by quantization noise feedback or preemphasis/deemphasis ﬁltering. x[n] = x[n] − d[n]. As the ﬁrst step in coding a block of speech samples.” Note the similarity of Figure 7. In ADPCM.118 Digital Speech Coding x[n] +  + d [n] Perceptual d′[n] Excitation Weighting W(z) Generator ˆ d [ n] x[n] ˆ Vocal Tract Filter H(z) Fig. a signal x[n] ˆ ˜ predicted from x[n] is subtracted from the input to form the diﬀerence ˆ signal. the reconstruction error is −d[n].3. the input . This is a key diﬀerence. but instead of the synthetic output x[n].8 to the core ADPCM diagram of Figure 7.8 Structure of analysisbysynthesis speech coders. the excitation signal is determined from the perceptually weighted difference signal d [n] by an algorithm represented by the block labeled “Excitation Generator. both the vocal tract ﬁlter and the perceptual weighting ﬁlter are derived from a linear predictive analysis of the block. it is desirable to shape its spectrum to take advantage of perceptual masking eﬀects. In analysisbysynthesis coding. where an adaptive quantization algorithm operates on d[n] to produce a quanˆ tized diﬀerence signal d[n]. W (z). which is the input to the vocal tract system.. and a percepˆ tually weighted version of that error is minimized in the meansquared ˆ sense by the selection of the excitation d[n]. Then. The perceptual weighting and excitation generator inside the dotted box play the role played by the quantizer in ADPCM. In ADPCM.2 Perceptual Weighting of the Diﬀerence Signal Since −d[n] is the error in the reconstructed signal. In ADPCM. the synthetic output diﬀers from the input x[n] by the quantization error.5. i.e. the vocal tract model is in the same position in the closedloop system.4. 7. perceptual weighting ﬁlter.
9 Comparison of frequency responses of vocal tract ﬁlter and perceptual weighting ﬁlter in analysisbysynthesis coding. i. By varying α1 and α2 in (7. which is the shape desired. .e. 7. A(z/α2 ) H(z/α1 ) (7.4 are used in (7. such errors would be masked by the high amplitude low frequencies.3 ClosedLoop Coders 119 to the vocal tract ﬁlter is determined so as to minimize the perceptuallyweighted error d [n].. Clearly. it follows that the error will be distributed in frequency so that relatively more error occurs at low frequencies. If α1 > α2 the frequency response is like a “controlled” inverse ﬁlter for H(z). in this case. Figure 7. Thus. d [n] = d[n] ∗ w[n] with the weighting ﬁlter usually deﬁned in terms of the vocal tract ﬁlter as the linear system with system function W (z) = A(z/α1 ) H(z/α2 ) = . where typical values of α1 = 0.14) the relative distribution of error can be adjusted. where.9 and α2 = 0. and the zeros of W (z) are at the same angles but at radii of α1 times the radii of the poles of H(z).14) The poles of W (z) lie at the same angle but at α2 times the radii of the poles of H(z). this ﬁlter tends to emphasize the high frequencies (where the vocal tract ﬁlter gain is low) and it deemphasizes the low frequencies in the error signal.14).7. 30 20 log magnitude in dB 10 0 −10 −20 −30 0 1000 2000 frequency in Hz.9 shows the frequency response of such a ﬁlter. 3000 4000 Fig. The weighting is implemented by linear ﬁltering.
3. and the perceptually weighted error would be d0 [n] = d0 [n] ∗ w[n]. Now at the ﬁrst iteration stage. the βs and γs are all that is needed to represent the ˆ input d[n]. By superposition.4. (7. N ˆ d[n] = k=1 βk fγk [n]. Starting with the assumption that the excitation signal is zero during the current excitation15 analysis frame (indexed 0 ≤ n ≤ L − 1).16). It is very diﬃcult to solve simultaneously for the optimum βs and γs that minimize (7.e. satisfactory results are obtained by solving for the component signals one at a time. ˆ x1 [n] = x0 [n] + β1 fγ1 [n] ∗ h[n]. which we designate here as fγ [n] for 0 ≤ n ≤ L − 1. . assume that we have determined which of the collection of input components fγ1 [n] will reduce the weighted meansquared error the most and we have also determined the required gain β1 . (7. i. or if we deﬁne the perceptually weighted vocal tract impulse response as h [n] = h[n] ∗ w[n] 15 Several analysis frames are usually included in one linear predictive analysis frame. However.120 Digital Speech Coding 7.16) where h[n] is the vocal tract impulse response and w[n] is the impulse response of the perceptual weighting ﬁlter with system function (7. The weighted error at the ﬁrst iteraˆ tion is d1 [n] = (d0 [n] − β1 fγ1 [n] ∗ h[n] ∗ w[n]). The error signal in the current frame at this initial stage of the iterative ˆ process would be d0 [n] = x[n] − x0 [n]. the output in the current frame due to the excitation determined for previous frames is computed and denoted x0 [n] for 0 ≤ n ≤ L − 1. Normally ˆ this would be a decaying signal that could be truncated after L samples. The input is composed of a ﬁnite sum of scaled components selected from the given collection. Since the component sequences are assumed to be known at both the coder and decoder. where L is the excitation frame length and γ ranges over the ﬁnite set of components.15) The βk s and the sequences fγk [n] are chosen to minimize L−1 E= n=0 ˆ ((x[n] − h[n] ∗ d[n]) ∗ w[n])2 ..3 Generating the Excitation Signal Most analysisbysynthesis systems generate the excitation from a ﬁnite ﬁxed collection of input components.14).
(7. Assuming that γk is known.21) . L−1 2 (y [n]) n=0 (7. (7. (7. the value of βk that minimizes Ek in (7. then the weighted error sequence is d1 [n] = d0 [n] − β1 fγ1 [n] ∗ h [n].18) is L−1 βk = dk−1 [n]yk [n] n=0 . Generalizing to the kth iteration. This process is continued by ﬁnding the next component. Equation (7.20) While we have assumed in the above discussion that fγk [n] and βk are known. subtracting it from the previously computed residual error.7.19) and the corresponding minimum meansquared error is L−1 L−1 2 (dk−1 [n])2 − βk n=0 n=0 Ek (min) = (yk [n])2 . we have not discussed how they can be found. The meansquared error at stage k of the iteration is deﬁned as L−1 L−1 Ek = n=0 (dk [n])2 = n=0 (dk−1 [n] − βk yk [n])2 .18) where it is assumed that dk−1 [n] is the weighted diﬀerence signal that remains after k − 1 steps of the process.20) suggests that the meansquared error can be minimized by maximizing L−1 L−1 2 βk n=0 2 (yk [n]) = 2 dk−1 [n]yk [n] n=0 L−1 (yk [n])2 n=0 . the corresponding equation takes the form: dk [n] = dk−1 [n] − βk yk [n]. and so forth.17) where y [n] = fγk [n] ∗ h [n] is the output of the perceptually weighted vocal tract impulse response due to the input component fγk [n].3 ClosedLoop Coders 121 and invoke the distributive property of convolution. (7.
and the new error sequence computed from (7. i.122 Digital Speech Coding which is essentially the normalized crosscorrelation between the new input component and the weighted residual error dk−1 [n].3. The ﬁrst analysisbysynthesis coder was called multipulse linear predictive coding [4]. This is because the components yk [n] are all just shifted versions of h [n]. .3.4.22) where N is usually on the order of 4–5 impulses in a 5 ms (40 samples for fs = 8 kHz) excitation analysis frame. the corresponding βk can be found from (7. where γ is an integer such that 0 ≤ γ ≤ L − 1.17 In this case. an exhaustive search at each stage can be achieved by computing just one crosscorrelation function and locating the impulse at the maximum of the crosscorrelation. the component input sequences are simply isolated unit impulse sequences. The error reduction process can be stopped (min) falls below a prescribed threshold.3 is therefore of the form: N ˆ d[n] = k=1 βk δ[n − γk ].4.. (7. In this system.3. The excitation sequence derived by the process described in Section 7. 16 Alternatively. After N iterations of this process.17). which normally is of 20 ms (160 samples) duration.16 If desired. fγ [n] = δ[n − γ].3 is quite general since the only constraint was that the collection of component input sequences be ﬁnite. N need not be ﬁxed in advance. the complete set of component input sequences and corresponding coeﬃcients will have been determined. 7.4. Once this component is found. An exhaustive search through the collection of input components will determine which component fγk [n] will maximize the quantity in (7. when Ek 17 It is advantageous to include two or more excitation analysis frames in one linear predictive analysis frame. the set of input sequences so determined can be assumed and a new set of βs can be found jointly by solving a set of N linear equations.19).4 MultiPulse Excitation Linear Prediction (MPLP) The procedure described in Section 7.e.21) and thus reduce the meansquared error the most.
At bit rates on the order of 10–16 kbps multipulse coding can approach toll quality. and the pulse amplitudes. the integer multiples. only γ1 . The predictor coeﬃcients can be coded as discussed for ADPCM by encoding any of the alternative representations either by scalar or vector quantization. the excitation signal components are Gaussian random sequences stored in a 18 Earlier work by Stewart [123] used a residual codebook (populated by the Lloyd algorithm [71]) with a low complexity trellis search. all other impulses are located at integer multiples of a ﬁxed spacing on either side of γ1 . Because the impulse locations are ﬁxed after the ﬁrst iteration. and the gain constants must be encoded. Regularpulse excitation (RPE) is a special case of multipulse excitation where it is easier to encode the impulse locations. 7. A number of diﬀerent approaches have been devised for coding the excitation parameters. however. quality degrades rapidly due to the many parameters that must be coded [13]. the pulse locations. Using a preset spacing of 4 and an excitation analysis window of 40 samples yields high quality at about 10 kbps [67]. One version of RPE is the basis for the 13 kbps digital coder in the GSM mobile communications system.7.3.5 CodeExcited Linear Prediction (CELP) Since the major portion of the bit rate of an analysisbysynthesis coder lies in the excitation signal. RPE requires signiﬁcantly less computation than fullblown multipulse excitation analysis. it is not surprising that a great deal of eﬀort has gone into schemes for ﬁnding excitation signals that are easier to encode than multipulse excitation.18 In this scheme. Thus. These usually involve some sort of diﬀerential coding scheme. .3 ClosedLoop Coders 123 In a speech coder based on multipulse analysisbysynthesis. below about 10 kbps. Speciﬁcally after determining the location γ1 of the ﬁrst impulse. but yet maintain high quality of reproduction of the speech signal.4. The next major innovation in the history of analysisbysynthesis systems was called codeexcited linear predictive coding or CELP [112]. the speech signal is represented by quantized versions of the prediction coeﬃcients.
729 conjugatestructure algebraic CELP (CSACELP) coder that is used in some mobile communication systems. the sequence lengths are typically L = 40–60 samples (5–7.728 lowdelay CELP coder uses very short excitation analysis frames and backward linear prediction to achieve high quality at a bit rate of 16. The basic CELP framework has been applied in the development of numerous speech coders that operate with bit rates in the range from 4800 to 16000 bps. Typical codebook sizes are 256. the object of linear prediction is to produce just such a signal. CELP coders can introduce more than 40 ms delay into communication systems. then a given sequence can be speciﬁed by an M bit number. 19 While it may seem counterintuitive that the excitation could be comprised of white noise sequence. Eﬃcient searching schemes and structured codebooks have eased the computational burden.000 bps with delays less than 2 ms. then a total of M · N bits will be required to code the sequences. in fact. which uses multiple codebooks to achieve a bit rate of 8000 bps. Additional bits are still required to code the βk s. The main disadvantage of CELP is the high computational cost of exhaustively searching the codebook at each stage of the error minimization. was adopted in 1989 for North American Cellular Systems. If N sequences are used to form the excitation. Another system called VSELP (vector sum excited linear prediction). remember that.5 ms at 8 kHz sampling rate). The GSM half rate coder operating at 5600 bps is based on the IS54 VSELP coder. and modern DSP hardware can easily implement the computations in real time. Still another standardized CELP system is the ITU G. Within the codebook. . The Federal Standard 1016 (FS1016) was adopted by the Department of Defense for use in secure voice transmission at 4800 bps. The ITU G. This is because each codebook sequence must be ﬁltered with the perceptually weighted impulse response before computing the cross correlation with the residual error sequence. the decoder must also have a copy of the analysis codebook so that the excitation can be regenerated at the decoder. 512 or 1024. Due to the block processing structure. With this method.”19 If the codebook contains 2M sequences.124 Digital Speech Coding “codebook.
We call these systems openloop systems because they do not determine the excitation by a feedback process.6 LongDelay Predictors in AnalysisbySynthesis Coders Longdelay predictors have an interesting interpretation in analysisbysynthesis coding. This can be done recursively to save computation. engineers have turned to the vocoder principles that were established decades ago.4. The ﬁrst step of the excitation computation would be to compute γ0 using (7.21) and then β0 using (7. Then the ˆ component β0 d[n − γ0 ] can be subtracted from the initial error to start the iteration process in either the multipulse or the CELP framework.15). This incorporation of components of the past history of the excitation has been referred to as “selfexcitation” [104] or the use of an “adaptive codebook” [19]. 7. and CELP coding has not been used at bit rates below about 4800 bits/s.4 OpenLoop Coders ADPCM coding has not been useful below about 9600 bps.7.3.1 The TwoState Excitation Model Figure 7. For bit rates below 4800 bps. For each value of γ0 to be tested. If we maintain a memory of one or more frames of ˆ previous values of the excitation signal.4 OpenLoop Coders 125 7. this requires that the weighted vocal tract impulse response be convolved with Lsample segments of the past excitation starting with sample γ0 . it has not been possible to generate excitation signals that can be coded at low bit rates and also produce high quality synthetic speech. 7.10 shows a source/system model for speech that is closely related to physical models for speech production. and longdelay predictors are used in many of the standardized coders mentioned above. we can add a term β0 d[n − γ0 ] to (7.19). As we have discussed . multipulse coding has similar limitations. While these closedloop systems have many attractive features. The long segment of the past excitation acts as a sort of codebook where the Lsample codebook sequences overlap by L − 1 samples. The additional bit rate for coding β0 and γ0 is often well worthwhile.4.
When the parameters of the model are estimated directly from a speech signal. in the case of voiced speech.1. radiation at the lips. The V/UV (voiced/unvoiced excitation) switch produces the alternating voiced and unvoiced segments of speech. . 7. and the gain parameter G controls the level of the ﬁlter output. the lowpass frequency shaping of the glottal pulse. an openloop analysis/synthesis speech coder. as discussed in Chapter 8. 7. When values for V/UV decision. P0 . and the parameters of the linear system are supplied at periodic intervals (frames) then the model becomes a speech synthesizer. The slowlytimevarying linear system models the combined eﬀects of vocal tract transmission. Gain.126 Digital Speech Coding Fig. as we prefer. The fundamental frequency varies slowly. We shall refer to this as the twostate excitation model.10 Twostateexcitation model for speech synthesis. It is common to estimate the fundamental frequency F0 or equivalently. the combination of estimator and synthesizer becomes a vocoder or. with time. in the previous chapters. where the spacing between impulses is the pitch period.4. and V/UV Detection The fundamental frequency of voiced speech can range from well below 100 Hz for lowpitched male speakers to over 250 Hz for highpitched voices of women and children.1 Pitch. and voiced sounds are produced by a periodic impulse train excitation. the excitation model can be very simple. G. more or less at the same rate as the vocal tract motions. and. P0 . Unvoiced sounds are produced by exciting the system with white noise.
yet most eﬀective approaches to pitch detection. V/UV. which is well below the bit rate used to encode the excitation signal in closedloop coders such as ADPCM or CELP. Typical values are 7 bits for pitch period (P0 = 0 signals UV) and 5 bits for G. which shows no evidence of periodicity. Similarly. In pitch detection applications of the shorttime autocorrelation. and unvoiced speech. the autocorrelation function value φn [0] at lag 0 or ˆ the cepstrum value cn [0] at quefrency 0 can be used to determine the ˆ energy of the segment of the input signal. The time variation of the pitch period is evident in the upper plots.6 shows a sequence of shorttime cepstra which moves from unvoiced to voiced speech in going from the bottom to the top of the ﬁgure. Since the vocal tract ﬁlter can be coded as in ADPCM or CELP. a strong peak in the expected pitch period range signals voiced speech.8. which shows the cepstra of segments of voiced and unvoiced speech. this totals 600 bps. much lower total bit . One of the simplest. For a frame rate of 50 frames/s. For digital coding applications.4 OpenLoop Coders 127 the pitch period P0 = 1/F0 at a frame rate of about 50–100 times/s. the pitch period. It should be chosen so that the shorttime energy of the synthetic output matches the shorttime energy of the input speech signal. This preprocessing tends to enhance the peak at the pitch period for voiced speech while suppressing the local correlation due to formant resonances. evincing a peak at the pitch period. Another approach to pitch detection is suggested by Figure 5. For this purpose. lack of a peak in the expected range signals unvoiced speech [83]. which shows autocorrelation functions for both voiced speech. it is common to preprocess the speech by a spectrum ﬂattening operation such as center clipping [96] or inverse ﬁltering [77]. To do this. In this case. and the location of the peak is the pitch period. The gain parameter G is also found by analysis of short segments of speech.7. and measuring the times between the peaks [43]. The STACF is also useful for this purpose as illustrated in Figure 4. operates directly on the time waveform by locating corresponding peaks and valleys in the waveform.5. Figure 5. and gain must be quantized. short segments of speech are analyzed to detect periodicity (signaling voiced speech) or aperiodicity (signaling unvoiced speech).
2 Vocal Tract System Estimation The vocal tract system in the synthesizer of Figure 7. s[n] = Ge[n] ∗ h[n].20 In a more recent application of homomorphic analysis in an analysisbysynthesis framework [21]. respectively. the cepstrum values were coded using vector quantization. These formant frequencies were used to control the resonance frequencies of a synthesizer comprised of a cascade of secondorder section IIR digital ﬁlters [111]. the impulse response can be changed at the time a new pitch impulse occurs. In still another approach to digital coding. In the original homomorphic vocoder.128 Digital Speech Coding rates are common in openloop systems.. one cepstrum computation can yield both an estimate of pitch and the vocal tract impulse response. is simply convolved with the excitation created from the quantized pitch. the impulse response was digitally coded by quantizing each cepstrum value individually (scalar quantization) [86].4. The impulse response.3. i. and the excitation derived by analysisbysynthesis as described in Section 7. reconstructed from the quantized cepstrum at the synthesizer. homomorphic ﬁltering was used to remove excitation eﬀects in the shorttime spectrum and then three formant frequencies were estimated from the smoothed spectra. For example. and gain information. however. This comes at a large cost in quality of the synthetic speech output. voicing. The primary methods that have been used have been homomorphic ﬁltering and linear predictive analysis as discussed in Chapters 5 and 6. LPC Vocoder Linear predictive analysis can also be used to estimate the vocal tract system for an openloop coder with twostate 20 Care must be taken at frame boundaries. The Homomorphic Vocoder Homomorphic ﬁltering can be used to extract a sequence of impulse responses from the sequence of cepstra that result from shorttime cepstrum analysis. Thus. 7. and the resulting output can overlap into the next frame.1. Such a speech coder is called a formant vocoder.6 shows an example of the formants estimated for voiced speech.e.10 can take many forms.3. .4. Figure 5.
The twostate model attempts to construct the excitation signal by direct analysis and measurement of the input speech signal. This ﬁgure shows an example of inverse ﬁltering of the speech signal using a prediction error ﬁlter. and accurate coding could require several bits per sample.2.4. None of the methods discussed so far attempt to directly code the prediction error signal in an openloop manner. we presented Figure 7. where the prediction error (or residual) is signiﬁcantly smaller and less lowpass in nature. Adaptive delta modulation was used in [128]. lowering the sampling rate. voicing and gain and the remaining bits allocated to the vocal tract ﬁlter coded as PARCOR coeﬃcients. Figure 7.4 OpenLoop Coders 129 excitation [6].7. 130]. This system is also called LPC10 (or LPC10e) because a 10thorder covariance linear predictive analysis is used to estimate the vocal tract system. ADPCM and analysisbysynthesis systems derive the excitation to the synthesis ﬁlter by a feedback process. In this system. Direct coding of the residual faces the same problem faced in ADPCM or CELP: the sampling rate is the same as that of the input. which is quite similar to voiceexcited vocoders (VEV) [113.2 ResidualExcited Linear Predictive Coding In Section 7. the problem of reducing the bit rate of the residual signal is attacked by reducing its bandwidth to about 800 Hz. Systems that attempt to code the residual signal directly are called residualexcited linear predictive (RELP) coders. Such analysis/synthesis coders are called LPC vocoders. and coding the samples with adaptive quantization. but APCM could be used if the sampling rate is . The speech signal can be reconstructed from the residual by passing it through the vocal tract system H(z) = 1/A(z).11 shows a block diagram of a RELP coder [128]. 7. the prediction coeﬃcients can be coded in one of the many ways that we have already discussed. In this case.4 as motivation for the use of the source/system model in digital speech coding.5 ms with 12 bits/frame allocated to pitch. The LPC10 system has a bit rate of 2400 bps using a frame size of 22. A vocoder of this type was standardized by the Department of Defense as the Federal Standard FS1015.
3 Mixed Excitation Systems While twostate excitation allows the bit rate to be quite low. The 800 Hz band is wide enough to contain several harmonics of the highest pitched voices. The output of such systems is often described as “buzzy. 7. and no pitch detection is required. the quality of the synthetic speech output leaves much to be desired. The quality achieved at this rate was not signiﬁcantly better than the LPC vocoder with a twostate excitation model. the sampling rate of the input was 6. errors in estimating pitch period or voicing decision cause the speech to sound unnatural if not unintelligible.8 kHz and the total bit rate was 9600 kbps with 6800 bps devoted to the residual signal. its basic principles can be found in subsequent openloop coders that have produced much better speech quality at bit rates around 2400 bps.130 Digital Speech Coding x[n] Inverse Filter A(z) Linear Predictive Analysis r[n] Lowpass & Decimator Adaptive Quantizer d [n] {α } Encoder Interpolator ˆ[ n] Synthesis & Spectrum Filter Flattener H(z) d [ n] Adaptive Quantizer Decoder x[n] {α } Decoder Fig. In the implementation of [128]. which restores higher harmonics of voiced speech.4. The reduced bandwidth residual is restored to full bandwidth prior to its use as an excitation signal by a nonlinear spectrum ﬂattening operation. lowered to 1600 Hz.” and in many cases. White noise is also added according to an empirically derived recipe. While this system did not become widely used.11 Residualexcited linear predictive (RELP) coder and decoder. The principle advantage of this system is that no hard V/UV decision must be made. The weaknesses of the twostate model . 7.
which focused onebyone on the sources of distortions manifest in the twostate excitation coder such as buzziness and tonal distortions.12 Mixedexcitation linear predictive (MELP) decoder. .7.4 OpenLoop Coders 131 for excitation spurred interest in a mixed excitation model where a hard decision between V and UV is not required.12 depicts the essential features of the mixedexcitation linear predictive coder (MELP) proposed by McCree and Barnwell [80]. Figure 7. 7. a “jitter” parameter ∆P is invoked to better model voicing transitions. they each pass through a multiband spectral shaping ﬁlter. The main feature is that impulse train excitation and noise excitation are added instead of switched.” This represents adaptive spectrum enhancement ﬁlters used to enhance formant regions. This mixed excitation helps to model shorttime spectral eﬀects such as “devoicing” of certain bands during voiced speech. Prior to their addition. ﬁlters are also used routinely in CELP coders and are often referred to as post ﬁlters.6. [75] and greatly reﬁned by McCree and Barnwell [80].21 In some situations. Other important features of the MELP system are lumped into the block labeled “Enhancing Filters.22 and a spectrally ﬂat “pulse dispersion ﬁlter” whose Fig. The gains in each of ﬁve bands are coordinated between the two ﬁlters so that the spectrum of e[n] is ﬂat. Such a model was ﬁrst proposed by Makhoul et al. 21 Devoicing 22 Such is evident in frame 9 in Figure 5. This conﬁguration was developed as the result of careful experimentation.
13 depicts the general nature of this large class of coding systems. 7. but these add only slightly to either the analysis computation or the bit rate [80]. There is one more class of coders that should be discussed. Figure 7. raising the sampling rate by the same factor as the corresponding downsampling box) followed by a bandpass ﬁlter similar to the analysis ﬁlter.. On the far left of the diagram is a set (bank) of analysis bandpass ﬁlters. Several new parameters of the excitation must be estimated at analysis time and coded for transmission.e. but by careful ˆ . On the far right in the diagram is a set of “bandpass interpolators” where each interpolator is composed of an upsampler (i. ˆ Such perfect reconstruction ﬁlter bank systems are the basis for a wideranging class of speech and audio coders called subband coders [25]. These are called frequencydomain coders because they are based on the principle of decomposing the speech signal into individual frequency bands. These modiﬁcations to the basic twostate excitation LPC vocoder produce marked improvements in the quality of reproduction of the speech signal. the output x[n] is not equal to the input. The MELP coder is said to produce speech quality at 2400 bps that is comparable to CELP coding at 4800 bps. and 600 bps. its superior performance led to a new Department of Defense standard in 1996 and subsequently to MILSTD3005 and NATO STANAG 4591. 1200.132 Digital Speech Coding purpose is to reduce “peakiness” due to the minimumphase nature of the linear predictive vocal tract system. we have really only focused on a few of the most important examples that have current relevance. There is much room for variation within the general frameworks that we have identiﬁed.5 FrequencyDomain Coders Although we have discussed a wide variety of digital speech coding systems. If the ﬁlters are carefully designed. In fact. Each of these is followed by a downsampler by a factor appropriate for the bandwidth of its bandpass input. and the outputs of the downsamplers are connected to the corresponding inputs of the bandpass interpolators. then it is possible for the output x[n] to be virtually identical to the input x[n] [24]. which operates at 2400. When the outputs of the ﬁlter bank are quantized by some sort of quantizer.
13 Subband coder and decoder for speech audio. As discussed in Chapter 3. adaptive PCM is used. the quantization error aﬀects all frequencies in 23 The total bit rate will be the sum of the products of the sampling rates of the downsampled channel signals times the number of bits allocated to each channel. which incorporated the speech production model into the quantization process.5 FrequencyDomain Coders 133 Fig. 7. . of course. but ADPCM could be used in principle. The goal with such coders is. but only in a rudimentary manner. Typically. Because the downsampled channel signals are full band at the lower sampling rate. the ﬁlter bank structure incorporates important features of the speech perception model. design of the quantizer the output can be indistinguishable from the input perceptually. the basilar membrane eﬀectively performs a frequency analysis of the sound impinging on the eardrum. In contrast to the coders that we have already discussed. suppose ﬁrst that the individual channels are quantized independently. The quantizers can be any quantization operator that preserves the waveform of the signal. for the total composite bit rate23 to be as low as possible while maintaining high quality. The coupling between points on the basilar membrane results in the masking eﬀects that we mentioned in Chapter 3.7. To see how subband coders work. This masking was incorporated in some of the coding systems discussed so far.
For example. Just as important. 7. These systems rely heavily on the techniques of linear prediction. What is not shown in that ﬁgure is additional analysis processing that is done to determine how to allocate the bits among the channels. and cepstrum analysis and also on models . [25] found that subband coders operating at 16 kbps and 9. Such systems are based on the principles depicted in Figure 7. but the important point is that no other bands are aﬀected by the errors in a given band. Much more sophisticated quantization schemes are possible in the conﬁguration shown in Figure 7. bands with low energy can be encoded with correspondingly low absolute error. and on the basis of this a threshold of audibility is determined. we have discussed a wide range of options for digital coding of speech signals. bits are allocated among the channels in an iterative process so that as much of the quantization error is inaudible as possible for the total bit budget. the quantization of the channels is done jointly. Furthermore. Another approach is to allocate bits among the channels dynamically according to a perceptual criterion. Using this audibility threshold. This typically involves the computation of a ﬁnegrained spectrum analysis using the DFT. respectively.134 Digital Speech Coding that band. The details of perceptual audio coding are presented in the recent textbook by Spanias [120].6 kbps were preferred by listeners to ADPCM coders operating at 22 and 19 kbps. if the channel bandwidths are all the same so that the output samples of the downsamplers can be treated as a vector. a particular band can be quantized according to its perceptual importance. In an early nonuniformly spaced ﬁve channel system. The simplest approach is to preassign a ﬁxed number of bits to each channel. This is the basis for modern audio coding standards such as the various MPEG standards. as suggested by the dotted boxes.13.13 if. From this spectrum it is possible to determine which frequencies will mask other frequencies.6 Evaluation of Coders In this chapter. Crochiere et al. ﬁlter banks. vector quantization can be used eﬀectively [23].
All this knowledge and technique is brought to bear on the problem of reducing the bit rate of the speech representation while maintaining high quality reproduction of the speech signal. improved pitch detection and source modeling can improve quality in an LPC vocoder as witnessed by the success of the MELP system. increasing the bit rate indeﬁnitely does not necessarily continue to improve quality. Throughout this chapter we have mentioned bit rates. the choice of a speech coder will depend on constraints imposed by the application such as cost of coder/decoder. . increasing the bit rate of an openloop twostateexcitation LPC vocoder above about 2400 bps does not improve quality very much. and quality as important practical dimensions of the speech coding problem.6 Evaluation of Coders 135 for speech production and speech perception.7. and quality requirements. robustness to transmission errors. Most coding schemes have many operations and parameters that can be chosen to tradeoﬀ among the important factors. increasing the bit rate will improve quality of reproduction. but this generally would come with an increase in computation and processing delay. available transmission capacity. In the ﬁnal analysis. In general. On the other hand. For example. complexity of computation. but lowering the bit rate causes noticeable degradation. there are many possibilities to choose from. processing delay. Fortunately. however. it is important to note that for many systems.
2. Such systems are often referred to as texttospeech synthesis (or TTS) systems. 136 . The input to the TTS system is text and the output is synthetic speech. 8.1. we will be concerned with computer simulation of the upper part of the speech chain in Figure 1. Fig. we discuss systems whose goal is to convert ordinary text messages into intelligible and natural sounding synthetic speech so as to transmit information from a machine to a human user.8 TexttoSpeech Synthesis Methods In this chapter. and their general structure is illustrated in Figure 8.1 Block diagram of general TTS system. The two fundamental processes performed by all TTS systems are text analysis (to determine the abstract underlying linguistic description of the speech) and speech synthesis (to produce the speech sounds corresponding to the text input). In other words.
2 shows more detail on how text analysis is performed. . including detecting the structure of the document containing the text (e. The input to the analysis is plain English text. email message versus paragraph of text from an encyclopedia article). and how much emphasis should be given to individual words and phrases within the ﬁnal spoken output speech (3) semantic focus and ambiguity resolution: the text analysis process must resolve homographs (words that are spelled alike but can be pronounced diﬀerently.g. the intonation of the speech. and also must use rules to determine word etymology to decide on how best to pronounce names and foreign words and phrases.1 Text Analysis 137 8.. Figure 8. The basic text processing beneﬁts from an online dictionary of word pronunciations along with rules for determining word etymology.1 must determine three things from the input text string. The output of the basic text processing step is tagged text. namely: (1) pronunciation of the text string: the text analysis process must decide on the set of phonemes that is to be spoken. and the duration of each of the sounds in the utterance (2) syntactic structure of the sentence to be spoken: the text analysis process must determine where to place pauses. depending on context).8. The ﬁrst stage of processing does some basic text processing operations. what rate of speaking is most appropriate for the material being spoken. and ﬁnally performing a linguistic analysis to determine grammatical information about words and phrases within the text. normalizing the text (so as to determine how to pronounce words like proper names or homographs with multiple pronunciations). where the tags denote the linguistic properties of the words of the input text string.1 Text Analysis The text analysis module of Figure 8. the degree of stress at various points in speaking.
in St. Louis” Example 2: “She worked for DEC in Maynard MA” . 8. For example.138 TexttoSpeech Synthesis Methods Fig.. or an exclamation point. .1. an end of sentence marker is usually a period. including how to handle abbreviations and acronyms as in the following sentences: Example 1: “I live on Bourbon St. However this is not always the case as in the sentence. neither of which denote the end of the sentence. 8. and to decide their signiﬁcance with regard to the sentence and paragraph structure of the input text. 8. ?. “This car is 72.2 Text Normalization Text normalization methods handle a range of text problems that occur in real applications of TTS systems.1. !. a question mark.2 Components of the text analysis process. long” where there are two periods.5 in.1 Document Structure Detection The document structure detection module seeks to determine the location of all punctuation marks in the text.
but is virtually never pronounced as the letter sequence “D E C. for prominence in the sentence • the style of speaking.e.2 is a linguistic analysis of the input text. shallow analysis is performed since most linguistic parsers are very slow. with the goal of determining.. and twenty. and in Example 2 the acronym DEC can be pronounced as either the word “deck” (the spoken acronym) or the name of the company. account numbers.50” should be pronounced as “ten dollars and ﬁfty cents” rather than as a sequence of characters.g. e. i.” is pronounced as street or saint. nine hundred.g. the use of a pronoun to refer back to another word unit) • the word (or words) on which emphasis are to be placed. A conventional parser could be used as the basis of the linguistic analysis of the printed text. where a pause in speaking might be appropriate • the presence of anaphora (e. etc. Thus the string “$10. the text “St. etc.. relaxed. depending on the context. times currency. 8.3 Linguistic Analysis The third step in the basic text processing block of Figure 8.. Digital Equipment Corporation.1. in Example 1. One other important text normalization problem concerns the pronunciation of proper names. but typically a simple. emotional. irate. . for each word in the printed string.1 Text Analysis 139 where.e. i. the following linguistic properties: • the part of speech (POS) of the word • the sense in which each word is used in the current context • the location of phrases (or phrase groups) with a sentence (or paragraph).” Other examples of text normalization include number strings like “1920” which can be pronounced as the year “nineteen twenty” or the number “one thousand.8.. especially those from languages other than English.” and dates.
with the help of a pronunciation dictionary. as is the case most often. the tagged text obtained from the basic text processing block of a TTS system has to be converted to a sequence of tagged phones which describe both the sounds to be produced as well as the manner of speaking. The basis for this is the context in which the word occurs. along with a set of lettertosound rules for words outside the dictionary.1. One simple example of homograph disambiguation is seen in the phrases “an absent boy” versus the sentence “do you choose to absent yourself.5 Homograph Disambiguation The homograph disambiguation operation must resolve the correct pronunciation of each word in the input string that has more than one pronunciation. Although there are a variety of ways of performing this analysis.140 TexttoSpeech Synthesis Methods 8. If so.” In the ﬁrst phrase the word “absent” is an adjective and the accent is on the ﬁrst syllable. The phonetic analysis block of Figure 8. both locally (emphasis) and globally (speaking style).1. in its entirety. in the second phrase the word “absent” is a verb and the accent is on the second syllable. Figure 8. If not. 8. perhaps the most straightforward method is to rely on a standard pronunciation dictionary. Each individual word in the text string is searched independently.3 shows the processing for a simple dictionary search for word pronunciation. the dictionary search attempts to ﬁnd aﬃxes (both preﬁxes and suﬃxes) and strips them .1.6 LettertoSound (LTS) Conversion The second step of phonetic analysis is the process of graphemetophoneme conversion.2 provides the processing that enables the TTS system to perform this conversion. The way in which these steps are performed is as follows.4 Phonetic Analysis Ultimately. First a “whole word” search is initiated to see if the printed word exists. the conversion to sounds is straightforward and the dictionary search begins on the next word. 8. namely conversion from the text to (marked) speech sounds. in the word dictionary.
from the word attempting to ﬁnd the “root form” of the word.2 is prosodic analysis which provides the speech synthesizer with the complete set of synthesis controls.8. 8. and an associated pitch contour (variation of fundamental frequency with time). If the root form is not present in the dictionary. 8. a set of lettertosound rules is used to determine the best pronunciation (usually based on etymology of the word) of the root form of the word.1 Text Analysis 141 Fig. and then does another “whole word” search. The assignment of duration and pitch contours is done by a set of pitch .1.3 Block diagram of dictionary search for proper word pronunciation. again followed by the reattachment of stripped out aﬃxes (including the case of no stripped out aﬃxes).7 Prosodic Analysis The last step in the text analysis system of Figure 8. The determination of the sequence of speech sounds is mainly performed by the phonetic analysis step as outlined above. namely the sequence of speech sounds. their durations.
8. This ﬁgure shows that there have been 3 generations of speech synthesis systems. is given in Figure 8. During the ﬁrst generation (between 1962 and 1977) formant synthesis of phonemes using a terminal analog synthesizer was the dominant technology using rules which related the phonetic decomposition of the sentence to formant frequency contours. along with a set of rules for assigning stress and determining where appropriate pauses should be inserted so that the local and global speaking rates appear to be natural.4. it was shown that good intelligibility synthetic speech could be reliably obtained from text input by concatenating the appropriate diphone units.4 Time line of progress in speech synthesis and TTS systems. The second generation of speech synthesis methods (from 1977 to 1992) was based primarily on an LPC representation of subword units such as diphones (half phones). .2 Evolution of Speech Synthesis Systems A summary of the progress in speech synthesis.142 TexttoSpeech Synthesis Methods and duration rules. The synthesis suﬀered from poor intelligibility and poor naturalness. 8. over the period 1962– 1997. By carefully modeling and representing diphone units via LPC parameters. Although the intelligibility improved dramatically Fig.
articulatory parameters. We begin our discussion of speech synthesis approaches with a review of early systems. namely: (1) choice of synthesis units: including whole words. dyads.1. primarily by Sagisaka at ATR Labs in Kyoto [108]. are diﬃcult to achieve and depend critically on three issues in the processing of the speech synthesizer “backend” of Figure 8. in which the method of “unit selection synthesis” was introduced and perfected. sinusoidal parameters.html. The third generation of speech synthesis technology was the period from 1992 to the present. and whose naturalness is as close to real speech as possible. or syllables (2) choice of synthesis parameters: including LPC features.cs. etc. diphones. formants. (3) method of computation: including rulebased systems or systems which rely on the concatenation of stored speech units. namely attaining high intelligibility along with reasonable naturalness. A detailed survey of progress in texttospeech conversion up to 1987 is given in the review paper by Klatt [65].indiana. The synthesis examples that accompanied that paper are available for listening at http://www.edu/rhythmsp/ASA/Contents. The resulting synthetic speech from this third generation technology had good intelligibility and naturalness that approached that of humangenerated speech.8.2 Evolution of Speech Synthesis Systems 143 over ﬁrst generation formant synthesis. phones. the remaining (major) task of TTS systems is to synthesize a speech waveform whose intelligibility is very high (to make the speech useful as a means of communication between a machine and a human). the naturalness of the synthetic speech remained low due to the inability of single diphone units to represent all possible combinations of sound using that diphone unit. and then discuss unit selection methods of speech synthesis later in this chapter.2. waveform templates.1 Early Speech Synthesis Approaches Once the abstract underlying linguistic description of the text input has been determined via the steps of Figure 8. . 8.2. Both tasks.
5. . It can be seen that.144 TexttoSpeech Synthesis Methods 8. This ﬁgure shows wideband spectrograms for the sentence “This shirt is red” spoken as a sequence of isolated words (with short. even for this trivial Fig. the words can be stored as sampled waveforms and simply concatenated in the correct sequence. This approach generally produces intelligible. and as a continuous utterance.5.5. as shown at the bottom of Figure 8. since it does not take into account the “coarticulation” eﬀects of producing phonemes in continuous speech. and it does not provide either for the adjustment of phoneme durations or the imposition of a desired pitch variation across the utterance.5 Wideband spectrograms of a sentence spoken as a sequence of isolated words (top panel) and as a continuous speech utterance (bottom panel). Words spoken in continuous speech sentences are generally much shorter in duration than when spoken in isolation (often up to 50% shorter) as illustrated in Figure 8. distinct pauses between words) as shown at the top of Figure 8. For greatest simplicity. but unnatural sounding speech.2 Word Concatenation Synthesis Perhaps the simplest approach to creating a speech utterance corresponding to a given text string is to literally splice together prerecorded words corresponding to the desired utterance.2. 8.
A special set of word concatenation rules is then used to create the control signals for the synthesizer.7 million distinct surnames in the United States and each of them would have to be spoken and stored for a general word concatenation synthesis method. shorten them. and further the formant tracks of the continuous sentence do not look like a set of uniformly compressed formant tracks from the individual words. being more closely related to the model for speech production. Eﬀorts to overcome the limitations of word concatenation followed two separate paths. the duration of the continuous sentence is on the order of half that of the isolated word version. The rational for this is that the parametric representations. while they are merged in the lower plot. One approach was based on controlling the motions of a physical model of the speech articulators based on the sequence of phonemes from the text analysis.1 There are many reasons for this. shorter units such as phonemes or diphones are more suitable for synthesis. Furthermore. but a major problem is that there are far too many words to store in a word catalog for word concatenation synthesis to be practical except in highly restricted situations. . The word concatenation approach can be made more sophisticated by storing the vocabulary words in a parametric form (formants. For example. equally important limitation is that wordlength segments of speech are simply the wrong subunits. Although such a synthesis system would appear to be an attractive alternative for general purpose synthesis of speech. A second. in reality this type of synthesis is not a practical approach. This requires sophisticated control 1 It should be obvious that whole word concatenation synthesis (from stored waveforms) is also impractical for general purpose synthesis of speech.2 Evolution of Speech Synthesis Systems 145 example. the boundaries between words in the upper plot are sharply deﬁned.) such as employed in the speech coders discussed in Chapter 7 [102]. This requires that the control parameters for all the words in the task vocabulary (as obtained from a training set of words) be stored as representations of the words. etc. As we will see. there are about 1. can be manipulated so as to blend the words together. LPC parameters. and impose a desired pitch variation.8.
2. 8.).1 [22]. etc. formants and pitch) can be derived by applying the acoustic theory of speech production and then used to control a synthesizer such those used for speech coding. or indirectly by converting articulatory shapes to formants or LPC parameters • we could highly constrain the motions of the articulatory parameters so that only natural motions would occur.. Again.146 TexttoSpeech Synthesis Methods rules that are mostly derived empirically. the rules (algorithm) for computing the control parameters are mostly derived by empirical means. • we could use Xray data (MRI data today) to study the motion of the articulators in the production of individual speech sounds. teeth. LPC parameters. An alternative approach eschews the articulatory model and proceeds directly to the computation of the control signals for a source/system model (e. thereby potentially making the speech more natural sounding. What we have learned about articulatory modeling of speech is that it requires a highly accurate model of the vocal cords and of the vocal tract for the resulting synthetic speech quality to be considered acceptable. Such models were thought to be inherently more natural than vocal tract analog models since: • we could impose known and fairly well understood physical constraints on articular movements to create realistic motions of the tongue.g. control parameters (e. pitch period. jaw. formant parameters.3 Articulatory Methods of Synthesis Articulatory models of human speech are based upon the application of acoustic theory to physical models such as depicted in highly stylized form in Figure 2. From the vocal tract shapes and sources produced by the articulatory model. etc. It further requires rules for handling the dynamics of the . velum. thereby increasing our understanding of the dynamics of speech production • we could model smooth articulatory parameter motions between sounds. either via direct methods (namely solving the wave equation)..g.
q b0 + H(z) = B(z) S(z) = = E(z) A(z) k=1 p bk z −k .4 Terminal Analog Synthesis of Speech The alternative to articulatory synthesis is called terminal analog speech synthesis. and thus. 8.1) 1− k=1 ak z where S(z) is the ztransform of the output speech signal. articulatory speech synthesis methods have not been found to be practical for synthesizing speech of acceptable quality. “Terminal” originally implied the “output terminals” of an electronic analog (not digital) circuit or system that was an analog of the human speech production system.1. . e[n].) 2 In the designation “terminal analog synthesis”. E(z) is the ztransform of the vocal tract excitation signal.2. ap } are the (timevarying) coeﬃcients of the vocal tract ﬁlter. bq } and {ak } = {a1 . This synthesis process has been called terminal analog synthesis because it is based on a model (analog) of the human vocal tract production of speech that seeks to produce a signal at its output terminals that is equivalent to the signal produced by a human talker. −k (8. . . . .. i. e[n]. s[n].e. namely an excitation source. So far we have been unable to learn all such rules. . . a2 .2 and 4. b1 . (See Figures 2. This can be confusing since today. and a transfer function of the human vocal tract in the form of a rational system function.2 The basis for terminal analog synthesis of speech is the source/system model of speech production that we have used many times in this text. “analog” implies “not digital” as well as an analogous thing. each sound of the language (phoneme) is characterized by a source excitation function and an ideal vocal tract model. b2 . .8. Speech is produced by varying (in time) the excitation and the vocal tract model control parameters at a rate commensurate with the sounds being produced. the terms analog and terminal result from the historical context of early speech synthesis studies. and {bk } = {b0 .2 Evolution of Speech Synthesis Systems 147 articulator motion in the context of the sounds being produced. In this approach. .
The vocal tract representation of (8. However. and one ﬁxed resonance F4 . Fig. AV .6 Speech synthesizer based on a cascade/serial (formant) synthesis model. The voiced speech (upper) branch includes an impulse generator (controlled by a timevarying pitch period. Finally a ﬁxed spectral compensation network can be used to model the combined eﬀects of glottal pulse shape and radiation characteristics from the lips and mouth. For unvoiced speech a simpler model (with one complex pole and one complex zero).1) can be implemented as a speech synthesis system using a direct form implementation. both the excitation signal properties (P0 and V /U V ) and the ﬁlter coeﬃcients of (8. and an allpole discretetime system that consists of a cascade of 3 timevarying resonances (the ﬁrst three formants. is shown in Figure 8.148 TexttoSpeech Synthesis Methods For most practical speech synthesis systems. based on the above discussion. P0 ). it has been shown that it is preferable to factor the numerator and denominator polynomials into either a series of cascade (serial) resonances. It has also been shown that an allpole model (B(z) = constant) is most appropriate for voiced (nonnasal) speech.1) change periodically so as to synthesize diﬀerent phonemes. or into a parallel combination of resonances. A complete serial terminal analog speech synthesis model. . F2 . 8. based on two real poles in the zplane. F1 . implemented via a parallel branch is adequate. 111].6 [95. a timevarying voiced signal gain. F3 ).
is to use synthesis segments that are sections of prerecorded .3 Unit Selection Methods The key idea of a concatenative TTS system. independent of the pitch period.8. it remains a very challenging task to compute the synthesis parameters. The resulting quality of the synthetic speech produced using a terminal analog synthesizer of the type shown in Figure 8. and a resonance/antiresonance system consisting of a timevarying pole (FP ) and a timevarying zero (FZ ). Nevertheless Klatt’s Klattalk system achieved adequate quality by 1983 to justify commercialization by the Digital Equipment Corporation as the DECtalk system. 8. Many of the short comings of the model of Figure 8. However. a timevarying unvoiced signal gain. The voiced and unvoiced components are added and processed by the ﬁxed spectral compensation network to provide the ﬁnal synthetic speech output. using unit selection methods.3 now provide superior quality in current applications.6 are alleviated by a more complex model proposed by Klatt [64].6 • nasal sounds are not handled properly since nasal zeros are not included in the model • stop consonants are not handled properly since there is no precise timing and control of the complex excitation signal • use of a ﬁxed pitch pulse shape. is inadequate and produces buzzy sounding voiced speech • the spectral compensation model is inaccurate and does not work well for unvoiced sounds. although the unit selection methods to be discussed in Section 8. even with a more sophisticated synthesis model.6 is highly variable with explicit model shortcomings due to the following: • voiced fricatives are not handled properly since their mixed excitation is not part of the model of Figure 8. Some DECtalk systems are still in operation today as legacy systems. AN .3 Unit Selection Methods 149 The unvoiced speech (lower) branch includes a white noise generator.
150 TexttoSpeech Synthesis Methods natural speech [31, 50, 108]. The word concatenation method discussed in Section 8.2.2 is perhaps the simplest embodiment of this idea; however, as we discussed, shorter segments are required to achieve better quality synthesis. The basic idea is that the more segments recorded, annotated, and saved in the database, the better the potential quality of the resulting speech synthesis. Ultimately, if an inﬁnite number of segments were recorded and saved, the resulting synthetic speech would sound natural for virtually all possible synthesis tasks. Concatenative speech synthesis systems, based on unit selection methods, are what is conventionally known as “data driven” approaches since their performance tends to get better the more data that is used for training the system and selecting appropriate units. In order to design and build a unit selection system based on using recorded speech segments, several issues have to be resolved, including the following: (1) What speech units should be used as the basic synthesis building blocks? (2) How the synthesis units are selected (extracted) from natural speech utterances? (3) How the units are labeled for retrieval from a large database of units? (4) What signal representation should be used to represent the units for storage and reproduction purposes? (5) What signal processing methods can be used to spectrally smooth the units (at unit junctures) and for prosody modiﬁcation (pitch, duration, amplitude)? We now attempt to answer each of these questions. 8.3.1 Choice of Concatenation Units
The units for unit selection can (in theory) be as large as words and as small as phoneme units. Words are prohibitive since there are essentially an inﬁnite number of words in English. Subword units include syllables (about 10,000 in English), phonemes (about 45 in English, but they are highly context dependent), demisyllables (about 2500
8.3 Unit Selection Methods
151
in English), and diphones (about 1500–2500 in English). The ideal synthesis unit is context independent and easily concatenates with other (appropriate) subword units [15]. Based on this criterion, the most reasonable choice for unit selection synthesis is the set of diphone units.3 Before any synthesis can be done, it is necessary to prepare an inventory of units (diphones). This requires signiﬁcant eﬀort if done manually. In any case, a large corpus of speech must be obtained from which to extract the diphone units as waveform snippets. These units are then represented in some compressed form for eﬃcient storage. High quality coding such as MPLP or CELP is used to limit compression artifacts. At the ﬁnal synthesis stage, the diphones are decoded into waveforms for ﬁnal merging, duration adjustment and pitch modiﬁcation, all of which take place on the time waveform. 8.3.2 From Text to Diphones
The synthesis procedure from subword units is straightforward, but far from trivial [14]. Following the process of text analysis into phonemes and prosody, the phoneme sequence is ﬁrst converted to the appropriate sequence of units from the inventory. For example, the phrase “I want” would be converted to diphone units as follows: Text Input: I want. Phonemes: /#/ AY/ /W/ /AA/ /N/ /T/ /#/ Diphones: /# AY/ /AYW/ /WAA/ /AAN/ /NT/ /T #/, where the symbol # is used to represent silence (at the beginning and end of each sentence or phrase). The second step in online synthesis is to select the most appropriate sequence of diphone units from the stored inventory. Since each diphone unit occurs many times in the stored inventory, the selection
3A
diphone is simply the concatenation of two phonemes that are allowed by the language constraints to be sequentially contiguous in natural speech. For example, a diphone based upon the phonemes AY and W would be denoted AY−W where the − denotes the joining of the two phonemes.
152 TexttoSpeech Synthesis Methods of the best sequence of diphone units involves solving a dynamic programming search for the sequence of units that minimizes a speciﬁed cost function. The cost function generally is based on diphone matches at each of the boundaries between diphones, where the diphone match is deﬁned in terms of spectral matching characteristics, pitch matching characteristics, and possibly phase matching characteristics. 8.3.3 Unit Selection Synthesis
The “Unit Selection” problem is basically one of having a given set of target features corresponding to the spoken text, and then automatically ﬁnding the sequence of units in the database that most closely match these features. This problem is illustrated in Figure 8.7 which shows a target feature set corresponding to the sequence of sounds (phonemes for this trivial example) /HH/ /EH/ /L/ /OW/ from the word /hello/.4 As shown, each of the phonemes in this word have multiple representations (units) in the inventory of sounds, having been extracted from diﬀerent phonemic environments. Hence there are many versions of /HH/ and many versions of /EH/ etc., as illustrated in Figure 8.7. The task of the unit selection module is to choose one of each of the multiple representations of the sounds, with the goal being to minimize the total perceptual distance between segments of the chosen sequence, based on spectral, pitch and phase diﬀerences throughout the sounds and especially at the boundary between sounds. By specifying costs associated with each unit, both globally across the unit,
“Trained” Perceptual Distance
HH
EH
L
OW
•••
HH HH HH HH
EH
EH E H
EH EH EH
L L L L L L
OW O W OW OW
Fig. 8.7 Illustration of basic process of unit selection. (After Dutoit [31].)
4 If
diphones were used, the sequence would be /HHEH/ /EHL/ /LOW/.
namely a nodal cost based on the unit segmental distortion (USD) which is deﬁned as the diﬀerence between the desired spectral pattern of the target (suitably deﬁned) and that of the candidate unit. consider a target context of the word “cart” with target phonemes /K/ /AH/ /R/ /T/. This dynamic programming search process is illustrated in Figure 8.8. This optimal sequence (or equivalently the optimal path through the combination of all possible versions of each unit in the string) can be found using a Viterbi search (dynamic programming) [38. Thus there are two costs associated with the Viterbi search. where the transitional costs (the arcs) reﬂect the cost of concatenation of a pair of units based on acoustic distances. and locally at the unit boundaries with adjacent units.8 Online unit selection based on a Viterbi search through a lattice of alternatives. The Viterbi search eﬀectively computes the cost of every possible path through the lattice and determines the path with the lowest total cost.3 Unit Selection Methods 153 Fig. and the unit concatenative distortion (UCD) which is deﬁned as the spectral (and/or pitch and/or phase) discontinuity across the boundaries of the concatenated units. and the nodes represent the target costs based on the linguistic identity of the unit. and a source context of the phoneme /AH/ obtained from the . 99]. throughout the unit. By way of example. 8. we can ﬁnd the sequence of units that best “join each other” in the sense of minimizing the accumulated distance across the sequence of units.8 for a 3 unit search.
154 TexttoSpeech Synthesis Methods
Fig. 8.9 Illustration of unit selection costs associated with a string of target units and a presumptive string of selected units.
source word “want” with source phonemes /W/ /AH/ /N/ /T/. The USD distance would be the cost (speciﬁed analytically) between the sound /AH/ in the context /W/ /AH/ /N/ versus the desired sound in the context /K/ /AH/ /R/. Figure 8.9 illustrates concatenative synthesis for a given string of target units (at the bottom of the ﬁgure) and a string of selected units from the unit inventory (shown at the top of the ﬁgure). We focus our attention on Target Unit tj and Selected Units θj and θj+1 . Associated with the match between units tj and θj is a USD Unit Cost and associated with the sequence of selected units, θj and θj+1 , is a UCD concatenation cost. The total cost of an arbitrary string of N selected units, Θ = {θ1 , θ2 , . . . , θN }, and the string of N target units, T = {t1 , t2 , . . . , tN } is deﬁned as:
N N −1
d(Θ, T ) =
j=1
du (θj , tj ) +
j=1
dt (θj , θj+1 ),
(8.2)
where du (θj , tj ) is the USD cost associated with matching target unit tj with selected unit θj and dt (θj , θj+1 ) is the UCD cost associated with concatenating the units θj and θj+1 . It should be noted that when selected units θj and θj+1 come from the same source sentence and are adjacent units, the UCD cost goes to zero, as this is as natural a concatenation as can exist in the database. Further, it is noted that generally there is a small overlap region between concatenated units. The optimal path (corresponding to the optimal sequence of units) can be eﬃciently computed using a standard Viterbi search ([38, 99]) in
8.3 Unit Selection Methods
155
which the computational demand scales linearly with both the number of target and the number of concatenation units. The concatenation cost between two units is essentially the spectral (and/or phase and/or pitch) discontinuity across the boundary between the units, and is deﬁned as:
p+2
dt (θj , θj+1 ) =
k=1
wk Ck (θj , θj+1 ),
(8.3)
where p is the size of the spectral feature vector (typically p = 12 mfcc coeﬃcients (melfrequency cepstral coeﬃcients as explained in Section 5.6.3), often represented as a VQ codebook vector), and the extra two features are log power and pitch. The weights, wk are chosen during the unit selection inventory creation phase and are optimized using a trialanderror procedure. The concatenation cost essentially measures a spectral plus log energy plus pitch diﬀerence between the two concatenated units at the boundary frames. Clearly the deﬁnition of concatenation cost could be extended to more than a single boundary frame. Also, as stated earlier, the concatenation cost is deﬁned to be zero (dt (θj , θj+1 ) = 0) whenever units θj and θj+1 are consecutive (in the database) since, by deﬁnition, there is no discontinuity in either spectrum or pitch in this case. Although there are a variety of choices for measuring the spectral/log energy/pitch discontinuity at the boundary, a common cost function is the normalized meansquared error in the feature parameters, namely: Ck (θj , θj+1 ) =
θ
[fk j (m) − fk j+1 (1)]2 , 2 σk
θ
θ
(8.4)
where fk j (l) is the kth feature parameter of the lth frame of segment θj , 2 m is the (normalized) duration of each segment, and σk is the variance of the kth feature vector component. The USD or target costs are conceptually more diﬃcult to understand, and, in practice, more diﬃcult to instantiate. The USD cost associated with units θj and tj is of the form:
q
du (θj , tj ) =
i=1
t wi φi Ti (fi j ), Ti (fi j ) ,
θ
t
(8.5)
156 TexttoSpeech Synthesis Methods where q is the number of features that specify the unit θj or tj , t wi , i = 1, 2, . . . , q is a trained set of target weights, and Ti (·) can be either a continuous function (for a set of features, fi , such as segmental pitch, power or duration), or a set of integers (in the case of categorical features, fi , such as unit identity, phonetic class, position in the syllable from which the unit was extracted, etc.). In the latter case, φi can be looked up in a table of distances. Otherwise, the local distance function can be expressed as a simple quadratic distance of the form φi Ti (fi j ), Ti (fi j ) = Ti (fi j ) − Ti (fi j )
θ t θ t 2
.
(8.6)
t The training of the weights, wi , is done oﬀline. For each phoneme in each phonetic class in the training speech database (which might be the entire recorded inventory), each exemplar of each unit is treated as a target and all others are treated as candidate units. Using this training set, a leastsquares system of linear equations can be derived from which the weight vector can be solved. The details of the weight training methods are described by Schroeter [114]. The ﬁnal step in the unit selection synthesis, having chosen the optimal sequence of units to match the target sentence, is to smooth/modify the selected units at each of the boundary frames to better match the spectra, pitch and phase at each unit boundary. Various smoothing methods based on the concepts of time domain harmonic scaling (TDHS) [76] and pitch synchronous overlap add (PSOLA) [20, 31, 82] have been proposed and optimized for such smoothing/modiﬁcation at the boundaries between adjacent diphone units.
8.4
TTS Applications
Speech technology serves as a way of intelligently and eﬃciently enabling humans to interact with machines with the goal of bringing down cost of service for an existing service capability, or providing new products and services that would be prohibitively expensive without the automation provided by a viable speech processing interactive
.5 TTS Future Needs 157 system. or too syntactically complicated.g.. The biggest problem with most TTS systems is that they have no idea as to how things should be said. but instead rely on the text analysis for emphasis. phrasing and all socalled suprasegmental features of the spoken utterance. The more TTS systems learn how to produce contextsensitive pronunciations of words (and phrases). traﬃc reports) • uniﬁed messaging (e.. • giving voice to small and embedded devices for reporting information and alerts.. etc. driving directions. reading email and fax messages) • voice portal providing voice access to webbased services • ecommerce agents • customized News.g. .8. and surprisingly natural speech utterances. e. or too technical. stock price quotations).g. Hence. so long as the utterance is not too long. Sports scores.g. the more natural sounding these systems will become.g.. cellphone) • a means of replacing expensive recorded Interactive Voice Response system prompts (a service which is particularly valuable when the prompts change often during the course of the day. 8. e. by way of example. Similarly some examples of new products and services which are enabled by a viable TTS technology are the following: • locationbased services. Stock Reports. Examples of existing services where TTS enables signiﬁcant cost reduction are as follows: • a dialog component for customer care applications • a means for delivering text messages over an audio connection (e. gas stations.5 TTS Future Needs Modern TTS systems have the capability of producing highly intelligible. stores in the vicinity of your current location • providing information in cars (e. prosody. alerting you to locations of restaurants.
better signal processing would enable improved compression of the units database. not to Mary or Bob.e. I gave the book to John. i. . each with diﬀerent emphasis on words in the utterance. Also needed in the unit selection process is a better spectral distance measure that incorporates human perception measures so as to ﬁnd the best sequence of units for a given utterance.158 TexttoSpeech Synthesis Methods the utterance “I gave the book to John” has at least three diﬀerent semantic interpretations. duration... i. spectral properties) and actual features of a candidate recorded unit. not the photos or the apple. i. not someone else.. The second future need of TTS systems is improvements in the unit selection process so as to better capture the target cost for mismatch between predicted unit speciﬁcation (i. I did it.e. Finally. I gave the book to John.e.. thereby making the footprint of TTS systems small enough to be usable in handheld and mobile devices. i..e. phoneme name.e. pitch. I gave the book to John.
9 Automatic Speech Recognition (ASR) In this chapter.. outdoors).e. The driving factor behind research in machine recognition of speech has been the potentially huge payoﬀ of providing services where humans interact solely with machines. this process often provides users with a natural and convenient way of accessing information and services. which has not yet been achieved. is to perform as well as a human listener. That is. quiet oﬃce.g. we examine the process of speech recognition by machine. independent of the device used to record the speech (i. the speaker’s accent. thereby eliminating the cost of live agents and signiﬁcantly reducing the cost of providing services. the transducer or microphone).. 159 . as a side beneﬁt. noisy room. or the acoustic environment in which the speaker is located (e. the ultimate goal. 9. which is in essence the inverse of the texttospeech problem. Interestingly.1 The Problem of Automatic Speech Recognition The goal of an ASR system is to accurately and eﬃciently convert a speech signal into a text message transcription of the spoken words.
In particular. We refer to the process of creating the speech waveform from the speaker’s intention as the Speaker Model since it reﬂects the speaker’s accent and choice of words to express a given thought or request. the mel frequency cepstrum coeﬃcients . X. is converted to the sequence of feature vectors. resulting in the recognized sentence ˆ W . X = {x1 . followed by a linguistic decoding process which makes a best (maximum likelihood) estimate of the words of the spoken sentence. the speaker must compose a linguistically meaningful sentence.2 shows a more detailed block diagram of the overall speech recognition system. The feature vectors are computed on a framebyframe basis using the techniques discussed in the earlier chapters. xT }. W . in the form of a sequence of words (possibly with pauses and other acoustic events such as uh’s.2. resulting in the speech waveform s[n]. A simple conceptual model of the speech generation and speech recognition processes is given in Figure 9. The input speech signal. . The processing steps of the Speech Recognizer are shown at the right side of Figure 9. temporal) features. s[n]. 9.1 and consist of an acoustic processor which analyzes the speech signal and converts it into a set of acoustic (spectral.160 Automatic Speech Recognition (ASR) Fig. which eﬃciently characterize the speech sounds. the speaker sends appropriate control signals to the articulatory speech organs which form a speech utterance whose sounds are those required to speak the desired sentence. This is in essence a digital simulation of the lower part of the speech chain diagram in Figure 1. .).1 Conceptual model of speech production and speech recognition processes. Once the words are chosen. It is assumed that the speaker intends to express some thought as part of a process of conversing with another human or with a machine.1. . To express that thought. by the feature analysis block (also denoted spectral analysis). um’s. Figure 9. . which is a simpliﬁed version of the speech chain shown in Figure 1.2. x2 . er’s etc.
The pattern classiﬁcation block (also denoted as the decoding and search block) decodes the sequence of feature vectors into a symbolic repreˆ sentation that is the maximum likelihood string. The ﬁnal block in the process is a conﬁdence scoring process (also denoted as an utterance veriﬁcation block). which is used to provide a conﬁdence score for each individual word in the recognized string. in some cases. are widely used to represent the shorttime spectral characteristics.2 Building a Speech Recognition System The steps in building and evaluating a speech recognition system are the following: (1) choose the feature set and the associated signal processing for representing the properties of the speech signal over time .2 Block diagram of an overall speech recognition system. The pattern recognition system uses a set of acoustic models (represented as hidden Markov models) and a word lexicon to provide the acoustic match score for each proposed string. Each of the operations in Figure 9.9. 9. The remainder of this chapter is an attempt to give the ﬂavor of what is involved in each part of Figure 9. extensive digital computation. Also.2 Building a Speech Recognition System 161 Fig.2.2 involves many details and. an N gram language model is used to compute a language model score for each proposed word string. 9. W that could have produced the input sequence of feature vectors.
various combinations of acoustic. and the task semantics (if any) (3) train the set of speech acoustic and language models (4) evaluate performance of the resulting speech recognition system. the spectral coeﬃcients are normalized and converted to melfrequency cepstral coeﬃcients via standard analysis methods of the type discussed in Chapter 5 [27]. with frame shifts of 10 ms being most common. A ﬁrst order (highpass) preemphasis network (1 − αz −1 ) is used to compensate for the speech spectral falloﬀ at higher frequencies and approximates the inverse to the mouth transmission frequency response. the task syntax or language model. Typical values for N and M correspond to frames of duration 15–40 ms. The most popular acoustic features have been the (LPCderived) melfrequency cepstrum coeﬃcients and their derivatives. articulatory. with adjacent frames spaced M samples apart. Each of these steps may involve many choices and can involve signiﬁcant research and development eﬀort. . Following (optionally used) simple noise removal methods. 9. Some type of cepstral bias removal is often used prior to calculation of the ﬁrst and second order cepstral derivatives. Some of the important issues are summarized in this section. hence adjacent frames overlap by 5–30 ms depending on the chosen values of N and M . A block diagram of the signal processing used in most modern large vocabulary speech recognition systems is shown in Figure 9.162 Automatic Speech Recognition (ASR) (2) choose the recognition task. A Hamming window is applied to each frame prior to spectral analysis using either standard spectral analysis or LPC methods. the basic speech sounds to represent the vocabulary (the speech units). Instead.2.000 samples/s.3. The preemphasized signal is next blocked into frames of N samples. The analog speech signal is sampled and quantized at rates between 8000 up to 20. including the recognition word vocabulary (the lexicon).1 Recognition Feature Set There is no “standard” set of features for speech recognition. and auditory features have been utilized in a range of speech recognition systems.
9.2 Building a Speech Recognition System
163
Fig. 9.3 Block diagram of feature extraction process for feature vector consisting of mfcc coeﬃcients and their ﬁrst and second derivatives.
Typically the resulting feature vector is the set of cepstral coeﬃcients, and their ﬁrst and secondorder derivatives. It is typical to use about 13 mfcc coeﬃcients, 13 ﬁrstorder cepstral derivative coeﬃcients, and 13 secondorder derivative cepstral coeﬃcients, making a D = 39 size feature vector. 9.2.2 The Recognition Task
Recognition tasks vary from simple word and phrase recognition systems to large vocabulary conversational interfaces to machines. For example, using a digits vocabulary, the task could be recognition of a string of digits that forms a telephone number, or identiﬁcation code, or a highly constrained sequence of digits that form a password. 9.2.3 Recognition Training
There are two aspects to training models for speech recognition, namely acoustic model training and language model training. Acoustic model
164 Automatic Speech Recognition (ASR) training requires recording each of the model units (whole words, phonemes) in as many contexts as possible so that the statistical learning method can create accurate distributions for each of the model states. Acoustic training relies on accurately labeled sequences of speech utterances which are segmented according to the transcription, so that training of the acoustic models ﬁrst involves segmenting the spoken strings into recognition model units (via either a Baum–Welch [10, 11] or Viterbi alignment method), and then using the segmented utterances to simultaneously build model distributions for each state of the vocabulary unit models. The resulting statistical models form the basis for the pattern recognition operations at the heart of the ASR system. As discussed in Section 9.3, concepts such as Viterbi search are employed in the pattern recognition process as well as in training. Language model training requires a sequence of text strings that reﬂect the syntax of spoken utterances for the task at hand. Generally such text training sets are created automatically (based on a model of grammar for the recognition task) or by using existing text sources, such as magazine and newspaper articles, or closed caption transcripts of television news broadcasts, etc. Other times, training sets for language models can be created from databases, e.g., valid strings of telephone numbers can be created from existing telephone directories. 9.2.4 Testing and Performance Evaluation
In order to improve the performance of any speech recognition system, there must be a reliable and statistically signiﬁcant way of evaluating recognition system performance based on an independent test set of labeled utterances. Typically we measure word error rate and sentence (or task) error rate as a measure of recognizer performance. A brief summary of performance evaluations across a range of ASR applications is given in Section 9.4.
9.3
The Decision Processes in ASR
The heart of any automatic speech recognition system is the pattern classiﬁcation and decision operations. In this section, we shall give a brief introduction to these important topics.
9.3 The Decision Processes in ASR
165
9.3.1
Mathematical Formulation of the ASR Problem
The problem of automatic speech recognition is represented as a statistical decision problem. Speciﬁcally it is formulated as a Bayes maximum a posteriori probability (MAP) decision process where we seek to ﬁnd ˆ the word string W (in the task language) that maximizes the a posteriori probability P (W X) of that string, given the measured feature vector, X, i.e., ˆ W = arg max P (W X).
W
(9.1)
Using Bayes’ rule we can rewrite (9.1) in the form: P (XW )P (W ) ˆ . W = arg max W P (X) (9.2)
Equation (9.2) shows that the calculation of the a posteriori probability is decomposed into two terms, one that deﬁnes the a priori probability of the word sequence, W , namely P (W ), and the other that deﬁnes the likelihood that the word string, W , produced the feature vector, X, namely P (XW ). For all future calculations we disregard the denominator term, P (X), since it is independent of the word sequence W which is being optimized. The term P (XW ) is known as the “acoustic model” and is generally denoted as PA (XW ) to emphasize the acoustic nature of this term. The term P (W ) is known as the “language model” and is generally denoted as PL (W ) to emphasize the linguistic nature of this term. The probabilities associated with PA (XW ) and PL (W ) are estimated or learned from a set of training data that have been labeled by a knowledge source, usually a human expert, where the training set is as large as reasonably possible. The recognition decoding process of (9.2) is often written in the form of a 3step process, i.e., ˆ W = arg max PA (XW ) PL (W ),
W
(9.3)
Step 3
Step 1
Step 2
where Step 1 is the computation of the probability associated with the acoustic model of the speech sounds in the sentence W , Step 2 is the computation of the probability associated with the linguistic model of the words in the utterance, and Step 3 is the computation associated
98]. . . which specify the probability of making a transition from state i to state j at each frame.4) where the speech signal duration is T frames (i.e.. a23 . aij . wM . the feature vector. aii are large (close to 1. a34 . W . . Figure 9. . of the form: X = {x1 .0). T times the frame shift in ms) and each frame. xt2 . (9. a45 . as: W = w1 . a12 .3. .3) we need to more explicit about the relationship between the feature vector.5) that characterizes the spectral/temporal properties of the speech signal at time t and D is the number of acoustic features in each frame. In addition to the statistical feature densities within states. . Usually the selftransitions. is a sequence of acoustic observations corresponding to each of T frames of the speech. xtD ) (9. 2. .4 shows a simple Q = 5state HMM for modeling a whole word. T is an acoustic feature vector of the form: xt = (xt1 . X.6) where there are assumed to be exactly M words in the decoded string. X. . t = 1. 62]. Similarly we can express the optimally decoded word sequence. . the HMM is also characterized by an explicit set of state transitions. are small (close to 0). xT }. .2 The Hidden Markov Model The most widely used method of building acoustic models (for both phonemes and words) is the use of a statistical characterization known as Hidden Markov Models (HMMs) [33. in the model. 9. and the word sequence W . w2 . and the jump transitions. As discussed above. 97. In order to be more explicit about the signal processing and computations associated with each of the three steps of (9.166 Automatic Speech Recognition (ASR) with the search through all valid sentences in the task language for the maximum likelihood sentence. 69. thereby deﬁning the time sequence of the feature vectors over the duration of the word. Each HMM state is characterized by a mixture density Gaussian distribution that characterizes the statistical behavior of the feature vectors within the states of the model [61. (9. x2 . . . . . xt . . . .
9. 11]. The heart of the training procedure for reestimating HMM model parameters using the Baum– Welch procedure is shown in Figure 9. The details of the Baum–Welch procedure are beyond the scope of this chapter but can be found in several references on Speech Recognition methods [49. state observation probability density. The initial model can be randomly chosen or selected based on a priori knowledge of the model 1 The Baum–Welch algorithm is also widely referred to as the forward–backward method. and initial state distribution. 1 ≤ i. 1 ≤ i ≤ Q} with π1 set to 1 for the “lefttoright” models of the type shown in Figure 9. B = {bj (xt ).1 This algorithm aligns each of the various words (or subword units) with the spoken inputs and then estimates the appropriate means. 99]. covariances and mixture gains for the distributions in each model state.4. In order to train the HMM (i. B. The complete HMM characterization of a Qstate word model (or a subword unit like a phoneme model) is generally written as λ(A. . j ≤ Q}.e. π = {πi . a labeled training set of sentences (transcribed into words and subword units) is used to guide an eﬃcient training procedure known as the Baum–Welch algorithm [10. An initial HMM model is used to begin the training process.. 1 ≤ j ≤ Q}. π) with state transition matrix A = {aij . lefttoright.4 Wordbased. HMM with 5 states. 9. The Baum–Welch method is a hillclimbing algorithm and is iterated until a stable alignment of models and speech is obtained. learn the optimal model parameters) for each word (or subword) unit.5.3 The Decision Processes in ASR 167 Fig.
This simple 3state HMM is a basic subword unit model with an initial state representing the statistical characteristics at the beginning of a sound. as shown in Figure 9. This process is iterated until no further improvement in probabilities occurs with each new iteration. parameters. a middle state representing the heart of the sound. to an HMM for a subword unit (such as a phoneme) as shown in Figure 9.5 The Baum–Welch training procedure based on a given training set of utterances. . 9.6 Subwordbased HMM with 3 states.6. 9. It is a simple matter to go from the HMM for a whole word. as illustrated in Fig.168 Automatic Speech Recognition (ASR) Fig. The iteration loop is a simple updating procedure for computing the forward and backward model probabilities based on an input speech database (the training set of utterances) and then optimizing the model parameters to give an updated HMM. A word model is made by concatenating the appropriate subword HMMs. and an ending state representing the spectral characteristics at the end of the sound.4.
wM with a sequence of feature vectors. Figure 9. T . (For simplicity we show each word model as a 5state HMM in Figure 9. 132] for which .7 Wordbased HMM for the word /is/ created by concatenating 3state subword models for the subword units /ih/ and /z/. xT }. so that every state has at least a single feature vector associated with that state. An optimal alignment procedure determines the exact best matching sequence between word model states and feature vectors such that the ﬁrst feature vector. 9. x1 . . . . . x2 .9. giving the word model for the word “is” (pronounced as /IH Z/). X = {x1 . . which concatenates the 3state HMM for the sound /IH/ with the 3state HMM for the sound /Z/.3 The Decision Processes in ASR 169 Fig.8. however once the word model has been built it can be used much the same as whole word models for training and for evaluating word strings for maximizing the likelihood as part of the speech recognition process.7. but clearly the alignment procedure works for any size model for any word.) The procedure for obtaining the best alignment between feature vectors and model states is based on either using the Baum–Welch statistical alignment procedure (in which we evaluate the probability of every alignment path and add them up to determine the probability of the word string). . . The resulting alignment procedure is illustrated in Figure 9. and the last feature vector. aligns with the ﬁrst state in the ﬁrst word model. exceeds the total number of model states. subject to the constraint that the total number of feature vectors. ww . . We are now ready to deﬁne the procedure for aligning a sequence of M word models.8. xT . or a Viterbi alignment procedure [38. We see the sequence of feature vectors along the horizontal axis and the concatenated sequence of word states along the vertical axis. In general the composition of a word (from subword units) is speciﬁed in a word lexicon or dictionary. w1 . aligns with the last state in the M th word model.
9.170 Automatic Speech Recognition (ASR) Fig. x2 .8 Alignment of concatenated HMM word models with acoustic feature vectors based on either a Baum–Welch or Viterbi alignment procedure.3. i. . . .8 is based on the ease of evaluating the probability of any alignment path using the Baum–Welch or Viterbi procedures. we need to compute the probability that the acoustic vector sequence X = {x1 . We now return to the mathematical formulation of the ASR problem and examine in more detail the three steps in the decoding Equation (9. given the observed acoustic vectors. . . w2 . we determine the single best alignment path and use the probability score along that path as the probability measure for the current word string. 9.. The utility of the alignment procedure of Figure 9. wM (assuming each word is represented as an HMM) and perform this computation for all possible word sequences.3 Step 1 — Acoustic Modeling The function of the acoustic modeling step (Step 1) is to assign probabilities to the acoustic realizations of a sequence of words. xT } came from the word sequence W = w1 .e. . . .3). .
. .9 which shows the set of T feature vectors (frames) along the horizontal axis. xT }w1 . . . . i(t) wj(t) . We assume that each word model is further decomposed into a set of states which reﬂect the changing statistical properties of the feature vectors over time for the duration of the word. . x2 . The process of assigning individual speech frames to the appropriate word model in an utterance is based on an optimal alignment process between the concatenated sequence of word models and the sequence of feature vectors of the spoken input utterance being recognized. we can express (9. 2. and for each word. w2 . We have seen in the previous section that the probability density of each state. Within each state of each word there is a probability density that characterizes the statistical properties of the feature vectors in that state. . i(t) (9. .3 The Decision Processes in ASR 171 This calculation can be expressed as: PA (XW ) = PA ({x1 . which we denote as bj (xt ). each of which corresponds to one of the words in the utterance and its set of optimally matching feature vectors.8) where we associate each frame of X with a unique word and state. We assume that each word is represented by an N state HMM model. in the word sequence. is learned during a training phase of the recognizer. we calculate the local probability PA xt wj(t) given that we know the word from which frame t came. N .7) as the product T PA (XW ) = t=1 PA xt wj(t) . is aligned with i(t) word i(t) and HMM model state j(t) via the function wj(t) and if we assume that each frame is independent of every other frame.7) If we make the assumption that each frame. i(t) . . and we denote the states as Sj .9. of the word model. . . The optimal segmentation of these feature vectors (frames) into the M words is shown by the sequence of boxes. j. j = 1. This alignment process is illustrated in Figure 9. Using a mixture of Gaussian densities to characterize the statistical distribution of the feature vectors in each state. xt . (9. wM ). and the set of M words (and word model states) along the vertical axis. . Further.
for mixture k in state j. µjk . The density constraints are: K cjk = 1. with the constraint cjk ≥ 0. Ujk .9 Illustration of time alignment process between unknown utterance feature vectors and set of M concatenated word models. for mixture k for state j. we get a statebased probability density of the form: K bj (xt ) = k =1 cjk N[xt . (9. 9.172 Automatic Speech Recognition (ASR) Fig. ∞ −∞ k=1 1≤j≤N 1 ≤ j ≤ N. and covariance matrix. and N is a Gaussian density function with mean vector. cjk is the weight of the kth mixture component in state j. Ujk ]. µjk .11) bj (xt )d xt = 1 . We now return to the issue of the calculation of the probability of frame xt being associated with the j(t)th state of the i(t)th word in .10) (9. (9.9) where K is the number of mixture components in the density function.
we “train the system” and learn the parameters of the best acoustic models for each word (or more speciﬁcally for each sound that comprises each word). according to the mixture model of (9.12) is incomplete since we have ignored the computation of the probability associated with the links between word states. PA (xt wj(t) ). The parameters. the amount of training data required for word models is excessive. in the alignment between a given word and a set of feature vectors corresponding to that word. We come back to these issues later in this section. it should be clear that for any reasonable size speech recognition task.9). The use of such subword acousticphonetic models poses no real diﬃculties in either training or when used to build up word models and hence is the most widely used representation for building word models in a speech recognition system. Although we have been discussing acoustic models for whole words.9. The key point is that we assign probabilities to acoustic realizations of a sequence of words by using hidden Markov models of the acoustic feature vectors within words. Using an independent (and orthographically labeled) set of training data. j. Even for modest size vocabularies of about 1000 words.3 The Decision Processes in ASR 173 the utterance. it is impractical to create a separate acoustic model for every possible word in the vocabulary since each word would have to be spoken in every possible context in order to build a statistically reliable model of the density functions of (9. and the covariance matrix. for each state of the model.9) are. Stateoftheart systems use contextdependent phone models as the basic units of recognition [99]. the mixture weights. which is calculated as PA xt wj(t) = bj(t) (xt ). i(t) i(t) i(t) (9. The alternative to word models is to build acousticphonetic models for the 40 or so phonemes in the English language and construct the model for a word by concatenating (stringing together sequentially) the models for the constituent phones in the word (as represented in a word dictionary or lexicon). . and we have also not speciﬁed how to determine the withinword state. the mean vectors.12) The computation of (9.
of a word string.13) (9. . wn−2 . wn−N +1 ). PL (W ). . . W . There are many ways of building Language Models for speciﬁc tasks. is to enable the computation of the a priori probability. Assume we have a large text training set of wordlabeled utterances. If we make the assumption that the probability of a word in a sentence is conditioned on only the previous N − 1 words. including: (1) statistical training from text databases transcribed from taskspeciﬁc dialogs (a learning procedure) (2) rulebased learning of the formal grammar associated with the task (3) enumerating. Thus we assume we can write the probability of the sentence W . Hence the probability of the text string W =“Call home” for a telephone number identiﬁcation task is zero since that string makes no sense for the speciﬁed task.174 Automatic Speech Recognition (ASR) 9. 106]. .14) = n=1 PL (wn wn−1 . by hand. consistent with the recognition task [59. we have a text ﬁle that identiﬁes the words in that sentence. based on the likelihood of that sequence of words occurring in the context of the task being performed by the speech recognition system. .) For every sentence in the training set. . all valid text strings in the language and assigning appropriate probability scores to each string. either from the task at hand or from a generic database with applicability to a wide range of tasks. as PL (W ) = PL (w1 . (Such databases could include millions or even tens of millions of text sentences. Perhaps the most popular way of constructing the language model is through the use of a statistical N gram word grammar that is estimated from a large training set of text utterances. w2 .4 Step 2 — The Language Model The language model assigns probabilities to sequences of words. we have the basis for an N gram language model. .3. wM ) M (9. The purpose of the language model. . We now describe the way in which such a language model is built. or grammar. . according to an N gram language model. 60.
wn−2 )). wn−1 ) as it occurs in the text training set. The basic concept of a ﬁnite state network transducer is illustrated in Figure 9. and C(wn−2 . we compute this quantity as P (wn wn−1 .9. through the use of methods from the ﬁeld of Finite State Automata Theory. wn−1 ) (9. and the weight is an estimate of the probability that the arc is utilized in the pronunciation of the word in context. wn−1 . the trigram of words) consisting of (wn−2 . The key problem is that the potential size of the search space can be astronomically large (for large vocabularies and high average word branching factor language models). wn−1 . wn ) is the frequency count of the word triplet (i.3.. Fortunately. to ﬁnd the one with the maximum likelihood of having been spoken. 9. bigram of words) (wn−2 . wn−1 ) is the frequency count of the word doublet (i. for example. to estimate word “trigram” probabilities (i.e.3 The Decision Processes in ASR 175 where the probability of a word occurring in the sentence only depends on the previous N − 1 words and we estimate this probability by counting the relative frequencies of N tuples of words in the training set.e.. Thus. . We see that for the word /data/ there are four total pronunciations. wn ) as it occurs in the text training set. Each arc in the state diagram corresponds to a phoneme in the word pronunciation network.15) where C(wn−2 .10 which shows a word pronunciation network for the word /data/. wn ) . wn−1 . even for very large speech recognition problems [81]. C(wn−2 .e. the probability that a word wn was preceded by the pair of words (wn−1 . thereby enabling exact maximum likelihood solutions in computationally feasible times. thereby taking inordinate amounts of computing power to solve by heuristic methods.5 Step 3 — The Search Problem The third step in the Bayesian approach to automatic speech recognition is to search the space of all valid word sequences from the language model. Finite State Network (FSN) methods have evolved that reduce the computational burden by orders of magnitude.. wn−2 ) = C(wn−2 .
Using the techniques of network combination (which include network composition. (After Mohri [81]. and even model phrases.48 /T/ /AX/ — probability of 0. determinization. model words. model phones. we can carry the process down to the level of HMM phones and HMM states. into a much smaller network via the method of weighted ﬁnite state transducers (WFST). minimization. the WFST uses a uniﬁed mathematical .12. 9.) namely (along with their (estimated) pronunciation probabilities): (1) (2) (3) (4) /D/ /D/ /D/ /D/ /EY/ /EY/ /AE/ /AE/ /D/ /AX/ — probability of 0.32 /T/ /AX/ — probability of 0.176 Automatic Speech Recognition (ASR) Fig. A simple example of such a WFST network optimization is given in Figure 9. and weight pushing) and network optimization. and then combine word FSNs into sentence FSNs using the appropriate language model. The combined FSN of the 4 pronunciations is a lot more eﬃcient than using 4 separate enumerations of the word since all the arcs are shared among the 4 pronunciations and the total computation for the full FSN for the word /data/ is close to 1/4 the computation of the 4 variants of the same word. equivalently. Further. making the process even more eﬃcient.11 [81]. thereby minimize the amount of duplicate computation). which combine the various representations of speech and language and optimize the resulting network to minimize the number of search states (and. Ultimately we can compile a very large network of model states. We can continue the process of creating eﬃcient FSNs for each word in the task vocabulary (the speech dictionary or lexicon).10 Word pronunciation transducer for four pronunciations of the word /data/.08 /D/ /AX/ — probability of 0.
model phones. Many variations on this general theme have been investigated over the past 30 years or more. model words.2.9. Using these methods.1. an unoptimized network with 1022 states (the result of the cross product of model states. In Section 9.3 we describe the most widely used statistical pattern recognition techniques that are employed for mapping the sequence of feature vectors into a sequence of symbols or words. and as we mentioned before.11 Use of WFSTs to compile a set of FSNs into a single optimized network to minimize redundancy in the network.) framework to eﬃciently compile a large network into a minimal representation that is readily searched using standard Viterbi decoding methods [38].1 gives a summary of a range of the performance of some speech recognition and natural language understanding systems .4 Representative Recognition Performance 177 Fig. we suggest how the techniques of digital speech analysis discussed in Chapters 4–6 can be applied to extract a sequence of feature vectors from the speech signal. testing and evaluation is a major part of speech recognition research and development. 9. and in Section 9. 9. and model phrases) was able to be compiled down to a mathematically equivalent model with 108 states that was readily searched for the optimum word string with no loss of performance or word accuracy. Table 9. (After Mohri [81].1 represents a wide range of possibilities for the implementation of automatic speech recognition.4 Representative Recognition Performance The block diagram of Figure 9.
3% for a very clean recording environment for the TI (Texas Instruments) connected digits database [68]).0 Conversational Read speech Spontaneous Read text Narrated News Telephone conversation Telephone conversation 11 (0–9.0%.1 Word error rates for a range of speech recognition systems. Table 9. showing the lack of robustness of the recognition system to noise and other background disturbances [44].3 2.178 Automatic Speech Recognition (ASR) Table 9. oh) 1000 2500 64. Corpus Connected digit strings (TI Database) Connected digit strings (AT&T Mall Recordings) Connected digit strings (AT&T HMIHY c ) Resource management (RM) Airline travel information system (ATIS) North American business (NAB & WSJ) Broadcast News Switchboard Callhome Type of speech Spontaneous Spontaneous Vocabulary size 11 (0–9. speaking styles.000 210.000 5.000 45.0 2.0% and when embedded within conversational speech (the AT&T HMIHY (How May I Help You) c system) the word error rate increases signiﬁcantly to 5. oh) 11 (0–9.0 2. or ATIS [133]) with a 2500 word vocabulary and a word error rate of 2.000 28. This table covers a range of vocabulary sizes.1 also shows the word error rates for a range of DARPA tasks ranging from • read speech of commands and informational requests about a naval ships database (the resource management system or RM) with a 1000 word vocabulary and a word error rate of 2.5% .6 ≈15 ≈27 ≈35 that have been developed so far. and application contexts [92]. It can be seen that for a vocabulary of 11 digits. oh) Word error rate (%) 0. but when the digit strings are spoken in a noisy shopping mall environment the word error rate rises to 2.5 6. the word error rates are very low (0.0% • spontaneous speech input for booking airlines travel (the airline travel information system.
000 word vocabulary and a word error rate of about 15% • recorded live telephone conversations between two unrelated individuals (the switchboard task [42]) with a vocabulary of 45. eﬃciency. most constrained tasks. we need to be able to handle users talking over the voice prompts (socalled bargein conditions). Before ASR systems become ubiquitous in society. (the broadcast news task) with a 210. In the system area we need large improvements in accuracy. and even sentences so as to maintain an intelligent dialog with a customer. or NAB) with a vocabulary of 64. and robustness in order to utilize the technology for a wide range of tasks. phrases. and ﬁnally we need better methods of conﬁdence scoring of words.000 words and a word error rate of about 27%. many improvements will be required in both system performance and operational performance. In the operational area we need better methods of detecting when a person is speaking to a machine and isolating the spoken input from the background. on a wide range of processors.000 words and a word error rate of about 35%.6% • narrated news broadcasts from a range of TV news providers like CNBC. we need more reliable and accurate utterance rejection methods so we can be sure that a word needs to be repeated when poorly recognized the ﬁrst time. ASR systems fall far short of human speech perception in all but the simplest. .000 words and a word error rate of 6.5 Challenges in ASR Technology So far.9.5 Challenges in ASR Technology 179 • read text from a range of business magazines and newspapers (the North American business task. 9. and under a wide range of operating conditions. and a separate task for live telephone conversations between two family members (the call home task) with a vocabulary of 28.
Digital speech processing systems have permeated society in the form of cellular speech coders. and perception provide the tools for 180 . In spite of our shortfalls in understanding.Conclusion We have attempted to provide the reader with a broad overview of the ﬁeld of digital speech processing and to give some idea as to the remarkable progress that has been achieved over the past 4–5 decades. and our understanding of the processing of speech in the human brain is at an even lower level. There remain many diﬃcult problems yet to be solved before digital speech processing will be considered a mature science and technology. linguistics. and speech recognition and understanding systems that handle a wide range of requests about airline ﬂights. stock price quotations. we have been able to create remarkable speech processing systems whose performance increases at a steady pace. signal processing. synthesized speech response systems. specialized help desks etc. A ﬁrm understanding of the fundamentals of acoustics. Our basic understanding of the human articulatory system and how the various muscular controls come together in the production of speech is rudimentary at best.
Conclusion 181 building systems that work and can be used by the general public. the application systems will only improve and the pervasiveness of speech processing in our daily lives will increase dramatically. with the end result of improving the productivity in our work and home environments. As we increase our basic understanding of speech. .
and Tom Quatieri for their detailed and perceptive comments. His patience. Yariv Ephraim. 182 . technical advice. Sadaoki Furui. Of course. we are responsible for any weaknesses or inaccuracies that remain in this text. which greatly improved the ﬁnal result. Luciana Ferrer. and editorial skill were crucial at every stage of the writing. We also wish to thank Abeer Alwan. for inviting us to prepare this text.Acknowledgments We wish to thank Professor Robert Gray. EditorinChief of Now Publisher’s Foundations and Trends in Signal Processing.
vol. pp. 535–538. June 1979. ASSP27. A Computer Laboratory Textbook. [2] B. October 1970. S. B.” IEEE Transactions on Acoustics. Atal. Schroeder. pp. no. “Speech analysis and synthesis by linear prediction of the speech wave. [5] B. [9] T.” IEEE Transactions on Acoustics. ASSP29. Schroeder. Speech and Signal Processing. 614–617. Hanauer. 50. “Predictive coding of speech signals and subjective error criterion. Speech. “Recursive windowing for generating autocorrelation analysis for LPC analysis. 4. “Improved quantizer for adaptive predictive coding of speech signals at low bit rates. B.References [1] J. Rabiner. “A uniﬁed theory of shorttime spectrum analysis and synthesis. Schroeder. B. Remde.” Proceedings of ICASSP. and C. S. October 1981. pp. Atal and M.” Journal of the Acoustical Society of America. 1973–1986. “Predictive coding of speech at low bit rates. “A new model of LPC exitation for producing naturalsounding speech at low bit rates. Speech Coding. [4] B. COM30. [3] B. Atal and M. 1996. R. [7] B. 49. [8] T. pp. 600–614. R. Allen and L. “Adaptive predictive coding of speech signals. 1062–1066. K. S. Richardson. no. S. R. R. Nayebi. April 1982. Barnwell III. Atal and M.” Bell System Technical Journal. and Signal Processing. 1982. S. vol. 65. Atal and S. pp. John Wiley and Sons. November 1977. 11. L. vol. 5.” Proceedings of IEEE. pp. vol. pp. [6] B.” IEEE Transactions on Communications. Barnwell III. vol. S.” Proceedings of IEEE ICASSP. H. 183 . 247–254. April 1980. Atal and J. 561–580. pp. 1971. no. 1558–1564. vol.
H. [22] C. 2. Bogert. Cox. 120. Hungary. 117–120. “Synchornized and noiserobust audio recordings during realtime MRI scans. [18] C. July 1948. Flanagan. 1986. [23] R. Kabal. E. [17] E. vol. W. Garten. October 1976. Australia. Crochiere and L. A. Nayak. vol. J.” Proceedings of Eurospeech ’99. “Spectra of quantized signals. Rabiner. 1972. Budapest. May 1989. Rosenblatt. 1791–1794.” Inequalities. Baum. H. “An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. J. Charpentier and M. Quackenbush. [16] B. Beutnagel. 1963. Speech and Signal Processing. Schafer. P. and N. Introduction to Wavelets and Wavelet Transforms. and J. 384–387. (M. “Diphone synthesis using an overlapadd technique for speech waveform concatenation. R. vol. vol. 2. E. pp. Conkie. and P. 8. “A model of articulatory dynamics and control. P. and T. “Digital coding of speech in subbands. E. ed. “Interaction of units in a unit selection database. Mermelstein. 164–171. Welch. “Performance evaluation of analysisbysynthesis homomorphic vocoders. P. Tremain. V. L.” Proceedings of ICASSP. no. Inc. Chung and R. G. Petri. C. S. 2015–2018. N. no. Crochiere. pp. H. no.” Proceedings of International Conference on Acoustics. Stella. 55. Beutnatel and A. 1998. Healy. Y. March 1984. M. .” Bell System Technical Journal. and S.” Bell System Technical Journal. Gay.). pp. [24] R.” IEEE Journal of Selected Areas in Communications. 1–8.. K. “The quefrency alanysis of times series for echos: Cepstrum. Weiss. Nielsen. February 1988. Seshadri.184 References [10] L. [14] M. New York: John Wiley and Sons. crosscepstrum. “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. [11] L.. Bresch. March 1992. Gopinath.” Third Speech Synthesis Workshop. 735–738. Campbell Jr. R. Conkie.” Journal of the Acoustical Society of America. September 1999.” Annals in Mathematical Statistics. Berouti.” Proceedings of IEEE. November 1998. [20] F. G. vol. [25] R. vol. S. pp. 1976. “Diphone synthesis using unit selection. pp. A. 446–472. A. and A.” Proceedings of ICASSP. Shoham. E. S. pp. 41. Syrdal. Bennett. October 2006. vol. Narayanan. pp. [19] J. pp. V. Baum. and saphe cracking. 3. 1983. “New directions in subband coding. 1970. R. 6. and N. pp. “An expandable errorpretected 4800 bps CELP coder. 391–409. Webber.” in Proceedings of the Symposium on Time Series Analysis. E. Burrus and R. Jayant. 64. Multirate Digital Signal Processing. K. T. 4. vol. L.” Proceedings of IEEE ICASSP. Tukey.. [21] J. [12] W. W. “Eﬃcient computation and encoding of the multipulse excitation for LPC. PrenticeHall Inc. S. Jenolan Caes. Coker.. and J. pp. PrenticeHall Inc. 2. 27. [15] M. pseudoautocovariance. [13] M. Soules. vol. pp. 452–459. 1069–1085.
vol. Speech and Signal Processing. August 1969. pp. its deﬁnition. ASSP26. and K. pp. E. “Parallel processing techniques for estimating pitch period of speech in the time domain. vol. [35] J. no. [33] J. Walter de Gruyter. [34] J. Netherlands: Kluwer Academic Publishers. 52–59. [40] S. R. 1970. L. J. vol. 2. R. [36] J. 1993. pp. eds. and Signal Processing. H. Denes and E.” IEEE Spectrum. Dutoit. D. [27] S. and N. 61. “SWITCHBOARD: Telephone Speech corpus for research and development. “Diﬀerential quantization of communication signals. Princeton: Institute for Defense Analyses. vol. Flanagan. and J. L. C.” IEEE Transactions on Acoustics. Freeman Company. “Delta modulation — a new method of PCM transmission using the 1unit code. 1960. [41] O. 442–466. February 1986. Davis and P. K. Acoustic Theory of Speech Production. “Speaker independent isolated word recognition using dynamic features of speech spectrum. NY: Marcel Dekker.361. pp. Fletcher and W.” Bell Labs Record. (S. 3. Sondhi.” IEEE Proceedings. Shipley.S. Furui. Speech Analysis. vol. March 1975. L. 357–366.” Proceedings of ICASSP 1992. December 1952. [38] G. 254–272. ASSP29. vol. [43] B. Umeda. 5. “Loudness.” Journal of Acoustical Society of America. pt. no. Cutler. pp. Godfrey. Ghitza. Coker. An Introduction to TexttoSpeech Synthesis. July 29. no. C. 1997. 82– 108. pp. B.” IEEE Transactions on Acoustics Speech. Mermelstein. Gold and L. J. [39] S. McDaniel. Ishizaka. “Audiotry nerve representation as a basis for speech processing. October 1933. March 1973.” IEEE Transactions on Acoustics. April 1981. 28. pp. measurement and calculation. October 1970. The speech chain. 485–506. 2. Flanagan. “Synthesis of speech from a dynamic model of the vocal cords and vocal tract. [30] H. 2nd Edition. 17. deJager. Dudley. August 1980. [32] G. “Cepstral analysis technique for automatic speaker veriﬁcation. B. Speech. C. 1939. vol. “The vocoder.. 46. [42] J. Holliman. W. Rabiner. pp.” Hidden Markov Models for Speech. “Hidden Markov Analysis: An Introduction. L.” Bell System Technical Journal. The Hague: Mouton & Co. 268– 278. L. Munson. 22–45. 1992. 1952. N. 1. W. H. 2. Patent 2. [29] P. 1991. 2. 122–126. Synthesis and Perception. Forney.” Philips Research Reports. 1972. [31] T. pp. Ferguson. Fant. “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. 7. D. Flanagan. no.References 185 [26] C. Signal Processing. SpringerVerlag. pp.).” Journal of Acoustical Society of America. no. 442–448. pp.605. Rabiner. . Pinson. “The Viterbi algorithm.” in Advances in Speech Signal Processing. “Synthetic voices for computers. [28] F. 453– 485.” U. Schafer. vol. Furui. R. [37] H. Furui and M. 517–520. pp. 1980. vol. 54.
57–60. Furui and M. [56] N. Mercer.” Philips Technical Review. Wilpon. A. vol. Statistical Methods for Speech Recognition. Roucos. “Line spectrum representation of linear predictive coeﬃcients of speech signals. no. pp. Jayant. L. Cambridge: MIT Press. vol. 1997. 1972. (S. “Toeplitz and circulant matrices: A review. B. pp. Hon. 1257– 1260. S. September 1973. “How may I help you?. 53A. 1997. no. R. vol. Sondhi. Itakura and S.” Bell System Technical Journal. 651–699. [46] R. pp. Hermansky. pp. “Adaptive quantization with one word memory. Marcel Dekker. “Code modulation with digitally controlled companding for speech transmission. 535(a). 17–21. Prague. pp. [59] F. A.W. Itakura and S. S. Itakura. vol. M. Sachs.” Bell System Technical Journal. A. C17–C20. 373–376. [60] F. M. 1119–1144. no. Jelinek. 1996.” in Advances in Speech Signal Processing. [50] A. pp. 2.” Proceedings of ICASSP96. pp.” Foundations and Trends in Communications and Information Theory. Dallas TX. 155–239. 4–28. Huang.” Electronics and Communications in Japan. and J. PrenticeHall. p. [58] N. 1233–1268. “Vector quantization. [57] N. 1984.). Saito. eds.. “Adaptive delta modulation with a onebit memory. pp. [45] R. [55] F. Jelinek. 36–43. pp.” Bell System Technical Journal. Riemens. Gorin. 51. “Principles of lexical language modeling for speech recognition. “Synthesis of voiced sounds from a twomass model of the vocal cords. Hunt and A. Umezaki. R. 3. s35(A). Itakura and T. Acero. Gray.” in Proceedings of First European Conference on Signal Analysis and Prediction. Jayant. “Distance measure for speech recognition based on the smoothed group delay spectrum. 1968. M. 2001. Ishizaka and J. “Auditory modeling in automatic recognition of speech. Gray. 1996. G. Digital Coding of Waveforms. 2006. 57. April 1987. Saito. Greefkes and K.186 References [44] A. [52] F. L. M. March 1970. [53] F. 1970. pp. 6. “Analysissynthesis telephony based upon the maximum likelihood method. pp. [49] X. pp. April 1984. [47] J. [54] F.” Journal of Acoustical Society of America. “Unit selection in a concatenative speech synthesis system using a large speech database. Parker. 1.” Proceedings of 6th International of Congress on Acoustics. [51] K. and H. Flanagan. pp. and S. “A statistical method for estimation of speech spectral density and formant frequencies. 1991. Czech Republic. L.” IEEE Signal Processing Magazine. 335–353. Jayant and P. pp. Black. 1. 321–342. Noll. PrenticeHall Inc.” in Proceedings of ICASSP87. [48] H. . 1970. S. Spoken Language Processing. vol.” Proceedings of the Interactive Voice Technology for Telecommunications Applications (IVTTA). Atlanta.
Speech and Signal Processing. Makhoul.” IEEE Transactions on Acoustics. [65] D. 1979. CRC Press. Theory and Practice. 121–133. pp. “Linear prediction: A tutorial review. 18. F.References 187 [61] B. no. no. Rabiner. Malah. vol. Levinson. September 1987. Klatt. vol. 84–95. vol. [68] R. 561–580. December 1972. pp. L. pp. 62. October 1986. “A database for speakerindependent digit recognition. M. 129–137. H. Speech. Koenig.” Journal of the Acoustical Society of America. 2007. Paris. F. 737–793. pp. Kroon. 1984. 64. [64] D.” Journal of the Acoustical Society of America. 32.1–42. and J. F. 64. vol. [72] P. D. 1054–1063. and Signal Processing.11. 82. “Software for a cascade/parallel formant synthesizer. no. H. 1980. Linde. E. 63. [70] Y. 67. ASSP35. Buzo. 7. vol. “The sound spectrograph. Markel. Speech and Signal Processing. [77] J. J. E. S. Markel and A. vol. C. Levinson. ASSP34. Deprettere. 367– 377. Schwarz. Klatt. “An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition.4.” IEEE Transactions on Acoustics. Makhoul. 27. M. pp. pp. 28. 1975. pp. 1235–1249. March 1982. 6. vol. V. Y.” Bell System Technical Journal. [62] B. no. Sondhi. 1946. D. “Timedomain algorithms for harmonic bandwidth reduction and timescaling of pitch signals. R. pp.” IEEE Transactions on Audio and Electroacoustics. vol. and R. 1577–1581.” Proceedings of ICASSP 1984. 42.” IEEE Transactions on Information Theory. pp. and M. Huggins. Loizou.” IEEE Transactions in Information Theory.” Journal of the Acoustical Society of America. France. “Maximum likelihood estimation for mixture multivariate stochastic observations of Markov chains. detection and compression in the cochlea. W. [66] W. R. [67] P.” IEEE Transactions on Acoustics. A. M. Dunn. [63] B. New York: SpringerVerlag. “An algorithm for vector quantizer design. no. [75] J. December 1978. Gray. no. [78] J.11. vol. H. Wilpon. 1983. R. vol. L. [76] D. G. “Maximum likelihood estimation for multivariate mixture observations of Markov chains. K. pp. 19–49. [73] R. [71] S. [74] J. 307–309. Speech Enhancement. 2. “Least square quantization in PCM. pp.” Journal of the Acoustical Society of America. Lloyd. “A mixed source model for speech compression and synthesis. 1986. Speech and Signal Processing. 971–995. Viswanathan. and A. Juang. 1985. and R. July 1987.” AT&T Technology Journal. vol. 2. and L. pp. Juang. Linear Prediction of Speech. Gray. 1035–1074. “A computational model of ﬁltering. Sondhi. 947–954.” IEEE Transactions on Communications. . Lyon. Juang. “Review of texttospeech conversion for English. pp. Rabiner. and M. “The SIFT algorithm for fundamental frequency estimation. H. 5. May 1982. 4. vol. vol. vol. E. 1976. pp. AU20. Lacey. “On the use of bandpass liftering in speehc recognition. H. [69] S. Leonard. H. COM28. P. Sluyter.” in Proceedings of IEEE International Conference on Acoustics. H. “Regularpulse excitation: A nove approach to eﬀective and eﬃcient multipulse coding of speech.” Proceedings of IEEE. G. January 1980. pp.
no. V. 293–309.” IRE Transactions on Information Theory.” IEEE Spectrum. pp.” Computational Linguistics. June 1968. “Pitch synchronous waveform processing techniques for texttospeech synthesis using diphones.” MS Thesis. DiscreteTime Signal Processing. Paez and T. Oppenheim and R. [84] P.. March 1960. pp. no. “The 1994 benchmark tests for the ARPA spoken language program. [85] A. 1. 293– 309. no. V. MIT. 57–62. vol. McCree and T. Oppenheim. V. R. [82] E. Portnoﬀ. pp. 1989. [93] M. Quatieri. no. pp. [87] A. R. [91] M. “A mixed excitation LPC vocoder model for low bit rate speech coding. 5–36. Mohri. R. 7–12. “Finitestate transducers in language and speech processing. Pallett et al. R.. “A model for synthesizing speech by rule.. Moulines and F. AU16. 9. 1990. November 1975. Technical Report No. 1964. “A quasionedimensional simulation for the timevarying vocal tract. 2. Schafer. Schafer. R. [86] A. “On the use of autocorrelation analysis for pitch detection. 2.” Proceedings of IEEE. pp. vol. [95] L. 56. Massachusetts. G. [94] T. Speech. 1597–1614. Discretetime speech signal processing. Oppenheim. [97] L. ASSP25. April 1972. 2002. Glisson. 1. Oppenheim. 221–228. 225–230. V.” Speech Communication. “Superposition in a class of nonlinear systems. vol. R. vol. pp. 5–6. [92] D. V. vol. [89] A. “Quantizing for minimum distortion. 24–33. no. 3. D. no.” Bell System Technical Journal. and T. of Electronics. V. 45.” IEEE Transactions on Audio and Electroacoustics. vol. PrenticeHall Inc. Oppenheim.” PhD dissertation. Prentice Hall. pp. “Cepstrum pitch determination.” IEEE Proceedings. February 1969. vol. pp. R. 1965. [88] A. vol. [80] A. F. 41. Department of Electrical Engineering. vol. 257–286. Rabiner. vol. 77. P. pp. July 1995. 1999. 432. V. 54. pp. no. vol. “A tutorial on hidden Markov models and selected applications in speech recognition. pp. 23. H. Also: MIT Research Lab. “Speech spectrograms using the fast Fourier transform. 9. 2. “Nonlinear ﬁltering of multiplied and convolved signals. 7. “A comparative study of various schemes for speech encoding. vol.” IEEE Transactions on Communications. and Signal Processing. Rabiner. Schafer. 4. . Barnwell III. 1264–1291. Charpentier. W. Cambridge. [81] M. vol. 1995. pp. pp.” Journal of the Acoustical Society of America. 7–13. Max. Buck. no. W. Com20. Rabiner. no.” IEEE Transactions on Acoustics. MIT. AU17. August 1968. [83] A.188 References [79] J. 1973. [90] A. S.” Journal of the Acoustical Society of America. no.” Proceedings of 1995 ARPA Human Language Technology Workshop.” IEEE Transactions on Speech and Audio Processing. March 1969. vol. and J. February 1977. Stockham Jr. W. 8. February 1967. 1997. “Homomorphic analysis of speech. 2. M. 242–250. pp. [96] L. 269–312. August 1970.” IEEE Transactions on Audio and Electroacoustics. Noll. IT6. “Minimum meansquared error quantization in speech. Noll. Oppenehim. “A speech analysissynthesis system based on homomorphic ﬁltering.
Seneﬀ. 2. Juang. pp. 1270–1278. pp. 35–43. 2009. no. Speech and Signal Processing. W.” Journal of the Acoustical Society of America. 1956. 2007.” Journal of Phonetics. [102] L. “The self excited vocoder — an alternate approach to toll quality at 4800 bps. pp. [115] S. L. no. 16.” Journal of the Acoustical Society of America. 16. 1993. Technical Report No.” IEEE Signal Processing Magazine. Rabiner. 1968. H. 43.. Schroeter. 88. E. pp.” Springer Handbook of Speech Processing. “Speech synthesis by rule using an optimal selection of nonuniform synthesis units. no. 466. vol. Blackburn. vol. 297– 315. 1960. Rabiner. vol. Rabiner and B. R. Fundamentals of Speech Recognition. 166– 181. R. S. Springer. 2006. (In preparation). C. 2.” Acustica. Also: MIT Research Laboratory of Electronics.” Journal of Phonetics. pp. April 1986. Schafer and L. [106] R. PrenticeHall Inc. pp. Robinson and R. “Homomorphic systems and cepstrum analysis of speech. “System for automatic formant analysis of voiced speech. 37–53. “Two decades of statistical language modeling: Where do we go from here?. February 1975. vol. “Rateplace and temporalplace representations of vowels in the auditory nerve and anteroventral cochlear nucleus. [103] D. Flanagan. W. Sagisaka. E. 11.” Bell System Technical Journal. 8. “A joint synchrony/meanrate model of auditory speech processing. 1985. 55–76. vol. R.” Springer Handbook of Speech Processing and Communication. R. and E. [100] L. S. vol. pp. [112] M. February 1970. [107] M. Schafer. “An introduction to hidden Markov models. D. “Basic principles of speech synthesis.” IEEE Proceedings. 50. Schroeder and B. Cambridge. Schafer. vol. R. R. vol. 453–456. W. MIT.” Proceedings of IEEE ICASSP. 1988. “A vocoder for transmitting 10 kc/s speech over a 3. W. Juang. 10. pp. Atal. Rose and T. Rabiner and B. SpringerVerlag. “Echo removal by discrete generalized linear ﬁltering. 7. R.” Proceedings of ICASSP ’86. [111] R. [99] L.” British Journal of Applied Physics. pp. vol. 5. vol. pp. “Computer synthesis of speech by concatenation of formant coded words. no. pp.” Bell System Technical Journal. David. C. W. 1969. [108] Y. “Codeexcited linear prediction (CELP): Highquality speech at very low bit rates. 679–682.5 kc/s channel. May–June 1971. [101] L. 1985. [104] R. Sambur. and J. Rosenfeld. “Eﬀect of glottal pulse shape on the quality of natural vowels. . 822–828. 47. Schafer. Rabiner and M. 2000. 458–465.” PhD dissertation.. W. 1541–1558. Schroeder and E. P. [110] R. Rabiner and R. no. [109] R. [114] J. R. PrenticeHall Inc. Schafer. “An algorithm for determining the endpoints of isolated utterances. B. 4. Massachusetts. 1988. 937–940. “A redetermination of the equalloudness contours for pure tones. 54.” Proceedings of the International Conference on Acoustics. Rosenberg. February 1971. R. H. C. [105] A. Young. H. Sachs. R.References 189 [98] L. Barnwell III. Theory and Application of Digital Speech Processing. [113] M. pp. 1988. Dadson.
[125] G. “Instantaneous companding of quantized signals. Weaver. PrenticeHall Inc. Juang. 702–710. Burrus. pp. Zwicker and H. November 2003. 63. Volkman. [133] W. [131] V. Treitel. 653–709. Strang and T. COM23. COM30. 260–269. Speech. “Factoring veryhighdegree polynomials. Viswanathan and J.” vol. Acoustic Phonetics. [124] T.” American Journal of Psychology. [118] B. M. May 1957. vol. [117] G. C. 678– 692. and V. SpringerVerlag. no. vol. B. and Y. Soong and B. . and S. Ingebretsen. ASSP25. “Voiceexcited LPC coders for 9. P. G. pp. Multirate Systems and Filter Banks. “The relation of pitch to frequency. Makhoul. W. C. April 1977. vol.” IEEE Transactions on Speech and Audio Processing. S. 4. no. The Mathematical Theory of Communication. [128] C. vol. December 1975. T. ASSP23. IT13. “The residualexcited linear prediction vocoder with transmission rate below 9. Audio Signal Processing and Coding. no. and Signal Processing. no. T.6 kbps speech transmission. Wellesley. Wavelets and Filter Banks. J. E. 101–105. 53. [127] J.” Proceedings of IEEE. 2007. Sitton. pp.” IEEE Transactions on Acoustics. no. Russell. pp. Stevens and J. Vaidyanathan. vol. 3. Atti. Gray. 12. Magill. pp. T. April 1975. A. M. W. [121] K. Psychoacoustics. K.. University of Illinois Press. Viterbi. “A weighted cepstral distance measure for speech recognition. 558–561. Fastl. “Blind deconvolution through digital signal processing. pp.190 References [116] C. vol.” IEEE transactions on Communications. 329. [126] Y. “A new phase unwrapping algorithm. S. MA: WellesleyCambridge Press. pp. 170–177. Wiley Interscience. Nguyen. [120] A. K.” Bell System Technical Journal. 1990. 309–321. Stockham Jr. Tribolet. 1949.6 kbits/s. Cannon. MIT Press. vol. vol. Smith. Ward. 15–24.. pp. “The design of trellis waveform coders.” IEEE Transactions on Acoustics. [129] P. [130] R. vol. [122] S. February 1991.” IEEE Transactions on Acoustical. October 1987. 3. J. vol. pp. pp. [119] F. 1940. June 1975. 1. p. 1414–1422. 1466–1474. “Quantization properties of transmission parameters in linear predictive systems. 1998. Painter. pp. Spanias. vol. Stewart.” IEEE Signal Processing Magazine. “Evaluation of the CMU ATIS system. Speech. 1. 2. no. January 1993. 27– 42. Stevens. Urbana. 36. [134] E.” Proceedings of DARPA Speech and Natural Language Workshop. [123] L. “Optimal quantization of LSP parameters. and Signal Processing. April 1982. 1993. and J.” IEEE Transactions on Communications. 6. Viswanathan. R. Fox. Un and D. April 1979. Makhoul. Linde.” IEEE Transactions on Information Theory. “Error bounds for convolutional codes and an asymptotically optimal decoding Aalgorithm. N. 2nd Edition. 35. and R. M. 1996. pp.H. April 1967. Tohkura. Speech and Signal Processing. Shannon and W. [132] A. 20.
Supplemental References The speciﬁc references that comprise the Bibliography of this text are representative of the literature of the ﬁeld of digital speech processing. 191 . • IEEE Transactions on Speech and Audio. Papers on speech and hearing as well as other areas of acoustics. Published by Elsevier. • Speech Communication. Speech Processing Journals • IEEE Transactions on Signal Processing. In addition. Listing the books in chronological order of publication provides some perspective on the evolution of the ﬁeld. Publication of IEEE Signal Processing Society that is focused on speech and audio processing. we provide the following list of journals and books as a guide for further study. General publication of the American Institute of Physics. Main publication of IEEE Signal Processing Society. • Journal of the Acoustical Society of America. A publication of the European Association for Signal Processing (EURASIP) and of the International Speech Communication Association (ISCA).
R. New York. 1987. Gray. Jayant and P. Paliwal.. John Wiley and Sons. W.. A Dynamic and OptimizationOriented Approach. A Computer Laboratory Textbook. • Speech Processing. Kluwer Academic Publishers.. M. K. . L. J. Kleijn and K. 1978. Synthesis and Perception. • Speech Communication. Marcel Dekker. 1996.. Gray. Synthesis and Recognition. Rabiner and R. B. • Speech and Audio Signal Processing. Morgan. 1991. M. Springer. B. Jr.. 1998. F. • Springer Handbook of Speech Processing and Speech Communication. J. Hansen. A. R. Schafer. L. Schafer and J. • Digital Speech Processing. • Vector Quantization and Signal Compression. S. O’Shaughnessy. D. D. • Digital Processing of Speech Signals. Furui and M. Gersho and R. Furui. Papamichalis. Gold and N. Elsevier. K. 1979. Speech Coding References • Digital Coding of Waveforms. • Linear Prediction of Speech. Sondhi. • Speech Analysis. Markel and A. Second Edition. L. Deller. Springer Verlag. Berlin. J. • Practical Approaches to Speech Coding. Prentice Hall Inc. • Theory and Application of Digital Speech Processing. 1976. • Speech Coding and Synthesis.192 Supplemental References General Speech Processing References • Speech Analysis. Schafer. SpringerVerlag. • Speech Coding. 1995. M. L. N. M. • Advances in Speech Signal Processing. Prentice Hall Inc. H. G. Sondhi and Y Huang (eds. O’Shaughnessy. W. 1992.. P.). PrenticeHall Inc. Stevens. Proakis. 2008. E. Classic Reissue. Human and Machine. Jr... W. 2000.. S.). IEEE Press Selected Reprint Series. Marcel Dekker Inc. T. • DiscreteTime Processing of Speech Signals. J. Prentice Hall Inc. AddisonWesley. T. N. Noll. • DiscreteTime Speech Signal Processing. MIT Press. 1984. Second Edition. Marcel Dekker Inc. 2001. Benesty. D. 1999. Quatieri. and J. Barnwell and K. J. WileyIEEE Press. 2003. • Acoustic Phonetics. Rabiner and R. 1972. H. Deng and D. Berlin. Prentice Hall Inc. S. 2002. Markel (eds. R. 2009. 1987. New York. John Wiley and Sons. Flanagan. P. L. Nayebi. W.
Sproat. SpringerVerlag. R. • Text To Speech Synthesis: New Paradigms and Advances. • Speech Processing and Synthesis Toolboxes. 2005. Schutze. Greenwood and J. 1987. Lee. Bourlard and N. 1996.). Olive. Campbell and N. C. 1996.W. John Wiley and Sons. • Speech Coding Algorithms. 2003. 2004. S. Chu. P. Coleman. Riek. K. • Foundations of Statistical Natural Language Processing. Allen. Soong and K. Acero and H. Kondoz. 1994. R. MIT Press.). Martin. D. • Automatic Speech and Speaker Recognition. S. A. • Statistical Methods for Speech Recognition. Juang.. 1993. H. Narayanan and A. 1996. • Acoustics of American English. Jelinek. Manning and H. Klatt. SpringerVerlag. D. • Mathematical Models for Speech Technology. 2000. P. Paliwal (eds. Hunnicutt and D. Kluwer Academic Publishers. A. A. M. Prentice Hall Inc. John Wiley and Sons. • An Introduction to TexttoSpeech Synthesis. 2004. Morgan. A. Cambridge University Press. J. W.. VanSanten. H. • Computing Prosody. R. Second Edition. Kluwer Academic Publisher. Huang.).. Kluwer Academic Publishers.Supplemental References 193 • A Practical Handbook of Speech Coders. Rabiner and B. • Progress in Speech Synthesis.. • Connectionist Speech RecognitionA Hybrid Approach. Goldberg and L. Higuchi. C. 1993. Childers. J. 2000. C. Dutoit. K. J. Sagisaka. Taylor. Speech Synthesis • From Text to Speech. MIT Press. John Wiley and Sons. F. J. H. • Speech and Language Processing. H. Cambridge University Press. Levinson. Alwan (eds. Prentice Hall Inc. 1998. P. . Prentice Hall Inc. X. • TexttoSpeech Synthesis. Hirschberg (eds. D. Jurafsky and J. 1999. CRC Press. • Digital Speech: Coding for Low Bit Rate Communication Systems. F. John Wiley and Sons. Y. E. Speech Recognition and Natural Language Processing • Fundamentals of Speech Recognition. 1999. N. W. 1997. Hon. • Spoken Language Processing. Prentice Hall Inc. Olive and J. T. 2000. SpringerVerlag. 2008. S. L.
CRC Press. T. Theory and Practice. P. 2006.). Coding and Error Concealment. Atti. . Ltd. 2007. 2007. • Audio Signal Processing and Coding. John Wiley and Sons. A. Kluwer Academic Publishers. Brandenburg (eds. Loizou. Vary and R. Audio Processing • Applications of Digital Signal Processing to Audio and Acoustics. John Wiley and Sons. Martin.194 Supplemental References Speech Enhancement • Digital Speech Transmission. • Speech Enhancement. Spanias. Painter and V. Kahrs and K. P. H. Enhancement. 1998.. C.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.