You are on page 1of 14

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO.

X, MONTH, YEAR 1

Power-Normalized Cepstral Coefficients (PNCC) for


Robust Speech Recognition
Chanwoo Kim and Richard M. Stern, Member, IEEE

Abstract—This paper presents a new feature extraction algo- have been introduced to address this problem. Many of these
rithm called Power Normalized Cepstral Coefficients (PNCC) conventional noise compensation algorithms have provided
that is motivated by auditory processing. Major new features of substantial improvement in accuracy for recognizing speech
PNCC processing include the use of a power-law nonlinearity
that replaces the traditional log nonlinearity used in MFCC in the presence of quasi-stationary noise (e.g. [3], [4], [5],
coefficients, a noise-suppression algorithm based on asymmetric [6], [7], [8], [9], [10]). Unfortunately these same algorithms
filtering that suppress background excitation, and a module that frequently do not provide significant improvements in more
accomplishes temporal masking. We also propose the use of difficult environments with transitory disturbances such as a
medium-time power analysis, in which environmental param- single interfering speaker or background music (e.g. [11]).
eters are estimated over a longer duration than is commonly
used for speech, as well as frequency smoothing. Experimental Many of the current systems developed for automatic speech
results demonstrate that PNCC processing provides substantial recognition, speaker identification, and related tasks are based
improvements in recognition accuracy compared to MFCC and on variants of one of two types of features: mel frequency
PLP processing for speech in the presence of various types of cepstral coefficients (MFCC) [12] or perceptual linear predic-
additive noise and in reverberant environments, with only slightly tion (PLP) coefficients [13]. Spectro-temporal features have
greater computational cost than conventional MFCC processing,
and without degrading the recognition accuracy that is observed also been recently introduced with promising results (e.g.
while training and testing using clean speech. PNCC processing [14], [15]). It has been observed that two-dimensional Gabor
also provides better recognition accuracy in noisy environments filters provide a reasonable approximation to the spectro-
than techniques such as Vector Taylor Series (VTS) and the temporal response fields of neurons in the auditory cortex,
ETSI Advanced Front End (AFE) while requiring much less which has lead to various approaches to extract features for
computation. We describe an implementation of PNCC using
“online processing” that does not require future knowledge of speech recognition (e.g. [16], [17], [18], [19]). In this paper
the input. we describe the development of an additional feature set for
speech recognition which we refer to as power-normalized
Index Terms—Robust speech recognition, feature extrac-
tion, physiological modeling, rate-level curve, power function, cepstral coefficients (PNCC).
asymmetric filtering, medium-time power estimation, spectral We had introduced several previous of PNCC processing in
weight smoothing, temporal masking, modulation filtering, on- [20] and [21], and these implementations have been evaluated
line speech processing by several teams of researchers and compared to several
different algorithms including zero crossing peak amplitude
EDICS Category: SPE-ROBU, SPE-SPER
(ZCPA) [22], RASTA-PLP [23], perceptual minimum variance
distortionless response (PMVDR) [24], invariant-integration
I. I NTRODUCTION
features (IIF) [25], and subband spectral centroid histograms

I N recent decades following the introduction of hidden


Markov models (e.g. [1]) and statistical language models
(e.g.[2]), the performance of speech recognition systems in
(SSCH) [26]. As described in several papers (e.g. [27], [28],
[29], [30], [31]), PNCC has been shown to provide better
speech recognition accuracy than the other algorithms cited
benign acoustical environments has dramatically improved. above, particularly in conditions of training that is mismatched
Nevertheless, most speech recognition systems remain sensi- across environments. For example, Müller and Mertins[32]
tive to the nature of the acoustical environments within which found that PNCC provides better results than the original IIF
they are deployed, and their performance deteriorates sharply features, but if IIF is combined with PNCC (PN-IIF), the result
in the presence of sources of degradation such as additive is somewhat better than the original PNCC. Similar results had
noise, linear channel distortion, and reverberation. been obtained with delta-spectral cepstral coefficients DSCC
One of the most challenging contemporary problems is that [33] as well. Our previous implementations of PNCC have
recognition accuracy degrades significantly if the test environ- also been employed in industry as well [34]. In selected other
ment is different from the training environment and/or if the studies, portions of PNCC processing have been incorporated
acoustical environment includes disturbances such as additive into other feature extraction algorithms (e.g [35], [36]). Even
noise, channel distortion, speaker differences, reverberation, though previous implementations of PNCC processing appear
and so on. Over the years dozens if not hundreds of algorithms to be promising, a major problem is that they cannot be easily
Chanwoo Kim (Corresponding Author) is with The Microsoft Corporation, implemented for online applications without look-ahead over
Redmond WA 98052 USA (email: chanwook@microsoft.com). Richard an entire sentence. In addition, previous implementations of
M. Stern is with the Language Technologies Institute and the Department of PNCC did not consider the effects of temporal masking, as is
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh,
PA 15213 USA (email: rms@cs.cmu.edu). the case for MFCC and PLP processing.
Manuscript received XXXXX, XXXXX; revised XXXXX, XXXXX. The implementation of PNCC processing in the present
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 2

Input Speech Input Speech Input Speech

Pre -Emphasis Pre-Emphasis

Initial Processing
STFT STFT STFT

Magnitude Squared Magnitude Squared Magnitude Squared

Triangular Critical-Band Gammatone


Frequency Integration Frequency Integration Frequency Integration Medium-Time
Processing

P[ m , l ]
Nonlinear Medium-Time
Compression Short-Time Power Calculation
~

Environmental Compensation
Processing Q [m , l ]
Asymmetric Noise
RASTA Filtering
Suppression with
Temporal Masking
Nonlinear ~
R [m , l]
Expansion
Weight Smoothing
~
S [m , l]
Time -Frequency
Normalization
T [m , l ]
Mean Power
Normalization
U [m, l ]
Logarithmic Power Function Power Function
Nonlinearity Nonlinearity ( • ) 1 / 3 Nonlinearity (• ) 1 / 15
V [m, l ]

Final Processing
DCT IDFT DCT

LPC-Based
Cepstral Recursion

Mean Normalization Mean Normalization Mean Normalization

MFCC RASTA-PLP PNCC


Coefficients Coefficients Coefficients

Fig. 1. Comparison of the structure of the MFCC, PLP, and PNCC feature extraction algorithms. The modules of PNCC that function on the basis of
“medium-time” analysis (with a temporal window of 65.6 ms) are plotted in the rightmost column. If the shaded blocks of PNCC are omitted, the remaining
processing is referred to as simple power-normalized cepstral coefficients (SPNCC).

paper has been significantly revised to address these issues in a A. Broader motivation for the PNCC algorithm
fashion that enables it to provide superior recognition accuracy The development of PNCC feature extraction was motivated
over a broad range of conditions of noise and reverberation by a desire to obtain a set of practical features for speech
using features that are computable in real time using “online” recognition that are more robust with respect to acoustical
algorithms that do not require extensive look-ahead, and with variability in their native form, without loss of performance
a computational complexity that is comparable to that of when the speech signal is undistorted, and with a degree of
traditional MFCC and PLP features. In the subsequent subsec- computational complexity that is comparable to that of MFCC
tions of this Introduction we discuss the broader motivations and PLP coefficients. While many of the attributes of PNCC
and overall structure of PNCC processing. We specify the processing have been strongly influenced by consideration
key elements of the processing in some detail in Sec. II. of various attributes of human auditory processing, we have
In Sec. III we compare the recognition accuracy provided favored approaches that provide pragmatic gains in robustness
by PNCC processing under a variety of conditions with that at small computational cost over approaches that are more
of other processing schemes, and we consider the impact of faithful to auditory physiology in developing the specific
various components of PNCC on these results. We compare processing that is performed.
the computational complexity of the MFCC, PLP, and PNCC Some of the innovations of the PNCC processing that we
feature extraction algorithms in Sec. IV and we summarize consider to be the most important include:
our results in the final section.
• The replacement of the log nonlinearity in MFCC pro-
cessing by a power-law nonlinearity that is carefully
chosen to approximate the nonlinear relation between
signal intensity and auditory-nerve firing rate. We believe
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 3

0.16
that this nonlinearity provides superior robustness by sup-
pressing small signals and their variability, as discussed 0.14

in Sec. II-G. 0.12


• The use of “medium-time” processing with a duration 0.1

H(e j ω)
of 50-120 ms to analyze the parameters characterizing 0.08
environmental degradation, in combination with the tra-
0.06
ditional short-time Fourier analysis with frames of 20-30
ms used in conventional speech recognition systems. We 0.04

believe that this approach enables us to estimate envi- 0.02


ronmental degradation more accurately while maintaining 0
0 2000 4000 6000 8000
the ability to respond to rapidly-changing speech signals, Frequency (Hz)
as discussed in Sec. II-B.
• The use of a form of “asymmetric nonlinear filtering” Fig. 2. The frequency response of a gammatone filterbank with each
area of the squared frequency response normalized to be unity. Characteristic
to estimate the level of the acoustical background noise frequencies are uniformly spaced between 200 and 8000 Hz according to the
for each time frame and frequency bin. We believe Equivalent Rectangular Bandwidth (ERB) scale [39].
that this approach enables us to remove slowly-varying
components easily without the need to deal with many of description below assumes a sampling rate of 16 kHz, the
the artifacts associated with over-correction in techniques PNCC features are easily modified to accommodate other
such as spectral subtraction [37], as discussed in Sec. sampling frequencies.
II-C. As shown in Sec. III-C, this approach is more
effective than RASTA processing [23]. A. Initial processing
• The development of a signal processing block that real-
As in the case of MFCC, a pre-emphasis filter of the form
izes temporal masking.
H(z) = 1−0.97z −1 is applied. A short-time Fourier transform
• The development of computationally-efficient realizations
(STFT) is performed using Hamming windows of duration
of the algorithms above that support “online” real-time
25.6 ms, with 10 ms between frames, using a DFT size of
processing that does not require substantial non-causal
1024. Spectral power in 40 analysis bands is obtained by
look-ahead of the input signal to compute the PNCC
weighting the magnitude-squared STFT outputs for positive
coefficients.
frequencies by the frequency response associated with a 40-
channel gammatone-shaped filter bank [38] whose center
B. Structure of the PNCC algorithm frequencies are linearly spaced in Equivalent Rectangular
Figure 1 compares the structure of conventional MFCC Bandwidth (ERB) [39] between 200 Hz and 8000 Hz, using
processing [12], PLP processing [13], [23], and the new PNCC the implementation of gammatone filters in Slaney’s Auditory
approach which we introduce in this paper. As was noted Toolbox [40]. In previous work [20] we observed that the use
above, the major innovations of PNCC processing include the of gammatone frequency weighting provides slightly better
redesigned nonlinear rate-intensity function, along with the ASR accuracy in white noise, but the differences compared
series of processing elements to suppress the effects of back- to the traditional triangular weights in MFCC processing are
ground acoustical activity based on medium-time analysis. small. The frequency response of the gammatone filterbank is
As can be seen from Fig. 1, the initial processing stages of shown in Fig. 2. In each channel the area under the squared
PNCC processing are quite similar to the corresponding stages transfer function is normalized to unity to satisfy the equation:
of MFCC and PLP analysis, except that the frequency analysis Z 8000
is performed using gammatone filters [38]. This is followed |Hl (f )|2 df = 1 (1)
0
by the series of nonlinear time-varying operations that are
performed using the longer-duration temporal analysis that where Hl (f ) is the frequency response of the lth gammatone
accomplish noise subtraction as well as a degree of robustness channel. To reduce the amount of computation, we modified
with respect to reverberation. The final stages of processing are the gammatone filter responses slightly by setting Hl (f )
also similar to MFCC and PLP processing, with the exception equal to zero for all values of f for which the unmodified
of the carefully-chosen power-law nonlinearity with exponent Hl (f ) would be less than 0.5 percent of its maximum value
1/15, which will be discussed in Sec. II-G below. Finally, (corresponding to -46 dB).
we note that if the shaded blocks in Fig. 1 are omitted, We obtain the short-time spectral power P [m, l] using the
the processing that remains is referred to as simple power- squared gammatone summation as below:
normalized cepstral coefficients (SPNCC). SPNCC processing (K/2)−1
X
has been employed in other studies on robust recognition (e.g. P [m, l] = |X[m, ejωk ]Hl (ejωk )|2 (2)
[36]). k=0

where K is the DFT size, m and l represent the frame and


II. C OMPONENTS OF PNCC PROCESSING channel indices, respectively, and ωk = 2πk/Fs , with Fs
In this section we describe and discuss the major compo- representing the sampling frequency. X[m, ejωk ] is the short-
nents of PNCC processing in greater detail. While the detailed time spectrum of the mth frame of the signal.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 4

~
B. Temporal integration for environmental analysis Q[ m, l ]
Most speech recognition and speech coding systems use
analysis frames of duration between 20 ms and 30 ms.
Asymmetric
Nevertheless, it is frequently observed that longer analysis Lowpass Filtering
windows provide better performance for noise modeling and/or ~
- Qle [ m , l ]
environmental normalization (e.g. [21], [41], [42]), because
the power associated with most background noise conditions +
changes more slowly than the instantaneous power associated Halfwave
with speech. Rectification
In PNCC processing we estimate a quantity we refer to
~
as “medium-time power” Q̃[m, l] by computing the running Qo [m, l ]
average of P [m, l], the power observed in a single analysis
Temporal Asymmetric
frame, according to the equation: Masking Lowpass Filtering
m+M ~
1 X Qtm[m, l] ~
Q f [ m, l ]
Q̃[m, l] = P [m′ , l] (3)
2M + 1
m′ =m−M MAX
where m represents the frame index and l is the channel index. ~
Rsp [m, l ]
We will apply the tilde symbol to all power estimates that are
performed using medium-time analysis. Excitation Non-Excitation
We observed experimentally that the choice of the temporal
integration factor M has a substantial impact on performance
~
in white noise (and presumably other types of broadband R [ m, l ]
background noise). This factor has less impact on the accuracy
that is observed in more dynamic interference or reverberation, Noise Removal
Floor Level
although the longer temporal analysis window does provide and Temporal
Estimation
Masking
some benefit in these environments as well [43]. We chose the
value of M = 2 (corresponding to five consecutive windows
Fig. 3. Functional block diagram of the modules for asymmetric noise
with a total net duration of 65.6 ms) on the basis of these suppression (ANS) and temporal masking in PNCC processing. All processing
observations. Since Q̃[m, l] is the moving average of P [m, l], is performed on a channel-by-channel basis. Q̃[m, l] is the medium-time-
Q̃[m, l] is a low-pass function of m. If M = 2, the upper averaged input power as defined by Eq.(3), R̃[m, l] is the speech output of
the ANS module, and S̃[m, l] is the output after temporal masking (which is
frequency is approximately 15 Hz. Nevertheless, if we were to applied only to the speech frames). The block labelled Temporal Masking is
use features based on Q̃[m, l] directly for speech recognition, depicted in detail in Fig. 5
recognition accuracy would be degraded because onsets and
offsets of the frequency components would become blurred. One significant problem with the application of conventional
Hence in PNCC, we use Q̃[m, l] only for noise estimation linear high-pass filtering in the power domain is that the filter
and compensation, which are used to modify the information output can become negative. Negative values for the power
based on the short-time power estimates P [m, l]. We also coefficients are problematic in the formal mathematical sense
apply smoothing over the various frequency channels, which (in that power itself is positive). They also cause problems
will discussed in Sec. II-E below. in the application of the compressive nonlinearity and in
speech resynthesis unless a suitable floor value is applied to
the power coefficients (e.g. [46]). Rather than filtering in the
C. Asymmetric noise suppression power domain, we could perform filtering after applying the
In this section, we discuss a new approach to noise com- logarithmic nonlinearity, as is done with conventional cepstral
pensation which we refer to as asymmetric noise suppression mean normalization in MFCC processing. Nevertheless, as
(ANS). This procedure is motivated by the observation men- will be seen in Sec. III, this approach is not very helpful
tioned above that the speech power in each channel usually for environments with additive noise. Spectral subtraction is
changes more rapidly than the background noise power in another way to reduce the effects of noise, whose power
the same channel. Alternately we might say that speech usu- changes slowly (e.g. [37]). In spectral subtraction techniques,
ally has a higher-frequency modulation spectrum than noise. the noise level is typically estimated from the power of non-
Motivated by this observation, many algorithms have been speech segments (e.g. [37]) or through the use of a continuous-
developed using either high-pass filtering or band-pass filtering update approach (e.g. [45]). In the approach that we introduce,
in the modulation spectrum domain (e.g. [23], [44]). The we obtain a running estimate of the time-varying noise floor
simplest way to accomplish this objective is to perform high- using an asymmetric nonlinear filter, and subtract that from
pass filtering in each channel (e.g. [45], [46]) which has the the instantaneous power.
effect of removing slowly-varying components which typically Figure 3 is a block diagram of the complete asymmetric
represent the effects of additive noise sources rather than the nonlinear suppression processing with temporal masking. Let
speech signal. us begin by describing the general characteristics of the
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 5

asymmetric nonlinear filter that is the first stage of processing.


This filter is represented by the following equation for arbitrary

Power (dB)
input and output Q̃in [m, l] and Q̃out [m, l], respectively:
 Qin [m, l]

 λa Q̃out [m − 1, l] + (1 − λa )Q̃in [m, l], Qout [m, l](λa = 0.9, λb = 0.9)

 if Q̃in [m, l] ≥ Q̃out [m − 1, l] 0 100 (1s) 200 (2s) 300 (3s)
Q̃out [m, l] = (4) Frame Index m


 λ b out [m − 1, l] + (1 − λb )Q̃in [m, l],
Q̃ (a)
if Q̃in [m, l] < Q̃out [m − 1, l]

Power (dB)
where m is the frame index and l is the channel index, and
λa and λb are constants between zero and one.
If λa = λb it is easy to verify that Eq. 4 reduces to a Qin [m, l]
conventional IIR filter that is lowpass in nature because the Qout [m, l](λa = 0.5, λb = 0.95)
values of the λ parameters are positive, as shown in Fig. 4(a). 0 100 (1s) 200 (2s) 300 (3s)
Frame Index m
In contrast, If 1 > λb > λa > 0, the nonlinear filter functions (b)
as a conventional “upper” envelope detector, as illustrated
in Fig. 4(b). Finally, and most usefully for our purposes, if

Power (dB)
1 > λa > λb > 0, the filter output Q̃out tends to follow
the lower envelope of Q̃in [m, l], as seen in Fig. 4(c). In our
processing, we will use this slowly-varying lower envelope in Qin [m, l]
Qout [m, l](λa = 0.999, λb = 0.5)
Fig. 4(c) to serve as a model for the estimated medium-time
noise level, and the activity above this envelope is assumed 0 100 (1s) 200 (2s) 300 (3s)
Frame Index m
to represent speech activity. Hence, subtracting this low-level (c)
envelope from the original input Q̃in [m, l] will remove a
slowly varying non-speech component. Fig. 4. Sample inputs (solid curves) and outputs (dashed curves) of the
We will use the notation asymmetric nonlinear filter defined by Eq. (4) for conditions when (a) λa =
λb (b) λa < λb , and (c) λa > λb . In this example, the channel index l is
8.
Q̃out [m, l] = AF λa ,λb [Q̃in [m, l]] (5)
filtering as before:
to represent the nonlinear filter described by Eq. (4). We note
that that this filter operates only on the frame indices m for Q̃f [m, l] = AF 0.999,0.5 [Q̃0 [m, l]] (7)
each channel index l.
Q̃f [0, l] is initialized as Q̃0 [m, l]. As shown in Fig. 3, we use
Keeping the characteristics of the asymmetric filter de- the lower envelope of the rectified signal Q̃f [m, l] as a floor
scribed above in mind, we may now consider the structure level for the ANS processing output R̃[m, l] after temporal
shown in Fig. 3. In the first stage, the lower envelope Q̃le [m, l], masking:
which represents the average noise power, is obtained by ANS
processing according to the equation R̃sp [m, l] = max (Q̃tm [m, l], Q̃f [m, l]) (8)

Q̃le [m, l] = AF 0.999,0.5 [Q̃[m, l]] (6) where Q̃tm [m, l] is the temporal masking output depicted in
Fig. 3. Temporal masking for speech segments is discussed in
as depicted in Fig. 4(c).Q̃le [0, l] is initialized to 0.9Q̃[m, l]. Sec. II-D.
Q̃le [m, l] is subtracted from the input Q̃[m, l], effectively We have found that applying lowpass filtering to the signal
highpass filtering the input, and that signal is passed through segments that do not appear to be driven by a periodic
an ideal half-wave linear rectifier to produce the rectified excitation function (as in voiced speech) improves recognition
output Q̃0 [m, l]. The impact of the specific values of the accuracy in noise by a small amount. For this reason we use
forgetting factors λa and λb on speech recognition accuracy the lower envelope of the rectified signal R̃le [m, l] directly
is discussed below. for these non-excitation segments. This operation, which is
The remaining elements of ANS processing in the right- effectively a further lowpass filtering, is not performed for the
hand side of Fig. 3 (other than the temporal masking block) speech segments because blurring the power coefficients for
are included to cope with problems that develop when the speech degrades recognition accuracy.
rectifier output Q̃0 [m, l] remains zero for an interval, or when Excitation/non-excitation decisions for this purpose are ob-
the local variance of Q̃0 [m, l] becomes excessively small. Our tained for each value of m and l in a very simple fashion:
approach to this problem is motivated by our previous work
[21] in which it was noted that applying a well-motivated “excitation segment” if Q̃[m, l] ≥ cQ̃le [m, l](9a)
flooring level to power is very important for noise robustness. “non-excitation segment” if Q̃[m, l] < cQ̃le [m, l](9b)
In PNCC processing we apply the asymmetric nonlinear filter
for a second time to obtain the lower envelope of the rectifier where Q̃le [m, l] is the lower envelope of Q̃[m, l] as described
output Q̃f [m, l], and we use this envelope to establish this floor above, and in and c is a fixed constant. In other words, a partic-
level. This envelope Q̃f [m, l] is obtained using asymmetric ular value of Q̃[m, l] is not considered to be a sufficiently-large
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 6

~
Qo [m, l ] ~ S̃[m, l] (T60 = 0.5 s)
Q p [m, l ]

Power (dB)
S̃[m, l] without Temporal Masking (T60 = 0.5 s)
MAX

z −1
0 100 (1s)
Frame Index m
~
λt Q p [m − 1, l ]

Power (dB)
µt 0 100 (1s)
Frame Index m
~ ~
Q0[m, l ] < λt Qp [m − 1, l ] Fig. 6. Demonstration of the effect of temporal masking in the ANS module
for speech in simulated reverberation with T60 = 0.5 s (upper panel) and
~ clean speech (lower panel). In this example, the channel index l is 18.
Rsp [m, l ]
~ ~ the following equation:
Q0 [m, l ] ≥ λt Q p [m − 1, l ] (
Q̃0 [m, l], Q̃0 [m, l] ≥ λt Q̃p [m − 1, l]
R̃sp [m, l] = (11)
µt Q̃p [m − 1, l], Q̃0 [m, l] < λt Q̃p [m − 1, l]
Fig. 5. Block diagram of the components that accomplish temporal masking
in Fig. 3
We have found [43] that if the forgetting factor λt is equal
excitation if it is less than a fixed multiple of its own lower to or less than 0.85 and if µt ≤ 0.2, recognition accuracy
envelope. remains almost constant for clean speech and most additive
We observed experimentally that while a broad range of noise conditions, and if λt increases beyond 0.85, performance
values of λb between 0.25 and 0.75 appear to provide reason- degrades. The value of λt = 0.85 also appears to be best in
able recognition accuracy, the choice of λa = 0.9 appears to the reverberant condition. For these reasons we use the values
be best under most circumstances [43]. The parameter values λt = 0.85 and µt = 0.2 in the standard implementation of
used for the current standard implementation are λa = 0.999 PNCC. Note that λt = 0.85 corresponds to a time constant
and λb = 0.5, which were chosen in part to maximize the of 28.2 ms, which means that the offset attenuation lasts
recognition accuracy in clean speech as well as performance in approximately 100 ms. This characteristic is in accordance
noise. We also observed (in experiments in which the temporal with observed data for humans [52].
masking described below was bypassed) that the threshold- Figure 6 illustrates the effect of this temporal masking. In
parameter value c = 2 provides the best performance for white general, with temporal masking the response of the system
noise (and presumably other types of broadband noise). The is inhibited for portions of the input signal R̃[m, l] other
value of c has little impact on performance in background than rising “attack transients”. The difference between the
music and in the presence of reverberation. signals with and without masking is especially pronounced in
reverberant environments, for which the temporal processing
module is especially helpful.
D. Temporal masking The final output of the asymmetric noise suppression and
Many authors have noted that the human auditory system temporal masking modules is R̃[m, l] = R̃sp [m, l] for the
appears to focus more on the onset of an incoming power excitation segments and R̃[m, l] = Q̃f [m, l] for the non-
envelope rather than the falling edge of that same power excitation segments.
envelope (e.g. [47], [48]). This observation has led to several
onset enhancement algorithms (e.g. [49], [46], [50], [51]). In E. Spectral weight smoothing
this section we describe a simple way to incorporate this effect
in PNCC processing, by obtaining a moving peak for each In our previous research on speech enhancement and noise
compensation techniques (e.g., [20], [21], [41], [53], [54])
frequency channel l and suppressing the instantaneous power
it has been frequently observed that smoothing the response
if it falls below this envelope.
across channels is helpful. This is true especially in processing
The processing invoked for temporal masking is depicted in
schemes such as PNCC where there are nonlinearities and/or
block diagram form in Fig. 5. We first obtain the online peak
thresholds that vary in their effect from channel to channel,
power Qp [m, l] for each channel using the following equation:
  as well as processing schemes that are based on inclusion of
Q̃p [m, l] = max λt Q̃p [m − 1, l], Q̃0 [m, l] (10) responses only from a subset of time frames and frequency
channels (e.g. [53]) or systems that rely on missing-feature
where λt is the forgetting factor for obtaining the online peak. approaches (e.g. [55]).
As before, m is the frame index and l is the channel index. From the discussion above, we can represent the combined
Temporal masking for speech segments is accomplished using effects of asymmetric noise suppression and temporal masking
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 7

4000
for a specific time frame and frequency bin as the transfer
function R̃[m, l]/Q̃[m, l]. Smoothing the transfer function 3500

Synapse Output (spikes/sec)


across frequency is accomplished by computing the running 3000
average over the channel index l of the ratio R̃[m, l]/Q̃[m, l].
2500
Hence, the frequency averaged weighting function T̃ [m, l]
(which had previously been subjected to temporal averaging) 2000
is given by: 1500
l2
!
1 X R̃[m, l′ ] 1000
S̃[m, l] = (12)
l2 − l1 + 1 ′ Q̃[m, l′ ] 500
l =l1
0
where l2 = min(l + N, L) and l1 = max(l − N, 1), and L is 0 20 40 60 80 100 120 140
Time (ms)
the total number of channels.
The time-averaged frequency-averaged transfer function Fig. 7. Synapse output for a pure tone input with a carrier frequency of 500
S̃[m, l] is used to modulate the original short-time power Hz at 60 dB SPL. This synapse output is obtained using the auditory model
P [m, l]: by Heinz et al. [56].
T [m, l] = P [m, l]S̃[m, l] (13) 1000
Onset Rate
Sustained Rate
In the present implementation of PNCC, we use a value of 800
N = 4, and a total number of L = 40 gammatone channels,

Rate (spikes/sec)
again based on empirical optimization from the results of pilot
600
studies [43]. We note that if we were to use a different number
of channels L, the optimal value of N would be also different.
400

F. Mean power normalization


200
In conventional MFCC processing, multiplication of the
input signal by a constant scale factor produces only an 0
additive shift of the C0 coefficient because a logarithmic −40 −20 0 20 40 60 80
Sound Pressure Level (dB)
nonlinearity is included in the processing, and this shift is
easily removed by cepstral mean normalization. In PNCC Fig. 8. Comparison of the onset rate (solid curve) and sustained rate (dashed
processing, however, the replacement of the log nonlinearity curve) obtained using the model proposed by Heinz et al. [56]. The curves
were obtained by averaging responses over seven frequencies. See text for
by a power-law nonlinearity as discussed below, causes the details.
response of the processing to be affected by changes in
absolute power, even though we have observed that this effect Detector (VAD) with PNCC. If silences of longer duration are
is usually small. In order to minimize further the potential interspersed with the speech, however, we recommend the use
impact of amplitude scaling in PNCC we invoke a stage of of a VAD in combination with PNCC processing.
mean power normalization. The normalized power is obtained directly from the running
While the easiest way to normalize power would be to power estimate µ[m]:
divide the instantaneous power by the average power over the
T [m, l]
utterance, this is not feasible for real-time online processing U [m, l] = k (15)
because of the “look ahead” that would be required. For µ[m]
this reason, we normalize input power in the present online where the value of the constant k is arbitrary. In pilot
implementation of PNCC by dividing the incoming power by experiments we found that the speech recognition accuracy
a running average of the overall power. The mean power esti- obtained using the online power normalization described above
mate µ[m] is computed from the simple difference equation: is comparable to the accuracy that would be obtained by
normalizing according to a power estimate that is computed
L−1 over the entire estimate in offline fashion.
(1 − λµ ) X
µ[m] = λµ µ[m − 1] + T [m, l] (14)
L G. Rate-level nonlinearity
l=0

where m and l are the frame and channel indices, as before, Several studies in our group (e.g. [20], [54]) have confirmed
and L represents the number of frequency channels. We use a the critical importance of the nonlinear function that describes
value of 0.999 for the forgetting factor λµ . For the initial value the relationship between incoming signal amplitude in a given
of µ[m], we use the value obtained from the training database. frequency channel and the corresponding response of the
Since the time constant corresponding to λµ is around 4.6 processing model. This “rate-level nonlinearity” is explicitly or
seconds, we normally do not need to incorporate a formal implicitly a crucial part of every conceptual or physiological
voice activity detector (VAD) in conjunction with PNCC if model of auditory processing (e.g. [57], [58], [59]). In this
a continuous non-speech portion is not longer than 3 to 4 section we summarize our approach to the development of the
seconds, then we usually do not need to use a Voice Activity rate-level nonlinearity used in PNCC processing.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 8

compares the onset rate-intensity curve depicted in Fig. 8 with


1400 various analytical functions that approximate this function.
1200 The curves are plotted as a function of dB SPL in the lower
panel of the figure and as a function of absolute pressure in
Rate (spikes / sec)

1000
Pascals in the upper panel, and the putative spontaneous firing
800
rate of 50 spikes per second is subtracted from the curves in
600 both cases.
400 The most widely used current feature extraction algorithms
200 Human Rate−Intensity Model are Mel Frequency Cepstral Coefficients (MFCC) and Per-
Cube Root Power−Law Approximation ceptual Linear Prediction (PLP) coefficients. Both the MFCC
0 MMSE Power−Law Approximation
Logarithmic Approximation and PLP procedures include an intrinsic nonlinearity, which
−200 is logarithmic in the case of MFCC and a cube-root power
0.05 0.1 0.15 0.2
Pressure (Pa) function in the case of PLP analysis. We plot these curves
(a)
relating the power of the input pressure p to the response s
1500
Human Rate−Intensity Model in Fig. 9 using values of the arbitrary scaling parameters that
Cube Root Power−Law Approximation are chosen to provide the best fit to the curve of the Heinz et
1000 MMSE Power−Law Approximation
Logarithmic Approximation
al. model, resulting in the following equations:
Rate (spikes / sec)

500 scube = 4294.1p2/3 (16)


slog = 120.2 log(p) + 1319.3 (17)
0
We note that the exponent of the power function is doubled
because we are plotting power rather than pressure. Even
−500
though scaling and shifting by fixed constants in Eqs. (16)
and (17) do not have any significance in speech recognition
−1000
−40 −20 0 20 40 60 80 systems, we included them in the above equation to fit these
Tone Level (dB SPL)
(b) curves to the rate-intensity curve in Fig. 9(a). The constants
in Eqs. (16) and (17) are obtained using an MMSE criterion
Fig. 9. Comparison between a human rate-intensity relation using the for the sound pressure range between 0 dB (20µPa) and 80
auditory model developed by Heinz et al. [56], a cube root power-law ap- dB (0.2 Pa) from the linear rate-intensity curve in the upper
proximation, an MMSE power-law approximation, and a logarithmic function panel of Fig. 8.
approximation. Upper panel: Comparison using the pressure (Pa) as the x-
axis. Lower panel: Comparison using the sound pressure level (SPL) in dB We have also observed experimentally [43] that the power-
as the x-axis. law curve with an exponent of 1/15 for sound pressure
provides a reasonably good fit to the physiological data while
optimizing recognition accuracy in the presence of noise. We
It is well known that the nonlinear curve relating sound
have observed that larger values of the pressure exponent such
pressure level in decibels to the auditory-nerve firing rate is
as 1/5 provide better performance in white noise, but they
compressive (e.g [56] [60]). It has also been observed that
degrade the recognition accuracy that is obtained for clean
the average auditory-nerve firing rate exhibits an overshoot at
speech [43]. We consider the value 1/15 for the pressure
the onset of an input signal. As an example, we compare in
exponent to represent a pragmatic compromise that provides
Fig. 8 the average onset firing rate versus the sustained rate
reasonable accuracy in white noise without sacrificing recog-
as predicted by the model of Heinz et al. [56]. The curves in
nition accuracy for clean speech, producing the power-law
this figure were obtained by averaging the rate-intensity values
nonlinearity
obtained from sinusoidal tone bursts over seven frequencies,
100, 200, 400, 800, 1600, 3200, and 6400 Hz. For the onset- V [m, l] = U [m, l]1/15 (18)
rate results we partitioned the response into bins of length of
2.5 ms, and searched for the bin with maximum rate during where again U [m, l] and V [m, l] have the dimensions of
the initial 10 ms of the tone burst. To measure the sustained power. This curve is closely approximated by the equation
rate, we averaged the response rate between 50 and 100 ms spower = 1389.6p0.1264 (19)
after the onset of the signals. The curves were generated under
the assumption that the spontaneous rate is 50 spikes/second. which is also plotted in Fig. 9. The exponent of 0.1264
We observe in Fig. 8 that the sustained firing rate (broken happens to be the best fit to the Heinz et al. data as depicted
curve) is S-shaped with a threshold around 0 dB SPL and a in the upper panel of Fig. 8. As before, this estimate was
saturating segment that begins at around 30 dB SPL. The onset developed in the MMSE sense over the sound pressure range
rate (solid curve), on the other hand, increases continuously between 0 dB (20µPa) and 80 dB (0.2 Pa).
without apparent saturation over the conversational hearing The power law function was chosen for PNCC processing
range of 0 to 80 dB SPL. We choose to model the onset rate- for several reasons. First, it is a relationship that is not affected
intensity curve for PNCC processing because of the important in form by multiplying the input by a constant. Second, it
role that it appears to play in auditory perception. Figure 9 has the attractive property that its asymptotic response at very
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 9

Clean and Street 5 dB very good performance over a wide variety of conditions using
a single set of parameters and settings, without sacrificing
log P [m, l]

word error rate in clean conditions relative to MFCC pro-


cessing. In previous work we had described slightly different
Street 5 dB
Clean feature extraction algorithms that provide even better perfor-
0 100 (1s) 200 (2s) 300 (3s) mance for speech recognition in the presence of reverberation
Frame Index m [21] and in background music [46], but these approaches do
Clean and Street 5 dB not perform as well as MFCC processing in clean speech.
We used five standard testing environments in our work: (1)
P̃ [m, l]1/15

digitally-added white noise, (2) digitally-added noise that had


Street 5 dB
been recorded live on urban streets, (3) digitally-added single-
Clean speaker interference, (4) digitally-added background music,
0 100 (1s) 200 (2s) 300 (3s) and (5) passage of the signal through simulated reverberation.
Frame Index m The street noise was recorded by us on streets with steady but
moderate traffic. The masking signal used for single-speaker-
Fig. 10. The effects of the asymmetric noise suppression, temporal masking, interference experiments consisted of other utterances drawn
and the rate-level nonlinearity used in PNCC processing. Shown are the from the same database as the target speech, and background
outputs of these stages of processing for clean speech and for speech corrupted
by street noise at an SNR of 5 dB when the logarithmic nonlinearity is used music was selected from music segments from the original
without ANS processing or temporal masking (upper panel), and when the DARPA Hub 4 Broadcast News database. The reverberation
power-law nonlinearity is used with ANS processing and temporal masking simulations were accomplished using the Room Impulse Re-
(lower panel). In this example, the channel index l is 8.
sponse open source software package [62] based on the image
low intensities is zero rather than negative infinity, which method [63]. The room size used was 3 × 4 × 5 meters, the
reduces variance in the response to low-level inputs such microphone is in the center of the room, the spacing between
as spectral valleys or silence segments. Finally, the power the target speaker and the microphone was assumed to be 3
law has been demonstrated to provide a good approximation meters, and reverberation time was manipulated by changing
to the “psychophysical transfer functions” that are observed the assumed absorption coefficients in the room appropriately.
in experiments relating the physical intensity of sensation These conditions were selected so that interfering additive
to the perceived intensity using direct magnitude-estimation noise sources of progressively greater difficulty were included,
procedures (e.g. [61]). along with basic reverberation effects.
Figure 10 is a final comparison of the effects of the
asymmetric noise suppression, temporal masking, channel
weighting, and power-law nonlinearity modules discussed in A. Experimental Configuration
Secs. II-C through II-G. The curves in both panels compare the The PNCC features described in this paper were evaluated
response of the system in the channel with center frequency by comparing the recognition accuracy obtained with PNCC
490 Hz to clean speech and speech in the presence of street introduced in this paper to that obtained using MFCC and
noise at an SNR of 5 dB. The curves in the upper panel RASTA-PLP processing. We used the version of conventional
were obtained using conventional MFCC processing, including MFCC processing implemented as part of sphinx_fe in
the logarithmic nonlinearity and without ANS processing sphinxbase 0.4.1, both from the CMU Sphinx open
or temporal masking. The curves in the lower panel were source codebase [64]. We used the PLP-RASTA implemen-
obtained using PNCC processing, which includes the power- tation that is available at [65]. In all cases decoding was
law transformation described in this section, as well as ANS performed using the publicly-available CMU Sphinx 3.8 sys-
processing and temporal masking. We note that the difference tem [64] using training from SphinxTrain 1.0. We also
between the two curves representing clean and noisy speech is compared PNCC with the vector Taylor series (VTS) noise
much greater with MFCC processing (upper panel), especially compensation algorithm [4] and the ETSI Advanced Front
for times during which the signal is at a low level. End (AFE) which has several noise suppression algorithms
included [8]. In the case of the ETSI AFE, we excluded the
III. E XPERIMENTAL RESULTS log energy element because this resulted in better results in
In this section we present experimental results that are our experiments. A bigram language model was used in all
intended to demonstrate the superiority of PNCC processing the experiments. We used feature vectors of length of 39
over competing approaches in a wide variety of acoustical including delta and delta-delta features. For experiments using
environments. We begin in Sec. III-A with a review of the the DARPA Resource Management (RM1) database we used
experimental procedures that were used. We provide some subsets of 1600 utterances of clean speech for training and
general results for PNCC processing, we assess the contri- 600 utterances of clean or degraded speech for testing. For
butions of its various components in PNCC in Sec. III-B, and experiments based on the DARPA Wall Street Journal (WSJ)
we compare PNCC to a small number of other approaches in 5000-word database we trained the system using the WSJ0
Sec. III-C. SI-84 training set and tested it on the WSJ0 5K test set.
It should be noted that in general we selected an algorithm We typically plot word recognition accuracy, which is 100
configuration and associated parameter values that provide percent minus the word error rate (WER), using the standard
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 10

RM1 (White Noise)


definition for WER of the number of insertions, deletions, and 100
substitutions divided by the number of words spoken.

Accuracy (100 − WER)


80

60
B. General performance of PNCC in noise and reverberation
40
In this section we describe the recognition accuracy ob-
tained using PNCC processing in the presence of various types 20
of degradation of the incoming speech signals. Figures 11 0
and 12 describe the recognition accuracy obtained with PNCC 0 5 10 15 20 Clean
SNR (dB)
processing in the presence of white noise, street noise, back- (a)
ground music, and speech from a single interfering speaker
RM1 (Street Noise)
as a function of SNR, as well as in the simulated reverberant 100

Accuracy (100 − WER)


environment as a function of reverberation time. These results
80
are plotted for the DARPA RM database in Fig. 11 and for
the DARPA WSJ database in Fig. 12. For the experiments 60

conducted in noise we prefer to characterize the improvement 40


in recognition accuracy by the amount of lateral shift of the 20
curves provided by the processing, which corresponds to an
0
increase of the effective SNR. For white noise using the 0 5 10 15 20 Clean
SNR (dB)
RM task, PNCC provides an improvement of about 12 dB (b)
to 13 dB compared to MFCC processing, as shown in Fig. RM1 (Music Noise)
11. In the presence of street noise, background music, and 100

Accuracy (100 − WER)


interfering speech, PNCC provides improvements of approxi- 80
mately 8 dB, 3.5 dB, and 3.5 dB, respectively. We also note 60
that PNCC processing provides considerable improvement in PNCC
40 Power−Law Nonlinearity with
reverberation, especially for longer reverberation times. PNCC ANS processing
processing exhibits similar performance trends for speech from 20 Power−Law Nonlinearity
MFCC
the DARPA WSJ0 database in similar environments, as seen 0
0 5 10 15 20 Clean
in Fig. 12, although the magnitude of the improvement is SNR (dB)
diminished somewhat, which is commonly observed as we (c)
move to larger databases. RM1 (Interfering Speaker)
100
The curves in Figs. 11 and 12 are also organized in a
Accuracy (100 − WER)

80
way that highlights the various contributions of the major
components. Beginning with baseline MFCC processing the 60
remaining curves show the effects of adding in sequence (1) 40
the power-law nonlinearity (along with mean power normal-
20
ization and the gammatone frequency integration), (2) the
ANS processing including spectral smoothing, and finally (3) 0
0 5 10 15 20 Clean
temporal masking. It can be seen from the curves that a SIR (dB)
(d)
substantial improvement can be obtained by simply replacing RM1 (Reverberation)
the logarithmic nonlinearity of MFCC processing by the 100
Accuracy (100 − WER)

power-law rate-intensity function described in Sec. II-G. The 80


addition of the ANS processing provides a substantial further
60
improvement for recognition accuracy in noise. Although it
is not explicitly shown in Figs. 11 and 12, temporal masking 40
is particularly helpful in improving accuracy for reverberated 20
speech and for speech in the presence of interfering speech.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.2
Reverberation Time (s)
C. Comparison with other algorithms (e)

Figures 13 and 14 provide comparisons of PNCC processing Fig. 11. Recognition accuracy obtained using PNCC processing in various
types of additive noise and reverberation. Curves are plotted separately to
to the baseline MFCC processing with cepstral mean normal- indicate the contributions of the power-law nonlinearity, asymmetric noise
ization, MFCC processing combined with the vector Taylor suppression, and temporal masking. Results are described for the DARPA
series (VTS) algorithm for noise robustness [4], as well as RM1 database in the presence of (a) white noise, (b) street noise, (c)
background music, (d) interfering speech, and (e) artificial reverberation.
RASTA-PLP feature extraction [23] and the ETSI Advanced
Front End (AFE) [8]. We compare PNCC processing to MFCC
and RASTA-PLP processing because these features are most of additive noise. The experimental conditions used were the
widely used in baseline systems, even though neither MFCC same as those used to produce Figs. 11 and 12.
nor PLP features were designed to be robust in the presence We note in Figs. 13 and 14 that PNCC provides substantially
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 11

WSJ0−5k (White Noise)


100 RM1 (White Noise)
100

Accuracy (100 − WER)

Accuracy (100 − WER)


80
80
60
60
40
40
20
20
0
0 5 10 15 20 Clean 0
SNR (dB) 0 5 10 15 20 Clean
(a) SNR (dB)
(a)
RM1 (Street Noise)
WSJ0−5k (Street Noise)
100
100

Accuracy (100 − WER)


Accuracy (100 − WER)

80
80
60
60
40
40
20
20
0
0 0 5 10 15 20 Clean
0 5 10 15 20 Clean SNR (dB)
SNR (dB)
(b) (b)
RM1 (Music Noise)
WSJ0−5k (Music Noise)
100
100

Accuracy (100 − WER)


Accuracy (100 − WER)

80
80
60
60
40
40
20
20
0
0 0 5 10 15 20 Clean
0 5 10 15 20 Clean SNR (dB)
SNR (dB)
(c) (c)
RM1 (Interfering Speaker)
WSJ0−5k (Interfering Speaker)
100
100
Accuracy (100 − WER)
Accuracy (100 − WER)

80
80
60
60
40
40
20
20
0
0 0 5 10 15 20 Clean
0 5 10 15 20 Clean SIR (dB)
SIR (dB)
(d) (d)
RM1 (Reverberation)
WSJ0−5k (Reverberation)
100
100 PNCC
Accuracy (100 − WER)

PNCC
Accuracy (100 − WER)

80 ETSI AFE
80 Power−Law Nonlinearity with
ANS processing MFCC with VTS
Power−Law Nonlinearity 60 MFCC
60
MFCC RASTA−PLP
40
40
20
20
0
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.2
Reverberation Time (s)
Reverberation Time (s)
(e) (e)

Fig. 12. Recognition accuracy obtained using PNCC processing in various Fig. 13. Comparison of recognition accuracy for PNCC with processing
types of additive noise and reverberation. Curves are plotted separately to using MFCC features, the ETSI AFE, MFCC with VTS, and RASTA-PLP
indicate the contributions of the power-law nonlinearity, asymmetric noise features using the DARPA RM1 corpus. Environmental conditions are (a)
suppression, and temporal masking. Results are described for the DARPA white noise, (b) street noise, (c) background music, (d) interfering speech,
WSJ0 database in the presence of (a) white noise, (b) street noise, (c) and (e) reverberation.
background music, (d) interfering speech, and (e) artificial reverberation.

MFCC with VTS, and at a substantially lower computational


better recognition accuracy than both MFCC and RASTA- cost than the computation that is incurred in implementing
PLP processing for all conditions examined. It also provides VTS. We also note that the VTS algorithm provides little
recognition accuracy that is better than the combination of or no improvement over the baseline MFCC performance in
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 12

WSJ0−5k (White Noise) environments, but the accuracy obtained with the AFE does
100 not approach that obtained with PNCC processing in the most
Accuracy (100 − WER) 80 difficult noise conditions. Neither the ETSI AFE nor VTS
60
improve recognition accuracy in reverberant environments
compared to MFCC features, while PNCC provides mea-
40
surable improvements in reverberation, and a closely related
20 algorithm [46] provides even greater recognition accuracy in
0 reverberation (at the expense of somewhat worse performance
0 5 10 15 20 Clean
SNR (dB) in clean speech).
(a)
WSJ0−5k (Street Noise)
100 IV. C OMPUTATIONAL C OMPLEXITY
Accuracy (100 − WER)

80
Table I provides estimates of the computational demands
60 MFCC, PLP, and PNCC feature extraction. (RASTA process-
40
ing is not included in these tabulations.) As before, we use
the standard open source Sphinx code in sphinx_fe [64]
20
for the implementation of MFCC, and the implementation in
0
0 5 10 15 20 Clean
[65] for PLP. We assume that the window length is 25.6 ms
SNR (dB) and that the interval between successive windows is 10 ms.
(b)
WSJ0−5k (Music Noise)
The sampling rate is assumed to be 16 kHz, and we use a
100 1024-pt FFT for each analysis frame.
Accuracy (100 − WER)

80 It can be seen in Table I that because all three algorithms


use 1024-point FFTs, the greatest difference from algorithm to
60
algorithm in the amount of computation required is associated
40 with the spectral integration component. Specifically, the trian-
20 gular weighting used in the MFCC calculation encompasses
a narrower range of frequencies than the trapezoids used in
0
0 5 10 15 20 Clean PLP processing, which is in turn considerably narrower than
SNR (dB)
(c) the gammatone filter shapes, and the amount of computation
WSJ0−5k (Interfering Speaker) needed for spectral integration is directly proportional to
100
the effective bandwidth of the channels. For this reason, as
Accuracy (100 − WER)

80 mentioned in Sec. II-A, we limited the gammatone filter


60 computation to those frequencies for which the filter transfer
function is 0.5 percent or more of the maximum filter gain. In
40
Table I, for all spectral integration types, we considered filter
20 portion whose magnitude is 0.5 or more of the maximum filter
0 gain.
0 5 10 15 20 Clean
SIR (dB) As can be seen in Table I, PLP processing by this tabula-
(d)
tion is about 32.9 percent more costly than baseline MFCC
WSJ0−5k (Reverberation)
100 processing. PNCC processing is approximately 34.6 percent
PNCC
Accuracy (100 − WER)

ETSI AFE
more costly than MFCC processing and 1.31 percent more
80
MFCC with VTS costly than PLP processing.
60 MFCC
RASTA−PLP
40
V. S UMMARY
20
In this paper we introduce power-normalized cepstral coef-
0 ficients (PNCC), which we characterize as a feature set that
0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.2
Reverberation Time (s) provides better recognition accuracy than MFCC and RASTA-
(e)
PLP processing in the presence of common types of additive
Fig. 14. Comparison of recognition accuracy for PNCC with processing using
MFCC features, ETSI AFE, MFCC with VTS, and RASTA-PLP features
noise and reverberation. PNCC processing is motivated by the
using the DARPA WSJ0 corpus. Environmental conditions are (a) white desire to develop computationally efficient feature extraction
noise, (b) street noise, (c) background music, (d) interfering speech, and (e) for automatic speech recognition that is based on a pragmatic
reverberation.
abstraction of various attributes of auditory processing includ-
ing the rate-level nonlinearity, temporal and spectral integra-
difficult environments like background music noise, single- tion, and temporal masking. The processing also includes a
channel interfering speaker or reverberation. component that implements suppression of various types of
The ETSI Advanced Front End (AFE) [8] generally pro- common additive noise. PNCC processing requires only about
vides slightly better recognition accuracy than VTS in noisy 33 percent more computation compared to MFCC.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 13

TABLE I
N UMBER OF MULTIPLICATIONS AND DIVISIONS IN EACH FRAME [9] S. Molau, M. Pitz, and H. Ney, “Histogram based normalization in
the acoustic feature space,” in IEEE Workshop on Automatic Speech
Recognition and Understanding, Nov. 2001, pp. 21–24.
Item MFCC PLP PNCC [10] H. Mirsa, S. Ikbal, H. Bourlard, and H. Hermansky, “Spectral entropy
Pre-emphasis 410 410 based feature for robust ASR,” in IEEE Int. Conf. Acoust. Speech, and
Windowing 410 410 410 Signal Processing, May 2004, pp. 193–196.
[11] B. Raj, V. N. Parikh, and R. M. Stern, “The effects of background music
FFT 10240 10240 10240 on speech recognition accuracy,” in IEEE Int. Conf. Acoust., Speech and
Magnitude squared 512 512 512 Signal Processing, vol. 2, Apr. 1997, pp. 851–854.
Medium-time power calculation 40 [12] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-
sentations for monosyllabic word recognition in continuously spoken
Spectral integration 958 4955 4984 sentences,” IEEE Trans. Acoust., Speech, and Signal Processing, vol. 28,
ANS filtering 200 no. 4, pp. 357–366, Aug. 1980.
Equal loudness pre-emphasis 512 [13] H. Hermansky, “Perceptual linear prediction analysis of speech,” J.
Acoust. Soc. Am., vol. 87, no. 4, pp. 1738–1752, Apr. 1990.
Temporal masking 120
[14] S. Ganapathy, S. Thomas, and H. Hermansky, “Robust spectro-temporal
Weight averaging 120 features based on autoregressive models of hilbert envelopes,” in IEEE
IDFT 504 Int. Conf. on Acoustics, Speech, and Signal Processing, March 2010,
LPC and cepstral recursion 156 pp. 4286–4289.
[15] M. Heckmann, X. Domont, F. Joublin, and C. Goerick, “A hierarchical
DCT 480 480 framework for spectro-temporal feature extraction,” Speech Communi-
Sum 13010 17289 17516 cation, vol. 53, no. 5, pp. 736–752, May-June 2011.
[16] N. Mesgarani, M. Slaney, and S. Shamma, “Discrimination of speech
from nonspeech based on multiscale spectro-temporal modulations,”
IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3,
Further details about the motivation for and implementation pp. 920–929, May 2006.
[17] M. Kleinschmidt, “Localized spectro-temporal features for automatic
of PNCC processing are available in [43]. This thesis also speech recognition,” in INTERSPEECH-2003, Sept. 2003, pp. 2573–
includes additional relevant experimental findings including 2576.
results obtained for PNCC processing using multi-style train- [18] H. Hermansky and F. Valente, “Hierarchical and parallel processing
of modulation spectrum for asr applications,” in IEEE Int. Conf. on
ing and in combination with speaker-by-speaker MLLR. Acoustics, Speech, and Signal Processing, March 2008, pp. 4165–4168.
Open Source MATLAB code for PNCC may be found at [19] S. Y. Zhao and N. Morgan, “Multi-stream spectro-temporal features for
http://www.cs.cmu.edu/ robust speech recognition,” in INTERSPEECH-2008, Sept. 2008, pp.
898–901.
˜robust/archive/algorithms/PNCC_IEEETran. The [20] C. Kim and R. M. Stern, “Feature extraction for robust speech recog-
code in this directory was used for obtaining the results for nition using a power-law nonlinearity and power-bias subtraction,” in
this paper. INTERSPEECH-2009, Sept. 2009, pp. 28–31.
[21] ——, “Feature extraction for robust speech recognition based on maxi-
mizing the sharpness of the power distribution and on power flooring,”
ACKNOWLEDGEMENTS in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, March
2010, pp. 4574–4577.
This research was supported by NSF (Grants IIS-0420866 [22] D.-S. Kim, S.-Y. Lee, and R. M. Kil, “Auditory processing of speech
signals for robust speech recognition in real-world noisy environments,”
and IIS-0916918). The authors are grateful to Bhiksha Raj, IEEE Trans. Speech and Audio Processing, vol. 7, no. 1, pp. 55–69,
Kshitiz Kumar, and Mark Harvilla for many helpful discus- 1999.
sions. A summary version of part of this paper was published [23] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE.
Trans. Speech Audio Process., vol. 2, no. 4, pp. 578–589, Oct. 1994.
at [66]. [24] U. H. Yapanel and J. H. L. Hansen, “A new perceptually motivated
MVDR-based acoustic front-end (PMVDR) for robust automatic speech
recognition,” Speech Communication, vol. 50, no. 2, pp. 142–152, Feb.
R EFERENCES 2008.
[25] F. Müller and A. Mertins, “Contextual invariant-integration features for
[1] L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. improved speaker-independent speech recognition,” Speech Communi-
Englewood Cliffs, New Jersey: PTR Prentice Hall, 1993. cation, vol. 53, no. 6, pp. 830–841, July 2011.
[2] F. Jelinek, Statistical Methods for Speech Recognition (Language, [26] B. Gajic and K. K. Paliwal, “Robust parameters for speech recognition
Speech, and Communication). MIT Press, 1998. based on subband spectral centroid histograms,” in Eurospeech-2001,
[3] A. Acero and R. M. Stern, “Environmental Robustness in Automatic Sept. 2001, pp. 591–594.
Speech Recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal [27] F. Kelly and N. Harte, “A comparison of auditory features for robust
Processing (Albuquerque, NM), vol. 2, Apr. 1990, pp. 849–852. speech recognition,” in EUSIPCO-2010, Aug 2010, pp. 1968–1972.
[4] P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series approach [28] ——, “Auditory features revisited for robust speech recognition,” in
for environment-independent speech recognition,” in IEEE Int. Conf. International conference on pattern recognition, Aug. 2010, pp. 4456–
Acoust., Speech and Signal Processing, May. 1996, pp. 733–736. 4459.
[5] P. Pujol, D. Macho, and C. Nadeu, “On real-time mean-and-variance nor- [29] J. K. Siqueira and A. Alcaim, “Comparação dos atributos mfcc, ssch e
malization of speech recognition features,” in IEEE Int. Conf. Acoust., pncc para reconhecimento robusto de voz contı́nua,” in XXIX Simpósio
Speech and Signal Processing, vol. 1, May 2006, pp. 773–776. Brasileiro de Telecomunicações, Oct. 2011.
[6] R. M. Stern, B. Raj, and P. J. Moreno, “Compensation for environmental [30] G. Sárosi, M. Mozsáry, B. Tarján, A. Balog, P. Mihajlik, and T. Fegyó,
degradation in automatic speech recognition,” in Proc. of the ESCA “Recognition of multiple language voice navigation queries in traffic
Tutorial and Research Workshop on Robust Speech Recognition for situations,” in COST 2102 International Conference, Sept. 2010, pp.
Unknown Communication Channels, Apr. 1997, pp. 33–42. 199–213.
[7] R. Singh, R. M. Stern, and B. Raj, “Signal and feature compensation [31] G. Sárosi, M. Mozsáry, P. Mihajlik, and T. Fegyó, “Comparison of
methods for robust speech recognition,” in Noise Reduction in Speech feature extraction methods for speech recognition in noise-free and in
Applications, G. M. Davis, Ed. CRC Press, 2002, pp. 219–244. traffic noise environment,” in Speech Technology and Human-Computer
[8] Speech Processing, Transmission and Quality Aspects (STQ); Dis- Dialogue (SpeD), May 2011, pp. 1–8.
tributed Speech Recognition; Advanced Front-end Feature Extraction Al- [32] F. Müller and A. Mertins, “Noise robust speaker-independent speech
gorithm; Compression Algorithms, European Telecommunications Stan- recognition with invariant-integration features using power-bias subtrac-
dards Institute ES 202 050, Rev. 1.1.5, Jan. 2007. tion,” in INTERSPEECH-2011, Aug. 2011, pp. 1677–1680.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 14

[33] K. Kumar, C. Kim and R. M. Stern, “Delta-spectral cepstral coefficients [56] M. G. Heinz, X. Zhang, I. C. Bruce, and L. H. Carney, “Auditory-
for robust speech recognition,” in IEEE Int. Conf. on Acoustics, Speech, nerve model for predicting performance limits of normal and impaired
and Signal Processing, May 2011, pp. 4784–4787. listeners,” Acoustics Research Letters Online, vol. 2, no. 3, pp. 91–96,
[34] H. Zhang, X. Zhu, T-R Su, K-W. Eom, and J-W Lee, “Data-driven July 2001.
lexicon refinement using local and web resources for chinese speech [57] S. Seneff, “A joint synchrony/mean-rate model of auditory speech
recognition,” in International symposium on Chinese spoken language processing,” J. Phonetics, vol. 16, no. 1, pp. 55–76, Jan. 1988.
processing, Dec 2010, pp. 233–237. [58] J. Tchorz and B. Kolllmeier, “A model of auditory perception as front
[35] A. Fazel and S. Chakrabartty, “Sparse auditory reproducing kernel end for automatic speech recognition,” J. Acoust. Soc. Am., vol. 106,
(SPARK) features for noise-robust speech recognition ,” IEEE Trans. no. 4, pp. 2040–2050, 1999.
Audio, Speech, Language Processing, Dec. 2011 (to appear). [59] X. Zhang and M. G. Heinz and I. C. Bruce and L. H. Carney, “A
[36] M. J. Harvilla and R. M. Stern, “Histogram-based subband power phenomenological model for the responses of auditory-nerve fibers: I.
warping and spectral averaging for robust speech recognition under Nonlinear tuning with compression and suppression,” J. Acoust. Soc.
matched and multistyle training,” in IEEE Int. Conf. Acoust. Speech Am., vol. 109, no. 2, pp. 648–670, Feb. 2001.
Signal Processing, May 2012 (to appear). [60] ——, “A phenomenological model for the responses of auditory-nerve
[37] S. F. Boll, “Suppression of acoustic noise in speech using spectral sub- fibers: I. Nonlinear tuning with compression and suppression,” J. Acoust.
traction,” IEEE Trans. Acoust., Speech and Signal Processing, vol. 27, Soc. Am., vol. 109, no. 2, pp. 648–670, Feb. 2001.
no. 2, pp. 113–120, Apr. 1979. [61] S. S. Stevens, “On the psychophysical law,” Psychological Review,
vol. 64, no. 3, pp. 153–181, 1957.
[38] P. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang,
[62] S. G. McGovern, “A model for room acoustics,” http://2pi.us/rir.html.
and M. H. Allerhand, “Complex sounds and auditory images,” in
[63] J. Allen and D. Berkley, “Image method for efficiently simulating small-
Auditory and Perception. Oxford, UK: Y. Cazals, L. Demany, and
room acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950, April
K. Horner, (Eds), Pergamon Press, 1992, pp. 429–446.
1979.
[39] B. C. J. Moore and B. R. Glasberg, “A revision of Zwicker’s loudness
[64] CMU Sphinx Consortium Sphinx Consortium. CMU Sphinx Open
model,” Acustica - Acta Acustica, vol. 82, pp. 335–345, 1996.
Source Toolkit for Speech Recognition: Downloads. [Online]. Available:
[40] M. Slaney, “Auditory toolbox version 2,” Interval Research http://cmusphinx.sourceforge.net/wiki/download/
Corporation Technical Report, no. 10, 1998. [Online]. Available: [65] D. Ellis. (2006) PLP and RASTA (and MFCC, and inversion)
http://cobweb.ecn.purdue.edu/ malcolm/interval/1998-010/ in MATLAB using melfcc.m and invmelfcc.m. [Online]. Available:
[41] C. Kim and R. M. Stern, “Power function-based power distribution nor- http://labrosa.ee.columbia.edu/matlab/rastamat/
malization algorithm for robust speech recognition,” in IEEE Automatic [66] C. Kim and R. M. Stern, “Power-normalized cepstral coefficients (pncc)
Speech Recognition and Understanding Workshop, Dec. 2009, pp. 188– for robust speech recognition,” in IEEE Int. Conf. on Acoustics, Speech,
193. and Signal Processing, March 2012 (to appear).
[42] D. Gelbart and N. Morgan, “Evaluating long-term spectral subtraction
for reverberant ASR,” in IEEE Workshop on Automatic Speech Recog-
nition and Understanding, 2001, pp. 103–106.
Chanwoo Kim is presently a software development
[43] C. Kim, “Signal processing for robust speech recognition motivated by
engineer at the Microsoft Corporation. He received
auditory processing,” Ph.D. dissertation, Carnegie Mellon University,
a Ph.D. from the Language Technologies Institute
Pittsburgh, PA USA, Dec. 2010.
of the Carnegie Mellon University School of Com-
[44] B. E. D. Kingsbury, N. Morgan, and, S. Greenberg, “Robust speech puter Science in 2010. He received his B.S and
recognition using the modulation spectrogram,” Speech Communication, M.S. degrees in Electrical Engineering from Seoul
vol. 25, no. 1–3, pp. 117–132, Aug. 1998. National University in 1998 and 2001, respectively.
[45] H. G. Hirsch and C. Ehrlicher, “Noise estimation techniques or robust Dr. Kim’s doctoral research was focused on enhanc-
speech recognition,” in IEEE Int. Conf. on Acoustics, Speech, and Signal ing the robustness of automatic speech recognition
Processing, May 1995, pp. 153–156. systems in noisy environments. Toward this end he
[46] C. Kim and R. M. Stern, “Nonlinear enhancement of onset for robust has developed a number of different algorithms for
speech recognition,” in INTERSPEECH-2010, Sept. 2010, pp. 2058– single-microphone applications, dual-microphone applications, and multiple-
2061. microphone applications which have been applied to various real-world
[47] C. Lemyre, M. Jelinek, and R. Lefebvre, “New approach to voiced applications. Between 2003 and 2005 Dr. Kim was a Senior Research Engineer
onset detection in speech signal and its application for frame error at LG Electronics, where he worked primarily on embedded signal processing
concealment,” in IEEE Int. Conf. on Acoustics, Speech, and Signal and protocol stacks for multimedia systems. Prior to his employment at LG,
Processing, May 2008, pp. 4757–4760. he worked for EdumediaTek and SK Teletech as a R&D engineer.
[48] S. R. M. Prasanna and P. Krishnamoorthy, “Vowel onset point detection
using source, spectral peaks, and modulation spectrum energies,” IEEE
Trans. Audio, Speech, and Lang. Process., vol. 17, no. 4, pp. 556–565, Richard Stern received the S.B. degree from the
May 2009. Massachusetts Institute of Technology in 1970, the
[49] K. D. Martin, “Echo suppression in a computational model of the M.S. from the University of California, Berkeley,
precedence effect,” in IEEE ASSP Workshop on Applications of Signal in 1972, and the Ph.D. from MIT in 1977, all in
Processing to Audio and Acoustics, Oct. 1997. electrical engineering. He has been on the faculty of
[50] C. Kim, K. Kumar, and R. M. Stern, “Binaural sound source separation Carnegie Mellon University since 1977, where he is
motivated by auditory processing,” in IEEE Int. Conf. on Acoustics, currently a Professor in the Electrical and Computer
Speech, and Signal Processing, May 2011, pp. 5072–5075. Engineering, Computer Science, and Biomedical En-
[51] T. S. Gunawan and E. Ambikairajah, “A new forward masking model and gineering Departments, and the Language Technolo-
its application to speech enhancement,” in IEEE Int. Conf. on Acoustics, gies Institute, and a Lecturer in the School of Music.
Speech, and Signal Processing, May 2006, pp. 149–152. Much of Dr. Stern’s current research is in spoken
[52] W. Jesteadt, S. P. Bacon, and J. R. Lehman, “Forward masking as a language systems, where he is particularly concerned with the development
function of frequency, masker level, and signal delay,” J. Acoust. Soc. of techniques with which automatic speech recognition can be made more
Am., vol. 71, no. 4, pp. 950–962, Apr. 1982. robust with respect to changes in environment and acoustical ambience. He
[53] C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation for robust has also developed sentence parsing and speaker adaptation algorithms for
speech recognition based on phase difference information obtained in earlier CMU speech systems. In addition to his work in speech recognition,
the frequency domain,” in INTERSPEECH-2009, Sept. 2009, pp. 2495– Dr. Stern also maintains an active research program in psychoacoustics, where
2498. he is best known for theoretical work in binaural perception. Dr. Stern is a
[54] C. Kim, K. Kumar and R. M. Stern, “Robust speech recognition using Fellow of the Acoustical Society of America, the 2008-2009 Distinguished
small power boosting algorithm,” in IEEE Automatic Speech Recognition Lecturer of the International Speech Communication Association, a recipient
and Understanding Workshop, Dec. 2009, pp. 243–248. of the Allen Newell Award for Research Excellence in 1992, and he served
as General Chair of Interspeech 2006. He is also a member of the IEEE and
[55] B. Raj and R. M. Stern, “Missing-Feature Methods for Robust Automatic
of the Audio Engineering Society.
Speech Recognition,” IEEE Signal Processing Magazine, vol. 22, no. 5,
pp. 101–116, Sept. 2005.

You might also like