Professional Documents
Culture Documents
X, MONTH, YEAR 1
Abstract—This paper presents a new feature extraction algo- have been introduced to address this problem. Many of these
rithm called Power Normalized Cepstral Coefficients (PNCC) conventional noise compensation algorithms have provided
that is motivated by auditory processing. Major new features of substantial improvement in accuracy for recognizing speech
PNCC processing include the use of a power-law nonlinearity
that replaces the traditional log nonlinearity used in MFCC in the presence of quasi-stationary noise (e.g. [3], [4], [5],
coefficients, a noise-suppression algorithm based on asymmetric [6], [7], [8], [9], [10]). Unfortunately these same algorithms
filtering that suppress background excitation, and a module that frequently do not provide significant improvements in more
accomplishes temporal masking. We also propose the use of difficult environments with transitory disturbances such as a
medium-time power analysis, in which environmental param- single interfering speaker or background music (e.g. [11]).
eters are estimated over a longer duration than is commonly
used for speech, as well as frequency smoothing. Experimental Many of the current systems developed for automatic speech
results demonstrate that PNCC processing provides substantial recognition, speaker identification, and related tasks are based
improvements in recognition accuracy compared to MFCC and on variants of one of two types of features: mel frequency
PLP processing for speech in the presence of various types of cepstral coefficients (MFCC) [12] or perceptual linear predic-
additive noise and in reverberant environments, with only slightly tion (PLP) coefficients [13]. Spectro-temporal features have
greater computational cost than conventional MFCC processing,
and without degrading the recognition accuracy that is observed also been recently introduced with promising results (e.g.
while training and testing using clean speech. PNCC processing [14], [15]). It has been observed that two-dimensional Gabor
also provides better recognition accuracy in noisy environments filters provide a reasonable approximation to the spectro-
than techniques such as Vector Taylor Series (VTS) and the temporal response fields of neurons in the auditory cortex,
ETSI Advanced Front End (AFE) while requiring much less which has lead to various approaches to extract features for
computation. We describe an implementation of PNCC using
“online processing” that does not require future knowledge of speech recognition (e.g. [16], [17], [18], [19]). In this paper
the input. we describe the development of an additional feature set for
speech recognition which we refer to as power-normalized
Index Terms—Robust speech recognition, feature extrac-
tion, physiological modeling, rate-level curve, power function, cepstral coefficients (PNCC).
asymmetric filtering, medium-time power estimation, spectral We had introduced several previous of PNCC processing in
weight smoothing, temporal masking, modulation filtering, on- [20] and [21], and these implementations have been evaluated
line speech processing by several teams of researchers and compared to several
different algorithms including zero crossing peak amplitude
EDICS Category: SPE-ROBU, SPE-SPER
(ZCPA) [22], RASTA-PLP [23], perceptual minimum variance
distortionless response (PMVDR) [24], invariant-integration
I. I NTRODUCTION
features (IIF) [25], and subband spectral centroid histograms
Initial Processing
STFT STFT STFT
P[ m , l ]
Nonlinear Medium-Time
Compression Short-Time Power Calculation
~
Environmental Compensation
Processing Q [m , l ]
Asymmetric Noise
RASTA Filtering
Suppression with
Temporal Masking
Nonlinear ~
R [m , l]
Expansion
Weight Smoothing
~
S [m , l]
Time -Frequency
Normalization
T [m , l ]
Mean Power
Normalization
U [m, l ]
Logarithmic Power Function Power Function
Nonlinearity Nonlinearity ( • ) 1 / 3 Nonlinearity (• ) 1 / 15
V [m, l ]
Final Processing
DCT IDFT DCT
LPC-Based
Cepstral Recursion
Fig. 1. Comparison of the structure of the MFCC, PLP, and PNCC feature extraction algorithms. The modules of PNCC that function on the basis of
“medium-time” analysis (with a temporal window of 65.6 ms) are plotted in the rightmost column. If the shaded blocks of PNCC are omitted, the remaining
processing is referred to as simple power-normalized cepstral coefficients (SPNCC).
paper has been significantly revised to address these issues in a A. Broader motivation for the PNCC algorithm
fashion that enables it to provide superior recognition accuracy The development of PNCC feature extraction was motivated
over a broad range of conditions of noise and reverberation by a desire to obtain a set of practical features for speech
using features that are computable in real time using “online” recognition that are more robust with respect to acoustical
algorithms that do not require extensive look-ahead, and with variability in their native form, without loss of performance
a computational complexity that is comparable to that of when the speech signal is undistorted, and with a degree of
traditional MFCC and PLP features. In the subsequent subsec- computational complexity that is comparable to that of MFCC
tions of this Introduction we discuss the broader motivations and PLP coefficients. While many of the attributes of PNCC
and overall structure of PNCC processing. We specify the processing have been strongly influenced by consideration
key elements of the processing in some detail in Sec. II. of various attributes of human auditory processing, we have
In Sec. III we compare the recognition accuracy provided favored approaches that provide pragmatic gains in robustness
by PNCC processing under a variety of conditions with that at small computational cost over approaches that are more
of other processing schemes, and we consider the impact of faithful to auditory physiology in developing the specific
various components of PNCC on these results. We compare processing that is performed.
the computational complexity of the MFCC, PLP, and PNCC Some of the innovations of the PNCC processing that we
feature extraction algorithms in Sec. IV and we summarize consider to be the most important include:
our results in the final section.
• The replacement of the log nonlinearity in MFCC pro-
cessing by a power-law nonlinearity that is carefully
chosen to approximate the nonlinear relation between
signal intensity and auditory-nerve firing rate. We believe
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 3
0.16
that this nonlinearity provides superior robustness by sup-
pressing small signals and their variability, as discussed 0.14
H(e j ω)
of 50-120 ms to analyze the parameters characterizing 0.08
environmental degradation, in combination with the tra-
0.06
ditional short-time Fourier analysis with frames of 20-30
ms used in conventional speech recognition systems. We 0.04
~
B. Temporal integration for environmental analysis Q[ m, l ]
Most speech recognition and speech coding systems use
analysis frames of duration between 20 ms and 30 ms.
Asymmetric
Nevertheless, it is frequently observed that longer analysis Lowpass Filtering
windows provide better performance for noise modeling and/or ~
- Qle [ m , l ]
environmental normalization (e.g. [21], [41], [42]), because
the power associated with most background noise conditions +
changes more slowly than the instantaneous power associated Halfwave
with speech. Rectification
In PNCC processing we estimate a quantity we refer to
~
as “medium-time power” Q̃[m, l] by computing the running Qo [m, l ]
average of P [m, l], the power observed in a single analysis
Temporal Asymmetric
frame, according to the equation: Masking Lowpass Filtering
m+M ~
1 X Qtm[m, l] ~
Q f [ m, l ]
Q̃[m, l] = P [m′ , l] (3)
2M + 1
m′ =m−M MAX
where m represents the frame index and l is the channel index. ~
Rsp [m, l ]
We will apply the tilde symbol to all power estimates that are
performed using medium-time analysis. Excitation Non-Excitation
We observed experimentally that the choice of the temporal
integration factor M has a substantial impact on performance
~
in white noise (and presumably other types of broadband R [ m, l ]
background noise). This factor has less impact on the accuracy
that is observed in more dynamic interference or reverberation, Noise Removal
Floor Level
although the longer temporal analysis window does provide and Temporal
Estimation
Masking
some benefit in these environments as well [43]. We chose the
value of M = 2 (corresponding to five consecutive windows
Fig. 3. Functional block diagram of the modules for asymmetric noise
with a total net duration of 65.6 ms) on the basis of these suppression (ANS) and temporal masking in PNCC processing. All processing
observations. Since Q̃[m, l] is the moving average of P [m, l], is performed on a channel-by-channel basis. Q̃[m, l] is the medium-time-
Q̃[m, l] is a low-pass function of m. If M = 2, the upper averaged input power as defined by Eq.(3), R̃[m, l] is the speech output of
the ANS module, and S̃[m, l] is the output after temporal masking (which is
frequency is approximately 15 Hz. Nevertheless, if we were to applied only to the speech frames). The block labelled Temporal Masking is
use features based on Q̃[m, l] directly for speech recognition, depicted in detail in Fig. 5
recognition accuracy would be degraded because onsets and
offsets of the frequency components would become blurred. One significant problem with the application of conventional
Hence in PNCC, we use Q̃[m, l] only for noise estimation linear high-pass filtering in the power domain is that the filter
and compensation, which are used to modify the information output can become negative. Negative values for the power
based on the short-time power estimates P [m, l]. We also coefficients are problematic in the formal mathematical sense
apply smoothing over the various frequency channels, which (in that power itself is positive). They also cause problems
will discussed in Sec. II-E below. in the application of the compressive nonlinearity and in
speech resynthesis unless a suitable floor value is applied to
the power coefficients (e.g. [46]). Rather than filtering in the
C. Asymmetric noise suppression power domain, we could perform filtering after applying the
In this section, we discuss a new approach to noise com- logarithmic nonlinearity, as is done with conventional cepstral
pensation which we refer to as asymmetric noise suppression mean normalization in MFCC processing. Nevertheless, as
(ANS). This procedure is motivated by the observation men- will be seen in Sec. III, this approach is not very helpful
tioned above that the speech power in each channel usually for environments with additive noise. Spectral subtraction is
changes more rapidly than the background noise power in another way to reduce the effects of noise, whose power
the same channel. Alternately we might say that speech usu- changes slowly (e.g. [37]). In spectral subtraction techniques,
ally has a higher-frequency modulation spectrum than noise. the noise level is typically estimated from the power of non-
Motivated by this observation, many algorithms have been speech segments (e.g. [37]) or through the use of a continuous-
developed using either high-pass filtering or band-pass filtering update approach (e.g. [45]). In the approach that we introduce,
in the modulation spectrum domain (e.g. [23], [44]). The we obtain a running estimate of the time-varying noise floor
simplest way to accomplish this objective is to perform high- using an asymmetric nonlinear filter, and subtract that from
pass filtering in each channel (e.g. [45], [46]) which has the the instantaneous power.
effect of removing slowly-varying components which typically Figure 3 is a block diagram of the complete asymmetric
represent the effects of additive noise sources rather than the nonlinear suppression processing with temporal masking. Let
speech signal. us begin by describing the general characteristics of the
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 5
Power (dB)
input and output Q̃in [m, l] and Q̃out [m, l], respectively:
Qin [m, l]
λa Q̃out [m − 1, l] + (1 − λa )Q̃in [m, l], Qout [m, l](λa = 0.9, λb = 0.9)
if Q̃in [m, l] ≥ Q̃out [m − 1, l] 0 100 (1s) 200 (2s) 300 (3s)
Q̃out [m, l] = (4) Frame Index m
λ b out [m − 1, l] + (1 − λb )Q̃in [m, l],
Q̃ (a)
if Q̃in [m, l] < Q̃out [m − 1, l]
Power (dB)
where m is the frame index and l is the channel index, and
λa and λb are constants between zero and one.
If λa = λb it is easy to verify that Eq. 4 reduces to a Qin [m, l]
conventional IIR filter that is lowpass in nature because the Qout [m, l](λa = 0.5, λb = 0.95)
values of the λ parameters are positive, as shown in Fig. 4(a). 0 100 (1s) 200 (2s) 300 (3s)
Frame Index m
In contrast, If 1 > λb > λa > 0, the nonlinear filter functions (b)
as a conventional “upper” envelope detector, as illustrated
in Fig. 4(b). Finally, and most usefully for our purposes, if
Power (dB)
1 > λa > λb > 0, the filter output Q̃out tends to follow
the lower envelope of Q̃in [m, l], as seen in Fig. 4(c). In our
processing, we will use this slowly-varying lower envelope in Qin [m, l]
Qout [m, l](λa = 0.999, λb = 0.5)
Fig. 4(c) to serve as a model for the estimated medium-time
noise level, and the activity above this envelope is assumed 0 100 (1s) 200 (2s) 300 (3s)
Frame Index m
to represent speech activity. Hence, subtracting this low-level (c)
envelope from the original input Q̃in [m, l] will remove a
slowly varying non-speech component. Fig. 4. Sample inputs (solid curves) and outputs (dashed curves) of the
We will use the notation asymmetric nonlinear filter defined by Eq. (4) for conditions when (a) λa =
λb (b) λa < λb , and (c) λa > λb . In this example, the channel index l is
8.
Q̃out [m, l] = AF λa ,λb [Q̃in [m, l]] (5)
filtering as before:
to represent the nonlinear filter described by Eq. (4). We note
that that this filter operates only on the frame indices m for Q̃f [m, l] = AF 0.999,0.5 [Q̃0 [m, l]] (7)
each channel index l.
Q̃f [0, l] is initialized as Q̃0 [m, l]. As shown in Fig. 3, we use
Keeping the characteristics of the asymmetric filter de- the lower envelope of the rectified signal Q̃f [m, l] as a floor
scribed above in mind, we may now consider the structure level for the ANS processing output R̃[m, l] after temporal
shown in Fig. 3. In the first stage, the lower envelope Q̃le [m, l], masking:
which represents the average noise power, is obtained by ANS
processing according to the equation R̃sp [m, l] = max (Q̃tm [m, l], Q̃f [m, l]) (8)
Q̃le [m, l] = AF 0.999,0.5 [Q̃[m, l]] (6) where Q̃tm [m, l] is the temporal masking output depicted in
Fig. 3. Temporal masking for speech segments is discussed in
as depicted in Fig. 4(c).Q̃le [0, l] is initialized to 0.9Q̃[m, l]. Sec. II-D.
Q̃le [m, l] is subtracted from the input Q̃[m, l], effectively We have found that applying lowpass filtering to the signal
highpass filtering the input, and that signal is passed through segments that do not appear to be driven by a periodic
an ideal half-wave linear rectifier to produce the rectified excitation function (as in voiced speech) improves recognition
output Q̃0 [m, l]. The impact of the specific values of the accuracy in noise by a small amount. For this reason we use
forgetting factors λa and λb on speech recognition accuracy the lower envelope of the rectified signal R̃le [m, l] directly
is discussed below. for these non-excitation segments. This operation, which is
The remaining elements of ANS processing in the right- effectively a further lowpass filtering, is not performed for the
hand side of Fig. 3 (other than the temporal masking block) speech segments because blurring the power coefficients for
are included to cope with problems that develop when the speech degrades recognition accuracy.
rectifier output Q̃0 [m, l] remains zero for an interval, or when Excitation/non-excitation decisions for this purpose are ob-
the local variance of Q̃0 [m, l] becomes excessively small. Our tained for each value of m and l in a very simple fashion:
approach to this problem is motivated by our previous work
[21] in which it was noted that applying a well-motivated “excitation segment” if Q̃[m, l] ≥ cQ̃le [m, l](9a)
flooring level to power is very important for noise robustness. “non-excitation segment” if Q̃[m, l] < cQ̃le [m, l](9b)
In PNCC processing we apply the asymmetric nonlinear filter
for a second time to obtain the lower envelope of the rectifier where Q̃le [m, l] is the lower envelope of Q̃[m, l] as described
output Q̃f [m, l], and we use this envelope to establish this floor above, and in and c is a fixed constant. In other words, a partic-
level. This envelope Q̃f [m, l] is obtained using asymmetric ular value of Q̃[m, l] is not considered to be a sufficiently-large
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 6
~
Qo [m, l ] ~ S̃[m, l] (T60 = 0.5 s)
Q p [m, l ]
Power (dB)
S̃[m, l] without Temporal Masking (T60 = 0.5 s)
MAX
z −1
0 100 (1s)
Frame Index m
~
λt Q p [m − 1, l ]
Power (dB)
µt 0 100 (1s)
Frame Index m
~ ~
Q0[m, l ] < λt Qp [m − 1, l ] Fig. 6. Demonstration of the effect of temporal masking in the ANS module
for speech in simulated reverberation with T60 = 0.5 s (upper panel) and
~ clean speech (lower panel). In this example, the channel index l is 18.
Rsp [m, l ]
~ ~ the following equation:
Q0 [m, l ] ≥ λt Q p [m − 1, l ] (
Q̃0 [m, l], Q̃0 [m, l] ≥ λt Q̃p [m − 1, l]
R̃sp [m, l] = (11)
µt Q̃p [m − 1, l], Q̃0 [m, l] < λt Q̃p [m − 1, l]
Fig. 5. Block diagram of the components that accomplish temporal masking
in Fig. 3
We have found [43] that if the forgetting factor λt is equal
excitation if it is less than a fixed multiple of its own lower to or less than 0.85 and if µt ≤ 0.2, recognition accuracy
envelope. remains almost constant for clean speech and most additive
We observed experimentally that while a broad range of noise conditions, and if λt increases beyond 0.85, performance
values of λb between 0.25 and 0.75 appear to provide reason- degrades. The value of λt = 0.85 also appears to be best in
able recognition accuracy, the choice of λa = 0.9 appears to the reverberant condition. For these reasons we use the values
be best under most circumstances [43]. The parameter values λt = 0.85 and µt = 0.2 in the standard implementation of
used for the current standard implementation are λa = 0.999 PNCC. Note that λt = 0.85 corresponds to a time constant
and λb = 0.5, which were chosen in part to maximize the of 28.2 ms, which means that the offset attenuation lasts
recognition accuracy in clean speech as well as performance in approximately 100 ms. This characteristic is in accordance
noise. We also observed (in experiments in which the temporal with observed data for humans [52].
masking described below was bypassed) that the threshold- Figure 6 illustrates the effect of this temporal masking. In
parameter value c = 2 provides the best performance for white general, with temporal masking the response of the system
noise (and presumably other types of broadband noise). The is inhibited for portions of the input signal R̃[m, l] other
value of c has little impact on performance in background than rising “attack transients”. The difference between the
music and in the presence of reverberation. signals with and without masking is especially pronounced in
reverberant environments, for which the temporal processing
module is especially helpful.
D. Temporal masking The final output of the asymmetric noise suppression and
Many authors have noted that the human auditory system temporal masking modules is R̃[m, l] = R̃sp [m, l] for the
appears to focus more on the onset of an incoming power excitation segments and R̃[m, l] = Q̃f [m, l] for the non-
envelope rather than the falling edge of that same power excitation segments.
envelope (e.g. [47], [48]). This observation has led to several
onset enhancement algorithms (e.g. [49], [46], [50], [51]). In E. Spectral weight smoothing
this section we describe a simple way to incorporate this effect
in PNCC processing, by obtaining a moving peak for each In our previous research on speech enhancement and noise
compensation techniques (e.g., [20], [21], [41], [53], [54])
frequency channel l and suppressing the instantaneous power
it has been frequently observed that smoothing the response
if it falls below this envelope.
across channels is helpful. This is true especially in processing
The processing invoked for temporal masking is depicted in
schemes such as PNCC where there are nonlinearities and/or
block diagram form in Fig. 5. We first obtain the online peak
thresholds that vary in their effect from channel to channel,
power Qp [m, l] for each channel using the following equation:
as well as processing schemes that are based on inclusion of
Q̃p [m, l] = max λt Q̃p [m − 1, l], Q̃0 [m, l] (10) responses only from a subset of time frames and frequency
channels (e.g. [53]) or systems that rely on missing-feature
where λt is the forgetting factor for obtaining the online peak. approaches (e.g. [55]).
As before, m is the frame index and l is the channel index. From the discussion above, we can represent the combined
Temporal masking for speech segments is accomplished using effects of asymmetric noise suppression and temporal masking
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 7
4000
for a specific time frame and frequency bin as the transfer
function R̃[m, l]/Q̃[m, l]. Smoothing the transfer function 3500
Rate (spikes/sec)
again based on empirical optimization from the results of pilot
600
studies [43]. We note that if we were to use a different number
of channels L, the optimal value of N would be also different.
400
where m and l are the frame and channel indices, as before, Several studies in our group (e.g. [20], [54]) have confirmed
and L represents the number of frequency channels. We use a the critical importance of the nonlinear function that describes
value of 0.999 for the forgetting factor λµ . For the initial value the relationship between incoming signal amplitude in a given
of µ[m], we use the value obtained from the training database. frequency channel and the corresponding response of the
Since the time constant corresponding to λµ is around 4.6 processing model. This “rate-level nonlinearity” is explicitly or
seconds, we normally do not need to incorporate a formal implicitly a crucial part of every conceptual or physiological
voice activity detector (VAD) in conjunction with PNCC if model of auditory processing (e.g. [57], [58], [59]). In this
a continuous non-speech portion is not longer than 3 to 4 section we summarize our approach to the development of the
seconds, then we usually do not need to use a Voice Activity rate-level nonlinearity used in PNCC processing.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 8
1000
Pascals in the upper panel, and the putative spontaneous firing
800
rate of 50 spikes per second is subtracted from the curves in
600 both cases.
400 The most widely used current feature extraction algorithms
200 Human Rate−Intensity Model are Mel Frequency Cepstral Coefficients (MFCC) and Per-
Cube Root Power−Law Approximation ceptual Linear Prediction (PLP) coefficients. Both the MFCC
0 MMSE Power−Law Approximation
Logarithmic Approximation and PLP procedures include an intrinsic nonlinearity, which
−200 is logarithmic in the case of MFCC and a cube-root power
0.05 0.1 0.15 0.2
Pressure (Pa) function in the case of PLP analysis. We plot these curves
(a)
relating the power of the input pressure p to the response s
1500
Human Rate−Intensity Model in Fig. 9 using values of the arbitrary scaling parameters that
Cube Root Power−Law Approximation are chosen to provide the best fit to the curve of the Heinz et
1000 MMSE Power−Law Approximation
Logarithmic Approximation
al. model, resulting in the following equations:
Rate (spikes / sec)
Clean and Street 5 dB very good performance over a wide variety of conditions using
a single set of parameters and settings, without sacrificing
log P [m, l]
60
B. General performance of PNCC in noise and reverberation
40
In this section we describe the recognition accuracy ob-
tained using PNCC processing in the presence of various types 20
of degradation of the incoming speech signals. Figures 11 0
and 12 describe the recognition accuracy obtained with PNCC 0 5 10 15 20 Clean
SNR (dB)
processing in the presence of white noise, street noise, back- (a)
ground music, and speech from a single interfering speaker
RM1 (Street Noise)
as a function of SNR, as well as in the simulated reverberant 100
80
way that highlights the various contributions of the major
components. Beginning with baseline MFCC processing the 60
remaining curves show the effects of adding in sequence (1) 40
the power-law nonlinearity (along with mean power normal-
20
ization and the gammatone frequency integration), (2) the
ANS processing including spectral smoothing, and finally (3) 0
0 5 10 15 20 Clean
temporal masking. It can be seen from the curves that a SIR (dB)
(d)
substantial improvement can be obtained by simply replacing RM1 (Reverberation)
the logarithmic nonlinearity of MFCC processing by the 100
Accuracy (100 − WER)
Figures 13 and 14 provide comparisons of PNCC processing Fig. 11. Recognition accuracy obtained using PNCC processing in various
types of additive noise and reverberation. Curves are plotted separately to
to the baseline MFCC processing with cepstral mean normal- indicate the contributions of the power-law nonlinearity, asymmetric noise
ization, MFCC processing combined with the vector Taylor suppression, and temporal masking. Results are described for the DARPA
series (VTS) algorithm for noise robustness [4], as well as RM1 database in the presence of (a) white noise, (b) street noise, (c)
background music, (d) interfering speech, and (e) artificial reverberation.
RASTA-PLP feature extraction [23] and the ETSI Advanced
Front End (AFE) [8]. We compare PNCC processing to MFCC
and RASTA-PLP processing because these features are most of additive noise. The experimental conditions used were the
widely used in baseline systems, even though neither MFCC same as those used to produce Figs. 11 and 12.
nor PLP features were designed to be robust in the presence We note in Figs. 13 and 14 that PNCC provides substantially
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 11
80
80
60
60
40
40
20
20
0
0 0 5 10 15 20 Clean
0 5 10 15 20 Clean SNR (dB)
SNR (dB)
(b) (b)
RM1 (Music Noise)
WSJ0−5k (Music Noise)
100
100
80
80
60
60
40
40
20
20
0
0 0 5 10 15 20 Clean
0 5 10 15 20 Clean SNR (dB)
SNR (dB)
(c) (c)
RM1 (Interfering Speaker)
WSJ0−5k (Interfering Speaker)
100
100
Accuracy (100 − WER)
Accuracy (100 − WER)
80
80
60
60
40
40
20
20
0
0 0 5 10 15 20 Clean
0 5 10 15 20 Clean SIR (dB)
SIR (dB)
(d) (d)
RM1 (Reverberation)
WSJ0−5k (Reverberation)
100
100 PNCC
Accuracy (100 − WER)
PNCC
Accuracy (100 − WER)
80 ETSI AFE
80 Power−Law Nonlinearity with
ANS processing MFCC with VTS
Power−Law Nonlinearity 60 MFCC
60
MFCC RASTA−PLP
40
40
20
20
0
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.2
Reverberation Time (s)
Reverberation Time (s)
(e) (e)
Fig. 12. Recognition accuracy obtained using PNCC processing in various Fig. 13. Comparison of recognition accuracy for PNCC with processing
types of additive noise and reverberation. Curves are plotted separately to using MFCC features, the ETSI AFE, MFCC with VTS, and RASTA-PLP
indicate the contributions of the power-law nonlinearity, asymmetric noise features using the DARPA RM1 corpus. Environmental conditions are (a)
suppression, and temporal masking. Results are described for the DARPA white noise, (b) street noise, (c) background music, (d) interfering speech,
WSJ0 database in the presence of (a) white noise, (b) street noise, (c) and (e) reverberation.
background music, (d) interfering speech, and (e) artificial reverberation.
WSJ0−5k (White Noise) environments, but the accuracy obtained with the AFE does
100 not approach that obtained with PNCC processing in the most
Accuracy (100 − WER) 80 difficult noise conditions. Neither the ETSI AFE nor VTS
60
improve recognition accuracy in reverberant environments
compared to MFCC features, while PNCC provides mea-
40
surable improvements in reverberation, and a closely related
20 algorithm [46] provides even greater recognition accuracy in
0 reverberation (at the expense of somewhat worse performance
0 5 10 15 20 Clean
SNR (dB) in clean speech).
(a)
WSJ0−5k (Street Noise)
100 IV. C OMPUTATIONAL C OMPLEXITY
Accuracy (100 − WER)
80
Table I provides estimates of the computational demands
60 MFCC, PLP, and PNCC feature extraction. (RASTA process-
40
ing is not included in these tabulations.) As before, we use
the standard open source Sphinx code in sphinx_fe [64]
20
for the implementation of MFCC, and the implementation in
0
0 5 10 15 20 Clean
[65] for PLP. We assume that the window length is 25.6 ms
SNR (dB) and that the interval between successive windows is 10 ms.
(b)
WSJ0−5k (Music Noise)
The sampling rate is assumed to be 16 kHz, and we use a
100 1024-pt FFT for each analysis frame.
Accuracy (100 − WER)
ETSI AFE
more costly than MFCC processing and 1.31 percent more
80
MFCC with VTS costly than PLP processing.
60 MFCC
RASTA−PLP
40
V. S UMMARY
20
In this paper we introduce power-normalized cepstral coef-
0 ficients (PNCC), which we characterize as a feature set that
0 0.1 0.2 0.3 0.4 0.5 0.6 0.9 1.2
Reverberation Time (s) provides better recognition accuracy than MFCC and RASTA-
(e)
PLP processing in the presence of common types of additive
Fig. 14. Comparison of recognition accuracy for PNCC with processing using
MFCC features, ETSI AFE, MFCC with VTS, and RASTA-PLP features
noise and reverberation. PNCC processing is motivated by the
using the DARPA WSJ0 corpus. Environmental conditions are (a) white desire to develop computationally efficient feature extraction
noise, (b) street noise, (c) background music, (d) interfering speech, and (e) for automatic speech recognition that is based on a pragmatic
reverberation.
abstraction of various attributes of auditory processing includ-
ing the rate-level nonlinearity, temporal and spectral integra-
difficult environments like background music noise, single- tion, and temporal masking. The processing also includes a
channel interfering speaker or reverberation. component that implements suppression of various types of
The ETSI Advanced Front End (AFE) [8] generally pro- common additive noise. PNCC processing requires only about
vides slightly better recognition accuracy than VTS in noisy 33 percent more computation compared to MFCC.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 13
TABLE I
N UMBER OF MULTIPLICATIONS AND DIVISIONS IN EACH FRAME [9] S. Molau, M. Pitz, and H. Ney, “Histogram based normalization in
the acoustic feature space,” in IEEE Workshop on Automatic Speech
Recognition and Understanding, Nov. 2001, pp. 21–24.
Item MFCC PLP PNCC [10] H. Mirsa, S. Ikbal, H. Bourlard, and H. Hermansky, “Spectral entropy
Pre-emphasis 410 410 based feature for robust ASR,” in IEEE Int. Conf. Acoust. Speech, and
Windowing 410 410 410 Signal Processing, May 2004, pp. 193–196.
[11] B. Raj, V. N. Parikh, and R. M. Stern, “The effects of background music
FFT 10240 10240 10240 on speech recognition accuracy,” in IEEE Int. Conf. Acoust., Speech and
Magnitude squared 512 512 512 Signal Processing, vol. 2, Apr. 1997, pp. 851–854.
Medium-time power calculation 40 [12] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-
sentations for monosyllabic word recognition in continuously spoken
Spectral integration 958 4955 4984 sentences,” IEEE Trans. Acoust., Speech, and Signal Processing, vol. 28,
ANS filtering 200 no. 4, pp. 357–366, Aug. 1980.
Equal loudness pre-emphasis 512 [13] H. Hermansky, “Perceptual linear prediction analysis of speech,” J.
Acoust. Soc. Am., vol. 87, no. 4, pp. 1738–1752, Apr. 1990.
Temporal masking 120
[14] S. Ganapathy, S. Thomas, and H. Hermansky, “Robust spectro-temporal
Weight averaging 120 features based on autoregressive models of hilbert envelopes,” in IEEE
IDFT 504 Int. Conf. on Acoustics, Speech, and Signal Processing, March 2010,
LPC and cepstral recursion 156 pp. 4286–4289.
[15] M. Heckmann, X. Domont, F. Joublin, and C. Goerick, “A hierarchical
DCT 480 480 framework for spectro-temporal feature extraction,” Speech Communi-
Sum 13010 17289 17516 cation, vol. 53, no. 5, pp. 736–752, May-June 2011.
[16] N. Mesgarani, M. Slaney, and S. Shamma, “Discrimination of speech
from nonspeech based on multiscale spectro-temporal modulations,”
IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3,
Further details about the motivation for and implementation pp. 920–929, May 2006.
[17] M. Kleinschmidt, “Localized spectro-temporal features for automatic
of PNCC processing are available in [43]. This thesis also speech recognition,” in INTERSPEECH-2003, Sept. 2003, pp. 2573–
includes additional relevant experimental findings including 2576.
results obtained for PNCC processing using multi-style train- [18] H. Hermansky and F. Valente, “Hierarchical and parallel processing
of modulation spectrum for asr applications,” in IEEE Int. Conf. on
ing and in combination with speaker-by-speaker MLLR. Acoustics, Speech, and Signal Processing, March 2008, pp. 4165–4168.
Open Source MATLAB code for PNCC may be found at [19] S. Y. Zhao and N. Morgan, “Multi-stream spectro-temporal features for
http://www.cs.cmu.edu/ robust speech recognition,” in INTERSPEECH-2008, Sept. 2008, pp.
898–901.
˜robust/archive/algorithms/PNCC_IEEETran. The [20] C. Kim and R. M. Stern, “Feature extraction for robust speech recog-
code in this directory was used for obtaining the results for nition using a power-law nonlinearity and power-bias subtraction,” in
this paper. INTERSPEECH-2009, Sept. 2009, pp. 28–31.
[21] ——, “Feature extraction for robust speech recognition based on maxi-
mizing the sharpness of the power distribution and on power flooring,”
ACKNOWLEDGEMENTS in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, March
2010, pp. 4574–4577.
This research was supported by NSF (Grants IIS-0420866 [22] D.-S. Kim, S.-Y. Lee, and R. M. Kil, “Auditory processing of speech
signals for robust speech recognition in real-world noisy environments,”
and IIS-0916918). The authors are grateful to Bhiksha Raj, IEEE Trans. Speech and Audio Processing, vol. 7, no. 1, pp. 55–69,
Kshitiz Kumar, and Mark Harvilla for many helpful discus- 1999.
sions. A summary version of part of this paper was published [23] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE.
Trans. Speech Audio Process., vol. 2, no. 4, pp. 578–589, Oct. 1994.
at [66]. [24] U. H. Yapanel and J. H. L. Hansen, “A new perceptually motivated
MVDR-based acoustic front-end (PMVDR) for robust automatic speech
recognition,” Speech Communication, vol. 50, no. 2, pp. 142–152, Feb.
R EFERENCES 2008.
[25] F. Müller and A. Mertins, “Contextual invariant-integration features for
[1] L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. improved speaker-independent speech recognition,” Speech Communi-
Englewood Cliffs, New Jersey: PTR Prentice Hall, 1993. cation, vol. 53, no. 6, pp. 830–841, July 2011.
[2] F. Jelinek, Statistical Methods for Speech Recognition (Language, [26] B. Gajic and K. K. Paliwal, “Robust parameters for speech recognition
Speech, and Communication). MIT Press, 1998. based on subband spectral centroid histograms,” in Eurospeech-2001,
[3] A. Acero and R. M. Stern, “Environmental Robustness in Automatic Sept. 2001, pp. 591–594.
Speech Recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal [27] F. Kelly and N. Harte, “A comparison of auditory features for robust
Processing (Albuquerque, NM), vol. 2, Apr. 1990, pp. 849–852. speech recognition,” in EUSIPCO-2010, Aug 2010, pp. 1968–1972.
[4] P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series approach [28] ——, “Auditory features revisited for robust speech recognition,” in
for environment-independent speech recognition,” in IEEE Int. Conf. International conference on pattern recognition, Aug. 2010, pp. 4456–
Acoust., Speech and Signal Processing, May. 1996, pp. 733–736. 4459.
[5] P. Pujol, D. Macho, and C. Nadeu, “On real-time mean-and-variance nor- [29] J. K. Siqueira and A. Alcaim, “Comparação dos atributos mfcc, ssch e
malization of speech recognition features,” in IEEE Int. Conf. Acoust., pncc para reconhecimento robusto de voz contı́nua,” in XXIX Simpósio
Speech and Signal Processing, vol. 1, May 2006, pp. 773–776. Brasileiro de Telecomunicações, Oct. 2011.
[6] R. M. Stern, B. Raj, and P. J. Moreno, “Compensation for environmental [30] G. Sárosi, M. Mozsáry, B. Tarján, A. Balog, P. Mihajlik, and T. Fegyó,
degradation in automatic speech recognition,” in Proc. of the ESCA “Recognition of multiple language voice navigation queries in traffic
Tutorial and Research Workshop on Robust Speech Recognition for situations,” in COST 2102 International Conference, Sept. 2010, pp.
Unknown Communication Channels, Apr. 1997, pp. 33–42. 199–213.
[7] R. Singh, R. M. Stern, and B. Raj, “Signal and feature compensation [31] G. Sárosi, M. Mozsáry, P. Mihajlik, and T. Fegyó, “Comparison of
methods for robust speech recognition,” in Noise Reduction in Speech feature extraction methods for speech recognition in noise-free and in
Applications, G. M. Davis, Ed. CRC Press, 2002, pp. 219–244. traffic noise environment,” in Speech Technology and Human-Computer
[8] Speech Processing, Transmission and Quality Aspects (STQ); Dis- Dialogue (SpeD), May 2011, pp. 1–8.
tributed Speech Recognition; Advanced Front-end Feature Extraction Al- [32] F. Müller and A. Mertins, “Noise robust speaker-independent speech
gorithm; Compression Algorithms, European Telecommunications Stan- recognition with invariant-integration features using power-bias subtrac-
dards Institute ES 202 050, Rev. 1.1.5, Jan. 2007. tion,” in INTERSPEECH-2011, Aug. 2011, pp. 1677–1680.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. X, MONTH, YEAR 14
[33] K. Kumar, C. Kim and R. M. Stern, “Delta-spectral cepstral coefficients [56] M. G. Heinz, X. Zhang, I. C. Bruce, and L. H. Carney, “Auditory-
for robust speech recognition,” in IEEE Int. Conf. on Acoustics, Speech, nerve model for predicting performance limits of normal and impaired
and Signal Processing, May 2011, pp. 4784–4787. listeners,” Acoustics Research Letters Online, vol. 2, no. 3, pp. 91–96,
[34] H. Zhang, X. Zhu, T-R Su, K-W. Eom, and J-W Lee, “Data-driven July 2001.
lexicon refinement using local and web resources for chinese speech [57] S. Seneff, “A joint synchrony/mean-rate model of auditory speech
recognition,” in International symposium on Chinese spoken language processing,” J. Phonetics, vol. 16, no. 1, pp. 55–76, Jan. 1988.
processing, Dec 2010, pp. 233–237. [58] J. Tchorz and B. Kolllmeier, “A model of auditory perception as front
[35] A. Fazel and S. Chakrabartty, “Sparse auditory reproducing kernel end for automatic speech recognition,” J. Acoust. Soc. Am., vol. 106,
(SPARK) features for noise-robust speech recognition ,” IEEE Trans. no. 4, pp. 2040–2050, 1999.
Audio, Speech, Language Processing, Dec. 2011 (to appear). [59] X. Zhang and M. G. Heinz and I. C. Bruce and L. H. Carney, “A
[36] M. J. Harvilla and R. M. Stern, “Histogram-based subband power phenomenological model for the responses of auditory-nerve fibers: I.
warping and spectral averaging for robust speech recognition under Nonlinear tuning with compression and suppression,” J. Acoust. Soc.
matched and multistyle training,” in IEEE Int. Conf. Acoust. Speech Am., vol. 109, no. 2, pp. 648–670, Feb. 2001.
Signal Processing, May 2012 (to appear). [60] ——, “A phenomenological model for the responses of auditory-nerve
[37] S. F. Boll, “Suppression of acoustic noise in speech using spectral sub- fibers: I. Nonlinear tuning with compression and suppression,” J. Acoust.
traction,” IEEE Trans. Acoust., Speech and Signal Processing, vol. 27, Soc. Am., vol. 109, no. 2, pp. 648–670, Feb. 2001.
no. 2, pp. 113–120, Apr. 1979. [61] S. S. Stevens, “On the psychophysical law,” Psychological Review,
vol. 64, no. 3, pp. 153–181, 1957.
[38] P. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang,
[62] S. G. McGovern, “A model for room acoustics,” http://2pi.us/rir.html.
and M. H. Allerhand, “Complex sounds and auditory images,” in
[63] J. Allen and D. Berkley, “Image method for efficiently simulating small-
Auditory and Perception. Oxford, UK: Y. Cazals, L. Demany, and
room acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950, April
K. Horner, (Eds), Pergamon Press, 1992, pp. 429–446.
1979.
[39] B. C. J. Moore and B. R. Glasberg, “A revision of Zwicker’s loudness
[64] CMU Sphinx Consortium Sphinx Consortium. CMU Sphinx Open
model,” Acustica - Acta Acustica, vol. 82, pp. 335–345, 1996.
Source Toolkit for Speech Recognition: Downloads. [Online]. Available:
[40] M. Slaney, “Auditory toolbox version 2,” Interval Research http://cmusphinx.sourceforge.net/wiki/download/
Corporation Technical Report, no. 10, 1998. [Online]. Available: [65] D. Ellis. (2006) PLP and RASTA (and MFCC, and inversion)
http://cobweb.ecn.purdue.edu/ malcolm/interval/1998-010/ in MATLAB using melfcc.m and invmelfcc.m. [Online]. Available:
[41] C. Kim and R. M. Stern, “Power function-based power distribution nor- http://labrosa.ee.columbia.edu/matlab/rastamat/
malization algorithm for robust speech recognition,” in IEEE Automatic [66] C. Kim and R. M. Stern, “Power-normalized cepstral coefficients (pncc)
Speech Recognition and Understanding Workshop, Dec. 2009, pp. 188– for robust speech recognition,” in IEEE Int. Conf. on Acoustics, Speech,
193. and Signal Processing, March 2012 (to appear).
[42] D. Gelbart and N. Morgan, “Evaluating long-term spectral subtraction
for reverberant ASR,” in IEEE Workshop on Automatic Speech Recog-
nition and Understanding, 2001, pp. 103–106.
Chanwoo Kim is presently a software development
[43] C. Kim, “Signal processing for robust speech recognition motivated by
engineer at the Microsoft Corporation. He received
auditory processing,” Ph.D. dissertation, Carnegie Mellon University,
a Ph.D. from the Language Technologies Institute
Pittsburgh, PA USA, Dec. 2010.
of the Carnegie Mellon University School of Com-
[44] B. E. D. Kingsbury, N. Morgan, and, S. Greenberg, “Robust speech puter Science in 2010. He received his B.S and
recognition using the modulation spectrogram,” Speech Communication, M.S. degrees in Electrical Engineering from Seoul
vol. 25, no. 1–3, pp. 117–132, Aug. 1998. National University in 1998 and 2001, respectively.
[45] H. G. Hirsch and C. Ehrlicher, “Noise estimation techniques or robust Dr. Kim’s doctoral research was focused on enhanc-
speech recognition,” in IEEE Int. Conf. on Acoustics, Speech, and Signal ing the robustness of automatic speech recognition
Processing, May 1995, pp. 153–156. systems in noisy environments. Toward this end he
[46] C. Kim and R. M. Stern, “Nonlinear enhancement of onset for robust has developed a number of different algorithms for
speech recognition,” in INTERSPEECH-2010, Sept. 2010, pp. 2058– single-microphone applications, dual-microphone applications, and multiple-
2061. microphone applications which have been applied to various real-world
[47] C. Lemyre, M. Jelinek, and R. Lefebvre, “New approach to voiced applications. Between 2003 and 2005 Dr. Kim was a Senior Research Engineer
onset detection in speech signal and its application for frame error at LG Electronics, where he worked primarily on embedded signal processing
concealment,” in IEEE Int. Conf. on Acoustics, Speech, and Signal and protocol stacks for multimedia systems. Prior to his employment at LG,
Processing, May 2008, pp. 4757–4760. he worked for EdumediaTek and SK Teletech as a R&D engineer.
[48] S. R. M. Prasanna and P. Krishnamoorthy, “Vowel onset point detection
using source, spectral peaks, and modulation spectrum energies,” IEEE
Trans. Audio, Speech, and Lang. Process., vol. 17, no. 4, pp. 556–565, Richard Stern received the S.B. degree from the
May 2009. Massachusetts Institute of Technology in 1970, the
[49] K. D. Martin, “Echo suppression in a computational model of the M.S. from the University of California, Berkeley,
precedence effect,” in IEEE ASSP Workshop on Applications of Signal in 1972, and the Ph.D. from MIT in 1977, all in
Processing to Audio and Acoustics, Oct. 1997. electrical engineering. He has been on the faculty of
[50] C. Kim, K. Kumar, and R. M. Stern, “Binaural sound source separation Carnegie Mellon University since 1977, where he is
motivated by auditory processing,” in IEEE Int. Conf. on Acoustics, currently a Professor in the Electrical and Computer
Speech, and Signal Processing, May 2011, pp. 5072–5075. Engineering, Computer Science, and Biomedical En-
[51] T. S. Gunawan and E. Ambikairajah, “A new forward masking model and gineering Departments, and the Language Technolo-
its application to speech enhancement,” in IEEE Int. Conf. on Acoustics, gies Institute, and a Lecturer in the School of Music.
Speech, and Signal Processing, May 2006, pp. 149–152. Much of Dr. Stern’s current research is in spoken
[52] W. Jesteadt, S. P. Bacon, and J. R. Lehman, “Forward masking as a language systems, where he is particularly concerned with the development
function of frequency, masker level, and signal delay,” J. Acoust. Soc. of techniques with which automatic speech recognition can be made more
Am., vol. 71, no. 4, pp. 950–962, Apr. 1982. robust with respect to changes in environment and acoustical ambience. He
[53] C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation for robust has also developed sentence parsing and speaker adaptation algorithms for
speech recognition based on phase difference information obtained in earlier CMU speech systems. In addition to his work in speech recognition,
the frequency domain,” in INTERSPEECH-2009, Sept. 2009, pp. 2495– Dr. Stern also maintains an active research program in psychoacoustics, where
2498. he is best known for theoretical work in binaural perception. Dr. Stern is a
[54] C. Kim, K. Kumar and R. M. Stern, “Robust speech recognition using Fellow of the Acoustical Society of America, the 2008-2009 Distinguished
small power boosting algorithm,” in IEEE Automatic Speech Recognition Lecturer of the International Speech Communication Association, a recipient
and Understanding Workshop, Dec. 2009, pp. 243–248. of the Allen Newell Award for Research Excellence in 1992, and he served
as General Chair of Interspeech 2006. He is also a member of the IEEE and
[55] B. Raj and R. M. Stern, “Missing-Feature Methods for Robust Automatic
of the Audio Engineering Society.
Speech Recognition,” IEEE Signal Processing Magazine, vol. 22, no. 5,
pp. 101–116, Sept. 2005.